linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v5 0/3] "Hotremove" persistent memory
@ 2019-05-02 18:43 Pavel Tatashin
  2019-05-02 18:43 ` [v5 1/3] device-dax: fix memory and resource leak if hotplug fails Pavel Tatashin
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 18:43 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, linux-kernel, linux-mm,
	linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
	keith.busch, vishal.l.verma, dave.jiang, zwisler,
	thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
	baiyaowei, tiwai, jglisse, david

Changelog:
v5
- Addressed comments from Dan Williams: made remove_memory() to return
  an error code, and use this function from dax.

v4
- Addressed comments from Dave Hansen

v3
- Addressed comments from David Hildenbrand. Don't release
  lock_device_hotplug after checking memory status, and rename
  memblock_offlined_cb() to check_memblock_offlined_cb()

v2
- Dan Williams mentioned that drv->remove() return is ignored
  by unbind. Unbind always succeeds. Because we cannot guarantee
  that memory can be offlined from the driver, don't even
  attempt to do so. Simply check that every section is offlined
  beforehand and only then proceed with removing dax memory.

---

Recently, adding a persistent memory to be used like a regular RAM was
added to Linux. This work extends this functionality to also allow hot
removing persistent memory.

We (Microsoft) have an important use case for this functionality.

The requirement is for physical machines with small amount of RAM (~8G)
to be able to reboot in a very short period of time (<1s). Yet, there is
a userland state that is expensive to recreate (~2G).

The solution is to boot machines with 2G preserved for persistent
memory.

Copy the state, and hotadd the persistent memory so machine still has
all 8G available for runtime. Before reboot, offline and hotremove
device-dax 2G, copy the memory that is needed to be preserved to pmem0
device, and reboot.

The series of operations look like this:

1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
   and free ramdisk.
2. Convert raw pmem0 to devdax
   ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
3. Hotadd to System RAM
   echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
   echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
   echo online_movable > /sys/devices/system/memoryXXX/state
4. Before reboot hotremove device-dax memory from System RAM
   echo offline > /sys/devices/system/memoryXXX/state
   echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
5. Create raw pmem0 device
   ndctl create-namespace --mode raw  -e namespace0.0 -f
6. Copy the state that was stored by apps to ramdisk to pmem device
7. Do kexec reboot or reboot through firmware if firmware does not
   zero memory in pmem0 region (These machines have only regular
   volatile memory). So to have pmem0 device either memmap kernel
   parameter is used, or devices nodes in dtb are specified.

Pavel Tatashin (3):
  device-dax: fix memory and resource leak if hotplug fails
  mm/hotplug: make remove_memory() interface useable
  device-dax: "Hotremove" persistent memory that is used like normal RAM

 drivers/dax/dax-private.h      |  2 ++
 drivers/dax/kmem.c             | 46 ++++++++++++++++++++++---
 include/linux/memory_hotplug.h |  8 +++--
 mm/memory_hotplug.c            | 61 ++++++++++++++++++++++------------
 4 files changed, 89 insertions(+), 28 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [v5 1/3] device-dax: fix memory and resource leak if hotplug fails
  2019-05-02 18:43 [v5 0/3] "Hotremove" persistent memory Pavel Tatashin
@ 2019-05-02 18:43 ` Pavel Tatashin
  2019-05-02 18:43 ` [v5 2/3] mm/hotplug: make remove_memory() interface useable Pavel Tatashin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 18:43 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, linux-kernel, linux-mm,
	linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
	keith.busch, vishal.l.verma, dave.jiang, zwisler,
	thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
	baiyaowei, tiwai, jglisse, david

When add_memory() function fails, the resource and the memory should be
freed.

Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
 drivers/dax/kmem.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a02318c6d28a..4c0131857133 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -66,8 +66,11 @@ int dev_dax_kmem_probe(struct device *dev)
 	new_res->name = dev_name(dev);
 
 	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
-	if (rc)
+	if (rc) {
+		release_resource(new_res);
+		kfree(new_res);
 		return rc;
+	}
 
 	return 0;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [v5 2/3] mm/hotplug: make remove_memory() interface useable
  2019-05-02 18:43 [v5 0/3] "Hotremove" persistent memory Pavel Tatashin
  2019-05-02 18:43 ` [v5 1/3] device-dax: fix memory and resource leak if hotplug fails Pavel Tatashin
@ 2019-05-02 18:43 ` Pavel Tatashin
  2019-05-03 10:06   ` David Hildenbrand
  2019-05-06 17:57   ` Dave Hansen
  2019-05-02 18:43 ` [v5 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM Pavel Tatashin
  2019-05-02 20:50 ` [v5 0/3] "Hotremove" persistent memory Verma, Vishal L
  3 siblings, 2 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 18:43 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, linux-kernel, linux-mm,
	linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
	keith.busch, vishal.l.verma, dave.jiang, zwisler,
	thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
	baiyaowei, tiwai, jglisse, david

As of right now remove_memory() interface is inherently broken. It tries
to remove memory but panics if some memory is not offline. The problem
is that it is impossible to ensure that all memory blocks are offline as
this function also takes lock_device_hotplug that is required to
change memory state via sysfs.

So, between calling this function and offlining all memory blocks there
is always a window when lock_device_hotplug is released, and therefore,
there is always a chance for a panic during this window.

Make this interface to return an error if memory removal fails. This way
it is safe to call this function without panicking machine, and also
makes it symmetric to add_memory() which already returns an error.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/memory_hotplug.h |  8 +++--
 mm/memory_hotplug.c            | 61 ++++++++++++++++++++++------------
 2 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8ade08c50d26..5438a2d92560 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -304,7 +304,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
 extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
 
 #else
@@ -321,7 +321,11 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	return -EINVAL;
 }
 
-static inline void remove_memory(int nid, u64 start, u64 size) {}
+static inline bool remove_memory(int nid, u64 start, u64 size)
+{
+	return -EBUSY;
+}
+
 static inline void __remove_memory(int nid, u64 start, u64 size) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8c454e82d4f6..a826aededa1a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1778,9 +1778,10 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 		endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
 		pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
 			&beginpa, &endpa);
-	}
 
-	return ret;
+		return -EBUSY;
+	}
+	return 0;
 }
 
 static int check_cpu_on_node(pg_data_t *pgdat)
@@ -1843,19 +1844,9 @@ void try_offline_node(int nid)
 }
 EXPORT_SYMBOL(try_offline_node);
 
-/**
- * remove_memory
- * @nid: the node ID
- * @start: physical address of the region to remove
- * @size: size of the region to remove
- *
- * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
- * and online/offline operations before this call, as required by
- * try_offline_node().
- */
-void __ref __remove_memory(int nid, u64 start, u64 size)
+static int __ref try_remove_memory(int nid, u64 start, u64 size)
 {
-	int ret;
+	int rc = 0;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -1863,13 +1854,13 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 
 	/*
 	 * All memory blocks must be offlined before removing memory.  Check
-	 * whether all memory blocks in question are offline and trigger a BUG()
+	 * whether all memory blocks in question are offline and return error
 	 * if this is not the case.
 	 */
-	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
-				check_memblock_offlined_cb);
-	if (ret)
-		BUG();
+	rc = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
+			       check_memblock_offlined_cb);
+	if (rc)
+		goto done;
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
@@ -1879,14 +1870,42 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 
 	try_offline_node(nid);
 
+done:
 	mem_hotplug_done();
+	return rc;
 }
 
-void remove_memory(int nid, u64 start, u64 size)
+/**
+ * remove_memory
+ * @nid: the node ID
+ * @start: physical address of the region to remove
+ * @size: size of the region to remove
+ *
+ * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
+ * and online/offline operations before this call, as required by
+ * try_offline_node().
+ */
+void __remove_memory(int nid, u64 start, u64 size)
 {
+
+	/*
+	 * trigger BUG() is some memory is not offlined prior to calling this
+	 * function
+	 */
+	if (try_remove_memory(nid, start, size))
+		BUG();
+}
+
+/* Remove memory if every memory block is offline, otherwise return false */
+int remove_memory(int nid, u64 start, u64 size)
+{
+	int rc;
+
 	lock_device_hotplug();
-	__remove_memory(nid, start, size);
+	rc  = try_remove_memory(nid, start, size);
 	unlock_device_hotplug();
+
+	return rc;
 }
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [v5 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM
  2019-05-02 18:43 [v5 0/3] "Hotremove" persistent memory Pavel Tatashin
  2019-05-02 18:43 ` [v5 1/3] device-dax: fix memory and resource leak if hotplug fails Pavel Tatashin
  2019-05-02 18:43 ` [v5 2/3] mm/hotplug: make remove_memory() interface useable Pavel Tatashin
@ 2019-05-02 18:43 ` Pavel Tatashin
  2019-05-02 20:50 ` [v5 0/3] "Hotremove" persistent memory Verma, Vishal L
  3 siblings, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 18:43 UTC (permalink / raw)
  To: pasha.tatashin, jmorris, sashal, linux-kernel, linux-mm,
	linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
	keith.busch, vishal.l.verma, dave.jiang, zwisler,
	thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
	baiyaowei, tiwai, jglisse, david

It is now allowed to use persistent memory like a regular RAM, but
currently there is no way to remove this memory until machine is
rebooted.

This work expands the functionality to also allows hotremoving
previously hotplugged persistent memory, and recover the device for use
for other purposes.

To hotremove persistent memory, the management software must first
offline all memory blocks of dax region, and than unbind it from
device-dax/kmem driver. So, operations should look like this:

echo offline > /sys/devices/system/memory/memoryN/state
...
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Note: if unbind is done without offlining memory beforehand, it won't be
possible to do dax0.0 hotremove, and dax's memory is going to be part of
System RAM until reboot.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 drivers/dax/dax-private.h |  2 ++
 drivers/dax/kmem.c        | 41 +++++++++++++++++++++++++++++++++++----
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index a45612148ca0..999aaf3a29b3 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -53,6 +53,7 @@ struct dax_region {
  * @pgmap - pgmap for memmap setup / lifetime (driver owned)
  * @ref: pgmap reference count (driver owned)
  * @cmp: @ref final put completion (driver owned)
+ * @dax_mem_res: physical address range of hotadded DAX memory
  */
 struct dev_dax {
 	struct dax_region *region;
@@ -62,6 +63,7 @@ struct dev_dax {
 	struct dev_pagemap pgmap;
 	struct percpu_ref ref;
 	struct completion cmp;
+	struct resource *dax_kmem_res;
 };
 
 static inline struct dev_dax *to_dev_dax(struct device *dev)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 4c0131857133..3d0a7e702c94 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -71,21 +71,54 @@ int dev_dax_kmem_probe(struct device *dev)
 		kfree(new_res);
 		return rc;
 	}
+	dev_dax->dax_kmem_res = new_res;
 
 	return 0;
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static int dev_dax_kmem_remove(struct device *dev)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct resource *res = dev_dax->dax_kmem_res;
+	resource_size_t kmem_start = res->start;
+	resource_size_t kmem_size = resource_size(res);
+	int rc;
+
+	/*
+	 * We have one shot for removing memory, if some memory blocks were not
+	 * offline prior to calling this function remove_memory() will fail, and
+	 * there is no way to hotremove this memory until reboot because device
+	 * unbind will succeed even if we return failure.
+	 */
+	rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
+	if (rc) {
+		dev_err(dev,
+			"DAX region %pR cannot be hotremoved until the next reboot\n",
+			res);
+		return rc;
+	}
+
+	/* Release and free dax resources */
+	release_resource(res);
+	kfree(res);
+	dev_dax->dax_kmem_res = NULL;
+
+	return 0;
+}
+#else
 static int dev_dax_kmem_remove(struct device *dev)
 {
 	/*
-	 * Purposely leak the request_mem_region() for the device-dax
-	 * range and return '0' to ->remove() attempts. The removal of
-	 * the device from the driver always succeeds, but the region
-	 * is permanently pinned as reserved by the unreleased
+	 * Without hotremove purposely leak the request_mem_region() for the
+	 * device-dax range and return '0' to ->remove() attempts. The removal
+	 * of the device from the driver always succeeds, but the region is
+	 * permanently pinned as reserved by the unreleased
 	 * request_mem_region().
 	 */
 	return 0;
 }
+#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 static struct dax_device_driver device_dax_kmem_driver = {
 	.drv = {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-02 18:43 [v5 0/3] "Hotremove" persistent memory Pavel Tatashin
                   ` (2 preceding siblings ...)
  2019-05-02 18:43 ` [v5 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM Pavel Tatashin
@ 2019-05-02 20:50 ` Verma, Vishal L
  2019-05-02 21:44   ` Pavel Tatashin
  2019-05-15 18:11   ` Pavel Tatashin
  3 siblings, 2 replies; 21+ messages in thread
From: Verma, Vishal L @ 2019-05-02 20:50 UTC (permalink / raw)
  To: linux-kernel, jmorris, tiwai, sashal, pasha.tatashin, linux-mm,
	dave.hansen, david, bp, Williams, Dan J, akpm, linux-nvdimm,
	jglisse, zwisler, mhocko, Jiang, Dave, bhelgaas, Busch, Keith,
	thomas.lendacky, Huang, Ying, Wu, Fengguang, baiyaowei

On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> The series of operations look like this:
> 
> 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
>    and free ramdisk.
> 2. Convert raw pmem0 to devdax
>    ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> 3. Hotadd to System RAM
>    echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>    echo online_movable > /sys/devices/system/memoryXXX/state
> 4. Before reboot hotremove device-dax memory from System RAM
>    echo offline > /sys/devices/system/memoryXXX/state
>    echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Hi Pavel,

I am working on adding this sort of a workflow into a new daxctl command
(daxctl-reconfigure-device)- this will allow changing the 'mode' of a
dax device to kmem, online the resulting memory, and with your patches,
also attempt to offline the memory, and change back to device-dax.

In running with these patches, and testing the offlining part, I ran
into the following lockdep below.

This is with just these three patches on top of -rc7.


[  +0.004886] ======================================================
[  +0.001576] WARNING: possible circular locking dependency detected
[  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O     
[  +0.000929] ------------------------------------------------------
[  +0.000708] daxctl/22950 is trying to acquire lock:
[  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
[  +0.000922] 
              but task is already holding lock:
[  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
[  +0.000960] 
              which lock already depends on the new lock.

[  +0.001001] 
              the existing dependency chain (in reverse order) is:
[  +0.000837] 
              -> #3 (mem_sysfs_mutex){+.+.}:
[  +0.000631]        __mutex_lock+0x82/0x9a0
[  +0.000477]        unregister_memory_section+0x22/0xa0
[  +0.000582]        __remove_pages+0xe9/0x520
[  +0.000489]        arch_remove_memory+0x81/0xc0
[  +0.000510]        devm_memremap_pages_release+0x180/0x270
[  +0.000633]        release_nodes+0x234/0x280
[  +0.000483]        device_release_driver_internal+0xf4/0x1d0
[  +0.000701]        bus_remove_device+0xfc/0x170
[  +0.000529]        device_del+0x16a/0x380
[  +0.000459]        unregister_dev_dax+0x23/0x50
[  +0.000526]        release_nodes+0x234/0x280
[  +0.000487]        device_release_driver_internal+0xf4/0x1d0
[  +0.000646]        unbind_store+0x9b/0x130
[  +0.000467]        kernfs_fop_write+0xf0/0x1a0
[  +0.000510]        vfs_write+0xba/0x1c0
[  +0.000438]        ksys_write+0x5a/0xe0
[  +0.000521]        do_syscall_64+0x60/0x210
[  +0.000489]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000637] 
              -> #2 (mem_hotplug_lock.rw_sem){++++}:
[  +0.000717]        get_online_mems+0x3e/0x80
[  +0.000491]        kmem_cache_create_usercopy+0x2e/0x270
[  +0.000609]        kmem_cache_create+0x12/0x20
[  +0.000507]        ptlock_cache_init+0x20/0x28
[  +0.000506]        start_kernel+0x240/0x4d0
[  +0.000480]        secondary_startup_64+0xa4/0xb0
[  +0.000539] 
              -> #1 (cpu_hotplug_lock.rw_sem){++++}:
[  +0.000784]        cpus_read_lock+0x3e/0x80
[  +0.000511]        online_pages+0x37/0x310
[  +0.000469]        memory_subsys_online+0x34/0x60
[  +0.000611]        device_online+0x60/0x80
[  +0.000611]        state_store+0x66/0xd0
[  +0.000552]        kernfs_fop_write+0xf0/0x1a0
[  +0.000649]        vfs_write+0xba/0x1c0
[  +0.000487]        ksys_write+0x5a/0xe0
[  +0.000459]        do_syscall_64+0x60/0x210
[  +0.000482]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000646] 
              -> #0 (kn->count#424){++++}:
[  +0.000669]        lock_acquire+0x9e/0x180
[  +0.000471]        __kernfs_remove+0x26a/0x310
[  +0.000518]        kernfs_remove_by_name_ns+0x40/0x80
[  +0.000583]        remove_files.isra.1+0x30/0x70
[  +0.000555]        sysfs_remove_group+0x3d/0x80
[  +0.000524]        sysfs_remove_groups+0x29/0x40
[  +0.000532]        device_remove_attrs+0x42/0x80
[  +0.000522]        device_del+0x162/0x380
[  +0.000464]        device_unregister+0x16/0x60
[  +0.000505]        unregister_memory_section+0x6e/0xa0
[  +0.000591]        __remove_pages+0xe9/0x520
[  +0.000492]        arch_remove_memory+0x81/0xc0
[  +0.000568]        try_remove_memory+0xba/0xd0
[  +0.000510]        remove_memory+0x23/0x40
[  +0.000483]        dev_dax_kmem_remove+0x29/0x57 [kmem]
[  +0.000608]        device_release_driver_internal+0xe4/0x1d0
[  +0.000637]        unbind_store+0x9b/0x130
[  +0.000464]        kernfs_fop_write+0xf0/0x1a0
[  +0.000685]        vfs_write+0xba/0x1c0
[  +0.000594]        ksys_write+0x5a/0xe0
[  +0.000449]        do_syscall_64+0x60/0x210
[  +0.000481]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000619] 
              other info that might help us debug this:

[  +0.000889] Chain exists of:
                kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex

[  +0.001269]  Possible unsafe locking scenario:

[  +0.000652]        CPU0                    CPU1
[  +0.000505]        ----                    ----
[  +0.000523]   lock(mem_sysfs_mutex);
[  +0.000422]                                lock(mem_hotplug_lock.rw_sem);
[  +0.000905]                                lock(mem_sysfs_mutex);
[  +0.000793]   lock(kn->count#424);
[  +0.000394] 
               *** DEADLOCK ***

[  +0.000665] 7 locks held by daxctl/22950:
[  +0.000458]  #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0
[  +0.000943]  #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0
[  +0.000895]  #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0
[  +0.001019]  #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40
[  +0.000942]  #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0
[  +0.001019]  #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120
[  +0.001118]  #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
[  +0.001033] 
              stack backtrace:
[  +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G           O      5.1.0-rc7+ #13
[  +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
[  +0.001360] Call Trace:
[  +0.000293]  dump_stack+0x85/0xc0
[  +0.000390]  print_circular_bug.isra.41.cold.60+0x15c/0x195
[  +0.000651]  check_prev_add.constprop.50+0x5fd/0xbe0
[  +0.000563]  ? call_rcu_zapped+0x80/0x80
[  +0.000449]  __lock_acquire+0xcee/0xfd0
[  +0.000437]  lock_acquire+0x9e/0x180
[  +0.000428]  ? kernfs_remove_by_name_ns+0x40/0x80
[  +0.000531]  __kernfs_remove+0x26a/0x310
[  +0.000451]  ? kernfs_remove_by_name_ns+0x40/0x80
[  +0.000529]  ? kernfs_name_hash+0x12/0x80
[  +0.000462]  kernfs_remove_by_name_ns+0x40/0x80
[  +0.000513]  remove_files.isra.1+0x30/0x70
[  +0.000483]  sysfs_remove_group+0x3d/0x80
[  +0.000458]  sysfs_remove_groups+0x29/0x40
[  +0.000477]  device_remove_attrs+0x42/0x80
[  +0.000461]  device_del+0x162/0x380
[  +0.000399]  device_unregister+0x16/0x60
[  +0.000442]  unregister_memory_section+0x6e/0xa0
[  +0.001232]  __remove_pages+0xe9/0x520
[  +0.000443]  arch_remove_memory+0x81/0xc0
[  +0.000459]  try_remove_memory+0xba/0xd0
[  +0.000460]  remove_memory+0x23/0x40
[  +0.000461]  dev_dax_kmem_remove+0x29/0x57 [kmem]
[  +0.000603]  device_release_driver_internal+0xe4/0x1d0
[  +0.000590]  unbind_store+0x9b/0x130
[  +0.000409]  kernfs_fop_write+0xf0/0x1a0
[  +0.000448]  vfs_write+0xba/0x1c0
[  +0.000395]  ksys_write+0x5a/0xe0
[  +0.000382]  do_syscall_64+0x60/0x210
[  +0.000418]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000573] RIP: 0033:0x7fd1f7442fa8
[  +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8
[  +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003
[  +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80
[  +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[  +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-02 20:50 ` [v5 0/3] "Hotremove" persistent memory Verma, Vishal L
@ 2019-05-02 21:44   ` Pavel Tatashin
  2019-05-02 22:29     ` Verma, Vishal L
  2019-05-15 18:11   ` Pavel Tatashin
  1 sibling, 1 reply; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 21:44 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: linux-kernel, jmorris, tiwai, sashal, linux-mm, dave.hansen,
	david, bp, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
	mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
	Huang, Ying, Wu, Fengguang, baiyaowei

On Thu, May 2, 2019 at 4:50 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote:
>
> On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> > The series of operations look like this:
> >
> > 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
> >    and free ramdisk.
> > 2. Convert raw pmem0 to devdax
> >    ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> > 3. Hotadd to System RAM
> >    echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> >    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> >    echo online_movable > /sys/devices/system/memoryXXX/state
> > 4. Before reboot hotremove device-dax memory from System RAM
> >    echo offline > /sys/devices/system/memoryXXX/state
> >    echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>
> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.

Hi Verma,

Thank you for testing. I wonder if there is a command sequence that I
could run to reproduce it?
Also, could you please send your config and qemu arguments.

Thank you,
Pasha

>
>
> [  +0.004886] ======================================================
> [  +0.001576] WARNING: possible circular locking dependency detected
> [  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O
> [  +0.000929] ------------------------------------------------------
> [  +0.000708] daxctl/22950 is trying to acquire lock:
> [  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000922]
>               but task is already holding lock:
> [  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
> [  +0.000960]
>               which lock already depends on the new lock.
>
> [  +0.001001]
>               the existing dependency chain (in reverse order) is:
> [  +0.000837]
>               -> #3 (mem_sysfs_mutex){+.+.}:
> [  +0.000631]        __mutex_lock+0x82/0x9a0
> [  +0.000477]        unregister_memory_section+0x22/0xa0
> [  +0.000582]        __remove_pages+0xe9/0x520
> [  +0.000489]        arch_remove_memory+0x81/0xc0
> [  +0.000510]        devm_memremap_pages_release+0x180/0x270
> [  +0.000633]        release_nodes+0x234/0x280
> [  +0.000483]        device_release_driver_internal+0xf4/0x1d0
> [  +0.000701]        bus_remove_device+0xfc/0x170
> [  +0.000529]        device_del+0x16a/0x380
> [  +0.000459]        unregister_dev_dax+0x23/0x50
> [  +0.000526]        release_nodes+0x234/0x280
> [  +0.000487]        device_release_driver_internal+0xf4/0x1d0
> [  +0.000646]        unbind_store+0x9b/0x130
> [  +0.000467]        kernfs_fop_write+0xf0/0x1a0
> [  +0.000510]        vfs_write+0xba/0x1c0
> [  +0.000438]        ksys_write+0x5a/0xe0
> [  +0.000521]        do_syscall_64+0x60/0x210
> [  +0.000489]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  +0.000637]
>               -> #2 (mem_hotplug_lock.rw_sem){++++}:
> [  +0.000717]        get_online_mems+0x3e/0x80
> [  +0.000491]        kmem_cache_create_usercopy+0x2e/0x270
> [  +0.000609]        kmem_cache_create+0x12/0x20
> [  +0.000507]        ptlock_cache_init+0x20/0x28
> [  +0.000506]        start_kernel+0x240/0x4d0
> [  +0.000480]        secondary_startup_64+0xa4/0xb0
> [  +0.000539]
>               -> #1 (cpu_hotplug_lock.rw_sem){++++}:
> [  +0.000784]        cpus_read_lock+0x3e/0x80
> [  +0.000511]        online_pages+0x37/0x310
> [  +0.000469]        memory_subsys_online+0x34/0x60
> [  +0.000611]        device_online+0x60/0x80
> [  +0.000611]        state_store+0x66/0xd0
> [  +0.000552]        kernfs_fop_write+0xf0/0x1a0
> [  +0.000649]        vfs_write+0xba/0x1c0
> [  +0.000487]        ksys_write+0x5a/0xe0
> [  +0.000459]        do_syscall_64+0x60/0x210
> [  +0.000482]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  +0.000646]
>               -> #0 (kn->count#424){++++}:
> [  +0.000669]        lock_acquire+0x9e/0x180
> [  +0.000471]        __kernfs_remove+0x26a/0x310
> [  +0.000518]        kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000583]        remove_files.isra.1+0x30/0x70
> [  +0.000555]        sysfs_remove_group+0x3d/0x80
> [  +0.000524]        sysfs_remove_groups+0x29/0x40
> [  +0.000532]        device_remove_attrs+0x42/0x80
> [  +0.000522]        device_del+0x162/0x380
> [  +0.000464]        device_unregister+0x16/0x60
> [  +0.000505]        unregister_memory_section+0x6e/0xa0
> [  +0.000591]        __remove_pages+0xe9/0x520
> [  +0.000492]        arch_remove_memory+0x81/0xc0
> [  +0.000568]        try_remove_memory+0xba/0xd0
> [  +0.000510]        remove_memory+0x23/0x40
> [  +0.000483]        dev_dax_kmem_remove+0x29/0x57 [kmem]
> [  +0.000608]        device_release_driver_internal+0xe4/0x1d0
> [  +0.000637]        unbind_store+0x9b/0x130
> [  +0.000464]        kernfs_fop_write+0xf0/0x1a0
> [  +0.000685]        vfs_write+0xba/0x1c0
> [  +0.000594]        ksys_write+0x5a/0xe0
> [  +0.000449]        do_syscall_64+0x60/0x210
> [  +0.000481]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  +0.000619]
>               other info that might help us debug this:
>
> [  +0.000889] Chain exists of:
>                 kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
>
> [  +0.001269]  Possible unsafe locking scenario:
>
> [  +0.000652]        CPU0                    CPU1
> [  +0.000505]        ----                    ----
> [  +0.000523]   lock(mem_sysfs_mutex);
> [  +0.000422]                                lock(mem_hotplug_lock.rw_sem);
> [  +0.000905]                                lock(mem_sysfs_mutex);
> [  +0.000793]   lock(kn->count#424);
> [  +0.000394]
>                *** DEADLOCK ***
>
> [  +0.000665] 7 locks held by daxctl/22950:
> [  +0.000458]  #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0
> [  +0.000943]  #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0
> [  +0.000895]  #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0
> [  +0.001019]  #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40
> [  +0.000942]  #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0
> [  +0.001019]  #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120
> [  +0.001118]  #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
> [  +0.001033]
>               stack backtrace:
> [  +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G           O      5.1.0-rc7+ #13
> [  +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
> [  +0.001360] Call Trace:
> [  +0.000293]  dump_stack+0x85/0xc0
> [  +0.000390]  print_circular_bug.isra.41.cold.60+0x15c/0x195
> [  +0.000651]  check_prev_add.constprop.50+0x5fd/0xbe0
> [  +0.000563]  ? call_rcu_zapped+0x80/0x80
> [  +0.000449]  __lock_acquire+0xcee/0xfd0
> [  +0.000437]  lock_acquire+0x9e/0x180
> [  +0.000428]  ? kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000531]  __kernfs_remove+0x26a/0x310
> [  +0.000451]  ? kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000529]  ? kernfs_name_hash+0x12/0x80
> [  +0.000462]  kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000513]  remove_files.isra.1+0x30/0x70
> [  +0.000483]  sysfs_remove_group+0x3d/0x80
> [  +0.000458]  sysfs_remove_groups+0x29/0x40
> [  +0.000477]  device_remove_attrs+0x42/0x80
> [  +0.000461]  device_del+0x162/0x380
> [  +0.000399]  device_unregister+0x16/0x60
> [  +0.000442]  unregister_memory_section+0x6e/0xa0
> [  +0.001232]  __remove_pages+0xe9/0x520
> [  +0.000443]  arch_remove_memory+0x81/0xc0
> [  +0.000459]  try_remove_memory+0xba/0xd0
> [  +0.000460]  remove_memory+0x23/0x40
> [  +0.000461]  dev_dax_kmem_remove+0x29/0x57 [kmem]
> [  +0.000603]  device_release_driver_internal+0xe4/0x1d0
> [  +0.000590]  unbind_store+0x9b/0x130
> [  +0.000409]  kernfs_fop_write+0xf0/0x1a0
> [  +0.000448]  vfs_write+0xba/0x1c0
> [  +0.000395]  ksys_write+0x5a/0xe0
> [  +0.000382]  do_syscall_64+0x60/0x210
> [  +0.000418]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  +0.000573] RIP: 0033:0x7fd1f7442fa8
> [  +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
> [  +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8
> [  +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003
> [  +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80
> [  +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [  +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-02 21:44   ` Pavel Tatashin
@ 2019-05-02 22:29     ` Verma, Vishal L
  2019-05-02 22:36       ` Pavel Tatashin
  0 siblings, 1 reply; 21+ messages in thread
From: Verma, Vishal L @ 2019-05-02 22:29 UTC (permalink / raw)
  To: pasha.tatashin
  Cc: linux-kernel, jmorris, sashal, bp, linux-mm, david, dave.hansen,
	tiwai, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
	mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
	Huang, Ying, Wu, Fengguang, baiyaowei

On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote:

> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> > 
> > This is with just these three patches on top of -rc7.
> 
> Hi Verma,
> 
> Thank you for testing. I wonder if there is a command sequence that I
> could run to reproduce it?
> Also, could you please send your config and qemu arguments.
> 
Yes, here is the qemu config:

qemu-system-x86_64
	-machine accel=kvm
	-machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
	-cpu Haswell-noTSX
	-m 12G,slots=3,maxmem=44G
	-realtime mlock=off
	-smp 8,sockets=2,cores=4,threads=1
	-numa node,nodeid=0,cpus=0-3,mem=6G
	-numa node,nodeid=1,cpus=4-7,mem=6G
	-numa node,nodeid=2
	-numa node,nodeid=3
	-drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
	-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
	-object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
	-device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
	-object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
	-device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
	-serial stdio
	-display none

For the command list - I'm using WIP patches to ndctl/daxctl to add the
command I mentioned earlier. Using this command, I can reproduce the
lockdep issue. I thought I should be able to reproduce the issue by
onlining/offlining through sysfs directly too - something like:

   node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
   for mem in /sys/devices/system/node/node"$node"/memory*; do
     echo "offline" > $mem/state
   done

But with that I can't reproduce the problem.

I'll try to dig a bit deeper into what might be happening, the daxctl
modifications simply amount to doing the same thing as above in C, so
I'm not immediately sure what might be happening.

If you're interested, I can post the ndctl patches - maybe as an RFC -
to test with.

Thanks,
-Vishal




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-02 22:29     ` Verma, Vishal L
@ 2019-05-02 22:36       ` Pavel Tatashin
  2019-05-03 21:48         ` Verma, Vishal L
  0 siblings, 1 reply; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 22:36 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: linux-kernel, jmorris, sashal, bp, linux-mm, david, dave.hansen,
	tiwai, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
	mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
	Huang, Ying, Wu, Fengguang, baiyaowei

On Thu, May 2, 2019 at 6:29 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote:
>
> On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote:
>
> > > In running with these patches, and testing the offlining part, I ran
> > > into the following lockdep below.
> > >
> > > This is with just these three patches on top of -rc7.
> >
> > Hi Verma,
> >
> > Thank you for testing. I wonder if there is a command sequence that I
> > could run to reproduce it?
> > Also, could you please send your config and qemu arguments.
> >
> Yes, here is the qemu config:
>
> qemu-system-x86_64
>         -machine accel=kvm
>         -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
>         -cpu Haswell-noTSX
>         -m 12G,slots=3,maxmem=44G
>         -realtime mlock=off
>         -smp 8,sockets=2,cores=4,threads=1
>         -numa node,nodeid=0,cpus=0-3,mem=6G
>         -numa node,nodeid=1,cpus=4-7,mem=6G
>         -numa node,nodeid=2
>         -numa node,nodeid=3
>         -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
>         -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
>         -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
>         -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
>         -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
>         -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
>         -serial stdio
>         -display none
>
> For the command list - I'm using WIP patches to ndctl/daxctl to add the
> command I mentioned earlier. Using this command, I can reproduce the
> lockdep issue. I thought I should be able to reproduce the issue by
> onlining/offlining through sysfs directly too - something like:
>
>    node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
>    for mem in /sys/devices/system/node/node"$node"/memory*; do
>      echo "offline" > $mem/state
>    done
>
> But with that I can't reproduce the problem.
>
> I'll try to dig a bit deeper into what might be happening, the daxctl
> modifications simply amount to doing the same thing as above in C, so
> I'm not immediately sure what might be happening.
>
> If you're interested, I can post the ndctl patches - maybe as an RFC -
> to test with.

I could apply the patches and test with them. Also, could you please
send your kernel config.

Thank you,
Pasha

>
> Thanks,
> -Vishal
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
  2019-05-02 18:43 ` [v5 2/3] mm/hotplug: make remove_memory() interface useable Pavel Tatashin
@ 2019-05-03 10:06   ` David Hildenbrand
  2019-05-06 17:57   ` Dave Hansen
  1 sibling, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2019-05-03 10:06 UTC (permalink / raw)
  To: Pavel Tatashin, jmorris, sashal, linux-kernel, linux-mm,
	linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
	keith.busch, vishal.l.verma, dave.jiang, zwisler,
	thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
	baiyaowei, tiwai, jglisse

On 02.05.19 20:43, Pavel Tatashin wrote:
> As of right now remove_memory() interface is inherently broken. It tries
> to remove memory but panics if some memory is not offline. The problem
> is that it is impossible to ensure that all memory blocks are offline as
> this function also takes lock_device_hotplug that is required to
> change memory state via sysfs.
> 

The existing interface can actually work today by registering a hotplug
notifier and rejecting any onlining attempts. But I agree that this way,
the interface becomes more usable.

> So, between calling this function and offlining all memory blocks there
> is always a window when lock_device_hotplug is released, and therefore,
> there is always a chance for a panic during this window.
> 
> Make this interface to return an error if memory removal fails. This way
> it is safe to call this function without panicking machine, and also
> makes it symmetric to add_memory() which already returns an error.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>> ---
>  include/linux/memory_hotplug.h |  8 +++--
>  mm/memory_hotplug.c            | 61 ++++++++++++++++++++++------------
>  2 files changed, 46 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 8ade08c50d26..5438a2d92560 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -304,7 +304,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
>  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
>  extern void try_offline_node(int nid);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>  extern void __remove_memory(int nid, u64 start, u64 size);
>  
>  #else
> @@ -321,7 +321,11 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	return -EINVAL;
>  }
>  
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline bool remove_memory(int nid, u64 start, u64 size)
> +{
> +	return -EBUSY;
> +}
> +
>  static inline void __remove_memory(int nid, u64 start, u64 size) {}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8c454e82d4f6..a826aededa1a 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1778,9 +1778,10 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
>  		endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
>  		pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
>  			&beginpa, &endpa);
> -	}
>  
> -	return ret;
> +		return -EBUSY;
> +	}
> +	return 0;
>  }
>  
>  static int check_cpu_on_node(pg_data_t *pgdat)
> @@ -1843,19 +1844,9 @@ void try_offline_node(int nid)
>  }
>  EXPORT_SYMBOL(try_offline_node);
>  
> -/**
> - * remove_memory
> - * @nid: the node ID
> - * @start: physical address of the region to remove
> - * @size: size of the region to remove
> - *
> - * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> - * and online/offline operations before this call, as required by
> - * try_offline_node().
> - */
> -void __ref __remove_memory(int nid, u64 start, u64 size)
> +static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  {
> -	int ret;
> +	int rc = 0;
>  
>  	BUG_ON(check_hotplug_memory_range(start, size));
>  
> @@ -1863,13 +1854,13 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>  
>  	/*
>  	 * All memory blocks must be offlined before removing memory.  Check
> -	 * whether all memory blocks in question are offline and trigger a BUG()
> +	 * whether all memory blocks in question are offline and return error
>  	 * if this is not the case.
>  	 */
> -	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> -				check_memblock_offlined_cb);
> -	if (ret)
> -		BUG();
> +	rc = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> +			       check_memblock_offlined_cb);
> +	if (rc)
> +		goto done;
>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
> @@ -1879,14 +1870,42 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>  
>  	try_offline_node(nid);
>  
> +done:
>  	mem_hotplug_done();
> +	return rc;
>  }
>  
> -void remove_memory(int nid, u64 start, u64 size)
> +/**
> + * remove_memory
> + * @nid: the node ID
> + * @start: physical address of the region to remove
> + * @size: size of the region to remove
> + *
> + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> + * and online/offline operations before this call, as required by
> + * try_offline_node().
> + */
> +void __remove_memory(int nid, u64 start, u64 size)
>  {
> +
> +	/*
> +	 * trigger BUG() is some memory is not offlined prior to calling this
> +	 * function
> +	 */
> +	if (try_remove_memory(nid, start, size))
> +		BUG();
> +}
> +
> +/* Remove memory if every memory block is offline, otherwise return false */

Comment is wrong "return false"

> +int remove_memory(int nid, u64 start, u64 size)
> +{
> +	int rc;
> +
>  	lock_device_hotplug();
> -	__remove_memory(nid, start, size);
> +	rc  = try_remove_memory(nid, start, size);
>  	unlock_device_hotplug();
> +
> +	return rc;
>  }
>  EXPORT_SYMBOL_GPL(remove_memory);
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> 


Looks sane to me

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-02 22:36       ` Pavel Tatashin
@ 2019-05-03 21:48         ` Verma, Vishal L
  0 siblings, 0 replies; 21+ messages in thread
From: Verma, Vishal L @ 2019-05-03 21:48 UTC (permalink / raw)
  To: pasha.tatashin
  Cc: linux-kernel, jmorris, sashal, bp, linux-mm, david, dave.hansen,
	tiwai, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
	mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
	Huang, Ying, Wu, Fengguang, baiyaowei


[-- Attachment #1.1: Type: text/plain, Size: 2644 bytes --]

On Thu, 2019-05-02 at 18:36 -0400, Pavel Tatashin wrote:
> > Yes, here is the qemu config:
> > 
> > qemu-system-x86_64
> >         -machine accel=kvm
> >         -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
> >         -cpu Haswell-noTSX
> >         -m 12G,slots=3,maxmem=44G
> >         -realtime mlock=off
> >         -smp 8,sockets=2,cores=4,threads=1
> >         -numa node,nodeid=0,cpus=0-3,mem=6G
> >         -numa node,nodeid=1,cpus=4-7,mem=6G
> >         -numa node,nodeid=2
> >         -numa node,nodeid=3
> >         -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
> >         -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
> >         -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
> >         -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
> >         -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
> >         -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
> >         -serial stdio
> >         -display none
> > 
> > For the command list - I'm using WIP patches to ndctl/daxctl to add the
> > command I mentioned earlier. Using this command, I can reproduce the
> > lockdep issue. I thought I should be able to reproduce the issue by
> > onlining/offlining through sysfs directly too - something like:
> > 
> >    node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
> >    for mem in /sys/devices/system/node/node"$node"/memory*; do
> >      echo "offline" > $mem/state
> >    done
> > 
> > But with that I can't reproduce the problem.
> > 
> > I'll try to dig a bit deeper into what might be happening, the daxctl
> > modifications simply amount to doing the same thing as above in C, so
> > I'm not immediately sure what might be happening.
> > 
> > If you're interested, I can post the ndctl patches - maybe as an RFC -
> > to test with.
> 
> I could apply the patches and test with them. Also, could you please
> send your kernel config.
> 
Hi Pavel,

I've CC'd you on the patches mentioned above, and also pushed them to a
'kmem-pending' branch on github:

https://github.com/pmem/ndctl/tree/kmem-pending

After building ndctl from the above, you will want to run:

# daxctl reconfigure-device --mode=system-ram dax0.0
(this will also have onlined the memory sections)

# daxctl reconfigure-device --mode=devdax --attempt-offline dax0.0
(this triggers the lockdep warnings)

I've attached the kernel config here too (gzipped).

Thanks,
	-Vishal

[-- Attachment #1.2: config.gz --]
[-- Type: application/gzip, Size: 30367 bytes --]

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3270 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
  2019-05-02 18:43 ` [v5 2/3] mm/hotplug: make remove_memory() interface useable Pavel Tatashin
  2019-05-03 10:06   ` David Hildenbrand
@ 2019-05-06 17:57   ` Dave Hansen
  2019-05-06 18:01     ` Dan Williams
  2019-05-06 18:13     ` Pavel Tatashin
  1 sibling, 2 replies; 21+ messages in thread
From: Dave Hansen @ 2019-05-06 17:57 UTC (permalink / raw)
  To: Pavel Tatashin, jmorris, sashal, linux-kernel, linux-mm,
	linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
	keith.busch, vishal.l.verma, dave.jiang, zwisler,
	thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
	baiyaowei, tiwai, jglisse, david

> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline bool remove_memory(int nid, u64 start, u64 size)
> +{
> +	return -EBUSY;
> +}

This seems like an appropriate place for a WARN_ONCE(), if someone
manages to call remove_memory() with hotplug disabled.

BTW, I looked and can't think of a better errno, but -EBUSY probably
isn't the best error code, right?

> -void remove_memory(int nid, u64 start, u64 size)
> +/**
> + * remove_memory
> + * @nid: the node ID
> + * @start: physical address of the region to remove
> + * @size: size of the region to remove
> + *
> + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> + * and online/offline operations before this call, as required by
> + * try_offline_node().
> + */
> +void __remove_memory(int nid, u64 start, u64 size)
>  {
> +
> +	/*
> +	 * trigger BUG() is some memory is not offlined prior to calling this
> +	 * function
> +	 */
> +	if (try_remove_memory(nid, start, size))
> +		BUG();
> +}

Could we call this remove_offline_memory()?  That way, it makes _some_
sense why we would BUG() if the memory isn't offline.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
  2019-05-06 17:57   ` Dave Hansen
@ 2019-05-06 18:01     ` Dan Williams
  2019-05-06 18:04       ` Dave Hansen
  2019-05-17 18:10       ` Pavel Tatashin
  2019-05-06 18:13     ` Pavel Tatashin
  1 sibling, 2 replies; 21+ messages in thread
From: Dan Williams @ 2019-05-06 18:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pavel Tatashin, James Morris, Sasha Levin,
	Linux Kernel Mailing List, Linux MM, linux-nvdimm, Andrew Morton,
	Michal Hocko, Dave Hansen, Keith Busch, Vishal L Verma,
	Dave Jiang, Ross Zwisler, Tom Lendacky, Huang, Ying,
	Fengguang Wu, Borislav Petkov, Bjorn Helgaas, Yaowei Bai,
	Takashi Iwai, Jérôme Glisse, David Hildenbrand

On Mon, May 6, 2019 at 10:57 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > +{
> > +     return -EBUSY;
> > +}
>
> This seems like an appropriate place for a WARN_ONCE(), if someone
> manages to call remove_memory() with hotplug disabled.
>
> BTW, I looked and can't think of a better errno, but -EBUSY probably
> isn't the best error code, right?
>
> > -void remove_memory(int nid, u64 start, u64 size)
> > +/**
> > + * remove_memory
> > + * @nid: the node ID
> > + * @start: physical address of the region to remove
> > + * @size: size of the region to remove
> > + *
> > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > + * and online/offline operations before this call, as required by
> > + * try_offline_node().
> > + */
> > +void __remove_memory(int nid, u64 start, u64 size)
> >  {
> > +
> > +     /*
> > +      * trigger BUG() is some memory is not offlined prior to calling this
> > +      * function
> > +      */
> > +     if (try_remove_memory(nid, start, size))
> > +             BUG();
> > +}
>
> Could we call this remove_offline_memory()?  That way, it makes _some_
> sense why we would BUG() if the memory isn't offline.

Please WARN() instead of BUG() because failing to remove memory should
not be system fatal.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
  2019-05-06 18:01     ` Dan Williams
@ 2019-05-06 18:04       ` Dave Hansen
  2019-05-06 18:18         ` Pavel Tatashin
  2019-05-17 18:10       ` Pavel Tatashin
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2019-05-06 18:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Pavel Tatashin, James Morris, Sasha Levin,
	Linux Kernel Mailing List, Linux MM, linux-nvdimm, Andrew Morton,
	Michal Hocko, Dave Hansen, Keith Busch, Vishal L Verma,
	Dave Jiang, Ross Zwisler, Tom Lendacky, Huang, Ying,
	Fengguang Wu, Borislav Petkov, Bjorn Helgaas, Yaowei Bai,
	Takashi Iwai, Jérôme Glisse, David Hildenbrand

On 5/6/19 11:01 AM, Dan Williams wrote:
>>> +void __remove_memory(int nid, u64 start, u64 size)
>>>  {
>>> +
>>> +     /*
>>> +      * trigger BUG() is some memory is not offlined prior to calling this
>>> +      * function
>>> +      */
>>> +     if (try_remove_memory(nid, start, size))
>>> +             BUG();
>>> +}
>> Could we call this remove_offline_memory()?  That way, it makes _some_
>> sense why we would BUG() if the memory isn't offline.
> Please WARN() instead of BUG() because failing to remove memory should
> not be system fatal.

That is my preference as well.  But, the existing code BUG()s, so I'm
OK-ish with this staying for the moment until we have a better handle on
what all the callers do if this fails.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
  2019-05-06 17:57   ` Dave Hansen
  2019-05-06 18:01     ` Dan Williams
@ 2019-05-06 18:13     ` Pavel Tatashin
  1 sibling, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-06 18:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: James Morris, Sasha Levin, LKML, linux-mm, linux-nvdimm,
	Andrew Morton, Michal Hocko, Dave Hansen, Dan Williams,
	Keith Busch, Vishal L Verma, Dave Jiang, Ross Zwisler,
	Tom Lendacky, Huang, Ying, Fengguang Wu, Borislav Petkov,
	Bjorn Helgaas, Yaowei Bai, Takashi Iwai, Jérôme Glisse,
	David Hildenbrand

On Mon, May 6, 2019 at 1:57 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > +{
> > +     return -EBUSY;
> > +}
>
> This seems like an appropriate place for a WARN_ONCE(), if someone
> manages to call remove_memory() with hotplug disabled.
>
> BTW, I looked and can't think of a better errno, but -EBUSY probably
> isn't the best error code, right?

Same here, I looked and did not find any better then -EBUSY. Also, it
is close to check_cpu_on_node() in the same file.

>
> > -void remove_memory(int nid, u64 start, u64 size)
> > +/**
> > + * remove_memory
> > + * @nid: the node ID
> > + * @start: physical address of the region to remove
> > + * @size: size of the region to remove
> > + *
> > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > + * and online/offline operations before this call, as required by
> > + * try_offline_node().
> > + */
> > +void __remove_memory(int nid, u64 start, u64 size)
> >  {
> > +
> > +     /*
> > +      * trigger BUG() is some memory is not offlined prior to calling this
> > +      * function
> > +      */
> > +     if (try_remove_memory(nid, start, size))
> > +             BUG();
> > +}
>
> Could we call this remove_offline_memory()?  That way, it makes _some_
> sense why we would BUG() if the memory isn't offline.

Sure, I will rename this function.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
  2019-05-06 18:04       ` Dave Hansen
@ 2019-05-06 18:18         ` Pavel Tatashin
  0 siblings, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-06 18:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Williams, James Morris, Sasha Levin,
	Linux Kernel Mailing List, Linux MM, linux-nvdimm, Andrew Morton,
	Michal Hocko, Dave Hansen, Keith Busch, Vishal L Verma,
	Dave Jiang, Ross Zwisler, Tom Lendacky, Huang, Ying,
	Fengguang Wu, Borislav Petkov, Bjorn Helgaas, Yaowei Bai,
	Takashi Iwai, Jérôme Glisse, David Hildenbrand

On Mon, May 6, 2019 at 2:04 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/6/19 11:01 AM, Dan Williams wrote:
> >>> +void __remove_memory(int nid, u64 start, u64 size)
> >>>  {
> >>> +
> >>> +     /*
> >>> +      * trigger BUG() is some memory is not offlined prior to calling this
> >>> +      * function
> >>> +      */
> >>> +     if (try_remove_memory(nid, start, size))
> >>> +             BUG();
> >>> +}
> >> Could we call this remove_offline_memory()?  That way, it makes _some_
> >> sense why we would BUG() if the memory isn't offline.
> > Please WARN() instead of BUG() because failing to remove memory should
> > not be system fatal.
>
> That is my preference as well.  But, the existing code BUG()s, so I'm
> OK-ish with this staying for the moment until we have a better handle on
> what all the callers do if this fails.

Yes, this is the reason why I BUG() here. The current code does this,
and I was not sure what would happen if we simply continue executing.
Of course, I would prefer to return failure, so the callers can act
appropriately, but let's make one thing at a time, this should not be
part of this series.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-02 20:50 ` [v5 0/3] "Hotremove" persistent memory Verma, Vishal L
  2019-05-02 21:44   ` Pavel Tatashin
@ 2019-05-15 18:11   ` Pavel Tatashin
  2019-05-16  0:42     ` Dan Williams
  1 sibling, 1 reply; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-15 18:11 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: linux-kernel, jmorris, tiwai, sashal, linux-mm, dave.hansen,
	david, bp, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
	mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
	Huang, Ying, Wu, Fengguang, baiyaowei

> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.
>
>
> [  +0.004886] ======================================================
> [  +0.001576] WARNING: possible circular locking dependency detected
> [  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O
> [  +0.000929] ------------------------------------------------------
> [  +0.000708] daxctl/22950 is trying to acquire lock:
> [  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000922]
>               but task is already holding lock:
> [  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0

I have studied this issue, and now have a clear understanding why it
happens, I am not yet sure how to fix it, so suggestions are welcomed
:)

Here is the problem:

When we offline pages we have the following call stack:

# echo offline > /sys/devices/system/memory/memory8/state
ksys_write
 vfs_write
  __vfs_write
   kernfs_fop_write
    kernfs_get_active
     lock_acquire                       kn->count#122 (lock for
"memory8/state" kn)
    sysfs_kf_write
     dev_attr_store
      state_store
       device_offline
        memory_subsys_offline
         memory_block_action
          offline_pages
           __offline_pages
            percpu_down_write
             down_write
              lock_acquire              mem_hotplug_lock.rw_sem

When we unbind dax0.0 we have the following  stack:
# echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
drv_attr_store
 unbind_store
  device_driver_detach
   device_release_driver_internal
    dev_dax_kmem_remove
     remove_memory                      device_hotplug_lock
      try_remove_memory                 mem_hotplug_lock.rw_sem
       arch_remove_memory
        __remove_pages
         __remove_section
          unregister_memory_section
           remove_memory_section        mem_sysfs_mutex
            unregister_memory
             device_unregister
              device_del
               device_remove_attrs
                sysfs_remove_groups
                 sysfs_remove_group
                  remove_files
                   kernfs_remove_by_name
                    kernfs_remove_by_name_ns
                     __kernfs_remove    kn->count#122

So, lockdep found the ordering issue with the above two stacks:

1. kn->count#122 -> mem_hotplug_lock.rw_sem
2. mem_hotplug_lock.rw_sem -> kn->count#122

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-15 18:11   ` Pavel Tatashin
@ 2019-05-16  0:42     ` Dan Williams
  2019-05-16  7:10       ` David Hildenbrand
  2019-05-17 14:09       ` Pavel Tatashin
  0 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2019-05-16  0:42 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Verma, Vishal L, linux-kernel, jmorris, tiwai, sashal, linux-mm,
	dave.hansen, david, bp, akpm, linux-nvdimm, jglisse, zwisler,
	mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
	Huang, Ying, Wu, Fengguang, baiyaowei

On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> > Hi Pavel,
> >
> > I am working on adding this sort of a workflow into a new daxctl command
> > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> > dax device to kmem, online the resulting memory, and with your patches,
> > also attempt to offline the memory, and change back to device-dax.
> >
> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> >
> > This is with just these three patches on top of -rc7.
> >
> >
> > [  +0.004886] ======================================================
> > [  +0.001576] WARNING: possible circular locking dependency detected
> > [  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O
> > [  +0.000929] ------------------------------------------------------
> > [  +0.000708] daxctl/22950 is trying to acquire lock:
> > [  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> > [  +0.000922]
> >               but task is already holding lock:
> > [  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
>
> I have studied this issue, and now have a clear understanding why it
> happens, I am not yet sure how to fix it, so suggestions are welcomed
> :)

I would think that ACPI hotplug would have a similar problem, but it does this:

                acpi_unbind_memory_blocks(info);
                __remove_memory(nid, info->start_addr, info->length);

I wonder if that ordering prevents going too deep into the
device_unregister() call stack that you highlighted below.


>
> Here is the problem:
>
> When we offline pages we have the following call stack:
>
> # echo offline > /sys/devices/system/memory/memory8/state
> ksys_write
>  vfs_write
>   __vfs_write
>    kernfs_fop_write
>     kernfs_get_active
>      lock_acquire                       kn->count#122 (lock for
> "memory8/state" kn)
>     sysfs_kf_write
>      dev_attr_store
>       state_store
>        device_offline
>         memory_subsys_offline
>          memory_block_action
>           offline_pages
>            __offline_pages
>             percpu_down_write
>              down_write
>               lock_acquire              mem_hotplug_lock.rw_sem
>
> When we unbind dax0.0 we have the following  stack:
> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> drv_attr_store
>  unbind_store
>   device_driver_detach
>    device_release_driver_internal
>     dev_dax_kmem_remove
>      remove_memory                      device_hotplug_lock
>       try_remove_memory                 mem_hotplug_lock.rw_sem
>        arch_remove_memory
>         __remove_pages
>          __remove_section
>           unregister_memory_section
>            remove_memory_section        mem_sysfs_mutex
>             unregister_memory
>              device_unregister
>               device_del
>                device_remove_attrs
>                 sysfs_remove_groups
>                  sysfs_remove_group
>                   remove_files
>                    kernfs_remove_by_name
>                     kernfs_remove_by_name_ns
>                      __kernfs_remove    kn->count#122
>
> So, lockdep found the ordering issue with the above two stacks:
>
> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> 2. mem_hotplug_lock.rw_sem -> kn->count#122

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-16  0:42     ` Dan Williams
@ 2019-05-16  7:10       ` David Hildenbrand
  2019-05-17 14:09       ` Pavel Tatashin
  1 sibling, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2019-05-16  7:10 UTC (permalink / raw)
  To: Dan Williams, Pavel Tatashin
  Cc: Verma, Vishal L, linux-kernel, jmorris, tiwai, sashal, linux-mm,
	dave.hansen, bp, akpm, linux-nvdimm, jglisse, zwisler, mhocko,
	Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky, Huang,
	Ying, Wu, Fengguang, baiyaowei

On 16.05.19 02:42, Dan Williams wrote:
> On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
> <pasha.tatashin@soleen.com> wrote:
>>
>>> Hi Pavel,
>>>
>>> I am working on adding this sort of a workflow into a new daxctl command
>>> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
>>> dax device to kmem, online the resulting memory, and with your patches,
>>> also attempt to offline the memory, and change back to device-dax.
>>>
>>> In running with these patches, and testing the offlining part, I ran
>>> into the following lockdep below.
>>>
>>> This is with just these three patches on top of -rc7.
>>>
>>>
>>> [  +0.004886] ======================================================
>>> [  +0.001576] WARNING: possible circular locking dependency detected
>>> [  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O
>>> [  +0.000929] ------------------------------------------------------
>>> [  +0.000708] daxctl/22950 is trying to acquire lock:
>>> [  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
>>> [  +0.000922]
>>>               but task is already holding lock:
>>> [  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
>>
>> I have studied this issue, and now have a clear understanding why it
>> happens, I am not yet sure how to fix it, so suggestions are welcomed
>> :)
> 
> I would think that ACPI hotplug would have a similar problem, but it does this:
> 
>                 acpi_unbind_memory_blocks(info);
>                 __remove_memory(nid, info->start_addr, info->length);
> 
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
> 

If that doesn't help, after we have

[PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling

we could probably pull the memory device removal phase out from the
mem_hotplug_lock protection and let it be protected by the
device_hotplug_lock only. Might require some more work, though.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-16  0:42     ` Dan Williams
  2019-05-16  7:10       ` David Hildenbrand
@ 2019-05-17 14:09       ` Pavel Tatashin
  2019-05-20  7:57         ` David Hildenbrand
  1 sibling, 1 reply; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-17 14:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Verma, Vishal L, linux-kernel, jmorris, tiwai, sashal, linux-mm,
	dave.hansen, david, bp, akpm, linux-nvdimm, jglisse, zwisler,
	mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
	Huang, Ying, Wu, Fengguang, baiyaowei

[-- Attachment #1: Type: text/plain, Size: 11948 bytes --]

>
> I would think that ACPI hotplug would have a similar problem, but it does this:
>
>                 acpi_unbind_memory_blocks(info);
>                 __remove_memory(nid, info->start_addr, info->length);

ACPI does have exactly the same problem, so this is not a bug for this
series, I will submit a new version of my series with comments
addressed, but without fix for this issue.

I was able to reproduce this issue on the current mainline kernel.
Also, I been thinking more about how to fix it, and there is no easy
fix without a major hotplug redesign. Basically, we have to remove
sysfs memory entries before or after memory is hotplugged/hotremoved.
But, we also have to guarantee that hotplug/hotremove will succeed or
reinstate sysfs entries.

Qemu script:

qemu-system-x86_64                                                      \
        -enable-kvm                                                     \
        -cpu host                                                       \
        -parallel none                                                  \
        -echr 1                                                         \
        -serial none                                                    \
        -chardev stdio,id=console,signal=off,mux=on                     \
        -serial chardev:console                                         \
        -mon chardev=console                                            \
        -vga none                                                       \
        -display none                                                   \
        -kernel pmem/native/arch/x86/boot/bzImage                       \
        -m 8G,slots=1,maxmem=16G                                        \
        -smp 8                                                          \
        -fsdev local,id=virtfs1,path=/,security_model=none              \
        -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs            \
        -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
TERM=xterm ip=dhcp loglevel=7'

Config is attached.

Steps to reproduce:
#
# QEMU 4.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
(qemu)

# echo online_movable > /sys/devices/system/memory/memory79/state
[   23.029552] Built 1 zonelists, mobility grouping on.  Total pages: 2045370
[   23.032591] Policy zone: Normal
# (qemu) device_del dimm1
(qemu) [   32.013950] Offlined Pages 32768
[   32.014307] Built 1 zonelists, mobility grouping on.  Total pages: 2031022
[   32.014843] Policy zone: Normal
[   32.015733]
[   32.015881] ======================================================
[   32.016390] WARNING: possible circular locking dependency detected
[   32.016881] 5.1.0_pt_pmem #38 Not tainted
[   32.017202] ------------------------------------------------------
[   32.017680] kworker/u16:4/380 is trying to acquire lock:
[   32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
kernfs_remove_by_name_ns+0x3b/0x80
[   32.018745]
[   32.018745] but task is already holding lock:
[   32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[   32.019859]
[   32.019859] which lock already depends on the new lock.
[   32.019859]
[   32.020499]
[   32.020499] the existing dependency chain (in reverse order) is:
[   32.021080]
[   32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
[   32.021522]        __mutex_lock+0x8b/0x900
[   32.021843]        hotplug_memory_register+0x26/0xa0
[   32.022231]        __add_pages+0xe7/0x160
[   32.022545]        add_pages+0xd/0x60
[   32.022835]        add_memory_resource+0xc3/0x1d0
[   32.023207]        __add_memory+0x57/0x80
[   32.023530]        acpi_memory_device_add+0x13a/0x2d0
[   32.023928]        acpi_bus_attach+0xf1/0x200
[   32.024272]        acpi_bus_scan+0x3e/0x90
[   32.024597]        acpi_device_hotplug+0x284/0x3e0
[   32.024972]        acpi_hotplug_work_fn+0x15/0x20
[   32.025342]        process_one_work+0x2a0/0x650
[   32.025755]        worker_thread+0x34/0x3d0
[   32.026077]        kthread+0x118/0x130
[   32.026442]        ret_from_fork+0x3a/0x50
[   32.026766]
[   32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
[   32.027261]        get_online_mems+0x39/0x80
[   32.027600]        kmem_cache_create_usercopy+0x29/0x2c0
[   32.028019]        kmem_cache_create+0xd/0x10
[   32.028367]        ptlock_cache_init+0x1b/0x23
[   32.028724]        start_kernel+0x1d2/0x4b8
[   32.029060]        secondary_startup_64+0xa4/0xb0
[   32.029447]
[   32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
[   32.030007]        cpus_read_lock+0x39/0x80
[   32.030360]        __offline_pages+0x32/0x790
[   32.030709]        memory_subsys_offline+0x3a/0x60
[   32.031089]        device_offline+0x7e/0xb0
[   32.031425]        acpi_bus_offline+0xd8/0x140
[   32.031821]        acpi_device_hotplug+0x1b2/0x3e0
[   32.032202]        acpi_hotplug_work_fn+0x15/0x20
[   32.032576]        process_one_work+0x2a0/0x650
[   32.032942]        worker_thread+0x34/0x3d0
[   32.033283]        kthread+0x118/0x130
[   32.033588]        ret_from_fork+0x3a/0x50
[   32.033919]
[   32.033919] -> #1 (&device->physical_node_lock){+.+.}:
[   32.034450]        __mutex_lock+0x8b/0x900
[   32.034784]        acpi_get_first_physical_node+0x16/0x60
[   32.035217]        acpi_companion_match+0x3b/0x60
[   32.035594]        acpi_device_uevent_modalias+0x9/0x20
[   32.036012]        platform_uevent+0xd/0x40
[   32.036352]        dev_uevent+0x85/0x1c0
[   32.036674]        kobject_uevent_env+0x1e2/0x640
[   32.037044]        kobject_synth_uevent+0x2b7/0x324
[   32.037428]        uevent_store+0x17/0x30
[   32.037752]        kernfs_fop_write+0xeb/0x1a0
[   32.038112]        vfs_write+0xb2/0x1b0
[   32.038417]        ksys_write+0x57/0xd0
[   32.038721]        do_syscall_64+0x4b/0x1a0
[   32.039053]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   32.039491]
[   32.039491] -> #0 (kn->count#18){++++}:
[   32.039913]        lock_acquire+0xaa/0x180
[   32.040242]        __kernfs_remove+0x244/0x2d0
[   32.040593]        kernfs_remove_by_name_ns+0x3b/0x80
[   32.040991]        device_del+0x14a/0x370
[   32.041309]        device_unregister+0x9/0x20
[   32.041653]        unregister_memory_section+0x69/0xa0
[   32.042059]        __remove_pages+0x112/0x460
[   32.042402]        arch_remove_memory+0x6f/0xa0
[   32.042758]        __remove_memory+0xab/0x130
[   32.043103]        acpi_memory_device_remove+0x67/0xe0
[   32.043537]        acpi_bus_trim+0x50/0x90
[   32.043889]        acpi_device_hotplug+0x2fa/0x3e0
[   32.044300]        acpi_hotplug_work_fn+0x15/0x20
[   32.044686]        process_one_work+0x2a0/0x650
[   32.045044]        worker_thread+0x34/0x3d0
[   32.045381]        kthread+0x118/0x130
[   32.045679]        ret_from_fork+0x3a/0x50
[   32.046005]
[   32.046005] other info that might help us debug this:
[   32.046005]
[   32.046636] Chain exists of:
[   32.046636]   kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
[   32.046636]
[   32.047514]  Possible unsafe locking scenario:
[   32.047514]
[   32.047976]        CPU0                    CPU1
[   32.048337]        ----                    ----
[   32.048697]   lock(mem_sysfs_mutex);
[   32.048983]                                lock(mem_hotplug_lock.rw_sem);
[   32.049519]                                lock(mem_sysfs_mutex);
[   32.050004]   lock(kn->count#18);
[   32.050270]
[   32.050270]  *** DEADLOCK ***
[   32.050270]
[   32.050736] 7 locks held by kworker/u16:4/380:
[   32.051087]  #0: 00000000a22fe78e
((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
[   32.051830]  #1: 00000000944f2dca
((work_completion)(&hpw->work)){+.+.}, at:
process_one_work+0x21e/0x650
[   32.052577]  #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
acpi_device_hotplug+0x2e/0x3e0
[   32.053271]  #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
acpi_device_hotplug+0x3c/0x3e0
[   32.053916]  #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
at: __remove_memory+0x3b/0x130
[   32.054602]  #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
at: percpu_down_write+0x1d/0x110
[   32.055315]  #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[   32.056016]
[   32.056016] stack backtrace:
[   32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
[   32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[   32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   32.058144] Call Trace:
[   32.058344]  dump_stack+0x67/0x90
[   32.058604]  print_circular_bug.cold.60+0x15c/0x195
[   32.058989]  __lock_acquire+0x17de/0x1d30
[   32.059308]  ? find_held_lock+0x2d/0x90
[   32.059611]  ? __kernfs_remove+0x199/0x2d0
[   32.059937]  lock_acquire+0xaa/0x180
[   32.060223]  ? kernfs_remove_by_name_ns+0x3b/0x80
[   32.060596]  __kernfs_remove+0x244/0x2d0
[   32.060908]  ? kernfs_remove_by_name_ns+0x3b/0x80
[   32.061283]  ? kernfs_name_hash+0xd/0x80
[   32.061596]  ? kernfs_find_ns+0x68/0xf0
[   32.061907]  kernfs_remove_by_name_ns+0x3b/0x80
[   32.062266]  device_del+0x14a/0x370
[   32.062548]  ? unregister_mem_sect_under_nodes+0x4f/0xc0
[   32.062973]  device_unregister+0x9/0x20
[   32.063285]  unregister_memory_section+0x69/0xa0
[   32.063651]  __remove_pages+0x112/0x460
[   32.063949]  arch_remove_memory+0x6f/0xa0
[   32.064271]  __remove_memory+0xab/0x130
[   32.064579]  ? walk_memory_range+0xa1/0xe0
[   32.064907]  acpi_memory_device_remove+0x67/0xe0
[   32.065274]  acpi_bus_trim+0x50/0x90
[   32.065560]  acpi_device_hotplug+0x2fa/0x3e0
[   32.065900]  acpi_hotplug_work_fn+0x15/0x20
[   32.066249]  process_one_work+0x2a0/0x650
[   32.066591]  worker_thread+0x34/0x3d0
[   32.066925]  ? process_one_work+0x650/0x650
[   32.067275]  kthread+0x118/0x130
[   32.067542]  ? kthread_create_on_node+0x60/0x60
[   32.067909]  ret_from_fork+0x3a/0x50

>
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
>
>
> >
> > Here is the problem:
> >
> > When we offline pages we have the following call stack:
> >
> > # echo offline > /sys/devices/system/memory/memory8/state
> > ksys_write
> >  vfs_write
> >   __vfs_write
> >    kernfs_fop_write
> >     kernfs_get_active
> >      lock_acquire                       kn->count#122 (lock for
> > "memory8/state" kn)
> >     sysfs_kf_write
> >      dev_attr_store
> >       state_store
> >        device_offline
> >         memory_subsys_offline
> >          memory_block_action
> >           offline_pages
> >            __offline_pages
> >             percpu_down_write
> >              down_write
> >               lock_acquire              mem_hotplug_lock.rw_sem
> >
> > When we unbind dax0.0 we have the following  stack:
> > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> > drv_attr_store
> >  unbind_store
> >   device_driver_detach
> >    device_release_driver_internal
> >     dev_dax_kmem_remove
> >      remove_memory                      device_hotplug_lock
> >       try_remove_memory                 mem_hotplug_lock.rw_sem
> >        arch_remove_memory
> >         __remove_pages
> >          __remove_section
> >           unregister_memory_section
> >            remove_memory_section        mem_sysfs_mutex
> >             unregister_memory
> >              device_unregister
> >               device_del
> >                device_remove_attrs
> >                 sysfs_remove_groups
> >                  sysfs_remove_group
> >                   remove_files
> >                    kernfs_remove_by_name
> >                     kernfs_remove_by_name_ns
> >                      __kernfs_remove    kn->count#122
> >
> > So, lockdep found the ordering issue with the above two stacks:
> >
> > 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> > 2. mem_hotplug_lock.rw_sem -> kn->count#122

[-- Attachment #2: x86.config.bz2 --]
[-- Type: application/x-bzip, Size: 24701 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
  2019-05-06 18:01     ` Dan Williams
  2019-05-06 18:04       ` Dave Hansen
@ 2019-05-17 18:10       ` Pavel Tatashin
  1 sibling, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-17 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, James Morris, Sasha Levin,
	Linux Kernel Mailing List, Linux MM, linux-nvdimm, Andrew Morton,
	Michal Hocko, Dave Hansen, Keith Busch, Vishal L Verma,
	Dave Jiang, Ross Zwisler, Tom Lendacky, Huang, Ying,
	Fengguang Wu, Borislav Petkov, Bjorn Helgaas, Yaowei Bai,
	Takashi Iwai, Jérôme Glisse, David Hildenbrand

Hi Dan,

Thank you very much for your review, my comments below:

On Mon, May 6, 2019 at 2:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Mon, May 6, 2019 at 10:57 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > > +{
> > > +     return -EBUSY;
> > > +}
> >
> > This seems like an appropriate place for a WARN_ONCE(), if someone
> > manages to call remove_memory() with hotplug disabled.

I decided not to do WARN_ONCE(), in all likelihood compiler will
simply optimize this function out, but with WARN_ONCE() some traces of
it will remain.

> >
> > BTW, I looked and can't think of a better errno, but -EBUSY probably
> > isn't the best error code, right?

-EBUSY is the only error that is returned in case of error by real
remove_memory(), so I think it is OK to keep it here.

> >
> > > -void remove_memory(int nid, u64 start, u64 size)
> > > +/**
> > > + * remove_memory
> > > + * @nid: the node ID
> > > + * @start: physical address of the region to remove
> > > + * @size: size of the region to remove
> > > + *
> > > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > > + * and online/offline operations before this call, as required by
> > > + * try_offline_node().
> > > + */
> > > +void __remove_memory(int nid, u64 start, u64 size)
> > >  {
> > > +
> > > +     /*
> > > +      * trigger BUG() is some memory is not offlined prior to calling this
> > > +      * function
> > > +      */
> > > +     if (try_remove_memory(nid, start, size))
> > > +             BUG();
> > > +}
> >
> > Could we call this remove_offline_memory()?  That way, it makes _some_
> > sense why we would BUG() if the memory isn't offline.

It is this particular code path, the second one: remove_memory(),
actually tries to remove memory and returns failure if it can't. So, I
think the current name is OK.

>
> Please WARN() instead of BUG() because failing to remove memory should
> not be system fatal.

As mentioned earlier, I will keep BUG(), because existing code does
that, and there is no good handling of this code to return on error.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [v5 0/3] "Hotremove" persistent memory
  2019-05-17 14:09       ` Pavel Tatashin
@ 2019-05-20  7:57         ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2019-05-20  7:57 UTC (permalink / raw)
  To: Pavel Tatashin, Dan Williams
  Cc: Verma, Vishal L, linux-kernel, jmorris, tiwai, sashal, linux-mm,
	dave.hansen, bp, akpm, linux-nvdimm, jglisse, zwisler, mhocko,
	Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky, Huang,
	Ying, Wu, Fengguang, baiyaowei

On 17.05.19 16:09, Pavel Tatashin wrote:
>>
>> I would think that ACPI hotplug would have a similar problem, but it does this:
>>
>>                 acpi_unbind_memory_blocks(info);
>>                 __remove_memory(nid, info->start_addr, info->length);
> 
> ACPI does have exactly the same problem, so this is not a bug for this
> series, I will submit a new version of my series with comments
> addressed, but without fix for this issue.
> 
> I was able to reproduce this issue on the current mainline kernel.
> Also, I been thinking more about how to fix it, and there is no easy
> fix without a major hotplug redesign. Basically, we have to remove
> sysfs memory entries before or after memory is hotplugged/hotremoved.
> But, we also have to guarantee that hotplug/hotremove will succeed or
> reinstate sysfs entries.
> 
> Qemu script:
> 
> qemu-system-x86_64                                                      \
>         -enable-kvm                                                     \
>         -cpu host                                                       \
>         -parallel none                                                  \
>         -echr 1                                                         \
>         -serial none                                                    \
>         -chardev stdio,id=console,signal=off,mux=on                     \
>         -serial chardev:console                                         \
>         -mon chardev=console                                            \
>         -vga none                                                       \
>         -display none                                                   \
>         -kernel pmem/native/arch/x86/boot/bzImage                       \
>         -m 8G,slots=1,maxmem=16G                                        \
>         -smp 8                                                          \
>         -fsdev local,id=virtfs1,path=/,security_model=none              \
>         -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs            \
>         -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
> TERM=xterm ip=dhcp loglevel=7'
> 
> Config is attached.
> 
> Steps to reproduce:
> #
> # QEMU 4.0.0 monitor - type 'help' for more information
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> (qemu)
> 
> # echo online_movable > /sys/devices/system/memory/memory79/state
> [   23.029552] Built 1 zonelists, mobility grouping on.  Total pages: 2045370
> [   23.032591] Policy zone: Normal
> # (qemu) device_del dimm1
> (qemu) [   32.013950] Offlined Pages 32768
> [   32.014307] Built 1 zonelists, mobility grouping on.  Total pages: 2031022
> [   32.014843] Policy zone: Normal
> [   32.015733]
> [   32.015881] ======================================================
> [   32.016390] WARNING: possible circular locking dependency detected
> [   32.016881] 5.1.0_pt_pmem #38 Not tainted
> [   32.017202] ------------------------------------------------------
> [   32.017680] kworker/u16:4/380 is trying to acquire lock:
> [   32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
> kernfs_remove_by_name_ns+0x3b/0x80
> [   32.018745]
> [   32.018745] but task is already holding lock:
> [   32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [   32.019859]
> [   32.019859] which lock already depends on the new lock.
> [   32.019859]
> [   32.020499]
> [   32.020499] the existing dependency chain (in reverse order) is:
> [   32.021080]
> [   32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
> [   32.021522]        __mutex_lock+0x8b/0x900
> [   32.021843]        hotplug_memory_register+0x26/0xa0
> [   32.022231]        __add_pages+0xe7/0x160
> [   32.022545]        add_pages+0xd/0x60
> [   32.022835]        add_memory_resource+0xc3/0x1d0
> [   32.023207]        __add_memory+0x57/0x80
> [   32.023530]        acpi_memory_device_add+0x13a/0x2d0
> [   32.023928]        acpi_bus_attach+0xf1/0x200
> [   32.024272]        acpi_bus_scan+0x3e/0x90
> [   32.024597]        acpi_device_hotplug+0x284/0x3e0
> [   32.024972]        acpi_hotplug_work_fn+0x15/0x20
> [   32.025342]        process_one_work+0x2a0/0x650
> [   32.025755]        worker_thread+0x34/0x3d0
> [   32.026077]        kthread+0x118/0x130
> [   32.026442]        ret_from_fork+0x3a/0x50
> [   32.026766]
> [   32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
> [   32.027261]        get_online_mems+0x39/0x80
> [   32.027600]        kmem_cache_create_usercopy+0x29/0x2c0
> [   32.028019]        kmem_cache_create+0xd/0x10
> [   32.028367]        ptlock_cache_init+0x1b/0x23
> [   32.028724]        start_kernel+0x1d2/0x4b8
> [   32.029060]        secondary_startup_64+0xa4/0xb0
> [   32.029447]
> [   32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
> [   32.030007]        cpus_read_lock+0x39/0x80
> [   32.030360]        __offline_pages+0x32/0x790
> [   32.030709]        memory_subsys_offline+0x3a/0x60
> [   32.031089]        device_offline+0x7e/0xb0
> [   32.031425]        acpi_bus_offline+0xd8/0x140
> [   32.031821]        acpi_device_hotplug+0x1b2/0x3e0
> [   32.032202]        acpi_hotplug_work_fn+0x15/0x20
> [   32.032576]        process_one_work+0x2a0/0x650
> [   32.032942]        worker_thread+0x34/0x3d0
> [   32.033283]        kthread+0x118/0x130
> [   32.033588]        ret_from_fork+0x3a/0x50
> [   32.033919]
> [   32.033919] -> #1 (&device->physical_node_lock){+.+.}:
> [   32.034450]        __mutex_lock+0x8b/0x900
> [   32.034784]        acpi_get_first_physical_node+0x16/0x60
> [   32.035217]        acpi_companion_match+0x3b/0x60
> [   32.035594]        acpi_device_uevent_modalias+0x9/0x20
> [   32.036012]        platform_uevent+0xd/0x40
> [   32.036352]        dev_uevent+0x85/0x1c0
> [   32.036674]        kobject_uevent_env+0x1e2/0x640
> [   32.037044]        kobject_synth_uevent+0x2b7/0x324
> [   32.037428]        uevent_store+0x17/0x30
> [   32.037752]        kernfs_fop_write+0xeb/0x1a0
> [   32.038112]        vfs_write+0xb2/0x1b0
> [   32.038417]        ksys_write+0x57/0xd0
> [   32.038721]        do_syscall_64+0x4b/0x1a0
> [   32.039053]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   32.039491]
> [   32.039491] -> #0 (kn->count#18){++++}:
> [   32.039913]        lock_acquire+0xaa/0x180
> [   32.040242]        __kernfs_remove+0x244/0x2d0
> [   32.040593]        kernfs_remove_by_name_ns+0x3b/0x80
> [   32.040991]        device_del+0x14a/0x370
> [   32.041309]        device_unregister+0x9/0x20
> [   32.041653]        unregister_memory_section+0x69/0xa0
> [   32.042059]        __remove_pages+0x112/0x460
> [   32.042402]        arch_remove_memory+0x6f/0xa0
> [   32.042758]        __remove_memory+0xab/0x130
> [   32.043103]        acpi_memory_device_remove+0x67/0xe0
> [   32.043537]        acpi_bus_trim+0x50/0x90
> [   32.043889]        acpi_device_hotplug+0x2fa/0x3e0
> [   32.044300]        acpi_hotplug_work_fn+0x15/0x20
> [   32.044686]        process_one_work+0x2a0/0x650
> [   32.045044]        worker_thread+0x34/0x3d0
> [   32.045381]        kthread+0x118/0x130
> [   32.045679]        ret_from_fork+0x3a/0x50
> [   32.046005]
> [   32.046005] other info that might help us debug this:
> [   32.046005]
> [   32.046636] Chain exists of:
> [   32.046636]   kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
> [   32.046636]
> [   32.047514]  Possible unsafe locking scenario:
> [   32.047514]
> [   32.047976]        CPU0                    CPU1
> [   32.048337]        ----                    ----
> [   32.048697]   lock(mem_sysfs_mutex);
> [   32.048983]                                lock(mem_hotplug_lock.rw_sem);
> [   32.049519]                                lock(mem_sysfs_mutex);
> [   32.050004]   lock(kn->count#18);
> [   32.050270]
> [   32.050270]  *** DEADLOCK ***
> [   32.050270]
> [   32.050736] 7 locks held by kworker/u16:4/380:
> [   32.051087]  #0: 00000000a22fe78e
> ((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
> [   32.051830]  #1: 00000000944f2dca
> ((work_completion)(&hpw->work)){+.+.}, at:
> process_one_work+0x21e/0x650
> [   32.052577]  #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
> acpi_device_hotplug+0x2e/0x3e0
> [   32.053271]  #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
> acpi_device_hotplug+0x3c/0x3e0
> [   32.053916]  #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
> at: __remove_memory+0x3b/0x130
> [   32.054602]  #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
> at: percpu_down_write+0x1d/0x110
> [   32.055315]  #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [   32.056016]
> [   32.056016] stack backtrace:
> [   32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
> [   32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.12.0-20181126_142135-anatol 04/01/2014
> [   32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [   32.058144] Call Trace:
> [   32.058344]  dump_stack+0x67/0x90
> [   32.058604]  print_circular_bug.cold.60+0x15c/0x195
> [   32.058989]  __lock_acquire+0x17de/0x1d30
> [   32.059308]  ? find_held_lock+0x2d/0x90
> [   32.059611]  ? __kernfs_remove+0x199/0x2d0
> [   32.059937]  lock_acquire+0xaa/0x180
> [   32.060223]  ? kernfs_remove_by_name_ns+0x3b/0x80
> [   32.060596]  __kernfs_remove+0x244/0x2d0
> [   32.060908]  ? kernfs_remove_by_name_ns+0x3b/0x80
> [   32.061283]  ? kernfs_name_hash+0xd/0x80
> [   32.061596]  ? kernfs_find_ns+0x68/0xf0
> [   32.061907]  kernfs_remove_by_name_ns+0x3b/0x80
> [   32.062266]  device_del+0x14a/0x370
> [   32.062548]  ? unregister_mem_sect_under_nodes+0x4f/0xc0
> [   32.062973]  device_unregister+0x9/0x20
> [   32.063285]  unregister_memory_section+0x69/0xa0
> [   32.063651]  __remove_pages+0x112/0x460
> [   32.063949]  arch_remove_memory+0x6f/0xa0
> [   32.064271]  __remove_memory+0xab/0x130
> [   32.064579]  ? walk_memory_range+0xa1/0xe0
> [   32.064907]  acpi_memory_device_remove+0x67/0xe0
> [   32.065274]  acpi_bus_trim+0x50/0x90
> [   32.065560]  acpi_device_hotplug+0x2fa/0x3e0
> [   32.065900]  acpi_hotplug_work_fn+0x15/0x20
> [   32.066249]  process_one_work+0x2a0/0x650
> [   32.066591]  worker_thread+0x34/0x3d0
> [   32.066925]  ? process_one_work+0x650/0x650
> [   32.067275]  kthread+0x118/0x130
> [   32.067542]  ? kthread_create_on_node+0x60/0x60
> [   32.067909]  ret_from_fork+0x3a/0x50
> 
>>
>> I wonder if that ordering prevents going too deep into the
>> device_unregister() call stack that you highlighted below.
>>
>>
>>>
>>> Here is the problem:
>>>
>>> When we offline pages we have the following call stack:
>>>
>>> # echo offline > /sys/devices/system/memory/memory8/state
>>> ksys_write
>>>  vfs_write
>>>   __vfs_write
>>>    kernfs_fop_write
>>>     kernfs_get_active
>>>      lock_acquire                       kn->count#122 (lock for
>>> "memory8/state" kn)
>>>     sysfs_kf_write
>>>      dev_attr_store
>>>       state_store
>>>        device_offline
>>>         memory_subsys_offline
>>>          memory_block_action
>>>           offline_pages
>>>            __offline_pages
>>>             percpu_down_write
>>>              down_write
>>>               lock_acquire              mem_hotplug_lock.rw_sem
>>>
>>> When we unbind dax0.0 we have the following  stack:
>>> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>>> drv_attr_store
>>>  unbind_store
>>>   device_driver_detach
>>>    device_release_driver_internal
>>>     dev_dax_kmem_remove
>>>      remove_memory                      device_hotplug_lock
>>>       try_remove_memory                 mem_hotplug_lock.rw_sem
>>>        arch_remove_memory
>>>         __remove_pages
>>>          __remove_section
>>>           unregister_memory_section
>>>            remove_memory_section        mem_sysfs_mutex
>>>             unregister_memory
>>>              device_unregister
>>>               device_del
>>>                device_remove_attrs
>>>                 sysfs_remove_groups
>>>                  sysfs_remove_group
>>>                   remove_files
>>>                    kernfs_remove_by_name
>>>                     kernfs_remove_by_name_ns
>>>                      __kernfs_remove    kn->count#122
>>>
>>> So, lockdep found the ordering issue with the above two stacks:
>>>
>>> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
>>> 2. mem_hotplug_lock.rw_sem -> kn->count#122

I once documented locking behavior in

Documentation/core-api/memory-hotplug.rst

Both, device_online() and __remove_memory() always have to be called
holding the device_hotplug_lock(), to avoid such races.

# echo offline > /sys/devices/system/memory/memory8/state
and
# echo 0 > /sys/devices/system/memory/memory8/online

either end up in:

-> online_store()
-- lock_device_hotplug_sysfs();
-- device_offline(dev)

-> state_store()
-- lock_device_hotplug_sysfs();
--  device_online(&mem->dev);

So the device_hotplug_lock prohibits the race you describe.

BUT

There is a possible race between the device_hotplug_lock and the
kn->count#122. However, that race can never trigger, as userspace
properly backs off in case it cannot get grip of the device_hotplug_lock.

(lock_device_hotplug_sysfs does a mutex_trylock(&device_hotplug_lock))

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-05-20  7:58 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-02 18:43 [v5 0/3] "Hotremove" persistent memory Pavel Tatashin
2019-05-02 18:43 ` [v5 1/3] device-dax: fix memory and resource leak if hotplug fails Pavel Tatashin
2019-05-02 18:43 ` [v5 2/3] mm/hotplug: make remove_memory() interface useable Pavel Tatashin
2019-05-03 10:06   ` David Hildenbrand
2019-05-06 17:57   ` Dave Hansen
2019-05-06 18:01     ` Dan Williams
2019-05-06 18:04       ` Dave Hansen
2019-05-06 18:18         ` Pavel Tatashin
2019-05-17 18:10       ` Pavel Tatashin
2019-05-06 18:13     ` Pavel Tatashin
2019-05-02 18:43 ` [v5 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM Pavel Tatashin
2019-05-02 20:50 ` [v5 0/3] "Hotremove" persistent memory Verma, Vishal L
2019-05-02 21:44   ` Pavel Tatashin
2019-05-02 22:29     ` Verma, Vishal L
2019-05-02 22:36       ` Pavel Tatashin
2019-05-03 21:48         ` Verma, Vishal L
2019-05-15 18:11   ` Pavel Tatashin
2019-05-16  0:42     ` Dan Williams
2019-05-16  7:10       ` David Hildenbrand
2019-05-17 14:09       ` Pavel Tatashin
2019-05-20  7:57         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).