* [v5 1/3] device-dax: fix memory and resource leak if hotplug fails
2019-05-02 18:43 [v5 0/3] "Hotremove" persistent memory Pavel Tatashin
@ 2019-05-02 18:43 ` Pavel Tatashin
2019-05-02 18:43 ` [v5 2/3] mm/hotplug: make remove_memory() interface useable Pavel Tatashin
` (2 subsequent siblings)
3 siblings, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 18:43 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, linux-kernel, linux-mm,
linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
keith.busch, vishal.l.verma, dave.jiang, zwisler,
thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
baiyaowei, tiwai, jglisse, david
When add_memory() function fails, the resource and the memory should be
freed.
Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
drivers/dax/kmem.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a02318c6d28a..4c0131857133 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -66,8 +66,11 @@ int dev_dax_kmem_probe(struct device *dev)
new_res->name = dev_name(dev);
rc = add_memory(numa_node, new_res->start, resource_size(new_res));
- if (rc)
+ if (rc) {
+ release_resource(new_res);
+ kfree(new_res);
return rc;
+ }
return 0;
}
--
2.21.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [v5 2/3] mm/hotplug: make remove_memory() interface useable
2019-05-02 18:43 [v5 0/3] "Hotremove" persistent memory Pavel Tatashin
2019-05-02 18:43 ` [v5 1/3] device-dax: fix memory and resource leak if hotplug fails Pavel Tatashin
@ 2019-05-02 18:43 ` Pavel Tatashin
2019-05-03 10:06 ` David Hildenbrand
2019-05-06 17:57 ` Dave Hansen
2019-05-02 18:43 ` [v5 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM Pavel Tatashin
2019-05-02 20:50 ` [v5 0/3] "Hotremove" persistent memory Verma, Vishal L
3 siblings, 2 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 18:43 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, linux-kernel, linux-mm,
linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
keith.busch, vishal.l.verma, dave.jiang, zwisler,
thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
baiyaowei, tiwai, jglisse, david
As of right now remove_memory() interface is inherently broken. It tries
to remove memory but panics if some memory is not offline. The problem
is that it is impossible to ensure that all memory blocks are offline as
this function also takes lock_device_hotplug that is required to
change memory state via sysfs.
So, between calling this function and offlining all memory blocks there
is always a window when lock_device_hotplug is released, and therefore,
there is always a chance for a panic during this window.
Make this interface to return an error if memory removal fails. This way
it is safe to call this function without panicking machine, and also
makes it symmetric to add_memory() which already returns an error.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
include/linux/memory_hotplug.h | 8 +++--
mm/memory_hotplug.c | 61 ++++++++++++++++++++++------------
2 files changed, 46 insertions(+), 23 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8ade08c50d26..5438a2d92560 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -304,7 +304,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
extern void try_offline_node(int nid);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
extern void __remove_memory(int nid, u64 start, u64 size);
#else
@@ -321,7 +321,11 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
return -EINVAL;
}
-static inline void remove_memory(int nid, u64 start, u64 size) {}
+static inline bool remove_memory(int nid, u64 start, u64 size)
+{
+ return -EBUSY;
+}
+
static inline void __remove_memory(int nid, u64 start, u64 size) {}
#endif /* CONFIG_MEMORY_HOTREMOVE */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8c454e82d4f6..a826aededa1a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1778,9 +1778,10 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
&beginpa, &endpa);
- }
- return ret;
+ return -EBUSY;
+ }
+ return 0;
}
static int check_cpu_on_node(pg_data_t *pgdat)
@@ -1843,19 +1844,9 @@ void try_offline_node(int nid)
}
EXPORT_SYMBOL(try_offline_node);
-/**
- * remove_memory
- * @nid: the node ID
- * @start: physical address of the region to remove
- * @size: size of the region to remove
- *
- * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
- * and online/offline operations before this call, as required by
- * try_offline_node().
- */
-void __ref __remove_memory(int nid, u64 start, u64 size)
+static int __ref try_remove_memory(int nid, u64 start, u64 size)
{
- int ret;
+ int rc = 0;
BUG_ON(check_hotplug_memory_range(start, size));
@@ -1863,13 +1854,13 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
/*
* All memory blocks must be offlined before removing memory. Check
- * whether all memory blocks in question are offline and trigger a BUG()
+ * whether all memory blocks in question are offline and return error
* if this is not the case.
*/
- ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
- check_memblock_offlined_cb);
- if (ret)
- BUG();
+ rc = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
+ check_memblock_offlined_cb);
+ if (rc)
+ goto done;
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
@@ -1879,14 +1870,42 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
try_offline_node(nid);
+done:
mem_hotplug_done();
+ return rc;
}
-void remove_memory(int nid, u64 start, u64 size)
+/**
+ * remove_memory
+ * @nid: the node ID
+ * @start: physical address of the region to remove
+ * @size: size of the region to remove
+ *
+ * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
+ * and online/offline operations before this call, as required by
+ * try_offline_node().
+ */
+void __remove_memory(int nid, u64 start, u64 size)
{
+
+ /*
+ * trigger BUG() is some memory is not offlined prior to calling this
+ * function
+ */
+ if (try_remove_memory(nid, start, size))
+ BUG();
+}
+
+/* Remove memory if every memory block is offline, otherwise return false */
+int remove_memory(int nid, u64 start, u64 size)
+{
+ int rc;
+
lock_device_hotplug();
- __remove_memory(nid, start, size);
+ rc = try_remove_memory(nid, start, size);
unlock_device_hotplug();
+
+ return rc;
}
EXPORT_SYMBOL_GPL(remove_memory);
#endif /* CONFIG_MEMORY_HOTREMOVE */
--
2.21.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
2019-05-02 18:43 ` [v5 2/3] mm/hotplug: make remove_memory() interface useable Pavel Tatashin
@ 2019-05-03 10:06 ` David Hildenbrand
2019-05-06 17:57 ` Dave Hansen
1 sibling, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2019-05-03 10:06 UTC (permalink / raw)
To: Pavel Tatashin, jmorris, sashal, linux-kernel, linux-mm,
linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
keith.busch, vishal.l.verma, dave.jiang, zwisler,
thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
baiyaowei, tiwai, jglisse
On 02.05.19 20:43, Pavel Tatashin wrote:
> As of right now remove_memory() interface is inherently broken. It tries
> to remove memory but panics if some memory is not offline. The problem
> is that it is impossible to ensure that all memory blocks are offline as
> this function also takes lock_device_hotplug that is required to
> change memory state via sysfs.
>
The existing interface can actually work today by registering a hotplug
notifier and rejecting any onlining attempts. But I agree that this way,
the interface becomes more usable.
> So, between calling this function and offlining all memory blocks there
> is always a window when lock_device_hotplug is released, and therefore,
> there is always a chance for a panic during this window.
>
> Make this interface to return an error if memory removal fails. This way
> it is safe to call this function without panicking machine, and also
> makes it symmetric to add_memory() which already returns an error.
>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>> ---
> include/linux/memory_hotplug.h | 8 +++--
> mm/memory_hotplug.c | 61 ++++++++++++++++++++++------------
> 2 files changed, 46 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 8ade08c50d26..5438a2d92560 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -304,7 +304,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {}
> extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
> extern void try_offline_node(int nid);
> extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
> extern void __remove_memory(int nid, u64 start, u64 size);
>
> #else
> @@ -321,7 +321,11 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> return -EINVAL;
> }
>
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline bool remove_memory(int nid, u64 start, u64 size)
> +{
> + return -EBUSY;
> +}
> +
> static inline void __remove_memory(int nid, u64 start, u64 size) {}
> #endif /* CONFIG_MEMORY_HOTREMOVE */
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8c454e82d4f6..a826aededa1a 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1778,9 +1778,10 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
> endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
> pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
> &beginpa, &endpa);
> - }
>
> - return ret;
> + return -EBUSY;
> + }
> + return 0;
> }
>
> static int check_cpu_on_node(pg_data_t *pgdat)
> @@ -1843,19 +1844,9 @@ void try_offline_node(int nid)
> }
> EXPORT_SYMBOL(try_offline_node);
>
> -/**
> - * remove_memory
> - * @nid: the node ID
> - * @start: physical address of the region to remove
> - * @size: size of the region to remove
> - *
> - * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> - * and online/offline operations before this call, as required by
> - * try_offline_node().
> - */
> -void __ref __remove_memory(int nid, u64 start, u64 size)
> +static int __ref try_remove_memory(int nid, u64 start, u64 size)
> {
> - int ret;
> + int rc = 0;
>
> BUG_ON(check_hotplug_memory_range(start, size));
>
> @@ -1863,13 +1854,13 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>
> /*
> * All memory blocks must be offlined before removing memory. Check
> - * whether all memory blocks in question are offline and trigger a BUG()
> + * whether all memory blocks in question are offline and return error
> * if this is not the case.
> */
> - ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> - check_memblock_offlined_cb);
> - if (ret)
> - BUG();
> + rc = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> + check_memblock_offlined_cb);
> + if (rc)
> + goto done;
>
> /* remove memmap entry */
> firmware_map_remove(start, start + size, "System RAM");
> @@ -1879,14 +1870,42 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>
> try_offline_node(nid);
>
> +done:
> mem_hotplug_done();
> + return rc;
> }
>
> -void remove_memory(int nid, u64 start, u64 size)
> +/**
> + * remove_memory
> + * @nid: the node ID
> + * @start: physical address of the region to remove
> + * @size: size of the region to remove
> + *
> + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> + * and online/offline operations before this call, as required by
> + * try_offline_node().
> + */
> +void __remove_memory(int nid, u64 start, u64 size)
> {
> +
> + /*
> + * trigger BUG() is some memory is not offlined prior to calling this
> + * function
> + */
> + if (try_remove_memory(nid, start, size))
> + BUG();
> +}
> +
> +/* Remove memory if every memory block is offline, otherwise return false */
Comment is wrong "return false"
> +int remove_memory(int nid, u64 start, u64 size)
> +{
> + int rc;
> +
> lock_device_hotplug();
> - __remove_memory(nid, start, size);
> + rc = try_remove_memory(nid, start, size);
> unlock_device_hotplug();
> +
> + return rc;
> }
> EXPORT_SYMBOL_GPL(remove_memory);
> #endif /* CONFIG_MEMORY_HOTREMOVE */
>
Looks sane to me
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
2019-05-02 18:43 ` [v5 2/3] mm/hotplug: make remove_memory() interface useable Pavel Tatashin
2019-05-03 10:06 ` David Hildenbrand
@ 2019-05-06 17:57 ` Dave Hansen
2019-05-06 18:01 ` Dan Williams
2019-05-06 18:13 ` Pavel Tatashin
1 sibling, 2 replies; 21+ messages in thread
From: Dave Hansen @ 2019-05-06 17:57 UTC (permalink / raw)
To: Pavel Tatashin, jmorris, sashal, linux-kernel, linux-mm,
linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
keith.busch, vishal.l.verma, dave.jiang, zwisler,
thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
baiyaowei, tiwai, jglisse, david
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline bool remove_memory(int nid, u64 start, u64 size)
> +{
> + return -EBUSY;
> +}
This seems like an appropriate place for a WARN_ONCE(), if someone
manages to call remove_memory() with hotplug disabled.
BTW, I looked and can't think of a better errno, but -EBUSY probably
isn't the best error code, right?
> -void remove_memory(int nid, u64 start, u64 size)
> +/**
> + * remove_memory
> + * @nid: the node ID
> + * @start: physical address of the region to remove
> + * @size: size of the region to remove
> + *
> + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> + * and online/offline operations before this call, as required by
> + * try_offline_node().
> + */
> +void __remove_memory(int nid, u64 start, u64 size)
> {
> +
> + /*
> + * trigger BUG() is some memory is not offlined prior to calling this
> + * function
> + */
> + if (try_remove_memory(nid, start, size))
> + BUG();
> +}
Could we call this remove_offline_memory()? That way, it makes _some_
sense why we would BUG() if the memory isn't offline.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
2019-05-06 17:57 ` Dave Hansen
@ 2019-05-06 18:01 ` Dan Williams
2019-05-06 18:04 ` Dave Hansen
2019-05-17 18:10 ` Pavel Tatashin
2019-05-06 18:13 ` Pavel Tatashin
1 sibling, 2 replies; 21+ messages in thread
From: Dan Williams @ 2019-05-06 18:01 UTC (permalink / raw)
To: Dave Hansen
Cc: Pavel Tatashin, James Morris, Sasha Levin,
Linux Kernel Mailing List, Linux MM, linux-nvdimm, Andrew Morton,
Michal Hocko, Dave Hansen, Keith Busch, Vishal L Verma,
Dave Jiang, Ross Zwisler, Tom Lendacky, Huang, Ying,
Fengguang Wu, Borislav Petkov, Bjorn Helgaas, Yaowei Bai,
Takashi Iwai, Jérôme Glisse, David Hildenbrand
On Mon, May 6, 2019 at 10:57 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > +{
> > + return -EBUSY;
> > +}
>
> This seems like an appropriate place for a WARN_ONCE(), if someone
> manages to call remove_memory() with hotplug disabled.
>
> BTW, I looked and can't think of a better errno, but -EBUSY probably
> isn't the best error code, right?
>
> > -void remove_memory(int nid, u64 start, u64 size)
> > +/**
> > + * remove_memory
> > + * @nid: the node ID
> > + * @start: physical address of the region to remove
> > + * @size: size of the region to remove
> > + *
> > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > + * and online/offline operations before this call, as required by
> > + * try_offline_node().
> > + */
> > +void __remove_memory(int nid, u64 start, u64 size)
> > {
> > +
> > + /*
> > + * trigger BUG() is some memory is not offlined prior to calling this
> > + * function
> > + */
> > + if (try_remove_memory(nid, start, size))
> > + BUG();
> > +}
>
> Could we call this remove_offline_memory()? That way, it makes _some_
> sense why we would BUG() if the memory isn't offline.
Please WARN() instead of BUG() because failing to remove memory should
not be system fatal.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
2019-05-06 18:01 ` Dan Williams
@ 2019-05-06 18:04 ` Dave Hansen
2019-05-06 18:18 ` Pavel Tatashin
2019-05-17 18:10 ` Pavel Tatashin
1 sibling, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2019-05-06 18:04 UTC (permalink / raw)
To: Dan Williams
Cc: Pavel Tatashin, James Morris, Sasha Levin,
Linux Kernel Mailing List, Linux MM, linux-nvdimm, Andrew Morton,
Michal Hocko, Dave Hansen, Keith Busch, Vishal L Verma,
Dave Jiang, Ross Zwisler, Tom Lendacky, Huang, Ying,
Fengguang Wu, Borislav Petkov, Bjorn Helgaas, Yaowei Bai,
Takashi Iwai, Jérôme Glisse, David Hildenbrand
On 5/6/19 11:01 AM, Dan Williams wrote:
>>> +void __remove_memory(int nid, u64 start, u64 size)
>>> {
>>> +
>>> + /*
>>> + * trigger BUG() is some memory is not offlined prior to calling this
>>> + * function
>>> + */
>>> + if (try_remove_memory(nid, start, size))
>>> + BUG();
>>> +}
>> Could we call this remove_offline_memory()? That way, it makes _some_
>> sense why we would BUG() if the memory isn't offline.
> Please WARN() instead of BUG() because failing to remove memory should
> not be system fatal.
That is my preference as well. But, the existing code BUG()s, so I'm
OK-ish with this staying for the moment until we have a better handle on
what all the callers do if this fails.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
2019-05-06 18:04 ` Dave Hansen
@ 2019-05-06 18:18 ` Pavel Tatashin
0 siblings, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-06 18:18 UTC (permalink / raw)
To: Dave Hansen
Cc: Dan Williams, James Morris, Sasha Levin,
Linux Kernel Mailing List, Linux MM, linux-nvdimm, Andrew Morton,
Michal Hocko, Dave Hansen, Keith Busch, Vishal L Verma,
Dave Jiang, Ross Zwisler, Tom Lendacky, Huang, Ying,
Fengguang Wu, Borislav Petkov, Bjorn Helgaas, Yaowei Bai,
Takashi Iwai, Jérôme Glisse, David Hildenbrand
On Mon, May 6, 2019 at 2:04 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/6/19 11:01 AM, Dan Williams wrote:
> >>> +void __remove_memory(int nid, u64 start, u64 size)
> >>> {
> >>> +
> >>> + /*
> >>> + * trigger BUG() is some memory is not offlined prior to calling this
> >>> + * function
> >>> + */
> >>> + if (try_remove_memory(nid, start, size))
> >>> + BUG();
> >>> +}
> >> Could we call this remove_offline_memory()? That way, it makes _some_
> >> sense why we would BUG() if the memory isn't offline.
> > Please WARN() instead of BUG() because failing to remove memory should
> > not be system fatal.
>
> That is my preference as well. But, the existing code BUG()s, so I'm
> OK-ish with this staying for the moment until we have a better handle on
> what all the callers do if this fails.
Yes, this is the reason why I BUG() here. The current code does this,
and I was not sure what would happen if we simply continue executing.
Of course, I would prefer to return failure, so the callers can act
appropriately, but let's make one thing at a time, this should not be
part of this series.
Thank you,
Pasha
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
2019-05-06 18:01 ` Dan Williams
2019-05-06 18:04 ` Dave Hansen
@ 2019-05-17 18:10 ` Pavel Tatashin
1 sibling, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-17 18:10 UTC (permalink / raw)
To: Dan Williams
Cc: Dave Hansen, James Morris, Sasha Levin,
Linux Kernel Mailing List, Linux MM, linux-nvdimm, Andrew Morton,
Michal Hocko, Dave Hansen, Keith Busch, Vishal L Verma,
Dave Jiang, Ross Zwisler, Tom Lendacky, Huang, Ying,
Fengguang Wu, Borislav Petkov, Bjorn Helgaas, Yaowei Bai,
Takashi Iwai, Jérôme Glisse, David Hildenbrand
Hi Dan,
Thank you very much for your review, my comments below:
On Mon, May 6, 2019 at 2:01 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Mon, May 6, 2019 at 10:57 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > > +{
> > > + return -EBUSY;
> > > +}
> >
> > This seems like an appropriate place for a WARN_ONCE(), if someone
> > manages to call remove_memory() with hotplug disabled.
I decided not to do WARN_ONCE(), in all likelihood compiler will
simply optimize this function out, but with WARN_ONCE() some traces of
it will remain.
> >
> > BTW, I looked and can't think of a better errno, but -EBUSY probably
> > isn't the best error code, right?
-EBUSY is the only error that is returned in case of error by real
remove_memory(), so I think it is OK to keep it here.
> >
> > > -void remove_memory(int nid, u64 start, u64 size)
> > > +/**
> > > + * remove_memory
> > > + * @nid: the node ID
> > > + * @start: physical address of the region to remove
> > > + * @size: size of the region to remove
> > > + *
> > > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > > + * and online/offline operations before this call, as required by
> > > + * try_offline_node().
> > > + */
> > > +void __remove_memory(int nid, u64 start, u64 size)
> > > {
> > > +
> > > + /*
> > > + * trigger BUG() is some memory is not offlined prior to calling this
> > > + * function
> > > + */
> > > + if (try_remove_memory(nid, start, size))
> > > + BUG();
> > > +}
> >
> > Could we call this remove_offline_memory()? That way, it makes _some_
> > sense why we would BUG() if the memory isn't offline.
It is this particular code path, the second one: remove_memory(),
actually tries to remove memory and returns failure if it can't. So, I
think the current name is OK.
>
> Please WARN() instead of BUG() because failing to remove memory should
> not be system fatal.
As mentioned earlier, I will keep BUG(), because existing code does
that, and there is no good handling of this code to return on error.
Thank you,
Pavel
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
2019-05-06 17:57 ` Dave Hansen
2019-05-06 18:01 ` Dan Williams
@ 2019-05-06 18:13 ` Pavel Tatashin
1 sibling, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-06 18:13 UTC (permalink / raw)
To: Dave Hansen
Cc: James Morris, Sasha Levin, LKML, linux-mm, linux-nvdimm,
Andrew Morton, Michal Hocko, Dave Hansen, Dan Williams,
Keith Busch, Vishal L Verma, Dave Jiang, Ross Zwisler,
Tom Lendacky, Huang, Ying, Fengguang Wu, Borislav Petkov,
Bjorn Helgaas, Yaowei Bai, Takashi Iwai, Jérôme Glisse,
David Hildenbrand
On Mon, May 6, 2019 at 1:57 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > +{
> > + return -EBUSY;
> > +}
>
> This seems like an appropriate place for a WARN_ONCE(), if someone
> manages to call remove_memory() with hotplug disabled.
>
> BTW, I looked and can't think of a better errno, but -EBUSY probably
> isn't the best error code, right?
Same here, I looked and did not find any better then -EBUSY. Also, it
is close to check_cpu_on_node() in the same file.
>
> > -void remove_memory(int nid, u64 start, u64 size)
> > +/**
> > + * remove_memory
> > + * @nid: the node ID
> > + * @start: physical address of the region to remove
> > + * @size: size of the region to remove
> > + *
> > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > + * and online/offline operations before this call, as required by
> > + * try_offline_node().
> > + */
> > +void __remove_memory(int nid, u64 start, u64 size)
> > {
> > +
> > + /*
> > + * trigger BUG() is some memory is not offlined prior to calling this
> > + * function
> > + */
> > + if (try_remove_memory(nid, start, size))
> > + BUG();
> > +}
>
> Could we call this remove_offline_memory()? That way, it makes _some_
> sense why we would BUG() if the memory isn't offline.
Sure, I will rename this function.
Thank you,
Pasha
^ permalink raw reply [flat|nested] 21+ messages in thread
* [v5 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM
2019-05-02 18:43 [v5 0/3] "Hotremove" persistent memory Pavel Tatashin
2019-05-02 18:43 ` [v5 1/3] device-dax: fix memory and resource leak if hotplug fails Pavel Tatashin
2019-05-02 18:43 ` [v5 2/3] mm/hotplug: make remove_memory() interface useable Pavel Tatashin
@ 2019-05-02 18:43 ` Pavel Tatashin
2019-05-02 20:50 ` [v5 0/3] "Hotremove" persistent memory Verma, Vishal L
3 siblings, 0 replies; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 18:43 UTC (permalink / raw)
To: pasha.tatashin, jmorris, sashal, linux-kernel, linux-mm,
linux-nvdimm, akpm, mhocko, dave.hansen, dan.j.williams,
keith.busch, vishal.l.verma, dave.jiang, zwisler,
thomas.lendacky, ying.huang, fengguang.wu, bp, bhelgaas,
baiyaowei, tiwai, jglisse, david
It is now allowed to use persistent memory like a regular RAM, but
currently there is no way to remove this memory until machine is
rebooted.
This work expands the functionality to also allows hotremoving
previously hotplugged persistent memory, and recover the device for use
for other purposes.
To hotremove persistent memory, the management software must first
offline all memory blocks of dax region, and than unbind it from
device-dax/kmem driver. So, operations should look like this:
echo offline > /sys/devices/system/memory/memoryN/state
...
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
Note: if unbind is done without offlining memory beforehand, it won't be
possible to do dax0.0 hotremove, and dax's memory is going to be part of
System RAM until reboot.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
drivers/dax/dax-private.h | 2 ++
drivers/dax/kmem.c | 41 +++++++++++++++++++++++++++++++++++----
2 files changed, 39 insertions(+), 4 deletions(-)
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index a45612148ca0..999aaf3a29b3 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -53,6 +53,7 @@ struct dax_region {
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
* @ref: pgmap reference count (driver owned)
* @cmp: @ref final put completion (driver owned)
+ * @dax_mem_res: physical address range of hotadded DAX memory
*/
struct dev_dax {
struct dax_region *region;
@@ -62,6 +63,7 @@ struct dev_dax {
struct dev_pagemap pgmap;
struct percpu_ref ref;
struct completion cmp;
+ struct resource *dax_kmem_res;
};
static inline struct dev_dax *to_dev_dax(struct device *dev)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 4c0131857133..3d0a7e702c94 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -71,21 +71,54 @@ int dev_dax_kmem_probe(struct device *dev)
kfree(new_res);
return rc;
}
+ dev_dax->dax_kmem_res = new_res;
return 0;
}
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static int dev_dax_kmem_remove(struct device *dev)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct resource *res = dev_dax->dax_kmem_res;
+ resource_size_t kmem_start = res->start;
+ resource_size_t kmem_size = resource_size(res);
+ int rc;
+
+ /*
+ * We have one shot for removing memory, if some memory blocks were not
+ * offline prior to calling this function remove_memory() will fail, and
+ * there is no way to hotremove this memory until reboot because device
+ * unbind will succeed even if we return failure.
+ */
+ rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
+ if (rc) {
+ dev_err(dev,
+ "DAX region %pR cannot be hotremoved until the next reboot\n",
+ res);
+ return rc;
+ }
+
+ /* Release and free dax resources */
+ release_resource(res);
+ kfree(res);
+ dev_dax->dax_kmem_res = NULL;
+
+ return 0;
+}
+#else
static int dev_dax_kmem_remove(struct device *dev)
{
/*
- * Purposely leak the request_mem_region() for the device-dax
- * range and return '0' to ->remove() attempts. The removal of
- * the device from the driver always succeeds, but the region
- * is permanently pinned as reserved by the unreleased
+ * Without hotremove purposely leak the request_mem_region() for the
+ * device-dax range and return '0' to ->remove() attempts. The removal
+ * of the device from the driver always succeeds, but the region is
+ * permanently pinned as reserved by the unreleased
* request_mem_region().
*/
return 0;
}
+#endif /* CONFIG_MEMORY_HOTREMOVE */
static struct dax_device_driver device_dax_kmem_driver = {
.drv = {
--
2.21.0
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-02 18:43 [v5 0/3] "Hotremove" persistent memory Pavel Tatashin
` (2 preceding siblings ...)
2019-05-02 18:43 ` [v5 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM Pavel Tatashin
@ 2019-05-02 20:50 ` Verma, Vishal L
2019-05-02 21:44 ` Pavel Tatashin
2019-05-15 18:11 ` Pavel Tatashin
3 siblings, 2 replies; 21+ messages in thread
From: Verma, Vishal L @ 2019-05-02 20:50 UTC (permalink / raw)
To: linux-kernel, jmorris, tiwai, sashal, pasha.tatashin, linux-mm,
dave.hansen, david, bp, Williams, Dan J, akpm, linux-nvdimm,
jglisse, zwisler, mhocko, Jiang, Dave, bhelgaas, Busch, Keith,
thomas.lendacky, Huang, Ying, Wu, Fengguang, baiyaowei
On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> The series of operations look like this:
>
> 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
> and free ramdisk.
> 2. Convert raw pmem0 to devdax
> ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> 3. Hotadd to System RAM
> echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> echo online_movable > /sys/devices/system/memoryXXX/state
> 4. Before reboot hotremove device-dax memory from System RAM
> echo offline > /sys/devices/system/memoryXXX/state
> echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
Hi Pavel,
I am working on adding this sort of a workflow into a new daxctl command
(daxctl-reconfigure-device)- this will allow changing the 'mode' of a
dax device to kmem, online the resulting memory, and with your patches,
also attempt to offline the memory, and change back to device-dax.
In running with these patches, and testing the offlining part, I ran
into the following lockdep below.
This is with just these three patches on top of -rc7.
[ +0.004886] ======================================================
[ +0.001576] WARNING: possible circular locking dependency detected
[ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
[ +0.000929] ------------------------------------------------------
[ +0.000708] daxctl/22950 is trying to acquire lock:
[ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
[ +0.000922]
but task is already holding lock:
[ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
[ +0.000960]
which lock already depends on the new lock.
[ +0.001001]
the existing dependency chain (in reverse order) is:
[ +0.000837]
-> #3 (mem_sysfs_mutex){+.+.}:
[ +0.000631] __mutex_lock+0x82/0x9a0
[ +0.000477] unregister_memory_section+0x22/0xa0
[ +0.000582] __remove_pages+0xe9/0x520
[ +0.000489] arch_remove_memory+0x81/0xc0
[ +0.000510] devm_memremap_pages_release+0x180/0x270
[ +0.000633] release_nodes+0x234/0x280
[ +0.000483] device_release_driver_internal+0xf4/0x1d0
[ +0.000701] bus_remove_device+0xfc/0x170
[ +0.000529] device_del+0x16a/0x380
[ +0.000459] unregister_dev_dax+0x23/0x50
[ +0.000526] release_nodes+0x234/0x280
[ +0.000487] device_release_driver_internal+0xf4/0x1d0
[ +0.000646] unbind_store+0x9b/0x130
[ +0.000467] kernfs_fop_write+0xf0/0x1a0
[ +0.000510] vfs_write+0xba/0x1c0
[ +0.000438] ksys_write+0x5a/0xe0
[ +0.000521] do_syscall_64+0x60/0x210
[ +0.000489] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ +0.000637]
-> #2 (mem_hotplug_lock.rw_sem){++++}:
[ +0.000717] get_online_mems+0x3e/0x80
[ +0.000491] kmem_cache_create_usercopy+0x2e/0x270
[ +0.000609] kmem_cache_create+0x12/0x20
[ +0.000507] ptlock_cache_init+0x20/0x28
[ +0.000506] start_kernel+0x240/0x4d0
[ +0.000480] secondary_startup_64+0xa4/0xb0
[ +0.000539]
-> #1 (cpu_hotplug_lock.rw_sem){++++}:
[ +0.000784] cpus_read_lock+0x3e/0x80
[ +0.000511] online_pages+0x37/0x310
[ +0.000469] memory_subsys_online+0x34/0x60
[ +0.000611] device_online+0x60/0x80
[ +0.000611] state_store+0x66/0xd0
[ +0.000552] kernfs_fop_write+0xf0/0x1a0
[ +0.000649] vfs_write+0xba/0x1c0
[ +0.000487] ksys_write+0x5a/0xe0
[ +0.000459] do_syscall_64+0x60/0x210
[ +0.000482] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ +0.000646]
-> #0 (kn->count#424){++++}:
[ +0.000669] lock_acquire+0x9e/0x180
[ +0.000471] __kernfs_remove+0x26a/0x310
[ +0.000518] kernfs_remove_by_name_ns+0x40/0x80
[ +0.000583] remove_files.isra.1+0x30/0x70
[ +0.000555] sysfs_remove_group+0x3d/0x80
[ +0.000524] sysfs_remove_groups+0x29/0x40
[ +0.000532] device_remove_attrs+0x42/0x80
[ +0.000522] device_del+0x162/0x380
[ +0.000464] device_unregister+0x16/0x60
[ +0.000505] unregister_memory_section+0x6e/0xa0
[ +0.000591] __remove_pages+0xe9/0x520
[ +0.000492] arch_remove_memory+0x81/0xc0
[ +0.000568] try_remove_memory+0xba/0xd0
[ +0.000510] remove_memory+0x23/0x40
[ +0.000483] dev_dax_kmem_remove+0x29/0x57 [kmem]
[ +0.000608] device_release_driver_internal+0xe4/0x1d0
[ +0.000637] unbind_store+0x9b/0x130
[ +0.000464] kernfs_fop_write+0xf0/0x1a0
[ +0.000685] vfs_write+0xba/0x1c0
[ +0.000594] ksys_write+0x5a/0xe0
[ +0.000449] do_syscall_64+0x60/0x210
[ +0.000481] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ +0.000619]
other info that might help us debug this:
[ +0.000889] Chain exists of:
kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
[ +0.001269] Possible unsafe locking scenario:
[ +0.000652] CPU0 CPU1
[ +0.000505] ---- ----
[ +0.000523] lock(mem_sysfs_mutex);
[ +0.000422] lock(mem_hotplug_lock.rw_sem);
[ +0.000905] lock(mem_sysfs_mutex);
[ +0.000793] lock(kn->count#424);
[ +0.000394]
*** DEADLOCK ***
[ +0.000665] 7 locks held by daxctl/22950:
[ +0.000458] #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0
[ +0.000943] #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0
[ +0.000895] #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0
[ +0.001019] #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40
[ +0.000942] #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0
[ +0.001019] #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120
[ +0.001118] #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
[ +0.001033]
stack backtrace:
[ +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G O 5.1.0-rc7+ #13
[ +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
[ +0.001360] Call Trace:
[ +0.000293] dump_stack+0x85/0xc0
[ +0.000390] print_circular_bug.isra.41.cold.60+0x15c/0x195
[ +0.000651] check_prev_add.constprop.50+0x5fd/0xbe0
[ +0.000563] ? call_rcu_zapped+0x80/0x80
[ +0.000449] __lock_acquire+0xcee/0xfd0
[ +0.000437] lock_acquire+0x9e/0x180
[ +0.000428] ? kernfs_remove_by_name_ns+0x40/0x80
[ +0.000531] __kernfs_remove+0x26a/0x310
[ +0.000451] ? kernfs_remove_by_name_ns+0x40/0x80
[ +0.000529] ? kernfs_name_hash+0x12/0x80
[ +0.000462] kernfs_remove_by_name_ns+0x40/0x80
[ +0.000513] remove_files.isra.1+0x30/0x70
[ +0.000483] sysfs_remove_group+0x3d/0x80
[ +0.000458] sysfs_remove_groups+0x29/0x40
[ +0.000477] device_remove_attrs+0x42/0x80
[ +0.000461] device_del+0x162/0x380
[ +0.000399] device_unregister+0x16/0x60
[ +0.000442] unregister_memory_section+0x6e/0xa0
[ +0.001232] __remove_pages+0xe9/0x520
[ +0.000443] arch_remove_memory+0x81/0xc0
[ +0.000459] try_remove_memory+0xba/0xd0
[ +0.000460] remove_memory+0x23/0x40
[ +0.000461] dev_dax_kmem_remove+0x29/0x57 [kmem]
[ +0.000603] device_release_driver_internal+0xe4/0x1d0
[ +0.000590] unbind_store+0x9b/0x130
[ +0.000409] kernfs_fop_write+0xf0/0x1a0
[ +0.000448] vfs_write+0xba/0x1c0
[ +0.000395] ksys_write+0x5a/0xe0
[ +0.000382] do_syscall_64+0x60/0x210
[ +0.000418] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ +0.000573] RIP: 0033:0x7fd1f7442fa8
[ +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[ +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8
[ +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003
[ +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80
[ +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-02 20:50 ` [v5 0/3] "Hotremove" persistent memory Verma, Vishal L
@ 2019-05-02 21:44 ` Pavel Tatashin
2019-05-02 22:29 ` Verma, Vishal L
2019-05-15 18:11 ` Pavel Tatashin
1 sibling, 1 reply; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 21:44 UTC (permalink / raw)
To: Verma, Vishal L
Cc: linux-kernel, jmorris, tiwai, sashal, linux-mm, dave.hansen,
david, bp, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
Huang, Ying, Wu, Fengguang, baiyaowei
On Thu, May 2, 2019 at 4:50 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote:
>
> On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> > The series of operations look like this:
> >
> > 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
> > and free ramdisk.
> > 2. Convert raw pmem0 to devdax
> > ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> > 3. Hotadd to System RAM
> > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> > echo online_movable > /sys/devices/system/memoryXXX/state
> > 4. Before reboot hotremove device-dax memory from System RAM
> > echo offline > /sys/devices/system/memoryXXX/state
> > echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>
> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.
Hi Verma,
Thank you for testing. I wonder if there is a command sequence that I
could run to reproduce it?
Also, could you please send your config and qemu arguments.
Thank you,
Pasha
>
>
> [ +0.004886] ======================================================
> [ +0.001576] WARNING: possible circular locking dependency detected
> [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
> [ +0.000929] ------------------------------------------------------
> [ +0.000708] daxctl/22950 is trying to acquire lock:
> [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000922]
> but task is already holding lock:
> [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
> [ +0.000960]
> which lock already depends on the new lock.
>
> [ +0.001001]
> the existing dependency chain (in reverse order) is:
> [ +0.000837]
> -> #3 (mem_sysfs_mutex){+.+.}:
> [ +0.000631] __mutex_lock+0x82/0x9a0
> [ +0.000477] unregister_memory_section+0x22/0xa0
> [ +0.000582] __remove_pages+0xe9/0x520
> [ +0.000489] arch_remove_memory+0x81/0xc0
> [ +0.000510] devm_memremap_pages_release+0x180/0x270
> [ +0.000633] release_nodes+0x234/0x280
> [ +0.000483] device_release_driver_internal+0xf4/0x1d0
> [ +0.000701] bus_remove_device+0xfc/0x170
> [ +0.000529] device_del+0x16a/0x380
> [ +0.000459] unregister_dev_dax+0x23/0x50
> [ +0.000526] release_nodes+0x234/0x280
> [ +0.000487] device_release_driver_internal+0xf4/0x1d0
> [ +0.000646] unbind_store+0x9b/0x130
> [ +0.000467] kernfs_fop_write+0xf0/0x1a0
> [ +0.000510] vfs_write+0xba/0x1c0
> [ +0.000438] ksys_write+0x5a/0xe0
> [ +0.000521] do_syscall_64+0x60/0x210
> [ +0.000489] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ +0.000637]
> -> #2 (mem_hotplug_lock.rw_sem){++++}:
> [ +0.000717] get_online_mems+0x3e/0x80
> [ +0.000491] kmem_cache_create_usercopy+0x2e/0x270
> [ +0.000609] kmem_cache_create+0x12/0x20
> [ +0.000507] ptlock_cache_init+0x20/0x28
> [ +0.000506] start_kernel+0x240/0x4d0
> [ +0.000480] secondary_startup_64+0xa4/0xb0
> [ +0.000539]
> -> #1 (cpu_hotplug_lock.rw_sem){++++}:
> [ +0.000784] cpus_read_lock+0x3e/0x80
> [ +0.000511] online_pages+0x37/0x310
> [ +0.000469] memory_subsys_online+0x34/0x60
> [ +0.000611] device_online+0x60/0x80
> [ +0.000611] state_store+0x66/0xd0
> [ +0.000552] kernfs_fop_write+0xf0/0x1a0
> [ +0.000649] vfs_write+0xba/0x1c0
> [ +0.000487] ksys_write+0x5a/0xe0
> [ +0.000459] do_syscall_64+0x60/0x210
> [ +0.000482] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ +0.000646]
> -> #0 (kn->count#424){++++}:
> [ +0.000669] lock_acquire+0x9e/0x180
> [ +0.000471] __kernfs_remove+0x26a/0x310
> [ +0.000518] kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000583] remove_files.isra.1+0x30/0x70
> [ +0.000555] sysfs_remove_group+0x3d/0x80
> [ +0.000524] sysfs_remove_groups+0x29/0x40
> [ +0.000532] device_remove_attrs+0x42/0x80
> [ +0.000522] device_del+0x162/0x380
> [ +0.000464] device_unregister+0x16/0x60
> [ +0.000505] unregister_memory_section+0x6e/0xa0
> [ +0.000591] __remove_pages+0xe9/0x520
> [ +0.000492] arch_remove_memory+0x81/0xc0
> [ +0.000568] try_remove_memory+0xba/0xd0
> [ +0.000510] remove_memory+0x23/0x40
> [ +0.000483] dev_dax_kmem_remove+0x29/0x57 [kmem]
> [ +0.000608] device_release_driver_internal+0xe4/0x1d0
> [ +0.000637] unbind_store+0x9b/0x130
> [ +0.000464] kernfs_fop_write+0xf0/0x1a0
> [ +0.000685] vfs_write+0xba/0x1c0
> [ +0.000594] ksys_write+0x5a/0xe0
> [ +0.000449] do_syscall_64+0x60/0x210
> [ +0.000481] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ +0.000619]
> other info that might help us debug this:
>
> [ +0.000889] Chain exists of:
> kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
>
> [ +0.001269] Possible unsafe locking scenario:
>
> [ +0.000652] CPU0 CPU1
> [ +0.000505] ---- ----
> [ +0.000523] lock(mem_sysfs_mutex);
> [ +0.000422] lock(mem_hotplug_lock.rw_sem);
> [ +0.000905] lock(mem_sysfs_mutex);
> [ +0.000793] lock(kn->count#424);
> [ +0.000394]
> *** DEADLOCK ***
>
> [ +0.000665] 7 locks held by daxctl/22950:
> [ +0.000458] #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0
> [ +0.000943] #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0
> [ +0.000895] #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0
> [ +0.001019] #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40
> [ +0.000942] #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0
> [ +0.001019] #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120
> [ +0.001118] #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
> [ +0.001033]
> stack backtrace:
> [ +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G O 5.1.0-rc7+ #13
> [ +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
> [ +0.001360] Call Trace:
> [ +0.000293] dump_stack+0x85/0xc0
> [ +0.000390] print_circular_bug.isra.41.cold.60+0x15c/0x195
> [ +0.000651] check_prev_add.constprop.50+0x5fd/0xbe0
> [ +0.000563] ? call_rcu_zapped+0x80/0x80
> [ +0.000449] __lock_acquire+0xcee/0xfd0
> [ +0.000437] lock_acquire+0x9e/0x180
> [ +0.000428] ? kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000531] __kernfs_remove+0x26a/0x310
> [ +0.000451] ? kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000529] ? kernfs_name_hash+0x12/0x80
> [ +0.000462] kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000513] remove_files.isra.1+0x30/0x70
> [ +0.000483] sysfs_remove_group+0x3d/0x80
> [ +0.000458] sysfs_remove_groups+0x29/0x40
> [ +0.000477] device_remove_attrs+0x42/0x80
> [ +0.000461] device_del+0x162/0x380
> [ +0.000399] device_unregister+0x16/0x60
> [ +0.000442] unregister_memory_section+0x6e/0xa0
> [ +0.001232] __remove_pages+0xe9/0x520
> [ +0.000443] arch_remove_memory+0x81/0xc0
> [ +0.000459] try_remove_memory+0xba/0xd0
> [ +0.000460] remove_memory+0x23/0x40
> [ +0.000461] dev_dax_kmem_remove+0x29/0x57 [kmem]
> [ +0.000603] device_release_driver_internal+0xe4/0x1d0
> [ +0.000590] unbind_store+0x9b/0x130
> [ +0.000409] kernfs_fop_write+0xf0/0x1a0
> [ +0.000448] vfs_write+0xba/0x1c0
> [ +0.000395] ksys_write+0x5a/0xe0
> [ +0.000382] do_syscall_64+0x60/0x210
> [ +0.000418] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ +0.000573] RIP: 0033:0x7fd1f7442fa8
> [ +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
> [ +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8
> [ +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003
> [ +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80
> [ +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [ +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-02 21:44 ` Pavel Tatashin
@ 2019-05-02 22:29 ` Verma, Vishal L
2019-05-02 22:36 ` Pavel Tatashin
0 siblings, 1 reply; 21+ messages in thread
From: Verma, Vishal L @ 2019-05-02 22:29 UTC (permalink / raw)
To: pasha.tatashin
Cc: linux-kernel, jmorris, sashal, bp, linux-mm, david, dave.hansen,
tiwai, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
Huang, Ying, Wu, Fengguang, baiyaowei
On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote:
> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> >
> > This is with just these three patches on top of -rc7.
>
> Hi Verma,
>
> Thank you for testing. I wonder if there is a command sequence that I
> could run to reproduce it?
> Also, could you please send your config and qemu arguments.
>
Yes, here is the qemu config:
qemu-system-x86_64
-machine accel=kvm
-machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
-cpu Haswell-noTSX
-m 12G,slots=3,maxmem=44G
-realtime mlock=off
-smp 8,sockets=2,cores=4,threads=1
-numa node,nodeid=0,cpus=0-3,mem=6G
-numa node,nodeid=1,cpus=4-7,mem=6G
-numa node,nodeid=2
-numa node,nodeid=3
-drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
-object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
-device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
-object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
-device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
-serial stdio
-display none
For the command list - I'm using WIP patches to ndctl/daxctl to add the
command I mentioned earlier. Using this command, I can reproduce the
lockdep issue. I thought I should be able to reproduce the issue by
onlining/offlining through sysfs directly too - something like:
node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
for mem in /sys/devices/system/node/node"$node"/memory*; do
echo "offline" > $mem/state
done
But with that I can't reproduce the problem.
I'll try to dig a bit deeper into what might be happening, the daxctl
modifications simply amount to doing the same thing as above in C, so
I'm not immediately sure what might be happening.
If you're interested, I can post the ndctl patches - maybe as an RFC -
to test with.
Thanks,
-Vishal
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-02 22:29 ` Verma, Vishal L
@ 2019-05-02 22:36 ` Pavel Tatashin
2019-05-03 21:48 ` Verma, Vishal L
0 siblings, 1 reply; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-02 22:36 UTC (permalink / raw)
To: Verma, Vishal L
Cc: linux-kernel, jmorris, sashal, bp, linux-mm, david, dave.hansen,
tiwai, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
Huang, Ying, Wu, Fengguang, baiyaowei
On Thu, May 2, 2019 at 6:29 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote:
>
> On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote:
>
> > > In running with these patches, and testing the offlining part, I ran
> > > into the following lockdep below.
> > >
> > > This is with just these three patches on top of -rc7.
> >
> > Hi Verma,
> >
> > Thank you for testing. I wonder if there is a command sequence that I
> > could run to reproduce it?
> > Also, could you please send your config and qemu arguments.
> >
> Yes, here is the qemu config:
>
> qemu-system-x86_64
> -machine accel=kvm
> -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
> -cpu Haswell-noTSX
> -m 12G,slots=3,maxmem=44G
> -realtime mlock=off
> -smp 8,sockets=2,cores=4,threads=1
> -numa node,nodeid=0,cpus=0-3,mem=6G
> -numa node,nodeid=1,cpus=4-7,mem=6G
> -numa node,nodeid=2
> -numa node,nodeid=3
> -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
> -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
> -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
> -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
> -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
> -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
> -serial stdio
> -display none
>
> For the command list - I'm using WIP patches to ndctl/daxctl to add the
> command I mentioned earlier. Using this command, I can reproduce the
> lockdep issue. I thought I should be able to reproduce the issue by
> onlining/offlining through sysfs directly too - something like:
>
> node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
> for mem in /sys/devices/system/node/node"$node"/memory*; do
> echo "offline" > $mem/state
> done
>
> But with that I can't reproduce the problem.
>
> I'll try to dig a bit deeper into what might be happening, the daxctl
> modifications simply amount to doing the same thing as above in C, so
> I'm not immediately sure what might be happening.
>
> If you're interested, I can post the ndctl patches - maybe as an RFC -
> to test with.
I could apply the patches and test with them. Also, could you please
send your kernel config.
Thank you,
Pasha
>
> Thanks,
> -Vishal
>
>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-02 22:36 ` Pavel Tatashin
@ 2019-05-03 21:48 ` Verma, Vishal L
0 siblings, 0 replies; 21+ messages in thread
From: Verma, Vishal L @ 2019-05-03 21:48 UTC (permalink / raw)
To: pasha.tatashin
Cc: linux-kernel, jmorris, sashal, bp, linux-mm, david, dave.hansen,
tiwai, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
Huang, Ying, Wu, Fengguang, baiyaowei
[-- Attachment #1.1: Type: text/plain, Size: 2644 bytes --]
On Thu, 2019-05-02 at 18:36 -0400, Pavel Tatashin wrote:
> > Yes, here is the qemu config:
> >
> > qemu-system-x86_64
> > -machine accel=kvm
> > -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
> > -cpu Haswell-noTSX
> > -m 12G,slots=3,maxmem=44G
> > -realtime mlock=off
> > -smp 8,sockets=2,cores=4,threads=1
> > -numa node,nodeid=0,cpus=0-3,mem=6G
> > -numa node,nodeid=1,cpus=4-7,mem=6G
> > -numa node,nodeid=2
> > -numa node,nodeid=3
> > -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
> > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
> > -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
> > -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
> > -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
> > -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
> > -serial stdio
> > -display none
> >
> > For the command list - I'm using WIP patches to ndctl/daxctl to add the
> > command I mentioned earlier. Using this command, I can reproduce the
> > lockdep issue. I thought I should be able to reproduce the issue by
> > onlining/offlining through sysfs directly too - something like:
> >
> > node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
> > for mem in /sys/devices/system/node/node"$node"/memory*; do
> > echo "offline" > $mem/state
> > done
> >
> > But with that I can't reproduce the problem.
> >
> > I'll try to dig a bit deeper into what might be happening, the daxctl
> > modifications simply amount to doing the same thing as above in C, so
> > I'm not immediately sure what might be happening.
> >
> > If you're interested, I can post the ndctl patches - maybe as an RFC -
> > to test with.
>
> I could apply the patches and test with them. Also, could you please
> send your kernel config.
>
Hi Pavel,
I've CC'd you on the patches mentioned above, and also pushed them to a
'kmem-pending' branch on github:
https://github.com/pmem/ndctl/tree/kmem-pending
After building ndctl from the above, you will want to run:
# daxctl reconfigure-device --mode=system-ram dax0.0
(this will also have onlined the memory sections)
# daxctl reconfigure-device --mode=devdax --attempt-offline dax0.0
(this triggers the lockdep warnings)
I've attached the kernel config here too (gzipped).
Thanks,
-Vishal
[-- Attachment #1.2: config.gz --]
[-- Type: application/gzip, Size: 30367 bytes --]
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3270 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-02 20:50 ` [v5 0/3] "Hotremove" persistent memory Verma, Vishal L
2019-05-02 21:44 ` Pavel Tatashin
@ 2019-05-15 18:11 ` Pavel Tatashin
2019-05-16 0:42 ` Dan Williams
1 sibling, 1 reply; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-15 18:11 UTC (permalink / raw)
To: Verma, Vishal L
Cc: linux-kernel, jmorris, tiwai, sashal, linux-mm, dave.hansen,
david, bp, Williams, Dan J, akpm, linux-nvdimm, jglisse, zwisler,
mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
Huang, Ying, Wu, Fengguang, baiyaowei
> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.
>
>
> [ +0.004886] ======================================================
> [ +0.001576] WARNING: possible circular locking dependency detected
> [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
> [ +0.000929] ------------------------------------------------------
> [ +0.000708] daxctl/22950 is trying to acquire lock:
> [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> [ +0.000922]
> but task is already holding lock:
> [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
I have studied this issue, and now have a clear understanding why it
happens, I am not yet sure how to fix it, so suggestions are welcomed
:)
Here is the problem:
When we offline pages we have the following call stack:
# echo offline > /sys/devices/system/memory/memory8/state
ksys_write
vfs_write
__vfs_write
kernfs_fop_write
kernfs_get_active
lock_acquire kn->count#122 (lock for
"memory8/state" kn)
sysfs_kf_write
dev_attr_store
state_store
device_offline
memory_subsys_offline
memory_block_action
offline_pages
__offline_pages
percpu_down_write
down_write
lock_acquire mem_hotplug_lock.rw_sem
When we unbind dax0.0 we have the following stack:
# echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
drv_attr_store
unbind_store
device_driver_detach
device_release_driver_internal
dev_dax_kmem_remove
remove_memory device_hotplug_lock
try_remove_memory mem_hotplug_lock.rw_sem
arch_remove_memory
__remove_pages
__remove_section
unregister_memory_section
remove_memory_section mem_sysfs_mutex
unregister_memory
device_unregister
device_del
device_remove_attrs
sysfs_remove_groups
sysfs_remove_group
remove_files
kernfs_remove_by_name
kernfs_remove_by_name_ns
__kernfs_remove kn->count#122
So, lockdep found the ordering issue with the above two stacks:
1. kn->count#122 -> mem_hotplug_lock.rw_sem
2. mem_hotplug_lock.rw_sem -> kn->count#122
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-15 18:11 ` Pavel Tatashin
@ 2019-05-16 0:42 ` Dan Williams
2019-05-16 7:10 ` David Hildenbrand
2019-05-17 14:09 ` Pavel Tatashin
0 siblings, 2 replies; 21+ messages in thread
From: Dan Williams @ 2019-05-16 0:42 UTC (permalink / raw)
To: Pavel Tatashin
Cc: Verma, Vishal L, linux-kernel, jmorris, tiwai, sashal, linux-mm,
dave.hansen, david, bp, akpm, linux-nvdimm, jglisse, zwisler,
mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
Huang, Ying, Wu, Fengguang, baiyaowei
On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> > Hi Pavel,
> >
> > I am working on adding this sort of a workflow into a new daxctl command
> > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> > dax device to kmem, online the resulting memory, and with your patches,
> > also attempt to offline the memory, and change back to device-dax.
> >
> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> >
> > This is with just these three patches on top of -rc7.
> >
> >
> > [ +0.004886] ======================================================
> > [ +0.001576] WARNING: possible circular locking dependency detected
> > [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
> > [ +0.000929] ------------------------------------------------------
> > [ +0.000708] daxctl/22950 is trying to acquire lock:
> > [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> > [ +0.000922]
> > but task is already holding lock:
> > [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
>
> I have studied this issue, and now have a clear understanding why it
> happens, I am not yet sure how to fix it, so suggestions are welcomed
> :)
I would think that ACPI hotplug would have a similar problem, but it does this:
acpi_unbind_memory_blocks(info);
__remove_memory(nid, info->start_addr, info->length);
I wonder if that ordering prevents going too deep into the
device_unregister() call stack that you highlighted below.
>
> Here is the problem:
>
> When we offline pages we have the following call stack:
>
> # echo offline > /sys/devices/system/memory/memory8/state
> ksys_write
> vfs_write
> __vfs_write
> kernfs_fop_write
> kernfs_get_active
> lock_acquire kn->count#122 (lock for
> "memory8/state" kn)
> sysfs_kf_write
> dev_attr_store
> state_store
> device_offline
> memory_subsys_offline
> memory_block_action
> offline_pages
> __offline_pages
> percpu_down_write
> down_write
> lock_acquire mem_hotplug_lock.rw_sem
>
> When we unbind dax0.0 we have the following stack:
> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> drv_attr_store
> unbind_store
> device_driver_detach
> device_release_driver_internal
> dev_dax_kmem_remove
> remove_memory device_hotplug_lock
> try_remove_memory mem_hotplug_lock.rw_sem
> arch_remove_memory
> __remove_pages
> __remove_section
> unregister_memory_section
> remove_memory_section mem_sysfs_mutex
> unregister_memory
> device_unregister
> device_del
> device_remove_attrs
> sysfs_remove_groups
> sysfs_remove_group
> remove_files
> kernfs_remove_by_name
> kernfs_remove_by_name_ns
> __kernfs_remove kn->count#122
>
> So, lockdep found the ordering issue with the above two stacks:
>
> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> 2. mem_hotplug_lock.rw_sem -> kn->count#122
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-16 0:42 ` Dan Williams
@ 2019-05-16 7:10 ` David Hildenbrand
2019-05-17 14:09 ` Pavel Tatashin
1 sibling, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2019-05-16 7:10 UTC (permalink / raw)
To: Dan Williams, Pavel Tatashin
Cc: Verma, Vishal L, linux-kernel, jmorris, tiwai, sashal, linux-mm,
dave.hansen, bp, akpm, linux-nvdimm, jglisse, zwisler, mhocko,
Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky, Huang,
Ying, Wu, Fengguang, baiyaowei
On 16.05.19 02:42, Dan Williams wrote:
> On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
> <pasha.tatashin@soleen.com> wrote:
>>
>>> Hi Pavel,
>>>
>>> I am working on adding this sort of a workflow into a new daxctl command
>>> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
>>> dax device to kmem, online the resulting memory, and with your patches,
>>> also attempt to offline the memory, and change back to device-dax.
>>>
>>> In running with these patches, and testing the offlining part, I ran
>>> into the following lockdep below.
>>>
>>> This is with just these three patches on top of -rc7.
>>>
>>>
>>> [ +0.004886] ======================================================
>>> [ +0.001576] WARNING: possible circular locking dependency detected
>>> [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O
>>> [ +0.000929] ------------------------------------------------------
>>> [ +0.000708] daxctl/22950 is trying to acquire lock:
>>> [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
>>> [ +0.000922]
>>> but task is already holding lock:
>>> [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
>>
>> I have studied this issue, and now have a clear understanding why it
>> happens, I am not yet sure how to fix it, so suggestions are welcomed
>> :)
>
> I would think that ACPI hotplug would have a similar problem, but it does this:
>
> acpi_unbind_memory_blocks(info);
> __remove_memory(nid, info->start_addr, info->length);
>
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
>
If that doesn't help, after we have
[PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling
we could probably pull the memory device removal phase out from the
mem_hotplug_lock protection and let it be protected by the
device_hotplug_lock only. Might require some more work, though.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-16 0:42 ` Dan Williams
2019-05-16 7:10 ` David Hildenbrand
@ 2019-05-17 14:09 ` Pavel Tatashin
2019-05-20 7:57 ` David Hildenbrand
1 sibling, 1 reply; 21+ messages in thread
From: Pavel Tatashin @ 2019-05-17 14:09 UTC (permalink / raw)
To: Dan Williams
Cc: Verma, Vishal L, linux-kernel, jmorris, tiwai, sashal, linux-mm,
dave.hansen, david, bp, akpm, linux-nvdimm, jglisse, zwisler,
mhocko, Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky,
Huang, Ying, Wu, Fengguang, baiyaowei
[-- Attachment #1: Type: text/plain, Size: 11948 bytes --]
>
> I would think that ACPI hotplug would have a similar problem, but it does this:
>
> acpi_unbind_memory_blocks(info);
> __remove_memory(nid, info->start_addr, info->length);
ACPI does have exactly the same problem, so this is not a bug for this
series, I will submit a new version of my series with comments
addressed, but without fix for this issue.
I was able to reproduce this issue on the current mainline kernel.
Also, I been thinking more about how to fix it, and there is no easy
fix without a major hotplug redesign. Basically, we have to remove
sysfs memory entries before or after memory is hotplugged/hotremoved.
But, we also have to guarantee that hotplug/hotremove will succeed or
reinstate sysfs entries.
Qemu script:
qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-parallel none \
-echr 1 \
-serial none \
-chardev stdio,id=console,signal=off,mux=on \
-serial chardev:console \
-mon chardev=console \
-vga none \
-display none \
-kernel pmem/native/arch/x86/boot/bzImage \
-m 8G,slots=1,maxmem=16G \
-smp 8 \
-fsdev local,id=virtfs1,path=/,security_model=none \
-device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs \
-append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
TERM=xterm ip=dhcp loglevel=7'
Config is attached.
Steps to reproduce:
#
# QEMU 4.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
(qemu)
# echo online_movable > /sys/devices/system/memory/memory79/state
[ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370
[ 23.032591] Policy zone: Normal
# (qemu) device_del dimm1
(qemu) [ 32.013950] Offlined Pages 32768
[ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022
[ 32.014843] Policy zone: Normal
[ 32.015733]
[ 32.015881] ======================================================
[ 32.016390] WARNING: possible circular locking dependency detected
[ 32.016881] 5.1.0_pt_pmem #38 Not tainted
[ 32.017202] ------------------------------------------------------
[ 32.017680] kworker/u16:4/380 is trying to acquire lock:
[ 32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
kernfs_remove_by_name_ns+0x3b/0x80
[ 32.018745]
[ 32.018745] but task is already holding lock:
[ 32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[ 32.019859]
[ 32.019859] which lock already depends on the new lock.
[ 32.019859]
[ 32.020499]
[ 32.020499] the existing dependency chain (in reverse order) is:
[ 32.021080]
[ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
[ 32.021522] __mutex_lock+0x8b/0x900
[ 32.021843] hotplug_memory_register+0x26/0xa0
[ 32.022231] __add_pages+0xe7/0x160
[ 32.022545] add_pages+0xd/0x60
[ 32.022835] add_memory_resource+0xc3/0x1d0
[ 32.023207] __add_memory+0x57/0x80
[ 32.023530] acpi_memory_device_add+0x13a/0x2d0
[ 32.023928] acpi_bus_attach+0xf1/0x200
[ 32.024272] acpi_bus_scan+0x3e/0x90
[ 32.024597] acpi_device_hotplug+0x284/0x3e0
[ 32.024972] acpi_hotplug_work_fn+0x15/0x20
[ 32.025342] process_one_work+0x2a0/0x650
[ 32.025755] worker_thread+0x34/0x3d0
[ 32.026077] kthread+0x118/0x130
[ 32.026442] ret_from_fork+0x3a/0x50
[ 32.026766]
[ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
[ 32.027261] get_online_mems+0x39/0x80
[ 32.027600] kmem_cache_create_usercopy+0x29/0x2c0
[ 32.028019] kmem_cache_create+0xd/0x10
[ 32.028367] ptlock_cache_init+0x1b/0x23
[ 32.028724] start_kernel+0x1d2/0x4b8
[ 32.029060] secondary_startup_64+0xa4/0xb0
[ 32.029447]
[ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
[ 32.030007] cpus_read_lock+0x39/0x80
[ 32.030360] __offline_pages+0x32/0x790
[ 32.030709] memory_subsys_offline+0x3a/0x60
[ 32.031089] device_offline+0x7e/0xb0
[ 32.031425] acpi_bus_offline+0xd8/0x140
[ 32.031821] acpi_device_hotplug+0x1b2/0x3e0
[ 32.032202] acpi_hotplug_work_fn+0x15/0x20
[ 32.032576] process_one_work+0x2a0/0x650
[ 32.032942] worker_thread+0x34/0x3d0
[ 32.033283] kthread+0x118/0x130
[ 32.033588] ret_from_fork+0x3a/0x50
[ 32.033919]
[ 32.033919] -> #1 (&device->physical_node_lock){+.+.}:
[ 32.034450] __mutex_lock+0x8b/0x900
[ 32.034784] acpi_get_first_physical_node+0x16/0x60
[ 32.035217] acpi_companion_match+0x3b/0x60
[ 32.035594] acpi_device_uevent_modalias+0x9/0x20
[ 32.036012] platform_uevent+0xd/0x40
[ 32.036352] dev_uevent+0x85/0x1c0
[ 32.036674] kobject_uevent_env+0x1e2/0x640
[ 32.037044] kobject_synth_uevent+0x2b7/0x324
[ 32.037428] uevent_store+0x17/0x30
[ 32.037752] kernfs_fop_write+0xeb/0x1a0
[ 32.038112] vfs_write+0xb2/0x1b0
[ 32.038417] ksys_write+0x57/0xd0
[ 32.038721] do_syscall_64+0x4b/0x1a0
[ 32.039053] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 32.039491]
[ 32.039491] -> #0 (kn->count#18){++++}:
[ 32.039913] lock_acquire+0xaa/0x180
[ 32.040242] __kernfs_remove+0x244/0x2d0
[ 32.040593] kernfs_remove_by_name_ns+0x3b/0x80
[ 32.040991] device_del+0x14a/0x370
[ 32.041309] device_unregister+0x9/0x20
[ 32.041653] unregister_memory_section+0x69/0xa0
[ 32.042059] __remove_pages+0x112/0x460
[ 32.042402] arch_remove_memory+0x6f/0xa0
[ 32.042758] __remove_memory+0xab/0x130
[ 32.043103] acpi_memory_device_remove+0x67/0xe0
[ 32.043537] acpi_bus_trim+0x50/0x90
[ 32.043889] acpi_device_hotplug+0x2fa/0x3e0
[ 32.044300] acpi_hotplug_work_fn+0x15/0x20
[ 32.044686] process_one_work+0x2a0/0x650
[ 32.045044] worker_thread+0x34/0x3d0
[ 32.045381] kthread+0x118/0x130
[ 32.045679] ret_from_fork+0x3a/0x50
[ 32.046005]
[ 32.046005] other info that might help us debug this:
[ 32.046005]
[ 32.046636] Chain exists of:
[ 32.046636] kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
[ 32.046636]
[ 32.047514] Possible unsafe locking scenario:
[ 32.047514]
[ 32.047976] CPU0 CPU1
[ 32.048337] ---- ----
[ 32.048697] lock(mem_sysfs_mutex);
[ 32.048983] lock(mem_hotplug_lock.rw_sem);
[ 32.049519] lock(mem_sysfs_mutex);
[ 32.050004] lock(kn->count#18);
[ 32.050270]
[ 32.050270] *** DEADLOCK ***
[ 32.050270]
[ 32.050736] 7 locks held by kworker/u16:4/380:
[ 32.051087] #0: 00000000a22fe78e
((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
[ 32.051830] #1: 00000000944f2dca
((work_completion)(&hpw->work)){+.+.}, at:
process_one_work+0x21e/0x650
[ 32.052577] #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
acpi_device_hotplug+0x2e/0x3e0
[ 32.053271] #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
acpi_device_hotplug+0x3c/0x3e0
[ 32.053916] #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
at: __remove_memory+0x3b/0x130
[ 32.054602] #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
at: percpu_down_write+0x1d/0x110
[ 32.055315] #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[ 32.056016]
[ 32.056016] stack backtrace:
[ 32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
[ 32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[ 32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[ 32.058144] Call Trace:
[ 32.058344] dump_stack+0x67/0x90
[ 32.058604] print_circular_bug.cold.60+0x15c/0x195
[ 32.058989] __lock_acquire+0x17de/0x1d30
[ 32.059308] ? find_held_lock+0x2d/0x90
[ 32.059611] ? __kernfs_remove+0x199/0x2d0
[ 32.059937] lock_acquire+0xaa/0x180
[ 32.060223] ? kernfs_remove_by_name_ns+0x3b/0x80
[ 32.060596] __kernfs_remove+0x244/0x2d0
[ 32.060908] ? kernfs_remove_by_name_ns+0x3b/0x80
[ 32.061283] ? kernfs_name_hash+0xd/0x80
[ 32.061596] ? kernfs_find_ns+0x68/0xf0
[ 32.061907] kernfs_remove_by_name_ns+0x3b/0x80
[ 32.062266] device_del+0x14a/0x370
[ 32.062548] ? unregister_mem_sect_under_nodes+0x4f/0xc0
[ 32.062973] device_unregister+0x9/0x20
[ 32.063285] unregister_memory_section+0x69/0xa0
[ 32.063651] __remove_pages+0x112/0x460
[ 32.063949] arch_remove_memory+0x6f/0xa0
[ 32.064271] __remove_memory+0xab/0x130
[ 32.064579] ? walk_memory_range+0xa1/0xe0
[ 32.064907] acpi_memory_device_remove+0x67/0xe0
[ 32.065274] acpi_bus_trim+0x50/0x90
[ 32.065560] acpi_device_hotplug+0x2fa/0x3e0
[ 32.065900] acpi_hotplug_work_fn+0x15/0x20
[ 32.066249] process_one_work+0x2a0/0x650
[ 32.066591] worker_thread+0x34/0x3d0
[ 32.066925] ? process_one_work+0x650/0x650
[ 32.067275] kthread+0x118/0x130
[ 32.067542] ? kthread_create_on_node+0x60/0x60
[ 32.067909] ret_from_fork+0x3a/0x50
>
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
>
>
> >
> > Here is the problem:
> >
> > When we offline pages we have the following call stack:
> >
> > # echo offline > /sys/devices/system/memory/memory8/state
> > ksys_write
> > vfs_write
> > __vfs_write
> > kernfs_fop_write
> > kernfs_get_active
> > lock_acquire kn->count#122 (lock for
> > "memory8/state" kn)
> > sysfs_kf_write
> > dev_attr_store
> > state_store
> > device_offline
> > memory_subsys_offline
> > memory_block_action
> > offline_pages
> > __offline_pages
> > percpu_down_write
> > down_write
> > lock_acquire mem_hotplug_lock.rw_sem
> >
> > When we unbind dax0.0 we have the following stack:
> > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> > drv_attr_store
> > unbind_store
> > device_driver_detach
> > device_release_driver_internal
> > dev_dax_kmem_remove
> > remove_memory device_hotplug_lock
> > try_remove_memory mem_hotplug_lock.rw_sem
> > arch_remove_memory
> > __remove_pages
> > __remove_section
> > unregister_memory_section
> > remove_memory_section mem_sysfs_mutex
> > unregister_memory
> > device_unregister
> > device_del
> > device_remove_attrs
> > sysfs_remove_groups
> > sysfs_remove_group
> > remove_files
> > kernfs_remove_by_name
> > kernfs_remove_by_name_ns
> > __kernfs_remove kn->count#122
> >
> > So, lockdep found the ordering issue with the above two stacks:
> >
> > 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> > 2. mem_hotplug_lock.rw_sem -> kn->count#122
[-- Attachment #2: x86.config.bz2 --]
[-- Type: application/x-bzip, Size: 24701 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [v5 0/3] "Hotremove" persistent memory
2019-05-17 14:09 ` Pavel Tatashin
@ 2019-05-20 7:57 ` David Hildenbrand
0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2019-05-20 7:57 UTC (permalink / raw)
To: Pavel Tatashin, Dan Williams
Cc: Verma, Vishal L, linux-kernel, jmorris, tiwai, sashal, linux-mm,
dave.hansen, bp, akpm, linux-nvdimm, jglisse, zwisler, mhocko,
Jiang, Dave, bhelgaas, Busch, Keith, thomas.lendacky, Huang,
Ying, Wu, Fengguang, baiyaowei
On 17.05.19 16:09, Pavel Tatashin wrote:
>>
>> I would think that ACPI hotplug would have a similar problem, but it does this:
>>
>> acpi_unbind_memory_blocks(info);
>> __remove_memory(nid, info->start_addr, info->length);
>
> ACPI does have exactly the same problem, so this is not a bug for this
> series, I will submit a new version of my series with comments
> addressed, but without fix for this issue.
>
> I was able to reproduce this issue on the current mainline kernel.
> Also, I been thinking more about how to fix it, and there is no easy
> fix without a major hotplug redesign. Basically, we have to remove
> sysfs memory entries before or after memory is hotplugged/hotremoved.
> But, we also have to guarantee that hotplug/hotremove will succeed or
> reinstate sysfs entries.
>
> Qemu script:
>
> qemu-system-x86_64 \
> -enable-kvm \
> -cpu host \
> -parallel none \
> -echr 1 \
> -serial none \
> -chardev stdio,id=console,signal=off,mux=on \
> -serial chardev:console \
> -mon chardev=console \
> -vga none \
> -display none \
> -kernel pmem/native/arch/x86/boot/bzImage \
> -m 8G,slots=1,maxmem=16G \
> -smp 8 \
> -fsdev local,id=virtfs1,path=/,security_model=none \
> -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs \
> -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
> TERM=xterm ip=dhcp loglevel=7'
>
> Config is attached.
>
> Steps to reproduce:
> #
> # QEMU 4.0.0 monitor - type 'help' for more information
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> (qemu)
>
> # echo online_movable > /sys/devices/system/memory/memory79/state
> [ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370
> [ 23.032591] Policy zone: Normal
> # (qemu) device_del dimm1
> (qemu) [ 32.013950] Offlined Pages 32768
> [ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022
> [ 32.014843] Policy zone: Normal
> [ 32.015733]
> [ 32.015881] ======================================================
> [ 32.016390] WARNING: possible circular locking dependency detected
> [ 32.016881] 5.1.0_pt_pmem #38 Not tainted
> [ 32.017202] ------------------------------------------------------
> [ 32.017680] kworker/u16:4/380 is trying to acquire lock:
> [ 32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
> kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.018745]
> [ 32.018745] but task is already holding lock:
> [ 32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [ 32.019859]
> [ 32.019859] which lock already depends on the new lock.
> [ 32.019859]
> [ 32.020499]
> [ 32.020499] the existing dependency chain (in reverse order) is:
> [ 32.021080]
> [ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
> [ 32.021522] __mutex_lock+0x8b/0x900
> [ 32.021843] hotplug_memory_register+0x26/0xa0
> [ 32.022231] __add_pages+0xe7/0x160
> [ 32.022545] add_pages+0xd/0x60
> [ 32.022835] add_memory_resource+0xc3/0x1d0
> [ 32.023207] __add_memory+0x57/0x80
> [ 32.023530] acpi_memory_device_add+0x13a/0x2d0
> [ 32.023928] acpi_bus_attach+0xf1/0x200
> [ 32.024272] acpi_bus_scan+0x3e/0x90
> [ 32.024597] acpi_device_hotplug+0x284/0x3e0
> [ 32.024972] acpi_hotplug_work_fn+0x15/0x20
> [ 32.025342] process_one_work+0x2a0/0x650
> [ 32.025755] worker_thread+0x34/0x3d0
> [ 32.026077] kthread+0x118/0x130
> [ 32.026442] ret_from_fork+0x3a/0x50
> [ 32.026766]
> [ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
> [ 32.027261] get_online_mems+0x39/0x80
> [ 32.027600] kmem_cache_create_usercopy+0x29/0x2c0
> [ 32.028019] kmem_cache_create+0xd/0x10
> [ 32.028367] ptlock_cache_init+0x1b/0x23
> [ 32.028724] start_kernel+0x1d2/0x4b8
> [ 32.029060] secondary_startup_64+0xa4/0xb0
> [ 32.029447]
> [ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
> [ 32.030007] cpus_read_lock+0x39/0x80
> [ 32.030360] __offline_pages+0x32/0x790
> [ 32.030709] memory_subsys_offline+0x3a/0x60
> [ 32.031089] device_offline+0x7e/0xb0
> [ 32.031425] acpi_bus_offline+0xd8/0x140
> [ 32.031821] acpi_device_hotplug+0x1b2/0x3e0
> [ 32.032202] acpi_hotplug_work_fn+0x15/0x20
> [ 32.032576] process_one_work+0x2a0/0x650
> [ 32.032942] worker_thread+0x34/0x3d0
> [ 32.033283] kthread+0x118/0x130
> [ 32.033588] ret_from_fork+0x3a/0x50
> [ 32.033919]
> [ 32.033919] -> #1 (&device->physical_node_lock){+.+.}:
> [ 32.034450] __mutex_lock+0x8b/0x900
> [ 32.034784] acpi_get_first_physical_node+0x16/0x60
> [ 32.035217] acpi_companion_match+0x3b/0x60
> [ 32.035594] acpi_device_uevent_modalias+0x9/0x20
> [ 32.036012] platform_uevent+0xd/0x40
> [ 32.036352] dev_uevent+0x85/0x1c0
> [ 32.036674] kobject_uevent_env+0x1e2/0x640
> [ 32.037044] kobject_synth_uevent+0x2b7/0x324
> [ 32.037428] uevent_store+0x17/0x30
> [ 32.037752] kernfs_fop_write+0xeb/0x1a0
> [ 32.038112] vfs_write+0xb2/0x1b0
> [ 32.038417] ksys_write+0x57/0xd0
> [ 32.038721] do_syscall_64+0x4b/0x1a0
> [ 32.039053] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 32.039491]
> [ 32.039491] -> #0 (kn->count#18){++++}:
> [ 32.039913] lock_acquire+0xaa/0x180
> [ 32.040242] __kernfs_remove+0x244/0x2d0
> [ 32.040593] kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.040991] device_del+0x14a/0x370
> [ 32.041309] device_unregister+0x9/0x20
> [ 32.041653] unregister_memory_section+0x69/0xa0
> [ 32.042059] __remove_pages+0x112/0x460
> [ 32.042402] arch_remove_memory+0x6f/0xa0
> [ 32.042758] __remove_memory+0xab/0x130
> [ 32.043103] acpi_memory_device_remove+0x67/0xe0
> [ 32.043537] acpi_bus_trim+0x50/0x90
> [ 32.043889] acpi_device_hotplug+0x2fa/0x3e0
> [ 32.044300] acpi_hotplug_work_fn+0x15/0x20
> [ 32.044686] process_one_work+0x2a0/0x650
> [ 32.045044] worker_thread+0x34/0x3d0
> [ 32.045381] kthread+0x118/0x130
> [ 32.045679] ret_from_fork+0x3a/0x50
> [ 32.046005]
> [ 32.046005] other info that might help us debug this:
> [ 32.046005]
> [ 32.046636] Chain exists of:
> [ 32.046636] kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
> [ 32.046636]
> [ 32.047514] Possible unsafe locking scenario:
> [ 32.047514]
> [ 32.047976] CPU0 CPU1
> [ 32.048337] ---- ----
> [ 32.048697] lock(mem_sysfs_mutex);
> [ 32.048983] lock(mem_hotplug_lock.rw_sem);
> [ 32.049519] lock(mem_sysfs_mutex);
> [ 32.050004] lock(kn->count#18);
> [ 32.050270]
> [ 32.050270] *** DEADLOCK ***
> [ 32.050270]
> [ 32.050736] 7 locks held by kworker/u16:4/380:
> [ 32.051087] #0: 00000000a22fe78e
> ((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
> [ 32.051830] #1: 00000000944f2dca
> ((work_completion)(&hpw->work)){+.+.}, at:
> process_one_work+0x21e/0x650
> [ 32.052577] #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
> acpi_device_hotplug+0x2e/0x3e0
> [ 32.053271] #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
> acpi_device_hotplug+0x3c/0x3e0
> [ 32.053916] #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
> at: __remove_memory+0x3b/0x130
> [ 32.054602] #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
> at: percpu_down_write+0x1d/0x110
> [ 32.055315] #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [ 32.056016]
> [ 32.056016] stack backtrace:
> [ 32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
> [ 32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.12.0-20181126_142135-anatol 04/01/2014
> [ 32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [ 32.058144] Call Trace:
> [ 32.058344] dump_stack+0x67/0x90
> [ 32.058604] print_circular_bug.cold.60+0x15c/0x195
> [ 32.058989] __lock_acquire+0x17de/0x1d30
> [ 32.059308] ? find_held_lock+0x2d/0x90
> [ 32.059611] ? __kernfs_remove+0x199/0x2d0
> [ 32.059937] lock_acquire+0xaa/0x180
> [ 32.060223] ? kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.060596] __kernfs_remove+0x244/0x2d0
> [ 32.060908] ? kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.061283] ? kernfs_name_hash+0xd/0x80
> [ 32.061596] ? kernfs_find_ns+0x68/0xf0
> [ 32.061907] kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.062266] device_del+0x14a/0x370
> [ 32.062548] ? unregister_mem_sect_under_nodes+0x4f/0xc0
> [ 32.062973] device_unregister+0x9/0x20
> [ 32.063285] unregister_memory_section+0x69/0xa0
> [ 32.063651] __remove_pages+0x112/0x460
> [ 32.063949] arch_remove_memory+0x6f/0xa0
> [ 32.064271] __remove_memory+0xab/0x130
> [ 32.064579] ? walk_memory_range+0xa1/0xe0
> [ 32.064907] acpi_memory_device_remove+0x67/0xe0
> [ 32.065274] acpi_bus_trim+0x50/0x90
> [ 32.065560] acpi_device_hotplug+0x2fa/0x3e0
> [ 32.065900] acpi_hotplug_work_fn+0x15/0x20
> [ 32.066249] process_one_work+0x2a0/0x650
> [ 32.066591] worker_thread+0x34/0x3d0
> [ 32.066925] ? process_one_work+0x650/0x650
> [ 32.067275] kthread+0x118/0x130
> [ 32.067542] ? kthread_create_on_node+0x60/0x60
> [ 32.067909] ret_from_fork+0x3a/0x50
>
>>
>> I wonder if that ordering prevents going too deep into the
>> device_unregister() call stack that you highlighted below.
>>
>>
>>>
>>> Here is the problem:
>>>
>>> When we offline pages we have the following call stack:
>>>
>>> # echo offline > /sys/devices/system/memory/memory8/state
>>> ksys_write
>>> vfs_write
>>> __vfs_write
>>> kernfs_fop_write
>>> kernfs_get_active
>>> lock_acquire kn->count#122 (lock for
>>> "memory8/state" kn)
>>> sysfs_kf_write
>>> dev_attr_store
>>> state_store
>>> device_offline
>>> memory_subsys_offline
>>> memory_block_action
>>> offline_pages
>>> __offline_pages
>>> percpu_down_write
>>> down_write
>>> lock_acquire mem_hotplug_lock.rw_sem
>>>
>>> When we unbind dax0.0 we have the following stack:
>>> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>>> drv_attr_store
>>> unbind_store
>>> device_driver_detach
>>> device_release_driver_internal
>>> dev_dax_kmem_remove
>>> remove_memory device_hotplug_lock
>>> try_remove_memory mem_hotplug_lock.rw_sem
>>> arch_remove_memory
>>> __remove_pages
>>> __remove_section
>>> unregister_memory_section
>>> remove_memory_section mem_sysfs_mutex
>>> unregister_memory
>>> device_unregister
>>> device_del
>>> device_remove_attrs
>>> sysfs_remove_groups
>>> sysfs_remove_group
>>> remove_files
>>> kernfs_remove_by_name
>>> kernfs_remove_by_name_ns
>>> __kernfs_remove kn->count#122
>>>
>>> So, lockdep found the ordering issue with the above two stacks:
>>>
>>> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
>>> 2. mem_hotplug_lock.rw_sem -> kn->count#122
I once documented locking behavior in
Documentation/core-api/memory-hotplug.rst
Both, device_online() and __remove_memory() always have to be called
holding the device_hotplug_lock(), to avoid such races.
# echo offline > /sys/devices/system/memory/memory8/state
and
# echo 0 > /sys/devices/system/memory/memory8/online
either end up in:
-> online_store()
-- lock_device_hotplug_sysfs();
-- device_offline(dev)
-> state_store()
-- lock_device_hotplug_sysfs();
-- device_online(&mem->dev);
So the device_hotplug_lock prohibits the race you describe.
BUT
There is a possible race between the device_hotplug_lock and the
kn->count#122. However, that race can never trigger, as userspace
properly backs off in case it cannot get grip of the device_hotplug_lock.
(lock_device_hotplug_sysfs does a mutex_trylock(&device_hotplug_lock))
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 21+ messages in thread