[PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
@ 2019-05-07 23:55 Dan Williams
  2019-05-07 23:56 ` [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action() Dan Williams
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Dan Williams @ 2019-05-07 23:55 UTC (permalink / raw)
  To: akpm
  Cc: Ira Weiny, Bjorn Helgaas, Logan Gunthorpe, Christoph Hellwig,
	Jérôme Glisse, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, linux-nvdimm, linux-mm

Changes since v1 [1]:
- Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)

- Refresh the p2pdma patch headers to match the format of other p2pdma
  patches (Bjorn)

- Collect Ira's reviewed-by

[1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/

---

Logan audited the devm_memremap_pages() shutdown path and noticed that
it was possible to proceed to arch_remove_memory() before all
potential page references have been reaped.

Introduce a new ->cleanup() callback to do the work of waiting for any
straggling page references and then perform the percpu_ref_exit() in
devm_memremap_pages_release() context.

For p2pdma this involves some deeper reworks to reference count
resources on a per-instance basis rather than a per pci-device basis. A
modified genalloc api is introduced to convey a driver-private pointer
through gen_pool_{alloc,free}() interfaces. Also, a
devm_memunmap_pages() api is introduced since p2pdma does not
auto-release resources on a setup failure.

The dax and pmem changes pass the nvdimm unit tests, and the p2pdma
changes should now pass testing with the pci_p2pdma_release() fix.
Jérôme, how does this look for HMM?

In general, I think these patches / fixes are suitable for v5.2-rc1 or
v5.2-rc2, and since they touch kernel/memremap.c, and other various
pieces of the core, they should go through the -mm tree. These patches
merge cleanly with the current state of -next, pass the nvdimm unit
tests, and are exposed to the 0day robot with no issues reported
(https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending).

---

Dan Williams (6):
      drivers/base/devres: Introduce devm_release_action()
      mm/devm_memremap_pages: Introduce devm_memunmap_pages
      PCI/P2PDMA: Fix the gen_pool_add_virt() failure path
      lib/genalloc: Introduce chunk owners
      PCI/P2PDMA: Track pgmap references per resource, not globally
      mm/devm_memremap_pages: Fix final page put race

 drivers/base/devres.c             |   24 +++++++-
 drivers/dax/device.c              |   13 +---
 drivers/nvdimm/pmem.c             |   17 ++++-
 drivers/pci/p2pdma.c              |  115 +++++++++++++++++++++++--------------
 include/linux/device.h            |    1 
 include/linux/genalloc.h          |   55 ++++++++++++++++--
 include/linux/memremap.h          |    8 +++
 kernel/memremap.c                 |   23 ++++++-
 lib/genalloc.c                    |   51 ++++++++--------
 mm/hmm.c                          |   14 +----
 tools/testing/nvdimm/test/iomap.c |    2 +
 11 files changed, 217 insertions(+), 106 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action()
  2019-05-07 23:55 [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Dan Williams
@ 2019-05-07 23:56 ` Dan Williams
  2019-05-14 19:12   ` Greg Kroah-Hartman
  2019-05-07 23:56 ` [PATCH v2 2/6] mm/devm_memremap_pages: Introduce devm_memunmap_pages Dan Williams
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2019-05-07 23:56 UTC (permalink / raw)
  To: akpm
  Cc: Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig,
	Greg Kroah-Hartman, Rafael J. Wysocki, Ira Weiny, linux-kernel,
	linux-nvdimm, linux-mm

The devm_add_action() facility allows a resource allocation routine to
add custom devm semantics. One such user is devm_memremap_pages().

There is now a need to manually trigger devm_memremap_pages_release().
Introduce devm_release_action() so the release action can be triggered
via a new devm_memunmap_pages() api in a follow-on change.

Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/base/devres.c  |   24 +++++++++++++++++++++++-
 include/linux/device.h |    1 +
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/base/devres.c b/drivers/base/devres.c
index e038e2b3b7ea..0bbb328bd17f 100644
--- a/drivers/base/devres.c
+++ b/drivers/base/devres.c
@@ -755,10 +755,32 @@ void devm_remove_action(struct device *dev, void (*action)(void *), void *data)
 
 	WARN_ON(devres_destroy(dev, devm_action_release, devm_action_match,
 			       &devres));
-
 }
 EXPORT_SYMBOL_GPL(devm_remove_action);
 
+/**
+ * devm_release_action() - release previously added custom action
+ * @dev: Device that owns the action
+ * @action: Function implementing the action
+ * @data: Pointer to data passed to @action implementation
+ *
+ * Releases and removes instance of @action previously added by
+ * devm_add_action().  Both action and data should match one of the
+ * existing entries.
+ */
+void devm_release_action(struct device *dev, void (*action)(void *), void *data)
+{
+	struct action_devres devres = {
+		.data = data,
+		.action = action,
+	};
+
+	WARN_ON(devres_release(dev, devm_action_release, devm_action_match,
+			       &devres));
+
+}
+EXPORT_SYMBOL_GPL(devm_release_action);
+
 /*
  * Managed kmalloc/kfree
  */
diff --git a/include/linux/device.h b/include/linux/device.h
index 4e6987e11f68..6d7fd5370f3d 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -713,6 +713,7 @@ void __iomem *devm_of_iomap(struct device *dev,
 /* allows to add/remove a custom action to devres stack */
 int devm_add_action(struct device *dev, void (*action)(void *), void *data);
 void devm_remove_action(struct device *dev, void (*action)(void *), void *data);
+void devm_release_action(struct device *dev, void (*action)(void *), void *data);
 
 static inline int devm_add_action_or_reset(struct device *dev,
 					   void (*action)(void *), void *data)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/6] mm/devm_memremap_pages: Introduce devm_memunmap_pages
  2019-05-07 23:55 [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Dan Williams
  2019-05-07 23:56 ` [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action() Dan Williams
@ 2019-05-07 23:56 ` Dan Williams
  2019-05-07 23:56 ` [PATCH v2 3/6] PCI/P2PDMA: Fix the gen_pool_add_virt() failure path Dan Williams
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2019-05-07 23:56 UTC (permalink / raw)
  To: akpm
  Cc: Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig, Ira Weiny,
	linux-kernel, linux-nvdimm, linux-mm

Use the new devm_relase_action() facility to allow
devm_memremap_pages_release() to be manually triggered.

Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memremap.h |    6 ++++++
 kernel/memremap.c        |    6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f0628660d541..7601ee314c4a 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -100,6 +100,7 @@ struct dev_pagemap {
 
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
+void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 		struct dev_pagemap *pgmap);
 
@@ -118,6 +119,11 @@ static inline void *devm_memremap_pages(struct device *dev,
 	return ERR_PTR(-ENXIO);
 }
 
+static inline void devm_memunmap_pages(struct device *dev,
+		struct dev_pagemap *pgmap)
+{
+}
+
 static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 		struct dev_pagemap *pgmap)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index a856cb5ff192..65afbacab44e 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -266,6 +266,12 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 }
 EXPORT_SYMBOL_GPL(devm_memremap_pages);
 
+void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap)
+{
+	devm_release_action(dev, devm_memremap_pages_release, pgmap);
+}
+EXPORT_SYMBOL_GPL(devm_memunmap_pages);
+
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {
 	/* number of pfns from base where pfn_to_page() is valid */


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/6] PCI/P2PDMA: Fix the gen_pool_add_virt() failure path
  2019-05-07 23:55 [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Dan Williams
  2019-05-07 23:56 ` [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action() Dan Williams
  2019-05-07 23:56 ` [PATCH v2 2/6] mm/devm_memremap_pages: Introduce devm_memunmap_pages Dan Williams
@ 2019-05-07 23:56 ` Dan Williams
  2019-05-07 23:56 ` [PATCH v2 4/6] lib/genalloc: Introduce chunk owners Dan Williams
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2019-05-07 23:56 UTC (permalink / raw)
  To: akpm
  Cc: Logan Gunthorpe, Christoph Hellwig, Ira Weiny, Bjorn Helgaas,
	linux-kernel, linux-nvdimm, linux-mm

The pci_p2pdma_add_resource() implementation immediately frees the pgmap
if gen_pool_add_virt() fails. However, that means that when @dev
triggers a devres release devm_memremap_pages_release() will crash
trying to access the freed @pgmap.

Use the new devm_memunmap_pages() to manually free the mapping in the
error path.

Fixes: 52916982af48 ("PCI/P2PDMA: Support peer-to-peer memory")
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/pci/p2pdma.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index c52298d76e64..595a534bd749 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -208,13 +208,15 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 			pci_bus_address(pdev, bar) + offset,
 			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
 	if (error)
-		goto pgmap_free;
+		goto pages_free;
 
 	pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
 		 &pgmap->res);
 
 	return 0;
 
+pages_free:
+	devm_memunmap_pages(&pdev->dev, pgmap);
 pgmap_free:
 	devm_kfree(&pdev->dev, pgmap);
 	return error;


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/6] lib/genalloc: Introduce chunk owners
  2019-05-07 23:55 [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Dan Williams
                   ` (2 preceding siblings ...)
  2019-05-07 23:56 ` [PATCH v2 3/6] PCI/P2PDMA: Fix the gen_pool_add_virt() failure path Dan Williams
@ 2019-05-07 23:56 ` Dan Williams
  2019-05-07 23:56 ` [PATCH v2 5/6] PCI/P2PDMA: Track pgmap references per resource, not globally Dan Williams
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2019-05-07 23:56 UTC (permalink / raw)
  To: akpm
  Cc: Logan Gunthorpe, Bjorn Helgaas, Jérôme Glisse,
	Christoph Hellwig, Ira Weiny, linux-kernel, linux-nvdimm,
	linux-mm

The p2pdma facility enables a provider to publish a pool of dma
addresses for a consumer to allocate. A genpool is used internally by
p2pdma to collect dma resources, 'chunks', to be handed out to
consumers. Whenever a consumer allocates a resource it needs to pin the
'struct dev_pagemap' instance that backs the chunk selected by
pci_alloc_p2pmem().

Currently that reference is taken globally on the entire provider
device. That sets up a lifetime mismatch whereby the p2pdma core needs
to maintain hacks to make sure the percpu_ref is not released twice.

This lifetime mismatch also stands in the way of a fix to
devm_memremap_pages() whereby devm_memremap_pages_release() must wait
for the percpu_ref ->release() callback to complete before it can
proceed to teardown pages.

So, towards fixing this situation, introduce the ability to store a
'chunk owner' at gen_pool_add() time, and a facility to retrieve the
owner at gen_pool_{alloc,free}() time. For p2pdma this will be used to
store and recall individual dev_pagemap reference counter instances
per-chunk.

Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/genalloc.h |   55 +++++++++++++++++++++++++++++++++++++++++-----
 lib/genalloc.c           |   51 +++++++++++++++++++++----------------------
 2 files changed, 74 insertions(+), 32 deletions(-)

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index dd0a452373e7..a337313e064f 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -75,6 +75,7 @@ struct gen_pool_chunk {
 	struct list_head next_chunk;	/* next chunk in pool */
 	atomic_long_t avail;
 	phys_addr_t phys_addr;		/* physical starting address of memory chunk */
+	void *owner;			/* private data to retrieve at alloc time */
 	unsigned long start_addr;	/* start address of memory chunk */
 	unsigned long end_addr;		/* end address of memory chunk (inclusive) */
 	unsigned long bits[0];		/* bitmap for allocating memory chunk */
@@ -96,8 +97,15 @@ struct genpool_data_fixed {
 
 extern struct gen_pool *gen_pool_create(int, int);
 extern phys_addr_t gen_pool_virt_to_phys(struct gen_pool *pool, unsigned long);
-extern int gen_pool_add_virt(struct gen_pool *, unsigned long, phys_addr_t,
-			     size_t, int);
+extern int gen_pool_add_owner(struct gen_pool *, unsigned long, phys_addr_t,
+			     size_t, int, void *);
+
+static inline int gen_pool_add_virt(struct gen_pool *pool, unsigned long addr,
+		phys_addr_t phys, size_t size, int nid)
+{
+	return gen_pool_add_owner(pool, addr, phys, size, nid, NULL);
+}
+
 /**
  * gen_pool_add - add a new chunk of special memory to the pool
  * @pool: pool to add new memory chunk to
@@ -116,12 +124,47 @@ static inline int gen_pool_add(struct gen_pool *pool, unsigned long addr,
 	return gen_pool_add_virt(pool, addr, -1, size, nid);
 }
 extern void gen_pool_destroy(struct gen_pool *);
-extern unsigned long gen_pool_alloc(struct gen_pool *, size_t);
-extern unsigned long gen_pool_alloc_algo(struct gen_pool *, size_t,
-		genpool_algo_t algo, void *data);
+unsigned long gen_pool_alloc_algo_owner(struct gen_pool *pool, size_t size,
+		genpool_algo_t algo, void *data, void **owner);
+
+static inline unsigned long gen_pool_alloc_owner(struct gen_pool *pool,
+		size_t size, void **owner)
+{
+	return gen_pool_alloc_algo_owner(pool, size, pool->algo, pool->data,
+			owner);
+}
+
+static inline unsigned long gen_pool_alloc_algo(struct gen_pool *pool,
+		size_t size, genpool_algo_t algo, void *data)
+{
+	return gen_pool_alloc_algo_owner(pool, size, algo, data, NULL);
+}
+
+/**
+ * gen_pool_alloc - allocate special memory from the pool
+ * @pool: pool to allocate from
+ * @size: number of bytes to allocate from the pool
+ *
+ * Allocate the requested number of bytes from the specified pool.
+ * Uses the pool allocation function (with first-fit algorithm by default).
+ * Can not be used in NMI handler on architectures without
+ * NMI-safe cmpxchg implementation.
+ */
+static inline unsigned long gen_pool_alloc(struct gen_pool *pool, size_t size)
+{
+	return gen_pool_alloc_algo(pool, size, pool->algo, pool->data);
+}
+
 extern void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size,
 		dma_addr_t *dma);
-extern void gen_pool_free(struct gen_pool *, unsigned long, size_t);
+extern void gen_pool_free_owner(struct gen_pool *pool, unsigned long addr,
+		size_t size, void **owner);
+static inline void gen_pool_free(struct gen_pool *pool, unsigned long addr,
+                size_t size)
+{
+	gen_pool_free_owner(pool, addr, size, NULL);
+}
+
 extern void gen_pool_for_each_chunk(struct gen_pool *,
 	void (*)(struct gen_pool *, struct gen_pool_chunk *, void *), void *);
 extern size_t gen_pool_avail(struct gen_pool *);
diff --git a/lib/genalloc.c b/lib/genalloc.c
index 7e85d1e37a6e..770c769d7cb7 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -168,20 +168,21 @@ struct gen_pool *gen_pool_create(int min_alloc_order, int nid)
 EXPORT_SYMBOL(gen_pool_create);
 
 /**
- * gen_pool_add_virt - add a new chunk of special memory to the pool
+ * gen_pool_add_owner- add a new chunk of special memory to the pool
  * @pool: pool to add new memory chunk to
  * @virt: virtual starting address of memory chunk to add to pool
  * @phys: physical starting address of memory chunk to add to pool
  * @size: size in bytes of the memory chunk to add to pool
  * @nid: node id of the node the chunk structure and bitmap should be
  *       allocated on, or -1
+ * @owner: private data the publisher would like to recall at alloc time
  *
  * Add a new chunk of special memory to the specified pool.
  *
  * Returns 0 on success or a -ve errno on failure.
  */
-int gen_pool_add_virt(struct gen_pool *pool, unsigned long virt, phys_addr_t phys,
-		 size_t size, int nid)
+int gen_pool_add_owner(struct gen_pool *pool, unsigned long virt, phys_addr_t phys,
+		 size_t size, int nid, void *owner)
 {
 	struct gen_pool_chunk *chunk;
 	int nbits = size >> pool->min_alloc_order;
@@ -195,6 +196,7 @@ int gen_pool_add_virt(struct gen_pool *pool, unsigned long virt, phys_addr_t phy
 	chunk->phys_addr = phys;
 	chunk->start_addr = virt;
 	chunk->end_addr = virt + size - 1;
+	chunk->owner = owner;
 	atomic_long_set(&chunk->avail, size);
 
 	spin_lock(&pool->lock);
@@ -203,7 +205,7 @@ int gen_pool_add_virt(struct gen_pool *pool, unsigned long virt, phys_addr_t phy
 
 	return 0;
 }
-EXPORT_SYMBOL(gen_pool_add_virt);
+EXPORT_SYMBOL(gen_pool_add_owner);
 
 /**
  * gen_pool_virt_to_phys - return the physical address of memory
@@ -260,35 +262,20 @@ void gen_pool_destroy(struct gen_pool *pool)
 EXPORT_SYMBOL(gen_pool_destroy);
 
 /**
- * gen_pool_alloc - allocate special memory from the pool
- * @pool: pool to allocate from
- * @size: number of bytes to allocate from the pool
- *
- * Allocate the requested number of bytes from the specified pool.
- * Uses the pool allocation function (with first-fit algorithm by default).
- * Can not be used in NMI handler on architectures without
- * NMI-safe cmpxchg implementation.
- */
-unsigned long gen_pool_alloc(struct gen_pool *pool, size_t size)
-{
-	return gen_pool_alloc_algo(pool, size, pool->algo, pool->data);
-}
-EXPORT_SYMBOL(gen_pool_alloc);
-
-/**
- * gen_pool_alloc_algo - allocate special memory from the pool
+ * gen_pool_alloc_algo_owner - allocate special memory from the pool
  * @pool: pool to allocate from
  * @size: number of bytes to allocate from the pool
  * @algo: algorithm passed from caller
  * @data: data passed to algorithm
+ * @owner: optionally retrieve the chunk owner
  *
  * Allocate the requested number of bytes from the specified pool.
  * Uses the pool allocation function (with first-fit algorithm by default).
  * Can not be used in NMI handler on architectures without
  * NMI-safe cmpxchg implementation.
  */
-unsigned long gen_pool_alloc_algo(struct gen_pool *pool, size_t size,
-		genpool_algo_t algo, void *data)
+unsigned long gen_pool_alloc_algo_owner(struct gen_pool *pool, size_t size,
+		genpool_algo_t algo, void *data, void **owner)
 {
 	struct gen_pool_chunk *chunk;
 	unsigned long addr = 0;
@@ -299,6 +286,9 @@ unsigned long gen_pool_alloc_algo(struct gen_pool *pool, size_t size,
 	BUG_ON(in_nmi());
 #endif
 
+	if (owner)
+		*owner = NULL;
+
 	if (size == 0)
 		return 0;
 
@@ -326,12 +316,14 @@ unsigned long gen_pool_alloc_algo(struct gen_pool *pool, size_t size,
 		addr = chunk->start_addr + ((unsigned long)start_bit << order);
 		size = nbits << order;
 		atomic_long_sub(size, &chunk->avail);
+		if (owner)
+			*owner = chunk->owner;
 		break;
 	}
 	rcu_read_unlock();
 	return addr;
 }
-EXPORT_SYMBOL(gen_pool_alloc_algo);
+EXPORT_SYMBOL(gen_pool_alloc_algo_owner);
 
 /**
  * gen_pool_dma_alloc - allocate special memory from the pool for DMA usage
@@ -367,12 +359,14 @@ EXPORT_SYMBOL(gen_pool_dma_alloc);
  * @pool: pool to free to
  * @addr: starting address of memory to free back to pool
  * @size: size in bytes of memory to free
+ * @owner: private data stashed at gen_pool_add() time
  *
  * Free previously allocated special memory back to the specified
  * pool.  Can not be used in NMI handler on architectures without
  * NMI-safe cmpxchg implementation.
  */
-void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size)
+void gen_pool_free_owner(struct gen_pool *pool, unsigned long addr, size_t size,
+		void **owner)
 {
 	struct gen_pool_chunk *chunk;
 	int order = pool->min_alloc_order;
@@ -382,6 +376,9 @@ void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size)
 	BUG_ON(in_nmi());
 #endif
 
+	if (owner)
+		*owner = NULL;
+
 	nbits = (size + (1UL << order) - 1) >> order;
 	rcu_read_lock();
 	list_for_each_entry_rcu(chunk, &pool->chunks, next_chunk) {
@@ -392,6 +389,8 @@ void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size)
 			BUG_ON(remain);
 			size = nbits << order;
 			atomic_long_add(size, &chunk->avail);
+			if (owner)
+				*owner = chunk->owner;
 			rcu_read_unlock();
 			return;
 		}
@@ -399,7 +398,7 @@ void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size)
 	rcu_read_unlock();
 	BUG();
 }
-EXPORT_SYMBOL(gen_pool_free);
+EXPORT_SYMBOL(gen_pool_free_owner);
 
 /**
  * gen_pool_for_each_chunk - call func for every chunk of generic memory pool


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 5/6] PCI/P2PDMA: Track pgmap references per resource, not globally
  2019-05-07 23:55 [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Dan Williams
                   ` (3 preceding siblings ...)
  2019-05-07 23:56 ` [PATCH v2 4/6] lib/genalloc: Introduce chunk owners Dan Williams
@ 2019-05-07 23:56 ` Dan Williams
  2019-05-07 23:56 ` [PATCH v2 6/6] mm/devm_memremap_pages: Fix final page put race Dan Williams
  2019-05-08 17:05 ` [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Logan Gunthorpe
  6 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2019-05-07 23:56 UTC (permalink / raw)
  To: akpm
  Cc: Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig, Ira Weiny,
	linux-kernel, linux-nvdimm, linux-mm

In preparation for fixing a race between devm_memremap_pages_release()
and the final put of a page from the device-page-map, allocate a
percpu-ref per p2pdma resource mapping.

Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/pci/p2pdma.c |  124 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 81 insertions(+), 43 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 595a534bd749..54d475569058 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -20,12 +20,16 @@
 #include <linux/seq_buf.h>
 
 struct pci_p2pdma {
-	struct percpu_ref devmap_ref;
-	struct completion devmap_ref_done;
 	struct gen_pool *pool;
 	bool p2pmem_published;
 };
 
+struct p2pdma_pagemap {
+	struct dev_pagemap pgmap;
+	struct percpu_ref ref;
+	struct completion ref_done;
+};
+
 static ssize_t size_show(struct device *dev, struct device_attribute *attr,
 			 char *buf)
 {
@@ -74,41 +78,45 @@ static const struct attribute_group p2pmem_group = {
 	.name = "p2pmem",
 };
 
+static struct p2pdma_pagemap *to_p2p_pgmap(struct percpu_ref *ref)
+{
+	return container_of(ref, struct p2pdma_pagemap, ref);
+}
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
-	struct pci_p2pdma *p2p =
-		container_of(ref, struct pci_p2pdma, devmap_ref);
+	struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
 
-	complete_all(&p2p->devmap_ref_done);
+	complete(&p2p_pgmap->ref_done);
 }
 
 static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
 {
-	/*
-	 * pci_p2pdma_add_resource() may be called multiple times
-	 * by a driver and may register the percpu_kill devm action multiple
-	 * times. We only want the first action to actually kill the
-	 * percpu_ref.
-	 */
-	if (percpu_ref_is_dying(ref))
-		return;
-
 	percpu_ref_kill(ref);
 }
 
+static void pci_p2pdma_percpu_cleanup(void *ref)
+{
+	struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
+
+	wait_for_completion(&p2p_pgmap->ref_done);
+	percpu_ref_exit(&p2p_pgmap->ref);
+}
+
 static void pci_p2pdma_release(void *data)
 {
 	struct pci_dev *pdev = data;
+	struct pci_p2pdma *p2pdma = pdev->p2pdma;
 
-	if (!pdev->p2pdma)
+	if (!p2pdma)
 		return;
 
-	wait_for_completion(&pdev->p2pdma->devmap_ref_done);
-	percpu_ref_exit(&pdev->p2pdma->devmap_ref);
+	/* Flush and disable pci_alloc_p2p_mem() */
+	pdev->p2pdma = NULL;
+	synchronize_rcu();
 
-	gen_pool_destroy(pdev->p2pdma->pool);
+	gen_pool_destroy(p2pdma->pool);
 	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
-	pdev->p2pdma = NULL;
 }
 
 static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -124,12 +132,6 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 	if (!p2p->pool)
 		goto out;
 
-	init_completion(&p2p->devmap_ref_done);
-	error = percpu_ref_init(&p2p->devmap_ref,
-			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
-	if (error)
-		goto out_pool_destroy;
-
 	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
 	if (error)
 		goto out_pool_destroy;
@@ -163,6 +165,7 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 			    u64 offset)
 {
+	struct p2pdma_pagemap *p2p_pgmap;
 	struct dev_pagemap *pgmap;
 	void *addr;
 	int error;
@@ -185,14 +188,32 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 			return error;
 	}
 
-	pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL);
-	if (!pgmap)
+	p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL);
+	if (!p2p_pgmap)
 		return -ENOMEM;
 
+	init_completion(&p2p_pgmap->ref_done);
+	error = percpu_ref_init(&p2p_pgmap->ref,
+			pci_p2pdma_percpu_release, 0, GFP_KERNEL);
+	if (error)
+		goto pgmap_free;
+
+	/*
+	 * FIXME: the percpu_ref_exit needs to be coordinated internal
+	 * to devm_memremap_pages_release(). Duplicate the same ordering
+	 * as other devm_memremap_pages() users for now.
+	 */
+	error = devm_add_action(&pdev->dev, pci_p2pdma_percpu_cleanup,
+			&p2p_pgmap->ref);
+	if (error)
+		goto ref_cleanup;
+
+	pgmap = &p2p_pgmap->pgmap;
+
 	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
 	pgmap->res.end = pgmap->res.start + size - 1;
 	pgmap->res.flags = pci_resource_flags(pdev, bar);
-	pgmap->ref = &pdev->p2pdma->devmap_ref;
+	pgmap->ref = &p2p_pgmap->ref;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
 	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
 		pci_resource_start(pdev, bar);
@@ -201,12 +222,13 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
 		error = PTR_ERR(addr);
-		goto pgmap_free;
+		goto ref_exit;
 	}
 
-	error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr,
+	error = gen_pool_add_owner(pdev->p2pdma->pool, (unsigned long)addr,
 			pci_bus_address(pdev, bar) + offset,
-			resource_size(&pgmap->res), dev_to_node(&pdev->dev));
+			resource_size(&pgmap->res), dev_to_node(&pdev->dev),
+			&p2p_pgmap->ref);
 	if (error)
 		goto pages_free;
 
@@ -217,8 +239,10 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 
 pages_free:
 	devm_memunmap_pages(&pdev->dev, pgmap);
+ref_cleanup:
+	percpu_ref_exit(&p2p_pgmap->ref);
 pgmap_free:
-	devm_kfree(&pdev->dev, pgmap);
+	devm_kfree(&pdev->dev, p2p_pgmap);
 	return error;
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
@@ -555,19 +579,30 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_find_many);
  */
 void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
 {
-	void *ret;
+	void *ret = NULL;
+	struct percpu_ref *ref;
 
+	/*
+	 * Pairs with synchronize_rcu() in pci_p2pdma_release() to
+	 * ensure pdev->p2pdma is non-NULL for the duration of the
+	 * read-lock.
+	 */
+	rcu_read_lock();
 	if (unlikely(!pdev->p2pdma))
-		return NULL;
-
-	if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref)))
-		return NULL;
-
-	ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size);
+		goto out;
 
-	if (unlikely(!ret))
-		percpu_ref_put(&pdev->p2pdma->devmap_ref);
+	ret = (void *)gen_pool_alloc_owner(pdev->p2pdma->pool, size,
+			(void **) &ref);
+	if (!ret)
+		goto out;
 
+	if (unlikely(!percpu_ref_tryget_live(ref))) {
+		gen_pool_free(pdev->p2pdma->pool, (unsigned long) ret, size);
+		ret = NULL;
+		goto out;
+	}
+out:
+	rcu_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
@@ -580,8 +615,11 @@ EXPORT_SYMBOL_GPL(pci_alloc_p2pmem);
  */
 void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size)
 {
-	gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size);
-	percpu_ref_put(&pdev->p2pdma->devmap_ref);
+	struct percpu_ref *ref;
+
+	gen_pool_free_owner(pdev->p2pdma->pool, (uintptr_t)addr, size,
+			(void **) &ref);
+	percpu_ref_put(ref);
 }
 EXPORT_SYMBOL_GPL(pci_free_p2pmem);
 


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 6/6] mm/devm_memremap_pages: Fix final page put race
  2019-05-07 23:55 [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Dan Williams
                   ` (4 preceding siblings ...)
  2019-05-07 23:56 ` [PATCH v2 5/6] PCI/P2PDMA: Track pgmap references per resource, not globally Dan Williams
@ 2019-05-07 23:56 ` Dan Williams
  2019-05-08 17:05 ` [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Logan Gunthorpe
  6 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2019-05-07 23:56 UTC (permalink / raw)
  To: akpm
  Cc: Logan Gunthorpe, Bjorn Helgaas, Jérôme Glisse,
	Christoph Hellwig, Ira Weiny, linux-kernel, linux-nvdimm,
	linux-mm

Logan noticed that devm_memremap_pages_release() kills the percpu_ref
drops all the page references that were acquired at init and then
immediately proceeds to unplug, arch_remove_memory(), the backing pages
for the pagemap. If for some reason device shutdown actually collides
with a busy / elevated-ref-count page then arch_remove_memory() should
be deferred until after that reference is dropped.

As it stands the "wait for last page ref drop" happens *after*
devm_memremap_pages_release() returns, which is obviously too late and
can lead to crashes.

Fix this situation by assigning the responsibility to wait for the
percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
callback. Implement the new cleanup callback for all
devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Fixes: 41e94a851304 ("add devm_memremap_pages")
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c              |   13 +++----------
 drivers/nvdimm/pmem.c             |   17 +++++++++++++----
 drivers/pci/p2pdma.c              |   17 +++--------------
 include/linux/memremap.h          |    2 ++
 kernel/memremap.c                 |   17 ++++++++++++-----
 mm/hmm.c                          |   14 +++-----------
 tools/testing/nvdimm/test/iomap.c |    2 ++
 7 files changed, 38 insertions(+), 44 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index e428468ab661..e3aa78dd1bb0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -27,9 +27,8 @@ static void dev_dax_percpu_release(struct percpu_ref *ref)
 	complete(&dev_dax->cmp);
 }
 
-static void dev_dax_percpu_exit(void *data)
+static void dev_dax_percpu_exit(struct percpu_ref *ref)
 {
-	struct percpu_ref *ref = data;
 	struct dev_dax *dev_dax = ref_to_dev_dax(ref);
 
 	dev_dbg(&dev_dax->dev, "%s\n", __func__);
@@ -468,18 +467,12 @@ int dev_dax_probe(struct device *dev)
 	if (rc)
 		return rc;
 
-	rc = devm_add_action_or_reset(dev, dev_dax_percpu_exit, &dev_dax->ref);
-	if (rc)
-		return rc;
-
 	dev_dax->pgmap.ref = &dev_dax->ref;
 	dev_dax->pgmap.kill = dev_dax_percpu_kill;
+	dev_dax->pgmap.cleanup = dev_dax_percpu_exit;
 	addr = devm_memremap_pages(dev, &dev_dax->pgmap);
-	if (IS_ERR(addr)) {
-		devm_remove_action(dev, dev_dax_percpu_exit, &dev_dax->ref);
-		percpu_ref_exit(&dev_dax->ref);
+	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-	}
 
 	inode = dax_inode(dax_dev);
 	cdev = inode->i_cdev;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0279eb1da3ef..1c9181712fa4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -304,11 +304,19 @@ static const struct attribute_group *pmem_attribute_groups[] = {
 	NULL,
 };
 
-static void pmem_release_queue(void *q)
+static void __pmem_release_queue(struct percpu_ref *ref)
 {
+	struct request_queue *q;
+
+	q = container_of(ref, typeof(*q), q_usage_counter);
 	blk_cleanup_queue(q);
 }
 
+static void pmem_release_queue(void *ref)
+{
+	__pmem_release_queue(ref);
+}
+
 static void pmem_freeze_queue(struct percpu_ref *ref)
 {
 	struct request_queue *q;
@@ -400,12 +408,10 @@ static int pmem_attach_disk(struct device *dev,
 	if (!q)
 		return -ENOMEM;
 
-	if (devm_add_action_or_reset(dev, pmem_release_queue, q))
-		return -ENOMEM;
-
 	pmem->pfn_flags = PFN_DEV;
 	pmem->pgmap.ref = &q->q_usage_counter;
 	pmem->pgmap.kill = pmem_freeze_queue;
+	pmem->pgmap.cleanup = __pmem_release_queue;
 	if (is_nd_pfn(dev)) {
 		if (setup_pagemap_fsdax(dev, &pmem->pgmap))
 			return -ENOMEM;
@@ -426,6 +432,9 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pfn_flags |= PFN_MAP;
 		memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res));
 	} else {
+		if (devm_add_action_or_reset(dev, pmem_release_queue,
+					&q->q_usage_counter))
+			return -ENOMEM;
 		addr = devm_memremap(dev, pmem->phys_addr,
 				pmem->size, ARCH_MEMREMAP_PMEM);
 		memcpy(&bb_res, &nsio->res, sizeof(bb_res));
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 54d475569058..a7a66b958720 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -95,7 +95,7 @@ static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
 	percpu_ref_kill(ref);
 }
 
-static void pci_p2pdma_percpu_cleanup(void *ref)
+static void pci_p2pdma_percpu_cleanup(struct percpu_ref *ref)
 {
 	struct p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(ref);
 
@@ -198,16 +198,6 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	if (error)
 		goto pgmap_free;
 
-	/*
-	 * FIXME: the percpu_ref_exit needs to be coordinated internal
-	 * to devm_memremap_pages_release(). Duplicate the same ordering
-	 * as other devm_memremap_pages() users for now.
-	 */
-	error = devm_add_action(&pdev->dev, pci_p2pdma_percpu_cleanup,
-			&p2p_pgmap->ref);
-	if (error)
-		goto ref_cleanup;
-
 	pgmap = &p2p_pgmap->pgmap;
 
 	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
@@ -218,11 +208,12 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
 		pci_resource_start(pdev, bar);
 	pgmap->kill = pci_p2pdma_percpu_kill;
+	pgmap->cleanup = pci_p2pdma_percpu_cleanup;
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
 		error = PTR_ERR(addr);
-		goto ref_exit;
+		goto pgmap_free;
 	}
 
 	error = gen_pool_add_owner(pdev->p2pdma->pool, (unsigned long)addr,
@@ -239,8 +230,6 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 
 pages_free:
 	devm_memunmap_pages(&pdev->dev, pgmap);
-ref_cleanup:
-	percpu_ref_exit(&p2p_pgmap->ref);
 pgmap_free:
 	devm_kfree(&pdev->dev, p2p_pgmap);
 	return error;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7601ee314c4a..1732dea030b2 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -81,6 +81,7 @@ typedef void (*dev_page_free_t)(struct page *page, void *data);
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @kill: callback to transition @ref to the dead state
+ * @cleanup: callback to wait for @ref to be idle and reap it
  * @dev: host device of the mapping for debug
  * @data: private data pointer for page_free()
  * @type: memory type: see MEMORY_* in memory_hotplug.h
@@ -92,6 +93,7 @@ struct dev_pagemap {
 	struct resource res;
 	struct percpu_ref *ref;
 	void (*kill)(struct percpu_ref *ref);
+	void (*cleanup)(struct percpu_ref *ref);
 	struct device *dev;
 	void *data;
 	enum memory_type type;
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 65afbacab44e..05d1af5a2f15 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -96,6 +96,7 @@ static void devm_memremap_pages_release(void *data)
 	pgmap->kill(pgmap->ref);
 	for_each_device_pfn(pfn, pgmap)
 		put_page(pfn_to_page(pfn));
+	pgmap->cleanup(pgmap->ref);
 
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
@@ -134,8 +135,8 @@ static void devm_memremap_pages_release(void *data)
  * 2/ The altmap field may optionally be initialized, in which case altmap_valid
  *    must be set to true
  *
- * 3/ pgmap->ref must be 'live' on entry and will be killed at
- *    devm_memremap_pages_release() time, or if this routine fails.
+ * 3/ pgmap->ref must be 'live' on entry and will be killed and reaped
+ *    at devm_memremap_pages_release() time, or if this routine fails.
  *
  * 4/ res is expected to be a host memory range that could feasibly be
  *    treated as a "System RAM" range, i.e. not a device mmio range, but
@@ -151,8 +152,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	pgprot_t pgprot = PAGE_KERNEL;
 	int error, nid, is_ram;
 
-	if (!pgmap->ref || !pgmap->kill)
+	if (!pgmap->ref || !pgmap->kill || !pgmap->cleanup) {
+		WARN(1, "Missing reference count teardown definition\n");
 		return ERR_PTR(-EINVAL);
+	}
 
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
@@ -163,14 +166,16 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	if (conflict_pgmap) {
 		dev_WARN(dev, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
-		return ERR_PTR(-ENOMEM);
+		error = -ENOMEM;
+		goto err_array;
 	}
 
 	conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_end), NULL);
 	if (conflict_pgmap) {
 		dev_WARN(dev, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
-		return ERR_PTR(-ENOMEM);
+		error = -ENOMEM;
+		goto err_array;
 	}
 
 	is_ram = region_intersects(align_start, align_size,
@@ -262,6 +267,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	pgmap_array_delete(res);
  err_array:
 	pgmap->kill(pgmap->ref);
+	pgmap->cleanup(pgmap->ref);
+
 	return ERR_PTR(error);
 }
 EXPORT_SYMBOL_GPL(devm_memremap_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index fe1cd87e49ac..225ade644058 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -975,9 +975,8 @@ static void hmm_devmem_ref_release(struct percpu_ref *ref)
 	complete(&devmem->completion);
 }
 
-static void hmm_devmem_ref_exit(void *data)
+static void hmm_devmem_ref_exit(struct percpu_ref *ref)
 {
-	struct percpu_ref *ref = data;
 	struct hmm_devmem *devmem;
 
 	devmem = container_of(ref, struct hmm_devmem, ref);
@@ -1054,10 +1053,6 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	if (ret)
 		return ERR_PTR(ret);
 
-	ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit, &devmem->ref);
-	if (ret)
-		return ERR_PTR(ret);
-
 	size = ALIGN(size, PA_SECTION_SIZE);
 	addr = min((unsigned long)iomem_resource.end,
 		   (1UL << MAX_PHYSMEM_BITS) - 1);
@@ -1096,6 +1091,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	devmem->pagemap.ref = &devmem->ref;
 	devmem->pagemap.data = devmem;
 	devmem->pagemap.kill = hmm_devmem_ref_kill;
+	devmem->pagemap.cleanup = hmm_devmem_ref_exit;
 
 	result = devm_memremap_pages(devmem->device, &devmem->pagemap);
 	if (IS_ERR(result))
@@ -1133,11 +1129,6 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (ret)
 		return ERR_PTR(ret);
 
-	ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit,
-			&devmem->ref);
-	if (ret)
-		return ERR_PTR(ret);
-
 	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
 	devmem->pfn_last = devmem->pfn_first +
 			   (resource_size(devmem->resource) >> PAGE_SHIFT);
@@ -1150,6 +1141,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	devmem->pagemap.ref = &devmem->ref;
 	devmem->pagemap.data = devmem;
 	devmem->pagemap.kill = hmm_devmem_ref_kill;
+	devmem->pagemap.cleanup = hmm_devmem_ref_exit;
 
 	result = devm_memremap_pages(devmem->device, &devmem->pagemap);
 	if (IS_ERR(result))
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index c6635fee27d8..219dd0a1cb08 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -108,7 +108,9 @@ static void nfit_test_kill(void *_pgmap)
 {
 	struct dev_pagemap *pgmap = _pgmap;
 
+	WARN_ON(!pgmap || !pgmap->ref || !pgmap->kill || !pgmap->cleanup);
 	pgmap->kill(pgmap->ref);
+	pgmap->cleanup(pgmap->ref);
 }
 
 void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-07 23:55 [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Dan Williams
                   ` (5 preceding siblings ...)
  2019-05-07 23:56 ` [PATCH v2 6/6] mm/devm_memremap_pages: Fix final page put race Dan Williams
@ 2019-05-08 17:05 ` Logan Gunthorpe
  2019-05-13 19:22   ` Logan Gunthorpe
  6 siblings, 1 reply; 19+ messages in thread
From: Logan Gunthorpe @ 2019-05-08 17:05 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Ira Weiny, Bjorn Helgaas, Christoph Hellwig,
	Jérôme Glisse, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, linux-nvdimm, linux-mm



On 2019-05-07 5:55 p.m., Dan Williams wrote:
> Changes since v1 [1]:
> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
> 
> - Refresh the p2pdma patch headers to match the format of other p2pdma
>    patches (Bjorn)
> 
> - Collect Ira's reviewed-by
> 
> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/

This series looks good to me:

Reviewed-by: Logan Gunthorpe <logang@deltatee.com>

However, I haven't tested it yet but I intend to later this week.

Thanks,

Logan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-08 17:05 ` [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Logan Gunthorpe
@ 2019-05-13 19:22   ` Logan Gunthorpe
  2019-05-14 18:51     ` Jane Chu
  2019-05-31  4:17     ` Dan Williams
  0 siblings, 2 replies; 19+ messages in thread
From: Logan Gunthorpe @ 2019-05-13 19:22 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Ira Weiny, Bjorn Helgaas, Christoph Hellwig,
	Jérôme Glisse, Greg Kroah-Hartman, Rafael J. Wysocki,
	linux-kernel, linux-nvdimm, linux-mm



On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
> 
> 
> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>> Changes since v1 [1]:
>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>
>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>    patches (Bjorn)
>>
>> - Collect Ira's reviewed-by
>>
>> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
> 
> This series looks good to me:
> 
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> 
> However, I haven't tested it yet but I intend to later this week.

I've tested libnvdimm-pending which includes this series on my setup and
everything works great.

Thanks,

Logan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-13 19:22   ` Logan Gunthorpe
@ 2019-05-14 18:51     ` Jane Chu
  2019-05-14 19:04       ` Dan Williams
  2019-05-31  4:17     ` Dan Williams
  1 sibling, 1 reply; 19+ messages in thread
From: Jane Chu @ 2019-05-14 18:51 UTC (permalink / raw)
  To: Logan Gunthorpe, Dan Williams, akpm
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, linux-nvdimm,
	linux-kernel, linux-mm, Jérôme Glisse, Bjorn Helgaas,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1231 bytes --]

On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:

>
> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>>
>> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>>> Changes since v1 [1]:
>>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>>
>>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>>     patches (Bjorn)
>>>
>>> - Collect Ira's reviewed-by
>>>
>>> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
>> This series looks good to me:
>>
>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>
>> However, I haven't tested it yet but I intend to later this week.
> I've tested libnvdimm-pending which includes this series on my setup and
> everything works great.

Just wondering in a difference scenario where pmem pages are exported to
a KVM guest, and then by mistake the user issues "ndctl destroy-namespace -f",
will the kernel wait indefinitely until the user figures out to kill the guest
and release the pmem pages?

thanks,
-jane
  

>
> Thanks,
>
> Logan
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

[-- Attachment #2: Type: text/html, Size: 2389 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-14 18:51     ` Jane Chu
@ 2019-05-14 19:04       ` Dan Williams
  2019-05-14 21:18         ` Jane Chu
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Williams @ 2019-05-14 19:04 UTC (permalink / raw)
  To: Jane Chu
  Cc: Logan Gunthorpe, Andrew Morton, Rafael J. Wysocki,
	Greg Kroah-Hartman, linux-nvdimm, Linux Kernel Mailing List,
	Linux MM, Jérôme Glisse, Bjorn Helgaas,
	Christoph Hellwig

On Tue, May 14, 2019 at 11:53 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:
>
> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>
> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>
> Changes since v1 [1]:
> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>
> - Refresh the p2pdma patch headers to match the format of other p2pdma
>    patches (Bjorn)
>
> - Collect Ira's reviewed-by
>
> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> This series looks good to me:
>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>
> However, I haven't tested it yet but I intend to later this week.
>
> I've tested libnvdimm-pending which includes this series on my setup and
> everything works great.
>
> Just wondering in a difference scenario where pmem pages are exported to
> a KVM guest, and then by mistake the user issues "ndctl destroy-namespace -f",
> will the kernel wait indefinitely until the user figures out to kill the guest
> and release the pmem pages?

It depends on whether the pages are pinned. Typically DAX memory
mappings assigned to a guest are not pinned in the host and can be
invalidated at any time. The pinning only occurs with VFIO and
device-assignment which isn't the common case, especially since that
configuration is blocked by fsdax. However, with devdax, yes you can
arrange for the system to go into an indefinite wait.

This somewhat ties back to the get_user_pages() vs DAX debate. The
indefinite stall issue with device-assignment could be addressed with
a requirement to hold a lease and expect that a lease revocation event
may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
expectation with device-dax is that it is already a raw interface with
pointy edges and caveats, but I would not be opposed to introducing a
lease semantic.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action()
  2019-05-07 23:56 ` [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action() Dan Williams
@ 2019-05-14 19:12   ` Greg Kroah-Hartman
  2019-05-14 19:24     ` Dan Williams
  0 siblings, 1 reply; 19+ messages in thread
From: Greg Kroah-Hartman @ 2019-05-14 19:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig,
	Rafael J. Wysocki, Ira Weiny, linux-kernel, linux-nvdimm,
	linux-mm

On Tue, May 07, 2019 at 04:56:05PM -0700, Dan Williams wrote:
> The devm_add_action() facility allows a resource allocation routine to
> add custom devm semantics. One such user is devm_memremap_pages().
> 
> There is now a need to manually trigger devm_memremap_pages_release().
> Introduce devm_release_action() so the release action can be triggered
> via a new devm_memunmap_pages() api in a follow-on change.
> 
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/base/devres.c  |   24 +++++++++++++++++++++++-
>  include/linux/device.h |    1 +
>  2 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/devres.c b/drivers/base/devres.c
> index e038e2b3b7ea..0bbb328bd17f 100644
> --- a/drivers/base/devres.c
> +++ b/drivers/base/devres.c
> @@ -755,10 +755,32 @@ void devm_remove_action(struct device *dev, void (*action)(void *), void *data)
>  
>  	WARN_ON(devres_destroy(dev, devm_action_release, devm_action_match,
>  			       &devres));
> -
>  }
>  EXPORT_SYMBOL_GPL(devm_remove_action);
>  
> +/**
> + * devm_release_action() - release previously added custom action
> + * @dev: Device that owns the action
> + * @action: Function implementing the action
> + * @data: Pointer to data passed to @action implementation
> + *
> + * Releases and removes instance of @action previously added by
> + * devm_add_action().  Both action and data should match one of the
> + * existing entries.
> + */
> +void devm_release_action(struct device *dev, void (*action)(void *), void *data)
> +{
> +	struct action_devres devres = {
> +		.data = data,
> +		.action = action,
> +	};
> +
> +	WARN_ON(devres_release(dev, devm_action_release, devm_action_match,
> +			       &devres));

What does WARN_ON help here?  are we going to start getting syzbot
reports of this happening?

How can this fail?

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action()
  2019-05-14 19:12   ` Greg Kroah-Hartman
@ 2019-05-14 19:24     ` Dan Williams
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Williams @ 2019-05-14 19:24 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Andrew Morton, Logan Gunthorpe, Bjorn Helgaas, Christoph Hellwig,
	Rafael J. Wysocki, Ira Weiny, Linux Kernel Mailing List,
	linux-nvdimm, Linux MM

On Tue, May 14, 2019 at 12:12 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Tue, May 07, 2019 at 04:56:05PM -0700, Dan Williams wrote:
> > The devm_add_action() facility allows a resource allocation routine to
> > add custom devm semantics. One such user is devm_memremap_pages().
> >
> > There is now a need to manually trigger devm_memremap_pages_release().
> > Introduce devm_release_action() so the release action can be triggered
> > via a new devm_memunmap_pages() api in a follow-on change.
> >
> > Cc: Logan Gunthorpe <logang@deltatee.com>
> > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> > Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  drivers/base/devres.c  |   24 +++++++++++++++++++++++-
> >  include/linux/device.h |    1 +
> >  2 files changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/devres.c b/drivers/base/devres.c
> > index e038e2b3b7ea..0bbb328bd17f 100644
> > --- a/drivers/base/devres.c
> > +++ b/drivers/base/devres.c
> > @@ -755,10 +755,32 @@ void devm_remove_action(struct device *dev, void (*action)(void *), void *data)
> >
> >       WARN_ON(devres_destroy(dev, devm_action_release, devm_action_match,
> >                              &devres));
> > -
> >  }
> >  EXPORT_SYMBOL_GPL(devm_remove_action);
> >
> > +/**
> > + * devm_release_action() - release previously added custom action
> > + * @dev: Device that owns the action
> > + * @action: Function implementing the action
> > + * @data: Pointer to data passed to @action implementation
> > + *
> > + * Releases and removes instance of @action previously added by
> > + * devm_add_action().  Both action and data should match one of the
> > + * existing entries.
> > + */
> > +void devm_release_action(struct device *dev, void (*action)(void *), void *data)
> > +{
> > +     struct action_devres devres = {
> > +             .data = data,
> > +             .action = action,
> > +     };
> > +
> > +     WARN_ON(devres_release(dev, devm_action_release, devm_action_match,
> > +                            &devres));
>
> What does WARN_ON help here?  are we going to start getting syzbot
> reports of this happening?

Hopefully, yes, if developers misuse the api they get a loud
notification similar to devm_remove_action() misuse.

> How can this fail?

It's a catch to make sure that @dev actually has a live devres
resource that can be found via @action and @data.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-14 19:04       ` Dan Williams
@ 2019-05-14 21:18         ` Jane Chu
  2019-05-16 16:45           ` Jane Chu
  0 siblings, 1 reply; 19+ messages in thread
From: Jane Chu @ 2019-05-14 21:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Andrew Morton, Rafael J. Wysocki,
	Greg Kroah-Hartman, linux-nvdimm, Linux Kernel Mailing List,
	Linux MM, Jérôme Glisse, Bjorn Helgaas,
	Christoph Hellwig

On 5/14/2019 12:04 PM, Dan Williams wrote:

> On Tue, May 14, 2019 at 11:53 AM Jane Chu <jane.chu@oracle.com> wrote:
>> On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:
>>
>> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>>
>> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>>
>> Changes since v1 [1]:
>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>
>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>     patches (Bjorn)
>>
>> - Collect Ira's reviewed-by
>>
>> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
>>
>> This series looks good to me:
>>
>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>
>> However, I haven't tested it yet but I intend to later this week.
>>
>> I've tested libnvdimm-pending which includes this series on my setup and
>> everything works great.
>>
>> Just wondering in a difference scenario where pmem pages are exported to
>> a KVM guest, and then by mistake the user issues "ndctl destroy-namespace -f",
>> will the kernel wait indefinitely until the user figures out to kill the guest
>> and release the pmem pages?
> It depends on whether the pages are pinned. Typically DAX memory
> mappings assigned to a guest are not pinned in the host and can be
> invalidated at any time. The pinning only occurs with VFIO and
> device-assignment which isn't the common case, especially since that
> configuration is blocked by fsdax. However, with devdax, yes you can
> arrange for the system to go into an indefinite wait.
>
> This somewhat ties back to the get_user_pages() vs DAX debate. The
> indefinite stall issue with device-assignment could be addressed with
> a requirement to hold a lease and expect that a lease revocation event
> may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
> expectation with device-dax is that it is already a raw interface with
> pointy edges and caveats, but I would not be opposed to introducing a
> lease semantic.

Thanks for the quick response Dan.

I am not convinced that the get_user_pages() vs FS-DAX dilemma is a perfect
comparison to "ndctl destroy-namespace -f" vs namespace-is-busy dilemma.

Others might disagree with me, I thought that there is no risk of panic
if we fail "ndctl destroy-namespace -f" to honor a clean shutdown of the
user application. Also, both actions are on the same host, so in theory
the admin could shutdown the application before attempt a destructive
action.

By allowing 'opposite' actions in competition with each other at fine
granularity, there is potential for panic in general, not necessarily with
pinned page I guess.  I just ran an experiment and panic'd the system.

So, as Optane DCPMEM is generally for server/cloud deployment, and as
RAS is a priority for server over administrative commands, to allow
namespace management command to panic kernel is not an option?

Here is my stress experiment -
   
Start out with ./create_nm.sh to create as many 48G devdax namespaces
as possible. Once that's completed, firing up 6 actions in quick
successions in below order:
   -> ndctl destroy-namespace all -f
   -> ./create_nm.sh
   -> ndctl destroy-namespace all -f
   -> ./create_nm.sh
   -> ndctl destroy-namespace all -f
   -> ./create_nm.sh

==========  console message =======
Kernel 5.1.0-rc7-next-20190501-libnvdimm-pending on an x86_64

ban25uut130 login: [ 1620.866813] BUG: kernel NULL pointer dereference, address: 0000000000000020
[ 1620.874585] #PF: supervisor read access in kernel mode
[ 1620.880319] #PF: error_code(0x0000) - not-present page
[ 1620.886052] PGD 0 P4D 0
[ 1620.888879] Oops: 0000 [#1] SMP NOPTI
[ 1620.892964] CPU: 19 PID: 5611 Comm: kworker/u130:3 Tainted: G        W         5.1.0-rc7-next-20190501-libnvdimm-pending #5
[ 1620.905389] Hardware name: Oracle Corporation ORACLE SERVER X8-2L/ASM,MTHRBD,2U, BIOS 52020101 05/07/2019
[ 1620.916069] Workqueue: events_unbound async_run_entry_fn
[ 1620.921997] RIP: 0010:klist_put+0x1b/0x6c
[ 1620.926471] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 74 02
[ 1620.947427] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
[ 1620.953258] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 0000000000000000
[ 1620.961223] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 0000000000000000
[ 1620.969185] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: ffffffffbb54b3c0
[ 1620.977150] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 0000000000000000
[ 1620.985116] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: ffff955000caf140
[ 1620.993081] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) knlGS:0000000000000000
[ 1621.002113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1621.008524] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 00000000007606e0
[ 1621.016487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1621.024450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1621.032413] PKRU: 55555554
[ 1621.035433] Call Trace:
[ 1621.038161]  klist_del+0xe/0x10
[ 1621.041667]  device_del+0x8a/0x2c9
[ 1621.045463]  ? __switch_to_asm+0x34/0x70
[ 1621.049840]  ? __switch_to_asm+0x40/0x70
[ 1621.054220]  device_unregister+0x44/0x4f
[ 1621.058603]  nd_async_device_unregister+0x22/0x2d [libnvdimm]
[ 1621.065016]  async_run_entry_fn+0x47/0x15a
[ 1621.069588]  process_one_work+0x1a2/0x2eb
[ 1621.074064]  worker_thread+0x1b8/0x26e
[ 1621.078239]  ? cancel_delayed_work_sync+0x15/0x15
[ 1621.083490]  kthread+0xf8/0xfd
[ 1621.086897]  ? kthread_destroy_worker+0x45/0x45
[ 1621.091954]  ret_from_fork+0x1f/0x40
[ 1621.095944] Modules linked in: xt_REDIRECT xt_nat xt_CHECKSUM iptable_mangle xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter iptable_filter scsi_transport_iscsi ip6table_nat ip6_tables iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vfat fat skx_edac intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel ipmi_si crypto_simd cryptd glue_helper ipmi_devintf ipmi_msghandler sg pcspkr dax_pmem_compat device_dax dax_pmem_core i2c_i801 pcc_cpufreq lpc_ich ioatdma wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c nd_pmem nd_btt sr_mod cdrom sd_mod mgag200 drm_kms_helper syscopyarea crc32c_intel sysfillrect sysimgblt fb_sys_fops ttm megaraid_sas drm igb ahci libahci ptp libata pps_core dca i2c_algo_bit nfit libnvdimm uas usb_storage dm_mirror dm_region_hash dm_log dm_mod
[ 1621.189449] CR2: 0000000000000020
[ 1621.193169] ---[ end trace 7c3f7029ef24aa5a ]---
[ 1621.305383] RIP: 0010:klist_put+0x1b/0x6c
[ 1621.309860] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 74 02
[ 1621.330809] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
[ 1621.336642] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 0000000000000000
[ 1621.344606] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 0000000000000000
[ 1621.352570] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: ffffffffbb54b3c0
[ 1621.360533] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 0000000000000000
[ 1621.368496] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: ffff955000caf140
[ 1621.376460] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) knlGS:0000000000000000
[ 1621.385490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1621.391902] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 00000000007606e0
[ 1621.399867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1621.407830] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1621.415793] PKRU: 55555554
[ 1621.418814] Kernel panic - not syncing: Fatal exception
[ 1621.424740] Kernel Offset: 0x39000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1621.550711] ---[ end Kernel panic - not syncing: Fatal exception ]---


Thanks!
-jane


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-14 21:18         ` Jane Chu
@ 2019-05-16 16:45           ` Jane Chu
  2019-05-16 21:42             ` jane.chu
  2019-05-16 21:51             ` Dan Williams
  0 siblings, 2 replies; 19+ messages in thread
From: Jane Chu @ 2019-05-16 16:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux MM, Jérôme Glisse,
	Bjorn Helgaas, Andrew Morton, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 9746 bytes --]

Hi,

I'm able to reproduce the panic below by running two sets of ndctl
commands that actually serve legitimate purpose in parallel (unlike
the brute force experiment earlier), each set in a indefinite loop.
This time it takes about an hour to panic.  But I gather the cause
is probably the same: I've overlapped ndctl commands on the same
region.

Could we add a check in nd_ioctl(), such that if there is
an ongoing ndctl command on a region, subsequent ndctl request
will fail immediately with something to the effect of EAGAIN?
The rationale being that kernel should protect itself against
user mistakes.

Also, sensing the subject fix is for a different problem, and has been
verified, I'm happy to see it in upstream, so we have a better
code base to digger deeper in terms of how the destructive ndctl
commands interacts to typical mission critical applications, include
but not limited to rdma.

thanks,
-jane

On 5/14/2019 2:18 PM, Jane Chu wrote:
> On 5/14/2019 12:04 PM, Dan Williams wrote:
>
>> On Tue, May 14, 2019 at 11:53 AM Jane Chu <jane.chu@oracle.com> wrote:
>>> On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:
>>>
>>> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>>>
>>> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>>>
>>> Changes since v1 [1]:
>>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>>
>>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>>     patches (Bjorn)
>>>
>>> - Collect Ira's reviewed-by
>>>
>>> [1]: 
>>> https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>
>>> This series looks good to me:
>>>
>>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>>
>>> However, I haven't tested it yet but I intend to later this week.
>>>
>>> I've tested libnvdimm-pending which includes this series on my setup 
>>> and
>>> everything works great.
>>>
>>> Just wondering in a difference scenario where pmem pages are 
>>> exported to
>>> a KVM guest, and then by mistake the user issues "ndctl 
>>> destroy-namespace -f",
>>> will the kernel wait indefinitely until the user figures out to kill 
>>> the guest
>>> and release the pmem pages?
>> It depends on whether the pages are pinned. Typically DAX memory
>> mappings assigned to a guest are not pinned in the host and can be
>> invalidated at any time. The pinning only occurs with VFIO and
>> device-assignment which isn't the common case, especially since that
>> configuration is blocked by fsdax. However, with devdax, yes you can
>> arrange for the system to go into an indefinite wait.
>>
>> This somewhat ties back to the get_user_pages() vs DAX debate. The
>> indefinite stall issue with device-assignment could be addressed with
>> a requirement to hold a lease and expect that a lease revocation event
>> may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
>> expectation with device-dax is that it is already a raw interface with
>> pointy edges and caveats, but I would not be opposed to introducing a
>> lease semantic.
>
> Thanks for the quick response Dan.
>
> I am not convinced that the get_user_pages() vs FS-DAX dilemma is a 
> perfect
> comparison to "ndctl destroy-namespace -f" vs namespace-is-busy dilemma.
>
> Others might disagree with me, I thought that there is no risk of panic
> if we fail "ndctl destroy-namespace -f" to honor a clean shutdown of the
> user application. Also, both actions are on the same host, so in theory
> the admin could shutdown the application before attempt a destructive
> action.
>
> By allowing 'opposite' actions in competition with each other at fine
> granularity, there is potential for panic in general, not necessarily 
> with
> pinned page I guess.  I just ran an experiment and panic'd the system.
>
> So, as Optane DCPMEM is generally for server/cloud deployment, and as
> RAS is a priority for server over administrative commands, to allow
> namespace management command to panic kernel is not an option?
>
> Here is my stress experiment -
>   Start out with ./create_nm.sh to create as many 48G devdax namespaces
> as possible. Once that's completed, firing up 6 actions in quick
> successions in below order:
>   -> ndctl destroy-namespace all -f
>   -> ./create_nm.sh
>   -> ndctl destroy-namespace all -f
>   -> ./create_nm.sh
>   -> ndctl destroy-namespace all -f
>   -> ./create_nm.sh
>
> ==========  console message =======
> Kernel 5.1.0-rc7-next-20190501-libnvdimm-pending on an x86_64
>
> ban25uut130 login: [ 1620.866813] BUG: kernel NULL pointer 
> dereference, address: 0000000000000020
> [ 1620.874585] #PF: supervisor read access in kernel mode
> [ 1620.880319] #PF: error_code(0x0000) - not-present page
> [ 1620.886052] PGD 0 P4D 0
> [ 1620.888879] Oops: 0000 [#1] SMP NOPTI
> [ 1620.892964] CPU: 19 PID: 5611 Comm: kworker/u130:3 Tainted: 
> G        W         5.1.0-rc7-next-20190501-libnvdimm-pending #5
> [ 1620.905389] Hardware name: Oracle Corporation ORACLE SERVER 
> X8-2L/ASM,MTHRBD,2U, BIOS 52020101 05/07/2019
> [ 1620.916069] Workqueue: events_unbound async_run_entry_fn
> [ 1620.921997] RIP: 0010:klist_put+0x1b/0x6c
> [ 1620.926471] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 
> 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 
> 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 
> 74 02
> [ 1620.947427] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
> [ 1620.953258] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 
> 0000000000000000
> [ 1620.961223] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 
> 0000000000000000
> [ 1620.969185] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: 
> ffffffffbb54b3c0
> [ 1620.977150] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 
> 0000000000000000
> [ 1620.985116] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: 
> ffff955000caf140
> [ 1620.993081] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) 
> knlGS:0000000000000000
> [ 1621.002113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1621.008524] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 
> 00000000007606e0
> [ 1621.016487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 1621.024450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [ 1621.032413] PKRU: 55555554
> [ 1621.035433] Call Trace:
> [ 1621.038161]  klist_del+0xe/0x10
> [ 1621.041667]  device_del+0x8a/0x2c9
> [ 1621.045463]  ? __switch_to_asm+0x34/0x70
> [ 1621.049840]  ? __switch_to_asm+0x40/0x70
> [ 1621.054220]  device_unregister+0x44/0x4f
> [ 1621.058603]  nd_async_device_unregister+0x22/0x2d [libnvdimm]
> [ 1621.065016]  async_run_entry_fn+0x47/0x15a
> [ 1621.069588]  process_one_work+0x1a2/0x2eb
> [ 1621.074064]  worker_thread+0x1b8/0x26e
> [ 1621.078239]  ? cancel_delayed_work_sync+0x15/0x15
> [ 1621.083490]  kthread+0xf8/0xfd
> [ 1621.086897]  ? kthread_destroy_worker+0x45/0x45
> [ 1621.091954]  ret_from_fork+0x1f/0x40
> [ 1621.095944] Modules linked in: xt_REDIRECT xt_nat xt_CHECKSUM 
> iptable_mangle xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 
> tun bridge stp llc ebtable_filter ebtables ip6table_filter 
> iptable_filter scsi_transport_iscsi ip6table_nat ip6_tables 
> iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vfat fat 
> skx_edac intel_powerclamp coretemp kvm_intel kvm irqbypass 
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt 
> iTCO_vendor_support aesni_intel ipmi_si crypto_simd cryptd glue_helper 
> ipmi_devintf ipmi_msghandler sg pcspkr dax_pmem_compat device_dax 
> dax_pmem_core i2c_i801 pcc_cpufreq lpc_ich ioatdma wmi nfsd 
> auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c nd_pmem 
> nd_btt sr_mod cdrom sd_mod mgag200 drm_kms_helper syscopyarea 
> crc32c_intel sysfillrect sysimgblt fb_sys_fops ttm megaraid_sas drm 
> igb ahci libahci ptp libata pps_core dca i2c_algo_bit nfit libnvdimm 
> uas usb_storage dm_mirror dm_region_hash dm_log dm_mod
> [ 1621.189449] CR2: 0000000000000020
> [ 1621.193169] ---[ end trace 7c3f7029ef24aa5a ]---
> [ 1621.305383] RIP: 0010:klist_put+0x1b/0x6c
> [ 1621.309860] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 
> 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 
> 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 
> 74 02
> [ 1621.330809] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
> [ 1621.336642] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 
> 0000000000000000
> [ 1621.344606] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 
> 0000000000000000
> [ 1621.352570] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: 
> ffffffffbb54b3c0
> [ 1621.360533] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 
> 0000000000000000
> [ 1621.368496] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: 
> ffff955000caf140
> [ 1621.376460] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) 
> knlGS:0000000000000000
> [ 1621.385490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1621.391902] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 
> 00000000007606e0
> [ 1621.399867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 1621.407830] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [ 1621.415793] PKRU: 55555554
> [ 1621.418814] Kernel panic - not syncing: Fatal exception
> [ 1621.424740] Kernel Offset: 0x39000000 from 0xffffffff81000000 
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 1621.550711] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
>
> Thanks!
> -jane
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

[-- Attachment #2: Type: text/html, Size: 13213 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-16 16:45           ` Jane Chu
@ 2019-05-16 21:42             ` jane.chu
  2019-05-16 21:51             ` Dan Williams
  1 sibling, 0 replies; 19+ messages in thread
From: jane.chu @ 2019-05-16 21:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rafael J. Wysocki, Greg Kroah-Hartman, linux-nvdimm,
	Linux Kernel Mailing List, Linux MM, Jérôme Glisse,
	Bjorn Helgaas, Andrew Morton, Christoph Hellwig

Apology for resending in plain text.
-jane

On 5/16/19 9:45 AM, Jane Chu wrote:
> Hi,
> 
> I'm able to reproduce the panic below by running two sets of ndctl
> commands that actually serve legitimate purpose in parallel (unlike
> the brute force experiment earlier), each set in a indefinite loop.
> This time it takes about an hour to panic.  But I gather the cause
> is probably the same: I've overlapped ndctl commands on the same
> region.
> 
> Could we add a check in nd_ioctl(), such that if there is
> an ongoing ndctl command on a region, subsequent ndctl request
> will fail immediately with something to the effect of EAGAIN?
> The rationale being that kernel should protect itself against
> user mistakes.
> 
> Also, sensing the subject fix is for a different problem, and has been
> verified, I'm happy to see it in upstream, so we have a better
> code base to digger deeper in terms of how the destructive ndctl
> commands interacts to typical mission critical applications, include
> but not limited to rdma.
> 
> thanks,
> -jane
> 
> On 5/14/2019 2:18 PM, Jane Chu wrote:
>> On 5/14/2019 12:04 PM, Dan Williams wrote:
>>
>>> On Tue, May 14, 2019 at 11:53 AM Jane Chu <jane.chu@oracle.com> wrote:
>>>> On 5/13/2019 12:22 PM, Logan Gunthorpe wrote:
>>>>
>>>> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
>>>>
>>>> On 2019-05-07 5:55 p.m., Dan Williams wrote:
>>>>
>>>> Changes since v1 [1]:
>>>> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
>>>>
>>>> - Refresh the p2pdma patch headers to match the format of other p2pdma
>>>>     patches (Bjorn)
>>>>
>>>> - Collect Ira's reviewed-by
>>>>
>>>> [1]: 
>>>> https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/ 
>>>>
>>>>
>>>> This series looks good to me:
>>>>
>>>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>>>
>>>> However, I haven't tested it yet but I intend to later this week.
>>>>
>>>> I've tested libnvdimm-pending which includes this series on my setup 
>>>> and
>>>> everything works great.
>>>>
>>>> Just wondering in a difference scenario where pmem pages are 
>>>> exported to
>>>> a KVM guest, and then by mistake the user issues "ndctl 
>>>> destroy-namespace -f",
>>>> will the kernel wait indefinitely until the user figures out to kill 
>>>> the guest
>>>> and release the pmem pages?
>>> It depends on whether the pages are pinned. Typically DAX memory
>>> mappings assigned to a guest are not pinned in the host and can be
>>> invalidated at any time. The pinning only occurs with VFIO and
>>> device-assignment which isn't the common case, especially since that
>>> configuration is blocked by fsdax. However, with devdax, yes you can
>>> arrange for the system to go into an indefinite wait.
>>>
>>> This somewhat ties back to the get_user_pages() vs DAX debate. The
>>> indefinite stall issue with device-assignment could be addressed with
>>> a requirement to hold a lease and expect that a lease revocation event
>>> may escalate to SIGKILL in response to 'ndctl destroy-namespace'. The
>>> expectation with device-dax is that it is already a raw interface with
>>> pointy edges and caveats, but I would not be opposed to introducing a
>>> lease semantic.
>>
>> Thanks for the quick response Dan.
>>
>> I am not convinced that the get_user_pages() vs FS-DAX dilemma is a 
>> perfect
>> comparison to "ndctl destroy-namespace -f" vs namespace-is-busy dilemma.
>>
>> Others might disagree with me, I thought that there is no risk of panic
>> if we fail "ndctl destroy-namespace -f" to honor a clean shutdown of the
>> user application. Also, both actions are on the same host, so in theory
>> the admin could shutdown the application before attempt a destructive
>> action.
>>
>> By allowing 'opposite' actions in competition with each other at fine
>> granularity, there is potential for panic in general, not necessarily 
>> with
>> pinned page I guess.  I just ran an experiment and panic'd the system.
>>
>> So, as Optane DCPMEM is generally for server/cloud deployment, and as
>> RAS is a priority for server over administrative commands, to allow
>> namespace management command to panic kernel is not an option?
>>
>> Here is my stress experiment -
>>   Start out with ./create_nm.sh to create as many 48G devdax namespaces
>> as possible. Once that's completed, firing up 6 actions in quick
>> successions in below order:
>>   -> ndctl destroy-namespace all -f
>>   -> ./create_nm.sh
>>   -> ndctl destroy-namespace all -f
>>   -> ./create_nm.sh
>>   -> ndctl destroy-namespace all -f
>>   -> ./create_nm.sh
>>
>> ==========  console message =======
>> Kernel 5.1.0-rc7-next-20190501-libnvdimm-pending on an x86_64
>>
>> ban25uut130 login: [ 1620.866813] BUG: kernel NULL pointer 
>> dereference, address: 0000000000000020
>> [ 1620.874585] #PF: supervisor read access in kernel mode
>> [ 1620.880319] #PF: error_code(0x0000) - not-present page
>> [ 1620.886052] PGD 0 P4D 0
>> [ 1620.888879] Oops: 0000 [#1] SMP NOPTI
>> [ 1620.892964] CPU: 19 PID: 5611 Comm: kworker/u130:3 Tainted: 
>> G        W         5.1.0-rc7-next-20190501-libnvdimm-pending #5
>> [ 1620.905389] Hardware name: Oracle Corporation ORACLE SERVER 
>> X8-2L/ASM,MTHRBD,2U, BIOS 52020101 05/07/2019
>> [ 1620.916069] Workqueue: events_unbound async_run_entry_fn
>> [ 1620.921997] RIP: 0010:klist_put+0x1b/0x6c
>> [ 1620.926471] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 
>> 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 
>> 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 
>> 74 02
>> [ 1620.947427] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
>> [ 1620.953258] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 
>> 0000000000000000
>> [ 1620.961223] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 
>> 0000000000000000
>> [ 1620.969185] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: 
>> ffffffffbb54b3c0
>> [ 1620.977150] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 
>> 0000000000000000
>> [ 1620.985116] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: 
>> ffff955000caf140
>> [ 1620.993081] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) 
>> knlGS:0000000000000000
>> [ 1621.002113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1621.008524] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 
>> 00000000007606e0
>> [ 1621.016487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [ 1621.024450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [ 1621.032413] PKRU: 55555554
>> [ 1621.035433] Call Trace:
>> [ 1621.038161]  klist_del+0xe/0x10
>> [ 1621.041667]  device_del+0x8a/0x2c9
>> [ 1621.045463]  ? __switch_to_asm+0x34/0x70
>> [ 1621.049840]  ? __switch_to_asm+0x40/0x70
>> [ 1621.054220]  device_unregister+0x44/0x4f
>> [ 1621.058603]  nd_async_device_unregister+0x22/0x2d [libnvdimm]
>> [ 1621.065016]  async_run_entry_fn+0x47/0x15a
>> [ 1621.069588]  process_one_work+0x1a2/0x2eb
>> [ 1621.074064]  worker_thread+0x1b8/0x26e
>> [ 1621.078239]  ? cancel_delayed_work_sync+0x15/0x15
>> [ 1621.083490]  kthread+0xf8/0xfd
>> [ 1621.086897]  ? kthread_destroy_worker+0x45/0x45
>> [ 1621.091954]  ret_from_fork+0x1f/0x40
>> [ 1621.095944] Modules linked in: xt_REDIRECT xt_nat xt_CHECKSUM 
>> iptable_mangle xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 
>> tun bridge stp llc ebtable_filter ebtables ip6table_filter 
>> iptable_filter scsi_transport_iscsi ip6table_nat ip6_tables 
>> iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vfat fat 
>> skx_edac intel_powerclamp coretemp kvm_intel kvm irqbypass 
>> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt 
>> iTCO_vendor_support aesni_intel ipmi_si crypto_simd cryptd glue_helper 
>> ipmi_devintf ipmi_msghandler sg pcspkr dax_pmem_compat device_dax 
>> dax_pmem_core i2c_i801 pcc_cpufreq lpc_ich ioatdma wmi nfsd 
>> auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c nd_pmem 
>> nd_btt sr_mod cdrom sd_mod mgag200 drm_kms_helper syscopyarea 
>> crc32c_intel sysfillrect sysimgblt fb_sys_fops ttm megaraid_sas drm 
>> igb ahci libahci ptp libata pps_core dca i2c_algo_bit nfit libnvdimm 
>> uas usb_storage dm_mirror dm_region_hash dm_log dm_mod
>> [ 1621.189449] CR2: 0000000000000020
>> [ 1621.193169] ---[ end trace 7c3f7029ef24aa5a ]---
>> [ 1621.305383] RIP: 0010:klist_put+0x1b/0x6c
>> [ 1621.309860] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 
>> 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83 e4 fe 
>> 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74 10 48 8b 03 a8 01 
>> 74 02
>> [ 1621.330809] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246
>> [ 1621.336642] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX: 
>> 0000000000000000
>> [ 1621.344606] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI: 
>> 0000000000000000
>> [ 1621.352570] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09: 
>> ffffffffbb54b3c0
>> [ 1621.360533] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12: 
>> 0000000000000000
>> [ 1621.368496] R13: ffff94d18dcfd000 R14: 0000000000000001 R15: 
>> ffff955000caf140
>> [ 1621.376460] FS:  0000000000000000(0000) GS:ffff95679f4c0000(0000) 
>> knlGS:0000000000000000
>> [ 1621.385490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1621.391902] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4: 
>> 00000000007606e0
>> [ 1621.399867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [ 1621.407830] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [ 1621.415793] PKRU: 55555554
>> [ 1621.418814] Kernel panic - not syncing: Fatal exception
>> [ 1621.424740] Kernel Offset: 0x39000000 from 0xffffffff81000000 
>> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [ 1621.550711] ---[ end Kernel panic - not syncing: Fatal exception ]---
>>
>>
>> Thanks!
>> -jane
>>
>> _______________________________________________
>> Linux-nvdimm mailing list
>> Linux-nvdimm@lists.01.org
>> https://lists.01.org/mailman/listinfo/linux-nvdimm
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-16 16:45           ` Jane Chu
  2019-05-16 21:42             ` jane.chu
@ 2019-05-16 21:51             ` Dan Williams
  2019-05-17  0:01               ` Jane Chu
  1 sibling, 1 reply; 19+ messages in thread
From: Dan Williams @ 2019-05-16 21:51 UTC (permalink / raw)
  To: Jane Chu
  Cc: linux-nvdimm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux MM, Jérôme Glisse,
	Bjorn Helgaas, Andrew Morton, Christoph Hellwig

On Thu, May 16, 2019 at 9:45 AM Jane Chu <jane.chu@oracle.com> wrote:
>
> Hi,
>
> I'm able to reproduce the panic below by running two sets of ndctl
> commands that actually serve legitimate purpose in parallel (unlike
> the brute force experiment earlier), each set in a indefinite loop.
> This time it takes about an hour to panic.  But I gather the cause
> is probably the same: I've overlapped ndctl commands on the same
> region.
>
> Could we add a check in nd_ioctl(), such that if there is
> an ongoing ndctl command on a region, subsequent ndctl request
> will fail immediately with something to the effect of EAGAIN?
> The rationale being that kernel should protect itself against
> user mistakes.

We do already have locking in the driver to prevent configuration
collisions. The problem looks to be broken assumptions about running
the device unregistration path in a separate thread outside the lock.
I suspect it may be incorrect assumptions about the userspace
visibility of the device relative to teardown actions. To be clear
this isn't the nd_ioctl() path this is the sysfs path.


> Also, sensing the subject fix is for a different problem, and has been
> verified, I'm happy to see it in upstream, so we have a better
> code base to digger deeper in terms of how the destructive ndctl
> commands interacts to typical mission critical applications, include
> but not limited to rdma.

Right, the crash signature you are seeing looks unrelated to the issue
being address in these patches which is device-teardown racing active
page pins. I'll start the investigation on the crash signature, but
again I don't think it reads on this fix series.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-16 21:51             ` Dan Williams
@ 2019-05-17  0:01               ` Jane Chu
  0 siblings, 0 replies; 19+ messages in thread
From: Jane Chu @ 2019-05-17  0:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux MM, Jérôme Glisse,
	Bjorn Helgaas, Andrew Morton, Christoph Hellwig

On 5/16/2019 2:51 PM, Dan Williams wrote:

> On Thu, May 16, 2019 at 9:45 AM Jane Chu <jane.chu@oracle.com> wrote:
>> Hi,
>>
>> I'm able to reproduce the panic below by running two sets of ndctl
>> commands that actually serve legitimate purpose in parallel (unlike
>> the brute force experiment earlier), each set in a indefinite loop.
>> This time it takes about an hour to panic.  But I gather the cause
>> is probably the same: I've overlapped ndctl commands on the same
>> region.
>>
>> Could we add a check in nd_ioctl(), such that if there is
>> an ongoing ndctl command on a region, subsequent ndctl request
>> will fail immediately with something to the effect of EAGAIN?
>> The rationale being that kernel should protect itself against
>> user mistakes.
> We do already have locking in the driver to prevent configuration
> collisions. The problem looks to be broken assumptions about running
> the device unregistration path in a separate thread outside the lock.
> I suspect it may be incorrect assumptions about the userspace
> visibility of the device relative to teardown actions. To be clear
> this isn't the nd_ioctl() path this is the sysfs path.

I see, thanks!

>
>> Also, sensing the subject fix is for a different problem, and has been
>> verified, I'm happy to see it in upstream, so we have a better
>> code base to digger deeper in terms of how the destructive ndctl
>> commands interacts to typical mission critical applications, include
>> but not limited to rdma.
> Right, the crash signature you are seeing looks unrelated to the issue
> being address in these patches which is device-teardown racing active
> page pins. I'll start the investigation on the crash signature, but
> again I don't think it reads on this fix series.

Agreed on investigating the crash as separate issue, looking forward
to see this patchset in upstream.

Thanks!
-jane


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
  2019-05-13 19:22   ` Logan Gunthorpe
  2019-05-14 18:51     ` Jane Chu
@ 2019-05-31  4:17     ` Dan Williams
  1 sibling, 0 replies; 19+ messages in thread
From: Dan Williams @ 2019-05-31  4:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, Ira Weiny, Bjorn Helgaas, Christoph Hellwig,
	Jérôme Glisse, Greg Kroah-Hartman, Rafael J. Wysocki,
	Linux Kernel Mailing List, linux-nvdimm, Linux MM

On Mon, May 13, 2019 at 12:22 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:
> >
> >
> > On 2019-05-07 5:55 p.m., Dan Williams wrote:
> >> Changes since v1 [1]:
> >> - Fix a NULL-pointer deref crash in pci_p2pdma_release() (Logan)
> >>
> >> - Refresh the p2pdma patch headers to match the format of other p2pdma
> >>    patches (Bjorn)
> >>
> >> - Collect Ira's reviewed-by
> >>
> >> [1]: https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@dwillia2-desk3.amr.corp.intel.com/
> >
> > This series looks good to me:
> >
> > Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> >
> > However, I haven't tested it yet but I intend to later this week.
>
> I've tested libnvdimm-pending which includes this series on my setup and
> everything works great.

Hi Andrew,

With this tested-by can we move forward on this fix set? I'm not aware
of any other remaining comments. Greg had a question about
"drivers/base/devres: Introduce devm_release_action()" that I
answered, but otherwise the feedback has gone silent.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-05-31  4:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-07 23:55 [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Dan Williams
2019-05-07 23:56 ` [PATCH v2 1/6] drivers/base/devres: Introduce devm_release_action() Dan Williams
2019-05-14 19:12   ` Greg Kroah-Hartman
2019-05-14 19:24     ` Dan Williams
2019-05-07 23:56 ` [PATCH v2 2/6] mm/devm_memremap_pages: Introduce devm_memunmap_pages Dan Williams
2019-05-07 23:56 ` [PATCH v2 3/6] PCI/P2PDMA: Fix the gen_pool_add_virt() failure path Dan Williams
2019-05-07 23:56 ` [PATCH v2 4/6] lib/genalloc: Introduce chunk owners Dan Williams
2019-05-07 23:56 ` [PATCH v2 5/6] PCI/P2PDMA: Track pgmap references per resource, not globally Dan Williams
2019-05-07 23:56 ` [PATCH v2 6/6] mm/devm_memremap_pages: Fix final page put race Dan Williams
2019-05-08 17:05 ` [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race Logan Gunthorpe
2019-05-13 19:22   ` Logan Gunthorpe
2019-05-14 18:51     ` Jane Chu
2019-05-14 19:04       ` Dan Williams
2019-05-14 21:18         ` Jane Chu
2019-05-16 16:45           ` Jane Chu
2019-05-16 21:42             ` jane.chu
2019-05-16 21:51             ` Dan Williams
2019-05-17  0:01               ` Jane Chu
2019-05-31  4:17     ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).