linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only
@ 2018-11-20 23:12 Dan Williams
  2018-11-20 23:12 ` [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Dan Williams @ 2018-11-20 23:12 UTC (permalink / raw)
  To: akpm
  Cc: stable, Balbir Singh, Logan Gunthorpe, Christoph Hellwig,
	Jérôme Glisse, Michal Hocko, Jérôme Glisse,
	torvalds, linux-mm, linux-kernel, dri-devel

Changes since v7 [1]:
* Rebase on next-20181119

[1]: https://lkml.org/lkml/2018/10/12/878

---

At Maintainer Summit, Greg brought up a topic I proposed around
EXPORT_SYMBOL_GPL usage. The motivation was considerations for when
EXPORT_SYMBOL_GPL is warranted and the criteria for taking the
exceptional step of reclassifying an existing export. Specifically, I
wanted to make the case that although the line is fuzzy and hard to
specify in abstract terms, it is nonetheless clear that
devm_memremap_pages() and HMM (Heterogeneous Memory Management) have
crossed it. The devm_memremap_pages() facility should have been
EXPORT_SYMBOL_GPL from the beginning, and HMM as a derivative of that
functionality should have naturally picked up that designation as well.

Contrary to typical rules, the HMM infrastructure was merged upstream
with zero in-tree consumers. There was a promise at the time that those
users would be merged "soon", but it has been over a year with no drivers
arriving. While the Nouveau driver is about to belatedly make good on
that promise it is clear that HMM was targeted first and foremost at an
out-of-tree consumer.

HMM is derived from devm_memremap_pages(), a facility Christoph and I
spearheaded to support persistent memory. It combines a device lifetime
model with a dynamically created 'struct page' / memmap array for any
physical address range. It enables coordination and control of the many
code paths in the kernel built to interact with memory via 'struct page'
objects. With HMM the integration goes even deeper by allowing device
drivers to hook and manipulate page fault and page free events.

One interpretation of when EXPORT_SYMBOL is suitable is when it is
exporting stable and generic leaf functionality.  The
devm_memremap_pages() facility continues to see expanding use cases,
peer-to-peer DMA being the most recent, with no clear end date when it
will stop attracting reworks and semantic changes. It is not suitable to
export devm_memremap_pages() as a stable 3rd party driver API due to the
fact that it is still changing and manipulates core behavior. Moreover,
it is not in the best interest of the long term development of the core
memory management subsystem to permit any external driver to effectively
define its own system-wide memory management policies with no
encouragement to engage with upstream.

I am also concerned that HMM was designed in a way to minimize further
engagement with the core-MM. That, with these hooks in place,
device-drivers are free to implement their own policies without much
consideration for whether and how the core-MM could grow to meet that
need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
core-MM should be allowed the opportunity and stimulus to change and
address these new use cases as first class functionality.

There is some more detailed justification in the individual changelogs.
The 0day infrastructure has reported build success on 102 configs and
this survives the libnvdimm unit test suite. Setting aside the
controversial aspect, the diffstat is compelling at:

	7 files changed, 126 insertions(+), 323 deletions(-)

---

Dan Williams (7):
      mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL
      mm, devm_memremap_pages: Kill mapping "System RAM" support
      mm, devm_memremap_pages: Fix shutdown handling
      mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support
      mm, hmm: Use devm semantics for hmm_devmem_{add,remove}
      mm, hmm: Replace hmm_devmem_pages_create() with devm_memremap_pages()
      mm, hmm: Mark hmm_devmem_{add,add_resource} EXPORT_SYMBOL_GPL


 drivers/dax/pmem.c                |   14 --
 drivers/nvdimm/pmem.c             |   13 +-
 include/linux/hmm.h               |    4 
 include/linux/memremap.h          |    2 
 kernel/memremap.c                 |   94 +++++++----
 mm/hmm.c                          |  305 +++++--------------------------------
 tools/testing/nvdimm/test/iomap.c |   17 ++
 7 files changed, 126 insertions(+), 323 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
@ 2018-11-20 23:12 ` Dan Williams
  2018-11-22 13:30   ` Michal Hocko
  2018-11-20 23:13 ` [PATCH v8 2/7] mm, devm_memremap_pages: Kill mapping "System RAM" support Dan Williams
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-11-20 23:12 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Jérôme Glisse, Christoph Hellwig,
	torvalds, linux-mm, linux-kernel, dri-devel

devm_memremap_pages() is a facility that can create struct page entries
for any arbitrary range and give drivers the ability to subvert core
aspects of page management.

Specifically the facility is tightly integrated with the kernel's memory
hotplug functionality. It injects an altmap argument deep into the
architecture specific vmemmap implementation to allow allocating from
specific reserved pages, and it has Linux specific assumptions about
page structure reference counting relative to get_user_pages() and
get_user_pages_fast(). It was an oversight and a mistake that this was
not marked EXPORT_SYMBOL_GPL from the outset.

Again, devm_memremap_pagex() exposes and relies upon core kernel
internal assumptions and will continue to evolve along with 'struct
page', memory hotplug, and support for new memory types / topologies.
Only an in-kernel GPL-only driver is expected to keep up with this
ongoing evolution. This interface, and functionality derived from this
interface, is not suitable for kernel-external drivers.

Cc: Michal Hocko <mhocko@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/memremap.c                 |    2 +-
 tools/testing/nvdimm/test/iomap.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 9eced2cc9f94..61dbcaa95530 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -233,7 +233,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
  err_array:
 	return ERR_PTR(error);
 }
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index ff9d3a5825e1..ed18a0cbc0c8 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -113,7 +113,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 		return nfit_res->buf + offset - nfit_res->res.start;
 	return devm_memremap_pages(dev, pgmap);
 }
-EXPORT_SYMBOL(__wrap_devm_memremap_pages);
+EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
 
 pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags)
 {


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v8 2/7] mm, devm_memremap_pages: Kill mapping "System RAM" support
  2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
  2018-11-20 23:12 ` [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
@ 2018-11-20 23:13 ` Dan Williams
  2018-11-20 23:13 ` [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling Dan Williams
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-11-20 23:13 UTC (permalink / raw)
  To: akpm
  Cc: Jérôme Glisse, Christoph Hellwig, Logan Gunthorpe,
	torvalds, linux-mm, linux-kernel, dri-devel

Given the fact that devm_memremap_pages() requires a percpu_ref that is
torn down by devm_memremap_pages_release() the current support for
mapping RAM is broken.

Support for remapping "System RAM" has been broken since the beginning
and there is no existing user of this this code path, so just kill the
support and make it an explicit error.

This cleanup also simplifies a follow-on patch to fix the error path
when setting a devm release action for devm_memremap_pages_release()
fails.

Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/memremap.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 61dbcaa95530..99d14940acfa 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -167,15 +167,12 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	is_ram = region_intersects(align_start, align_size,
 		IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
 
-	if (is_ram == REGION_MIXED) {
-		WARN_ONCE(1, "%s attempted on mixed region %pr\n",
-				__func__, res);
+	if (is_ram != REGION_DISJOINT) {
+		WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__,
+				is_ram == REGION_MIXED ? "mixed" : "ram", res);
 		return ERR_PTR(-ENXIO);
 	}
 
-	if (is_ram == REGION_INTERSECTS)
-		return __va(res->start);
-
 	if (!pgmap->ref)
 		return ERR_PTR(-EINVAL);
 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
  2018-11-20 23:12 ` [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
  2018-11-20 23:13 ` [PATCH v8 2/7] mm, devm_memremap_pages: Kill mapping "System RAM" support Dan Williams
@ 2018-11-20 23:13 ` Dan Williams
  2018-11-27 21:43   ` Logan Gunthorpe
  2018-11-20 23:13 ` [PATCH v8 4/7] mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support Dan Williams
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-11-20 23:13 UTC (permalink / raw)
  To: akpm
  Cc: stable, Jérôme Glisse, Logan Gunthorpe,
	Logan Gunthorpe, Christoph Hellwig, torvalds, linux-mm,
	linux-kernel, dri-devel

The last step before devm_memremap_pages() returns success is to
allocate a release action, devm_memremap_pages_release(), to tear the
entire setup down. However, the result from devm_add_action() is not
checked.

Checking the error from devm_add_action() is not enough. The api
currently relies on the fact that the percpu_ref it is using is killed
by the time the devm_memremap_pages_release() is run. Rather than
continue this awkward situation, offload the responsibility of killing
the percpu_ref to devm_memremap_pages_release() directly. This allows
devm_memremap_pages() to do the right thing  relative to init failures
and shutdown.

Without this change we could fail to register the teardown of
devm_memremap_pages(). The likelihood of hitting this failure is tiny as
small memory allocations almost always succeed. However, the impact of
the failure is large given any future reconfiguration, or
disable/enable, of an nvdimm namespace will fail forever as subsequent
calls to devm_memremap_pages() will fail to setup the pgmap_radix since
there will be stale entries for the physical address range.

An argument could be made to require that the ->kill() operation be set
in the @pgmap arg rather than passed in separately. However, it helps
code readability, tracking the lifetime of a given instance, to be able
to grep the kill routine directly at the devm_memremap_pages() call
site.

Cc: <stable@vger.kernel.org>
Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/pmem.c                |   14 +++-----------
 drivers/nvdimm/pmem.c             |   13 +++++--------
 include/linux/memremap.h          |    2 ++
 kernel/memremap.c                 |   30 ++++++++++++++----------------
 tools/testing/nvdimm/test/iomap.c |   15 ++++++++++++++-
 5 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 99e2aace8078..2c1f459c0c63 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -48,9 +48,8 @@ static void dax_pmem_percpu_exit(void *data)
 	percpu_ref_exit(ref);
 }
 
-static void dax_pmem_percpu_kill(void *data)
+static void dax_pmem_percpu_kill(struct percpu_ref *ref)
 {
-	struct percpu_ref *ref = data;
 	struct dax_pmem *dax_pmem = to_dax_pmem(ref);
 
 	dev_dbg(dax_pmem->dev, "trace\n");
@@ -112,17 +111,10 @@ static int dax_pmem_probe(struct device *dev)
 	}
 
 	dax_pmem->pgmap.ref = &dax_pmem->ref;
+	dax_pmem->pgmap.kill = dax_pmem_percpu_kill;
 	addr = devm_memremap_pages(dev, &dax_pmem->pgmap);
-	if (IS_ERR(addr)) {
-		devm_remove_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref);
-		percpu_ref_exit(&dax_pmem->ref);
+	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-	}
-
-	rc = devm_add_action_or_reset(dev, dax_pmem_percpu_kill,
-							&dax_pmem->ref);
-	if (rc)
-		return rc;
 
 	/* adjust the dax_region resource to the start of data */
 	memcpy(&res, &dax_pmem->pgmap.res, sizeof(res));
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f7019294740c..bc2f700feef8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -309,8 +309,11 @@ static void pmem_release_queue(void *q)
 	blk_cleanup_queue(q);
 }
 
-static void pmem_freeze_queue(void *q)
+static void pmem_freeze_queue(struct percpu_ref *ref)
 {
+	struct request_queue *q;
+
+	q = container_of(ref, typeof(*q), q_usage_counter);
 	blk_freeze_queue_start(q);
 }
 
@@ -402,6 +405,7 @@ static int pmem_attach_disk(struct device *dev,
 
 	pmem->pfn_flags = PFN_DEV;
 	pmem->pgmap.ref = &q->q_usage_counter;
+	pmem->pgmap.kill = pmem_freeze_queue;
 	if (is_nd_pfn(dev)) {
 		if (setup_pagemap_fsdax(dev, &pmem->pgmap))
 			return -ENOMEM;
@@ -427,13 +431,6 @@ static int pmem_attach_disk(struct device *dev,
 		memcpy(&bb_res, &nsio->res, sizeof(bb_res));
 	}
 
-	/*
-	 * At release time the queue must be frozen before
-	 * devm_memremap_pages is unwound
-	 */
-	if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
-		return -ENOMEM;
-
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 	pmem->virt_addr = addr;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 0ac69ddf5fc4..55db66b3716f 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -111,6 +111,7 @@ typedef void (*dev_page_free_t)(struct page *page, void *data);
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
+ * @kill: callback to transition @ref to the dead state
  * @dev: host device of the mapping for debug
  * @data: private data pointer for page_free()
  * @type: memory type: see MEMORY_* in memory_hotplug.h
@@ -122,6 +123,7 @@ struct dev_pagemap {
 	bool altmap_valid;
 	struct resource res;
 	struct percpu_ref *ref;
+	void (*kill)(struct percpu_ref *ref);
 	struct device *dev;
 	void *data;
 	enum memory_type type;
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 99d14940acfa..5e45f0c327a5 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -88,14 +88,10 @@ static void devm_memremap_pages_release(void *data)
 	resource_size_t align_start, align_size;
 	unsigned long pfn;
 
+	pgmap->kill(pgmap->ref);
 	for_each_device_pfn(pfn, pgmap)
 		put_page(pfn_to_page(pfn));
 
-	if (percpu_ref_tryget_live(pgmap->ref)) {
-		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
-		percpu_ref_put(pgmap->ref);
-	}
-
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
@@ -116,7 +112,7 @@ static void devm_memremap_pages_release(void *data)
 /**
  * devm_memremap_pages - remap and provide memmap backing for the given resource
  * @dev: hosting device for @res
- * @pgmap: pointer to a struct dev_pgmap
+ * @pgmap: pointer to a struct dev_pagemap
  *
  * Notes:
  * 1/ At a minimum the res, ref and type members of @pgmap must be initialized
@@ -125,11 +121,8 @@ static void devm_memremap_pages_release(void *data)
  * 2/ The altmap field may optionally be initialized, in which case altmap_valid
  *    must be set to true
  *
- * 3/ pgmap.ref must be 'live' on entry and 'dead' before devm_memunmap_pages()
- *    time (or devm release event). The expected order of events is that ref has
- *    been through percpu_ref_kill() before devm_memremap_pages_release(). The
- *    wait for the completion of all references being dropped and
- *    percpu_ref_exit() must occur after devm_memremap_pages_release().
+ * 3/ pgmap->ref must be 'live' on entry and will be killed at
+ *    devm_memremap_pages_release() time, or if this routine fails.
  *
  * 4/ res is expected to be a host memory range that could feasibly be
  *    treated as a "System RAM" range, i.e. not a device mmio range, but
@@ -145,6 +138,9 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	pgprot_t pgprot = PAGE_KERNEL;
 	int error, nid, is_ram;
 
+	if (!pgmap->ref || !pgmap->kill)
+		return ERR_PTR(-EINVAL);
+
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
 		- align_start;
@@ -170,12 +166,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	if (is_ram != REGION_DISJOINT) {
 		WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__,
 				is_ram == REGION_MIXED ? "mixed" : "ram", res);
-		return ERR_PTR(-ENXIO);
+		error = -ENXIO;
+		goto err_array;
 	}
 
-	if (!pgmap->ref)
-		return ERR_PTR(-EINVAL);
-
 	pgmap->dev = dev;
 
 	error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(res->start),
@@ -217,7 +211,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 				align_size >> PAGE_SHIFT, pgmap);
 	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
 
-	devm_add_action(dev, devm_memremap_pages_release, pgmap);
+	error = devm_add_action_or_reset(dev, devm_memremap_pages_release,
+			pgmap);
+	if (error)
+		return ERR_PTR(error);
 
 	return __va(res->start);
 
@@ -228,6 +225,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
  err_pfn_remap:
 	pgmap_array_delete(res);
  err_array:
+	pgmap->kill(pgmap->ref);
 	return ERR_PTR(error);
 }
 EXPORT_SYMBOL_GPL(devm_memremap_pages);
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index ed18a0cbc0c8..c6635fee27d8 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -104,13 +104,26 @@ void *__wrap_devm_memremap(struct device *dev, resource_size_t offset,
 }
 EXPORT_SYMBOL(__wrap_devm_memremap);
 
+static void nfit_test_kill(void *_pgmap)
+{
+	struct dev_pagemap *pgmap = _pgmap;
+
+	pgmap->kill(pgmap->ref);
+}
+
 void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 {
 	resource_size_t offset = pgmap->res.start;
 	struct nfit_test_resource *nfit_res = get_nfit_res(offset);
 
-	if (nfit_res)
+	if (nfit_res) {
+		int rc;
+
+		rc = devm_add_action_or_reset(dev, nfit_test_kill, pgmap);
+		if (rc)
+			return ERR_PTR(rc);
 		return nfit_res->buf + offset - nfit_res->res.start;
+	}
 	return devm_memremap_pages(dev, pgmap);
 }
 EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v8 4/7] mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support
  2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
                   ` (2 preceding siblings ...)
  2018-11-20 23:13 ` [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling Dan Williams
@ 2018-11-20 23:13 ` Dan Williams
  2018-11-23 10:48   ` David Hildenbrand
  2018-11-20 23:13 ` [PATCH v8 5/7] mm, hmm: Use devm semantics for hmm_devmem_{add, remove} Dan Williams
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-11-20 23:13 UTC (permalink / raw)
  To: akpm
  Cc: Jérôme Glisse, Christoph Hellwig, Logan Gunthorpe,
	Logan Gunthorpe, torvalds, linux-mm, linux-kernel, dri-devel

In preparation for consolidating all ZONE_DEVICE enabling via
devm_memremap_pages(), teach it how to handle the constraints of
MEMORY_DEVICE_PRIVATE ranges.

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
[jglisse: call move_pfn_range_to_zone for MEMORY_DEVICE_PRIVATE]
Acked-by: Christoph Hellwig <hch@lst.de>
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/memremap.c |   53 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 41 insertions(+), 12 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 5e45f0c327a5..3eef989ef035 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -98,9 +98,15 @@ static void devm_memremap_pages_release(void *data)
 		- align_start;
 
 	mem_hotplug_begin();
-	arch_remove_memory(align_start, align_size, pgmap->altmap_valid ?
-			&pgmap->altmap : NULL);
-	kasan_remove_zero_shadow(__va(align_start), align_size);
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+		pfn = align_start >> PAGE_SHIFT;
+		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
+				align_size >> PAGE_SHIFT, NULL);
+	} else {
+		arch_remove_memory(align_start, align_size,
+				pgmap->altmap_valid ? &pgmap->altmap : NULL);
+		kasan_remove_zero_shadow(__va(align_start), align_size);
+	}
 	mem_hotplug_done();
 
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
@@ -187,17 +193,40 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 		goto err_pfn_remap;
 
 	mem_hotplug_begin();
-	error = kasan_add_zero_shadow(__va(align_start), align_size);
-	if (error) {
-		mem_hotplug_done();
-		goto err_kasan;
+
+	/*
+	 * For device private memory we call add_pages() as we only need to
+	 * allocate and initialize struct page for the device memory. More-
+	 * over the device memory is un-accessible thus we do not want to
+	 * create a linear mapping for the memory like arch_add_memory()
+	 * would do.
+	 *
+	 * For all other device memory types, which are accessible by
+	 * the CPU, we do want the linear mapping and thus use
+	 * arch_add_memory().
+	 */
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+		error = add_pages(nid, align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT, NULL, false);
+	} else {
+		error = kasan_add_zero_shadow(__va(align_start), align_size);
+		if (error) {
+			mem_hotplug_done();
+			goto err_kasan;
+		}
+
+		error = arch_add_memory(nid, align_start, align_size, altmap,
+				false);
+	}
+
+	if (!error) {
+		struct zone *zone;
+
+		zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
+		move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT, altmap);
 	}
 
-	error = arch_add_memory(nid, align_start, align_size, altmap, false);
-	if (!error)
-		move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
-					align_start >> PAGE_SHIFT,
-					align_size >> PAGE_SHIFT, altmap);
 	mem_hotplug_done();
 	if (error)
 		goto err_add_memory;


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v8 5/7] mm, hmm: Use devm semantics for hmm_devmem_{add, remove}
  2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
                   ` (3 preceding siblings ...)
  2018-11-20 23:13 ` [PATCH v8 4/7] mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support Dan Williams
@ 2018-11-20 23:13 ` Dan Williams
  2018-11-20 23:13 ` [PATCH v8 6/7] mm, hmm: Replace hmm_devmem_pages_create() with devm_memremap_pages() Dan Williams
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-11-20 23:13 UTC (permalink / raw)
  To: akpm
  Cc: Christoph Hellwig, Jérôme Glisse,
	Jérôme Glisse, Logan Gunthorpe, torvalds, linux-mm,
	linux-kernel, dri-devel

devm semantics arrange for resources to be torn down when
device-driver-probe fails or when device-driver-release completes.
Similar to devm_memremap_pages() there is no need to support an explicit
remove operation when the users properly adhere to devm semantics.

Note that devm_kzalloc() automatically handles allocating node-local
memory.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/hmm.h |    4 --
 mm/hmm.c            |  127 ++++++++++-----------------------------------------
 2 files changed, 25 insertions(+), 106 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index c6fb869a81c0..ed89fbc525d2 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -512,8 +512,7 @@ struct hmm_devmem {
  * enough and allocate struct page for it.
  *
  * The device driver can wrap the hmm_devmem struct inside a private device
- * driver struct. The device driver must call hmm_devmem_remove() before the
- * device goes away and before freeing the hmm_devmem struct memory.
+ * driver struct.
  */
 struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 				  struct device *device,
@@ -521,7 +520,6 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 					   struct device *device,
 					   struct resource *res);
-void hmm_devmem_remove(struct hmm_devmem *devmem);
 
 /*
  * hmm_devmem_page_set_drvdata - set per-page driver data field
diff --git a/mm/hmm.c b/mm/hmm.c
index 90c34f3d1243..8510881e7b44 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -987,7 +987,6 @@ static void hmm_devmem_ref_exit(void *data)
 
 	devmem = container_of(ref, struct hmm_devmem, ref);
 	percpu_ref_exit(ref);
-	devm_remove_action(devmem->device, &hmm_devmem_ref_exit, data);
 }
 
 static void hmm_devmem_ref_kill(void *data)
@@ -998,7 +997,6 @@ static void hmm_devmem_ref_kill(void *data)
 	devmem = container_of(ref, struct hmm_devmem, ref);
 	percpu_ref_kill(ref);
 	wait_for_completion(&devmem->completion);
-	devm_remove_action(devmem->device, &hmm_devmem_ref_kill, data);
 }
 
 static int hmm_devmem_fault(struct vm_area_struct *vma,
@@ -1036,7 +1034,7 @@ static void hmm_devmem_radix_release(struct resource *resource)
 	mutex_unlock(&hmm_devmem_lock);
 }
 
-static void hmm_devmem_release(struct device *dev, void *data)
+static void hmm_devmem_release(void *data)
 {
 	struct hmm_devmem *devmem = data;
 	struct resource *resource = devmem->resource;
@@ -1044,11 +1042,6 @@ static void hmm_devmem_release(struct device *dev, void *data)
 	struct zone *zone;
 	struct page *page;
 
-	if (percpu_ref_tryget_live(&devmem->ref)) {
-		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
-		percpu_ref_put(&devmem->ref);
-	}
-
 	/* pages are dead and unused, undo the arch mapping */
 	start_pfn = (resource->start & ~(PA_SECTION_SIZE - 1)) >> PAGE_SHIFT;
 	npages = ALIGN(resource_size(resource), PA_SECTION_SIZE) >> PAGE_SHIFT;
@@ -1174,19 +1167,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	return ret;
 }
 
-static int hmm_devmem_match(struct device *dev, void *data, void *match_data)
-{
-	struct hmm_devmem *devmem = data;
-
-	return devmem->resource == match_data;
-}
-
-static void hmm_devmem_pages_remove(struct hmm_devmem *devmem)
-{
-	devres_release(devmem->device, &hmm_devmem_release,
-		       &hmm_devmem_match, devmem->resource);
-}
-
 /*
  * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory
  *
@@ -1214,8 +1194,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 
 	dev_pagemap_get_ops();
 
-	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
-				   GFP_KERNEL, dev_to_node(device));
+	devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL);
 	if (!devmem)
 		return ERR_PTR(-ENOMEM);
 
@@ -1229,11 +1208,11 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
 			      0, GFP_KERNEL);
 	if (ret)
-		goto error_percpu_ref;
+		return ERR_PTR(ret);
 
-	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit, &devmem->ref);
 	if (ret)
-		goto error_devm_add_action;
+		return ERR_PTR(ret);
 
 	size = ALIGN(size, PA_SECTION_SIZE);
 	addr = min((unsigned long)iomem_resource.end,
@@ -1253,16 +1232,12 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 
 		devmem->resource = devm_request_mem_region(device, addr, size,
 							   dev_name(device));
-		if (!devmem->resource) {
-			ret = -ENOMEM;
-			goto error_no_resource;
-		}
+		if (!devmem->resource)
+			return ERR_PTR(-ENOMEM);
 		break;
 	}
-	if (!devmem->resource) {
-		ret = -ERANGE;
-		goto error_no_resource;
-	}
+	if (!devmem->resource)
+		return ERR_PTR(-ERANGE);
 
 	devmem->resource->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;
 	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
@@ -1271,28 +1246,13 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 
 	ret = hmm_devmem_pages_create(devmem);
 	if (ret)
-		goto error_pages;
-
-	devres_add(device, devmem);
+		return ERR_PTR(ret);
 
-	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
-	if (ret) {
-		hmm_devmem_remove(devmem);
+	ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem);
+	if (ret)
 		return ERR_PTR(ret);
-	}
 
 	return devmem;
-
-error_pages:
-	devm_release_mem_region(device, devmem->resource->start,
-				resource_size(devmem->resource));
-error_no_resource:
-error_devm_add_action:
-	hmm_devmem_ref_kill(&devmem->ref);
-	hmm_devmem_ref_exit(&devmem->ref);
-error_percpu_ref:
-	devres_free(devmem);
-	return ERR_PTR(ret);
 }
 EXPORT_SYMBOL(hmm_devmem_add);
 
@@ -1308,8 +1268,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 
 	dev_pagemap_get_ops();
 
-	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
-				   GFP_KERNEL, dev_to_node(device));
+	devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL);
 	if (!devmem)
 		return ERR_PTR(-ENOMEM);
 
@@ -1323,12 +1282,12 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
 			      0, GFP_KERNEL);
 	if (ret)
-		goto error_percpu_ref;
+		return ERR_PTR(ret);
 
-	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit,
+			&devmem->ref);
 	if (ret)
-		goto error_devm_add_action;
-
+		return ERR_PTR(ret);
 
 	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
 	devmem->pfn_last = devmem->pfn_first +
@@ -1336,59 +1295,21 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 
 	ret = hmm_devmem_pages_create(devmem);
 	if (ret)
-		goto error_devm_add_action;
+		return ERR_PTR(ret);
 
-	devres_add(device, devmem);
+	ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem);
+	if (ret)
+		return ERR_PTR(ret);
 
-	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
-	if (ret) {
-		hmm_devmem_remove(devmem);
+	ret = devm_add_action_or_reset(device, hmm_devmem_ref_kill,
+			&devmem->ref);
+	if (ret)
 		return ERR_PTR(ret);
-	}
 
 	return devmem;
-
-error_devm_add_action:
-	hmm_devmem_ref_kill(&devmem->ref);
-	hmm_devmem_ref_exit(&devmem->ref);
-error_percpu_ref:
-	devres_free(devmem);
-	return ERR_PTR(ret);
 }
 EXPORT_SYMBOL(hmm_devmem_add_resource);
 
-/*
- * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
- *
- * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
- *
- * This will hot-unplug memory that was hotplugged by hmm_devmem_add on behalf
- * of the device driver. It will free struct page and remove the resource that
- * reserved the physical address range for this device memory.
- */
-void hmm_devmem_remove(struct hmm_devmem *devmem)
-{
-	resource_size_t start, size;
-	struct device *device;
-	bool cdm = false;
-
-	if (!devmem)
-		return;
-
-	device = devmem->device;
-	start = devmem->resource->start;
-	size = resource_size(devmem->resource);
-
-	cdm = devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY;
-	hmm_devmem_ref_kill(&devmem->ref);
-	hmm_devmem_ref_exit(&devmem->ref);
-	hmm_devmem_pages_remove(devmem);
-
-	if (!cdm)
-		devm_release_mem_region(device, start, size);
-}
-EXPORT_SYMBOL(hmm_devmem_remove);
-
 /*
  * A device driver that wants to handle multiple devices memory through a
  * single fake device can use hmm_device to do so. This is purely a helper


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v8 6/7] mm, hmm: Replace hmm_devmem_pages_create() with devm_memremap_pages()
  2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
                   ` (4 preceding siblings ...)
  2018-11-20 23:13 ` [PATCH v8 5/7] mm, hmm: Use devm semantics for hmm_devmem_{add, remove} Dan Williams
@ 2018-11-20 23:13 ` Dan Williams
  2018-11-20 23:13 ` [PATCH v8 7/7] mm, hmm: Mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL Dan Williams
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-11-20 23:13 UTC (permalink / raw)
  To: akpm
  Cc: Christoph Hellwig, Jérôme Glisse, Balbir Singh,
	Logan Gunthorpe, torvalds, linux-mm, linux-kernel, dri-devel

Commit e8d513483300 "memremap: change devm_memremap_pages interface to
use struct dev_pagemap" refactored devm_memremap_pages() to allow a
dev_pagemap instance to be supplied. Passing in a dev_pagemap interface
simplifies the design of pgmap type drivers in that they can rely on
container_of() to lookup any private data associated with the given
dev_pagemap instance.

In addition to the cleanups this also gives hmm users multi-order-radix
improvements that arrived with commit ab1b597ee0e4 "mm,
devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups"

As part of the conversion to the devm_memremap_pages() method of
handling the percpu_ref relative to when pages are put, the percpu_ref
completion needs to move to hmm_devmem_ref_exit(). See commit
71389703839e ("mm, zone_device: Replace {get, put}_zone_device_page...")
for details.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/hmm.c |  196 ++++++++------------------------------------------------------
 1 file changed, 26 insertions(+), 170 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 8510881e7b44..bf2495d9de81 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -986,17 +986,16 @@ static void hmm_devmem_ref_exit(void *data)
 	struct hmm_devmem *devmem;
 
 	devmem = container_of(ref, struct hmm_devmem, ref);
+	wait_for_completion(&devmem->completion);
 	percpu_ref_exit(ref);
 }
 
-static void hmm_devmem_ref_kill(void *data)
+static void hmm_devmem_ref_kill(struct percpu_ref *ref)
 {
-	struct percpu_ref *ref = data;
 	struct hmm_devmem *devmem;
 
 	devmem = container_of(ref, struct hmm_devmem, ref);
 	percpu_ref_kill(ref);
-	wait_for_completion(&devmem->completion);
 }
 
 static int hmm_devmem_fault(struct vm_area_struct *vma,
@@ -1019,154 +1018,6 @@ static void hmm_devmem_free(struct page *page, void *data)
 	devmem->ops->free(devmem, page);
 }
 
-static DEFINE_MUTEX(hmm_devmem_lock);
-static RADIX_TREE(hmm_devmem_radix, GFP_KERNEL);
-
-static void hmm_devmem_radix_release(struct resource *resource)
-{
-	resource_size_t key;
-
-	mutex_lock(&hmm_devmem_lock);
-	for (key = resource->start;
-	     key <= resource->end;
-	     key += PA_SECTION_SIZE)
-		radix_tree_delete(&hmm_devmem_radix, key >> PA_SECTION_SHIFT);
-	mutex_unlock(&hmm_devmem_lock);
-}
-
-static void hmm_devmem_release(void *data)
-{
-	struct hmm_devmem *devmem = data;
-	struct resource *resource = devmem->resource;
-	unsigned long start_pfn, npages;
-	struct zone *zone;
-	struct page *page;
-
-	/* pages are dead and unused, undo the arch mapping */
-	start_pfn = (resource->start & ~(PA_SECTION_SIZE - 1)) >> PAGE_SHIFT;
-	npages = ALIGN(resource_size(resource), PA_SECTION_SIZE) >> PAGE_SHIFT;
-
-	page = pfn_to_page(start_pfn);
-	zone = page_zone(page);
-
-	mem_hotplug_begin();
-	if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
-		__remove_pages(zone, start_pfn, npages, NULL);
-	else
-		arch_remove_memory(start_pfn << PAGE_SHIFT,
-				   npages << PAGE_SHIFT, NULL);
-	mem_hotplug_done();
-
-	hmm_devmem_radix_release(resource);
-}
-
-static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
-{
-	resource_size_t key, align_start, align_size, align_end;
-	struct device *device = devmem->device;
-	int ret, nid, is_ram;
-
-	align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
-	align_size = ALIGN(devmem->resource->start +
-			   resource_size(devmem->resource),
-			   PA_SECTION_SIZE) - align_start;
-
-	is_ram = region_intersects(align_start, align_size,
-				   IORESOURCE_SYSTEM_RAM,
-				   IORES_DESC_NONE);
-	if (is_ram == REGION_MIXED) {
-		WARN_ONCE(1, "%s attempted on mixed region %pr\n",
-				__func__, devmem->resource);
-		return -ENXIO;
-	}
-	if (is_ram == REGION_INTERSECTS)
-		return -ENXIO;
-
-	if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
-		devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
-	else
-		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
-
-	devmem->pagemap.res = *devmem->resource;
-	devmem->pagemap.page_fault = hmm_devmem_fault;
-	devmem->pagemap.page_free = hmm_devmem_free;
-	devmem->pagemap.dev = devmem->device;
-	devmem->pagemap.ref = &devmem->ref;
-	devmem->pagemap.data = devmem;
-
-	mutex_lock(&hmm_devmem_lock);
-	align_end = align_start + align_size - 1;
-	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
-		struct hmm_devmem *dup;
-
-		dup = radix_tree_lookup(&hmm_devmem_radix,
-					key >> PA_SECTION_SHIFT);
-		if (dup) {
-			dev_err(device, "%s: collides with mapping for %s\n",
-				__func__, dev_name(dup->device));
-			mutex_unlock(&hmm_devmem_lock);
-			ret = -EBUSY;
-			goto error;
-		}
-		ret = radix_tree_insert(&hmm_devmem_radix,
-					key >> PA_SECTION_SHIFT,
-					devmem);
-		if (ret) {
-			dev_err(device, "%s: failed: %d\n", __func__, ret);
-			mutex_unlock(&hmm_devmem_lock);
-			goto error_radix;
-		}
-	}
-	mutex_unlock(&hmm_devmem_lock);
-
-	nid = dev_to_node(device);
-	if (nid < 0)
-		nid = numa_mem_id();
-
-	mem_hotplug_begin();
-	/*
-	 * For device private memory we call add_pages() as we only need to
-	 * allocate and initialize struct page for the device memory. More-
-	 * over the device memory is un-accessible thus we do not want to
-	 * create a linear mapping for the memory like arch_add_memory()
-	 * would do.
-	 *
-	 * For device public memory, which is accesible by the CPU, we do
-	 * want the linear mapping and thus use arch_add_memory().
-	 */
-	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
-		ret = arch_add_memory(nid, align_start, align_size, NULL,
-				false);
-	else
-		ret = add_pages(nid, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL, false);
-	if (ret) {
-		mem_hotplug_done();
-		goto error_add_memory;
-	}
-	move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
-				align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL);
-	mem_hotplug_done();
-
-	/*
-	 * Initialization of the pages has been deferred until now in order
-	 * to allow us to do the work while not holding the hotplug lock.
-	 */
-	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
-				align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, &devmem->pagemap);
-
-	return 0;
-
-error_add_memory:
-	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
-error_radix:
-	hmm_devmem_radix_release(devmem->resource);
-error:
-	return ret;
-}
-
 /*
  * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory
  *
@@ -1190,6 +1041,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 {
 	struct hmm_devmem *devmem;
 	resource_size_t addr;
+	void *result;
 	int ret;
 
 	dev_pagemap_get_ops();
@@ -1244,14 +1096,18 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	devmem->pfn_last = devmem->pfn_first +
 			   (resource_size(devmem->resource) >> PAGE_SHIFT);
 
-	ret = hmm_devmem_pages_create(devmem);
-	if (ret)
-		return ERR_PTR(ret);
-
-	ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem);
-	if (ret)
-		return ERR_PTR(ret);
+	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	devmem->pagemap.res = *devmem->resource;
+	devmem->pagemap.page_fault = hmm_devmem_fault;
+	devmem->pagemap.page_free = hmm_devmem_free;
+	devmem->pagemap.altmap_valid = false;
+	devmem->pagemap.ref = &devmem->ref;
+	devmem->pagemap.data = devmem;
+	devmem->pagemap.kill = hmm_devmem_ref_kill;
 
+	result = devm_memremap_pages(devmem->device, &devmem->pagemap);
+	if (IS_ERR(result))
+		return result;
 	return devmem;
 }
 EXPORT_SYMBOL(hmm_devmem_add);
@@ -1261,6 +1117,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 					   struct resource *res)
 {
 	struct hmm_devmem *devmem;
+	void *result;
 	int ret;
 
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
@@ -1293,19 +1150,18 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	devmem->pfn_last = devmem->pfn_first +
 			   (resource_size(devmem->resource) >> PAGE_SHIFT);
 
-	ret = hmm_devmem_pages_create(devmem);
-	if (ret)
-		return ERR_PTR(ret);
-
-	ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem);
-	if (ret)
-		return ERR_PTR(ret);
-
-	ret = devm_add_action_or_reset(device, hmm_devmem_ref_kill,
-			&devmem->ref);
-	if (ret)
-		return ERR_PTR(ret);
+	devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+	devmem->pagemap.res = *devmem->resource;
+	devmem->pagemap.page_fault = hmm_devmem_fault;
+	devmem->pagemap.page_free = hmm_devmem_free;
+	devmem->pagemap.altmap_valid = false;
+	devmem->pagemap.ref = &devmem->ref;
+	devmem->pagemap.data = devmem;
+	devmem->pagemap.kill = hmm_devmem_ref_kill;
 
+	result = devm_memremap_pages(devmem->device, &devmem->pagemap);
+	if (IS_ERR(result))
+		return result;
 	return devmem;
 }
 EXPORT_SYMBOL(hmm_devmem_add_resource);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v8 7/7] mm, hmm: Mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
  2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
                   ` (5 preceding siblings ...)
  2018-11-20 23:13 ` [PATCH v8 6/7] mm, hmm: Replace hmm_devmem_pages_create() with devm_memremap_pages() Dan Williams
@ 2018-11-20 23:13 ` Dan Williams
  2018-11-22  1:20 ` [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Andrew Morton
  2018-12-03 23:37 ` Jerome Glisse
  8 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-11-20 23:13 UTC (permalink / raw)
  To: akpm
  Cc: Logan Gunthorpe, Jérôme Glisse, Christoph Hellwig,
	torvalds, linux-mm, linux-kernel, dri-devel

The routines hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
devm_memremap_pages() and are now simple now wrappers around the core
facility to inject a dev_pagemap instance into the global pgmap_radix
and hook page-idle events. The devm_memremap_pages() interface is base
infrastructure for HMM. HMM has more and deeper ties into the kernel
memory management implementation than base ZONE_DEVICE which is itself a
EXPORT_SYMBOL_GPL facility.

Originally, the HMM page structure creation routines copied the
devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify
the implementations was discussed during the initial review:
http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html
Recent work to extend devm_memremap_pages() for the peer-to-peer-DMA
facility enabled this cleanup to move forward.

In addition to the integration with devm_memremap_pages() HMM depends on
other GPL-only symbols:

    mmu_notifier_unregister_no_release
    percpu_ref
    region_intersects
    __class_create

It goes further to consume / indirectly expose functionality that is not
exported to any other driver:

    alloc_pages_vma
    walk_page_range

HMM is derived from devm_memremap_pages(), and extends deep core-kernel
fundamentals. Similar to devm_memremap_pages(), mark its entry points
EXPORT_SYMBOL_GPL().

Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/hmm.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index bf2495d9de81..50fbaf80f95e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -1110,7 +1110,7 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 		return result;
 	return devmem;
 }
-EXPORT_SYMBOL(hmm_devmem_add);
+EXPORT_SYMBOL_GPL(hmm_devmem_add);
 
 struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 					   struct device *device,
@@ -1164,7 +1164,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 		return result;
 	return devmem;
 }
-EXPORT_SYMBOL(hmm_devmem_add_resource);
+EXPORT_SYMBOL_GPL(hmm_devmem_add_resource);
 
 /*
  * A device driver that wants to handle multiple devices memory through a


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only
  2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
                   ` (6 preceding siblings ...)
  2018-11-20 23:13 ` [PATCH v8 7/7] mm, hmm: Mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL Dan Williams
@ 2018-11-22  1:20 ` Andrew Morton
  2018-11-25 22:04   ` Pavel Machek
  2018-12-03 23:37 ` Jerome Glisse
  8 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2018-11-22  1:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: stable, Balbir Singh, Logan Gunthorpe, Christoph Hellwig,
	Jérôme Glisse, Michal Hocko, torvalds, linux-mm,
	linux-kernel, dri-devel

On Tue, 20 Nov 2018 15:12:49 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

> Changes since v7 [1]:
> At Maintainer Summit, Greg brought up a topic I proposed around
> EXPORT_SYMBOL_GPL usage. The motivation was considerations for when
> EXPORT_SYMBOL_GPL is warranted and the criteria for taking the
> exceptional step of reclassifying an existing export. Specifically, I
> wanted to make the case that although the line is fuzzy and hard to
> specify in abstract terms, it is nonetheless clear that
> devm_memremap_pages() and HMM (Heterogeneous Memory Management) have
> crossed it. The devm_memremap_pages() facility should have been
> EXPORT_SYMBOL_GPL from the beginning, and HMM as a derivative of that
> functionality should have naturally picked up that designation as well.
> 
> Contrary to typical rules, the HMM infrastructure was merged upstream
> with zero in-tree consumers. There was a promise at the time that those
> users would be merged "soon", but it has been over a year with no drivers
> arriving. While the Nouveau driver is about to belatedly make good on
> that promise it is clear that HMM was targeted first and foremost at an
> out-of-tree consumer.
> 
> HMM is derived from devm_memremap_pages(), a facility Christoph and I
> spearheaded to support persistent memory. It combines a device lifetime
> model with a dynamically created 'struct page' / memmap array for any
> physical address range. It enables coordination and control of the many
> code paths in the kernel built to interact with memory via 'struct page'
> objects. With HMM the integration goes even deeper by allowing device
> drivers to hook and manipulate page fault and page free events.
> 
> One interpretation of when EXPORT_SYMBOL is suitable is when it is
> exporting stable and generic leaf functionality.  The
> devm_memremap_pages() facility continues to see expanding use cases,
> peer-to-peer DMA being the most recent, with no clear end date when it
> will stop attracting reworks and semantic changes. It is not suitable to
> export devm_memremap_pages() as a stable 3rd party driver API due to the
> fact that it is still changing and manipulates core behavior. Moreover,
> it is not in the best interest of the long term development of the core
> memory management subsystem to permit any external driver to effectively
> define its own system-wide memory management policies with no
> encouragement to engage with upstream.
> 
> I am also concerned that HMM was designed in a way to minimize further
> engagement with the core-MM. That, with these hooks in place,
> device-drivers are free to implement their own policies without much
> consideration for whether and how the core-MM could grow to meet that
> need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
> core-MM should be allowed the opportunity and stimulus to change and
> address these new use cases as first class functionality.
> 

The arguments are compelling.  I apologize for not thinking of and/or
not being made aware of them at the time.

I'll take [7/7] (with all the above added to the changelog) with a view
to a 4.21-rc1 merge.  That gives us a couple of months for further
discussion.  Public discussion, please.

It should be noted that [7/7] has a cc:stable.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  2018-11-20 23:12 ` [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
@ 2018-11-22 13:30   ` Michal Hocko
  2018-11-22 16:38     ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2018-11-22 13:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Jérôme Glisse, Christoph Hellwig, torvalds,
	linux-mm, linux-kernel, dri-devel

On Tue 20-11-18 15:12:54, Dan Williams wrote:
> devm_memremap_pages() is a facility that can create struct page entries
> for any arbitrary range and give drivers the ability to subvert core
> aspects of page management.
> 
> Specifically the facility is tightly integrated with the kernel's memory
> hotplug functionality. It injects an altmap argument deep into the
> architecture specific vmemmap implementation to allow allocating from
> specific reserved pages, and it has Linux specific assumptions about
> page structure reference counting relative to get_user_pages() and
> get_user_pages_fast(). It was an oversight and a mistake that this was
> not marked EXPORT_SYMBOL_GPL from the outset.
> 
> Again, devm_memremap_pagex() exposes and relies upon core kernel
> internal assumptions and will continue to evolve along with 'struct
> page', memory hotplug, and support for new memory types / topologies.
> Only an in-kernel GPL-only driver is expected to keep up with this
> ongoing evolution. This interface, and functionality derived from this
> interface, is not suitable for kernel-external drivers.

As I've said earlier I do not buy this justification because there is
simply no stable API for modules by definition
(Documentation/process/stable-api-nonsense.rst). I do understand
your reasoning that you as an author never intended to export the symbol
this way. That is fair and justified reason for this patch.

Whoever needs a wrapper around arch_add_memory can do so because this
symbol has no restriction for the usage. It will be still the same
fiddling with struct page and deep mm internals. Do we care? I am not
convinced because once we grow any in tree user we have to cope with any
potential abuse like we have in other areas in the past. And out-of-tree
modules? Who cares. Those are on their own completely and have their
ways to go around.

> Cc: Michal Hocko <mhocko@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

That being said
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  kernel/memremap.c                 |    2 +-
>  tools/testing/nvdimm/test/iomap.c |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 9eced2cc9f94..61dbcaa95530 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -233,7 +233,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>   err_array:
>  	return ERR_PTR(error);
>  }
> -EXPORT_SYMBOL(devm_memremap_pages);
> +EXPORT_SYMBOL_GPL(devm_memremap_pages);
>  
>  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
>  {
> diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
> index ff9d3a5825e1..ed18a0cbc0c8 100644
> --- a/tools/testing/nvdimm/test/iomap.c
> +++ b/tools/testing/nvdimm/test/iomap.c
> @@ -113,7 +113,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  		return nfit_res->buf + offset - nfit_res->res.start;
>  	return devm_memremap_pages(dev, pgmap);
>  }
> -EXPORT_SYMBOL(__wrap_devm_memremap_pages);
> +EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
>  
>  pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags)
>  {

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  2018-11-22 13:30   ` Michal Hocko
@ 2018-11-22 16:38     ` Christoph Hellwig
  2018-11-22 16:40       ` Christoph Hellwig
  2018-11-23  8:47       ` Michal Hocko
  0 siblings, 2 replies; 26+ messages in thread
From: Christoph Hellwig @ 2018-11-22 16:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, akpm, Jérôme Glisse, Christoph Hellwig,
	torvalds, linux-mm, linux-kernel, dri-devel

On Thu, Nov 22, 2018 at 02:30:13PM +0100, Michal Hocko wrote:
> Whoever needs a wrapper around arch_add_memory can do so because this
> symbol has no restriction for the usage.

arch_add_memory is not exported, and it really should not be.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  2018-11-22 16:38     ` Christoph Hellwig
@ 2018-11-22 16:40       ` Christoph Hellwig
  2018-11-23  8:47       ` Michal Hocko
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2018-11-22 16:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, akpm, Jérôme Glisse, Christoph Hellwig,
	torvalds, linux-mm, linux-kernel, dri-devel

On Thu, Nov 22, 2018 at 05:38:58PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 22, 2018 at 02:30:13PM +0100, Michal Hocko wrote:
> > Whoever needs a wrapper around arch_add_memory can do so because this
> > symbol has no restriction for the usage.
> 
> arch_add_memory is not exported, and it really should not be.

And in some older trees it oddly has an EXPORY_SYMBOL_GPL on x86
and sh, but no actual modular users..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL
  2018-11-22 16:38     ` Christoph Hellwig
  2018-11-22 16:40       ` Christoph Hellwig
@ 2018-11-23  8:47       ` Michal Hocko
  1 sibling, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2018-11-23  8:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, akpm, Jérôme Glisse, torvalds, linux-mm,
	linux-kernel, dri-devel

On Thu 22-11-18 17:38:58, Christoph Hellwig wrote:
> On Thu, Nov 22, 2018 at 02:30:13PM +0100, Michal Hocko wrote:
> > Whoever needs a wrapper around arch_add_memory can do so because this
> > symbol has no restriction for the usage.
> 
> arch_add_memory is not exported, and it really should not be.

It is not, but nobody really prevents from wrapping it and exporting.
I am definitely not arguing for that and I would even agree with you
that it shouldn't be exported at all unless there is a _very_ good
reason for that. Because usecases is what we care about here.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 4/7] mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support
  2018-11-20 23:13 ` [PATCH v8 4/7] mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support Dan Williams
@ 2018-11-23 10:48   ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2018-11-23 10:48 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Jérôme Glisse, Christoph Hellwig, Logan Gunthorpe,
	torvalds, linux-mm, linux-kernel, dri-devel

On 21.11.18 00:13, Dan Williams wrote:
> In preparation for consolidating all ZONE_DEVICE enabling via
> devm_memremap_pages(), teach it how to handle the constraints of
> MEMORY_DEVICE_PRIVATE ranges.
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> [jglisse: call move_pfn_range_to_zone for MEMORY_DEVICE_PRIVATE]
> Acked-by: Christoph Hellwig <hch@lst.de>
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  kernel/memremap.c |   53 +++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 41 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 5e45f0c327a5..3eef989ef035 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -98,9 +98,15 @@ static void devm_memremap_pages_release(void *data)
>  		- align_start;
>  
>  	mem_hotplug_begin();
> -	arch_remove_memory(align_start, align_size, pgmap->altmap_valid ?
> -			&pgmap->altmap : NULL);
> -	kasan_remove_zero_shadow(__va(align_start), align_size);
> +	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> +		pfn = align_start >> PAGE_SHIFT;
> +		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
> +				align_size >> PAGE_SHIFT, NULL);
> +	} else {
> +		arch_remove_memory(align_start, align_size,
> +				pgmap->altmap_valid ? &pgmap->altmap : NULL);
> +		kasan_remove_zero_shadow(__va(align_start), align_size);
> +	}
>  	mem_hotplug_done();
>  
>  	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
> @@ -187,17 +193,40 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  		goto err_pfn_remap;
>  
>  	mem_hotplug_begin();
> -	error = kasan_add_zero_shadow(__va(align_start), align_size);
> -	if (error) {
> -		mem_hotplug_done();
> -		goto err_kasan;
> +
> +	/*
> +	 * For device private memory we call add_pages() as we only need to
> +	 * allocate and initialize struct page for the device memory. More-
> +	 * over the device memory is un-accessible thus we do not want to
> +	 * create a linear mapping for the memory like arch_add_memory()
> +	 * would do.
> +	 *
> +	 * For all other device memory types, which are accessible by
> +	 * the CPU, we do want the linear mapping and thus use
> +	 * arch_add_memory().
> +	 */

I consider this comment really useful. :)

Short question: Right now, MEMORY_DEVICE_PRIVATE always indicates HMM,
correct? (I am just confused by the naming but I assume
MEMORY_DEVICE_PRIVATE is more generic than HMM)


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only
  2018-11-22  1:20 ` [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Andrew Morton
@ 2018-11-25 22:04   ` Pavel Machek
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Machek @ 2018-11-25 22:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, stable, Balbir Singh, Logan Gunthorpe,
	Christoph Hellwig, Jérôme Glisse, Michal Hocko,
	torvalds, linux-mm, linux-kernel, dri-devel

[-- Attachment #1: Type: text/plain, Size: 1680 bytes --]

Hi!

> > Changes since v7 [1]:
> > At Maintainer Summit, Greg brought up a topic I proposed around
> > EXPORT_SYMBOL_GPL usage. The motivation was considerations for when
> > EXPORT_SYMBOL_GPL is warranted and the criteria for taking the
> > exceptional step of reclassifying an existing export. Specifically, I
> > wanted to make the case that although the line is fuzzy and hard to
> > specify in abstract terms, it is nonetheless clear that
> > devm_memremap_pages() and HMM (Heterogeneous Memory Management) have
> > crossed it. The devm_memremap_pages() facility should have been
> > EXPORT_SYMBOL_GPL from the beginning, and HMM as a derivative of that
> > functionality should have naturally picked up that designation as well.
> > 
> > Contrary to typical rules, the HMM infrastructure was merged upstream
> > with zero in-tree consumers. There was a promise at the time that those
> > users would be merged "soon", but it has been over a year with no drivers
> > arriving. While the Nouveau driver is about to belatedly make good on
> > that promise it is clear that HMM was targeted first and foremost at an
> > out-of-tree consumer.

Ok, so who is this consumer and is he GPLed?

> It should be noted that [7/7] has a cc:stable.

That is pretty evil thing to do, right?

The aim here is not to fix "a real bug that hits people", AFAICT. The
aim is to break existing configurations for users.

Political games are sometimes neccessary, but should not really be
played with stable.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-20 23:13 ` [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling Dan Williams
@ 2018-11-27 21:43   ` Logan Gunthorpe
  2018-11-29  3:10     ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2018-11-27 21:43 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: stable, Jérôme Glisse, Christoph Hellwig, torvalds,
	linux-mm, linux-kernel, dri-devel, Bjorn Helgaas, Stephen Bates

Hey Dan,

On 2018-11-20 4:13 p.m., Dan Williams wrote:
> The last step before devm_memremap_pages() returns success is to
> allocate a release action, devm_memremap_pages_release(), to tear the
> entire setup down. However, the result from devm_add_action() is not
> checked.
> 
> Checking the error from devm_add_action() is not enough. The api
> currently relies on the fact that the percpu_ref it is using is killed
> by the time the devm_memremap_pages_release() is run. Rather than
> continue this awkward situation, offload the responsibility of killing
> the percpu_ref to devm_memremap_pages_release() directly. This allows
> devm_memremap_pages() to do the right thing  relative to init failures
> and shutdown.
> 
> Without this change we could fail to register the teardown of
> devm_memremap_pages(). The likelihood of hitting this failure is tiny as
> small memory allocations almost always succeed. However, the impact of
> the failure is large given any future reconfiguration, or
> disable/enable, of an nvdimm namespace will fail forever as subsequent
> calls to devm_memremap_pages() will fail to setup the pgmap_radix since
> there will be stale entries for the physical address range.
> 
> An argument could be made to require that the ->kill() operation be set
> in the @pgmap arg rather than passed in separately. However, it helps
> code readability, tracking the lifetime of a given instance, to be able
> to grep the kill routine directly at the devm_memremap_pages() call
> site.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
> Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I recently realized this patch, which was recently added to the mm tree,
will break p2pdma. This is largely because the patch was written and
reviewed before p2pdma was merged (in 4.20). Originally, I think we both
expected this patch would be merged before p2pdma but that's not what
happened.

Also, while testing this, I found the teardown is still not quite
correct. In p2pdma, the struct pages will be removed before all of the
percpu references have released and if the device is unbound while pages
are in use, there will be a kernel panic. This is because we wait on the
completion that indicates all references have been free'd after
devm_memremap_pages_release() is called and the pages are removed. This
is fairly easily fixed by waiting for the completion in the kill
function and moving the call after the last put_page(). I suspect device
DAX also has this problem but I'm not entirely certain if something else
might be preventing us from hitting this bug.

Ideally, as part of this patch we need to update the p2pdma call site
for devm_memremap_pages() and fix the completion issue. The diff for all
this is below, but if you'd like I can send a proper patch.

Thanks,

Logan

--


diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ae3c5b25dcc7..1df7bdb45eab 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -82,9 +82,10 @@ static void pci_p2pdma_percpu_release(struct
percpu_ref *ref)
        complete_all(&p2p->devmap_ref_done);
 }

-static void pci_p2pdma_percpu_kill(void *data)
+static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
 {
-       struct percpu_ref *ref = data;
+       struct pci_p2pdma *p2p =
+               container_of(ref, struct pci_p2pdma, devmap_ref);

        /*
         * pci_p2pdma_add_resource() may be called multiple times
@@ -96,6 +97,7 @@ static void pci_p2pdma_percpu_kill(void *data)
                return;

        percpu_ref_kill(ref);
+       wait_for_completion(&p2p->devmap_ref_done);
 }

 static void pci_p2pdma_release(void *data)
@@ -105,7 +107,6 @@ static void pci_p2pdma_release(void *data)
        if (!pdev->p2pdma)
                return;

-       wait_for_completion(&pdev->p2pdma->devmap_ref_done);
        percpu_ref_exit(&pdev->p2pdma->devmap_ref);

        gen_pool_destroy(pdev->p2pdma->pool);
@@ -198,6 +199,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev,
int bar, size_t size,
        pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
        pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
                pci_resource_start(pdev, bar);
+       pgmap->kill = pci_p2pdma_percpu_kill;

        addr = devm_memremap_pages(&pdev->dev, pgmap);
        if (IS_ERR(addr)) {
@@ -211,11 +213,6 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev,
int bar, size_t size,
        if (error)
                goto pgmap_free;

-       error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
-                                         &pdev->p2pdma->devmap_ref);
-       if (error)
-               goto pgmap_free;
-
        pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
                 &pgmap->res);

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 5e45f0c327a5..dd9a953e796a 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -88,9 +88,9 @@ static void devm_memremap_pages_release(void *data)
        resource_size_t align_start, align_size;
        unsigned long pfn;

-       pgmap->kill(pgmap->ref);
        for_each_device_pfn(pfn, pgmap)
                put_page(pfn_to_page(pfn));
+       pgmap->kill(pgmap->ref);

        /* pages are dead and unused, undo the arch mapping */
        align_start = res->start & ~(SECTION_SIZE - 1);








^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-27 21:43   ` Logan Gunthorpe
@ 2018-11-29  3:10     ` Dan Williams
  2018-11-29 17:06       ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-11-29  3:10 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, stable, Jérôme Glisse,
	Christoph Hellwig, Linus Torvalds, Linux MM,
	Linux Kernel Mailing List, Maling list - DRI developers,
	Bjorn Helgaas, Stephen Bates

On Tue, Nov 27, 2018 at 1:44 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
> Hey Dan,
>
> On 2018-11-20 4:13 p.m., Dan Williams wrote:
> > The last step before devm_memremap_pages() returns success is to
> > allocate a release action, devm_memremap_pages_release(), to tear the
> > entire setup down. However, the result from devm_add_action() is not
> > checked.
> >
> > Checking the error from devm_add_action() is not enough. The api
> > currently relies on the fact that the percpu_ref it is using is killed
> > by the time the devm_memremap_pages_release() is run. Rather than
> > continue this awkward situation, offload the responsibility of killing
> > the percpu_ref to devm_memremap_pages_release() directly. This allows
> > devm_memremap_pages() to do the right thing  relative to init failures
> > and shutdown.
> >
> > Without this change we could fail to register the teardown of
> > devm_memremap_pages(). The likelihood of hitting this failure is tiny as
> > small memory allocations almost always succeed. However, the impact of
> > the failure is large given any future reconfiguration, or
> > disable/enable, of an nvdimm namespace will fail forever as subsequent
> > calls to devm_memremap_pages() will fail to setup the pgmap_radix since
> > there will be stale entries for the physical address range.
> >
> > An argument could be made to require that the ->kill() operation be set
> > in the @pgmap arg rather than passed in separately. However, it helps
> > code readability, tracking the lifetime of a given instance, to be able
> > to grep the kill routine directly at the devm_memremap_pages() call
> > site.
> >
> > Cc: <stable@vger.kernel.org>
> > Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
> > Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
> > Reported-by: Logan Gunthorpe <logang@deltatee.com>
> > Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> I recently realized this patch, which was recently added to the mm tree,
> will break p2pdma. This is largely because the patch was written and
> reviewed before p2pdma was merged (in 4.20). Originally, I think we both
> expected this patch would be merged before p2pdma but that's not what
> happened.

Indeed, sorry I missed this.

>
> Also, while testing this, I found the teardown is still not quite
> correct. In p2pdma, the struct pages will be removed before all of the
> percpu references have released and if the device is unbound while pages
> are in use, there will be a kernel panic. This is because we wait on the
> completion that indicates all references have been free'd after
> devm_memremap_pages_release() is called and the pages are removed. This
> is fairly easily fixed by waiting for the completion in the kill
> function and moving the call after the last put_page(). I suspect device
> DAX also has this problem but I'm not entirely certain if something else
> might be preventing us from hitting this bug.
>
> Ideally, as part of this patch we need to update the p2pdma call site
> for devm_memremap_pages() and fix the completion issue. The diff for all
> this is below, but if you'd like I can send a proper patch.

Yes, please send a proper patch. Although, I'm still not sure I see
the problem with the order of the percpu-ref kill. It's likely more
efficient to put the kill after the put_page() loop because the
percpu-ref will still be in "fast" per-cpu mode, but the kernel panic
should not be possible as long as their is a wait_for_completion()
before the exit, unless something else is wrong.

Certainly you can't move the wait_for_completion() into your ->kill()
callback without switching the ordering, but I'm not on board with
that change until I understand a bit more about why you think
device-dax might be broken?

I took a look at the p2pdma shutdown path and the:

        if (percpu_ref_is_dying(ref))
                return;

...looks fishy. If multiple agents can overlap their requests for the
same range why not track that simply as additional refs? Could it be
the crash that you are seeing is a result of mis-accounting when it is
safe to assume the page allocation can be freed?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-29  3:10     ` Dan Williams
@ 2018-11-29 17:06       ` Logan Gunthorpe
  2018-11-29 17:30         ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2018-11-29 17:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, stable, Jérôme Glisse,
	Christoph Hellwig, Linus Torvalds, Linux MM,
	Linux Kernel Mailing List, Maling list - DRI developers,
	Bjorn Helgaas, Stephen Bates



On 2018-11-28 8:10 p.m., Dan Williams wrote:
> Yes, please send a proper patch. 

Ok, I'll send one shortly.

> Although, I'm still not sure I see
> the problem with the order of the percpu-ref kill. It's likely more
> efficient to put the kill after the put_page() loop because the
> percpu-ref will still be in "fast" per-cpu mode, but the kernel panic
> should not be possible as long as their is a wait_for_completion()
> before the exit, unless something else is wrong.

The series of events looks something like this:

1) Some p2pdma user calls pci_alloc_p2pmem() to get some memory to DMA
to taking a reference to the pgmap.
2) Another process unbinds the underlying p2pdma driver and the devm
chain starts to unwind.
3) devm_memremap_pages_release() is called and it kills the reference
and drop's it's last reference.
4) arch_remove_memory() is called which will remove all the struct pages.
5) We eventually get to pci_p2pdma_release() where we wait for the
completion indicating all the pages have been freed.
6) The user in (1) tries to use the page that has been removed,
typically by calling pci_p2pdma_map_sg(), but the page doesn't exist so
the kernel panics.

So we really need the wait in (5) to occur before (4) but after (3) so
that the pages continue to exist until the last reference is dropped.

> Certainly you can't move the wait_for_completion() into your ->kill()
> callback without switching the ordering, but I'm not on board with
> that change until I understand a bit more about why you think
> device-dax might be broken?
> 
> I took a look at the p2pdma shutdown path and the:
> 
>         if (percpu_ref_is_dying(ref))
>                 return;
> ...looks fishy. If multiple agents can overlap their requests for the
> same range why not track that simply as additional refs? Could it be
> the crash that you are seeing is a result of mis-accounting when it is
> safe to assume the page allocation can be freed?

Yeah, someone else mentioned the same thing during review but if I
remove it, there can be a double kill() on a hypothetical driver that
might call pci_p2pdma_add_resource() twice. The issue is we only have
one percpu_ref per device not one per range/BAR.

Though, now that I look at it, the current change in question will be
wrong if there are two devm_memremap_pages_release()s to call. Both need
to drop their references before we can wait_for_completion() ;(. I guess
I need multiple percpu_refs or more complex changes to
devm_memremap_pages_release().

Thanks

Logan


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-29 17:06       ` Logan Gunthorpe
@ 2018-11-29 17:30         ` Dan Williams
  2018-11-29 17:50           ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-11-29 17:30 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, stable, Jérôme Glisse,
	Christoph Hellwig, Linus Torvalds, Linux MM,
	Linux Kernel Mailing List, Maling list - DRI developers,
	Bjorn Helgaas, Stephen Bates

On Thu, Nov 29, 2018 at 9:07 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2018-11-28 8:10 p.m., Dan Williams wrote:
> > Yes, please send a proper patch.
>
> Ok, I'll send one shortly.
>
> > Although, I'm still not sure I see
> > the problem with the order of the percpu-ref kill. It's likely more
> > efficient to put the kill after the put_page() loop because the
> > percpu-ref will still be in "fast" per-cpu mode, but the kernel panic
> > should not be possible as long as their is a wait_for_completion()
> > before the exit, unless something else is wrong.
>
> The series of events looks something like this:
>
> 1) Some p2pdma user calls pci_alloc_p2pmem() to get some memory to DMA
> to taking a reference to the pgmap.
> 2) Another process unbinds the underlying p2pdma driver and the devm
> chain starts to unwind.
> 3) devm_memremap_pages_release() is called and it kills the reference
> and drop's it's last reference.

Oh! Yes, nice find. We need to wait for the percpu-ref to be dead and
all outstanding references dropped before we can proceed to
arch_remove_memory(), and I think this problem has been there since
day one because the final exit was always after devm_memremap_pages()
release which means arch_remove_memory() was always racing any final
put_page(). I'll take a look, it seems the arch_remove_pages() call
needs to be moved out-of-line to its own context and wait for the
final exit of the percpu-ref.

> 4) arch_remove_memory() is called which will remove all the struct pages.
> 5) We eventually get to pci_p2pdma_release() where we wait for the
> completion indicating all the pages have been freed.
> 6) The user in (1) tries to use the page that has been removed,
> typically by calling pci_p2pdma_map_sg(), but the page doesn't exist so
> the kernel panics.
>
> So we really need the wait in (5) to occur before (4) but after (3) so
> that the pages continue to exist until the last reference is dropped.
>
> > Certainly you can't move the wait_for_completion() into your ->kill()
> > callback without switching the ordering, but I'm not on board with
> > that change until I understand a bit more about why you think
> > device-dax might be broken?
> >
> > I took a look at the p2pdma shutdown path and the:
> >
> >         if (percpu_ref_is_dying(ref))
> >                 return;
> > ...looks fishy. If multiple agents can overlap their requests for the
> > same range why not track that simply as additional refs? Could it be
> > the crash that you are seeing is a result of mis-accounting when it is
> > safe to assume the page allocation can be freed?
>
> Yeah, someone else mentioned the same thing during review but if I
> remove it, there can be a double kill() on a hypothetical driver that
> might call pci_p2pdma_add_resource() twice. The issue is we only have
> one percpu_ref per device not one per range/BAR.
>
> Though, now that I look at it, the current change in question will be
> wrong if there are two devm_memremap_pages_release()s to call. Both need
> to drop their references before we can wait_for_completion() ;(. I guess
> I need multiple percpu_refs or more complex changes to
> devm_memremap_pages_release().

Can you just have a normal device-level kref for this case? On final
device-level kref_put then kill the percpu_ref? I guess the problem is
devm semantics where p2pdma only gets one callback on a driver
->remove() event. I'm not sure how to support multiple references of
the same pages without creating a non-devm version of
devm_memremap_pages(). I'm not opposed to that, but afaiu I don't
think p2pdma is compatible with devm as long as it supports N>1:1
mappings of the same range.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-29 17:30         ` Dan Williams
@ 2018-11-29 17:50           ` Logan Gunthorpe
  2018-11-29 18:51             ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2018-11-29 17:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, stable, Jérôme Glisse,
	Christoph Hellwig, Linus Torvalds, Linux MM,
	Linux Kernel Mailing List, Maling list - DRI developers,
	Bjorn Helgaas, Stephen Bates



On 2018-11-29 10:30 a.m., Dan Williams wrote:
> Oh! Yes, nice find. We need to wait for the percpu-ref to be dead and
> all outstanding references dropped before we can proceed to
> arch_remove_memory(), and I think this problem has been there since
> day one because the final exit was always after devm_memremap_pages()
> release which means arch_remove_memory() was always racing any final
> put_page(). I'll take a look, it seems the arch_remove_pages() call
> needs to be moved out-of-line to its own context and wait for the
> final exit of the percpu-ref.

Ok, well I thought moving the wait_for_completion() into the kill() call
was a pretty good solution to this. Though, if we move the
arch_remove_pages() into a different context, it *may* help with the
problem below...

>> Though, now that I look at it, the current change in question will be
>> wrong if there are two devm_memremap_pages_release()s to call. Both need
>> to drop their references before we can wait_for_completion() ;(. I guess
>> I need multiple percpu_refs or more complex changes to
>> devm_memremap_pages_release().
> 
> Can you just have a normal device-level kref for this case? On final
> device-level kref_put then kill the percpu_ref? I guess the problem is
> devm semantics where p2pdma only gets one callback on a driver
> ->remove() event. I'm not sure how to support multiple references of
> the same pages without creating a non-devm version of
> devm_memremap_pages(). I'm not opposed to that, but afaiu I don't
> think p2pdma is compatible with devm as long as it supports N>1:1
> mappings of the same range.

Hmm, no I think you misunderstood what I said. I'm saying I need to have
exactly one percpu_ref per call to devm_memremap_pages() and this is
doable, just slightly annoying. Right now I have one percpu_ref for
multiple calls to devm_memremap_pages() which doesn't work with the
above fix because there will always be a wait_for_completion() before
the last references are dropped in this way:

1) First devm_memremap_pages_release() is called which drops it's
reference and waits_for_completion().

2) The second devm_memremap_pages_release() needs to be called to drop
it's reference, but can't seeing the first is waiting, and therefore the
percpu_ref never goes to zero and the wait_for_completion() never returns.

Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-29 17:50           ` Logan Gunthorpe
@ 2018-11-29 18:51             ` Dan Williams
  2018-11-30 22:19               ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-11-29 18:51 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, stable, Jérôme Glisse,
	Christoph Hellwig, Linus Torvalds, Linux MM,
	Linux Kernel Mailing List, Maling list - DRI developers,
	Bjorn Helgaas, Stephen Bates

On Thu, Nov 29, 2018 at 9:51 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2018-11-29 10:30 a.m., Dan Williams wrote:
> > Oh! Yes, nice find. We need to wait for the percpu-ref to be dead and
> > all outstanding references dropped before we can proceed to
> > arch_remove_memory(), and I think this problem has been there since
> > day one because the final exit was always after devm_memremap_pages()
> > release which means arch_remove_memory() was always racing any final
> > put_page(). I'll take a look, it seems the arch_remove_pages() call
> > needs to be moved out-of-line to its own context and wait for the
> > final exit of the percpu-ref.
>
> Ok, well I thought moving the wait_for_completion() into the kill() call
> was a pretty good solution to this.

True, it is...

> Though, if we move the
> arch_remove_pages() into a different context, it *may* help with the
> problem below...

Glad to see my over-engineered proposal in this case might be good for
something...

>
> >> Though, now that I look at it, the current change in question will be
> >> wrong if there are two devm_memremap_pages_release()s to call. Both need
> >> to drop their references before we can wait_for_completion() ;(. I guess
> >> I need multiple percpu_refs or more complex changes to
> >> devm_memremap_pages_release().
> >
> > Can you just have a normal device-level kref for this case? On final
> > device-level kref_put then kill the percpu_ref? I guess the problem is
> > devm semantics where p2pdma only gets one callback on a driver
> > ->remove() event. I'm not sure how to support multiple references of
> > the same pages without creating a non-devm version of
> > devm_memremap_pages(). I'm not opposed to that, but afaiu I don't
> > think p2pdma is compatible with devm as long as it supports N>1:1
> > mappings of the same range.
>
> Hmm, no I think you misunderstood what I said. I'm saying I need to have
> exactly one percpu_ref per call to devm_memremap_pages() and this is
> doable, just slightly annoying. Right now I have one percpu_ref for
> multiple calls to devm_memremap_pages() which doesn't work with the
> above fix because there will always be a wait_for_completion() before
> the last references are dropped in this way:
>
> 1) First devm_memremap_pages_release() is called which drops it's
> reference and waits_for_completion().
>
> 2) The second devm_memremap_pages_release() needs to be called to drop
> it's reference, but can't seeing the first is waiting, and therefore the
> percpu_ref never goes to zero and the wait_for_completion() never returns.
>

Got it, let me see how bad moving arch_remove_memory() turns out,
sounds like a decent approach to coordinate multiple users of a single
ref.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-29 18:51             ` Dan Williams
@ 2018-11-30 22:19               ` Logan Gunthorpe
  2018-11-30 22:28                 ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2018-11-30 22:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, stable, Jérôme Glisse,
	Christoph Hellwig, Linus Torvalds, Linux MM,
	Linux Kernel Mailing List, Maling list - DRI developers,
	Bjorn Helgaas, Stephen Bates

Hey,

On 2018-11-29 11:51 a.m., Dan Williams wrote:
> Got it, let me see how bad moving arch_remove_memory() turns out,
> sounds like a decent approach to coordinate multiple users of a single
> ref.

I've put together a patch set[1] that fixes all the users of
devm_memremap_pages() without moving arch_remove_memory(). It's pretty
clean except for the p2pdma case which is fairly tricky but I don't
think there's an easy way around that.

If you come up with a better solution that's great, otherwise let me
know and I'll do some clean up and more testing and send this set to the
lists. Though, we might need to wait for your patch to land before we
can properly send the fix to it (the first patch in my series)...

Logan

[1] https://github.com/sbates130272/linux-p2pmem/ memremap_fix


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-30 22:19               ` Logan Gunthorpe
@ 2018-11-30 22:28                 ` Dan Williams
  2018-11-30 22:34                   ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2018-11-30 22:28 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, stable, Jérôme Glisse,
	Christoph Hellwig, Linus Torvalds, Linux MM,
	Linux Kernel Mailing List, Maling list - DRI developers,
	Bjorn Helgaas, Stephen Bates

On Fri, Nov 30, 2018 at 2:19 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
> Hey,
>
> On 2018-11-29 11:51 a.m., Dan Williams wrote:
> > Got it, let me see how bad moving arch_remove_memory() turns out,
> > sounds like a decent approach to coordinate multiple users of a single
> > ref.
>
> I've put together a patch set[1] that fixes all the users of
> devm_memremap_pages() without moving arch_remove_memory(). It's pretty
> clean except for the p2pdma case which is fairly tricky but I don't
> think there's an easy way around that.

The solution I'm trying is to introduce a devm_memremap_pages_remove()
that each user can call after they have called percpu_ref_exit(), it's
just crashing for me currently...

> If you come up with a better solution that's great, otherwise let me
> know and I'll do some clean up and more testing and send this set to the
> lists. Though, we might need to wait for your patch to land before we
> can properly send the fix to it (the first patch in my series)...

I'd say go ahead and send it. We can fix p2pdma as a follow-on. Send
it to Andrew as a patch relative to the current -next tree.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-30 22:28                 ` Dan Williams
@ 2018-11-30 22:34                   ` Logan Gunthorpe
  2018-11-30 22:47                     ` Dan Williams
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2018-11-30 22:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, stable, Jérôme Glisse,
	Christoph Hellwig, Linus Torvalds, Linux MM,
	Linux Kernel Mailing List, Maling list - DRI developers,
	Bjorn Helgaas, Stephen Bates



On 2018-11-30 3:28 p.m., Dan Williams wrote:
> On Fri, Nov 30, 2018 at 2:19 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>> Hey,
>>
>> On 2018-11-29 11:51 a.m., Dan Williams wrote:
>>> Got it, let me see how bad moving arch_remove_memory() turns out,
>>> sounds like a decent approach to coordinate multiple users of a single
>>> ref.
>>
>> I've put together a patch set[1] that fixes all the users of
>> devm_memremap_pages() without moving arch_remove_memory(). It's pretty
>> clean except for the p2pdma case which is fairly tricky but I don't
>> think there's an easy way around that.
> 
> The solution I'm trying is to introduce a devm_memremap_pages_remove()
> that each user can call after they have called percpu_ref_exit(), it's
> just crashing for me currently...

Ok, that's probably less of a clean up for other users, but sounds like
it would be less tricky for p2pdma. I'd have to create a list of all
pgmaps, but that's not so hard and doesn't create any nasty races to
consider like my current solution.

>> If you come up with a better solution that's great, otherwise let me
>> know and I'll do some clean up and more testing and send this set to the
>> lists. Though, we might need to wait for your patch to land before we
>> can properly send the fix to it (the first patch in my series)...
> 
> I'd say go ahead and send it. We can fix p2pdma as a follow-on. Send
> it to Andrew as a patch relative to the current -next tree.

Ok, though, how do I reference the current patch in Andrew's tree? Or
does it matter?

Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling
  2018-11-30 22:34                   ` Logan Gunthorpe
@ 2018-11-30 22:47                     ` Dan Williams
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Williams @ 2018-11-30 22:47 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andrew Morton, stable, Jérôme Glisse,
	Christoph Hellwig, Linus Torvalds, Linux MM,
	Linux Kernel Mailing List, Maling list - DRI developers,
	Bjorn Helgaas, Stephen Bates

On Fri, Nov 30, 2018 at 2:34 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2018-11-30 3:28 p.m., Dan Williams wrote:
> > On Fri, Nov 30, 2018 at 2:19 PM Logan Gunthorpe <logang@deltatee.com> wrote:
> >>
> >> Hey,
> >>
> >> On 2018-11-29 11:51 a.m., Dan Williams wrote:
> >>> Got it, let me see how bad moving arch_remove_memory() turns out,
> >>> sounds like a decent approach to coordinate multiple users of a single
> >>> ref.
> >>
> >> I've put together a patch set[1] that fixes all the users of
> >> devm_memremap_pages() without moving arch_remove_memory(). It's pretty
> >> clean except for the p2pdma case which is fairly tricky but I don't
> >> think there's an easy way around that.
> >
> > The solution I'm trying is to introduce a devm_memremap_pages_remove()
> > that each user can call after they have called percpu_ref_exit(), it's
> > just crashing for me currently...
>
> Ok, that's probably less of a clean up for other users, but sounds like
> it would be less tricky for p2pdma. I'd have to create a list of all
> pgmaps, but that's not so hard and doesn't create any nasty races to
> consider like my current solution.
>
> >> If you come up with a better solution that's great, otherwise let me
> >> know and I'll do some clean up and more testing and send this set to the
> >> lists. Though, we might need to wait for your patch to land before we
> >> can properly send the fix to it (the first patch in my series)...
> >
> > I'd say go ahead and send it. We can fix p2pdma as a follow-on. Send
> > it to Andrew as a patch relative to the current -next tree.
>
> Ok, though, how do I reference the current patch in Andrew's tree? Or
> does it matter?

I would just let Andrew know that this applies incrementally to
"mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl.patch" in
his tree. You can't specify Fixes: tags for pending patches in -mm.
Andrew may choose to squash the change into the existing patch, which
may be the best outcome for not exposing a bisect regression point for
p2pdma.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only
  2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
                   ` (7 preceding siblings ...)
  2018-11-22  1:20 ` [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Andrew Morton
@ 2018-12-03 23:37 ` Jerome Glisse
  8 siblings, 0 replies; 26+ messages in thread
From: Jerome Glisse @ 2018-12-03 23:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, stable, Balbir Singh, Logan Gunthorpe, Christoph Hellwig,
	Michal Hocko, torvalds, linux-mm, linux-kernel, dri-devel

On Wed, Nov 21, 2018 at 05:20:55PM -0800, Andrew Morton wrote:
> On Tue, 20 Nov 2018 15:12:49 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

[...]

> > I am also concerned that HMM was designed in a way to minimize further
> > engagement with the core-MM. That, with these hooks in place,
> > device-drivers are free to implement their own policies without much
> > consideration for whether and how the core-MM could grow to meet that
> > need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
> > core-MM should be allowed the opportunity and stimulus to change and
> > address these new use cases as first class functionality.
> > 
> 
> The arguments are compelling.  I apologize for not thinking of and/or
> not being made aware of them at the time.

So i wanted to comment on that part. Yes HMM is an impedence layer
that goes both way ie device driver are shielded from core mm and
core mm folks do not need to understand individual driver to modify
mm, they only need to understand what is provided to the driver by
HMM (and keeps HMM promise intact from driver POV no matter how it
is achieve). So this is intentional.

Nonetheless I want to grow core mm involvement in managing those
memory (see patchset i just posted about hbind() and heterogeneous
memory system). But i do not expect that core mm will be in full
control at least not for some time. The historical reasons is that
device like GPU are not only use for compute (which is where HMM
gets use) but also for graphics (simple desktop or even games).
Those are two differents workload using different API (CUDA/OpenCL
for compute, OpenGL/Vulkan for graphics) on the same underlying
hardware.

Those API expose the hardware in incompatible way when it comes to
memory management (especialy API like Vulkan). Managing memory page
wise is not well suited for graphics. The issues comes from the
fact that we do not want to exclude either workload from happening
concurrently (running your destkop while some compute job is running
in the background). So for this to work we need to keep the device
driver in control of its memory (hence why callback when page are
freed for instance). We also need to forbid things like pinning any
device memory pages ...


I still expect some commonality to emerge accross different hardware
so that we can grow more things and share more code into core mm but
i want to get their organicaly, not forcing everyone into a design
today. I expect this will happens by going from high level concept,
how things get use in userspace from end user POV, and working back-
ward from there to see what common API (if any) we can provided to
catter those common use case.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-12-03 23:37 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-20 23:12 [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Dan Williams
2018-11-20 23:12 ` [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL Dan Williams
2018-11-22 13:30   ` Michal Hocko
2018-11-22 16:38     ` Christoph Hellwig
2018-11-22 16:40       ` Christoph Hellwig
2018-11-23  8:47       ` Michal Hocko
2018-11-20 23:13 ` [PATCH v8 2/7] mm, devm_memremap_pages: Kill mapping "System RAM" support Dan Williams
2018-11-20 23:13 ` [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling Dan Williams
2018-11-27 21:43   ` Logan Gunthorpe
2018-11-29  3:10     ` Dan Williams
2018-11-29 17:06       ` Logan Gunthorpe
2018-11-29 17:30         ` Dan Williams
2018-11-29 17:50           ` Logan Gunthorpe
2018-11-29 18:51             ` Dan Williams
2018-11-30 22:19               ` Logan Gunthorpe
2018-11-30 22:28                 ` Dan Williams
2018-11-30 22:34                   ` Logan Gunthorpe
2018-11-30 22:47                     ` Dan Williams
2018-11-20 23:13 ` [PATCH v8 4/7] mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support Dan Williams
2018-11-23 10:48   ` David Hildenbrand
2018-11-20 23:13 ` [PATCH v8 5/7] mm, hmm: Use devm semantics for hmm_devmem_{add, remove} Dan Williams
2018-11-20 23:13 ` [PATCH v8 6/7] mm, hmm: Replace hmm_devmem_pages_create() with devm_memremap_pages() Dan Williams
2018-11-20 23:13 ` [PATCH v8 7/7] mm, hmm: Mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL Dan Williams
2018-11-22  1:20 ` [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only Andrew Morton
2018-11-25 22:04   ` Pavel Machek
2018-12-03 23:37 ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).