linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] HMM updates for 5.1
@ 2019-01-29 16:54 jglisse
  2019-01-29 16:54 ` [PATCH 01/10] mm/hmm: use reference counting for HMM struct jglisse
                   ` (12 more replies)
  0 siblings, 13 replies; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

This patchset improves the HMM driver API and add support for hugetlbfs
and DAX mirroring. The improvement motivation was to make the ODP to HMM
conversion easier [1]. Because we have nouveau bits schedule for 5.1 and
to avoid any multi-tree synchronization this patchset adds few lines of
inline function that wrap the existing HMM driver API to the improved
API. The nouveau driver was tested before and after this patchset and it
builds and works on both case so there is no merging issue [2]. The
nouveau bit are queue up for 5.1 so this is why i added those inline.

If this get merge in 5.1 the plans is to merge the HMM to ODP in 5.2 or
5.3 if testing shows any issues (so far no issues has been found with
limited testing but Mellanox will be running heavier testing for longer
time).

To avoid spamming mm i would like to not cc mm on ODP or nouveau patches,
however if people prefer to see those on mm mailing list then i can keep
it cced.

This is also what i intend to use as a base for AMD and Intel patches
(v2 with more thing of some rfc which were already posted in the past).

[1] https://cgit.freedesktop.org/~glisse/linux/log/?h=odp-hmm
[2] https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-5.1

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Dan Williams <dan.j.williams@intel.com>

Jérôme Glisse (10):
  mm/hmm: use reference counting for HMM struct
  mm/hmm: do not erase snapshot when a range is invalidated
  mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot()
  mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault()
  mm/hmm: improve driver API to work and wait over a range
  mm/hmm: add default fault flags to avoid the need to pre-fill pfns
    arrays.
  mm/hmm: add an helper function that fault pages and map them to a
    device
  mm/hmm: support hugetlbfs (snap shoting, faulting and DMA mapping)
  mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  mm/hmm: add helpers for driver to safely take the mmap_sem

 include/linux/hmm.h |  290 ++++++++++--
 mm/hmm.c            | 1060 +++++++++++++++++++++++++++++--------------
 2 files changed, 983 insertions(+), 367 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 01/10] mm/hmm: use reference counting for HMM struct
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-02-20 23:47   ` John Hubbard
  2019-01-29 16:54 ` [PATCH 02/10] mm/hmm: do not erase snapshot when a range is invalidated jglisse
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Ralph Campbell,
	John Hubbard, Andrew Morton

From: Jérôme Glisse <jglisse@redhat.com>

Every time i read the code to check that the HMM structure does not
vanish before it should thanks to the many lock protecting its removal
i get a headache. Switch to reference counting instead it is much
easier to follow and harder to break. This also remove some code that
is no longer needed with refcounting.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/hmm.h |   2 +
 mm/hmm.c            | 178 +++++++++++++++++++++++++++++---------------
 2 files changed, 120 insertions(+), 60 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 66f9ebbb1df3..bd6e058597a6 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -131,6 +131,7 @@ enum hmm_pfn_value_e {
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @hmm: the core HMM structure this range is active against
  * @vma: the vm area struct for the range
  * @list: all range lock are on a list
  * @start: range virtual start address (inclusive)
@@ -142,6 +143,7 @@ enum hmm_pfn_value_e {
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+	struct hmm		*hmm;
 	struct vm_area_struct	*vma;
 	struct list_head	list;
 	unsigned long		start;
diff --git a/mm/hmm.c b/mm/hmm.c
index a04e4b810610..b9f384ea15e9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -50,6 +50,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
  */
 struct hmm {
 	struct mm_struct	*mm;
+	struct kref		kref;
 	spinlock_t		lock;
 	struct list_head	ranges;
 	struct list_head	mirrors;
@@ -57,6 +58,16 @@ struct hmm {
 	struct rw_semaphore	mirrors_sem;
 };
 
+static inline struct hmm *hmm_get(struct mm_struct *mm)
+{
+	struct hmm *hmm = READ_ONCE(mm->hmm);
+
+	if (hmm && kref_get_unless_zero(&hmm->kref))
+		return hmm;
+
+	return NULL;
+}
+
 /*
  * hmm_register - register HMM against an mm (HMM internal)
  *
@@ -67,14 +78,9 @@ struct hmm {
  */
 static struct hmm *hmm_register(struct mm_struct *mm)
 {
-	struct hmm *hmm = READ_ONCE(mm->hmm);
+	struct hmm *hmm = hmm_get(mm);
 	bool cleanup = false;
 
-	/*
-	 * The hmm struct can only be freed once the mm_struct goes away,
-	 * hence we should always have pre-allocated an new hmm struct
-	 * above.
-	 */
 	if (hmm)
 		return hmm;
 
@@ -86,6 +92,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	hmm->mmu_notifier.ops = NULL;
 	INIT_LIST_HEAD(&hmm->ranges);
 	spin_lock_init(&hmm->lock);
+	kref_init(&hmm->kref);
 	hmm->mm = mm;
 
 	spin_lock(&mm->page_table_lock);
@@ -106,7 +113,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
 		goto error_mm;
 
-	return mm->hmm;
+	return hmm;
 
 error_mm:
 	spin_lock(&mm->page_table_lock);
@@ -118,9 +125,41 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	return NULL;
 }
 
+static void hmm_free(struct kref *kref)
+{
+	struct hmm *hmm = container_of(kref, struct hmm, kref);
+	struct mm_struct *mm = hmm->mm;
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
+
+	spin_lock(&mm->page_table_lock);
+	if (mm->hmm == hmm)
+		mm->hmm = NULL;
+	spin_unlock(&mm->page_table_lock);
+
+	kfree(hmm);
+}
+
+static inline void hmm_put(struct hmm *hmm)
+{
+	kref_put(&hmm->kref, hmm_free);
+}
+
 void hmm_mm_destroy(struct mm_struct *mm)
 {
-	kfree(mm->hmm);
+	struct hmm *hmm;
+
+	spin_lock(&mm->page_table_lock);
+	hmm = hmm_get(mm);
+	mm->hmm = NULL;
+	if (hmm) {
+		hmm->mm = NULL;
+		spin_unlock(&mm->page_table_lock);
+		hmm_put(hmm);
+		return;
+	}
+
+	spin_unlock(&mm->page_table_lock);
 }
 
 static int hmm_invalidate_range(struct hmm *hmm, bool device,
@@ -165,7 +204,7 @@ static int hmm_invalidate_range(struct hmm *hmm, bool device,
 static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct hmm_mirror *mirror;
-	struct hmm *hmm = mm->hmm;
+	struct hmm *hmm = hmm_get(mm);
 
 	down_write(&hmm->mirrors_sem);
 	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
@@ -186,36 +225,50 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 						  struct hmm_mirror, list);
 	}
 	up_write(&hmm->mirrors_sem);
+
+	hmm_put(hmm);
 }
 
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *range)
 {
 	struct hmm_update update;
-	struct hmm *hmm = range->mm->hmm;
+	struct hmm *hmm = hmm_get(range->mm);
+	int ret;
 
 	VM_BUG_ON(!hmm);
 
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL)
+		return 0;
+
 	update.start = range->start;
 	update.end = range->end;
 	update.event = HMM_UPDATE_INVALIDATE;
 	update.blockable = range->blockable;
-	return hmm_invalidate_range(hmm, true, &update);
+	ret = hmm_invalidate_range(hmm, true, &update);
+	hmm_put(hmm);
+	return ret;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *range)
 {
 	struct hmm_update update;
-	struct hmm *hmm = range->mm->hmm;
+	struct hmm *hmm = hmm_get(range->mm);
 
 	VM_BUG_ON(!hmm);
 
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL)
+		return;
+
 	update.start = range->start;
 	update.end = range->end;
 	update.event = HMM_UPDATE_INVALIDATE;
 	update.blockable = true;
 	hmm_invalidate_range(hmm, false, &update);
+	hmm_put(hmm);
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
@@ -241,24 +294,13 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 	if (!mm || !mirror || !mirror->ops)
 		return -EINVAL;
 
-again:
 	mirror->hmm = hmm_register(mm);
 	if (!mirror->hmm)
 		return -ENOMEM;
 
 	down_write(&mirror->hmm->mirrors_sem);
-	if (mirror->hmm->mm == NULL) {
-		/*
-		 * A racing hmm_mirror_unregister() is about to destroy the hmm
-		 * struct. Try again to allocate a new one.
-		 */
-		up_write(&mirror->hmm->mirrors_sem);
-		mirror->hmm = NULL;
-		goto again;
-	} else {
-		list_add(&mirror->list, &mirror->hmm->mirrors);
-		up_write(&mirror->hmm->mirrors_sem);
-	}
+	list_add(&mirror->list, &mirror->hmm->mirrors);
+	up_write(&mirror->hmm->mirrors_sem);
 
 	return 0;
 }
@@ -273,33 +315,18 @@ EXPORT_SYMBOL(hmm_mirror_register);
  */
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
-	bool should_unregister = false;
-	struct mm_struct *mm;
-	struct hmm *hmm;
+	struct hmm *hmm = READ_ONCE(mirror->hmm);
 
-	if (mirror->hmm == NULL)
+	if (hmm == NULL)
 		return;
 
-	hmm = mirror->hmm;
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
-	should_unregister = list_empty(&hmm->mirrors);
+	/* To protect us against double unregister ... */
 	mirror->hmm = NULL;
-	mm = hmm->mm;
-	hmm->mm = NULL;
 	up_write(&hmm->mirrors_sem);
 
-	if (!should_unregister || mm == NULL)
-		return;
-
-	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
-
-	spin_lock(&mm->page_table_lock);
-	if (mm->hmm == hmm)
-		mm->hmm = NULL;
-	spin_unlock(&mm->page_table_lock);
-
-	kfree(hmm);
+	hmm_put(hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
@@ -708,6 +735,8 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
+	range->hmm = NULL;
+
 	/* Sanity check, this really should not happen ! */
 	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
@@ -717,14 +746,18 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm)
 		return -ENOMEM;
-	/* Caller must have registered a mirror, via hmm_mirror_register() ! */
-	if (!hmm->mmu_notifier.ops)
+
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL) {
+		hmm_put(hmm);
 		return -EINVAL;
+	}
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
 			vma_is_dax(vma)) {
 		hmm_pfns_special(range);
+		hmm_put(hmm);
 		return -EINVAL;
 	}
 
@@ -736,6 +769,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 		 * operations such has atomic access would not work.
 		 */
 		hmm_pfns_clear(range, range->pfns, range->start, range->end);
+		hmm_put(hmm);
 		return -EPERM;
 	}
 
@@ -758,6 +792,12 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	walk_page_range(range->start, range->end, &mm_walk);
+	/*
+	 * Transfer hmm reference to the range struct it will be drop inside
+	 * the hmm_vma_range_done() function (which _must_ be call if this
+	 * function return 0).
+	 */
+	range->hmm = hmm;
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -802,25 +842,27 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  */
 bool hmm_vma_range_done(struct hmm_range *range)
 {
-	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
-	struct hmm *hmm;
+	bool ret = false;
 
-	if (range->end <= range->start) {
+	/* Sanity check this really should not happen. */
+	if (range->hmm == NULL || range->end <= range->start) {
 		BUG();
 		return false;
 	}
 
-	hmm = hmm_register(range->vma->vm_mm);
-	if (!hmm) {
-		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
-		return false;
-	}
-
-	spin_lock(&hmm->lock);
+	spin_lock(&range->hmm->lock);
 	list_del_rcu(&range->list);
-	spin_unlock(&hmm->lock);
+	ret = range->valid;
+	spin_unlock(&range->hmm->lock);
 
-	return range->valid;
+	/* Is the mm still alive ? */
+	if (range->hmm->mm == NULL)
+		ret = false;
+
+	/* Drop reference taken by hmm_vma_fault() or hmm_vma_get_pfns() */
+	hmm_put(range->hmm);
+	range->hmm = NULL;
+	return ret;
 }
 EXPORT_SYMBOL(hmm_vma_range_done);
 
@@ -880,6 +922,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 	struct hmm *hmm;
 	int ret;
 
+	range->hmm = NULL;
+
 	/* Sanity check, this really should not happen ! */
 	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
@@ -891,14 +935,18 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
-	/* Caller must have registered a mirror using hmm_mirror_register() */
-	if (!hmm->mmu_notifier.ops)
+
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL) {
+		hmm_put(hmm);
 		return -EINVAL;
+	}
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
 			vma_is_dax(vma)) {
 		hmm_pfns_special(range);
+		hmm_put(hmm);
 		return -EINVAL;
 	}
 
@@ -910,6 +958,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		 * operations such has atomic access would not work.
 		 */
 		hmm_pfns_clear(range, range->pfns, range->start, range->end);
+		hmm_put(hmm);
 		return -EPERM;
 	}
 
@@ -945,7 +994,16 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
 			       range->end);
 		hmm_vma_range_done(range);
+		hmm_put(hmm);
+	} else {
+		/*
+		 * Transfer hmm reference to the range struct it will be drop
+		 * inside the hmm_vma_range_done() function (which _must_ be
+		 * call if this function return 0).
+		 */
+		range->hmm = hmm;
 	}
+
 	return ret;
 }
 EXPORT_SYMBOL(hmm_vma_fault);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 02/10] mm/hmm: do not erase snapshot when a range is invalidated
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
  2019-01-29 16:54 ` [PATCH 01/10] mm/hmm: use reference counting for HMM struct jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-02-20 23:58   ` John Hubbard
  2019-01-29 16:54 ` [PATCH 03/10] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() jglisse
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Ralph Campbell, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

Users of HMM might be using the snapshot information to do
preparatory step like dma mapping pages to a device before
checking for invalidation through hmm_vma_range_done() so
do not erase that information and assume users will do the
right thing.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index b9f384ea15e9..74d69812d6be 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -170,16 +170,10 @@ static int hmm_invalidate_range(struct hmm *hmm, bool device,
 
 	spin_lock(&hmm->lock);
 	list_for_each_entry(range, &hmm->ranges, list) {
-		unsigned long addr, idx, npages;
-
 		if (update->end < range->start || update->start >= range->end)
 			continue;
 
 		range->valid = false;
-		addr = max(update->start, range->start);
-		idx = (addr - range->start) >> PAGE_SHIFT;
-		npages = (min(range->end, update->end) - addr) >> PAGE_SHIFT;
-		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
 	}
 	spin_unlock(&hmm->lock);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 03/10] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot()
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
  2019-01-29 16:54 ` [PATCH 01/10] mm/hmm: use reference counting for HMM struct jglisse
  2019-01-29 16:54 ` [PATCH 02/10] mm/hmm: do not erase snapshot when a range is invalidated jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-02-21  0:25   ` John Hubbard
  2019-01-29 16:54 ` [PATCH 04/10] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() jglisse
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Ralph Campbell, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

Rename for consistency between code, comments and documentation. Also
improves the comments on all the possible returns values. Improve the
function by returning the number of populated entries in pfns array.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |  4 ++--
 mm/hmm.c            | 23 ++++++++++-------------
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index bd6e058597a6..ddf49c1b1f5e 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -365,11 +365,11 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  * table invalidation serializes on it.
  *
  * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
- * hmm_vma_get_pfns() WITHOUT ERROR !
+ * hmm_range_snapshot() WITHOUT ERROR !
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct hmm_range *range);
+long hmm_range_snapshot(struct hmm_range *range);
 bool hmm_vma_range_done(struct hmm_range *range);
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 74d69812d6be..0d9ecd3337e5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -706,23 +706,19 @@ static void hmm_pfns_special(struct hmm_range *range)
 }
 
 /*
- * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
- * @range: range being snapshotted
+ * hmm_range_snapshot() - snapshot CPU page table for a range
+ * @range: range
  * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
- *          vma permission, 0 success
+ *          permission (for instance asking for write and range is read only),
+ *          -EAGAIN if you need to retry, -EFAULT invalid (ie either no valid
+ *          vma or it is illegal to access that range), number of valid pages
+ *          in range->pfns[] (from range start address).
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
  * validity is tracked by range struct. See hmm_vma_range_done() for further
  * information.
- *
- * The range struct is initialized here. It tracks the CPU page table, but only
- * if the function returns success (0), in which case the caller must then call
- * hmm_vma_range_done() to stop CPU page table update tracking on this range.
- *
- * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
- * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct hmm_range *range)
+long hmm_range_snapshot(struct hmm_range *range)
 {
 	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
@@ -776,6 +772,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	hmm_vma_walk.fault = false;
 	hmm_vma_walk.range = range;
 	mm_walk.private = &hmm_vma_walk;
+	hmm_vma_walk.last = range->start;
 
 	mm_walk.vma = vma;
 	mm_walk.mm = vma->vm_mm;
@@ -792,9 +789,9 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	 * function return 0).
 	 */
 	range->hmm = hmm;
-	return 0;
+	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
 }
-EXPORT_SYMBOL(hmm_vma_get_pfns);
+EXPORT_SYMBOL(hmm_range_snapshot);
 
 /*
  * hmm_vma_range_done() - stop tracking change to CPU page table over a range
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 04/10] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault()
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (2 preceding siblings ...)
  2019-01-29 16:54 ` [PATCH 03/10] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-01-29 16:54 ` [PATCH 05/10] mm/hmm: improve driver API to work and wait over a range jglisse
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Ralph Campbell, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

Rename for consistency between code, comments and documentation. Also
improves the comments on all the possible returns values. Improve the
function by returning the number of populated entries in pfns array.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 13 ++++++-
 mm/hmm.c            | 93 ++++++++++++++++++++-------------------------
 2 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ddf49c1b1f5e..ccf2b630447e 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -391,7 +391,18 @@ bool hmm_vma_range_done(struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct hmm_range *range, bool block);
+long hmm_range_fault(struct hmm_range *range, bool block);
+
+/* This is a temporary helper to avoid merge conflict between trees. */
+static inline int hmm_vma_fault(struct hmm_range *range, bool block)
+{
+	long ret = hmm_range_fault(range, block);
+	if (ret == -EBUSY)
+		ret = -EAGAIN;
+	else if (ret == -EAGAIN)
+		ret = -EBUSY;
+	return ret < 0 ? ret : 0;
+}
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
diff --git a/mm/hmm.c b/mm/hmm.c
index 0d9ecd3337e5..04235455b4d2 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -344,13 +344,13 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 	flags |= write_fault ? FAULT_FLAG_WRITE : 0;
 	ret = handle_mm_fault(vma, addr, flags);
 	if (ret & VM_FAULT_RETRY)
-		return -EBUSY;
+		return -EAGAIN;
 	if (ret & VM_FAULT_ERROR) {
 		*pfn = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
-	return -EAGAIN;
+	return -EBUSY;
 }
 
 static int hmm_pfns_bad(unsigned long addr,
@@ -376,7 +376,7 @@ static int hmm_pfns_bad(unsigned long addr,
  * @fault: should we fault or not ?
  * @write_fault: write fault ?
  * @walk: mm_walk structure
- * Returns: 0 on success, -EAGAIN after page fault, or page fault error
+ * Returns: 0 on success, -EBUSY after page fault, or page fault error
  *
  * This function will be called whenever pmd_none() or pte_none() returns true,
  * or whenever there is no page directory covering the virtual address range.
@@ -399,12 +399,12 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
 
 			ret = hmm_vma_do_fault(walk, addr, write_fault,
 					       &pfns[i]);
-			if (ret != -EAGAIN)
+			if (ret != -EBUSY)
 				return ret;
 		}
 	}
 
-	return (fault || write_fault) ? -EAGAIN : 0;
+	return (fault || write_fault) ? -EBUSY : 0;
 }
 
 static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
@@ -535,11 +535,11 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	uint64_t orig_pfn = *pfn;
 
 	*pfn = range->values[HMM_PFN_NONE];
-	cpu_flags = pte_to_hmm_pfn_flags(range, pte);
-	hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
-			   &fault, &write_fault);
+	fault = write_fault = false;
 
 	if (pte_none(pte)) {
+		hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0,
+				   &fault, &write_fault);
 		if (fault || write_fault)
 			goto fault;
 		return 0;
@@ -578,7 +578,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 				hmm_vma_walk->last = addr;
 				migration_entry_wait(vma->vm_mm,
 						     pmdp, addr);
-				return -EAGAIN;
+				return -EBUSY;
 			}
 			return 0;
 		}
@@ -586,6 +586,10 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		/* Report error for everything else */
 		*pfn = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
+	} else {
+		cpu_flags = pte_to_hmm_pfn_flags(range, pte);
+		hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
+				   &fault, &write_fault);
 	}
 
 	if (fault || write_fault)
@@ -636,7 +640,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (fault || write_fault) {
 			hmm_vma_walk->last = addr;
 			pmd_migration_entry_wait(vma->vm_mm, pmdp);
-			return -EAGAIN;
+			return -EBUSY;
 		}
 		return 0;
 	} else if (!pmd_present(pmd))
@@ -858,53 +862,36 @@ bool hmm_vma_range_done(struct hmm_range *range)
 EXPORT_SYMBOL(hmm_vma_range_done);
 
 /*
- * hmm_vma_fault() - try to fault some address in a virtual address range
+ * hmm_range_fault() - try to fault some address in a virtual address range
  * @range: range being faulted
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
- * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ * Returns: 0 on success ortherwise:
+ *      -EINVAL:
+ *              Invalid argument
+ *      -ENOMEM:
+ *              Out of memory.
+ *      -EPERM:
+ *              Invalid permission (for instance asking for write and range
+ *              is read only).
+ *      -EAGAIN:
+ *              If you need to retry and mmap_sem was drop. This can only
+ *              happens if block argument is false.
+ *      -EBUSY:
+ *              If the the range is being invalidated and you should wait for
+ *              invalidation to finish.
+ *      -EFAULT:
+ *              Invalid (ie either no valid vma or it is illegal to access that
+ *              range), number of valid pages in range->pfns[] (from range start
+ *              address).
  *
  * This is similar to a regular CPU page fault except that it will not trigger
- * any memory migration if the memory being faulted is not accessible by CPUs.
+ * any memory migration if the memory being faulted is not accessible by CPUs
+ * and caller does not ask for migration.
  *
  * On error, for one virtual address in the range, the function will mark the
  * corresponding HMM pfn entry with an error flag.
- *
- * Expected use pattern:
- * retry:
- *   down_read(&mm->mmap_sem);
- *   // Find vma and address device wants to fault, initialize hmm_pfn_t
- *   // array accordingly
- *   ret = hmm_vma_fault(range, write, block);
- *   switch (ret) {
- *   case -EAGAIN:
- *     hmm_vma_range_done(range);
- *     // You might want to rate limit or yield to play nicely, you may
- *     // also commit any valid pfn in the array assuming that you are
- *     // getting true from hmm_vma_range_monitor_end()
- *     goto retry;
- *   case 0:
- *     break;
- *   case -ENOMEM:
- *   case -EINVAL:
- *   case -EPERM:
- *   default:
- *     // Handle error !
- *     up_read(&mm->mmap_sem)
- *     return;
- *   }
- *   // Take device driver lock that serialize device page table update
- *   driver_lock_device_page_table_update();
- *   hmm_vma_range_done(range);
- *   // Commit pfns we got from hmm_vma_fault()
- *   driver_unlock_device_page_table_update();
- *   up_read(&mm->mmap_sem)
- *
- * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
- * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
- *
- * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct hmm_range *range, bool block)
+long hmm_range_fault(struct hmm_range *range, bool block)
 {
 	struct vm_area_struct *vma = range->vma;
 	unsigned long start = range->start;
@@ -976,7 +963,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 	do {
 		ret = walk_page_range(start, range->end, &mm_walk);
 		start = hmm_vma_walk.last;
-	} while (ret == -EAGAIN);
+		/* Keep trying while the range is valid. */
+	} while (ret == -EBUSY && range->valid);
 
 	if (ret) {
 		unsigned long i;
@@ -986,6 +974,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 			       range->end);
 		hmm_vma_range_done(range);
 		hmm_put(hmm);
+		return ret;
 	} else {
 		/*
 		 * Transfer hmm reference to the range struct it will be drop
@@ -995,9 +984,9 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		range->hmm = hmm;
 	}
 
-	return ret;
+	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
 }
-EXPORT_SYMBOL(hmm_vma_fault);
+EXPORT_SYMBOL(hmm_range_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 05/10] mm/hmm: improve driver API to work and wait over a range
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (3 preceding siblings ...)
  2019-01-29 16:54 ` [PATCH 04/10] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-01-29 16:54 ` [PATCH 06/10] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays jglisse
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Ralph Campbell, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

A common use case for HMM mirror is user trying to mirror a range
and before they could program the hardware it get invalidated by
some core mm event. Instead of having user re-try right away to
mirror the range provide a completion mechanism for them to wait
for any active invalidation affecting the range.

This also changes how hmm_range_snapshot() and hmm_range_fault()
works by not relying on vma so that we can drop the mmap_sem
when waiting and lookup the vma again on retry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 208 +++++++++++++++---
 mm/hmm.c            | 526 +++++++++++++++++++++-----------------------
 2 files changed, 430 insertions(+), 304 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ccf2b630447e..93dc88edc293 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -77,8 +77,34 @@
 #include <linux/migrate.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
+#include <linux/mmu_notifier.h>
 
-struct hmm;
+
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting ranges list
+ * @ranges: list of range being snapshotted
+ * @mirrors: list of mirrors for this mm
+ * @mmu_notifier: mmu notifier to track updates to CPU page table
+ * @mirrors_sem: read/write semaphore protecting the mirrors list
+ * @wq: wait queue for user waiting on a range invalidation
+ * @notifiers: count of active mmu notifiers
+ * @dead: is the mm dead ?
+ */
+struct hmm {
+	struct mm_struct	*mm;
+	struct kref		kref;
+	struct mutex		lock;
+	struct list_head	ranges;
+	struct list_head	mirrors;
+	struct mmu_notifier	mmu_notifier;
+	struct rw_semaphore	mirrors_sem;
+	wait_queue_head_t	wq;
+	long			notifiers;
+	bool			dead;
+};
 
 /*
  * hmm_pfn_flag_e - HMM flag enums
@@ -155,6 +181,38 @@ struct hmm_range {
 	bool			valid;
 };
 
+/*
+ * hmm_range_wait_until_valid() - wait for range to be valid
+ * @range: range affected by invalidation to wait on
+ * @timeout: time out for wait in ms (ie abort wait after that period of time)
+ * Returns: true if the range is valid, false otherwise.
+ */
+static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
+					      unsigned long timeout)
+{
+	/* Check if mm is dead ? */
+	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
+		range->valid = false;
+		return false;
+	}
+	if (range->valid)
+		return true;
+	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
+			   msecs_to_jiffies(timeout));
+	/* Return current valid status just in case we get lucky */
+	return range->valid;
+}
+
+/*
+ * hmm_range_valid() - test if a range is valid or not
+ * @range: range
+ * Returns: true if the range is valid, false otherwise.
+ */
+static inline bool hmm_range_valid(struct hmm_range *range)
+{
+	return range->valid;
+}
+
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
  * @range: range use to decode HMM pfn value
@@ -357,51 +415,133 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
 
 /*
- * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
- * driver lock that serializes device page table updates, then call
- * hmm_vma_range_done(), to check if the snapshot is still valid. The same
- * device driver page table update lock must also be used in the
- * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
- * table invalidation serializes on it.
+ * To snapshot the CPU page table you first have to call hmm_range_register()
+ * to register the range. If hmm_range_register() return an error then some-
+ * thing is horribly wrong and you should fail loudly. If it returned true then
+ * you can wait for the range to be stable with hmm_range_wait_until_valid()
+ * function, a range is valid when there are no concurrent changes to the CPU
+ * page table for the range.
+ *
+ * Once the range is valid you can call hmm_range_snapshot() if that returns
+ * without error then you can take your device page table lock (the same lock
+ * you use in the HMM mirror sync_cpu_device_pagetables() callback). After
+ * taking that lock you have to check the range validity, if it is still valid
+ * (ie hmm_range_valid() returns true) then you can program the device page
+ * table, otherwise you have to start again. Pseudo code:
+ *
+ *      mydevice_prefault(mydevice, mm, start, end)
+ *      {
+ *          struct hmm_range range;
+ *          ...
  *
- * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
- * hmm_range_snapshot() WITHOUT ERROR !
+ *          ret = hmm_range_register(&range, mm, start, end);
+ *          if (ret)
+ *              return ret;
  *
- * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
- */
-long hmm_range_snapshot(struct hmm_range *range);
-bool hmm_vma_range_done(struct hmm_range *range);
-
-
-/*
- * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The HMM pfn array will
- * be updated with the fault result and current snapshot of the CPU page table
- * for the range.
+ *          down_read(mm->mmap_sem);
+ *      again:
+ *
+ *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
+ *              up_read(&mm->mmap_sem);
+ *              hmm_range_unregister(range);
+ *              // Handle time out, either sleep or retry or something else
+ *              ...
+ *              return -ESOMETHING; || goto again;
+ *          }
+ *
+ *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
+ *          if (ret == -EAGAIN) {
+ *              down_read(mm->mmap_sem);
+ *              goto again;
+ *          } else if (ret == -EBUSY) {
+ *              goto again;
+ *          }
+ *
+ *          up_read(&mm->mmap_sem);
+ *          if (ret) {
+ *              hmm_range_unregister(range);
+ *              return ret;
+ *          }
+ *
+ *          // It might not have snap-shoted the whole range but only the first
+ *          // npages, the return values is the number of valid pages from the
+ *          // start of the range.
+ *          npages = ret;
  *
- * The mmap_sem must be taken in read mode before entering and it might be
- * dropped by the function if the block argument is false. In that case, the
- * function returns -EAGAIN.
+ *          ...
  *
- * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the HMM pfn array to
- * determine fault status for each address.
+ *          mydevice_page_table_lock(mydevice);
+ *          if (!hmm_range_valid(range)) {
+ *              mydevice_page_table_unlock(mydevice);
+ *              goto again;
+ *          }
  *
- * Trying to fault inside an invalid vma will result in -EINVAL.
+ *          mydevice_populate_page_table(mydevice, range, npages);
+ *          ...
+ *          mydevice_take_page_table_unlock(mydevice);
+ *          hmm_range_unregister(range);
  *
- * See the function description in mm/hmm.c for further documentation.
+ *          return 0;
+ *      }
+ *
+ * The same scheme apply to hmm_range_fault() (ie replace hmm_range_snapshot()
+ * with hmm_range_fault() in above pseudo code).
+ *
+ * YOU MUST CALL hmm_range_unregister() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_range_register() AND hmm_range_register() RETURNED TRUE ! IF YOU DO NOT
+ * FOLLOW THIS RULE MEMORY CORRUPTION WILL ENSUE !
  */
+int hmm_range_register(struct hmm_range *range,
+		       struct mm_struct *mm,
+		       unsigned long start,
+		       unsigned long end);
+void hmm_range_unregister(struct hmm_range *range);
+long hmm_range_snapshot(struct hmm_range *range);
 long hmm_range_fault(struct hmm_range *range, bool block);
 
+/*
+ * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
+ *
+ * When waiting for mmu notifiers we need some kind of time out otherwise we
+ * could potentialy wait for ever, 1000ms ie 1s sounds like a long time to
+ * wait already.
+ */
+#define HMM_RANGE_DEFAULT_TIMEOUT 1000
+
 /* This is a temporary helper to avoid merge conflict between trees. */
+static inline bool hmm_vma_range_done(struct hmm_range *range)
+{
+	bool ret = hmm_range_valid(range);
+
+	hmm_range_unregister(range);
+	return ret;
+}
+
 static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 {
-	long ret = hmm_range_fault(range, block);
-	if (ret == -EBUSY)
-		ret = -EAGAIN;
-	else if (ret == -EAGAIN)
-		ret = -EBUSY;
-	return ret < 0 ? ret : 0;
+	long ret;
+
+	ret = hmm_range_register(range, range->vma->vm_mm,
+				 range->start, range->end);
+	if (ret)
+		return (int)ret;
+
+	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
+		up_read(&range->vma->vm_mm->mmap_sem);
+		return -EAGAIN;
+	}
+
+	ret = hmm_range_fault(range, block);
+	if (ret <= 0) {
+		if (ret == -EBUSY || !ret) {
+			up_read(&range->vma->vm_mm->mmap_sem);
+			ret = -EBUSY;
+		} else if (ret == -EAGAIN)
+			ret = -EBUSY;
+		hmm_range_unregister(range);
+		return ret;
+	}
+	return 0;
 }
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
diff --git a/mm/hmm.c b/mm/hmm.c
index 04235455b4d2..860ebe5d4b07 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -38,26 +38,6 @@
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
-/*
- * struct hmm - HMM per mm struct
- *
- * @mm: mm struct this HMM struct is bound to
- * @lock: lock protecting ranges list
- * @ranges: list of range being snapshotted
- * @mirrors: list of mirrors for this mm
- * @mmu_notifier: mmu notifier to track updates to CPU page table
- * @mirrors_sem: read/write semaphore protecting the mirrors list
- */
-struct hmm {
-	struct mm_struct	*mm;
-	struct kref		kref;
-	spinlock_t		lock;
-	struct list_head	ranges;
-	struct list_head	mirrors;
-	struct mmu_notifier	mmu_notifier;
-	struct rw_semaphore	mirrors_sem;
-};
-
 static inline struct hmm *hmm_get(struct mm_struct *mm)
 {
 	struct hmm *hmm = READ_ONCE(mm->hmm);
@@ -87,12 +67,15 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 	if (!hmm)
 		return NULL;
+	init_waitqueue_head(&hmm->wq);
 	INIT_LIST_HEAD(&hmm->mirrors);
 	init_rwsem(&hmm->mirrors_sem);
 	hmm->mmu_notifier.ops = NULL;
 	INIT_LIST_HEAD(&hmm->ranges);
-	spin_lock_init(&hmm->lock);
+	mutex_init(&hmm->lock);
 	kref_init(&hmm->kref);
+	hmm->notifiers = 0;
+	hmm->dead = false;
 	hmm->mm = mm;
 
 	spin_lock(&mm->page_table_lock);
@@ -154,6 +137,7 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	mm->hmm = NULL;
 	if (hmm) {
 		hmm->mm = NULL;
+		hmm->dead = true;
 		spin_unlock(&mm->page_table_lock);
 		hmm_put(hmm);
 		return;
@@ -162,43 +146,22 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	spin_unlock(&mm->page_table_lock);
 }
 
-static int hmm_invalidate_range(struct hmm *hmm, bool device,
-				const struct hmm_update *update)
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
+	struct hmm *hmm = hmm_get(mm);
 	struct hmm_mirror *mirror;
 	struct hmm_range *range;
 
-	spin_lock(&hmm->lock);
-	list_for_each_entry(range, &hmm->ranges, list) {
-		if (update->end < range->start || update->start >= range->end)
-			continue;
+	/* Report this HMM as dying. */
+	hmm->dead = true;
 
+	/* Wake-up everyone waiting on any range. */
+	mutex_lock(&hmm->lock);
+	list_for_each_entry(range, &hmm->ranges, list) {
 		range->valid = false;
 	}
-	spin_unlock(&hmm->lock);
-
-	if (!device)
-		return 0;
-
-	down_read(&hmm->mirrors_sem);
-	list_for_each_entry(mirror, &hmm->mirrors, list) {
-		int ret;
-
-		ret = mirror->ops->sync_cpu_device_pagetables(mirror, update);
-		if (!update->blockable && ret == -EAGAIN) {
-			up_read(&hmm->mirrors_sem);
-			return -EAGAIN;
-		}
-	}
-	up_read(&hmm->mirrors_sem);
-
-	return 0;
-}
-
-static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
-{
-	struct hmm_mirror *mirror;
-	struct hmm *hmm = hmm_get(mm);
+	wake_up_all(&hmm->wq);
+	mutex_unlock(&hmm->lock);
 
 	down_write(&hmm->mirrors_sem);
 	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
@@ -224,44 +187,88 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 }
 
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
-			const struct mmu_notifier_range *range)
+			const struct mmu_notifier_range *nrange)
 {
+	struct hmm *hmm = hmm_get(nrange->mm);
+	struct hmm_mirror *mirror;
 	struct hmm_update update;
-	struct hmm *hmm = hmm_get(range->mm);
-	int ret;
+	struct hmm_range *range;
+	int ret = 0;
 
 	VM_BUG_ON(!hmm);
 
 	/* Check if hmm_mm_destroy() was call. */
 	if (hmm->mm == NULL)
-		return 0;
+		goto out;
 
-	update.start = range->start;
-	update.end = range->end;
+	update.start = nrange->start;
+	update.end = nrange->end;
 	update.event = HMM_UPDATE_INVALIDATE;
-	update.blockable = range->blockable;
-	ret = hmm_invalidate_range(hmm, true, &update);
+	update.blockable = nrange->blockable;
+
+	if (!nrange->blockable && !mutex_trylock(&hmm->lock)) {
+		ret = -EAGAIN;
+		goto out;
+	} else
+		mutex_lock(&hmm->lock);
+	hmm->notifiers++;
+	list_for_each_entry(range, &hmm->ranges, list) {
+		if (update.end < range->start || update.start >= range->end)
+			continue;
+
+		range->valid = false;
+	}
+	mutex_unlock(&hmm->lock);
+
+
+	if (!nrange->blockable && !down_read_trylock(&hmm->mirrors_sem)) {
+		ret = -EAGAIN;
+		goto out;
+	} else
+		down_read(&hmm->mirrors_sem);
+	list_for_each_entry(mirror, &hmm->mirrors, list) {
+		int ret;
+
+		ret = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
+		if (!update.blockable && ret == -EAGAIN) {
+			up_read(&hmm->mirrors_sem);
+			ret = -EAGAIN;
+			goto out;
+		}
+	}
+	up_read(&hmm->mirrors_sem);
+
+out:
 	hmm_put(hmm);
 	return ret;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
-			const struct mmu_notifier_range *range)
+			const struct mmu_notifier_range *nrange)
 {
-	struct hmm_update update;
-	struct hmm *hmm = hmm_get(range->mm);
+	struct hmm *hmm = hmm_get(nrange->mm);
 
 	VM_BUG_ON(!hmm);
 
 	/* Check if hmm_mm_destroy() was call. */
 	if (hmm->mm == NULL)
-		return;
+		goto out;
 
-	update.start = range->start;
-	update.end = range->end;
-	update.event = HMM_UPDATE_INVALIDATE;
-	update.blockable = true;
-	hmm_invalidate_range(hmm, false, &update);
+	mutex_lock(&hmm->lock);
+	hmm->notifiers--;
+	if (!hmm->notifiers) {
+		struct hmm_range *range;
+
+		list_for_each_entry(range, &hmm->ranges, list) {
+			if (range->valid)
+				continue;
+			range->valid = true;
+		}
+		wake_up_all(&hmm->wq);
+	}
+	mutex_unlock(&hmm->lock);
+
+out:
 	hmm_put(hmm);
 }
 
@@ -413,7 +420,6 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 {
 	struct hmm_range *range = hmm_vma_walk->range;
 
-	*fault = *write_fault = false;
 	if (!hmm_vma_walk->fault)
 		return;
 
@@ -452,10 +458,11 @@ static void hmm_range_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 		return;
 	}
 
+	*fault = *write_fault = false;
 	for (i = 0; i < npages; ++i) {
 		hmm_pte_need_fault(hmm_vma_walk, pfns[i], cpu_flags,
 				   fault, write_fault);
-		if ((*fault) || (*write_fault))
+		if ((*write_fault))
 			return;
 	}
 }
@@ -710,156 +717,152 @@ static void hmm_pfns_special(struct hmm_range *range)
 }
 
 /*
- * hmm_range_snapshot() - snapshot CPU page table for a range
+ * hmm_range_register() - start tracking change to CPU page table over a range
  * @range: range
- * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
- *          permission (for instance asking for write and range is read only),
- *          -EAGAIN if you need to retry, -EFAULT invalid (ie either no valid
- *          vma or it is illegal to access that range), number of valid pages
- *          in range->pfns[] (from range start address).
+ * @mm: the mm struct for the range of virtual address
+ * @start: start virtual address (inclusive)
+ * @end: end virtual address (exclusive)
+ * Returns 0 on success, -EFAULT if the address space is no longer valid
  *
- * This snapshots the CPU page table for a range of virtual addresses. Snapshot
- * validity is tracked by range struct. See hmm_vma_range_done() for further
- * information.
+ * Track updates to the CPU page table see include/linux/hmm.h
  */
-long hmm_range_snapshot(struct hmm_range *range)
+int hmm_range_register(struct hmm_range *range,
+		       struct mm_struct *mm,
+		       unsigned long start,
+		       unsigned long end)
 {
-	struct vm_area_struct *vma = range->vma;
-	struct hmm_vma_walk hmm_vma_walk;
-	struct mm_walk mm_walk;
-	struct hmm *hmm;
-
+	range->start = start & PAGE_MASK;
+	range->end = end & PAGE_MASK;
+	range->valid = false;
 	range->hmm = NULL;
 
-	/* Sanity check, this really should not happen ! */
-	if (range->start < vma->vm_start || range->start >= vma->vm_end)
-		return -EINVAL;
-	if (range->end < vma->vm_start || range->end > vma->vm_end)
+	if (range->start >= range->end)
 		return -EINVAL;
 
-	hmm = hmm_register(vma->vm_mm);
-	if (!hmm)
-		return -ENOMEM;
+	range->hmm = hmm_register(mm);
+	if (!range->hmm)
+		return -EFAULT;
 
 	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL) {
-		hmm_put(hmm);
-		return -EINVAL;
+	if (range->hmm->mm == NULL || range->hmm->dead) {
+		hmm_put(range->hmm);
+		return -EFAULT;
 	}
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
-			vma_is_dax(vma)) {
-		hmm_pfns_special(range);
-		hmm_put(hmm);
-		return -EINVAL;
-	}
+	/* Initialize range to track CPU page table update */
+	mutex_lock(&range->hmm->lock);
 
-	if (!(vma->vm_flags & VM_READ)) {
-		/*
-		 * If vma do not allow read access, then assume that it does
-		 * not allow write access, either. Architecture that allow
-		 * write without read access are not supported by HMM, because
-		 * operations such has atomic access would not work.
-		 */
-		hmm_pfns_clear(range, range->pfns, range->start, range->end);
-		hmm_put(hmm);
-		return -EPERM;
-	}
+	list_add_rcu(&range->list, &range->hmm->ranges);
 
-	/* Initialize range to track CPU page table update */
-	spin_lock(&hmm->lock);
-	range->valid = true;
-	list_add_rcu(&range->list, &hmm->ranges);
-	spin_unlock(&hmm->lock);
-
-	hmm_vma_walk.fault = false;
-	hmm_vma_walk.range = range;
-	mm_walk.private = &hmm_vma_walk;
-	hmm_vma_walk.last = range->start;
-
-	mm_walk.vma = vma;
-	mm_walk.mm = vma->vm_mm;
-	mm_walk.pte_entry = NULL;
-	mm_walk.test_walk = NULL;
-	mm_walk.hugetlb_entry = NULL;
-	mm_walk.pmd_entry = hmm_vma_walk_pmd;
-	mm_walk.pte_hole = hmm_vma_walk_hole;
-
-	walk_page_range(range->start, range->end, &mm_walk);
 	/*
-	 * Transfer hmm reference to the range struct it will be drop inside
-	 * the hmm_vma_range_done() function (which _must_ be call if this
-	 * function return 0).
+	 * If there are any concurrent notifiers we have to wait for them for
+	 * the range to be valid (see hmm_range_wait_until_valid()).
 	 */
-	range->hmm = hmm;
-	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+	if (!range->hmm->notifiers)
+		range->valid = true;
+	mutex_unlock(&range->hmm->lock);
+
+	return 0;
 }
-EXPORT_SYMBOL(hmm_range_snapshot);
+EXPORT_SYMBOL(hmm_range_register);
 
 /*
- * hmm_vma_range_done() - stop tracking change to CPU page table over a range
- * @range: range being tracked
- * Returns: false if range data has been invalidated, true otherwise
+ * hmm_range_unregister() - stop tracking change to CPU page table over a range
+ * @range: range
  *
  * Range struct is used to track updates to the CPU page table after a call to
- * either hmm_vma_get_pfns() or hmm_vma_fault(). Once the device driver is done
- * using the data,  or wants to lock updates to the data it got from those
- * functions, it must call the hmm_vma_range_done() function, which will then
- * stop tracking CPU page table updates.
- *
- * Note that device driver must still implement general CPU page table update
- * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
- * the mmu_notifier API directly.
- *
- * CPU page table update tracking done through hmm_range is only temporary and
- * to be used while trying to duplicate CPU page table contents for a range of
- * virtual addresses.
- *
- * There are two ways to use this :
- * again:
- *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
- *   trans = device_build_page_table_update_transaction(pfns);
- *   device_page_table_lock();
- *   if (!hmm_vma_range_done(range)) {
- *     device_page_table_unlock();
- *     goto again;
- *   }
- *   device_commit_transaction(trans);
- *   device_page_table_unlock();
- *
- * Or:
- *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
- *   device_page_table_lock();
- *   hmm_vma_range_done(range);
- *   device_update_page_table(range->pfns);
- *   device_page_table_unlock();
+ * hmm_range_register(). See include/linux/hmm.h for how to use it.
  */
-bool hmm_vma_range_done(struct hmm_range *range)
+void hmm_range_unregister(struct hmm_range *range)
 {
-	bool ret = false;
-
 	/* Sanity check this really should not happen. */
-	if (range->hmm == NULL || range->end <= range->start) {
-		BUG();
-		return false;
-	}
+	if (range->hmm == NULL || range->end <= range->start)
+		return;
 
-	spin_lock(&range->hmm->lock);
+	mutex_lock(&range->hmm->lock);
 	list_del_rcu(&range->list);
-	ret = range->valid;
-	spin_unlock(&range->hmm->lock);
-
-	/* Is the mm still alive ? */
-	if (range->hmm->mm == NULL)
-		ret = false;
+	mutex_unlock(&range->hmm->lock);
 
-	/* Drop reference taken by hmm_vma_fault() or hmm_vma_get_pfns() */
+	/* Drop reference taken by hmm_range_register() */
+	range->valid = false;
 	hmm_put(range->hmm);
 	range->hmm = NULL;
-	return ret;
 }
-EXPORT_SYMBOL(hmm_vma_range_done);
+EXPORT_SYMBOL(hmm_range_unregister);
+
+/*
+ * hmm_range_snapshot() - snapshot CPU page table for a range
+ * @range: range
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
+ *          permission (for instance asking for write and range is read only),
+ *          -EAGAIN if you need to retry, -EFAULT invalid (ie either no valid
+ *          vma or it is illegal to access that range), number of valid pages
+ *          in range->pfns[] (from range start address).
+ *
+ * This snapshots the CPU page table for a range of virtual addresses. Snapshot
+ * validity is tracked by range struct. See in include/linux/hmm.h for example
+ * on how to use.
+ */
+long hmm_range_snapshot(struct hmm_range *range)
+{
+	unsigned long start = range->start, end;
+	struct hmm_vma_walk hmm_vma_walk;
+	struct hmm *hmm = range->hmm;
+	struct vm_area_struct *vma;
+	struct mm_walk mm_walk;
+
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL || hmm->dead)
+		return -EFAULT;
+
+	do {
+		/* If range is no longer valid force retry. */
+		if (!range->valid)
+			return -EAGAIN;
+
+		vma = find_vma(hmm->mm, start);
+		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
+			return -EFAULT;
+
+		/* FIXME support hugetlb fs/dax */
+		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
+			hmm_pfns_special(range);
+			return -EINVAL;
+		}
+
+		if (!(vma->vm_flags & VM_READ)) {
+			/*
+			 * If vma do not allow read access, then assume that it
+			 * does not allow write access, either. HMM does not
+			 * support architecture that allow write without read.
+			 */
+			hmm_pfns_clear(range, range->pfns,
+				range->start, range->end);
+			return -EPERM;
+		}
+
+		range->vma = vma;
+		hmm_vma_walk.last = start;
+		hmm_vma_walk.fault = false;
+		hmm_vma_walk.range = range;
+		mm_walk.private = &hmm_vma_walk;
+		end = min(range->end, vma->vm_end);
+
+		mm_walk.vma = vma;
+		mm_walk.mm = vma->vm_mm;
+		mm_walk.pte_entry = NULL;
+		mm_walk.test_walk = NULL;
+		mm_walk.hugetlb_entry = NULL;
+		mm_walk.pmd_entry = hmm_vma_walk_pmd;
+		mm_walk.pte_hole = hmm_vma_walk_hole;
+
+		walk_page_range(start, end, &mm_walk);
+		start = end;
+	} while (start < range->end);
+
+	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+}
+EXPORT_SYMBOL(hmm_range_snapshot);
 
 /*
  * hmm_range_fault() - try to fault some address in a virtual address range
@@ -893,96 +896,79 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  */
 long hmm_range_fault(struct hmm_range *range, bool block)
 {
-	struct vm_area_struct *vma = range->vma;
-	unsigned long start = range->start;
+	unsigned long start = range->start, end;
 	struct hmm_vma_walk hmm_vma_walk;
+	struct hmm *hmm = range->hmm;
+	struct vm_area_struct *vma;
 	struct mm_walk mm_walk;
-	struct hmm *hmm;
 	int ret;
 
-	range->hmm = NULL;
-
-	/* Sanity check, this really should not happen ! */
-	if (range->start < vma->vm_start || range->start >= vma->vm_end)
-		return -EINVAL;
-	if (range->end < vma->vm_start || range->end > vma->vm_end)
-		return -EINVAL;
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL || hmm->dead)
+		return -EFAULT;
 
-	hmm = hmm_register(vma->vm_mm);
-	if (!hmm) {
-		hmm_pfns_clear(range, range->pfns, range->start, range->end);
-		return -ENOMEM;
-	}
+	do {
+		/* If range is no longer valid force retry. */
+		if (!range->valid) {
+			up_read(&hmm->mm->mmap_sem);
+			return -EAGAIN;
+		}
 
-	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL) {
-		hmm_put(hmm);
-		return -EINVAL;
-	}
+		vma = find_vma(hmm->mm, start);
+		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
+			return -EFAULT;
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
-			vma_is_dax(vma)) {
-		hmm_pfns_special(range);
-		hmm_put(hmm);
-		return -EINVAL;
-	}
+		/* FIXME support hugetlb fs/dax */
+		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
+			hmm_pfns_special(range);
+			return -EINVAL;
+		}
 
-	if (!(vma->vm_flags & VM_READ)) {
-		/*
-		 * If vma do not allow read access, then assume that it does
-		 * not allow write access, either. Architecture that allow
-		 * write without read access are not supported by HMM, because
-		 * operations such has atomic access would not work.
-		 */
-		hmm_pfns_clear(range, range->pfns, range->start, range->end);
-		hmm_put(hmm);
-		return -EPERM;
-	}
+		if (!(vma->vm_flags & VM_READ)) {
+			/*
+			 * If vma do not allow read access, then assume that it
+			 * does not allow write access, either. HMM does not
+			 * support architecture that allow write without read.
+			 */
+			hmm_pfns_clear(range, range->pfns,
+				range->start, range->end);
+			return -EPERM;
+		}
 
-	/* Initialize range to track CPU page table update */
-	spin_lock(&hmm->lock);
-	range->valid = true;
-	list_add_rcu(&range->list, &hmm->ranges);
-	spin_unlock(&hmm->lock);
-
-	hmm_vma_walk.fault = true;
-	hmm_vma_walk.block = block;
-	hmm_vma_walk.range = range;
-	mm_walk.private = &hmm_vma_walk;
-	hmm_vma_walk.last = range->start;
-
-	mm_walk.vma = vma;
-	mm_walk.mm = vma->vm_mm;
-	mm_walk.pte_entry = NULL;
-	mm_walk.test_walk = NULL;
-	mm_walk.hugetlb_entry = NULL;
-	mm_walk.pmd_entry = hmm_vma_walk_pmd;
-	mm_walk.pte_hole = hmm_vma_walk_hole;
+		range->vma = vma;
+		hmm_vma_walk.last = start;
+		hmm_vma_walk.fault = true;
+		hmm_vma_walk.block = block;
+		hmm_vma_walk.range = range;
+		mm_walk.private = &hmm_vma_walk;
+		end = min(range->end, vma->vm_end);
+
+		mm_walk.vma = vma;
+		mm_walk.mm = vma->vm_mm;
+		mm_walk.pte_entry = NULL;
+		mm_walk.test_walk = NULL;
+		mm_walk.hugetlb_entry = NULL;
+		mm_walk.pmd_entry = hmm_vma_walk_pmd;
+		mm_walk.pte_hole = hmm_vma_walk_hole;
+
+		do {
+			ret = walk_page_range(start, end, &mm_walk);
+			start = hmm_vma_walk.last;
+
+			/* Keep trying while the range is valid. */
+		} while (ret == -EBUSY && range->valid);
+
+		if (ret) {
+			unsigned long i;
+
+			i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+			hmm_pfns_clear(range, &range->pfns[i],
+				hmm_vma_walk.last, range->end);
+			return ret;
+		}
+		start = end;
 
-	do {
-		ret = walk_page_range(start, range->end, &mm_walk);
-		start = hmm_vma_walk.last;
-		/* Keep trying while the range is valid. */
-	} while (ret == -EBUSY && range->valid);
-
-	if (ret) {
-		unsigned long i;
-
-		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
-			       range->end);
-		hmm_vma_range_done(range);
-		hmm_put(hmm);
-		return ret;
-	} else {
-		/*
-		 * Transfer hmm reference to the range struct it will be drop
-		 * inside the hmm_vma_range_done() function (which _must_ be
-		 * call if this function return 0).
-		 */
-		range->hmm = hmm;
-	}
+	} while (start < range->end);
 
 	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 06/10] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (4 preceding siblings ...)
  2019-01-29 16:54 ` [PATCH 05/10] mm/hmm: improve driver API to work and wait over a range jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-01-29 16:54 ` [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device jglisse
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Ralph Campbell, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

The HMM mirror API can be use in two fashions. The first one where the HMM
user coalesce multiple page faults into one request and set flags per pfns
for of those faults. The second one where the HMM user want to pre-fault a
range with specific flags. For the latter one it is a waste to have the user
pre-fill the pfn arrays with a default flags value.

This patch adds a default flags value allowing user to set them for a range
without having to pre-fill the pfn array.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |  7 +++++++
 mm/hmm.c            | 12 ++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 93dc88edc293..4263f8fb32e5 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
  * @pfns: array of pfns (big enough for the range)
  * @flags: pfn flags to match device driver page table
  * @values: pfn value for some special case (none, special, error, ...)
+ * @default_flags: default flags for the range (write, read, ...)
+ * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
  * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
@@ -177,6 +179,8 @@ struct hmm_range {
 	uint64_t		*pfns;
 	const uint64_t		*flags;
 	const uint64_t		*values;
+	uint64_t		default_flags;
+	uint64_t		pfn_flags_mask;
 	uint8_t			pfn_shift;
 	bool			valid;
 };
@@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 {
 	long ret;
 
+	range->default_flags = 0;
+	range->pfn_flags_mask = -1UL;
+
 	ret = hmm_range_register(range, range->vma->vm_mm,
 				 range->start, range->end);
 	if (ret)
diff --git a/mm/hmm.c b/mm/hmm.c
index 860ebe5d4b07..0a4ff31e9d7a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -423,6 +423,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 	if (!hmm_vma_walk->fault)
 		return;
 
+	/*
+	 * So we not only consider the individual per page request we also
+	 * consider the default flags requested for the range. The API can
+	 * be use in 2 fashions. The first one where the HMM user coalesce
+	 * multiple page fault into one request and set flags per pfns for
+	 * of those faults. The second one where the HMM user want to pre-
+	 * fault a range with specific flags. For the latter one it is a
+	 * waste to have the user pre-fill the pfn arrays with a default
+	 * flags value.
+	 */
+	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
+
 	/* We aren't ask to do anything ... */
 	if (!(pfns & range->flags[HMM_PFN_VALID]))
 		return;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (5 preceding siblings ...)
  2019-01-29 16:54 ` [PATCH 06/10] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-03-18 20:21   ` Dan Williams
  2019-01-29 16:54 ` [PATCH 08/10] mm/hmm: support hugetlbfs (snap shoting, faulting and DMA mapping) jglisse
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Ralph Campbell, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This is a all in one helper that fault pages in a range and map them to
a device so that every single device driver do not have to re-implement
this common pattern.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |   9 +++
 mm/hmm.c            | 152 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 161 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4263f8fb32e5..fc3630d0bbfd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -502,6 +502,15 @@ int hmm_range_register(struct hmm_range *range,
 void hmm_range_unregister(struct hmm_range *range);
 long hmm_range_snapshot(struct hmm_range *range);
 long hmm_range_fault(struct hmm_range *range, bool block);
+long hmm_range_dma_map(struct hmm_range *range,
+		       struct device *device,
+		       dma_addr_t *daddrs,
+		       bool block);
+long hmm_range_dma_unmap(struct hmm_range *range,
+			 struct vm_area_struct *vma,
+			 struct device *device,
+			 dma_addr_t *daddrs,
+			 bool dirty);
 
 /*
  * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
diff --git a/mm/hmm.c b/mm/hmm.c
index 0a4ff31e9d7a..9cd68334a759 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -30,6 +30,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memremap.h>
 #include <linux/jump_label.h>
+#include <linux/dma-mapping.h>
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
@@ -985,6 +986,157 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
 }
 EXPORT_SYMBOL(hmm_range_fault);
+
+/*
+ * hmm_range_dma_map() - hmm_range_fault() and dma map page all in one.
+ * @range: range being faulted
+ * @device: device against to dma map page to
+ * @daddrs: dma address of mapped pages
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: number of pages mapped on success, -EAGAIN if mmap_sem have been
+ *          drop and you need to try again, some other error value otherwise
+ *
+ * Note same usage pattern as hmm_range_fault().
+ */
+long hmm_range_dma_map(struct hmm_range *range,
+		       struct device *device,
+		       dma_addr_t *daddrs,
+		       bool block)
+{
+	unsigned long i, npages, mapped;
+	long ret;
+
+	ret = hmm_range_fault(range, block);
+	if (ret <= 0)
+		return ret ? ret : -EBUSY;
+
+	npages = (range->end - range->start) >> PAGE_SHIFT;
+	for (i = 0, mapped = 0; i < npages; ++i) {
+		enum dma_data_direction dir = DMA_FROM_DEVICE;
+		struct page *page;
+
+		/*
+		 * FIXME need to update DMA API to provide invalid DMA address
+		 * value instead of a function to test dma address value. This
+		 * would remove lot of dumb code duplicated accross many arch.
+		 *
+		 * For now setting it to 0 here is good enough as the pfns[]
+		 * value is what is use to check what is valid and what isn't.
+		 */
+		daddrs[i] = 0;
+
+		page = hmm_pfn_to_page(range, range->pfns[i]);
+		if (page == NULL)
+			continue;
+
+		/* Check if range is being invalidated */
+		if (!range->valid) {
+			ret = -EBUSY;
+			goto unmap;
+		}
+
+		/* If it is read and write than map bi-directional. */
+		if (range->pfns[i] & range->values[HMM_PFN_WRITE])
+			dir = DMA_BIDIRECTIONAL;
+
+		daddrs[i] = dma_map_page(device, page, 0, PAGE_SIZE, dir);
+		if (dma_mapping_error(device, daddrs[i])) {
+			ret = -EFAULT;
+			goto unmap;
+		}
+
+		mapped++;
+	}
+
+	return mapped;
+
+unmap:
+	for (npages = i, i = 0; (i < npages) && mapped; ++i) {
+		enum dma_data_direction dir = DMA_FROM_DEVICE;
+		struct page *page;
+
+		page = hmm_pfn_to_page(range, range->pfns[i]);
+		if (page == NULL)
+			continue;
+
+		if (dma_mapping_error(device, daddrs[i]))
+			continue;
+
+		/* If it is read and write than map bi-directional. */
+		if (range->pfns[i] & range->values[HMM_PFN_WRITE])
+			dir = DMA_BIDIRECTIONAL;
+
+		dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
+		mapped--;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(hmm_range_dma_map);
+
+/*
+ * hmm_range_dma_unmap() - unmap range of that was map with hmm_range_dma_map()
+ * @range: range being unmapped
+ * @vma: the vma against which the range (optional)
+ * @device: device against which dma map was done
+ * @daddrs: dma address of mapped pages
+ * @dirty: dirty page if it had the write flag set
+ * Returns: number of page unmapped on success, -EINVAL otherwise
+ *
+ * Note that caller MUST abide by mmu notifier or use HMM mirror and abide
+ * to the sync_cpu_device_pagetables() callback so that it is safe here to
+ * call set_page_dirty(). Caller must also take appropriate locks to avoid
+ * concurrent mmu notifier or sync_cpu_device_pagetables() to make progress.
+ */
+long hmm_range_dma_unmap(struct hmm_range *range,
+			 struct vm_area_struct *vma,
+			 struct device *device,
+			 dma_addr_t *daddrs,
+			 bool dirty)
+{
+	unsigned long i, npages;
+	long cpages = 0;
+
+	/* Sanity check. */
+	if (range->end <= range->start)
+		return -EINVAL;
+	if (!daddrs)
+		return -EINVAL;
+	if (!range->pfns)
+		return -EINVAL;
+
+	npages = (range->end - range->start) >> PAGE_SHIFT;
+	for (i = 0; i < npages; ++i) {
+		enum dma_data_direction dir = DMA_FROM_DEVICE;
+		struct page *page;
+
+		page = hmm_pfn_to_page(range, range->pfns[i]);
+		if (page == NULL)
+			continue;
+
+		/* If it is read and write than map bi-directional. */
+		if (range->pfns[i] & range->values[HMM_PFN_WRITE]) {
+			dir = DMA_BIDIRECTIONAL;
+
+			/*
+			 * See comments in function description on why it is
+			 * safe here to call set_page_dirty()
+			 */
+			if (dirty)
+				set_page_dirty(page);
+		}
+
+		/* Unmap and clear pfns/dma address */
+		dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
+		range->pfns[i] = range->values[HMM_PFN_NONE];
+		/* FIXME see comments in hmm_vma_dma_map() */
+		daddrs[i] = 0;
+		cpages++;
+	}
+
+	return cpages;
+}
+EXPORT_SYMBOL(hmm_range_dma_unmap);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 08/10] mm/hmm: support hugetlbfs (snap shoting, faulting and DMA mapping)
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (6 preceding siblings ...)
  2019-01-29 16:54 ` [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-01-29 16:54 ` [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem jglisse
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Ralph Campbell, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This adds support for hugetlbfs so that HMM user can map mirror range
of virtual address back by hugetlbfs. Note that now the range allows
user to optimize DMA mapping of such page so that we can map a huge
page as one chunk.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |  29 ++++++++-
 mm/hmm.c            | 141 +++++++++++++++++++++++++++++++++++++-------
 2 files changed, 147 insertions(+), 23 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index fc3630d0bbfd..b3850297352f 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -181,10 +181,31 @@ struct hmm_range {
 	const uint64_t		*values;
 	uint64_t		default_flags;
 	uint64_t		pfn_flags_mask;
+	uint8_t			page_shift;
 	uint8_t			pfn_shift;
 	bool			valid;
 };
 
+/*
+ * hmm_range_page_shift() - return the page shift for the range
+ * @range: range being queried
+ * Returns: page shift (page size = 1 << page shift) for the range
+ */
+static inline unsigned hmm_range_page_shift(const struct hmm_range *range)
+{
+	return range->page_shift;
+}
+
+/*
+ * hmm_range_page_size() - return the page size for the range
+ * @range: range being queried
+ * Returns: page size for the range in bytes
+ */
+static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
+{
+	return 1UL << hmm_range_page_shift(range);
+}
+
 /*
  * hmm_range_wait_until_valid() - wait for range to be valid
  * @range: range affected by invalidation to wait on
@@ -438,7 +459,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  *          struct hmm_range range;
  *          ...
  *
- *          ret = hmm_range_register(&range, mm, start, end);
+ *          ret = hmm_range_register(&range, mm, start, end, page_shift);
  *          if (ret)
  *              return ret;
  *
@@ -498,7 +519,8 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 int hmm_range_register(struct hmm_range *range,
 		       struct mm_struct *mm,
 		       unsigned long start,
-		       unsigned long end);
+		       unsigned long end,
+		       unsigned page_shift);
 void hmm_range_unregister(struct hmm_range *range);
 long hmm_range_snapshot(struct hmm_range *range);
 long hmm_range_fault(struct hmm_range *range, bool block);
@@ -538,7 +560,8 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 	range->pfn_flags_mask = -1UL;
 
 	ret = hmm_range_register(range, range->vma->vm_mm,
-				 range->start, range->end);
+				 range->start, range->end,
+				 PAGE_SHIFT);
 	if (ret)
 		return (int)ret;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 9cd68334a759..8b87e1813313 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -396,11 +396,13 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	uint64_t *pfns = range->pfns;
-	unsigned long i;
+	unsigned long i, page_size;
 
 	hmm_vma_walk->last = addr;
-	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++) {
+	page_size = 1UL << range->page_shift;
+	i = (addr - range->start) >> range->page_shift;
+
+	for (; addr < end; addr += page_size, i++) {
 		pfns[i] = range->values[HMM_PFN_NONE];
 		if (fault || write_fault) {
 			int ret;
@@ -712,6 +714,69 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
+				      unsigned long start, unsigned long end,
+				      struct mm_walk *walk)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	unsigned long addr = start, i, pfn, mask, size, pfn_inc;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	struct vm_area_struct *vma = walk->vma;
+	struct hstate *h = hstate_vma(vma);
+	uint64_t orig_pfn, cpu_flags;
+	bool fault, write_fault;
+	spinlock_t *ptl;
+	pte_t entry;
+	int ret = 0;
+
+	size = 1UL << huge_page_shift(h);
+	mask = size - 1;
+	if (range->page_shift != PAGE_SHIFT) {
+		/* Make sure we are looking at full page. */
+		if (start & mask)
+			return -EINVAL;
+		if (end < (start + size))
+			return -EINVAL;
+		pfn_inc = size >> PAGE_SHIFT;
+	} else {
+		pfn_inc = 1;
+		size = PAGE_SIZE;
+	}
+
+
+	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
+	entry = huge_ptep_get(pte);
+
+	i = (start - range->start) >> range->page_shift;
+	orig_pfn = range->pfns[i];
+	range->pfns[i] = range->values[HMM_PFN_NONE];
+	cpu_flags = pte_to_hmm_pfn_flags(range, entry);
+	fault = write_fault = false;
+	hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
+			   &fault, &write_fault);
+	if (fault || write_fault) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	pfn = pte_pfn(entry) + (start & mask);
+	for (; addr < end; addr += size, i++, pfn += pfn_inc)
+		range->pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
+	hmm_vma_walk->last = end;
+
+unlock:
+	spin_unlock(ptl);
+
+	if (ret == -ENOENT)
+		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
+
+	return ret;
+#else /* CONFIG_HUGETLB_PAGE */
+	return -EINVAL;
+#endif
+}
+
 static void hmm_pfns_clear(struct hmm_range *range,
 			   uint64_t *pfns,
 			   unsigned long addr,
@@ -735,6 +800,7 @@ static void hmm_pfns_special(struct hmm_range *range)
  * @mm: the mm struct for the range of virtual address
  * @start: start virtual address (inclusive)
  * @end: end virtual address (exclusive)
+ * @page_shift: expect page shift for the range
  * Returns 0 on success, -EFAULT if the address space is no longer valid
  *
  * Track updates to the CPU page table see include/linux/hmm.h
@@ -742,15 +808,22 @@ static void hmm_pfns_special(struct hmm_range *range)
 int hmm_range_register(struct hmm_range *range,
 		       struct mm_struct *mm,
 		       unsigned long start,
-		       unsigned long end)
+		       unsigned long end,
+		       unsigned page_shift)
 {
-	range->start = start & PAGE_MASK;
-	range->end = end & PAGE_MASK;
+	unsigned long mask = ((1UL << page_shift) - 1UL);
+
 	range->valid = false;
 	range->hmm = NULL;
 
-	if (range->start >= range->end)
+	if ((start & mask) || (end & mask))
 		return -EINVAL;
+	if (start >= end)
+		return -EINVAL;
+
+	range->page_shift = page_shift;
+	range->start = start;
+	range->end = end;
 
 	range->hmm = hmm_register(mm);
 	if (!range->hmm)
@@ -818,6 +891,7 @@ EXPORT_SYMBOL(hmm_range_unregister);
  */
 long hmm_range_snapshot(struct hmm_range *range)
 {
+	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
 	unsigned long start = range->start, end;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct hmm *hmm = range->hmm;
@@ -834,15 +908,26 @@ long hmm_range_snapshot(struct hmm_range *range)
 			return -EAGAIN;
 
 		vma = find_vma(hmm->mm, start);
-		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
+		if (vma == NULL || (vma->vm_flags & device_vma))
 			return -EFAULT;
 
-		/* FIXME support hugetlb fs/dax */
-		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
+		/* FIXME support dax */
+		if (vma_is_dax(vma)) {
 			hmm_pfns_special(range);
 			return -EINVAL;
 		}
 
+		if (is_vm_hugetlb_page(vma)) {
+			struct hstate *h = hstate_vma(vma);
+
+			if (huge_page_shift(h) != range->page_shift &&
+			    range->page_shift != PAGE_SHIFT)
+				return -EINVAL;
+		} else {
+			if (range->page_shift != PAGE_SHIFT)
+				return -EINVAL;
+		}
+
 		if (!(vma->vm_flags & VM_READ)) {
 			/*
 			 * If vma do not allow read access, then assume that it
@@ -868,6 +953,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 		mm_walk.hugetlb_entry = NULL;
 		mm_walk.pmd_entry = hmm_vma_walk_pmd;
 		mm_walk.pte_hole = hmm_vma_walk_hole;
+		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
 
 		walk_page_range(start, end, &mm_walk);
 		start = end;
@@ -909,6 +995,7 @@ EXPORT_SYMBOL(hmm_range_snapshot);
  */
 long hmm_range_fault(struct hmm_range *range, bool block)
 {
+	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
 	unsigned long start = range->start, end;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct hmm *hmm = range->hmm;
@@ -928,15 +1015,26 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		}
 
 		vma = find_vma(hmm->mm, start);
-		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
+		if (vma == NULL || (vma->vm_flags & device_vma))
 			return -EFAULT;
 
-		/* FIXME support hugetlb fs/dax */
-		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
+		/* FIXME support dax */
+		if (vma_is_dax(vma)) {
 			hmm_pfns_special(range);
 			return -EINVAL;
 		}
 
+		if (is_vm_hugetlb_page(vma)) {
+			struct hstate *h = hstate_vma(vma);
+
+			if (huge_page_shift(h) != range->page_shift &&
+			    range->page_shift != PAGE_SHIFT)
+				return -EINVAL;
+		} else {
+			if (range->page_shift != PAGE_SHIFT)
+				return -EINVAL;
+		}
+
 		if (!(vma->vm_flags & VM_READ)) {
 			/*
 			 * If vma do not allow read access, then assume that it
@@ -963,6 +1061,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		mm_walk.hugetlb_entry = NULL;
 		mm_walk.pmd_entry = hmm_vma_walk_pmd;
 		mm_walk.pte_hole = hmm_vma_walk_hole;
+		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
 
 		do {
 			ret = walk_page_range(start, end, &mm_walk);
@@ -1003,14 +1102,15 @@ long hmm_range_dma_map(struct hmm_range *range,
 		       dma_addr_t *daddrs,
 		       bool block)
 {
-	unsigned long i, npages, mapped;
+	unsigned long i, npages, mapped, page_size;
 	long ret;
 
 	ret = hmm_range_fault(range, block);
 	if (ret <= 0)
 		return ret ? ret : -EBUSY;
 
-	npages = (range->end - range->start) >> PAGE_SHIFT;
+	page_size = hmm_range_page_size(range);
+	npages = (range->end - range->start) >> range->page_shift;
 	for (i = 0, mapped = 0; i < npages; ++i) {
 		enum dma_data_direction dir = DMA_FROM_DEVICE;
 		struct page *page;
@@ -1039,7 +1139,7 @@ long hmm_range_dma_map(struct hmm_range *range,
 		if (range->pfns[i] & range->values[HMM_PFN_WRITE])
 			dir = DMA_BIDIRECTIONAL;
 
-		daddrs[i] = dma_map_page(device, page, 0, PAGE_SIZE, dir);
+		daddrs[i] = dma_map_page(device, page, 0, page_size, dir);
 		if (dma_mapping_error(device, daddrs[i])) {
 			ret = -EFAULT;
 			goto unmap;
@@ -1066,7 +1166,7 @@ long hmm_range_dma_map(struct hmm_range *range,
 		if (range->pfns[i] & range->values[HMM_PFN_WRITE])
 			dir = DMA_BIDIRECTIONAL;
 
-		dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
+		dma_unmap_page(device, daddrs[i], page_size, dir);
 		mapped--;
 	}
 
@@ -1094,7 +1194,7 @@ long hmm_range_dma_unmap(struct hmm_range *range,
 			 dma_addr_t *daddrs,
 			 bool dirty)
 {
-	unsigned long i, npages;
+	unsigned long i, npages, page_size;
 	long cpages = 0;
 
 	/* Sanity check. */
@@ -1105,7 +1205,8 @@ long hmm_range_dma_unmap(struct hmm_range *range,
 	if (!range->pfns)
 		return -EINVAL;
 
-	npages = (range->end - range->start) >> PAGE_SHIFT;
+	page_size = hmm_range_page_size(range);
+	npages = (range->end - range->start) >> range->page_shift;
 	for (i = 0; i < npages; ++i) {
 		enum dma_data_direction dir = DMA_FROM_DEVICE;
 		struct page *page;
@@ -1127,7 +1228,7 @@ long hmm_range_dma_unmap(struct hmm_range *range,
 		}
 
 		/* Unmap and clear pfns/dma address */
-		dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
+		dma_unmap_page(device, daddrs[i], page_size, dir);
 		range->pfns[i] = range->values[HMM_PFN_NONE];
 		/* FIXME see comments in hmm_vma_dma_map() */
 		daddrs[i] = 0;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (7 preceding siblings ...)
  2019-01-29 16:54 ` [PATCH 08/10] mm/hmm: support hugetlbfs (snap shoting, faulting and DMA mapping) jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-01-29 18:41   ` Dan Williams
  2019-01-29 16:54 ` [PATCH 10/10] mm/hmm: add helpers for driver to safely take the mmap_sem jglisse
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Dan Williams, Ralph Campbell, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This add support to mirror vma which is an mmap of a file which is on
a filesystem that using a DAX block device. There is no reason not to
support that case.

Note that unlike GUP code we do not take page reference hence when we
back-off we have nothing to undo.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 112 insertions(+), 21 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 8b87e1813313..1a444885404e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -334,6 +334,7 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
 
 struct hmm_vma_walk {
 	struct hmm_range	*range;
+	struct dev_pagemap	*pgmap;
 	unsigned long		last;
 	bool			fault;
 	bool			block;
@@ -508,6 +509,15 @@ static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
 				range->flags[HMM_PFN_VALID];
 }
 
+static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud)
+{
+	if (!pud_present(pud))
+		return 0;
+	return pud_write(pud) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
+}
+
 static int hmm_vma_handle_pmd(struct mm_walk *walk,
 			      unsigned long addr,
 			      unsigned long end,
@@ -529,8 +539,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 
 	pfn = pmd_pfn(pmd) + pte_index(addr);
-	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
+	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		if (pmd_devmap(pmd)) {
+			hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
+					      hmm_vma_walk->pgmap);
+			if (unlikely(!hmm_vma_walk->pgmap))
+				return -EBUSY;
+		}
 		pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
+	}
+	if (hmm_vma_walk->pgmap) {
+		put_dev_pagemap(hmm_vma_walk->pgmap);
+		hmm_vma_walk->pgmap = NULL;
+	}
 	hmm_vma_walk->last = end;
 	return 0;
 }
@@ -617,10 +638,24 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	if (fault || write_fault)
 		goto fault;
 
+	if (pte_devmap(pte)) {
+		hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte),
+					      hmm_vma_walk->pgmap);
+		if (unlikely(!hmm_vma_walk->pgmap))
+			return -EBUSY;
+	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
+		*pfn = range->values[HMM_PFN_SPECIAL];
+		return -EFAULT;
+	}
+
 	*pfn = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
 	return 0;
 
 fault:
+	if (hmm_vma_walk->pgmap) {
+		put_dev_pagemap(hmm_vma_walk->pgmap);
+		hmm_vma_walk->pgmap = NULL;
+	}
 	pte_unmap(ptep);
 	/* Fault any virtual address we were asked to fault */
 	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
@@ -708,12 +743,84 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			return r;
 		}
 	}
+	if (hmm_vma_walk->pgmap) {
+		put_dev_pagemap(hmm_vma_walk->pgmap);
+		hmm_vma_walk->pgmap = NULL;
+	}
 	pte_unmap(ptep - 1);
 
 	hmm_vma_walk->last = addr;
 	return 0;
 }
 
+static int hmm_vma_walk_pud(pud_t *pudp,
+			    unsigned long start,
+			    unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long addr = start, next;
+	pmd_t *pmdp;
+	pud_t pud;
+	int ret;
+
+again:
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
+		return hmm_vma_walk_hole(start, end, walk);
+
+	if (pud_huge(pud) && pud_devmap(pud)) {
+		unsigned long i, npages, pfn;
+		uint64_t *pfns, cpu_flags;
+		bool fault, write_fault;
+
+		if (!pud_present(pud))
+			return hmm_vma_walk_hole(start, end, walk);
+
+		i = (addr - range->start) >> PAGE_SHIFT;
+		npages = (end - addr) >> PAGE_SHIFT;
+		pfns = &range->pfns[i];
+
+		cpu_flags = pud_to_hmm_pfn_flags(range, pud);
+		hmm_range_need_fault(hmm_vma_walk, pfns, npages,
+				     cpu_flags, &fault, &write_fault);
+		if (fault || write_fault)
+			return hmm_vma_walk_hole_(addr, end, fault,
+						write_fault, walk);
+
+		pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+		for (i = 0; i < npages; ++i, ++pfn) {
+			hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
+					      hmm_vma_walk->pgmap);
+			if (unlikely(!hmm_vma_walk->pgmap))
+				return -EBUSY;
+			pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
+		}
+		if (hmm_vma_walk->pgmap) {
+			put_dev_pagemap(hmm_vma_walk->pgmap);
+			hmm_vma_walk->pgmap = NULL;
+		}
+		hmm_vma_walk->last = end;
+		return 0;
+	}
+
+	split_huge_pud(vma, pudp, addr);
+	if (pud_none(*pudp))
+		goto again;
+
+	pmdp = pmd_offset(pudp, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		ret = hmm_vma_walk_pmd(pmdp, addr, next, walk);
+		if (ret)
+			return ret;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 0;
+}
+
 static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 				      unsigned long start, unsigned long end,
 				      struct mm_walk *walk)
@@ -786,14 +893,6 @@ static void hmm_pfns_clear(struct hmm_range *range,
 		*pfns = range->values[HMM_PFN_NONE];
 }
 
-static void hmm_pfns_special(struct hmm_range *range)
-{
-	unsigned long addr = range->start, i = 0;
-
-	for (; addr < range->end; addr += PAGE_SIZE, i++)
-		range->pfns[i] = range->values[HMM_PFN_SPECIAL];
-}
-
 /*
  * hmm_range_register() - start tracking change to CPU page table over a range
  * @range: range
@@ -911,12 +1010,6 @@ long hmm_range_snapshot(struct hmm_range *range)
 		if (vma == NULL || (vma->vm_flags & device_vma))
 			return -EFAULT;
 
-		/* FIXME support dax */
-		if (vma_is_dax(vma)) {
-			hmm_pfns_special(range);
-			return -EINVAL;
-		}
-
 		if (is_vm_hugetlb_page(vma)) {
 			struct hstate *h = hstate_vma(vma);
 
@@ -940,6 +1033,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 		}
 
 		range->vma = vma;
+		hmm_vma_walk.pgmap = NULL;
 		hmm_vma_walk.last = start;
 		hmm_vma_walk.fault = false;
 		hmm_vma_walk.range = range;
@@ -951,6 +1045,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 		mm_walk.pte_entry = NULL;
 		mm_walk.test_walk = NULL;
 		mm_walk.hugetlb_entry = NULL;
+		mm_walk.pud_entry = hmm_vma_walk_pud;
 		mm_walk.pmd_entry = hmm_vma_walk_pmd;
 		mm_walk.pte_hole = hmm_vma_walk_hole;
 		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
@@ -1018,12 +1113,6 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		if (vma == NULL || (vma->vm_flags & device_vma))
 			return -EFAULT;
 
-		/* FIXME support dax */
-		if (vma_is_dax(vma)) {
-			hmm_pfns_special(range);
-			return -EINVAL;
-		}
-
 		if (is_vm_hugetlb_page(vma)) {
 			struct hstate *h = hstate_vma(vma);
 
@@ -1047,6 +1136,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		}
 
 		range->vma = vma;
+		hmm_vma_walk.pgmap = NULL;
 		hmm_vma_walk.last = start;
 		hmm_vma_walk.fault = true;
 		hmm_vma_walk.block = block;
@@ -1059,6 +1149,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		mm_walk.pte_entry = NULL;
 		mm_walk.test_walk = NULL;
 		mm_walk.hugetlb_entry = NULL;
+		mm_walk.pud_entry = hmm_vma_walk_pud;
 		mm_walk.pmd_entry = hmm_vma_walk_pmd;
 		mm_walk.pte_hole = hmm_vma_walk_hole;
 		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 10/10] mm/hmm: add helpers for driver to safely take the mmap_sem
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (8 preceding siblings ...)
  2019-01-29 16:54 ` [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem jglisse
@ 2019-01-29 16:54 ` jglisse
  2019-02-20 21:59   ` John Hubbard
  2019-02-20 23:17 ` [PATCH 00/10] HMM updates for 5.1 John Hubbard
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 98+ messages in thread
From: jglisse @ 2019-01-29 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Ralph Campbell, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

The device driver context which holds reference to mirror and thus to
core hmm struct might outlive the mm against which it was created. To
avoid every driver to check for that case provide an helper that check
if mm is still alive and take the mmap_sem in read mode if so. If the
mm have been destroy (mmu_notifier release call back did happen) then
we return -EINVAL so that calling code knows that it is trying to do
something against a mm that is no longer valid.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b3850297352f..4a1454e3efba 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -438,6 +438,50 @@ struct hmm_mirror {
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
+/*
+ * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
+ * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
+ * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
+ *
+ * The device driver context which holds reference to mirror and thus to core
+ * hmm struct might outlive the mm against which it was created. To avoid every
+ * driver to check for that case provide an helper that check if mm is still
+ * alive and take the mmap_sem in read mode if so. If the mm have been destroy
+ * (mmu_notifier release call back did happen) then we return -EINVAL so that
+ * calling code knows that it is trying to do something against a mm that is
+ * no longer valid.
+ */
+static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
+{
+	struct mm_struct *mm;
+
+	/* Sanity check ... */
+	if (!mirror || !mirror->hmm)
+		return -EINVAL;
+	/*
+	 * Before trying to take the mmap_sem make sure the mm is still
+	 * alive as device driver context might outlive the mm lifetime.
+	 *
+	 * FIXME: should we also check for mm that outlive its owning
+	 * task ?
+	 */
+	mm = READ_ONCE(mirror->hmm->mm);
+	if (mirror->hmm->dead || !mm)
+		return -EINVAL;
+
+	down_read(&mm->mmap_sem);
+	return 0;
+}
+
+/*
+ * hmm_mirror_mm_up_read() - unlock the mmap_sem from read mode
+ * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
+ */
+static inline void hmm_mirror_mm_up_read(struct hmm_mirror *mirror)
+{
+	up_read(&mirror->hmm->mm->mmap_sem);
+}
+
 
 /*
  * To snapshot the CPU page table you first have to call hmm_range_register()
@@ -463,7 +507,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  *          if (ret)
  *              return ret;
  *
- *          down_read(mm->mmap_sem);
+ *          hmm_mirror_mm_down_read(mirror);
  *      again:
  *
  *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
@@ -476,13 +520,13 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  *
  *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
  *          if (ret == -EAGAIN) {
- *              down_read(mm->mmap_sem);
+ *              hmm_mirror_mm_down_read(mirror);
  *              goto again;
  *          } else if (ret == -EBUSY) {
  *              goto again;
  *          }
  *
- *          up_read(&mm->mmap_sem);
+ *          hmm_mirror_mm_up_read(mirror);
  *          if (ret) {
  *              hmm_range_unregister(range);
  *              return ret;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-29 16:54 ` [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem jglisse
@ 2019-01-29 18:41   ` Dan Williams
  2019-01-29 19:31     ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-01-29 18:41 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Jan 29, 2019 at 8:54 AM <jglisse@redhat.com> wrote:
>
> From: Jérôme Glisse <jglisse@redhat.com>
>
> This add support to mirror vma which is an mmap of a file which is on
> a filesystem that using a DAX block device. There is no reason not to
> support that case.
>

The reason not to support it would be if it gets in the way of future
DAX development. How does this interact with MAP_SYNC? I'm also
concerned if this complicates DAX reflink support. In general I'd
rather prioritize fixing the places where DAX is broken today before
adding more cross-subsystem entanglements. The unit tests for
filesystems (xfstests) are readily accessible. How would I go about
regression testing DAX + HMM interactions?

> Note that unlike GUP code we do not take page reference hence when we
> back-off we have nothing to undo.
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/hmm.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 112 insertions(+), 21 deletions(-)
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 8b87e1813313..1a444885404e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -334,6 +334,7 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
>
>  struct hmm_vma_walk {
>         struct hmm_range        *range;
> +       struct dev_pagemap      *pgmap;
>         unsigned long           last;
>         bool                    fault;
>         bool                    block;
> @@ -508,6 +509,15 @@ static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
>                                 range->flags[HMM_PFN_VALID];
>  }
>
> +static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud)
> +{
> +       if (!pud_present(pud))
> +               return 0;
> +       return pud_write(pud) ? range->flags[HMM_PFN_VALID] |
> +                               range->flags[HMM_PFN_WRITE] :
> +                               range->flags[HMM_PFN_VALID];
> +}
> +
>  static int hmm_vma_handle_pmd(struct mm_walk *walk,
>                               unsigned long addr,
>                               unsigned long end,
> @@ -529,8 +539,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>                 return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
>
>         pfn = pmd_pfn(pmd) + pte_index(addr);
> -       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
> +       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +               if (pmd_devmap(pmd)) {
> +                       hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> +                                             hmm_vma_walk->pgmap);
> +                       if (unlikely(!hmm_vma_walk->pgmap))
> +                               return -EBUSY;
> +               }
>                 pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
> +       }
> +       if (hmm_vma_walk->pgmap) {
> +               put_dev_pagemap(hmm_vma_walk->pgmap);
> +               hmm_vma_walk->pgmap = NULL;
> +       }
>         hmm_vma_walk->last = end;
>         return 0;
>  }
> @@ -617,10 +638,24 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>         if (fault || write_fault)
>                 goto fault;
>
> +       if (pte_devmap(pte)) {
> +               hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte),
> +                                             hmm_vma_walk->pgmap);
> +               if (unlikely(!hmm_vma_walk->pgmap))
> +                       return -EBUSY;
> +       } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> +               *pfn = range->values[HMM_PFN_SPECIAL];
> +               return -EFAULT;
> +       }
> +
>         *pfn = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
>         return 0;
>
>  fault:
> +       if (hmm_vma_walk->pgmap) {
> +               put_dev_pagemap(hmm_vma_walk->pgmap);
> +               hmm_vma_walk->pgmap = NULL;
> +       }
>         pte_unmap(ptep);
>         /* Fault any virtual address we were asked to fault */
>         return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> @@ -708,12 +743,84 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>                         return r;
>                 }
>         }
> +       if (hmm_vma_walk->pgmap) {
> +               put_dev_pagemap(hmm_vma_walk->pgmap);
> +               hmm_vma_walk->pgmap = NULL;
> +       }
>         pte_unmap(ptep - 1);
>
>         hmm_vma_walk->last = addr;
>         return 0;
>  }
>
> +static int hmm_vma_walk_pud(pud_t *pudp,
> +                           unsigned long start,
> +                           unsigned long end,
> +                           struct mm_walk *walk)
> +{
> +       struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +       struct hmm_range *range = hmm_vma_walk->range;
> +       struct vm_area_struct *vma = walk->vma;
> +       unsigned long addr = start, next;
> +       pmd_t *pmdp;
> +       pud_t pud;
> +       int ret;
> +
> +again:
> +       pud = READ_ONCE(*pudp);
> +       if (pud_none(pud))
> +               return hmm_vma_walk_hole(start, end, walk);
> +
> +       if (pud_huge(pud) && pud_devmap(pud)) {
> +               unsigned long i, npages, pfn;
> +               uint64_t *pfns, cpu_flags;
> +               bool fault, write_fault;
> +
> +               if (!pud_present(pud))
> +                       return hmm_vma_walk_hole(start, end, walk);
> +
> +               i = (addr - range->start) >> PAGE_SHIFT;
> +               npages = (end - addr) >> PAGE_SHIFT;
> +               pfns = &range->pfns[i];
> +
> +               cpu_flags = pud_to_hmm_pfn_flags(range, pud);
> +               hmm_range_need_fault(hmm_vma_walk, pfns, npages,
> +                                    cpu_flags, &fault, &write_fault);
> +               if (fault || write_fault)
> +                       return hmm_vma_walk_hole_(addr, end, fault,
> +                                               write_fault, walk);
> +
> +               pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> +               for (i = 0; i < npages; ++i, ++pfn) {
> +                       hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> +                                             hmm_vma_walk->pgmap);
> +                       if (unlikely(!hmm_vma_walk->pgmap))
> +                               return -EBUSY;
> +                       pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
> +               }
> +               if (hmm_vma_walk->pgmap) {
> +                       put_dev_pagemap(hmm_vma_walk->pgmap);
> +                       hmm_vma_walk->pgmap = NULL;
> +               }
> +               hmm_vma_walk->last = end;
> +               return 0;
> +       }
> +
> +       split_huge_pud(vma, pudp, addr);
> +       if (pud_none(*pudp))
> +               goto again;
> +
> +       pmdp = pmd_offset(pudp, addr);
> +       do {
> +               next = pmd_addr_end(addr, end);
> +               ret = hmm_vma_walk_pmd(pmdp, addr, next, walk);
> +               if (ret)
> +                       return ret;
> +       } while (pmdp++, addr = next, addr != end);
> +
> +       return 0;
> +}
> +
>  static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>                                       unsigned long start, unsigned long end,
>                                       struct mm_walk *walk)
> @@ -786,14 +893,6 @@ static void hmm_pfns_clear(struct hmm_range *range,
>                 *pfns = range->values[HMM_PFN_NONE];
>  }
>
> -static void hmm_pfns_special(struct hmm_range *range)
> -{
> -       unsigned long addr = range->start, i = 0;
> -
> -       for (; addr < range->end; addr += PAGE_SIZE, i++)
> -               range->pfns[i] = range->values[HMM_PFN_SPECIAL];
> -}
> -
>  /*
>   * hmm_range_register() - start tracking change to CPU page table over a range
>   * @range: range
> @@ -911,12 +1010,6 @@ long hmm_range_snapshot(struct hmm_range *range)
>                 if (vma == NULL || (vma->vm_flags & device_vma))
>                         return -EFAULT;
>
> -               /* FIXME support dax */
> -               if (vma_is_dax(vma)) {
> -                       hmm_pfns_special(range);
> -                       return -EINVAL;
> -               }
> -
>                 if (is_vm_hugetlb_page(vma)) {
>                         struct hstate *h = hstate_vma(vma);
>
> @@ -940,6 +1033,7 @@ long hmm_range_snapshot(struct hmm_range *range)
>                 }
>
>                 range->vma = vma;
> +               hmm_vma_walk.pgmap = NULL;
>                 hmm_vma_walk.last = start;
>                 hmm_vma_walk.fault = false;
>                 hmm_vma_walk.range = range;
> @@ -951,6 +1045,7 @@ long hmm_range_snapshot(struct hmm_range *range)
>                 mm_walk.pte_entry = NULL;
>                 mm_walk.test_walk = NULL;
>                 mm_walk.hugetlb_entry = NULL;
> +               mm_walk.pud_entry = hmm_vma_walk_pud;
>                 mm_walk.pmd_entry = hmm_vma_walk_pmd;
>                 mm_walk.pte_hole = hmm_vma_walk_hole;
>                 mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
> @@ -1018,12 +1113,6 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>                 if (vma == NULL || (vma->vm_flags & device_vma))
>                         return -EFAULT;
>
> -               /* FIXME support dax */
> -               if (vma_is_dax(vma)) {
> -                       hmm_pfns_special(range);
> -                       return -EINVAL;
> -               }
> -
>                 if (is_vm_hugetlb_page(vma)) {
>                         struct hstate *h = hstate_vma(vma);
>
> @@ -1047,6 +1136,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>                 }
>
>                 range->vma = vma;
> +               hmm_vma_walk.pgmap = NULL;
>                 hmm_vma_walk.last = start;
>                 hmm_vma_walk.fault = true;
>                 hmm_vma_walk.block = block;
> @@ -1059,6 +1149,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>                 mm_walk.pte_entry = NULL;
>                 mm_walk.test_walk = NULL;
>                 mm_walk.hugetlb_entry = NULL;
> +               mm_walk.pud_entry = hmm_vma_walk_pud;
>                 mm_walk.pmd_entry = hmm_vma_walk_pmd;
>                 mm_walk.pte_hole = hmm_vma_walk_hole;
>                 mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
> --
> 2.17.2
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-29 18:41   ` Dan Williams
@ 2019-01-29 19:31     ` Jerome Glisse
  2019-01-29 20:51       ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-01-29 19:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Jan 29, 2019 at 10:41:23AM -0800, Dan Williams wrote:
> On Tue, Jan 29, 2019 at 8:54 AM <jglisse@redhat.com> wrote:
> >
> > From: Jérôme Glisse <jglisse@redhat.com>
> >
> > This add support to mirror vma which is an mmap of a file which is on
> > a filesystem that using a DAX block device. There is no reason not to
> > support that case.
> >
> 
> The reason not to support it would be if it gets in the way of future
> DAX development. How does this interact with MAP_SYNC? I'm also
> concerned if this complicates DAX reflink support. In general I'd
> rather prioritize fixing the places where DAX is broken today before
> adding more cross-subsystem entanglements. The unit tests for
> filesystems (xfstests) are readily accessible. How would I go about
> regression testing DAX + HMM interactions?

HMM mirror CPU page table so anything you do to CPU page table will
be reflected to all HMM mirror user. So MAP_SYNC has no bearing here
whatsoever as all HMM mirror user must do cache coherent access to
range they mirror so from DAX point of view this is just _exactly_
the same as CPU access.

Note that you can not migrate DAX memory to GPU memory and thus for a
mmap of a file on a filesystem that use a DAX block device then you can
not do migration to device memory. Also at this time migration of file
back page is only supported for cache coherent device memory so for
instance on OpenCAPI platform.

Bottom line is you just have to worry about the CPU page table. What
ever you do there will be reflected properly. It does not add any
burden to people working on DAX. Unless you want to modify CPU page
table without calling mmu notifier but in that case you would not
only break HMM mirror user but other thing like KVM ...


For testing the issue is what do you want to test ? Do you want to test
that a device properly mirror some mmap of a file back by DAX ? ie
device driver which use HMM mirror keep working after changes made to
DAX.

Or do you want to run filesystem test suite using the GPU to access
mmap of the file (read or write) instead of the CPU ? In that case any
such test suite would need to be updated to be able to use something
like OpenCL for. At this time i do not see much need for that but maybe
this is something people would like to see.

Cheers,
Jérôme


> 
> > Note that unlike GUP code we do not take page reference hence when we
> > back-off we have nothing to undo.
> >
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  mm/hmm.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 112 insertions(+), 21 deletions(-)
> >
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 8b87e1813313..1a444885404e 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -334,6 +334,7 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
> >
> >  struct hmm_vma_walk {
> >         struct hmm_range        *range;
> > +       struct dev_pagemap      *pgmap;
> >         unsigned long           last;
> >         bool                    fault;
> >         bool                    block;
> > @@ -508,6 +509,15 @@ static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
> >                                 range->flags[HMM_PFN_VALID];
> >  }
> >
> > +static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud)
> > +{
> > +       if (!pud_present(pud))
> > +               return 0;
> > +       return pud_write(pud) ? range->flags[HMM_PFN_VALID] |
> > +                               range->flags[HMM_PFN_WRITE] :
> > +                               range->flags[HMM_PFN_VALID];
> > +}
> > +
> >  static int hmm_vma_handle_pmd(struct mm_walk *walk,
> >                               unsigned long addr,
> >                               unsigned long end,
> > @@ -529,8 +539,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
> >                 return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> >
> >         pfn = pmd_pfn(pmd) + pte_index(addr);
> > -       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
> > +       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +               if (pmd_devmap(pmd)) {
> > +                       hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> > +                                             hmm_vma_walk->pgmap);
> > +                       if (unlikely(!hmm_vma_walk->pgmap))
> > +                               return -EBUSY;
> > +               }
> >                 pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
> > +       }
> > +       if (hmm_vma_walk->pgmap) {
> > +               put_dev_pagemap(hmm_vma_walk->pgmap);
> > +               hmm_vma_walk->pgmap = NULL;
> > +       }
> >         hmm_vma_walk->last = end;
> >         return 0;
> >  }
> > @@ -617,10 +638,24 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> >         if (fault || write_fault)
> >                 goto fault;
> >
> > +       if (pte_devmap(pte)) {
> > +               hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte),
> > +                                             hmm_vma_walk->pgmap);
> > +               if (unlikely(!hmm_vma_walk->pgmap))
> > +                       return -EBUSY;
> > +       } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> > +               *pfn = range->values[HMM_PFN_SPECIAL];
> > +               return -EFAULT;
> > +       }
> > +
> >         *pfn = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
> >         return 0;
> >
> >  fault:
> > +       if (hmm_vma_walk->pgmap) {
> > +               put_dev_pagemap(hmm_vma_walk->pgmap);
> > +               hmm_vma_walk->pgmap = NULL;
> > +       }
> >         pte_unmap(ptep);
> >         /* Fault any virtual address we were asked to fault */
> >         return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> > @@ -708,12 +743,84 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >                         return r;
> >                 }
> >         }
> > +       if (hmm_vma_walk->pgmap) {
> > +               put_dev_pagemap(hmm_vma_walk->pgmap);
> > +               hmm_vma_walk->pgmap = NULL;
> > +       }
> >         pte_unmap(ptep - 1);
> >
> >         hmm_vma_walk->last = addr;
> >         return 0;
> >  }
> >
> > +static int hmm_vma_walk_pud(pud_t *pudp,
> > +                           unsigned long start,
> > +                           unsigned long end,
> > +                           struct mm_walk *walk)
> > +{
> > +       struct hmm_vma_walk *hmm_vma_walk = walk->private;
> > +       struct hmm_range *range = hmm_vma_walk->range;
> > +       struct vm_area_struct *vma = walk->vma;
> > +       unsigned long addr = start, next;
> > +       pmd_t *pmdp;
> > +       pud_t pud;
> > +       int ret;
> > +
> > +again:
> > +       pud = READ_ONCE(*pudp);
> > +       if (pud_none(pud))
> > +               return hmm_vma_walk_hole(start, end, walk);
> > +
> > +       if (pud_huge(pud) && pud_devmap(pud)) {
> > +               unsigned long i, npages, pfn;
> > +               uint64_t *pfns, cpu_flags;
> > +               bool fault, write_fault;
> > +
> > +               if (!pud_present(pud))
> > +                       return hmm_vma_walk_hole(start, end, walk);
> > +
> > +               i = (addr - range->start) >> PAGE_SHIFT;
> > +               npages = (end - addr) >> PAGE_SHIFT;
> > +               pfns = &range->pfns[i];
> > +
> > +               cpu_flags = pud_to_hmm_pfn_flags(range, pud);
> > +               hmm_range_need_fault(hmm_vma_walk, pfns, npages,
> > +                                    cpu_flags, &fault, &write_fault);
> > +               if (fault || write_fault)
> > +                       return hmm_vma_walk_hole_(addr, end, fault,
> > +                                               write_fault, walk);
> > +
> > +               pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> > +               for (i = 0; i < npages; ++i, ++pfn) {
> > +                       hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> > +                                             hmm_vma_walk->pgmap);
> > +                       if (unlikely(!hmm_vma_walk->pgmap))
> > +                               return -EBUSY;
> > +                       pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
> > +               }
> > +               if (hmm_vma_walk->pgmap) {
> > +                       put_dev_pagemap(hmm_vma_walk->pgmap);
> > +                       hmm_vma_walk->pgmap = NULL;
> > +               }
> > +               hmm_vma_walk->last = end;
> > +               return 0;
> > +       }
> > +
> > +       split_huge_pud(vma, pudp, addr);
> > +       if (pud_none(*pudp))
> > +               goto again;
> > +
> > +       pmdp = pmd_offset(pudp, addr);
> > +       do {
> > +               next = pmd_addr_end(addr, end);
> > +               ret = hmm_vma_walk_pmd(pmdp, addr, next, walk);
> > +               if (ret)
> > +                       return ret;
> > +       } while (pmdp++, addr = next, addr != end);
> > +
> > +       return 0;
> > +}
> > +
> >  static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
> >                                       unsigned long start, unsigned long end,
> >                                       struct mm_walk *walk)
> > @@ -786,14 +893,6 @@ static void hmm_pfns_clear(struct hmm_range *range,
> >                 *pfns = range->values[HMM_PFN_NONE];
> >  }
> >
> > -static void hmm_pfns_special(struct hmm_range *range)
> > -{
> > -       unsigned long addr = range->start, i = 0;
> > -
> > -       for (; addr < range->end; addr += PAGE_SIZE, i++)
> > -               range->pfns[i] = range->values[HMM_PFN_SPECIAL];
> > -}
> > -
> >  /*
> >   * hmm_range_register() - start tracking change to CPU page table over a range
> >   * @range: range
> > @@ -911,12 +1010,6 @@ long hmm_range_snapshot(struct hmm_range *range)
> >                 if (vma == NULL || (vma->vm_flags & device_vma))
> >                         return -EFAULT;
> >
> > -               /* FIXME support dax */
> > -               if (vma_is_dax(vma)) {
> > -                       hmm_pfns_special(range);
> > -                       return -EINVAL;
> > -               }
> > -
> >                 if (is_vm_hugetlb_page(vma)) {
> >                         struct hstate *h = hstate_vma(vma);
> >
> > @@ -940,6 +1033,7 @@ long hmm_range_snapshot(struct hmm_range *range)
> >                 }
> >
> >                 range->vma = vma;
> > +               hmm_vma_walk.pgmap = NULL;
> >                 hmm_vma_walk.last = start;
> >                 hmm_vma_walk.fault = false;
> >                 hmm_vma_walk.range = range;
> > @@ -951,6 +1045,7 @@ long hmm_range_snapshot(struct hmm_range *range)
> >                 mm_walk.pte_entry = NULL;
> >                 mm_walk.test_walk = NULL;
> >                 mm_walk.hugetlb_entry = NULL;
> > +               mm_walk.pud_entry = hmm_vma_walk_pud;
> >                 mm_walk.pmd_entry = hmm_vma_walk_pmd;
> >                 mm_walk.pte_hole = hmm_vma_walk_hole;
> >                 mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
> > @@ -1018,12 +1113,6 @@ long hmm_range_fault(struct hmm_range *range, bool block)
> >                 if (vma == NULL || (vma->vm_flags & device_vma))
> >                         return -EFAULT;
> >
> > -               /* FIXME support dax */
> > -               if (vma_is_dax(vma)) {
> > -                       hmm_pfns_special(range);
> > -                       return -EINVAL;
> > -               }
> > -
> >                 if (is_vm_hugetlb_page(vma)) {
> >                         struct hstate *h = hstate_vma(vma);
> >
> > @@ -1047,6 +1136,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
> >                 }
> >
> >                 range->vma = vma;
> > +               hmm_vma_walk.pgmap = NULL;
> >                 hmm_vma_walk.last = start;
> >                 hmm_vma_walk.fault = true;
> >                 hmm_vma_walk.block = block;
> > @@ -1059,6 +1149,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
> >                 mm_walk.pte_entry = NULL;
> >                 mm_walk.test_walk = NULL;
> >                 mm_walk.hugetlb_entry = NULL;
> > +               mm_walk.pud_entry = hmm_vma_walk_pud;
> >                 mm_walk.pmd_entry = hmm_vma_walk_pmd;
> >                 mm_walk.pte_hole = hmm_vma_walk_hole;
> >                 mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
> > --
> > 2.17.2
> >

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-29 19:31     ` Jerome Glisse
@ 2019-01-29 20:51       ` Dan Williams
  2019-01-29 21:21         ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-01-29 20:51 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Jan 29, 2019 at 11:32 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Jan 29, 2019 at 10:41:23AM -0800, Dan Williams wrote:
> > On Tue, Jan 29, 2019 at 8:54 AM <jglisse@redhat.com> wrote:
> > >
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > >
> > > This add support to mirror vma which is an mmap of a file which is on
> > > a filesystem that using a DAX block device. There is no reason not to
> > > support that case.
> > >
> >
> > The reason not to support it would be if it gets in the way of future
> > DAX development. How does this interact with MAP_SYNC? I'm also
> > concerned if this complicates DAX reflink support. In general I'd
> > rather prioritize fixing the places where DAX is broken today before
> > adding more cross-subsystem entanglements. The unit tests for
> > filesystems (xfstests) are readily accessible. How would I go about
> > regression testing DAX + HMM interactions?
>
> HMM mirror CPU page table so anything you do to CPU page table will
> be reflected to all HMM mirror user. So MAP_SYNC has no bearing here
> whatsoever as all HMM mirror user must do cache coherent access to
> range they mirror so from DAX point of view this is just _exactly_
> the same as CPU access.
>
> Note that you can not migrate DAX memory to GPU memory and thus for a
> mmap of a file on a filesystem that use a DAX block device then you can
> not do migration to device memory. Also at this time migration of file
> back page is only supported for cache coherent device memory so for
> instance on OpenCAPI platform.

Ok, this addresses the primary concern about maintenance burden. Thanks.

However the changelog still amounts to a justification of "change
this, because we can". At least, that's how it reads to me. Is there
any positive benefit to merging this patch? Can you spell that out in
the changelog?

> Bottom line is you just have to worry about the CPU page table. What
> ever you do there will be reflected properly. It does not add any
> burden to people working on DAX. Unless you want to modify CPU page
> table without calling mmu notifier but in that case you would not
> only break HMM mirror user but other thing like KVM ...
>
>
> For testing the issue is what do you want to test ? Do you want to test
> that a device properly mirror some mmap of a file back by DAX ? ie
> device driver which use HMM mirror keep working after changes made to
> DAX.
>
> Or do you want to run filesystem test suite using the GPU to access
> mmap of the file (read or write) instead of the CPU ? In that case any
> such test suite would need to be updated to be able to use something
> like OpenCL for. At this time i do not see much need for that but maybe
> this is something people would like to see.

In general, as HMM grows intercept points throughout the mm it would
be helpful to be able to sanity check the implementation.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-29 20:51       ` Dan Williams
@ 2019-01-29 21:21         ` Jerome Glisse
  2019-01-30  2:32           ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-01-29 21:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Jan 29, 2019 at 12:51:25PM -0800, Dan Williams wrote:
> On Tue, Jan 29, 2019 at 11:32 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Jan 29, 2019 at 10:41:23AM -0800, Dan Williams wrote:
> > > On Tue, Jan 29, 2019 at 8:54 AM <jglisse@redhat.com> wrote:
> > > >
> > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > >
> > > > This add support to mirror vma which is an mmap of a file which is on
> > > > a filesystem that using a DAX block device. There is no reason not to
> > > > support that case.
> > > >
> > >
> > > The reason not to support it would be if it gets in the way of future
> > > DAX development. How does this interact with MAP_SYNC? I'm also
> > > concerned if this complicates DAX reflink support. In general I'd
> > > rather prioritize fixing the places where DAX is broken today before
> > > adding more cross-subsystem entanglements. The unit tests for
> > > filesystems (xfstests) are readily accessible. How would I go about
> > > regression testing DAX + HMM interactions?
> >
> > HMM mirror CPU page table so anything you do to CPU page table will
> > be reflected to all HMM mirror user. So MAP_SYNC has no bearing here
> > whatsoever as all HMM mirror user must do cache coherent access to
> > range they mirror so from DAX point of view this is just _exactly_
> > the same as CPU access.
> >
> > Note that you can not migrate DAX memory to GPU memory and thus for a
> > mmap of a file on a filesystem that use a DAX block device then you can
> > not do migration to device memory. Also at this time migration of file
> > back page is only supported for cache coherent device memory so for
> > instance on OpenCAPI platform.
> 
> Ok, this addresses the primary concern about maintenance burden. Thanks.
> 
> However the changelog still amounts to a justification of "change
> this, because we can". At least, that's how it reads to me. Is there
> any positive benefit to merging this patch? Can you spell that out in
> the changelog?

There is 3 reasons for this:
    1) Convert ODP to use HMM underneath so that we share code between
    infiniband ODP and GPU drivers. ODP do support DAX today so i can
    not convert ODP to HMM without also supporting DAX in HMM otherwise
    i would regress the ODP features.

    2) I expect people will be running GPGPU on computer with file that
    use DAX and they will want to use HMM there too, in fact from user-
    space point of view wether the file is DAX or not should only change
    one thing ie for DAX file you will never be able to use GPU memory.

    3) I want to convert as many user of GUP to HMM (already posted
    several patchset to GPU mailing list for that and i intend to post
    a v2 of those latter on). Using HMM avoids GUP and it will avoid
    the GUP pin as here we abide by mmu notifier hence we do not want to
    inhibit any of the filesystem regular operation. Some of those GPU
    driver do allow GUP on DAX file. So again i can not regress them.


> > Bottom line is you just have to worry about the CPU page table. What
> > ever you do there will be reflected properly. It does not add any
> > burden to people working on DAX. Unless you want to modify CPU page
> > table without calling mmu notifier but in that case you would not
> > only break HMM mirror user but other thing like KVM ...
> >
> >
> > For testing the issue is what do you want to test ? Do you want to test
> > that a device properly mirror some mmap of a file back by DAX ? ie
> > device driver which use HMM mirror keep working after changes made to
> > DAX.
> >
> > Or do you want to run filesystem test suite using the GPU to access
> > mmap of the file (read or write) instead of the CPU ? In that case any
> > such test suite would need to be updated to be able to use something
> > like OpenCL for. At this time i do not see much need for that but maybe
> > this is something people would like to see.
> 
> In general, as HMM grows intercept points throughout the mm it would
> be helpful to be able to sanity check the implementation.

I usualy use a combination of simple OpenCL programs and hand tailor direct
ioctl hack to force specific code path to happen. I should probably create
a repository with a set of OpenCL tests so that other can also use them.
I need to clean those up into something not too ugly so i am not ashame
of them.

Also at this time the OpenCL bits are not in any distro, most of the bits
are in mesa and Karol and others are doing a great jobs at polishing things
and getting all the bits in. I do expect that in couple months the mainline
of all projects (LLVM, Mesa, libdrm, ...) will have all the bits and then it
will trickle down to your favorite distribution (assuming they build mesa
with OpenCL enabled).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-29 21:21         ` Jerome Glisse
@ 2019-01-30  2:32           ` Dan Williams
  2019-01-30  3:03             ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-01-30  2:32 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Jan 29, 2019 at 1:21 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Jan 29, 2019 at 12:51:25PM -0800, Dan Williams wrote:
> > On Tue, Jan 29, 2019 at 11:32 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Jan 29, 2019 at 10:41:23AM -0800, Dan Williams wrote:
> > > > On Tue, Jan 29, 2019 at 8:54 AM <jglisse@redhat.com> wrote:
> > > > >
> > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > >
> > > > > This add support to mirror vma which is an mmap of a file which is on
> > > > > a filesystem that using a DAX block device. There is no reason not to
> > > > > support that case.
> > > > >
> > > >
> > > > The reason not to support it would be if it gets in the way of future
> > > > DAX development. How does this interact with MAP_SYNC? I'm also
> > > > concerned if this complicates DAX reflink support. In general I'd
> > > > rather prioritize fixing the places where DAX is broken today before
> > > > adding more cross-subsystem entanglements. The unit tests for
> > > > filesystems (xfstests) are readily accessible. How would I go about
> > > > regression testing DAX + HMM interactions?
> > >
> > > HMM mirror CPU page table so anything you do to CPU page table will
> > > be reflected to all HMM mirror user. So MAP_SYNC has no bearing here
> > > whatsoever as all HMM mirror user must do cache coherent access to
> > > range they mirror so from DAX point of view this is just _exactly_
> > > the same as CPU access.
> > >
> > > Note that you can not migrate DAX memory to GPU memory and thus for a
> > > mmap of a file on a filesystem that use a DAX block device then you can
> > > not do migration to device memory. Also at this time migration of file
> > > back page is only supported for cache coherent device memory so for
> > > instance on OpenCAPI platform.
> >
> > Ok, this addresses the primary concern about maintenance burden. Thanks.
> >
> > However the changelog still amounts to a justification of "change
> > this, because we can". At least, that's how it reads to me. Is there
> > any positive benefit to merging this patch? Can you spell that out in
> > the changelog?
>
> There is 3 reasons for this:

Thanks for this.

>     1) Convert ODP to use HMM underneath so that we share code between
>     infiniband ODP and GPU drivers. ODP do support DAX today so i can
>     not convert ODP to HMM without also supporting DAX in HMM otherwise
>     i would regress the ODP features.
>
>     2) I expect people will be running GPGPU on computer with file that
>     use DAX and they will want to use HMM there too, in fact from user-
>     space point of view wether the file is DAX or not should only change
>     one thing ie for DAX file you will never be able to use GPU memory.
>
>     3) I want to convert as many user of GUP to HMM (already posted
>     several patchset to GPU mailing list for that and i intend to post
>     a v2 of those latter on). Using HMM avoids GUP and it will avoid
>     the GUP pin as here we abide by mmu notifier hence we do not want to
>     inhibit any of the filesystem regular operation. Some of those GPU
>     driver do allow GUP on DAX file. So again i can not regress them.

Is this really a GUP to HMM conversion, or a GUP to mmu_notifier
solution? It would be good to boil this conversion down to the base
building blocks. It seems "HMM" can mean several distinct pieces of
infrastructure. Is it possible to replace some GUP usage with an
mmu_notifier based solution without pulling in all of HMM?

> > > Bottom line is you just have to worry about the CPU page table. What
> > > ever you do there will be reflected properly. It does not add any
> > > burden to people working on DAX. Unless you want to modify CPU page
> > > table without calling mmu notifier but in that case you would not
> > > only break HMM mirror user but other thing like KVM ...
> > >
> > >
> > > For testing the issue is what do you want to test ? Do you want to test
> > > that a device properly mirror some mmap of a file back by DAX ? ie
> > > device driver which use HMM mirror keep working after changes made to
> > > DAX.
> > >
> > > Or do you want to run filesystem test suite using the GPU to access
> > > mmap of the file (read or write) instead of the CPU ? In that case any
> > > such test suite would need to be updated to be able to use something
> > > like OpenCL for. At this time i do not see much need for that but maybe
> > > this is something people would like to see.
> >
> > In general, as HMM grows intercept points throughout the mm it would
> > be helpful to be able to sanity check the implementation.
>
> I usualy use a combination of simple OpenCL programs and hand tailor direct
> ioctl hack to force specific code path to happen. I should probably create
> a repository with a set of OpenCL tests so that other can also use them.
> I need to clean those up into something not too ugly so i am not ashame
> of them.

That would be great, even it is messy.

> Also at this time the OpenCL bits are not in any distro, most of the bits
> are in mesa and Karol and others are doing a great jobs at polishing things
> and getting all the bits in. I do expect that in couple months the mainline
> of all projects (LLVM, Mesa, libdrm, ...) will have all the bits and then it
> will trickle down to your favorite distribution (assuming they build mesa
> with OpenCL enabled).

Ok.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-30  2:32           ` Dan Williams
@ 2019-01-30  3:03             ` Jerome Glisse
  2019-01-30 17:25               ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-01-30  3:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Jan 29, 2019 at 06:32:56PM -0800, Dan Williams wrote:
> On Tue, Jan 29, 2019 at 1:21 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Jan 29, 2019 at 12:51:25PM -0800, Dan Williams wrote:
> > > On Tue, Jan 29, 2019 at 11:32 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Tue, Jan 29, 2019 at 10:41:23AM -0800, Dan Williams wrote:
> > > > > On Tue, Jan 29, 2019 at 8:54 AM <jglisse@redhat.com> wrote:
> > > > > >
> > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > >
> > > > > > This add support to mirror vma which is an mmap of a file which is on
> > > > > > a filesystem that using a DAX block device. There is no reason not to
> > > > > > support that case.
> > > > > >
> > > > >
> > > > > The reason not to support it would be if it gets in the way of future
> > > > > DAX development. How does this interact with MAP_SYNC? I'm also
> > > > > concerned if this complicates DAX reflink support. In general I'd
> > > > > rather prioritize fixing the places where DAX is broken today before
> > > > > adding more cross-subsystem entanglements. The unit tests for
> > > > > filesystems (xfstests) are readily accessible. How would I go about
> > > > > regression testing DAX + HMM interactions?
> > > >
> > > > HMM mirror CPU page table so anything you do to CPU page table will
> > > > be reflected to all HMM mirror user. So MAP_SYNC has no bearing here
> > > > whatsoever as all HMM mirror user must do cache coherent access to
> > > > range they mirror so from DAX point of view this is just _exactly_
> > > > the same as CPU access.
> > > >
> > > > Note that you can not migrate DAX memory to GPU memory and thus for a
> > > > mmap of a file on a filesystem that use a DAX block device then you can
> > > > not do migration to device memory. Also at this time migration of file
> > > > back page is only supported for cache coherent device memory so for
> > > > instance on OpenCAPI platform.
> > >
> > > Ok, this addresses the primary concern about maintenance burden. Thanks.
> > >
> > > However the changelog still amounts to a justification of "change
> > > this, because we can". At least, that's how it reads to me. Is there
> > > any positive benefit to merging this patch? Can you spell that out in
> > > the changelog?
> >
> > There is 3 reasons for this:
> 
> Thanks for this.
> 
> >     1) Convert ODP to use HMM underneath so that we share code between
> >     infiniband ODP and GPU drivers. ODP do support DAX today so i can
> >     not convert ODP to HMM without also supporting DAX in HMM otherwise
> >     i would regress the ODP features.
> >
> >     2) I expect people will be running GPGPU on computer with file that
> >     use DAX and they will want to use HMM there too, in fact from user-
> >     space point of view wether the file is DAX or not should only change
> >     one thing ie for DAX file you will never be able to use GPU memory.
> >
> >     3) I want to convert as many user of GUP to HMM (already posted
> >     several patchset to GPU mailing list for that and i intend to post
> >     a v2 of those latter on). Using HMM avoids GUP and it will avoid
> >     the GUP pin as here we abide by mmu notifier hence we do not want to
> >     inhibit any of the filesystem regular operation. Some of those GPU
> >     driver do allow GUP on DAX file. So again i can not regress them.
> 
> Is this really a GUP to HMM conversion, or a GUP to mmu_notifier
> solution? It would be good to boil this conversion down to the base
> building blocks. It seems "HMM" can mean several distinct pieces of
> infrastructure. Is it possible to replace some GUP usage with an
> mmu_notifier based solution without pulling in all of HMM?

Kind of both, some of the GPU driver i am converting will use HMM for
more than just this GUP reason. But when and for what hardware they
will use HMM for is not something i can share (it is up to each vendor
to announce their hardware and feature on their own timeline).

So yes you could do the mmu notifier solution without pulling HMM
mirror (note that you do not need to pull all of HMM, HMM as many
kernel config option and for this you only need to use HMM mirror).
But if you are not using HMM then you will just be duplicating the
same code as HMM mirror. So i believe it is better to share this
code and if we want to change core mm then we only have to update
HMM while keeping the API/contract with device driver intact. This
is one of the motivation behind HMM ie have it as an impedence layer
between mm and device drivers so that mm folks do not have to under-
stand every single device driver but only have to understand the
contract HMM has with all device driver that uses it.

Also having each driver duplicating this code increase the risk of
one getting a little detail wrong. The hope is that sharing same
HMM code with all the driver then everyone benefit from debugging
the same code (i am hopping i do not have many bugs left :))


> > > > Bottom line is you just have to worry about the CPU page table. What
> > > > ever you do there will be reflected properly. It does not add any
> > > > burden to people working on DAX. Unless you want to modify CPU page
> > > > table without calling mmu notifier but in that case you would not
> > > > only break HMM mirror user but other thing like KVM ...
> > > >
> > > >
> > > > For testing the issue is what do you want to test ? Do you want to test
> > > > that a device properly mirror some mmap of a file back by DAX ? ie
> > > > device driver which use HMM mirror keep working after changes made to
> > > > DAX.
> > > >
> > > > Or do you want to run filesystem test suite using the GPU to access
> > > > mmap of the file (read or write) instead of the CPU ? In that case any
> > > > such test suite would need to be updated to be able to use something
> > > > like OpenCL for. At this time i do not see much need for that but maybe
> > > > this is something people would like to see.
> > >
> > > In general, as HMM grows intercept points throughout the mm it would
> > > be helpful to be able to sanity check the implementation.
> >
> > I usualy use a combination of simple OpenCL programs and hand tailor direct
> > ioctl hack to force specific code path to happen. I should probably create
> > a repository with a set of OpenCL tests so that other can also use them.
> > I need to clean those up into something not too ugly so i am not ashame
> > of them.
> 
> That would be great, even it is messy.

I will clean them up a put something together that i am not too ashame to
push :) I am on PTO for next couple weeks so it will probably not happens
before i am back. I still should have email access.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-30  3:03             ` Jerome Glisse
@ 2019-01-30 17:25               ` Dan Williams
  2019-01-30 18:36                 ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-01-30 17:25 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Jan 29, 2019 at 7:03 PM Jerome Glisse <jglisse@redhat.com> wrote:
[..]
> > >     1) Convert ODP to use HMM underneath so that we share code between
> > >     infiniband ODP and GPU drivers. ODP do support DAX today so i can
> > >     not convert ODP to HMM without also supporting DAX in HMM otherwise
> > >     i would regress the ODP features.
> > >
> > >     2) I expect people will be running GPGPU on computer with file that
> > >     use DAX and they will want to use HMM there too, in fact from user-
> > >     space point of view wether the file is DAX or not should only change
> > >     one thing ie for DAX file you will never be able to use GPU memory.
> > >
> > >     3) I want to convert as many user of GUP to HMM (already posted
> > >     several patchset to GPU mailing list for that and i intend to post
> > >     a v2 of those latter on). Using HMM avoids GUP and it will avoid
> > >     the GUP pin as here we abide by mmu notifier hence we do not want to
> > >     inhibit any of the filesystem regular operation. Some of those GPU
> > >     driver do allow GUP on DAX file. So again i can not regress them.
> >
> > Is this really a GUP to HMM conversion, or a GUP to mmu_notifier
> > solution? It would be good to boil this conversion down to the base
> > building blocks. It seems "HMM" can mean several distinct pieces of
> > infrastructure. Is it possible to replace some GUP usage with an
> > mmu_notifier based solution without pulling in all of HMM?
>
> Kind of both, some of the GPU driver i am converting will use HMM for
> more than just this GUP reason. But when and for what hardware they
> will use HMM for is not something i can share (it is up to each vendor
> to announce their hardware and feature on their own timeline).

Typically a spec document precedes specific hardware announcement and
Linux enabling is gated on public spec availability.

> So yes you could do the mmu notifier solution without pulling HMM
> mirror (note that you do not need to pull all of HMM, HMM as many
> kernel config option and for this you only need to use HMM mirror).
> But if you are not using HMM then you will just be duplicating the
> same code as HMM mirror. So i believe it is better to share this
> code and if we want to change core mm then we only have to update
> HMM while keeping the API/contract with device driver intact.

No. Linux should not end up with the HMM-mm as distinct from the
Core-mm. For long term maintainability of Linux, the target should be
one mm.

> This
> is one of the motivation behind HMM ie have it as an impedence layer
> between mm and device drivers so that mm folks do not have to under-
> stand every single device driver but only have to understand the
> contract HMM has with all device driver that uses it.

This gets to heart of my critique of the approach taken with HMM. The
above statement is antithetical to
Documentation/process/stable-api-nonsense.rst. If HMM is trying to set
expectations that device-driver projects can write to a "stable" HMM
api then HMM is setting those device-drivers up for failure.

The possibility of refactoring driver code *across* vendors is a core
tenet of Linux maintainability. If the refactoring requires the API
exported to drivers to change then so be it. The expectation that all
out-of-tree device-drivers should have is that the API they are using
in kernel version X may not be there in version X+1. Having the driver
upstream is the only surefire insurance against that thrash.

HMM seems a bold experiment in trying to violate Linux development norms.

> Also having each driver duplicating this code increase the risk of
> one getting a little detail wrong. The hope is that sharing same
> HMM code with all the driver then everyone benefit from debugging
> the same code (i am hopping i do not have many bugs left :))

"each driver duplicating code" begs for refactoring driver code to
common code and this refactoring is hindered if it must adhere to an
"HMM" api.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-30 17:25               ` Dan Williams
@ 2019-01-30 18:36                 ` Jerome Glisse
  2019-01-31  3:28                   ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-01-30 18:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, Jan 30, 2019 at 09:25:21AM -0800, Dan Williams wrote:
> On Tue, Jan 29, 2019 at 7:03 PM Jerome Glisse <jglisse@redhat.com> wrote:
> [..]
> > > >     1) Convert ODP to use HMM underneath so that we share code between
> > > >     infiniband ODP and GPU drivers. ODP do support DAX today so i can
> > > >     not convert ODP to HMM without also supporting DAX in HMM otherwise
> > > >     i would regress the ODP features.
> > > >
> > > >     2) I expect people will be running GPGPU on computer with file that
> > > >     use DAX and they will want to use HMM there too, in fact from user-
> > > >     space point of view wether the file is DAX or not should only change
> > > >     one thing ie for DAX file you will never be able to use GPU memory.
> > > >
> > > >     3) I want to convert as many user of GUP to HMM (already posted
> > > >     several patchset to GPU mailing list for that and i intend to post
> > > >     a v2 of those latter on). Using HMM avoids GUP and it will avoid
> > > >     the GUP pin as here we abide by mmu notifier hence we do not want to
> > > >     inhibit any of the filesystem regular operation. Some of those GPU
> > > >     driver do allow GUP on DAX file. So again i can not regress them.
> > >
> > > Is this really a GUP to HMM conversion, or a GUP to mmu_notifier
> > > solution? It would be good to boil this conversion down to the base
> > > building blocks. It seems "HMM" can mean several distinct pieces of
> > > infrastructure. Is it possible to replace some GUP usage with an
> > > mmu_notifier based solution without pulling in all of HMM?
> >
> > Kind of both, some of the GPU driver i am converting will use HMM for
> > more than just this GUP reason. But when and for what hardware they
> > will use HMM for is not something i can share (it is up to each vendor
> > to announce their hardware and feature on their own timeline).
> 
> Typically a spec document precedes specific hardware announcement and
> Linux enabling is gated on public spec availability.
> 
> > So yes you could do the mmu notifier solution without pulling HMM
> > mirror (note that you do not need to pull all of HMM, HMM as many
> > kernel config option and for this you only need to use HMM mirror).
> > But if you are not using HMM then you will just be duplicating the
> > same code as HMM mirror. So i believe it is better to share this
> > code and if we want to change core mm then we only have to update
> > HMM while keeping the API/contract with device driver intact.
> 
> No. Linux should not end up with the HMM-mm as distinct from the
> Core-mm. For long term maintainability of Linux, the target should be
> one mm.

Hu ? I do not follow here. Maybe i am unclear and we are talking past
each other.

> 
> > This
> > is one of the motivation behind HMM ie have it as an impedence layer
> > between mm and device drivers so that mm folks do not have to under-
> > stand every single device driver but only have to understand the
> > contract HMM has with all device driver that uses it.
> 
> This gets to heart of my critique of the approach taken with HMM. The
> above statement is antithetical to
> Documentation/process/stable-api-nonsense.rst. If HMM is trying to set
> expectations that device-driver projects can write to a "stable" HMM
> api then HMM is setting those device-drivers up for failure.

So i am not expressing myself correctly. If someone want to change mm
in anyway that would affect HMM user, it can and it is welcome too
(assuming that those change are wanted by the community and motivated
for good reasons). Here by understanding HMM contract and preserving it
what i mean is that all you have to do is update the HMM API in anyway
that deliver the same result to the device driver. So what i means is
that instead of having to understand each device driver. For instance
you have HMM provide X so that driver can do Y; then what can be Z a
replacement for X that allow driver to do Y. The point here is that
HMM define what Y is and provide X for current kernel mm code. If X
ever need to change so that core mm can evolve than you can and are
more than welcome to do it. With HMM Y is defined and you only need to
figure out how to achieve the same end result for the device driver.

The point is that you do not have to go read each device driver to
figure out Y.driver_foo, Y.driver_bar, ... you only have HMM that
define what Y means and is ie this what device driver are trying to
do.

Obviously here i assume that we do not want to regress features ie
we want to keep device driver features intact when we modify anything.

> 
> The possibility of refactoring driver code *across* vendors is a core
> tenet of Linux maintainability. If the refactoring requires the API
> exported to drivers to change then so be it. The expectation that all
> out-of-tree device-drivers should have is that the API they are using
> in kernel version X may not be there in version X+1. Having the driver
> upstream is the only surefire insurance against that thrash.
> 
> HMM seems a bold experiment in trying to violate Linux development norms.

We are definitly talking past each other. HMM is _not_ a stable API or
any kind of contract with anybody outside upstream. If you want to change
the API HMM expose to device driver you are free to do so provided that
the change still allow device driver to achieve their objective.

HMM is not here to hinder that, quite the opposite in fact, it is here
to help that. Helping people that want to evolve mm by not requirement
them to understand every single device driver.


> 
> > Also having each driver duplicating this code increase the risk of
> > one getting a little detail wrong. The hope is that sharing same
> > HMM code with all the driver then everyone benefit from debugging
> > the same code (i am hopping i do not have many bugs left :))
> 
> "each driver duplicating code" begs for refactoring driver code to
> common code and this refactoring is hindered if it must adhere to an
> "HMM" api.

Again HMM API can evolve, i am happy to help with any such change, given
it provides benefit to either mm or device driver (ie changing the HMM
just for the sake of changing the HMM API would not make much sense to
me).

So if after converting driver A, B and C we see that it would be nicer
to change HMM in someway then i will definitly do that and this patchset
is a testimony of that. Converting ODP to use HMM is easier after this
patchset and this patchset changes the HMM API. I will be updating the
nouveau driver to the new API and use the new API for the other driver
patchset i am working on.

If i bump again into something that would be better done any differently
i will definitly change the HMM API and update all upstream driver
accordingly.

I am a strong believer in full freedom for internal kernel API changes
and my intention have always been to help and facilitate such process.
I am sorry this was unclear to any body :( and i am hopping that this
email make my intention clear.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-30 18:36                 ` Jerome Glisse
@ 2019-01-31  3:28                   ` Dan Williams
  2019-01-31  4:16                     ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-01-31  3:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, Jan 30, 2019 at 10:36 AM Jerome Glisse <jglisse@redhat.com> wrote:
[..]
> > > This
> > > is one of the motivation behind HMM ie have it as an impedence layer
> > > between mm and device drivers so that mm folks do not have to under-
> > > stand every single device driver but only have to understand the
> > > contract HMM has with all device driver that uses it.
> >
> > This gets to heart of my critique of the approach taken with HMM. The
> > above statement is antithetical to
> > Documentation/process/stable-api-nonsense.rst. If HMM is trying to set
> > expectations that device-driver projects can write to a "stable" HMM
> > api then HMM is setting those device-drivers up for failure.
>
> So i am not expressing myself correctly. If someone want to change mm
> in anyway that would affect HMM user, it can and it is welcome too
> (assuming that those change are wanted by the community and motivated
> for good reasons). Here by understanding HMM contract and preserving it
> what i mean is that all you have to do is update the HMM API in anyway
> that deliver the same result to the device driver. So what i means is
> that instead of having to understand each device driver. For instance
> you have HMM provide X so that driver can do Y; then what can be Z a
> replacement for X that allow driver to do Y. The point here is that
> HMM define what Y is and provide X for current kernel mm code. If X
> ever need to change so that core mm can evolve than you can and are
> more than welcome to do it. With HMM Y is defined and you only need to
> figure out how to achieve the same end result for the device driver.
>
> The point is that you do not have to go read each device driver to
> figure out Y.driver_foo, Y.driver_bar, ... you only have HMM that
> define what Y means and is ie this what device driver are trying to
> do.
>
> Obviously here i assume that we do not want to regress features ie
> we want to keep device driver features intact when we modify anything.

The specific concern is HMM attempting to expand the regression
boundary beyond drivers that exist in the kernel. The regression
contract that has priority is the one established for in-tree users.
If an in-tree change to mm semantics is fine for in-tree mm users, but
breaks out of tree users the question to those out of tree users is
"why isn't your use case upstream?". HMM is not that use case in and
of itself.

[..]
> Again HMM API can evolve, i am happy to help with any such change, given
> it provides benefit to either mm or device driver (ie changing the HMM
> just for the sake of changing the HMM API would not make much sense to
> me).
>
> So if after converting driver A, B and C we see that it would be nicer
> to change HMM in someway then i will definitly do that and this patchset
> is a testimony of that. Converting ODP to use HMM is easier after this
> patchset and this patchset changes the HMM API. I will be updating the
> nouveau driver to the new API and use the new API for the other driver
> patchset i am working on.
>
> If i bump again into something that would be better done any differently
> i will definitly change the HMM API and update all upstream driver
> accordingly.
>
> I am a strong believer in full freedom for internal kernel API changes
> and my intention have always been to help and facilitate such process.
> I am sorry this was unclear to any body :( and i am hopping that this
> email make my intention clear.''

A simple way to ensure that out-of-tree consumers don't come beat us
up over a backwards incompatible HMM change is to mark all the exports
with _GPL. I'm not requiring that, the devm_memremap_pages() fight was
hard enough, but the pace of new exports vs arrival of consumers for
those exports has me worried that this arrangement will fall over at
some point.

Another way to help allay these worries is commit to no new exports
without in-tree users. In general, that should go without saying for
any core changes for new or future hardware.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-31  3:28                   ` Dan Williams
@ 2019-01-31  4:16                     ` Jerome Glisse
  2019-01-31  5:44                       ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-01-31  4:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, Jan 30, 2019 at 07:28:12PM -0800, Dan Williams wrote:
> On Wed, Jan 30, 2019 at 10:36 AM Jerome Glisse <jglisse@redhat.com> wrote:
> [..]
> > > > This
> > > > is one of the motivation behind HMM ie have it as an impedence layer
> > > > between mm and device drivers so that mm folks do not have to under-
> > > > stand every single device driver but only have to understand the
> > > > contract HMM has with all device driver that uses it.
> > >
> > > This gets to heart of my critique of the approach taken with HMM. The
> > > above statement is antithetical to
> > > Documentation/process/stable-api-nonsense.rst. If HMM is trying to set
> > > expectations that device-driver projects can write to a "stable" HMM
> > > api then HMM is setting those device-drivers up for failure.
> >
> > So i am not expressing myself correctly. If someone want to change mm
> > in anyway that would affect HMM user, it can and it is welcome too
> > (assuming that those change are wanted by the community and motivated
> > for good reasons). Here by understanding HMM contract and preserving it
> > what i mean is that all you have to do is update the HMM API in anyway
> > that deliver the same result to the device driver. So what i means is
> > that instead of having to understand each device driver. For instance
> > you have HMM provide X so that driver can do Y; then what can be Z a
> > replacement for X that allow driver to do Y. The point here is that
> > HMM define what Y is and provide X for current kernel mm code. If X
> > ever need to change so that core mm can evolve than you can and are
> > more than welcome to do it. With HMM Y is defined and you only need to
> > figure out how to achieve the same end result for the device driver.
> >
> > The point is that you do not have to go read each device driver to
> > figure out Y.driver_foo, Y.driver_bar, ... you only have HMM that
> > define what Y means and is ie this what device driver are trying to
> > do.
> >
> > Obviously here i assume that we do not want to regress features ie
> > we want to keep device driver features intact when we modify anything.
> 
> The specific concern is HMM attempting to expand the regression
> boundary beyond drivers that exist in the kernel. The regression
> contract that has priority is the one established for in-tree users.
> If an in-tree change to mm semantics is fine for in-tree mm users, but
> breaks out of tree users the question to those out of tree users is
> "why isn't your use case upstream?". HMM is not that use case in and
> of itself.

I do not worry about out of tree user and we should not worry about
them. I care only about upstream driver (AMD, Intel, NVidia) and i
will not do an HMM feature if i do not intend to use it in at least
one of those upstream driver. Yes i have work with NVidia on the
design simply because they are the market leader on GPU compute and
have talented engineers that know a little about what would work
well. Not working with them to get their input on design just because
their driver is closed source seems radical to me. I believe i
benefited from their valuable input. But in the end my aim is, and
have always been, to make the upstream kernel driver the best as
possible. I will talk with anyone that can help in achieving that
objective.

So do not worry about non upstream driver.


> [..]
> > Again HMM API can evolve, i am happy to help with any such change, given
> > it provides benefit to either mm or device driver (ie changing the HMM
> > just for the sake of changing the HMM API would not make much sense to
> > me).
> >
> > So if after converting driver A, B and C we see that it would be nicer
> > to change HMM in someway then i will definitly do that and this patchset
> > is a testimony of that. Converting ODP to use HMM is easier after this
> > patchset and this patchset changes the HMM API. I will be updating the
> > nouveau driver to the new API and use the new API for the other driver
> > patchset i am working on.
> >
> > If i bump again into something that would be better done any differently
> > i will definitly change the HMM API and update all upstream driver
> > accordingly.
> >
> > I am a strong believer in full freedom for internal kernel API changes
> > and my intention have always been to help and facilitate such process.
> > I am sorry this was unclear to any body :( and i am hopping that this
> > email make my intention clear.''
> 
> A simple way to ensure that out-of-tree consumers don't come beat us
> up over a backwards incompatible HMM change is to mark all the exports
> with _GPL. I'm not requiring that, the devm_memremap_pages() fight was
> hard enough, but the pace of new exports vs arrival of consumers for
> those exports has me worried that this arrangement will fall over at
> some point.

I was reluctant with the devm_memremap_pages() GPL changes because i
think we should not change symbol export after an initial choice have
been made on those.

I don't think GPL or non GPL export change one bit in respect to out
of tree user. They know they can not make any legitimate regression
claim, nor should we care. So i fail to see how GPL export would make
it any different.

> Another way to help allay these worries is commit to no new exports
> without in-tree users. In general, that should go without saying for
> any core changes for new or future hardware.

I always intend to have an upstream user the issue is that the device
driver tree and the mm tree move a different pace and there is always
a chicken and egg problem. I do not think Andrew wants to have to
merge driver patches through its tree, nor Linus want to have to merge
drivers and mm trees in specific order. So it is easier to introduce
mm change in one release and driver change in the next. This is what
i am doing with ODP. Adding things necessary in 5.1 and working with
Mellanox to have the ODP HMM patch fully tested and ready to go in
5.2 (the patch is available today and Mellanox have begin testing it
AFAIK). So this is the guideline i will be following. Post mm bits
with driver patches, push to merge mm bits one release and have the
driver bits in the next. I do hope this sound fine to everyone.

It is also easier for the driver folks as then they do not need to
have a special tree just to test my changes. They can integrate it
in their regular workflow ie merge the new kernel release in their
tree and then start pilling up changes to their driver for the next
kernel release.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-31  4:16                     ` Jerome Glisse
@ 2019-01-31  5:44                       ` Dan Williams
  2019-03-05 22:16                         ` Andrew Morton
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-01-31  5:44 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linux MM, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, Jan 30, 2019 at 8:17 PM Jerome Glisse <jglisse@redhat.com> wrote:
> On Wed, Jan 30, 2019 at 07:28:12PM -0800, Dan Williams wrote:
[..]
> > > Again HMM API can evolve, i am happy to help with any such change, given
> > > it provides benefit to either mm or device driver (ie changing the HMM
> > > just for the sake of changing the HMM API would not make much sense to
> > > me).
> > >
> > > So if after converting driver A, B and C we see that it would be nicer
> > > to change HMM in someway then i will definitly do that and this patchset
> > > is a testimony of that. Converting ODP to use HMM is easier after this
> > > patchset and this patchset changes the HMM API. I will be updating the
> > > nouveau driver to the new API and use the new API for the other driver
> > > patchset i am working on.
> > >
> > > If i bump again into something that would be better done any differently
> > > i will definitly change the HMM API and update all upstream driver
> > > accordingly.
> > >
> > > I am a strong believer in full freedom for internal kernel API changes
> > > and my intention have always been to help and facilitate such process.
> > > I am sorry this was unclear to any body :( and i am hopping that this
> > > email make my intention clear.''
> >
> > A simple way to ensure that out-of-tree consumers don't come beat us
> > up over a backwards incompatible HMM change is to mark all the exports
> > with _GPL. I'm not requiring that, the devm_memremap_pages() fight was
> > hard enough, but the pace of new exports vs arrival of consumers for
> > those exports has me worried that this arrangement will fall over at
> > some point.
>
> I was reluctant with the devm_memremap_pages() GPL changes because i
> think we should not change symbol export after an initial choice have
> been made on those.
>
> I don't think GPL or non GPL export change one bit in respect to out
> of tree user. They know they can not make any legitimate regression
> claim, nor should we care. So i fail to see how GPL export would make
> it any different.

It does matter. It's a perennial fight. For a recent example see the
discussion around: "x86/fpu: Don't export __kernel_fpu_{begin,end}()".
If you're not sure you can keep an api trivially stable it should have
a GPL export to minimize the exposure surface of out-of-tree users
that might grow attached to it.

>
> > Another way to help allay these worries is commit to no new exports
> > without in-tree users. In general, that should go without saying for
> > any core changes for new or future hardware.
>
> I always intend to have an upstream user the issue is that the device
> driver tree and the mm tree move a different pace and there is always
> a chicken and egg problem. I do not think Andrew wants to have to
> merge driver patches through its tree, nor Linus want to have to merge
> drivers and mm trees in specific order. So it is easier to introduce
> mm change in one release and driver change in the next. This is what
> i am doing with ODP. Adding things necessary in 5.1 and working with
> Mellanox to have the ODP HMM patch fully tested and ready to go in
> 5.2 (the patch is available today and Mellanox have begin testing it
> AFAIK). So this is the guideline i will be following. Post mm bits
> with driver patches, push to merge mm bits one release and have the
> driver bits in the next. I do hope this sound fine to everyone.

The track record to date has not been "merge HMM patch in one release
and merge the driver updates the next". If that is the plan going
forward that's great, and I do appreciate that this set came with
driver changes, and maintain hope the existing exports don't go
user-less for too much longer.

> It is also easier for the driver folks as then they do not need to
> have a special tree just to test my changes. They can integrate it
> in their regular workflow ie merge the new kernel release in their
> tree and then start pilling up changes to their driver for the next
> kernel release.

Everyone agrees that coordinating cross-tree updates is hard, but it's
managaeble. HMM as far I can see is taking an unprecedented approach
to early merging of core infrastructure.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/10] mm/hmm: add helpers for driver to safely take the mmap_sem
  2019-01-29 16:54 ` [PATCH 10/10] mm/hmm: add helpers for driver to safely take the mmap_sem jglisse
@ 2019-02-20 21:59   ` John Hubbard
  2019-02-20 22:19     ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: John Hubbard @ 2019-02-20 21:59 UTC (permalink / raw)
  To: jglisse, linux-mm; +Cc: linux-kernel, Andrew Morton, Ralph Campbell

On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> The device driver context which holds reference to mirror and thus to
> core hmm struct might outlive the mm against which it was created. To
> avoid every driver to check for that case provide an helper that check
> if mm is still alive and take the mmap_sem in read mode if so. If the
> mm have been destroy (mmu_notifier release call back did happen) then
> we return -EINVAL so that calling code knows that it is trying to do
> something against a mm that is no longer valid.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>   include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index b3850297352f..4a1454e3efba 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -438,6 +438,50 @@ struct hmm_mirror {
>   int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>   void hmm_mirror_unregister(struct hmm_mirror *mirror);
>   
> +/*
> + * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
> + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> + * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
> + *
> + * The device driver context which holds reference to mirror and thus to core
> + * hmm struct might outlive the mm against which it was created. To avoid every
> + * driver to check for that case provide an helper that check if mm is still
> + * alive and take the mmap_sem in read mode if so. If the mm have been destroy
> + * (mmu_notifier release call back did happen) then we return -EINVAL so that
> + * calling code knows that it is trying to do something against a mm that is
> + * no longer valid.
> + */

Hi Jerome,

Are you thinking that, throughout the HMM API, there is a problem that
the mm may have gone away, and so driver code needs to be littered with
checks to ensure that mm is non-NULL? If so, why doesn't HMM take a
reference on mm->count?

This solution here cannot work. I think you'd need refcounting in order
to avoid this kind of problem. Just doing a check will always be open to
races (see below).


> +static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
> +{
> +	struct mm_struct *mm;
> +
> +	/* Sanity check ... */
> +	if (!mirror || !mirror->hmm)
> +		return -EINVAL;
> +	/*
> +	 * Before trying to take the mmap_sem make sure the mm is still
> +	 * alive as device driver context might outlive the mm lifetime.
> +	 *
> +	 * FIXME: should we also check for mm that outlive its owning
> +	 * task ?
> +	 */
> +	mm = READ_ONCE(mirror->hmm->mm);
> +	if (mirror->hmm->dead || !mm)
> +		return -EINVAL;
> +

Nothing really prevents mirror->hmm->mm from changing to NULL right here.

> +	down_read(&mm->mmap_sem);
> +	return 0;
> +}
> +

...maybe better to just drop this patch from the series, until we see a
pattern of uses in the calling code.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/10] mm/hmm: add helpers for driver to safely take the mmap_sem
  2019-02-20 21:59   ` John Hubbard
@ 2019-02-20 22:19     ` Jerome Glisse
  2019-02-20 22:40       ` John Hubbard
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-02-20 22:19 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Ralph Campbell

On Wed, Feb 20, 2019 at 01:59:13PM -0800, John Hubbard wrote:
> On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > The device driver context which holds reference to mirror and thus to
> > core hmm struct might outlive the mm against which it was created. To
> > avoid every driver to check for that case provide an helper that check
> > if mm is still alive and take the mmap_sem in read mode if so. If the
> > mm have been destroy (mmu_notifier release call back did happen) then
> > we return -EINVAL so that calling code knows that it is trying to do
> > something against a mm that is no longer valid.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >   include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
> >   1 file changed, 47 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index b3850297352f..4a1454e3efba 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -438,6 +438,50 @@ struct hmm_mirror {
> >   int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
> >   void hmm_mirror_unregister(struct hmm_mirror *mirror);
> > +/*
> > + * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
> > + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> > + * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
> > + *
> > + * The device driver context which holds reference to mirror and thus to core
> > + * hmm struct might outlive the mm against which it was created. To avoid every
> > + * driver to check for that case provide an helper that check if mm is still
> > + * alive and take the mmap_sem in read mode if so. If the mm have been destroy
> > + * (mmu_notifier release call back did happen) then we return -EINVAL so that
> > + * calling code knows that it is trying to do something against a mm that is
> > + * no longer valid.
> > + */
> 
> Hi Jerome,
> 
> Are you thinking that, throughout the HMM API, there is a problem that
> the mm may have gone away, and so driver code needs to be littered with
> checks to ensure that mm is non-NULL? If so, why doesn't HMM take a
> reference on mm->count?
> 
> This solution here cannot work. I think you'd need refcounting in order
> to avoid this kind of problem. Just doing a check will always be open to
> races (see below).
> 
> 
> > +static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
> > +{
> > +	struct mm_struct *mm;
> > +
> > +	/* Sanity check ... */
> > +	if (!mirror || !mirror->hmm)
> > +		return -EINVAL;
> > +	/*
> > +	 * Before trying to take the mmap_sem make sure the mm is still
> > +	 * alive as device driver context might outlive the mm lifetime.
> > +	 *
> > +	 * FIXME: should we also check for mm that outlive its owning
> > +	 * task ?
> > +	 */
> > +	mm = READ_ONCE(mirror->hmm->mm);
> > +	if (mirror->hmm->dead || !mm)
> > +		return -EINVAL;
> > +
> 
> Nothing really prevents mirror->hmm->mm from changing to NULL right here.

This is really just to catch driver mistake, if driver does not call
hmm_mirror_unregister() then the !mm will never be true ie the
mirror->hmm->mm can not go NULL until the last reference to hmm_mirror
is gone.

> 
> > +	down_read(&mm->mmap_sem);
> > +	return 0;
> > +}
> > +
> 
> ...maybe better to just drop this patch from the series, until we see a
> pattern of uses in the calling code.

It use by nouveau now.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/10] mm/hmm: add helpers for driver to safely take the mmap_sem
  2019-02-20 22:19     ` Jerome Glisse
@ 2019-02-20 22:40       ` John Hubbard
  2019-02-20 23:09         ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: John Hubbard @ 2019-02-20 22:40 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Ralph Campbell

On 2/20/19 2:19 PM, Jerome Glisse wrote:
> On Wed, Feb 20, 2019 at 01:59:13PM -0800, John Hubbard wrote:
>> On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
>>> The device driver context which holds reference to mirror and thus to
>>> core hmm struct might outlive the mm against which it was created. To
>>> avoid every driver to check for that case provide an helper that check
>>> if mm is still alive and take the mmap_sem in read mode if so. If the
>>> mm have been destroy (mmu_notifier release call back did happen) then
>>> we return -EINVAL so that calling code knows that it is trying to do
>>> something against a mm that is no longer valid.
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> ---
>>>    include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
>>>    1 file changed, 47 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>> index b3850297352f..4a1454e3efba 100644
>>> --- a/include/linux/hmm.h
>>> +++ b/include/linux/hmm.h
>>> @@ -438,6 +438,50 @@ struct hmm_mirror {
>>>    int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>>>    void hmm_mirror_unregister(struct hmm_mirror *mirror);
>>> +/*
>>> + * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
>>> + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
>>> + * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
>>> + *
>>> + * The device driver context which holds reference to mirror and thus to core
>>> + * hmm struct might outlive the mm against which it was created. To avoid every
>>> + * driver to check for that case provide an helper that check if mm is still
>>> + * alive and take the mmap_sem in read mode if so. If the mm have been destroy
>>> + * (mmu_notifier release call back did happen) then we return -EINVAL so that
>>> + * calling code knows that it is trying to do something against a mm that is
>>> + * no longer valid.
>>> + */
>>
>> Hi Jerome,
>>
>> Are you thinking that, throughout the HMM API, there is a problem that
>> the mm may have gone away, and so driver code needs to be littered with
>> checks to ensure that mm is non-NULL? If so, why doesn't HMM take a
>> reference on mm->count?
>>
>> This solution here cannot work. I think you'd need refcounting in order
>> to avoid this kind of problem. Just doing a check will always be open to
>> races (see below).
>>
>>
>>> +static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
>>> +{
>>> +	struct mm_struct *mm;
>>> +
>>> +	/* Sanity check ... */
>>> +	if (!mirror || !mirror->hmm)
>>> +		return -EINVAL;
>>> +	/*
>>> +	 * Before trying to take the mmap_sem make sure the mm is still
>>> +	 * alive as device driver context might outlive the mm lifetime.
>>> +	 *
>>> +	 * FIXME: should we also check for mm that outlive its owning
>>> +	 * task ?
>>> +	 */
>>> +	mm = READ_ONCE(mirror->hmm->mm);
>>> +	if (mirror->hmm->dead || !mm)
>>> +		return -EINVAL;
>>> +
>>
>> Nothing really prevents mirror->hmm->mm from changing to NULL right here.
> 
> This is really just to catch driver mistake, if driver does not call
> hmm_mirror_unregister() then the !mm will never be true ie the
> mirror->hmm->mm can not go NULL until the last reference to hmm_mirror
> is gone.

In that case, then this again seems unnecessary, and in fact undesirable.
If the driver code has a bug, then let's let the backtrace from a NULL
dereference just happen, loud and clear.

This patch, at best, hides bugs. And it adds code that should simply be
unnecessary, so I don't like it. :)  Let's make it go away.

> 
>>
>>> +	down_read(&mm->mmap_sem);
>>> +	return 0;
>>> +}
>>> +
>>
>> ...maybe better to just drop this patch from the series, until we see a
>> pattern of uses in the calling code.
> 
> It use by nouveau now.

Maybe you'd have to remove that use case in a couple steps, depending on the
order that patches are going in.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 10/10] mm/hmm: add helpers for driver to safely take the mmap_sem
  2019-02-20 22:40       ` John Hubbard
@ 2019-02-20 23:09         ` Jerome Glisse
  0 siblings, 0 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-02-20 23:09 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Ralph Campbell

On Wed, Feb 20, 2019 at 02:40:20PM -0800, John Hubbard wrote:
> On 2/20/19 2:19 PM, Jerome Glisse wrote:
> > On Wed, Feb 20, 2019 at 01:59:13PM -0800, John Hubbard wrote:
> > > On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > 
> > > > The device driver context which holds reference to mirror and thus to
> > > > core hmm struct might outlive the mm against which it was created. To
> > > > avoid every driver to check for that case provide an helper that check
> > > > if mm is still alive and take the mmap_sem in read mode if so. If the
> > > > mm have been destroy (mmu_notifier release call back did happen) then
> > > > we return -EINVAL so that calling code knows that it is trying to do
> > > > something against a mm that is no longer valid.
> > > > 
> > > > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > > > Cc: John Hubbard <jhubbard@nvidia.com>
> > > > ---
> > > >    include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
> > > >    1 file changed, 47 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > > > index b3850297352f..4a1454e3efba 100644
> > > > --- a/include/linux/hmm.h
> > > > +++ b/include/linux/hmm.h
> > > > @@ -438,6 +438,50 @@ struct hmm_mirror {
> > > >    int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
> > > >    void hmm_mirror_unregister(struct hmm_mirror *mirror);
> > > > +/*
> > > > + * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
> > > > + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> > > > + * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
> > > > + *
> > > > + * The device driver context which holds reference to mirror and thus to core
> > > > + * hmm struct might outlive the mm against which it was created. To avoid every
> > > > + * driver to check for that case provide an helper that check if mm is still
> > > > + * alive and take the mmap_sem in read mode if so. If the mm have been destroy
> > > > + * (mmu_notifier release call back did happen) then we return -EINVAL so that
> > > > + * calling code knows that it is trying to do something against a mm that is
> > > > + * no longer valid.
> > > > + */
> > > 
> > > Hi Jerome,
> > > 
> > > Are you thinking that, throughout the HMM API, there is a problem that
> > > the mm may have gone away, and so driver code needs to be littered with
> > > checks to ensure that mm is non-NULL? If so, why doesn't HMM take a
> > > reference on mm->count?
> > > 
> > > This solution here cannot work. I think you'd need refcounting in order
> > > to avoid this kind of problem. Just doing a check will always be open to
> > > races (see below).
> > > 
> > > 
> > > > +static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
> > > > +{
> > > > +	struct mm_struct *mm;
> > > > +
> > > > +	/* Sanity check ... */
> > > > +	if (!mirror || !mirror->hmm)
> > > > +		return -EINVAL;
> > > > +	/*
> > > > +	 * Before trying to take the mmap_sem make sure the mm is still
> > > > +	 * alive as device driver context might outlive the mm lifetime.
> > > > +	 *
> > > > +	 * FIXME: should we also check for mm that outlive its owning
> > > > +	 * task ?
> > > > +	 */
> > > > +	mm = READ_ONCE(mirror->hmm->mm);
> > > > +	if (mirror->hmm->dead || !mm)
> > > > +		return -EINVAL;
> > > > +
> > > 
> > > Nothing really prevents mirror->hmm->mm from changing to NULL right here.
> > 
> > This is really just to catch driver mistake, if driver does not call
> > hmm_mirror_unregister() then the !mm will never be true ie the
> > mirror->hmm->mm can not go NULL until the last reference to hmm_mirror
> > is gone.
> 
> In that case, then this again seems unnecessary, and in fact undesirable.
> If the driver code has a bug, then let's let the backtrace from a NULL
> dereference just happen, loud and clear.
> 
> This patch, at best, hides bugs. And it adds code that should simply be
> unnecessary, so I don't like it. :)  Let's make it go away.
> 
> > 
> > > 
> > > > +	down_read(&mm->mmap_sem);
> > > > +	return 0;
> > > > +}
> > > > +
> > > 
> > > ...maybe better to just drop this patch from the series, until we see a
> > > pattern of uses in the calling code.
> > 
> > It use by nouveau now.
> 
> Maybe you'd have to remove that use case in a couple steps, depending on the
> order that patches are going in.

Well all that is needed is removing if (mirror->hmm->dead || !mm) return -EINVAL;
from functions so it does not have any ordering conflict with anything really.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (9 preceding siblings ...)
  2019-01-29 16:54 ` [PATCH 10/10] mm/hmm: add helpers for driver to safely take the mmap_sem jglisse
@ 2019-02-20 23:17 ` John Hubbard
  2019-02-20 23:36   ` Jerome Glisse
  2019-02-22 23:31 ` Ralph Campbell
  2019-03-13  1:27 ` Jerome Glisse
  12 siblings, 1 reply; 98+ messages in thread
From: John Hubbard @ 2019-02-20 23:17 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: linux-kernel, Andrew Morton, Felix Kuehling,
	Christian König, Ralph Campbell, Jason Gunthorpe,
	Dan Williams

On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This patchset improves the HMM driver API and add support for hugetlbfs
> and DAX mirroring. The improvement motivation was to make the ODP to HMM
> conversion easier [1]. Because we have nouveau bits schedule for 5.1 and
> to avoid any multi-tree synchronization this patchset adds few lines of
> inline function that wrap the existing HMM driver API to the improved
> API. The nouveau driver was tested before and after this patchset and it
> builds and works on both case so there is no merging issue [2]. The
> nouveau bit are queue up for 5.1 so this is why i added those inline.
> 
> If this get merge in 5.1 the plans is to merge the HMM to ODP in 5.2 or
> 5.3 if testing shows any issues (so far no issues has been found with
> limited testing but Mellanox will be running heavier testing for longer
> time).
> 
> To avoid spamming mm i would like to not cc mm on ODP or nouveau patches,
> however if people prefer to see those on mm mailing list then i can keep
> it cced.
> 
> This is also what i intend to use as a base for AMD and Intel patches
> (v2 with more thing of some rfc which were already posted in the past).
> 

Hi Jerome,

Although Ralph has been testing and looking at this patchset, I just now
noticed that there hasn't been much public review of it, so I'm doing
a bit of that now. I don't think it's *quite* too late, because we're
still not at the 5.1 merge window...sorry for taking so long to get to
this.

Ralph, you might want to add ACKs or Tested-by's to some of these
patches (or even Reviewed-by, if you went that deep, which I suspect you
did in some cases), according to what you feel comfortable with?


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-02-20 23:17 ` [PATCH 00/10] HMM updates for 5.1 John Hubbard
@ 2019-02-20 23:36   ` Jerome Glisse
  0 siblings, 0 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-02-20 23:36 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, linux-kernel, Andrew Morton, Felix Kuehling,
	Christian König, Ralph Campbell, Jason Gunthorpe,
	Dan Williams

On Wed, Feb 20, 2019 at 03:17:58PM -0800, John Hubbard wrote:
> On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > This patchset improves the HMM driver API and add support for hugetlbfs
> > and DAX mirroring. The improvement motivation was to make the ODP to HMM
> > conversion easier [1]. Because we have nouveau bits schedule for 5.1 and
> > to avoid any multi-tree synchronization this patchset adds few lines of
> > inline function that wrap the existing HMM driver API to the improved
> > API. The nouveau driver was tested before and after this patchset and it
> > builds and works on both case so there is no merging issue [2]. The
> > nouveau bit are queue up for 5.1 so this is why i added those inline.
> > 
> > If this get merge in 5.1 the plans is to merge the HMM to ODP in 5.2 or
> > 5.3 if testing shows any issues (so far no issues has been found with
> > limited testing but Mellanox will be running heavier testing for longer
> > time).
> > 
> > To avoid spamming mm i would like to not cc mm on ODP or nouveau patches,
> > however if people prefer to see those on mm mailing list then i can keep
> > it cced.
> > 
> > This is also what i intend to use as a base for AMD and Intel patches
> > (v2 with more thing of some rfc which were already posted in the past).
> > 
> 
> Hi Jerome,
> 
> Although Ralph has been testing and looking at this patchset, I just now
> noticed that there hasn't been much public review of it, so I'm doing
> a bit of that now. I don't think it's *quite* too late, because we're
> still not at the 5.1 merge window...sorry for taking so long to get to
> this.
> 
> Ralph, you might want to add ACKs or Tested-by's to some of these
> patches (or even Reviewed-by, if you went that deep, which I suspect you
> did in some cases), according to what you feel comfortable with?

More eyes are always welcome, i tested with nouveau and with infinibanb
mlx5. It seemed to work properly in my testing but i might have miss-
something.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/10] mm/hmm: use reference counting for HMM struct
  2019-01-29 16:54 ` [PATCH 01/10] mm/hmm: use reference counting for HMM struct jglisse
@ 2019-02-20 23:47   ` John Hubbard
  2019-02-20 23:59     ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: John Hubbard @ 2019-02-20 23:47 UTC (permalink / raw)
  To: jglisse, linux-mm; +Cc: linux-kernel, Ralph Campbell, Andrew Morton

On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Every time i read the code to check that the HMM structure does not
> vanish before it should thanks to the many lock protecting its removal
> i get a headache. Switch to reference counting instead it is much
> easier to follow and harder to break. This also remove some code that
> is no longer needed with refcounting.

Hi Jerome,

That is an excellent idea. Some review comments below:

[snip]

>   static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   			const struct mmu_notifier_range *range)
>   {
>   	struct hmm_update update;
> -	struct hmm *hmm = range->mm->hmm;
> +	struct hmm *hmm = hmm_get(range->mm);
> +	int ret;
>   
>   	VM_BUG_ON(!hmm);
>   
> +	/* Check if hmm_mm_destroy() was call. */
> +	if (hmm->mm == NULL)
> +		return 0;

Let's delete that NULL check. It can't provide true protection. If there
is a way for that to race, we need to take another look at refcounting.

Is there a need for mmgrab()/mmdrop(), to keep the mm around while HMM
is using it?


> +
>   	update.start = range->start;
>   	update.end = range->end;
>   	update.event = HMM_UPDATE_INVALIDATE;
>   	update.blockable = range->blockable;
> -	return hmm_invalidate_range(hmm, true, &update);
> +	ret = hmm_invalidate_range(hmm, true, &update);
> +	hmm_put(hmm);
> +	return ret;
>   }
>   
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>   			const struct mmu_notifier_range *range)
>   {
>   	struct hmm_update update;
> -	struct hmm *hmm = range->mm->hmm;
> +	struct hmm *hmm = hmm_get(range->mm);
>   
>   	VM_BUG_ON(!hmm);
>   
> +	/* Check if hmm_mm_destroy() was call. */
> +	if (hmm->mm == NULL)
> +		return;
> +

Another one to delete, same reasoning as above.

[snip]

> @@ -717,14 +746,18 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>   	hmm = hmm_register(vma->vm_mm);
>   	if (!hmm)
>   		return -ENOMEM;
> -	/* Caller must have registered a mirror, via hmm_mirror_register() ! */
> -	if (!hmm->mmu_notifier.ops)
> +
> +	/* Check if hmm_mm_destroy() was call. */
> +	if (hmm->mm == NULL) {
> +		hmm_put(hmm);
>   		return -EINVAL;
> +	}
>   

Another hmm->mm NULL check to remove.

[snip]
> @@ -802,25 +842,27 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
>    */
>   bool hmm_vma_range_done(struct hmm_range *range)
>   {
> -	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
> -	struct hmm *hmm;
> +	bool ret = false;
>   
> -	if (range->end <= range->start) {
> +	/* Sanity check this really should not happen. */
> +	if (range->hmm == NULL || range->end <= range->start) {
>   		BUG();
>   		return false;
>   	}
>   
> -	hmm = hmm_register(range->vma->vm_mm);
> -	if (!hmm) {
> -		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
> -		return false;
> -	}
> -
> -	spin_lock(&hmm->lock);
> +	spin_lock(&range->hmm->lock);
>   	list_del_rcu(&range->list);
> -	spin_unlock(&hmm->lock);
> +	ret = range->valid;
> +	spin_unlock(&range->hmm->lock);
>   
> -	return range->valid;
> +	/* Is the mm still alive ? */
> +	if (range->hmm->mm == NULL)
> +		ret = false;


And another one here.


> +
> +	/* Drop reference taken by hmm_vma_fault() or hmm_vma_get_pfns() */
> +	hmm_put(range->hmm);
> +	range->hmm = NULL;
> +	return ret;
>   }
>   EXPORT_SYMBOL(hmm_vma_range_done);
>   
> @@ -880,6 +922,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>   	struct hmm *hmm;
>   	int ret;
>   
> +	range->hmm = NULL;
> +
>   	/* Sanity check, this really should not happen ! */
>   	if (range->start < vma->vm_start || range->start >= vma->vm_end)
>   		return -EINVAL;
> @@ -891,14 +935,18 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>   		hmm_pfns_clear(range, range->pfns, range->start, range->end);
>   		return -ENOMEM;
>   	}
> -	/* Caller must have registered a mirror using hmm_mirror_register() */
> -	if (!hmm->mmu_notifier.ops)
> +
> +	/* Check if hmm_mm_destroy() was call. */
> +	if (hmm->mm == NULL) {
> +		hmm_put(hmm);
>   		return -EINVAL;
> +	}

And here.

>   
>   	/* FIXME support hugetlb fs */
>   	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
>   			vma_is_dax(vma)) {
>   		hmm_pfns_special(range);
> +		hmm_put(hmm);
>   		return -EINVAL;
>   	}
>   
> @@ -910,6 +958,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>   		 * operations such has atomic access would not work.
>   		 */
>   		hmm_pfns_clear(range, range->pfns, range->start, range->end);
> +		hmm_put(hmm);
>   		return -EPERM;
>   	}
>   
> @@ -945,7 +994,16 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>   		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
>   			       range->end);
>   		hmm_vma_range_done(range);
> +		hmm_put(hmm);
> +	} else {
> +		/*
> +		 * Transfer hmm reference to the range struct it will be drop
> +		 * inside the hmm_vma_range_done() function (which _must_ be
> +		 * call if this function return 0).
> +		 */
> +		range->hmm = hmm;

Is that thread-safe? Is there anything preventing two or more threads from
changing range->hmm at the same time?



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 02/10] mm/hmm: do not erase snapshot when a range is invalidated
  2019-01-29 16:54 ` [PATCH 02/10] mm/hmm: do not erase snapshot when a range is invalidated jglisse
@ 2019-02-20 23:58   ` John Hubbard
  0 siblings, 0 replies; 98+ messages in thread
From: John Hubbard @ 2019-02-20 23:58 UTC (permalink / raw)
  To: jglisse, linux-mm; +Cc: linux-kernel, Andrew Morton, Ralph Campbell

On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Users of HMM might be using the snapshot information to do
> preparatory step like dma mapping pages to a device before
> checking for invalidation through hmm_vma_range_done() so
> do not erase that information and assume users will do the
> right thing.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>   mm/hmm.c | 6 ------
>   1 file changed, 6 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index b9f384ea15e9..74d69812d6be 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -170,16 +170,10 @@ static int hmm_invalidate_range(struct hmm *hmm, bool device,
>   
>   	spin_lock(&hmm->lock);
>   	list_for_each_entry(range, &hmm->ranges, list) {
> -		unsigned long addr, idx, npages;
> -
>   		if (update->end < range->start || update->start >= range->end)
>   			continue;
>   
>   		range->valid = false;
> -		addr = max(update->start, range->start);
> -		idx = (addr - range->start) >> PAGE_SHIFT;
> -		npages = (min(range->end, update->end) - addr) >> PAGE_SHIFT;
> -		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
>   	}
>   	spin_unlock(&hmm->lock);
>   
> 

Seems harmless to me. I really cannot see how this could cause a problem,
so you can add:

	Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/10] mm/hmm: use reference counting for HMM struct
  2019-02-20 23:47   ` John Hubbard
@ 2019-02-20 23:59     ` Jerome Glisse
  2019-02-21  0:06       ` John Hubbard
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-02-20 23:59 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Ralph Campbell, Andrew Morton

On Wed, Feb 20, 2019 at 03:47:50PM -0800, John Hubbard wrote:
> On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > Every time i read the code to check that the HMM structure does not
> > vanish before it should thanks to the many lock protecting its removal
> > i get a headache. Switch to reference counting instead it is much
> > easier to follow and harder to break. This also remove some code that
> > is no longer needed with refcounting.
> 
> Hi Jerome,
> 
> That is an excellent idea. Some review comments below:
> 
> [snip]
> 
> >   static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >   			const struct mmu_notifier_range *range)
> >   {
> >   	struct hmm_update update;
> > -	struct hmm *hmm = range->mm->hmm;
> > +	struct hmm *hmm = hmm_get(range->mm);
> > +	int ret;
> >   	VM_BUG_ON(!hmm);
> > +	/* Check if hmm_mm_destroy() was call. */
> > +	if (hmm->mm == NULL)
> > +		return 0;
> 
> Let's delete that NULL check. It can't provide true protection. If there
> is a way for that to race, we need to take another look at refcounting.

I will do a patch to delete the NULL check so that it is easier for
Andrew. No need to respin.

> Is there a need for mmgrab()/mmdrop(), to keep the mm around while HMM
> is using it?

It is already the case. The hmm struct holds a reference on the mm struct
and the mirror struct holds a reference on the hmm struct hence the mirror
struct holds a reference on the mm through the hmm struct.


[...]

> >   	/* FIXME support hugetlb fs */
> >   	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
> >   			vma_is_dax(vma)) {
> >   		hmm_pfns_special(range);
> > +		hmm_put(hmm);
> >   		return -EINVAL;
> >   	}
> > @@ -910,6 +958,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
> >   		 * operations such has atomic access would not work.
> >   		 */
> >   		hmm_pfns_clear(range, range->pfns, range->start, range->end);
> > +		hmm_put(hmm);
> >   		return -EPERM;
> >   	}
> > @@ -945,7 +994,16 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
> >   		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
> >   			       range->end);
> >   		hmm_vma_range_done(range);
> > +		hmm_put(hmm);
> > +	} else {
> > +		/*
> > +		 * Transfer hmm reference to the range struct it will be drop
> > +		 * inside the hmm_vma_range_done() function (which _must_ be
> > +		 * call if this function return 0).
> > +		 */
> > +		range->hmm = hmm;
> 
> Is that thread-safe? Is there anything preventing two or more threads from
> changing range->hmm at the same time?

The range is provided by the driver and the driver should not change
the hmm field nor should it use the range struct in multiple threads.
If the driver do stupid things there is nothing i can do. Note that
this code is removed latter in the serie.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/10] mm/hmm: use reference counting for HMM struct
  2019-02-20 23:59     ` Jerome Glisse
@ 2019-02-21  0:06       ` John Hubbard
  2019-02-21  0:15         ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: John Hubbard @ 2019-02-21  0:06 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Ralph Campbell, Andrew Morton

On 2/20/19 3:59 PM, Jerome Glisse wrote:
> On Wed, Feb 20, 2019 at 03:47:50PM -0800, John Hubbard wrote:
>> On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
>>> Every time i read the code to check that the HMM structure does not
>>> vanish before it should thanks to the many lock protecting its removal
>>> i get a headache. Switch to reference counting instead it is much
>>> easier to follow and harder to break. This also remove some code that
>>> is no longer needed with refcounting.
>>
>> Hi Jerome,
>>
>> That is an excellent idea. Some review comments below:
>>
>> [snip]
>>
>>>    static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>>>    			const struct mmu_notifier_range *range)
>>>    {
>>>    	struct hmm_update update;
>>> -	struct hmm *hmm = range->mm->hmm;
>>> +	struct hmm *hmm = hmm_get(range->mm);
>>> +	int ret;
>>>    	VM_BUG_ON(!hmm);
>>> +	/* Check if hmm_mm_destroy() was call. */
>>> +	if (hmm->mm == NULL)
>>> +		return 0;
>>
>> Let's delete that NULL check. It can't provide true protection. If there
>> is a way for that to race, we need to take another look at refcounting.
> 
> I will do a patch to delete the NULL check so that it is easier for
> Andrew. No need to respin.

(Did you miss my request to make hmm_get/hmm_put symmetric, though?)

> 
>> Is there a need for mmgrab()/mmdrop(), to keep the mm around while HMM
>> is using it?
> 
> It is already the case. The hmm struct holds a reference on the mm struct
> and the mirror struct holds a reference on the hmm struct hence the mirror
> struct holds a reference on the mm through the hmm struct.
> 
> 

OK, good. Yes, I guess the __mmu_notifier_register() call in hmm_register()
should get an mm_struct reference for us.

> 
>>>    	/* FIXME support hugetlb fs */
>>>    	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
>>>    			vma_is_dax(vma)) {
>>>    		hmm_pfns_special(range);
>>> +		hmm_put(hmm);
>>>    		return -EINVAL;
>>>    	}
>>> @@ -910,6 +958,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>>>    		 * operations such has atomic access would not work.
>>>    		 */
>>>    		hmm_pfns_clear(range, range->pfns, range->start, range->end);
>>> +		hmm_put(hmm);
>>>    		return -EPERM;
>>>    	}
>>> @@ -945,7 +994,16 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>>>    		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
>>>    			       range->end);
>>>    		hmm_vma_range_done(range);
>>> +		hmm_put(hmm);
>>> +	} else {
>>> +		/*
>>> +		 * Transfer hmm reference to the range struct it will be drop
>>> +		 * inside the hmm_vma_range_done() function (which _must_ be
>>> +		 * call if this function return 0).
>>> +		 */
>>> +		range->hmm = hmm;
>>
>> Is that thread-safe? Is there anything preventing two or more threads from
>> changing range->hmm at the same time?
> 
> The range is provided by the driver and the driver should not change
> the hmm field nor should it use the range struct in multiple threads.
> If the driver do stupid things there is nothing i can do. Note that
> this code is removed latter in the serie.
> 
> Cheers,
> Jérôme
> 

OK, I see. That sounds good.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/10] mm/hmm: use reference counting for HMM struct
  2019-02-21  0:06       ` John Hubbard
@ 2019-02-21  0:15         ` Jerome Glisse
  2019-02-21  0:32           ` John Hubbard
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-02-21  0:15 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Ralph Campbell, Andrew Morton

On Wed, Feb 20, 2019 at 04:06:50PM -0800, John Hubbard wrote:
> On 2/20/19 3:59 PM, Jerome Glisse wrote:
> > On Wed, Feb 20, 2019 at 03:47:50PM -0800, John Hubbard wrote:
> > > On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > 
> > > > Every time i read the code to check that the HMM structure does not
> > > > vanish before it should thanks to the many lock protecting its removal
> > > > i get a headache. Switch to reference counting instead it is much
> > > > easier to follow and harder to break. This also remove some code that
> > > > is no longer needed with refcounting.
> > > 
> > > Hi Jerome,
> > > 
> > > That is an excellent idea. Some review comments below:
> > > 
> > > [snip]
> > > 
> > > >    static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> > > >    			const struct mmu_notifier_range *range)
> > > >    {
> > > >    	struct hmm_update update;
> > > > -	struct hmm *hmm = range->mm->hmm;
> > > > +	struct hmm *hmm = hmm_get(range->mm);
> > > > +	int ret;
> > > >    	VM_BUG_ON(!hmm);
> > > > +	/* Check if hmm_mm_destroy() was call. */
> > > > +	if (hmm->mm == NULL)
> > > > +		return 0;
> > > 
> > > Let's delete that NULL check. It can't provide true protection. If there
> > > is a way for that to race, we need to take another look at refcounting.
> > 
> > I will do a patch to delete the NULL check so that it is easier for
> > Andrew. No need to respin.
> 
> (Did you miss my request to make hmm_get/hmm_put symmetric, though?)

Went over my mail i do not see anything about symmetric, what do you
mean ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 03/10] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot()
  2019-01-29 16:54 ` [PATCH 03/10] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() jglisse
@ 2019-02-21  0:25   ` John Hubbard
  2019-02-21  0:28     ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: John Hubbard @ 2019-02-21  0:25 UTC (permalink / raw)
  To: jglisse, linux-mm; +Cc: linux-kernel, Andrew Morton, Ralph Campbell

On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Rename for consistency between code, comments and documentation. Also
> improves the comments on all the possible returns values. Improve the
> function by returning the number of populated entries in pfns array.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>   include/linux/hmm.h |  4 ++--
>   mm/hmm.c            | 23 ++++++++++-------------
>   2 files changed, 12 insertions(+), 15 deletions(-)
> 

Hi Jerome,

After applying the entire patchset, I still see a few hits of the old name,
in Documentation:

$ git grep -n hmm_vma_get_pfns
Documentation/vm/hmm.rst:192:  int hmm_vma_get_pfns(struct vm_area_struct *vma,
Documentation/vm/hmm.rst:205:The first one (hmm_vma_get_pfns()) will only fetch 
present CPU page table
Documentation/vm/hmm.rst:224:      ret = hmm_vma_get_pfns(vma, &range, start, 
end, pfns);
include/linux/hmm.h:145: * HMM pfn value returned by hmm_vma_get_pfns() or 
hmm_vma_fault() will be:


> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index bd6e058597a6..ddf49c1b1f5e 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -365,11 +365,11 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>    * table invalidation serializes on it.
>    *
>    * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
> - * hmm_vma_get_pfns() WITHOUT ERROR !
> + * hmm_range_snapshot() WITHOUT ERROR !
>    *
>    * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
>    */
> -int hmm_vma_get_pfns(struct hmm_range *range);
> +long hmm_range_snapshot(struct hmm_range *range);
>   bool hmm_vma_range_done(struct hmm_range *range);
>   
>   
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 74d69812d6be..0d9ecd3337e5 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -706,23 +706,19 @@ static void hmm_pfns_special(struct hmm_range *range)
>   }
>   
>   /*
> - * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
> - * @range: range being snapshotted
> + * hmm_range_snapshot() - snapshot CPU page table for a range
> + * @range: range
>    * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid

Channeling Mike Rapoport, that should be @Return: instead of Returns: , but...


> - *          vma permission, 0 success
> + *          permission (for instance asking for write and range is read only),
> + *          -EAGAIN if you need to retry, -EFAULT invalid (ie either no valid
> + *          vma or it is illegal to access that range), number of valid pages
> + *          in range->pfns[] (from range start address).

...actually, that's a little hard to spot that we're returning number of valid 
pages. How about:

  * @Returns: number of valid pages in range->pfns[] (from range start
  *           address). This may be zero. If the return value is negative,
  *           then one of the following values may be returned:
  *
  *           -EINVAL  range->invalid is set, or range->start or range->end
  *                    are not valid.
  *           -EPERM   For example, asking for write, when the range is
  *      	      read-only
  *           -EAGAIN  Caller needs to retry
  *           -EFAULT  Either no valid vma exists for this range, or it is
  *                    illegal to access the range

(caution: my white space might be wrong with respect to tabs)

>    *
>    * This snapshots the CPU page table for a range of virtual addresses. Snapshot
>    * validity is tracked by range struct. See hmm_vma_range_done() for further
>    * information.
> - *
> - * The range struct is initialized here. It tracks the CPU page table, but only
> - * if the function returns success (0), in which case the caller must then call
> - * hmm_vma_range_done() to stop CPU page table update tracking on this range.
> - *
> - * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
> - * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
>    */
> -int hmm_vma_get_pfns(struct hmm_range *range)
> +long hmm_range_snapshot(struct hmm_range *range)
>   {
>   	struct vm_area_struct *vma = range->vma;
>   	struct hmm_vma_walk hmm_vma_walk;
> @@ -776,6 +772,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>   	hmm_vma_walk.fault = false;
>   	hmm_vma_walk.range = range;
>   	mm_walk.private = &hmm_vma_walk;
> +	hmm_vma_walk.last = range->start;
>   
>   	mm_walk.vma = vma;
>   	mm_walk.mm = vma->vm_mm;
> @@ -792,9 +789,9 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>   	 * function return 0).
>   	 */
>   	range->hmm = hmm;
> -	return 0;
> +	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
>   }
> -EXPORT_SYMBOL(hmm_vma_get_pfns);
> +EXPORT_SYMBOL(hmm_range_snapshot);
>   
>   /*
>    * hmm_vma_range_done() - stop tracking change to CPU page table over a range
> 

Otherwise, looks good.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 03/10] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot()
  2019-02-21  0:25   ` John Hubbard
@ 2019-02-21  0:28     ` Jerome Glisse
  0 siblings, 0 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-02-21  0:28 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Ralph Campbell

On Wed, Feb 20, 2019 at 04:25:07PM -0800, John Hubbard wrote:
> On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > Rename for consistency between code, comments and documentation. Also
> > improves the comments on all the possible returns values. Improve the
> > function by returning the number of populated entries in pfns array.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >   include/linux/hmm.h |  4 ++--
> >   mm/hmm.c            | 23 ++++++++++-------------
> >   2 files changed, 12 insertions(+), 15 deletions(-)
> > 
> 
> Hi Jerome,
> 
> After applying the entire patchset, I still see a few hits of the old name,
> in Documentation:
> 
> $ git grep -n hmm_vma_get_pfns
> Documentation/vm/hmm.rst:192:  int hmm_vma_get_pfns(struct vm_area_struct *vma,
> Documentation/vm/hmm.rst:205:The first one (hmm_vma_get_pfns()) will only
> fetch present CPU page table
> Documentation/vm/hmm.rst:224:      ret = hmm_vma_get_pfns(vma, &range,
> start, end, pfns);
> include/linux/hmm.h:145: * HMM pfn value returned by hmm_vma_get_pfns() or
> hmm_vma_fault() will be:
> 
> 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index bd6e058597a6..ddf49c1b1f5e 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -365,11 +365,11 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
> >    * table invalidation serializes on it.
> >    *
> >    * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
> > - * hmm_vma_get_pfns() WITHOUT ERROR !
> > + * hmm_range_snapshot() WITHOUT ERROR !
> >    *
> >    * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
> >    */
> > -int hmm_vma_get_pfns(struct hmm_range *range);
> > +long hmm_range_snapshot(struct hmm_range *range);
> >   bool hmm_vma_range_done(struct hmm_range *range);
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 74d69812d6be..0d9ecd3337e5 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -706,23 +706,19 @@ static void hmm_pfns_special(struct hmm_range *range)
> >   }
> >   /*
> > - * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
> > - * @range: range being snapshotted
> > + * hmm_range_snapshot() - snapshot CPU page table for a range
> > + * @range: range
> >    * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
> 
> Channeling Mike Rapoport, that should be @Return: instead of Returns: , but...
> 
> 
> > - *          vma permission, 0 success
> > + *          permission (for instance asking for write and range is read only),
> > + *          -EAGAIN if you need to retry, -EFAULT invalid (ie either no valid
> > + *          vma or it is illegal to access that range), number of valid pages
> > + *          in range->pfns[] (from range start address).
> 
> ...actually, that's a little hard to spot that we're returning number of
> valid pages. How about:
> 
>  * @Returns: number of valid pages in range->pfns[] (from range start
>  *           address). This may be zero. If the return value is negative,
>  *           then one of the following values may be returned:
>  *
>  *           -EINVAL  range->invalid is set, or range->start or range->end
>  *                    are not valid.
>  *           -EPERM   For example, asking for write, when the range is
>  *      	      read-only
>  *           -EAGAIN  Caller needs to retry
>  *           -EFAULT  Either no valid vma exists for this range, or it is
>  *                    illegal to access the range
> 
> (caution: my white space might be wrong with respect to tabs)

Will do a documentation patch to improve things and remove leftover.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/10] mm/hmm: use reference counting for HMM struct
  2019-02-21  0:15         ` Jerome Glisse
@ 2019-02-21  0:32           ` John Hubbard
  2019-02-21  0:37             ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: John Hubbard @ 2019-02-21  0:32 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Ralph Campbell, Andrew Morton

On 2/20/19 4:15 PM, Jerome Glisse wrote:
> On Wed, Feb 20, 2019 at 04:06:50PM -0800, John Hubbard wrote:
>> On 2/20/19 3:59 PM, Jerome Glisse wrote:
>>> On Wed, Feb 20, 2019 at 03:47:50PM -0800, John Hubbard wrote:
>>>> On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>>>
>>>>> Every time i read the code to check that the HMM structure does not
>>>>> vanish before it should thanks to the many lock protecting its removal
>>>>> i get a headache. Switch to reference counting instead it is much
>>>>> easier to follow and harder to break. This also remove some code that
>>>>> is no longer needed with refcounting.
>>>>
>>>> Hi Jerome,
>>>>
>>>> That is an excellent idea. Some review comments below:
>>>>
>>>> [snip]
>>>>
>>>>>     static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>>>>>     			const struct mmu_notifier_range *range)
>>>>>     {
>>>>>     	struct hmm_update update;
>>>>> -	struct hmm *hmm = range->mm->hmm;
>>>>> +	struct hmm *hmm = hmm_get(range->mm);
>>>>> +	int ret;
>>>>>     	VM_BUG_ON(!hmm);
>>>>> +	/* Check if hmm_mm_destroy() was call. */
>>>>> +	if (hmm->mm == NULL)
>>>>> +		return 0;
>>>>
>>>> Let's delete that NULL check. It can't provide true protection. If there
>>>> is a way for that to race, we need to take another look at refcounting.
>>>
>>> I will do a patch to delete the NULL check so that it is easier for
>>> Andrew. No need to respin.
>>
>> (Did you miss my request to make hmm_get/hmm_put symmetric, though?)
> 
> Went over my mail i do not see anything about symmetric, what do you
> mean ?
> 
> Cheers,
> Jérôme

I meant the comment that I accidentally deleted, before sending the email!
doh. Sorry about that. :) Here is the recreated comment:

diff --git a/mm/hmm.c b/mm/hmm.c
index a04e4b810610..b9f384ea15e9 100644

--- a/mm/hmm.c

+++ b/mm/hmm.c

@@ -50,6 +50,7 @@

  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;

   */
  struct hmm {
  	struct mm_struct	*mm;
+	struct kref		kref;
  	spinlock_t		lock;
  	struct list_head	ranges;
  	struct list_head	mirrors;

@@ -57,6 +58,16 @@

  struct hmm {

  	struct rw_semaphore	mirrors_sem;
  };

+static inline struct hmm *hmm_get(struct mm_struct *mm)
+{
+	struct hmm *hmm = READ_ONCE(mm->hmm);
+
+	if (hmm && kref_get_unless_zero(&hmm->kref))
+		return hmm;
+
+	return NULL;
+}
+

So for this, hmm_get() really ought to be symmetric with
hmm_put(), by taking a struct hmm*. And the null check is
not helping here, so let's just go with this smaller version:

static inline struct hmm *hmm_get(struct hmm *hmm)
{
	if (kref_get_unless_zero(&hmm->kref))
		return hmm;

	return NULL;
}

...and change the few callers accordingly.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/10] mm/hmm: use reference counting for HMM struct
  2019-02-21  0:32           ` John Hubbard
@ 2019-02-21  0:37             ` Jerome Glisse
  2019-02-21  0:42               ` John Hubbard
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-02-21  0:37 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Ralph Campbell, Andrew Morton

On Wed, Feb 20, 2019 at 04:32:09PM -0800, John Hubbard wrote:
> On 2/20/19 4:15 PM, Jerome Glisse wrote:
> > On Wed, Feb 20, 2019 at 04:06:50PM -0800, John Hubbard wrote:
> > > On 2/20/19 3:59 PM, Jerome Glisse wrote:
> > > > On Wed, Feb 20, 2019 at 03:47:50PM -0800, John Hubbard wrote:
> > > > > On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > 
> > > > > > Every time i read the code to check that the HMM structure does not
> > > > > > vanish before it should thanks to the many lock protecting its removal
> > > > > > i get a headache. Switch to reference counting instead it is much
> > > > > > easier to follow and harder to break. This also remove some code that
> > > > > > is no longer needed with refcounting.
> > > > > 
> > > > > Hi Jerome,
> > > > > 
> > > > > That is an excellent idea. Some review comments below:
> > > > > 
> > > > > [snip]
> > > > > 
> > > > > >     static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> > > > > >     			const struct mmu_notifier_range *range)
> > > > > >     {
> > > > > >     	struct hmm_update update;
> > > > > > -	struct hmm *hmm = range->mm->hmm;
> > > > > > +	struct hmm *hmm = hmm_get(range->mm);
> > > > > > +	int ret;
> > > > > >     	VM_BUG_ON(!hmm);
> > > > > > +	/* Check if hmm_mm_destroy() was call. */
> > > > > > +	if (hmm->mm == NULL)
> > > > > > +		return 0;
> > > > > 
> > > > > Let's delete that NULL check. It can't provide true protection. If there
> > > > > is a way for that to race, we need to take another look at refcounting.
> > > > 
> > > > I will do a patch to delete the NULL check so that it is easier for
> > > > Andrew. No need to respin.
> > > 
> > > (Did you miss my request to make hmm_get/hmm_put symmetric, though?)
> > 
> > Went over my mail i do not see anything about symmetric, what do you
> > mean ?
> > 
> > Cheers,
> > Jérôme
> 
> I meant the comment that I accidentally deleted, before sending the email!
> doh. Sorry about that. :) Here is the recreated comment:
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index a04e4b810610..b9f384ea15e9 100644
> 
> --- a/mm/hmm.c
> 
> +++ b/mm/hmm.c
> 
> @@ -50,6 +50,7 @@
> 
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
> 
>   */
>  struct hmm {
>  	struct mm_struct	*mm;
> +	struct kref		kref;
>  	spinlock_t		lock;
>  	struct list_head	ranges;
>  	struct list_head	mirrors;
> 
> @@ -57,6 +58,16 @@
> 
>  struct hmm {
> 
>  	struct rw_semaphore	mirrors_sem;
>  };
> 
> +static inline struct hmm *hmm_get(struct mm_struct *mm)
> +{
> +	struct hmm *hmm = READ_ONCE(mm->hmm);
> +
> +	if (hmm && kref_get_unless_zero(&hmm->kref))
> +		return hmm;
> +
> +	return NULL;
> +}
> +
> 
> So for this, hmm_get() really ought to be symmetric with
> hmm_put(), by taking a struct hmm*. And the null check is
> not helping here, so let's just go with this smaller version:
> 
> static inline struct hmm *hmm_get(struct hmm *hmm)
> {
> 	if (kref_get_unless_zero(&hmm->kref))
> 		return hmm;
> 
> 	return NULL;
> }
> 
> ...and change the few callers accordingly.
> 

What about renaning hmm_get() to mm_get_hmm() instead ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 01/10] mm/hmm: use reference counting for HMM struct
  2019-02-21  0:37             ` Jerome Glisse
@ 2019-02-21  0:42               ` John Hubbard
  0 siblings, 0 replies; 98+ messages in thread
From: John Hubbard @ 2019-02-21  0:42 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Ralph Campbell, Andrew Morton

On 2/20/19 4:37 PM, Jerome Glisse wrote:
> On Wed, Feb 20, 2019 at 04:32:09PM -0800, John Hubbard wrote:
>> On 2/20/19 4:15 PM, Jerome Glisse wrote:
>>> On Wed, Feb 20, 2019 at 04:06:50PM -0800, John Hubbard wrote:
>>>> On 2/20/19 3:59 PM, Jerome Glisse wrote:
>>>>> On Wed, Feb 20, 2019 at 03:47:50PM -0800, John Hubbard wrote:
>>>>>> On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>>>>>
>>>>>>> Every time i read the code to check that the HMM structure does not
>>>>>>> vanish before it should thanks to the many lock protecting its removal
>>>>>>> i get a headache. Switch to reference counting instead it is much
>>>>>>> easier to follow and harder to break. This also remove some code that
>>>>>>> is no longer needed with refcounting.
>>>>>>
>>>>>> Hi Jerome,
>>>>>>
>>>>>> That is an excellent idea. Some review comments below:
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>>      static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>>>>>>>      			const struct mmu_notifier_range *range)
>>>>>>>      {
>>>>>>>      	struct hmm_update update;
>>>>>>> -	struct hmm *hmm = range->mm->hmm;
>>>>>>> +	struct hmm *hmm = hmm_get(range->mm);
>>>>>>> +	int ret;
>>>>>>>      	VM_BUG_ON(!hmm);
>>>>>>> +	/* Check if hmm_mm_destroy() was call. */
>>>>>>> +	if (hmm->mm == NULL)
>>>>>>> +		return 0;
>>>>>>
>>>>>> Let's delete that NULL check. It can't provide true protection. If there
>>>>>> is a way for that to race, we need to take another look at refcounting.
>>>>>
>>>>> I will do a patch to delete the NULL check so that it is easier for
>>>>> Andrew. No need to respin.
>>>>
>>>> (Did you miss my request to make hmm_get/hmm_put symmetric, though?)
>>>
>>> Went over my mail i do not see anything about symmetric, what do you
>>> mean ?
>>>
>>> Cheers,
>>> Jérôme
>>
>> I meant the comment that I accidentally deleted, before sending the email!
>> doh. Sorry about that. :) Here is the recreated comment:
>>
>> diff --git a/mm/hmm.c b/mm/hmm.c
>> index a04e4b810610..b9f384ea15e9 100644
>>
>> --- a/mm/hmm.c
>>
>> +++ b/mm/hmm.c
>>
>> @@ -50,6 +50,7 @@
>>
>>   static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>>
>>    */
>>   struct hmm {
>>   	struct mm_struct	*mm;
>> +	struct kref		kref;
>>   	spinlock_t		lock;
>>   	struct list_head	ranges;
>>   	struct list_head	mirrors;
>>
>> @@ -57,6 +58,16 @@
>>
>>   struct hmm {
>>
>>   	struct rw_semaphore	mirrors_sem;
>>   };
>>
>> +static inline struct hmm *hmm_get(struct mm_struct *mm)
>> +{
>> +	struct hmm *hmm = READ_ONCE(mm->hmm);
>> +
>> +	if (hmm && kref_get_unless_zero(&hmm->kref))
>> +		return hmm;
>> +
>> +	return NULL;
>> +}
>> +
>>
>> So for this, hmm_get() really ought to be symmetric with
>> hmm_put(), by taking a struct hmm*. And the null check is
>> not helping here, so let's just go with this smaller version:
>>
>> static inline struct hmm *hmm_get(struct hmm *hmm)
>> {
>> 	if (kref_get_unless_zero(&hmm->kref))
>> 		return hmm;
>>
>> 	return NULL;
>> }
>>
>> ...and change the few callers accordingly.
>>
> 
> What about renaning hmm_get() to mm_get_hmm() instead ?
> 

For a get/put pair of functions, it would be ideal to pass
the same argument type to each. It looks like we are passing
around hmm*, and hmm retains a reference count on hmm->mm,
so I think you have a choice of using either mm* or hmm* as
the argument. I'm not sure that one is better than the other
here, as the lifetimes appear to be linked pretty tightly.

Whichever one is used, I think it would be best to use it
in both the _get() and _put() calls.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (10 preceding siblings ...)
  2019-02-20 23:17 ` [PATCH 00/10] HMM updates for 5.1 John Hubbard
@ 2019-02-22 23:31 ` Ralph Campbell
  2019-03-13  1:27 ` Jerome Glisse
  12 siblings, 0 replies; 98+ messages in thread
From: Ralph Campbell @ 2019-02-22 23:31 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: linux-kernel, Andrew Morton, Felix Kuehling,
	Christian König, John Hubbard, Jason Gunthorpe,
	Dan Williams


On 1/29/19 8:54 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This patchset improves the HMM driver API and add support for hugetlbfs
> and DAX mirroring. The improvement motivation was to make the ODP to HMM
> conversion easier [1]. Because we have nouveau bits schedule for 5.1 and
> to avoid any multi-tree synchronization this patchset adds few lines of
> inline function that wrap the existing HMM driver API to the improved
> API. The nouveau driver was tested before and after this patchset and it
> builds and works on both case so there is no merging issue [2]. The
> nouveau bit are queue up for 5.1 so this is why i added those inline.
> 
> If this get merge in 5.1 the plans is to merge the HMM to ODP in 5.2 or
> 5.3 if testing shows any issues (so far no issues has been found with
> limited testing but Mellanox will be running heavier testing for longer
> time).
> 
> To avoid spamming mm i would like to not cc mm on ODP or nouveau patches,
> however if people prefer to see those on mm mailing list then i can keep
> it cced.
> 
> This is also what i intend to use as a base for AMD and Intel patches
> (v2 with more thing of some rfc which were already posted in the past).
> 
> [1] https://cgit.freedesktop.org/~glisse/linux/log/?h=odp-hmm
> [2] https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-5.1
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> 
> Jérôme Glisse (10):
>    mm/hmm: use reference counting for HMM struct
>    mm/hmm: do not erase snapshot when a range is invalidated
>    mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot()
>    mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault()
>    mm/hmm: improve driver API to work and wait over a range
>    mm/hmm: add default fault flags to avoid the need to pre-fill pfns
>      arrays.
>    mm/hmm: add an helper function that fault pages and map them to a
>      device
>    mm/hmm: support hugetlbfs (snap shoting, faulting and DMA mapping)
>    mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
>    mm/hmm: add helpers for driver to safely take the mmap_sem
> 
>   include/linux/hmm.h |  290 ++++++++++--
>   mm/hmm.c            | 1060 +++++++++++++++++++++++++++++--------------
>   2 files changed, 983 insertions(+), 367 deletions(-)
> 

I have been testing this patch series in addition to [1] with some
success. I wouldn't go as far as saying it is thoroughly tested
but you can add:

Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>


[1] https://marc.info/?l=linux-mm&m=155060669514459&w=2
     ("[PATCH v5 0/9] mmu notifier provide context informations")

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-01-31  5:44                       ` Dan Williams
@ 2019-03-05 22:16                         ` Andrew Morton
  2019-03-06  4:20                           ` Dan Williams
  2019-03-06 15:49                           ` Jerome Glisse
  0 siblings, 2 replies; 98+ messages in thread
From: Andrew Morton @ 2019-03-05 22:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jerome Glisse, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

> >
> > > Another way to help allay these worries is commit to no new exports
> > > without in-tree users. In general, that should go without saying for
> > > any core changes for new or future hardware.
> >
> > I always intend to have an upstream user the issue is that the device
> > driver tree and the mm tree move a different pace and there is always
> > a chicken and egg problem. I do not think Andrew wants to have to
> > merge driver patches through its tree, nor Linus want to have to merge
> > drivers and mm trees in specific order. So it is easier to introduce
> > mm change in one release and driver change in the next. This is what
> > i am doing with ODP. Adding things necessary in 5.1 and working with
> > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > 5.2 (the patch is available today and Mellanox have begin testing it
> > AFAIK). So this is the guideline i will be following. Post mm bits
> > with driver patches, push to merge mm bits one release and have the
> > driver bits in the next. I do hope this sound fine to everyone.
> 
> The track record to date has not been "merge HMM patch in one release
> and merge the driver updates the next". If that is the plan going
> forward that's great, and I do appreciate that this set came with
> driver changes, and maintain hope the existing exports don't go
> user-less for too much longer.

Decision time.  Jerome, how are things looking for getting these driver
changes merged in the next cycle?

Dan, what's your overall take on this series for a 5.1-rc1 merge?

Jerome, what would be the risks in skipping just this [09/10] patch?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-05 22:16                         ` Andrew Morton
@ 2019-03-06  4:20                           ` Dan Williams
  2019-03-06 15:51                             ` Jerome Glisse
  2019-03-07 17:46                             ` Andrew Morton
  2019-03-06 15:49                           ` Jerome Glisse
  1 sibling, 2 replies; 98+ messages in thread
From: Dan Williams @ 2019-03-06  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jerome Glisse, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Mar 5, 2019 at 2:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
>
> > >
> > > > Another way to help allay these worries is commit to no new exports
> > > > without in-tree users. In general, that should go without saying for
> > > > any core changes for new or future hardware.
> > >
> > > I always intend to have an upstream user the issue is that the device
> > > driver tree and the mm tree move a different pace and there is always
> > > a chicken and egg problem. I do not think Andrew wants to have to
> > > merge driver patches through its tree, nor Linus want to have to merge
> > > drivers and mm trees in specific order. So it is easier to introduce
> > > mm change in one release and driver change in the next. This is what
> > > i am doing with ODP. Adding things necessary in 5.1 and working with
> > > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > > 5.2 (the patch is available today and Mellanox have begin testing it
> > > AFAIK). So this is the guideline i will be following. Post mm bits
> > > with driver patches, push to merge mm bits one release and have the
> > > driver bits in the next. I do hope this sound fine to everyone.
> >
> > The track record to date has not been "merge HMM patch in one release
> > and merge the driver updates the next". If that is the plan going
> > forward that's great, and I do appreciate that this set came with
> > driver changes, and maintain hope the existing exports don't go
> > user-less for too much longer.
>
> Decision time.  Jerome, how are things looking for getting these driver
> changes merged in the next cycle?
>
> Dan, what's your overall take on this series for a 5.1-rc1 merge?

My hesitation would be drastically reduced if there was a plan to
avoid dangling unconsumed symbols and functionality. Specifically one
or more of the following suggestions:

* EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
surface for out-of-tree consumers to come grumble at us when we
continue to refactor the kernel as we are wont to do.

* A commitment to consume newly exported symbols in the same merge
window, or the following merge window. When that goal is missed revert
the functionality until such time that it can be consumed, or
otherwise abandoned.

* No new symbol exports and functionality while existing symbols go unconsumed.

These are the minimum requirements I would expect my work, or any
core-mm work for that matter, to be held to, I see no reason why HMM
could not meet the same.

On this specific patch I would ask that the changelog incorporate the
motivation that was teased out of our follow-on discussion, not "There
is no reason not to support that case." which isn't a justification.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-05 22:16                         ` Andrew Morton
  2019-03-06  4:20                           ` Dan Williams
@ 2019-03-06 15:49                           ` Jerome Glisse
  2019-03-06 22:18                             ` Andrew Morton
  1 sibling, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-06 15:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Mar 05, 2019 at 02:16:35PM -0800, Andrew Morton wrote:
> On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > >
> > > > Another way to help allay these worries is commit to no new exports
> > > > without in-tree users. In general, that should go without saying for
> > > > any core changes for new or future hardware.
> > >
> > > I always intend to have an upstream user the issue is that the device
> > > driver tree and the mm tree move a different pace and there is always
> > > a chicken and egg problem. I do not think Andrew wants to have to
> > > merge driver patches through its tree, nor Linus want to have to merge
> > > drivers and mm trees in specific order. So it is easier to introduce
> > > mm change in one release and driver change in the next. This is what
> > > i am doing with ODP. Adding things necessary in 5.1 and working with
> > > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > > 5.2 (the patch is available today and Mellanox have begin testing it
> > > AFAIK). So this is the guideline i will be following. Post mm bits
> > > with driver patches, push to merge mm bits one release and have the
> > > driver bits in the next. I do hope this sound fine to everyone.
> > 
> > The track record to date has not been "merge HMM patch in one release
> > and merge the driver updates the next". If that is the plan going
> > forward that's great, and I do appreciate that this set came with
> > driver changes, and maintain hope the existing exports don't go
> > user-less for too much longer.
> 
> Decision time.  Jerome, how are things looking for getting these driver
> changes merged in the next cycle?

nouveau is merge already.

> 
> Dan, what's your overall take on this series for a 5.1-rc1 merge?
> 
> Jerome, what would be the risks in skipping just this [09/10] patch?

As nouveau is a new user it does not regress anything but for RDMA
mlx5 (which i expect to merge new window) it would regress that
driver.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-06  4:20                           ` Dan Williams
@ 2019-03-06 15:51                             ` Jerome Glisse
  2019-03-06 15:57                               ` Dan Williams
  2019-03-07 17:46                             ` Andrew Morton
  1 sibling, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-06 15:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Mar 05, 2019 at 08:20:10PM -0800, Dan Williams wrote:
> On Tue, Mar 5, 2019 at 2:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > > >
> > > > > Another way to help allay these worries is commit to no new exports
> > > > > without in-tree users. In general, that should go without saying for
> > > > > any core changes for new or future hardware.
> > > >
> > > > I always intend to have an upstream user the issue is that the device
> > > > driver tree and the mm tree move a different pace and there is always
> > > > a chicken and egg problem. I do not think Andrew wants to have to
> > > > merge driver patches through its tree, nor Linus want to have to merge
> > > > drivers and mm trees in specific order. So it is easier to introduce
> > > > mm change in one release and driver change in the next. This is what
> > > > i am doing with ODP. Adding things necessary in 5.1 and working with
> > > > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > > > 5.2 (the patch is available today and Mellanox have begin testing it
> > > > AFAIK). So this is the guideline i will be following. Post mm bits
> > > > with driver patches, push to merge mm bits one release and have the
> > > > driver bits in the next. I do hope this sound fine to everyone.
> > >
> > > The track record to date has not been "merge HMM patch in one release
> > > and merge the driver updates the next". If that is the plan going
> > > forward that's great, and I do appreciate that this set came with
> > > driver changes, and maintain hope the existing exports don't go
> > > user-less for too much longer.
> >
> > Decision time.  Jerome, how are things looking for getting these driver
> > changes merged in the next cycle?
> >
> > Dan, what's your overall take on this series for a 5.1-rc1 merge?
> 
> My hesitation would be drastically reduced if there was a plan to
> avoid dangling unconsumed symbols and functionality. Specifically one
> or more of the following suggestions:
> 
> * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> surface for out-of-tree consumers to come grumble at us when we
> continue to refactor the kernel as we are wont to do.
> 
> * A commitment to consume newly exported symbols in the same merge
> window, or the following merge window. When that goal is missed revert
> the functionality until such time that it can be consumed, or
> otherwise abandoned.
> 
> * No new symbol exports and functionality while existing symbols go unconsumed.
> 
> These are the minimum requirements I would expect my work, or any
> core-mm work for that matter, to be held to, I see no reason why HMM
> could not meet the same.

nouveau use all of this and other driver patchset have been posted to
also use this API.

> On this specific patch I would ask that the changelog incorporate the
> motivation that was teased out of our follow-on discussion, not "There
> is no reason not to support that case." which isn't a justification.

mlx5 wants to use HMM without DAX support it would regress mlx5. Other
driver like nouveau also want to access DAX filesystem. So yes there is
no reason not to support DAX filesystem. Why do you not want DAX with
mirroring ? You want to cripple HMM ? Why ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-06 15:51                             ` Jerome Glisse
@ 2019-03-06 15:57                               ` Dan Williams
  2019-03-06 16:03                                 ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-06 15:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, Mar 6, 2019 at 7:51 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Mar 05, 2019 at 08:20:10PM -0800, Dan Williams wrote:
> > On Tue, Mar 5, 2019 at 2:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > > >
> > > > > > Another way to help allay these worries is commit to no new exports
> > > > > > without in-tree users. In general, that should go without saying for
> > > > > > any core changes for new or future hardware.
> > > > >
> > > > > I always intend to have an upstream user the issue is that the device
> > > > > driver tree and the mm tree move a different pace and there is always
> > > > > a chicken and egg problem. I do not think Andrew wants to have to
> > > > > merge driver patches through its tree, nor Linus want to have to merge
> > > > > drivers and mm trees in specific order. So it is easier to introduce
> > > > > mm change in one release and driver change in the next. This is what
> > > > > i am doing with ODP. Adding things necessary in 5.1 and working with
> > > > > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > > > > 5.2 (the patch is available today and Mellanox have begin testing it
> > > > > AFAIK). So this is the guideline i will be following. Post mm bits
> > > > > with driver patches, push to merge mm bits one release and have the
> > > > > driver bits in the next. I do hope this sound fine to everyone.
> > > >
> > > > The track record to date has not been "merge HMM patch in one release
> > > > and merge the driver updates the next". If that is the plan going
> > > > forward that's great, and I do appreciate that this set came with
> > > > driver changes, and maintain hope the existing exports don't go
> > > > user-less for too much longer.
> > >
> > > Decision time.  Jerome, how are things looking for getting these driver
> > > changes merged in the next cycle?
> > >
> > > Dan, what's your overall take on this series for a 5.1-rc1 merge?
> >
> > My hesitation would be drastically reduced if there was a plan to
> > avoid dangling unconsumed symbols and functionality. Specifically one
> > or more of the following suggestions:
> >
> > * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> > surface for out-of-tree consumers to come grumble at us when we
> > continue to refactor the kernel as we are wont to do.
> >
> > * A commitment to consume newly exported symbols in the same merge
> > window, or the following merge window. When that goal is missed revert
> > the functionality until such time that it can be consumed, or
> > otherwise abandoned.
> >
> > * No new symbol exports and functionality while existing symbols go unconsumed.
> >
> > These are the minimum requirements I would expect my work, or any
> > core-mm work for that matter, to be held to, I see no reason why HMM
> > could not meet the same.
>
> nouveau use all of this and other driver patchset have been posted to
> also use this API.
>
> > On this specific patch I would ask that the changelog incorporate the
> > motivation that was teased out of our follow-on discussion, not "There
> > is no reason not to support that case." which isn't a justification.
>
> mlx5 wants to use HMM without DAX support it would regress mlx5. Other
> driver like nouveau also want to access DAX filesystem. So yes there is
> no reason not to support DAX filesystem. Why do you not want DAX with
> mirroring ? You want to cripple HMM ? Why ?

There is a misunderstanding... my request for this patch was to update
the changelog to describe the merits of DAX mirroring to replace the
"There is no reason not to support that case." Otherwise someone
reading this changelog in a year will wonder what the motivation was.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-06 15:57                               ` Dan Williams
@ 2019-03-06 16:03                                 ` Jerome Glisse
  2019-03-06 16:06                                   ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-06 16:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, Mar 06, 2019 at 07:57:30AM -0800, Dan Williams wrote:
> On Wed, Mar 6, 2019 at 7:51 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Mar 05, 2019 at 08:20:10PM -0800, Dan Williams wrote:
> > > On Tue, Mar 5, 2019 at 2:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > > >
> > > > > >
> > > > > > > Another way to help allay these worries is commit to no new exports
> > > > > > > without in-tree users. In general, that should go without saying for
> > > > > > > any core changes for new or future hardware.
> > > > > >
> > > > > > I always intend to have an upstream user the issue is that the device
> > > > > > driver tree and the mm tree move a different pace and there is always
> > > > > > a chicken and egg problem. I do not think Andrew wants to have to
> > > > > > merge driver patches through its tree, nor Linus want to have to merge
> > > > > > drivers and mm trees in specific order. So it is easier to introduce
> > > > > > mm change in one release and driver change in the next. This is what
> > > > > > i am doing with ODP. Adding things necessary in 5.1 and working with
> > > > > > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > > > > > 5.2 (the patch is available today and Mellanox have begin testing it
> > > > > > AFAIK). So this is the guideline i will be following. Post mm bits
> > > > > > with driver patches, push to merge mm bits one release and have the
> > > > > > driver bits in the next. I do hope this sound fine to everyone.
> > > > >
> > > > > The track record to date has not been "merge HMM patch in one release
> > > > > and merge the driver updates the next". If that is the plan going
> > > > > forward that's great, and I do appreciate that this set came with
> > > > > driver changes, and maintain hope the existing exports don't go
> > > > > user-less for too much longer.
> > > >
> > > > Decision time.  Jerome, how are things looking for getting these driver
> > > > changes merged in the next cycle?
> > > >
> > > > Dan, what's your overall take on this series for a 5.1-rc1 merge?
> > >
> > > My hesitation would be drastically reduced if there was a plan to
> > > avoid dangling unconsumed symbols and functionality. Specifically one
> > > or more of the following suggestions:
> > >
> > > * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> > > surface for out-of-tree consumers to come grumble at us when we
> > > continue to refactor the kernel as we are wont to do.
> > >
> > > * A commitment to consume newly exported symbols in the same merge
> > > window, or the following merge window. When that goal is missed revert
> > > the functionality until such time that it can be consumed, or
> > > otherwise abandoned.
> > >
> > > * No new symbol exports and functionality while existing symbols go unconsumed.
> > >
> > > These are the minimum requirements I would expect my work, or any
> > > core-mm work for that matter, to be held to, I see no reason why HMM
> > > could not meet the same.
> >
> > nouveau use all of this and other driver patchset have been posted to
> > also use this API.
> >
> > > On this specific patch I would ask that the changelog incorporate the
> > > motivation that was teased out of our follow-on discussion, not "There
> > > is no reason not to support that case." which isn't a justification.
> >
> > mlx5 wants to use HMM without DAX support it would regress mlx5. Other
> > driver like nouveau also want to access DAX filesystem. So yes there is
> > no reason not to support DAX filesystem. Why do you not want DAX with
> > mirroring ? You want to cripple HMM ? Why ?
> 
> There is a misunderstanding... my request for this patch was to update
> the changelog to describe the merits of DAX mirroring to replace the
> "There is no reason not to support that case." Otherwise someone
> reading this changelog in a year will wonder what the motivation was.

So what about:

HMM mirroring allow device to mirror process address onto device
there is no reason for that mirroring to not work if the virtual
address are the result of an mmap of a file on DAX enabled file-
system.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-06 16:03                                 ` Jerome Glisse
@ 2019-03-06 16:06                                   ` Dan Williams
  0 siblings, 0 replies; 98+ messages in thread
From: Dan Williams @ 2019-03-06 16:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, Mar 6, 2019 at 8:03 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Mar 06, 2019 at 07:57:30AM -0800, Dan Williams wrote:
> > On Wed, Mar 6, 2019 at 7:51 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Mar 05, 2019 at 08:20:10PM -0800, Dan Williams wrote:
> > > > On Tue, Mar 5, 2019 at 2:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > >
> > > > > On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > > > >
> > > > > > >
> > > > > > > > Another way to help allay these worries is commit to no new exports
> > > > > > > > without in-tree users. In general, that should go without saying for
> > > > > > > > any core changes for new or future hardware.
> > > > > > >
> > > > > > > I always intend to have an upstream user the issue is that the device
> > > > > > > driver tree and the mm tree move a different pace and there is always
> > > > > > > a chicken and egg problem. I do not think Andrew wants to have to
> > > > > > > merge driver patches through its tree, nor Linus want to have to merge
> > > > > > > drivers and mm trees in specific order. So it is easier to introduce
> > > > > > > mm change in one release and driver change in the next. This is what
> > > > > > > i am doing with ODP. Adding things necessary in 5.1 and working with
> > > > > > > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > > > > > > 5.2 (the patch is available today and Mellanox have begin testing it
> > > > > > > AFAIK). So this is the guideline i will be following. Post mm bits
> > > > > > > with driver patches, push to merge mm bits one release and have the
> > > > > > > driver bits in the next. I do hope this sound fine to everyone.
> > > > > >
> > > > > > The track record to date has not been "merge HMM patch in one release
> > > > > > and merge the driver updates the next". If that is the plan going
> > > > > > forward that's great, and I do appreciate that this set came with
> > > > > > driver changes, and maintain hope the existing exports don't go
> > > > > > user-less for too much longer.
> > > > >
> > > > > Decision time.  Jerome, how are things looking for getting these driver
> > > > > changes merged in the next cycle?
> > > > >
> > > > > Dan, what's your overall take on this series for a 5.1-rc1 merge?
> > > >
> > > > My hesitation would be drastically reduced if there was a plan to
> > > > avoid dangling unconsumed symbols and functionality. Specifically one
> > > > or more of the following suggestions:
> > > >
> > > > * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> > > > surface for out-of-tree consumers to come grumble at us when we
> > > > continue to refactor the kernel as we are wont to do.
> > > >
> > > > * A commitment to consume newly exported symbols in the same merge
> > > > window, or the following merge window. When that goal is missed revert
> > > > the functionality until such time that it can be consumed, or
> > > > otherwise abandoned.
> > > >
> > > > * No new symbol exports and functionality while existing symbols go unconsumed.
> > > >
> > > > These are the minimum requirements I would expect my work, or any
> > > > core-mm work for that matter, to be held to, I see no reason why HMM
> > > > could not meet the same.
> > >
> > > nouveau use all of this and other driver patchset have been posted to
> > > also use this API.
> > >
> > > > On this specific patch I would ask that the changelog incorporate the
> > > > motivation that was teased out of our follow-on discussion, not "There
> > > > is no reason not to support that case." which isn't a justification.
> > >
> > > mlx5 wants to use HMM without DAX support it would regress mlx5. Other
> > > driver like nouveau also want to access DAX filesystem. So yes there is
> > > no reason not to support DAX filesystem. Why do you not want DAX with
> > > mirroring ? You want to cripple HMM ? Why ?
> >
> > There is a misunderstanding... my request for this patch was to update
> > the changelog to describe the merits of DAX mirroring to replace the
> > "There is no reason not to support that case." Otherwise someone
> > reading this changelog in a year will wonder what the motivation was.
>
> So what about:
>
> HMM mirroring allow device to mirror process address onto device
> there is no reason for that mirroring to not work if the virtual
> address are the result of an mmap of a file on DAX enabled file-
> system.

Looks like an improvement to me.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-06 15:49                           ` Jerome Glisse
@ 2019-03-06 22:18                             ` Andrew Morton
  2019-03-07  0:36                               ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Andrew Morton @ 2019-03-06 22:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, 6 Mar 2019 10:49:04 -0500 Jerome Glisse <jglisse@redhat.com> wrote:

> On Tue, Mar 05, 2019 at 02:16:35PM -0800, Andrew Morton wrote:
> > On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > >
> > > > > Another way to help allay these worries is commit to no new exports
> > > > > without in-tree users. In general, that should go without saying for
> > > > > any core changes for new or future hardware.
> > > >
> > > > I always intend to have an upstream user the issue is that the device
> > > > driver tree and the mm tree move a different pace and there is always
> > > > a chicken and egg problem. I do not think Andrew wants to have to
> > > > merge driver patches through its tree, nor Linus want to have to merge
> > > > drivers and mm trees in specific order. So it is easier to introduce
> > > > mm change in one release and driver change in the next. This is what
> > > > i am doing with ODP. Adding things necessary in 5.1 and working with
> > > > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > > > 5.2 (the patch is available today and Mellanox have begin testing it
> > > > AFAIK). So this is the guideline i will be following. Post mm bits
> > > > with driver patches, push to merge mm bits one release and have the
> > > > driver bits in the next. I do hope this sound fine to everyone.
> > > 
> > > The track record to date has not been "merge HMM patch in one release
> > > and merge the driver updates the next". If that is the plan going
> > > forward that's great, and I do appreciate that this set came with
> > > driver changes, and maintain hope the existing exports don't go
> > > user-less for too much longer.
> > 
> > Decision time.  Jerome, how are things looking for getting these driver
> > changes merged in the next cycle?
> 
> nouveau is merge already.

Confused.  Nouveau in mainline is dependent upon "mm/hmm: allow to
mirror vma of a file on a DAX backed filesystem"?  That can't be the
case?

> > 
> > Dan, what's your overall take on this series for a 5.1-rc1 merge?
> > 
> > Jerome, what would be the risks in skipping just this [09/10] patch?
> 
> As nouveau is a new user it does not regress anything but for RDMA
> mlx5 (which i expect to merge new window) it would regress that
> driver.

Also confused.  How can omitting "mm/hmm: allow to mirror vma of a file
on a DAX backed filesystem" from 5.1-rc1 cause an mlx5 regression?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-06 22:18                             ` Andrew Morton
@ 2019-03-07  0:36                               ` Jerome Glisse
  0 siblings, 0 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-03-07  0:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, Mar 06, 2019 at 02:18:20PM -0800, Andrew Morton wrote:
> On Wed, 6 Mar 2019 10:49:04 -0500 Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > On Tue, Mar 05, 2019 at 02:16:35PM -0800, Andrew Morton wrote:
> > > On Wed, 30 Jan 2019 21:44:46 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > > 
> > > > >
> > > > > > Another way to help allay these worries is commit to no new exports
> > > > > > without in-tree users. In general, that should go without saying for
> > > > > > any core changes for new or future hardware.
> > > > >
> > > > > I always intend to have an upstream user the issue is that the device
> > > > > driver tree and the mm tree move a different pace and there is always
> > > > > a chicken and egg problem. I do not think Andrew wants to have to
> > > > > merge driver patches through its tree, nor Linus want to have to merge
> > > > > drivers and mm trees in specific order. So it is easier to introduce
> > > > > mm change in one release and driver change in the next. This is what
> > > > > i am doing with ODP. Adding things necessary in 5.1 and working with
> > > > > Mellanox to have the ODP HMM patch fully tested and ready to go in
> > > > > 5.2 (the patch is available today and Mellanox have begin testing it
> > > > > AFAIK). So this is the guideline i will be following. Post mm bits
> > > > > with driver patches, push to merge mm bits one release and have the
> > > > > driver bits in the next. I do hope this sound fine to everyone.
> > > > 
> > > > The track record to date has not been "merge HMM patch in one release
> > > > and merge the driver updates the next". If that is the plan going
> > > > forward that's great, and I do appreciate that this set came with
> > > > driver changes, and maintain hope the existing exports don't go
> > > > user-less for too much longer.
> > > 
> > > Decision time.  Jerome, how are things looking for getting these driver
> > > changes merged in the next cycle?
> > 
> > nouveau is merge already.
> 
> Confused.  Nouveau in mainline is dependent upon "mm/hmm: allow to
> mirror vma of a file on a DAX backed filesystem"?  That can't be the
> case?

Not really, HMM mirror is about mirroring address space onto the device
so if mirroring does not work for file that are on a filesystem that use
DAX it fails in un-expected way from user point of view. But as nouveau
is just getting upstrean you can argue that no one previously depended
on that working for file backed page on DAX filesystem.

Now the ODP RDMA case is different, what is upstream today works on DAX
so if that patch is not upstream in 5.1 then i can not merge HMM ODP in
5.2 as it would regress and the ODP people would not take the risk of
regression ie ODP folks want the DAX support to be upstream first.

> 
> > > 
> > > Dan, what's your overall take on this series for a 5.1-rc1 merge?
> > > 
> > > Jerome, what would be the risks in skipping just this [09/10] patch?
> > 
> > As nouveau is a new user it does not regress anything but for RDMA
> > mlx5 (which i expect to merge new window) it would regress that
> > driver.
> 
> Also confused.  How can omitting "mm/hmm: allow to mirror vma of a file
> on a DAX backed filesystem" from 5.1-rc1 cause an mlx5 regression?

Not in 5.1 but i can not merge HMM ODP in 5.2 if that is not in 5.1.
I know this circular dependency between sub-system is painful but i
do not see any simpler way.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-06  4:20                           ` Dan Williams
  2019-03-06 15:51                             ` Jerome Glisse
@ 2019-03-07 17:46                             ` Andrew Morton
  2019-03-07 18:56                               ` Jerome Glisse
  1 sibling, 1 reply; 98+ messages in thread
From: Andrew Morton @ 2019-03-07 17:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jerome Glisse, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, 5 Mar 2019 20:20:10 -0800 Dan Williams <dan.j.williams@intel.com> wrote:

> My hesitation would be drastically reduced if there was a plan to
> avoid dangling unconsumed symbols and functionality. Specifically one
> or more of the following suggestions:
> 
> * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> surface for out-of-tree consumers to come grumble at us when we
> continue to refactor the kernel as we are wont to do.

The existing patches use EXPORT_SYMBOL() so that's a sticking point. 
Jerome, what would happen is we made these EXPORT_SYMBOL_GPL()?

> * A commitment to consume newly exported symbols in the same merge
> window, or the following merge window. When that goal is missed revert
> the functionality until such time that it can be consumed, or
> otherwise abandoned.

It sounds like we can tick this box.

> * No new symbol exports and functionality while existing symbols go unconsumed.

Unsure about this one?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-07 17:46                             ` Andrew Morton
@ 2019-03-07 18:56                               ` Jerome Glisse
  2019-03-12  3:13                                 ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-07 18:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Thu, Mar 07, 2019 at 09:46:54AM -0800, Andrew Morton wrote:
> On Tue, 5 Mar 2019 20:20:10 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > My hesitation would be drastically reduced if there was a plan to
> > avoid dangling unconsumed symbols and functionality. Specifically one
> > or more of the following suggestions:
> > 
> > * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> > surface for out-of-tree consumers to come grumble at us when we
> > continue to refactor the kernel as we are wont to do.
> 
> The existing patches use EXPORT_SYMBOL() so that's a sticking point. 
> Jerome, what would happen is we made these EXPORT_SYMBOL_GPL()?

So Dan argue that GPL export solve the problem of out of tree user and
my personnal experience is that it does not. The GPU sub-system has tons
of GPL drivers that are not upstream and we never felt that we were bound
to support them in anyway. We always were very clear that if you are not
upstream that you do not have any voice on changes we do.

So my exeperience is that GPL does not help here. It is just about being
clear and ignoring anyone who does not have an upstream driver ie we have
free hands to update HMM in anyway as long as we keep supporting the
upstream user.

That being said if the GPL aspect is that much important to some then
fine let switch all HMM symbol to GPL.

> 
> > * A commitment to consume newly exported symbols in the same merge
> > window, or the following merge window. When that goal is missed revert
> > the functionality until such time that it can be consumed, or
> > otherwise abandoned.
> 
> It sounds like we can tick this box.

I wouldn't be too strick either, when adding something in release N
the driver change in N+1 can miss N+1 because of bug or regression
and be push to N+2.

I think a better stance here is that if we do not get any sign-off
on the feature from driver maintainer for which the feature is intended
then we just do not merge. If after few release we still can not get
the driver to use it then we revert.

It just feels dumb to revert at N+1 just to get it back in N+2 as
the driver bit get fix.

> 
> > * No new symbol exports and functionality while existing symbols go unconsumed.
> 
> Unsure about this one?

With nouveau upstream now everything is use. ODP will use some of the
symbol too. PPC has patchset posted to use lot of HMM too. I have been
working with other vendor that have patchset being work on to use HMM
too.

I have not done all those function just for the fun of it :) They do
have real use and user. It took a longtime to get nouveau because of
userspace we had a lot of catchup to do in mesa and llvm and we are
still very rough there.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-07 18:56                               ` Jerome Glisse
@ 2019-03-12  3:13                                 ` Dan Williams
  2019-03-12 15:25                                   ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-12  3:13 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Thu, Mar 7, 2019 at 10:56 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Thu, Mar 07, 2019 at 09:46:54AM -0800, Andrew Morton wrote:
> > On Tue, 5 Mar 2019 20:20:10 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > > My hesitation would be drastically reduced if there was a plan to
> > > avoid dangling unconsumed symbols and functionality. Specifically one
> > > or more of the following suggestions:
> > >
> > > * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> > > surface for out-of-tree consumers to come grumble at us when we
> > > continue to refactor the kernel as we are wont to do.
> >
> > The existing patches use EXPORT_SYMBOL() so that's a sticking point.
> > Jerome, what would happen is we made these EXPORT_SYMBOL_GPL()?
>
> So Dan argue that GPL export solve the problem of out of tree user and
> my personnal experience is that it does not. The GPU sub-system has tons
> of GPL drivers that are not upstream and we never felt that we were bound
> to support them in anyway. We always were very clear that if you are not
> upstream that you do not have any voice on changes we do.
>
> So my exeperience is that GPL does not help here. It is just about being
> clear and ignoring anyone who does not have an upstream driver ie we have
> free hands to update HMM in anyway as long as we keep supporting the
> upstream user.
>
> That being said if the GPL aspect is that much important to some then
> fine let switch all HMM symbol to GPL.

I should add that I would not be opposed to moving symbols to
non-GPL-only over time, but that should be based on our experience
with the stability and utility of the implementation. For brand new
symbols there's just no data to argue that we can / should keep the
interface stable, or that the interface exposes something fragile that
we'd rather not export at all. That experience gathering and thrash is
best constrained to upstream GPL-only drivers that are signing up to
participate in that maturation process.

So I think it is important from a practical perspective and is a lower
risk way to run this HMM experiment of "merge infrastructure way in
advance of an upstream user".

> > > * A commitment to consume newly exported symbols in the same merge
> > > window, or the following merge window. When that goal is missed revert
> > > the functionality until such time that it can be consumed, or
> > > otherwise abandoned.
> >
> > It sounds like we can tick this box.
>
> I wouldn't be too strick either, when adding something in release N
> the driver change in N+1 can miss N+1 because of bug or regression
> and be push to N+2.
>
> I think a better stance here is that if we do not get any sign-off
> on the feature from driver maintainer for which the feature is intended
> then we just do not merge.

Agree, no driver maintainer sign-off then no merge.

> If after few release we still can not get
> the driver to use it then we revert.

As long as it is made clear to the driver maintainer that they have
one cycle to consume it then we can have a conversation if it is too
early to merge the infrastructure. If no one has time to consume the
feature, why rush dead code into the kernel? Also, waiting 2 cycles
means the infrastructure that was hard to review without a user is now
even harder to review because any review momentum has been lost by the
time the user show up, so we're better off keeping them close together
in time.


> It just feels dumb to revert at N+1 just to get it back in N+2 as
> the driver bit get fix.

No, I think it just means the infrastructure went in too early if a
driver can't consume it in a development cycle. Lets revisit if it
becomes a problem in practice.

> > > * No new symbol exports and functionality while existing symbols go unconsumed.
> >
> > Unsure about this one?
>
> With nouveau upstream now everything is use. ODP will use some of the
> symbol too. PPC has patchset posted to use lot of HMM too. I have been
> working with other vendor that have patchset being work on to use HMM
> too.
>
> I have not done all those function just for the fun of it :) They do
> have real use and user. It took a longtime to get nouveau because of
> userspace we had a lot of catchup to do in mesa and llvm and we are
> still very rough there.

Sure, this one is less of a concern if we can stick to tighter
timelines between infrastructure and driver consumer merge.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-12  3:13                                 ` Dan Williams
@ 2019-03-12 15:25                                   ` Jerome Glisse
  2019-03-12 16:06                                     ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-12 15:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Mon, Mar 11, 2019 at 08:13:53PM -0700, Dan Williams wrote:
> On Thu, Mar 7, 2019 at 10:56 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Thu, Mar 07, 2019 at 09:46:54AM -0800, Andrew Morton wrote:
> > > On Tue, 5 Mar 2019 20:20:10 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > > My hesitation would be drastically reduced if there was a plan to
> > > > avoid dangling unconsumed symbols and functionality. Specifically one
> > > > or more of the following suggestions:
> > > >
> > > > * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> > > > surface for out-of-tree consumers to come grumble at us when we
> > > > continue to refactor the kernel as we are wont to do.
> > >
> > > The existing patches use EXPORT_SYMBOL() so that's a sticking point.
> > > Jerome, what would happen is we made these EXPORT_SYMBOL_GPL()?
> >
> > So Dan argue that GPL export solve the problem of out of tree user and
> > my personnal experience is that it does not. The GPU sub-system has tons
> > of GPL drivers that are not upstream and we never felt that we were bound
> > to support them in anyway. We always were very clear that if you are not
> > upstream that you do not have any voice on changes we do.
> >
> > So my exeperience is that GPL does not help here. It is just about being
> > clear and ignoring anyone who does not have an upstream driver ie we have
> > free hands to update HMM in anyway as long as we keep supporting the
> > upstream user.
> >
> > That being said if the GPL aspect is that much important to some then
> > fine let switch all HMM symbol to GPL.
> 
> I should add that I would not be opposed to moving symbols to
> non-GPL-only over time, but that should be based on our experience
> with the stability and utility of the implementation. For brand new
> symbols there's just no data to argue that we can / should keep the
> interface stable, or that the interface exposes something fragile that
> we'd rather not export at all. That experience gathering and thrash is
> best constrained to upstream GPL-only drivers that are signing up to
> participate in that maturation process.
> 
> So I think it is important from a practical perspective and is a lower
> risk way to run this HMM experiment of "merge infrastructure way in
> advance of an upstream user".
> 
> > > > * A commitment to consume newly exported symbols in the same merge
> > > > window, or the following merge window. When that goal is missed revert
> > > > the functionality until such time that it can be consumed, or
> > > > otherwise abandoned.
> > >
> > > It sounds like we can tick this box.
> >
> > I wouldn't be too strick either, when adding something in release N
> > the driver change in N+1 can miss N+1 because of bug or regression
> > and be push to N+2.
> >
> > I think a better stance here is that if we do not get any sign-off
> > on the feature from driver maintainer for which the feature is intended
> > then we just do not merge.
> 
> Agree, no driver maintainer sign-off then no merge.
> 
> > If after few release we still can not get
> > the driver to use it then we revert.
> 
> As long as it is made clear to the driver maintainer that they have
> one cycle to consume it then we can have a conversation if it is too
> early to merge the infrastructure. If no one has time to consume the
> feature, why rush dead code into the kernel? Also, waiting 2 cycles
> means the infrastructure that was hard to review without a user is now
> even harder to review because any review momentum has been lost by the
> time the user show up, so we're better off keeping them close together
> in time.

Miss-understanding here, in first post the infrastructure and the driver
bit get posted just like have been doing lately. So that you know that
you have working user with the feature and what is left is pushing the
driver bits throught the appropriate tree. So driver maintainer support
is about knowing that they want the feature and have some confidence
that it looks ready.

It also means you can review the infrastructure along side user of it.

> 
> 
> > It just feels dumb to revert at N+1 just to get it back in N+2 as
> > the driver bit get fix.
> 
> No, I think it just means the infrastructure went in too early if a
> driver can't consume it in a development cycle. Lets revisit if it
> becomes a problem in practice.

Well that's just dumb to have hard guideline like that. Many things
can lead to missing deadline. For instance bug i am refering too might
have nothing to do with the feature, it can be something related to
integrating the feature an unforseen side effect. So i believe a better
guideline is that driver maintainer rejecting the feature rather than
just failure to meet one deadline.


> > > > * No new symbol exports and functionality while existing symbols go unconsumed.
> > >
> > > Unsure about this one?
> >
> > With nouveau upstream now everything is use. ODP will use some of the
> > symbol too. PPC has patchset posted to use lot of HMM too. I have been
> > working with other vendor that have patchset being work on to use HMM
> > too.
> >
> > I have not done all those function just for the fun of it :) They do
> > have real use and user. It took a longtime to get nouveau because of
> > userspace we had a lot of catchup to do in mesa and llvm and we are
> > still very rough there.
> 
> Sure, this one is less of a concern if we can stick to tighter
> timelines between infrastructure and driver consumer merge.

Issue is that consumer timeline can be hard to know, sometimes
the consumer go over few revision (like ppc for instance) and
not because of the infrastructure but for other reasons. So
reverting the infrastructure just because user had its timeline
change is not productive. User missing one cycle means they would
get delayed for 2 cycles ie reupstreaming the infrastructure in
next cycle and repushing the user the cycle after. This sounds
like a total wastage of everyone times. While keeping the infra-
structure would allow the timeline to slip by just one cycle.

Spirit of the rule is better than blind application of rule.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-12 15:25                                   ` Jerome Glisse
@ 2019-03-12 16:06                                     ` Dan Williams
  2019-03-12 19:06                                       ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-12 16:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Mon, Mar 11, 2019 at 08:13:53PM -0700, Dan Williams wrote:
> > On Thu, Mar 7, 2019 at 10:56 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Thu, Mar 07, 2019 at 09:46:54AM -0800, Andrew Morton wrote:
> > > > On Tue, 5 Mar 2019 20:20:10 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > > >
> > > > > My hesitation would be drastically reduced if there was a plan to
> > > > > avoid dangling unconsumed symbols and functionality. Specifically one
> > > > > or more of the following suggestions:
> > > > >
> > > > > * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> > > > > surface for out-of-tree consumers to come grumble at us when we
> > > > > continue to refactor the kernel as we are wont to do.
> > > >
> > > > The existing patches use EXPORT_SYMBOL() so that's a sticking point.
> > > > Jerome, what would happen is we made these EXPORT_SYMBOL_GPL()?
> > >
> > > So Dan argue that GPL export solve the problem of out of tree user and
> > > my personnal experience is that it does not. The GPU sub-system has tons
> > > of GPL drivers that are not upstream and we never felt that we were bound
> > > to support them in anyway. We always were very clear that if you are not
> > > upstream that you do not have any voice on changes we do.
> > >
> > > So my exeperience is that GPL does not help here. It is just about being
> > > clear and ignoring anyone who does not have an upstream driver ie we have
> > > free hands to update HMM in anyway as long as we keep supporting the
> > > upstream user.
> > >
> > > That being said if the GPL aspect is that much important to some then
> > > fine let switch all HMM symbol to GPL.
> >
> > I should add that I would not be opposed to moving symbols to
> > non-GPL-only over time, but that should be based on our experience
> > with the stability and utility of the implementation. For brand new
> > symbols there's just no data to argue that we can / should keep the
> > interface stable, or that the interface exposes something fragile that
> > we'd rather not export at all. That experience gathering and thrash is
> > best constrained to upstream GPL-only drivers that are signing up to
> > participate in that maturation process.
> >
> > So I think it is important from a practical perspective and is a lower
> > risk way to run this HMM experiment of "merge infrastructure way in
> > advance of an upstream user".
> >
> > > > > * A commitment to consume newly exported symbols in the same merge
> > > > > window, or the following merge window. When that goal is missed revert
> > > > > the functionality until such time that it can be consumed, or
> > > > > otherwise abandoned.
> > > >
> > > > It sounds like we can tick this box.
> > >
> > > I wouldn't be too strick either, when adding something in release N
> > > the driver change in N+1 can miss N+1 because of bug or regression
> > > and be push to N+2.
> > >
> > > I think a better stance here is that if we do not get any sign-off
> > > on the feature from driver maintainer for which the feature is intended
> > > then we just do not merge.
> >
> > Agree, no driver maintainer sign-off then no merge.
> >
> > > If after few release we still can not get
> > > the driver to use it then we revert.
> >
> > As long as it is made clear to the driver maintainer that they have
> > one cycle to consume it then we can have a conversation if it is too
> > early to merge the infrastructure. If no one has time to consume the
> > feature, why rush dead code into the kernel? Also, waiting 2 cycles
> > means the infrastructure that was hard to review without a user is now
> > even harder to review because any review momentum has been lost by the
> > time the user show up, so we're better off keeping them close together
> > in time.
>
> Miss-understanding here, in first post the infrastructure and the driver
> bit get posted just like have been doing lately. So that you know that
> you have working user with the feature and what is left is pushing the
> driver bits throught the appropriate tree. So driver maintainer support
> is about knowing that they want the feature and have some confidence
> that it looks ready.
>
> It also means you can review the infrastructure along side user of it.

Sounds good.

> > > It just feels dumb to revert at N+1 just to get it back in N+2 as
> > > the driver bit get fix.
> >
> > No, I think it just means the infrastructure went in too early if a
> > driver can't consume it in a development cycle. Lets revisit if it
> > becomes a problem in practice.
>
> Well that's just dumb to have hard guideline like that. Many things
> can lead to missing deadline. For instance bug i am refering too might
> have nothing to do with the feature, it can be something related to
> integrating the feature an unforseen side effect. So i believe a better
> guideline is that driver maintainer rejecting the feature rather than
> just failure to meet one deadline.

The history of the Linux kernel disagrees with this statement. It's
only HMM that has recently ignored precedent and pushed to land
infrastructure in advance of consumers, a one cycle constraint is
already generous in that light.

> > > > > * No new symbol exports and functionality while existing symbols go unconsumed.
> > > >
> > > > Unsure about this one?
> > >
> > > With nouveau upstream now everything is use. ODP will use some of the
> > > symbol too. PPC has patchset posted to use lot of HMM too. I have been
> > > working with other vendor that have patchset being work on to use HMM
> > > too.
> > >
> > > I have not done all those function just for the fun of it :) They do
> > > have real use and user. It took a longtime to get nouveau because of
> > > userspace we had a lot of catchup to do in mesa and llvm and we are
> > > still very rough there.
> >
> > Sure, this one is less of a concern if we can stick to tighter
> > timelines between infrastructure and driver consumer merge.
>
> Issue is that consumer timeline can be hard to know, sometimes
> the consumer go over few revision (like ppc for instance) and
> not because of the infrastructure but for other reasons. So
> reverting the infrastructure just because user had its timeline
> change is not productive. User missing one cycle means they would
> get delayed for 2 cycles ie reupstreaming the infrastructure in
> next cycle and repushing the user the cycle after. This sounds
> like a total wastage of everyone times. While keeping the infra-
> structure would allow the timeline to slip by just one cycle.
>
> Spirit of the rule is better than blind application of rule.

Again, I fail to see why HMM is suddenly unable to make forward
progress when the infrastructure that came before it was merged with
consumers in the same development cycle.

A gate to upstream merge is about the only lever a reviewer has to
push for change, and these requests to uncouple the consumer only
serve to weaken that review tool in my mind.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-12 16:06                                     ` Dan Williams
@ 2019-03-12 19:06                                       ` Jerome Glisse
  2019-03-12 19:30                                         ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-12 19:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote:
> On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Mon, Mar 11, 2019 at 08:13:53PM -0700, Dan Williams wrote:
> > > On Thu, Mar 7, 2019 at 10:56 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Thu, Mar 07, 2019 at 09:46:54AM -0800, Andrew Morton wrote:
> > > > > On Tue, 5 Mar 2019 20:20:10 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > > > >
> > > > > > My hesitation would be drastically reduced if there was a plan to
> > > > > > avoid dangling unconsumed symbols and functionality. Specifically one
> > > > > > or more of the following suggestions:
> > > > > >
> > > > > > * EXPORT_SYMBOL_GPL on all exports to avoid a growing liability
> > > > > > surface for out-of-tree consumers to come grumble at us when we
> > > > > > continue to refactor the kernel as we are wont to do.
> > > > >
> > > > > The existing patches use EXPORT_SYMBOL() so that's a sticking point.
> > > > > Jerome, what would happen is we made these EXPORT_SYMBOL_GPL()?
> > > >
> > > > So Dan argue that GPL export solve the problem of out of tree user and
> > > > my personnal experience is that it does not. The GPU sub-system has tons
> > > > of GPL drivers that are not upstream and we never felt that we were bound
> > > > to support them in anyway. We always were very clear that if you are not
> > > > upstream that you do not have any voice on changes we do.
> > > >
> > > > So my exeperience is that GPL does not help here. It is just about being
> > > > clear and ignoring anyone who does not have an upstream driver ie we have
> > > > free hands to update HMM in anyway as long as we keep supporting the
> > > > upstream user.
> > > >
> > > > That being said if the GPL aspect is that much important to some then
> > > > fine let switch all HMM symbol to GPL.
> > >
> > > I should add that I would not be opposed to moving symbols to
> > > non-GPL-only over time, but that should be based on our experience
> > > with the stability and utility of the implementation. For brand new
> > > symbols there's just no data to argue that we can / should keep the
> > > interface stable, or that the interface exposes something fragile that
> > > we'd rather not export at all. That experience gathering and thrash is
> > > best constrained to upstream GPL-only drivers that are signing up to
> > > participate in that maturation process.
> > >
> > > So I think it is important from a practical perspective and is a lower
> > > risk way to run this HMM experiment of "merge infrastructure way in
> > > advance of an upstream user".
> > >
> > > > > > * A commitment to consume newly exported symbols in the same merge
> > > > > > window, or the following merge window. When that goal is missed revert
> > > > > > the functionality until such time that it can be consumed, or
> > > > > > otherwise abandoned.
> > > > >
> > > > > It sounds like we can tick this box.
> > > >
> > > > I wouldn't be too strick either, when adding something in release N
> > > > the driver change in N+1 can miss N+1 because of bug or regression
> > > > and be push to N+2.
> > > >
> > > > I think a better stance here is that if we do not get any sign-off
> > > > on the feature from driver maintainer for which the feature is intended
> > > > then we just do not merge.
> > >
> > > Agree, no driver maintainer sign-off then no merge.
> > >
> > > > If after few release we still can not get
> > > > the driver to use it then we revert.
> > >
> > > As long as it is made clear to the driver maintainer that they have
> > > one cycle to consume it then we can have a conversation if it is too
> > > early to merge the infrastructure. If no one has time to consume the
> > > feature, why rush dead code into the kernel? Also, waiting 2 cycles
> > > means the infrastructure that was hard to review without a user is now
> > > even harder to review because any review momentum has been lost by the
> > > time the user show up, so we're better off keeping them close together
> > > in time.
> >
> > Miss-understanding here, in first post the infrastructure and the driver
> > bit get posted just like have been doing lately. So that you know that
> > you have working user with the feature and what is left is pushing the
> > driver bits throught the appropriate tree. So driver maintainer support
> > is about knowing that they want the feature and have some confidence
> > that it looks ready.
> >
> > It also means you can review the infrastructure along side user of it.
> 
> Sounds good.
> 
> > > > It just feels dumb to revert at N+1 just to get it back in N+2 as
> > > > the driver bit get fix.
> > >
> > > No, I think it just means the infrastructure went in too early if a
> > > driver can't consume it in a development cycle. Lets revisit if it
> > > becomes a problem in practice.
> >
> > Well that's just dumb to have hard guideline like that. Many things
> > can lead to missing deadline. For instance bug i am refering too might
> > have nothing to do with the feature, it can be something related to
> > integrating the feature an unforseen side effect. So i believe a better
> > guideline is that driver maintainer rejecting the feature rather than
> > just failure to meet one deadline.
> 
> The history of the Linux kernel disagrees with this statement. It's
> only HMM that has recently ignored precedent and pushed to land
> infrastructure in advance of consumers, a one cycle constraint is
> already generous in that light.
> 
> > > > > > * No new symbol exports and functionality while existing symbols go unconsumed.
> > > > >
> > > > > Unsure about this one?
> > > >
> > > > With nouveau upstream now everything is use. ODP will use some of the
> > > > symbol too. PPC has patchset posted to use lot of HMM too. I have been
> > > > working with other vendor that have patchset being work on to use HMM
> > > > too.
> > > >
> > > > I have not done all those function just for the fun of it :) They do
> > > > have real use and user. It took a longtime to get nouveau because of
> > > > userspace we had a lot of catchup to do in mesa and llvm and we are
> > > > still very rough there.
> > >
> > > Sure, this one is less of a concern if we can stick to tighter
> > > timelines between infrastructure and driver consumer merge.
> >
> > Issue is that consumer timeline can be hard to know, sometimes
> > the consumer go over few revision (like ppc for instance) and
> > not because of the infrastructure but for other reasons. So
> > reverting the infrastructure just because user had its timeline
> > change is not productive. User missing one cycle means they would
> > get delayed for 2 cycles ie reupstreaming the infrastructure in
> > next cycle and repushing the user the cycle after. This sounds
> > like a total wastage of everyone times. While keeping the infra-
> > structure would allow the timeline to slip by just one cycle.
> >
> > Spirit of the rule is better than blind application of rule.
> 
> Again, I fail to see why HMM is suddenly unable to make forward
> progress when the infrastructure that came before it was merged with
> consumers in the same development cycle.
> 
> A gate to upstream merge is about the only lever a reviewer has to
> push for change, and these requests to uncouple the consumer only
> serve to weaken that review tool in my mind.

Well let just agree to disagree and leave it at that and stop
wasting each other time

Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-12 19:06                                       ` Jerome Glisse
@ 2019-03-12 19:30                                         ` Dan Williams
  2019-03-12 20:34                                           ` Dave Chinner
  2019-03-12 21:52                                           ` Andrew Morton
  0 siblings, 2 replies; 98+ messages in thread
From: Dan Williams @ 2019-03-12 19:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Mar 12, 2019 at 12:06 PM Jerome Glisse <jglisse@redhat.com> wrote:
> On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote:
> > On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse <jglisse@redhat.com> wrote:
[..]
> > > Spirit of the rule is better than blind application of rule.
> >
> > Again, I fail to see why HMM is suddenly unable to make forward
> > progress when the infrastructure that came before it was merged with
> > consumers in the same development cycle.
> >
> > A gate to upstream merge is about the only lever a reviewer has to
> > push for change, and these requests to uncouple the consumer only
> > serve to weaken that review tool in my mind.
>
> Well let just agree to disagree and leave it at that and stop
> wasting each other time

I'm fine to continue this discussion if you are. Please be specific
about where we disagree and what aspect of the proposed rules about
merge staging are either acceptable, painful-but-doable, or
show-stoppers. Do you agree that HMM is doing something novel with
merge staging, am I off base there? I expect I can find folks that
would balk with even a one cycle deferment of consumers, but can we
start with that concession and see how it goes? I'm missing where I've
proposed something that is untenable for the future of HMM which is
addressing some real needs in gaps in the kernel's support for new
hardware.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-12 19:30                                         ` Dan Williams
@ 2019-03-12 20:34                                           ` Dave Chinner
  2019-03-13  1:06                                             ` Dan Williams
  2019-03-12 21:52                                           ` Andrew Morton
  1 sibling, 1 reply; 98+ messages in thread
From: Dave Chinner @ 2019-03-12 20:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jerome Glisse, Andrew Morton, Linux MM,
	Linux Kernel Mailing List, Ralph Campbell, John Hubbard,
	linux-fsdevel

On Tue, Mar 12, 2019 at 12:30:52PM -0700, Dan Williams wrote:
> On Tue, Mar 12, 2019 at 12:06 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote:
> > > On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse <jglisse@redhat.com> wrote:
> [..]
> > > > Spirit of the rule is better than blind application of rule.
> > >
> > > Again, I fail to see why HMM is suddenly unable to make forward
> > > progress when the infrastructure that came before it was merged with
> > > consumers in the same development cycle.
> > >
> > > A gate to upstream merge is about the only lever a reviewer has to
> > > push for change, and these requests to uncouple the consumer only
> > > serve to weaken that review tool in my mind.
> >
> > Well let just agree to disagree and leave it at that and stop
> > wasting each other time
> 
> I'm fine to continue this discussion if you are. Please be specific
> about where we disagree and what aspect of the proposed rules about
> merge staging are either acceptable, painful-but-doable, or
> show-stoppers. Do you agree that HMM is doing something novel with
> merge staging, am I off base there? I expect I can find folks that
> would balk with even a one cycle deferment of consumers, but can we
> start with that concession and see how it goes? I'm missing where I've
> proposed something that is untenable for the future of HMM which is
> addressing some real needs in gaps in the kernel's support for new
> hardware.

/me quietly wonders why the hmm infrastructure can't be staged in a
maintainer tree development branch on a kernel.org and then
all merged in one go when that branch has both infrastructure and
drivers merged into it...

i.e. everyone doing hmm driver work gets the infrastructure from the
dev tree, not mainline. That's a pretty standard procedure for
developing complex features, and it avoids all the issues being
argued over right now...

Cheers,

Dave/
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-12 19:30                                         ` Dan Williams
  2019-03-12 20:34                                           ` Dave Chinner
@ 2019-03-12 21:52                                           ` Andrew Morton
  2019-03-13  0:10                                             ` Jerome Glisse
  1 sibling, 1 reply; 98+ messages in thread
From: Andrew Morton @ 2019-03-12 21:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jerome Glisse, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, 12 Mar 2019 12:30:52 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Mar 12, 2019 at 12:06 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote:
> > > On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse <jglisse@redhat.com> wrote:
> [..]
> > > > Spirit of the rule is better than blind application of rule.
> > >
> > > Again, I fail to see why HMM is suddenly unable to make forward
> > > progress when the infrastructure that came before it was merged with
> > > consumers in the same development cycle.
> > >
> > > A gate to upstream merge is about the only lever a reviewer has to
> > > push for change, and these requests to uncouple the consumer only
> > > serve to weaken that review tool in my mind.
> >
> > Well let just agree to disagree and leave it at that and stop
> > wasting each other time
> 
> I'm fine to continue this discussion if you are. Please be specific
> about where we disagree and what aspect of the proposed rules about
> merge staging are either acceptable, painful-but-doable, or
> show-stoppers. Do you agree that HMM is doing something novel with
> merge staging, am I off base there?

You're correct.  We chose to go this way because the HMM code is so
large and all-over-the-place that developing it in a standalone tree
seemed impractical - better to feed it into mainline piecewise.

This decision very much assumed that HMM users would definitely be
merged, and that it would happen soon.  I was skeptical for a long time
and was eventually persuaded by quite a few conversations with various
architecture and driver maintainers indicating that these HMM users
would be forthcoming.

In retrospect, the arrival of HMM clients took quite a lot longer than
was anticipated and I'm not sure that all of the anticipated usage
sites will actually be using it.  I wish I'd kept records of
who-said-what, but I didn't and the info is now all rather dissipated.

So the plan didn't really work out as hoped.  Lesson learned, I would
now very much prefer that new HMM feature work's changelogs include
links to the driver patchsets which will be using those features and
acks and review input from the developers of those driver patchsets.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-12 21:52                                           ` Andrew Morton
@ 2019-03-13  0:10                                             ` Jerome Glisse
  2019-03-13  0:46                                               ` Dan Williams
  2019-03-13 16:06                                               ` Andrew Morton
  0 siblings, 2 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-03-13  0:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Mar 12, 2019 at 02:52:14PM -0700, Andrew Morton wrote:
> On Tue, 12 Mar 2019 12:30:52 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > On Tue, Mar 12, 2019 at 12:06 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote:
> > > > On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > [..]
> > > > > Spirit of the rule is better than blind application of rule.
> > > >
> > > > Again, I fail to see why HMM is suddenly unable to make forward
> > > > progress when the infrastructure that came before it was merged with
> > > > consumers in the same development cycle.
> > > >
> > > > A gate to upstream merge is about the only lever a reviewer has to
> > > > push for change, and these requests to uncouple the consumer only
> > > > serve to weaken that review tool in my mind.
> > >
> > > Well let just agree to disagree and leave it at that and stop
> > > wasting each other time
> > 
> > I'm fine to continue this discussion if you are. Please be specific
> > about where we disagree and what aspect of the proposed rules about
> > merge staging are either acceptable, painful-but-doable, or
> > show-stoppers. Do you agree that HMM is doing something novel with
> > merge staging, am I off base there?
> 
> You're correct.  We chose to go this way because the HMM code is so
> large and all-over-the-place that developing it in a standalone tree
> seemed impractical - better to feed it into mainline piecewise.
> 
> This decision very much assumed that HMM users would definitely be
> merged, and that it would happen soon.  I was skeptical for a long time
> and was eventually persuaded by quite a few conversations with various
> architecture and driver maintainers indicating that these HMM users
> would be forthcoming.
> 
> In retrospect, the arrival of HMM clients took quite a lot longer than
> was anticipated and I'm not sure that all of the anticipated usage
> sites will actually be using it.  I wish I'd kept records of
> who-said-what, but I didn't and the info is now all rather dissipated.
> 
> So the plan didn't really work out as hoped.  Lesson learned, I would
> now very much prefer that new HMM feature work's changelogs include
> links to the driver patchsets which will be using those features and
> acks and review input from the developers of those driver patchsets.

This is what i am doing now and this patchset falls into that. I did
post the ODP and nouveau bits to use the 2 new functions (dma map and
unmap). I expect to merge both ODP and nouveau bits for that during
the next merge window.

Also with 5.1 everything that is upstream is use by nouveau at least.
They are posted patches to use HMM for AMD, Intel, Radeon, ODP, PPC.
Some are going through several revisions so i do not know exactly when
each will make it upstream but i keep working on all this.

So the guideline we agree on:
    - no new infrastructure without user
    - device driver maintainer for which new infrastructure is done
      must either sign off or review of explicitly say that they want
      the feature I do not expect all driver maintainer will have
      the bandwidth to do proper review of the mm part of the infra-
      structure and it would not be fair to ask that from them. They
      can still provide feedback on the API expose to the device
      driver.
    - driver bits must be posted at the same time as the new infra-
      structure even if they target the next release cycle to avoid
      inter-tree dependency
    - driver bits must be merge as soon as possible

Thing we do not agree on:
    - If driver bits miss for any reason the +1 target directly
      revert the new infra-structure. I think it should not be black
      and white and the reasons why the driver bit missed the merge
      window should be taken into account. If the feature is still
      wanted and the driver bits missed the window for simple reasons
      then it means that we push everything by 2 release ie the
      revert is done in +1 then we reupload the infra-structure in
      +2 and finaly repush the driver bit in +3 so we loose 1 cycle.
      Hence why i would rather that the revert would only happen if
      it is clear that the infrastructure is not ready or can not
      be use in timely (over couple kernel release) fashion by any
      drivers.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-13  0:10                                             ` Jerome Glisse
@ 2019-03-13  0:46                                               ` Dan Williams
  2019-03-13  1:00                                                 ` Jerome Glisse
  2019-03-13 16:06                                               ` Andrew Morton
  1 sibling, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-13  0:46 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Mar 12, 2019 at 5:10 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Mar 12, 2019 at 02:52:14PM -0700, Andrew Morton wrote:
> > On Tue, 12 Mar 2019 12:30:52 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > > On Tue, Mar 12, 2019 at 12:06 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote:
> > > > > On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > [..]
> > > > > > Spirit of the rule is better than blind application of rule.
> > > > >
> > > > > Again, I fail to see why HMM is suddenly unable to make forward
> > > > > progress when the infrastructure that came before it was merged with
> > > > > consumers in the same development cycle.
> > > > >
> > > > > A gate to upstream merge is about the only lever a reviewer has to
> > > > > push for change, and these requests to uncouple the consumer only
> > > > > serve to weaken that review tool in my mind.
> > > >
> > > > Well let just agree to disagree and leave it at that and stop
> > > > wasting each other time
> > >
> > > I'm fine to continue this discussion if you are. Please be specific
> > > about where we disagree and what aspect of the proposed rules about
> > > merge staging are either acceptable, painful-but-doable, or
> > > show-stoppers. Do you agree that HMM is doing something novel with
> > > merge staging, am I off base there?
> >
> > You're correct.  We chose to go this way because the HMM code is so
> > large and all-over-the-place that developing it in a standalone tree
> > seemed impractical - better to feed it into mainline piecewise.
> >
> > This decision very much assumed that HMM users would definitely be
> > merged, and that it would happen soon.  I was skeptical for a long time
> > and was eventually persuaded by quite a few conversations with various
> > architecture and driver maintainers indicating that these HMM users
> > would be forthcoming.
> >
> > In retrospect, the arrival of HMM clients took quite a lot longer than
> > was anticipated and I'm not sure that all of the anticipated usage
> > sites will actually be using it.  I wish I'd kept records of
> > who-said-what, but I didn't and the info is now all rather dissipated.
> >
> > So the plan didn't really work out as hoped.  Lesson learned, I would
> > now very much prefer that new HMM feature work's changelogs include
> > links to the driver patchsets which will be using those features and
> > acks and review input from the developers of those driver patchsets.
>
> This is what i am doing now and this patchset falls into that. I did
> post the ODP and nouveau bits to use the 2 new functions (dma map and
> unmap). I expect to merge both ODP and nouveau bits for that during
> the next merge window.
>
> Also with 5.1 everything that is upstream is use by nouveau at least.
> They are posted patches to use HMM for AMD, Intel, Radeon, ODP, PPC.
> Some are going through several revisions so i do not know exactly when
> each will make it upstream but i keep working on all this.
>
> So the guideline we agree on:
>     - no new infrastructure without user
>     - device driver maintainer for which new infrastructure is done
>       must either sign off or review of explicitly say that they want
>       the feature I do not expect all driver maintainer will have
>       the bandwidth to do proper review of the mm part of the infra-
>       structure and it would not be fair to ask that from them. They
>       can still provide feedback on the API expose to the device
>       driver.
>     - driver bits must be posted at the same time as the new infra-
>       structure even if they target the next release cycle to avoid
>       inter-tree dependency
>     - driver bits must be merge as soon as possible

What about EXPORT_SYMBOL_GPL?

>
> Thing we do not agree on:
>     - If driver bits miss for any reason the +1 target directly
>       revert the new infra-structure. I think it should not be black
>       and white and the reasons why the driver bit missed the merge
>       window should be taken into account. If the feature is still
>       wanted and the driver bits missed the window for simple reasons
>       then it means that we push everything by 2 release ie the
>       revert is done in +1 then we reupload the infra-structure in
>       +2 and finaly repush the driver bit in +3 so we loose 1 cycle.

I think that pain is reasonable.

>       Hence why i would rather that the revert would only happen if
>       it is clear that the infrastructure is not ready or can not
>       be use in timely (over couple kernel release) fashion by any
>       drivers.

This seems too generous to me, but in the interest of moving this
discussion forward let's cross that bridge if/when it happens.
Hopefully the threat of this debate recurring means consumers put in
the due diligence to get things merged at infrastructure + 1 time.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-13  0:46                                               ` Dan Williams
@ 2019-03-13  1:00                                                 ` Jerome Glisse
  0 siblings, 0 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-03-13  1:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, Mar 12, 2019 at 05:46:51PM -0700, Dan Williams wrote:
> On Tue, Mar 12, 2019 at 5:10 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Mar 12, 2019 at 02:52:14PM -0700, Andrew Morton wrote:
> > > On Tue, 12 Mar 2019 12:30:52 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > > On Tue, Mar 12, 2019 at 12:06 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote:
> > > > > > On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > [..]
> > > > > > > Spirit of the rule is better than blind application of rule.
> > > > > >
> > > > > > Again, I fail to see why HMM is suddenly unable to make forward
> > > > > > progress when the infrastructure that came before it was merged with
> > > > > > consumers in the same development cycle.
> > > > > >
> > > > > > A gate to upstream merge is about the only lever a reviewer has to
> > > > > > push for change, and these requests to uncouple the consumer only
> > > > > > serve to weaken that review tool in my mind.
> > > > >
> > > > > Well let just agree to disagree and leave it at that and stop
> > > > > wasting each other time
> > > >
> > > > I'm fine to continue this discussion if you are. Please be specific
> > > > about where we disagree and what aspect of the proposed rules about
> > > > merge staging are either acceptable, painful-but-doable, or
> > > > show-stoppers. Do you agree that HMM is doing something novel with
> > > > merge staging, am I off base there?
> > >
> > > You're correct.  We chose to go this way because the HMM code is so
> > > large and all-over-the-place that developing it in a standalone tree
> > > seemed impractical - better to feed it into mainline piecewise.
> > >
> > > This decision very much assumed that HMM users would definitely be
> > > merged, and that it would happen soon.  I was skeptical for a long time
> > > and was eventually persuaded by quite a few conversations with various
> > > architecture and driver maintainers indicating that these HMM users
> > > would be forthcoming.
> > >
> > > In retrospect, the arrival of HMM clients took quite a lot longer than
> > > was anticipated and I'm not sure that all of the anticipated usage
> > > sites will actually be using it.  I wish I'd kept records of
> > > who-said-what, but I didn't and the info is now all rather dissipated.
> > >
> > > So the plan didn't really work out as hoped.  Lesson learned, I would
> > > now very much prefer that new HMM feature work's changelogs include
> > > links to the driver patchsets which will be using those features and
> > > acks and review input from the developers of those driver patchsets.
> >
> > This is what i am doing now and this patchset falls into that. I did
> > post the ODP and nouveau bits to use the 2 new functions (dma map and
> > unmap). I expect to merge both ODP and nouveau bits for that during
> > the next merge window.
> >
> > Also with 5.1 everything that is upstream is use by nouveau at least.
> > They are posted patches to use HMM for AMD, Intel, Radeon, ODP, PPC.
> > Some are going through several revisions so i do not know exactly when
> > each will make it upstream but i keep working on all this.
> >
> > So the guideline we agree on:
> >     - no new infrastructure without user
> >     - device driver maintainer for which new infrastructure is done
> >       must either sign off or review of explicitly say that they want
> >       the feature I do not expect all driver maintainer will have
> >       the bandwidth to do proper review of the mm part of the infra-
> >       structure and it would not be fair to ask that from them. They
> >       can still provide feedback on the API expose to the device
> >       driver.
> >     - driver bits must be posted at the same time as the new infra-
> >       structure even if they target the next release cycle to avoid
> >       inter-tree dependency
> >     - driver bits must be merge as soon as possible
> 
> What about EXPORT_SYMBOL_GPL?

I explained why i do not see value in changing export, but i will not
oppose that change either.


> > Thing we do not agree on:
> >     - If driver bits miss for any reason the +1 target directly
> >       revert the new infra-structure. I think it should not be black
> >       and white and the reasons why the driver bit missed the merge
> >       window should be taken into account. If the feature is still
> >       wanted and the driver bits missed the window for simple reasons
> >       then it means that we push everything by 2 release ie the
> >       revert is done in +1 then we reupload the infra-structure in
> >       +2 and finaly repush the driver bit in +3 so we loose 1 cycle.
> 
> I think that pain is reasonable.
> 
> >       Hence why i would rather that the revert would only happen if
> >       it is clear that the infrastructure is not ready or can not
> >       be use in timely (over couple kernel release) fashion by any
> >       drivers.
> 
> This seems too generous to me, but in the interest of moving this
> discussion forward let's cross that bridge if/when it happens.
> Hopefully the threat of this debate recurring means consumers put in
> the due diligence to get things merged at infrastructure + 1 time.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-12 20:34                                           ` Dave Chinner
@ 2019-03-13  1:06                                             ` Dan Williams
  0 siblings, 0 replies; 98+ messages in thread
From: Dan Williams @ 2019-03-13  1:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jerome Glisse, Andrew Morton, Linux MM,
	Linux Kernel Mailing List, Ralph Campbell, John Hubbard,
	linux-fsdevel

On Tue, Mar 12, 2019 at 1:34 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Mar 12, 2019 at 12:30:52PM -0700, Dan Williams wrote:
> > On Tue, Mar 12, 2019 at 12:06 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > On Tue, Mar 12, 2019 at 09:06:12AM -0700, Dan Williams wrote:
> > > > On Tue, Mar 12, 2019 at 8:26 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > [..]
> > > > > Spirit of the rule is better than blind application of rule.
> > > >
> > > > Again, I fail to see why HMM is suddenly unable to make forward
> > > > progress when the infrastructure that came before it was merged with
> > > > consumers in the same development cycle.
> > > >
> > > > A gate to upstream merge is about the only lever a reviewer has to
> > > > push for change, and these requests to uncouple the consumer only
> > > > serve to weaken that review tool in my mind.
> > >
> > > Well let just agree to disagree and leave it at that and stop
> > > wasting each other time
> >
> > I'm fine to continue this discussion if you are. Please be specific
> > about where we disagree and what aspect of the proposed rules about
> > merge staging are either acceptable, painful-but-doable, or
> > show-stoppers. Do you agree that HMM is doing something novel with
> > merge staging, am I off base there? I expect I can find folks that
> > would balk with even a one cycle deferment of consumers, but can we
> > start with that concession and see how it goes? I'm missing where I've
> > proposed something that is untenable for the future of HMM which is
> > addressing some real needs in gaps in the kernel's support for new
> > hardware.
>
> /me quietly wonders why the hmm infrastructure can't be staged in a
> maintainer tree development branch on a kernel.org and then
> all merged in one go when that branch has both infrastructure and
> drivers merged into it...
>
> i.e. everyone doing hmm driver work gets the infrastructure from the
> dev tree, not mainline. That's a pretty standard procedure for
> developing complex features, and it avoids all the issues being
> argued over right now...

True, but I wasn't considering that because the mm tree does not do
stable topic branches. This kind of staging seems not amenable to a
quilt workflow and it needs to keep pace with the rest of mm.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
                   ` (11 preceding siblings ...)
  2019-02-22 23:31 ` Ralph Campbell
@ 2019-03-13  1:27 ` Jerome Glisse
  2019-03-13 16:10   ` Andrew Morton
  12 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-13  1:27 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Felix Kuehling, Christian König,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Dan Williams

Andrew you will not be pushing this patchset in 5.1 ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-13  0:10                                             ` Jerome Glisse
  2019-03-13  0:46                                               ` Dan Williams
@ 2019-03-13 16:06                                               ` Andrew Morton
  2019-03-13 18:39                                                 ` Jerome Glisse
  1 sibling, 1 reply; 98+ messages in thread
From: Andrew Morton @ 2019-03-13 16:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Tue, 12 Mar 2019 20:10:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

> > You're correct.  We chose to go this way because the HMM code is so
> > large and all-over-the-place that developing it in a standalone tree
> > seemed impractical - better to feed it into mainline piecewise.
> > 
> > This decision very much assumed that HMM users would definitely be
> > merged, and that it would happen soon.  I was skeptical for a long time
> > and was eventually persuaded by quite a few conversations with various
> > architecture and driver maintainers indicating that these HMM users
> > would be forthcoming.
> > 
> > In retrospect, the arrival of HMM clients took quite a lot longer than
> > was anticipated and I'm not sure that all of the anticipated usage
> > sites will actually be using it.  I wish I'd kept records of
> > who-said-what, but I didn't and the info is now all rather dissipated.
> > 
> > So the plan didn't really work out as hoped.  Lesson learned, I would
> > now very much prefer that new HMM feature work's changelogs include
> > links to the driver patchsets which will be using those features and
> > acks and review input from the developers of those driver patchsets.
> 
> This is what i am doing now and this patchset falls into that. I did
> post the ODP and nouveau bits to use the 2 new functions (dma map and
> unmap). I expect to merge both ODP and nouveau bits for that during
> the next merge window.
> 
> Also with 5.1 everything that is upstream is use by nouveau at least.
> They are posted patches to use HMM for AMD, Intel, Radeon, ODP, PPC.
> Some are going through several revisions so i do not know exactly when
> each will make it upstream but i keep working on all this.
> 
> So the guideline we agree on:
>     - no new infrastructure without user
>     - device driver maintainer for which new infrastructure is done
>       must either sign off or review of explicitly say that they want
>       the feature I do not expect all driver maintainer will have
>       the bandwidth to do proper review of the mm part of the infra-
>       structure and it would not be fair to ask that from them. They
>       can still provide feedback on the API expose to the device
>       driver.

The patchset in -mm ("HMM updates for 5.1") has review from Ralph
Campbell @ nvidia.  Are there any other maintainers who we should have
feedback from?

>     - driver bits must be posted at the same time as the new infra-
>       structure even if they target the next release cycle to avoid
>       inter-tree dependency
>     - driver bits must be merge as soon as possible

Are there links to driver patchsets which we can add to the changelogs?

> Thing we do not agree on:
>     - If driver bits miss for any reason the +1 target directly
>       revert the new infra-structure. I think it should not be black
>       and white and the reasons why the driver bit missed the merge
>       window should be taken into account. If the feature is still
>       wanted and the driver bits missed the window for simple reasons
>       then it means that we push everything by 2 release ie the
>       revert is done in +1 then we reupload the infra-structure in
>       +2 and finaly repush the driver bit in +3 so we loose 1 cycle.
>       Hence why i would rather that the revert would only happen if
>       it is clear that the infrastructure is not ready or can not
>       be use in timely (over couple kernel release) fashion by any
>       drivers.

I agree that this should be more a philosophy than a set of hard rules.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-13  1:27 ` Jerome Glisse
@ 2019-03-13 16:10   ` Andrew Morton
  2019-03-13 18:01     ` Jason Gunthorpe
                       ` (3 more replies)
  0 siblings, 4 replies; 98+ messages in thread
From: Andrew Morton @ 2019-03-13 16:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Felix Kuehling, Christian König,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Dan Williams

On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

> Andrew you will not be pushing this patchset in 5.1 ?

I'd like to.  It sounds like we're converging on a plan.

It would be good to hear more from the driver developers who will be
consuming these new features - links to patchsets, review feedback,
etc.  Which individuals should we be asking?  Felix, Christian and
Jason, perhaps?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-13 16:10   ` Andrew Morton
@ 2019-03-13 18:01     ` Jason Gunthorpe
  2019-03-13 18:33     ` Jerome Glisse
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 98+ messages in thread
From: Jason Gunthorpe @ 2019-03-13 18:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jerome Glisse, linux-mm, linux-kernel, Felix Kuehling,
	Christian König, Ralph Campbell, John Hubbard, Dan Williams

On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > Andrew you will not be pushing this patchset in 5.1 ?
> 
> I'd like to.  It sounds like we're converging on a plan.
> 
> It would be good to hear more from the driver developers who will be
> consuming these new features - links to patchsets, review feedback,
> etc.  Which individuals should we be asking?  Felix, Christian and
> Jason, perhaps?

At least the Mellanox driver patch looks like a good improvement:

https://patchwork.kernel.org/patch/10786625/
 5 files changed, 202 insertions(+), 452 deletions(-)

In fact it hollows out the 'umem_odp' driver abstraction we already
had in the RDMA core.

So, I fully expect to see this API used in mlx5 and RDMA-core after it
is merged.

We've done some testing now, and there are still some outstanding
questions on the driver parts, but I haven't seen anything
fundamentally wrong with HMM mirror come up.

Jason

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-13 16:10   ` Andrew Morton
  2019-03-13 18:01     ` Jason Gunthorpe
@ 2019-03-13 18:33     ` Jerome Glisse
  2019-03-18 17:00     ` Kuehling, Felix
  2019-03-18 17:04     ` Jerome Glisse
  3 siblings, 0 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-03-13 18:33 UTC (permalink / raw)
  To: Andrew Morton, Ben Skeggs
  Cc: linux-mm, linux-kernel, Felix Kuehling, Christian König,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Dan Williams

On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > Andrew you will not be pushing this patchset in 5.1 ?
> 
> I'd like to.  It sounds like we're converging on a plan.
> 
> It would be good to hear more from the driver developers who will be
> consuming these new features - links to patchsets, review feedback,
> etc.  Which individuals should we be asking?  Felix, Christian and
> Jason, perhaps?
> 

Adding Ben as nouveau maintainer. Note that this patchset only add
2 new function the rest is just refactoring to allow RDMA ODP. The
2 news functions will both be use by ODP and nouveau. Ben this is
the dma map function we discuss previously. If they get in 5.1 then
i will push there user in nouveau at least in 5.2. I will soon repost
ODP v2 patchset on top of RDMA tree so if this does not get in 5.1
then ODP will have to be push to 5.3 or to when this get upstream.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
  2019-03-13 16:06                                               ` Andrew Morton
@ 2019-03-13 18:39                                                 ` Jerome Glisse
  0 siblings, 0 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-03-13 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, Linux MM, Linux Kernel Mailing List,
	Ralph Campbell, John Hubbard, linux-fsdevel

On Wed, Mar 13, 2019 at 09:06:04AM -0700, Andrew Morton wrote:
> On Tue, 12 Mar 2019 20:10:19 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > > You're correct.  We chose to go this way because the HMM code is so
> > > large and all-over-the-place that developing it in a standalone tree
> > > seemed impractical - better to feed it into mainline piecewise.
> > > 
> > > This decision very much assumed that HMM users would definitely be
> > > merged, and that it would happen soon.  I was skeptical for a long time
> > > and was eventually persuaded by quite a few conversations with various
> > > architecture and driver maintainers indicating that these HMM users
> > > would be forthcoming.
> > > 
> > > In retrospect, the arrival of HMM clients took quite a lot longer than
> > > was anticipated and I'm not sure that all of the anticipated usage
> > > sites will actually be using it.  I wish I'd kept records of
> > > who-said-what, but I didn't and the info is now all rather dissipated.
> > > 
> > > So the plan didn't really work out as hoped.  Lesson learned, I would
> > > now very much prefer that new HMM feature work's changelogs include
> > > links to the driver patchsets which will be using those features and
> > > acks and review input from the developers of those driver patchsets.
> > 
> > This is what i am doing now and this patchset falls into that. I did
> > post the ODP and nouveau bits to use the 2 new functions (dma map and
> > unmap). I expect to merge both ODP and nouveau bits for that during
> > the next merge window.
> > 
> > Also with 5.1 everything that is upstream is use by nouveau at least.
> > They are posted patches to use HMM for AMD, Intel, Radeon, ODP, PPC.
> > Some are going through several revisions so i do not know exactly when
> > each will make it upstream but i keep working on all this.
> > 
> > So the guideline we agree on:
> >     - no new infrastructure without user
> >     - device driver maintainer for which new infrastructure is done
> >       must either sign off or review of explicitly say that they want
> >       the feature I do not expect all driver maintainer will have
> >       the bandwidth to do proper review of the mm part of the infra-
> >       structure and it would not be fair to ask that from them. They
> >       can still provide feedback on the API expose to the device
> >       driver.
> 
> The patchset in -mm ("HMM updates for 5.1") has review from Ralph
> Campbell @ nvidia.  Are there any other maintainers who we should have
> feedback from?

John Hubbard also give his review on couple of them iirc.

> 
> >     - driver bits must be posted at the same time as the new infra-
> >       structure even if they target the next release cycle to avoid
> >       inter-tree dependency
> >     - driver bits must be merge as soon as possible
> 
> Are there links to driver patchsets which we can add to the changelogs?
> 

Issue with that is that i often post the infrastructure bit first and
then the driver bit so i have email circular dependency :) I can alway
post driver bits first and then add links to driver bits. Or i can
reply after posting so that i can cross link both.

Or i can post the driver bit on mm the first time around and mark them
as "not for Andrew" or any tag that make it clear that those patch will
be merge through the appropriate driver tree.

In any case for this patchset there is:

https://patchwork.kernel.org/patch/10786625/

Also this patchset refactor some of the hmm internal for better API so
it is getting use by nouveau too which is already upstream.


> > Thing we do not agree on:
> >     - If driver bits miss for any reason the +1 target directly
> >       revert the new infra-structure. I think it should not be black
> >       and white and the reasons why the driver bit missed the merge
> >       window should be taken into account. If the feature is still
> >       wanted and the driver bits missed the window for simple reasons
> >       then it means that we push everything by 2 release ie the
> >       revert is done in +1 then we reupload the infra-structure in
> >       +2 and finaly repush the driver bit in +3 so we loose 1 cycle.
> >       Hence why i would rather that the revert would only happen if
> >       it is clear that the infrastructure is not ready or can not
> >       be use in timely (over couple kernel release) fashion by any
> >       drivers.
> 
> I agree that this should be more a philosophy than a set of hard rules.
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-13 16:10   ` Andrew Morton
  2019-03-13 18:01     ` Jason Gunthorpe
  2019-03-13 18:33     ` Jerome Glisse
@ 2019-03-18 17:00     ` Kuehling, Felix
  2019-03-18 17:04     ` Jerome Glisse
  3 siblings, 0 replies; 98+ messages in thread
From: Kuehling, Felix @ 2019-03-18 17:00 UTC (permalink / raw)
  To: Andrew Morton, Jerome Glisse
  Cc: linux-mm, linux-kernel, Koenig, Christian, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Dan Williams, Yang, Philip

For amdgpu I looked over the changes and they look reasonable to me. 
Philip Yang (CCed) already rebased amdgpu on top of Jerome's patches and 
is looking forward to using the new helpers and simplifying our driver code.

Feel free to add my Acked-by to the patches.

Regards,
   Felix

On 3/13/2019 12:10 PM, Andrew Morton wrote:
> On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
>
>> Andrew you will not be pushing this patchset in 5.1 ?
> I'd like to.  It sounds like we're converging on a plan.
>
> It would be good to hear more from the driver developers who will be
> consuming these new features - links to patchsets, review feedback,
> etc.  Which individuals should we be asking?  Felix, Christian and
> Jason, perhaps?
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-13 16:10   ` Andrew Morton
                       ` (2 preceding siblings ...)
  2019-03-18 17:00     ` Kuehling, Felix
@ 2019-03-18 17:04     ` Jerome Glisse
  2019-03-18 18:30       ` Dan Williams
  2019-03-19 16:40       ` Andrew Morton
  3 siblings, 2 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-03-18 17:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Felix Kuehling, Christian König,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Dan Williams

On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > Andrew you will not be pushing this patchset in 5.1 ?
> 
> I'd like to.  It sounds like we're converging on a plan.
> 
> It would be good to hear more from the driver developers who will be
> consuming these new features - links to patchsets, review feedback,
> etc.  Which individuals should we be asking?  Felix, Christian and
> Jason, perhaps?
> 

So i am guessing you will not send this to Linus ? Should i repost ?
This patchset has 2 sides, first side is just reworking the HMM API
to make something better in respect to process lifetime. AMD folks
did find that helpful [1]. This rework is also necessary to ease up
the convertion of ODP to HMM [2] and Jason already said that he is
interested in seing that happening [3]. By missing 5.1 it means now
that i can not push ODP to HMM in 5.2 and it will be postpone to 5.3
which is also postoning other work ...

The second side is it adds 2 new helper dma map and dma unmap both
are gonna be use by ODP and latter by nouveau (after some other
nouveau changes are done). This new functions just do dma_map ie:
    hmm_dma_map() {
        existing_hmm_api()
        for_each_page() {
            dma_map_page()
        }
    }

Do you want to see anymore justification than that ?

[1] https://www.spinics.net/lists/amd-gfx/msg31048.html
[2] https://patchwork.kernel.org/patch/10786625/
[3] https://lkml.org/lkml/2019/3/13/591

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-18 17:04     ` Jerome Glisse
@ 2019-03-18 18:30       ` Dan Williams
  2019-03-18 18:54         ` Jerome Glisse
  2019-03-19 16:40       ` Andrew Morton
  1 sibling, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-18 18:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Mon, Mar 18, 2019 at 10:04 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> > On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > > Andrew you will not be pushing this patchset in 5.1 ?
> >
> > I'd like to.  It sounds like we're converging on a plan.
> >
> > It would be good to hear more from the driver developers who will be
> > consuming these new features - links to patchsets, review feedback,
> > etc.  Which individuals should we be asking?  Felix, Christian and
> > Jason, perhaps?
> >
>
> So i am guessing you will not send this to Linus ? Should i repost ?
> This patchset has 2 sides, first side is just reworking the HMM API
> to make something better in respect to process lifetime. AMD folks
> did find that helpful [1]. This rework is also necessary to ease up
> the convertion of ODP to HMM [2] and Jason already said that he is
> interested in seing that happening [3]. By missing 5.1 it means now
> that i can not push ODP to HMM in 5.2 and it will be postpone to 5.3
> which is also postoning other work ...
>
> The second side is it adds 2 new helper dma map and dma unmap both
> are gonna be use by ODP and latter by nouveau (after some other
> nouveau changes are done). This new functions just do dma_map ie:
>     hmm_dma_map() {
>         existing_hmm_api()
>         for_each_page() {
>             dma_map_page()
>         }
>     }
>
> Do you want to see anymore justification than that ?

Yes, why does hmm needs its own dma mapping apis? It seems to
perpetuate the perception that hmm is something bolted onto the side
of the core-mm rather than a native capability.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-18 18:30       ` Dan Williams
@ 2019-03-18 18:54         ` Jerome Glisse
  2019-03-18 19:18           ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-18 18:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Mon, Mar 18, 2019 at 11:30:15AM -0700, Dan Williams wrote:
> On Mon, Mar 18, 2019 at 10:04 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> > > On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > > Andrew you will not be pushing this patchset in 5.1 ?
> > >
> > > I'd like to.  It sounds like we're converging on a plan.
> > >
> > > It would be good to hear more from the driver developers who will be
> > > consuming these new features - links to patchsets, review feedback,
> > > etc.  Which individuals should we be asking?  Felix, Christian and
> > > Jason, perhaps?
> > >
> >
> > So i am guessing you will not send this to Linus ? Should i repost ?
> > This patchset has 2 sides, first side is just reworking the HMM API
> > to make something better in respect to process lifetime. AMD folks
> > did find that helpful [1]. This rework is also necessary to ease up
> > the convertion of ODP to HMM [2] and Jason already said that he is
> > interested in seing that happening [3]. By missing 5.1 it means now
> > that i can not push ODP to HMM in 5.2 and it will be postpone to 5.3
> > which is also postoning other work ...
> >
> > The second side is it adds 2 new helper dma map and dma unmap both
> > are gonna be use by ODP and latter by nouveau (after some other
> > nouveau changes are done). This new functions just do dma_map ie:
> >     hmm_dma_map() {
> >         existing_hmm_api()
> >         for_each_page() {
> >             dma_map_page()
> >         }
> >     }
> >
> > Do you want to see anymore justification than that ?
> 
> Yes, why does hmm needs its own dma mapping apis? It seems to
> perpetuate the perception that hmm is something bolted onto the side
> of the core-mm rather than a native capability.

Seriously ?

Kernel is fill with example where common code pattern that are not
device specific are turn into helpers and here this is exactly what
it is. A common pattern that all device driver will do which is turn
into a common helper.

Moreover this allow to share the same error code handling accross
driver when mapping one page fails. So this avoid the needs to
duplicate same boiler plate code accross different drivers.

Is code factorization not a good thing ? Should i duplicate every-
thing in every single driver ?


If that's not enough, this will also allow to handle peer to peer
and i posted patches for that [1] and again this is to avoid
duplicating common code accross different drivers.


It does feel that you oppose everything with HMM in its name just
because you do not like it. It is your prerogative to not like some-
thing but you should propose something that achieve the same result
instead of constantly questioning every single comma.

Cheers,
Jérôme

[1] https://lwn.net/ml/linux-kernel/20190129174728.6430-1-jglisse@redhat.com/

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-18 18:54         ` Jerome Glisse
@ 2019-03-18 19:18           ` Dan Williams
  2019-03-18 19:28             ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-18 19:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Mon, Mar 18, 2019 at 11:55 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Mon, Mar 18, 2019 at 11:30:15AM -0700, Dan Williams wrote:
> > On Mon, Mar 18, 2019 at 10:04 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> > > > On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > > Andrew you will not be pushing this patchset in 5.1 ?
> > > >
> > > > I'd like to.  It sounds like we're converging on a plan.
> > > >
> > > > It would be good to hear more from the driver developers who will be
> > > > consuming these new features - links to patchsets, review feedback,
> > > > etc.  Which individuals should we be asking?  Felix, Christian and
> > > > Jason, perhaps?
> > > >
> > >
> > > So i am guessing you will not send this to Linus ? Should i repost ?
> > > This patchset has 2 sides, first side is just reworking the HMM API
> > > to make something better in respect to process lifetime. AMD folks
> > > did find that helpful [1]. This rework is also necessary to ease up
> > > the convertion of ODP to HMM [2] and Jason already said that he is
> > > interested in seing that happening [3]. By missing 5.1 it means now
> > > that i can not push ODP to HMM in 5.2 and it will be postpone to 5.3
> > > which is also postoning other work ...
> > >
> > > The second side is it adds 2 new helper dma map and dma unmap both
> > > are gonna be use by ODP and latter by nouveau (after some other
> > > nouveau changes are done). This new functions just do dma_map ie:
> > >     hmm_dma_map() {
> > >         existing_hmm_api()
> > >         for_each_page() {
> > >             dma_map_page()
> > >         }
> > >     }
> > >
> > > Do you want to see anymore justification than that ?
> >
> > Yes, why does hmm needs its own dma mapping apis? It seems to
> > perpetuate the perception that hmm is something bolted onto the side
> > of the core-mm rather than a native capability.
>
> Seriously ?

Yes.

> Kernel is fill with example where common code pattern that are not
> device specific are turn into helpers and here this is exactly what
> it is. A common pattern that all device driver will do which is turn
> into a common helper.

Yes, but we also try not to introduce thin wrappers around existing
apis. If the current dma api does not understand some hmm constraint
I'm questioning why not teach the dma api that constraint and make it
a native capability rather than asking the driver developer to
understand the rules about when to use dma_map_page() vs
hmm_dma_map().

For example I don't think we want to end up with more headers like
include/linux/pci-dma-compat.h.

> Moreover this allow to share the same error code handling accross
> driver when mapping one page fails. So this avoid the needs to
> duplicate same boiler plate code accross different drivers.
>
> Is code factorization not a good thing ? Should i duplicate every-
> thing in every single driver ?

I did not ask for duplication, I asked why is it not more deeply integrated.

> If that's not enough, this will also allow to handle peer to peer
> and i posted patches for that [1] and again this is to avoid
> duplicating common code accross different drivers.

I went looking for the hmm_dma_map() patches on the list but could not
find them, so I was reacting to the "This new functions just do
dma_map", and wondered if that was the full extent of the
justification.

> It does feel that you oppose everything with HMM in its name just
> because you do not like it. It is your prerogative to not like some-
> thing but you should propose something that achieve the same result
> instead of constantly questioning every single comma.

I respect what you're trying to do, if I didn't I wouldn't bother
responding. Please don't put words in my mouth. I think it was
Churchill who said "if two people agree all the time, one of them is
redundant". You're raising questions with HMM that identify real gaps
in Linux memory management relative to new hardware capabilities, I
also think it is reasonable to question how the gaps are filled.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-18 19:18           ` Dan Williams
@ 2019-03-18 19:28             ` Jerome Glisse
  2019-03-18 19:36               ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-18 19:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Mon, Mar 18, 2019 at 12:18:38PM -0700, Dan Williams wrote:
> On Mon, Mar 18, 2019 at 11:55 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Mon, Mar 18, 2019 at 11:30:15AM -0700, Dan Williams wrote:
> > > On Mon, Mar 18, 2019 at 10:04 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> > > > > On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > >
> > > > > > Andrew you will not be pushing this patchset in 5.1 ?
> > > > >
> > > > > I'd like to.  It sounds like we're converging on a plan.
> > > > >
> > > > > It would be good to hear more from the driver developers who will be
> > > > > consuming these new features - links to patchsets, review feedback,
> > > > > etc.  Which individuals should we be asking?  Felix, Christian and
> > > > > Jason, perhaps?
> > > > >
> > > >
> > > > So i am guessing you will not send this to Linus ? Should i repost ?
> > > > This patchset has 2 sides, first side is just reworking the HMM API
> > > > to make something better in respect to process lifetime. AMD folks
> > > > did find that helpful [1]. This rework is also necessary to ease up
> > > > the convertion of ODP to HMM [2] and Jason already said that he is
> > > > interested in seing that happening [3]. By missing 5.1 it means now
> > > > that i can not push ODP to HMM in 5.2 and it will be postpone to 5.3
> > > > which is also postoning other work ...
> > > >
> > > > The second side is it adds 2 new helper dma map and dma unmap both
> > > > are gonna be use by ODP and latter by nouveau (after some other
> > > > nouveau changes are done). This new functions just do dma_map ie:
> > > >     hmm_dma_map() {
> > > >         existing_hmm_api()
> > > >         for_each_page() {
> > > >             dma_map_page()
> > > >         }
> > > >     }
> > > >
> > > > Do you want to see anymore justification than that ?
> > >
> > > Yes, why does hmm needs its own dma mapping apis? It seems to
> > > perpetuate the perception that hmm is something bolted onto the side
> > > of the core-mm rather than a native capability.
> >
> > Seriously ?
> 
> Yes.
> 
> > Kernel is fill with example where common code pattern that are not
> > device specific are turn into helpers and here this is exactly what
> > it is. A common pattern that all device driver will do which is turn
> > into a common helper.
> 
> Yes, but we also try not to introduce thin wrappers around existing
> apis. If the current dma api does not understand some hmm constraint
> I'm questioning why not teach the dma api that constraint and make it
> a native capability rather than asking the driver developer to
> understand the rules about when to use dma_map_page() vs
> hmm_dma_map().

There is nothing special here, existing_hmm_api() return an array of
page and the new helper just call dma_map_page for each entry in that
array. If it fails it undo everything so that error handling is share.

So i am not playing trick with DMA API i am just providing an helper
for a common pattern. Maybe the name confuse you but the pseudo should
be selft explanatory:
    Before
        mydriver_mirror_range() {
            err = existing_hmm_mirror_api(pages)
            if (err) {...}
            for_each_page(pages) {
                pas[i]= dma_map_page()
                if (dma_error(pas[i])) { ... }
            }
            // use pas[]
        }

    After
        mydriver_mirror_range() {
            err = hmm_range_dma_map(pas)
            if (err) { ... }
            // use pas[]
        }

So there is no rule of using one or the other. In the end it is the
same code. But instead of duplicating it in multiple drivers it is
share.

> 
> For example I don't think we want to end up with more headers like
> include/linux/pci-dma-compat.h.
> 
> > Moreover this allow to share the same error code handling accross
> > driver when mapping one page fails. So this avoid the needs to
> > duplicate same boiler plate code accross different drivers.
> >
> > Is code factorization not a good thing ? Should i duplicate every-
> > thing in every single driver ?
> 
> I did not ask for duplication, I asked why is it not more deeply integrated.

Because it is a common code pattern for HMM user not for DMA user.

> > If that's not enough, this will also allow to handle peer to peer
> > and i posted patches for that [1] and again this is to avoid
> > duplicating common code accross different drivers.
> 
> I went looking for the hmm_dma_map() patches on the list but could not
> find them, so I was reacting to the "This new functions just do
> dma_map", and wondered if that was the full extent of the
> justification.

They are here [1] patch 7 in this patch serie

Cheers,
Jérôme

[1] https://lkml.org/lkml/2019/1/29/1016

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-18 19:28             ` Jerome Glisse
@ 2019-03-18 19:36               ` Dan Williams
  0 siblings, 0 replies; 98+ messages in thread
From: Dan Williams @ 2019-03-18 19:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Mon, Mar 18, 2019 at 12:29 PM Jerome Glisse <jglisse@redhat.com> wrote:
[..]
> > I went looking for the hmm_dma_map() patches on the list but could not
> > find them, so I was reacting to the "This new functions just do
> > dma_map", and wondered if that was the full extent of the
> > justification.
>
> They are here [1] patch 7 in this patch serie

Ah, I was missing the _range in my search. I'll take my comments over
to that patch. Thanks.

> [1] https://lkml.org/lkml/2019/1/29/1016

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-01-29 16:54 ` [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device jglisse
@ 2019-03-18 20:21   ` Dan Williams
  2019-03-18 20:41     ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-18 20:21 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard

On Tue, Jan 29, 2019 at 8:55 AM <jglisse@redhat.com> wrote:
>
> From: Jérôme Glisse <jglisse@redhat.com>
>
> This is a all in one helper that fault pages in a range and map them to
> a device so that every single device driver do not have to re-implement
> this common pattern.

Ok, correct me if I am wrong but these seem effectively be the typical
"get_user_pages() + dma_map_page()" pattern that non-HMM drivers would
follow. Could we just teach get_user_pages() to take an HMM shortcut
based on the range?

I'm interested in being able to share code across drivers and not have
to worry about the HMM special case at the api level.

And to be clear this isn't an anti-HMM critique this is a "yes, let's
do this, but how about a more fundamental change".

>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h |   9 +++
>  mm/hmm.c            | 152 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 161 insertions(+)
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 4263f8fb32e5..fc3630d0bbfd 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -502,6 +502,15 @@ int hmm_range_register(struct hmm_range *range,
>  void hmm_range_unregister(struct hmm_range *range);
>  long hmm_range_snapshot(struct hmm_range *range);
>  long hmm_range_fault(struct hmm_range *range, bool block);
> +long hmm_range_dma_map(struct hmm_range *range,
> +                      struct device *device,
> +                      dma_addr_t *daddrs,
> +                      bool block);
> +long hmm_range_dma_unmap(struct hmm_range *range,
> +                        struct vm_area_struct *vma,
> +                        struct device *device,
> +                        dma_addr_t *daddrs,
> +                        bool dirty);
>
>  /*
>   * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 0a4ff31e9d7a..9cd68334a759 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -30,6 +30,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memremap.h>
>  #include <linux/jump_label.h>
> +#include <linux/dma-mapping.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/memory_hotplug.h>
>
> @@ -985,6 +986,157 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>         return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
>  }
>  EXPORT_SYMBOL(hmm_range_fault);
> +
> +/*
> + * hmm_range_dma_map() - hmm_range_fault() and dma map page all in one.
> + * @range: range being faulted
> + * @device: device against to dma map page to
> + * @daddrs: dma address of mapped pages
> + * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
> + * Returns: number of pages mapped on success, -EAGAIN if mmap_sem have been
> + *          drop and you need to try again, some other error value otherwise
> + *
> + * Note same usage pattern as hmm_range_fault().
> + */
> +long hmm_range_dma_map(struct hmm_range *range,
> +                      struct device *device,
> +                      dma_addr_t *daddrs,
> +                      bool block)
> +{
> +       unsigned long i, npages, mapped;
> +       long ret;
> +
> +       ret = hmm_range_fault(range, block);
> +       if (ret <= 0)
> +               return ret ? ret : -EBUSY;
> +
> +       npages = (range->end - range->start) >> PAGE_SHIFT;
> +       for (i = 0, mapped = 0; i < npages; ++i) {
> +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> +               struct page *page;
> +
> +               /*
> +                * FIXME need to update DMA API to provide invalid DMA address
> +                * value instead of a function to test dma address value. This
> +                * would remove lot of dumb code duplicated accross many arch.
> +                *
> +                * For now setting it to 0 here is good enough as the pfns[]
> +                * value is what is use to check what is valid and what isn't.
> +                */
> +               daddrs[i] = 0;
> +
> +               page = hmm_pfn_to_page(range, range->pfns[i]);
> +               if (page == NULL)
> +                       continue;
> +
> +               /* Check if range is being invalidated */
> +               if (!range->valid) {
> +                       ret = -EBUSY;
> +                       goto unmap;
> +               }
> +
> +               /* If it is read and write than map bi-directional. */
> +               if (range->pfns[i] & range->values[HMM_PFN_WRITE])
> +                       dir = DMA_BIDIRECTIONAL;
> +
> +               daddrs[i] = dma_map_page(device, page, 0, PAGE_SIZE, dir);
> +               if (dma_mapping_error(device, daddrs[i])) {
> +                       ret = -EFAULT;
> +                       goto unmap;
> +               }
> +
> +               mapped++;
> +       }
> +
> +       return mapped;
> +
> +unmap:
> +       for (npages = i, i = 0; (i < npages) && mapped; ++i) {
> +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> +               struct page *page;
> +
> +               page = hmm_pfn_to_page(range, range->pfns[i]);
> +               if (page == NULL)
> +                       continue;
> +
> +               if (dma_mapping_error(device, daddrs[i]))
> +                       continue;
> +
> +               /* If it is read and write than map bi-directional. */
> +               if (range->pfns[i] & range->values[HMM_PFN_WRITE])
> +                       dir = DMA_BIDIRECTIONAL;
> +
> +               dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
> +               mapped--;
> +       }
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL(hmm_range_dma_map);
> +
> +/*
> + * hmm_range_dma_unmap() - unmap range of that was map with hmm_range_dma_map()
> + * @range: range being unmapped
> + * @vma: the vma against which the range (optional)
> + * @device: device against which dma map was done
> + * @daddrs: dma address of mapped pages
> + * @dirty: dirty page if it had the write flag set
> + * Returns: number of page unmapped on success, -EINVAL otherwise
> + *
> + * Note that caller MUST abide by mmu notifier or use HMM mirror and abide
> + * to the sync_cpu_device_pagetables() callback so that it is safe here to
> + * call set_page_dirty(). Caller must also take appropriate locks to avoid
> + * concurrent mmu notifier or sync_cpu_device_pagetables() to make progress.
> + */
> +long hmm_range_dma_unmap(struct hmm_range *range,
> +                        struct vm_area_struct *vma,
> +                        struct device *device,
> +                        dma_addr_t *daddrs,
> +                        bool dirty)
> +{
> +       unsigned long i, npages;
> +       long cpages = 0;
> +
> +       /* Sanity check. */
> +       if (range->end <= range->start)
> +               return -EINVAL;
> +       if (!daddrs)
> +               return -EINVAL;
> +       if (!range->pfns)
> +               return -EINVAL;
> +
> +       npages = (range->end - range->start) >> PAGE_SHIFT;
> +       for (i = 0; i < npages; ++i) {
> +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> +               struct page *page;
> +
> +               page = hmm_pfn_to_page(range, range->pfns[i]);
> +               if (page == NULL)
> +                       continue;
> +
> +               /* If it is read and write than map bi-directional. */
> +               if (range->pfns[i] & range->values[HMM_PFN_WRITE]) {
> +                       dir = DMA_BIDIRECTIONAL;
> +
> +                       /*
> +                        * See comments in function description on why it is
> +                        * safe here to call set_page_dirty()
> +                        */
> +                       if (dirty)
> +                               set_page_dirty(page);
> +               }
> +
> +               /* Unmap and clear pfns/dma address */
> +               dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
> +               range->pfns[i] = range->values[HMM_PFN_NONE];
> +               /* FIXME see comments in hmm_vma_dma_map() */
> +               daddrs[i] = 0;
> +               cpages++;
> +       }
> +
> +       return cpages;
> +}
> +EXPORT_SYMBOL(hmm_range_dma_unmap);
>  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>
>
> --
> 2.17.2
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-03-18 20:21   ` Dan Williams
@ 2019-03-18 20:41     ` Jerome Glisse
  2019-03-18 21:30       ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-18 20:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard

On Mon, Mar 18, 2019 at 01:21:00PM -0700, Dan Williams wrote:
> On Tue, Jan 29, 2019 at 8:55 AM <jglisse@redhat.com> wrote:
> >
> > From: Jérôme Glisse <jglisse@redhat.com>
> >
> > This is a all in one helper that fault pages in a range and map them to
> > a device so that every single device driver do not have to re-implement
> > this common pattern.
> 
> Ok, correct me if I am wrong but these seem effectively be the typical
> "get_user_pages() + dma_map_page()" pattern that non-HMM drivers would
> follow. Could we just teach get_user_pages() to take an HMM shortcut
> based on the range?
> 
> I'm interested in being able to share code across drivers and not have
> to worry about the HMM special case at the api level.
> 
> And to be clear this isn't an anti-HMM critique this is a "yes, let's
> do this, but how about a more fundamental change".

It is a yes and no, HMM have the synchronization with mmu notifier
which is not common to all device driver ie you have device driver
that do not synchronize with mmu notifier and use GUP. For instance
see the range->valid test in below code this is HMM specific and it
would not apply to GUP user.

Nonetheless i want to remove more HMM code and grow GUP to do some
of this too so that HMM and non HMM driver can share the common part
(under GUP). But right now updating GUP is a too big endeavor. I need
to make progress on more driver with HMM before thinking of messing
with GUP code. Making that code HMM only for now will make the GUP
factorization easier and smaller down the road (should only need to
update HMM helper and not each individual driver which use HMM).

FYI here is my todo list:
    - this patchset
    - HMM ODP
    - mmu notifier changes for optimization and device range binding
    - device range binding (amdgpu/nouveau/...)
    - factor out some nouveau deep inner-layer code to outer-layer for
      more code sharing
    - page->mapping endeavor for generic page protection for instance
      KSM with file back page
    - grow GUP to remove HMM code and consolidate with GUP code
    ...

Cheers,
Jérôme

> 
> >
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  include/linux/hmm.h |   9 +++
> >  mm/hmm.c            | 152 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 161 insertions(+)
> >
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 4263f8fb32e5..fc3630d0bbfd 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -502,6 +502,15 @@ int hmm_range_register(struct hmm_range *range,
> >  void hmm_range_unregister(struct hmm_range *range);
> >  long hmm_range_snapshot(struct hmm_range *range);
> >  long hmm_range_fault(struct hmm_range *range, bool block);
> > +long hmm_range_dma_map(struct hmm_range *range,
> > +                      struct device *device,
> > +                      dma_addr_t *daddrs,
> > +                      bool block);
> > +long hmm_range_dma_unmap(struct hmm_range *range,
> > +                        struct vm_area_struct *vma,
> > +                        struct device *device,
> > +                        dma_addr_t *daddrs,
> > +                        bool dirty);
> >
> >  /*
> >   * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 0a4ff31e9d7a..9cd68334a759 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -30,6 +30,7 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/memremap.h>
> >  #include <linux/jump_label.h>
> > +#include <linux/dma-mapping.h>
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/memory_hotplug.h>
> >
> > @@ -985,6 +986,157 @@ long hmm_range_fault(struct hmm_range *range, bool block)
> >         return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
> >  }
> >  EXPORT_SYMBOL(hmm_range_fault);
> > +
> > +/*
> > + * hmm_range_dma_map() - hmm_range_fault() and dma map page all in one.
> > + * @range: range being faulted
> > + * @device: device against to dma map page to
> > + * @daddrs: dma address of mapped pages
> > + * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
> > + * Returns: number of pages mapped on success, -EAGAIN if mmap_sem have been
> > + *          drop and you need to try again, some other error value otherwise
> > + *
> > + * Note same usage pattern as hmm_range_fault().
> > + */
> > +long hmm_range_dma_map(struct hmm_range *range,
> > +                      struct device *device,
> > +                      dma_addr_t *daddrs,
> > +                      bool block)
> > +{
> > +       unsigned long i, npages, mapped;
> > +       long ret;
> > +
> > +       ret = hmm_range_fault(range, block);
> > +       if (ret <= 0)
> > +               return ret ? ret : -EBUSY;
> > +
> > +       npages = (range->end - range->start) >> PAGE_SHIFT;
> > +       for (i = 0, mapped = 0; i < npages; ++i) {
> > +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> > +               struct page *page;
> > +
> > +               /*
> > +                * FIXME need to update DMA API to provide invalid DMA address
> > +                * value instead of a function to test dma address value. This
> > +                * would remove lot of dumb code duplicated accross many arch.
> > +                *
> > +                * For now setting it to 0 here is good enough as the pfns[]
> > +                * value is what is use to check what is valid and what isn't.
> > +                */
> > +               daddrs[i] = 0;
> > +
> > +               page = hmm_pfn_to_page(range, range->pfns[i]);
> > +               if (page == NULL)
> > +                       continue;
> > +
> > +               /* Check if range is being invalidated */
> > +               if (!range->valid) {
> > +                       ret = -EBUSY;
> > +                       goto unmap;
> > +               }
> > +
> > +               /* If it is read and write than map bi-directional. */
> > +               if (range->pfns[i] & range->values[HMM_PFN_WRITE])
> > +                       dir = DMA_BIDIRECTIONAL;
> > +
> > +               daddrs[i] = dma_map_page(device, page, 0, PAGE_SIZE, dir);
> > +               if (dma_mapping_error(device, daddrs[i])) {
> > +                       ret = -EFAULT;
> > +                       goto unmap;
> > +               }
> > +
> > +               mapped++;
> > +       }
> > +
> > +       return mapped;
> > +
> > +unmap:
> > +       for (npages = i, i = 0; (i < npages) && mapped; ++i) {
> > +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> > +               struct page *page;
> > +
> > +               page = hmm_pfn_to_page(range, range->pfns[i]);
> > +               if (page == NULL)
> > +                       continue;
> > +
> > +               if (dma_mapping_error(device, daddrs[i]))
> > +                       continue;
> > +
> > +               /* If it is read and write than map bi-directional. */
> > +               if (range->pfns[i] & range->values[HMM_PFN_WRITE])
> > +                       dir = DMA_BIDIRECTIONAL;
> > +
> > +               dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
> > +               mapped--;
> > +       }
> > +
> > +       return ret;
> > +}
> > +EXPORT_SYMBOL(hmm_range_dma_map);
> > +
> > +/*
> > + * hmm_range_dma_unmap() - unmap range of that was map with hmm_range_dma_map()
> > + * @range: range being unmapped
> > + * @vma: the vma against which the range (optional)
> > + * @device: device against which dma map was done
> > + * @daddrs: dma address of mapped pages
> > + * @dirty: dirty page if it had the write flag set
> > + * Returns: number of page unmapped on success, -EINVAL otherwise
> > + *
> > + * Note that caller MUST abide by mmu notifier or use HMM mirror and abide
> > + * to the sync_cpu_device_pagetables() callback so that it is safe here to
> > + * call set_page_dirty(). Caller must also take appropriate locks to avoid
> > + * concurrent mmu notifier or sync_cpu_device_pagetables() to make progress.
> > + */
> > +long hmm_range_dma_unmap(struct hmm_range *range,
> > +                        struct vm_area_struct *vma,
> > +                        struct device *device,
> > +                        dma_addr_t *daddrs,
> > +                        bool dirty)
> > +{
> > +       unsigned long i, npages;
> > +       long cpages = 0;
> > +
> > +       /* Sanity check. */
> > +       if (range->end <= range->start)
> > +               return -EINVAL;
> > +       if (!daddrs)
> > +               return -EINVAL;
> > +       if (!range->pfns)
> > +               return -EINVAL;
> > +
> > +       npages = (range->end - range->start) >> PAGE_SHIFT;
> > +       for (i = 0; i < npages; ++i) {
> > +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> > +               struct page *page;
> > +
> > +               page = hmm_pfn_to_page(range, range->pfns[i]);
> > +               if (page == NULL)
> > +                       continue;
> > +
> > +               /* If it is read and write than map bi-directional. */
> > +               if (range->pfns[i] & range->values[HMM_PFN_WRITE]) {
> > +                       dir = DMA_BIDIRECTIONAL;
> > +
> > +                       /*
> > +                        * See comments in function description on why it is
> > +                        * safe here to call set_page_dirty()
> > +                        */
> > +                       if (dirty)
> > +                               set_page_dirty(page);
> > +               }
> > +
> > +               /* Unmap and clear pfns/dma address */
> > +               dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
> > +               range->pfns[i] = range->values[HMM_PFN_NONE];
> > +               /* FIXME see comments in hmm_vma_dma_map() */
> > +               daddrs[i] = 0;
> > +               cpages++;
> > +       }
> > +
> > +       return cpages;
> > +}
> > +EXPORT_SYMBOL(hmm_range_dma_unmap);
> >  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> >
> >
> > --
> > 2.17.2
> >

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-03-18 20:41     ` Jerome Glisse
@ 2019-03-18 21:30       ` Dan Williams
  2019-03-18 22:15         ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-18 21:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard

On Mon, Mar 18, 2019 at 1:41 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Mon, Mar 18, 2019 at 01:21:00PM -0700, Dan Williams wrote:
> > On Tue, Jan 29, 2019 at 8:55 AM <jglisse@redhat.com> wrote:
> > >
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > >
> > > This is a all in one helper that fault pages in a range and map them to
> > > a device so that every single device driver do not have to re-implement
> > > this common pattern.
> >
> > Ok, correct me if I am wrong but these seem effectively be the typical
> > "get_user_pages() + dma_map_page()" pattern that non-HMM drivers would
> > follow. Could we just teach get_user_pages() to take an HMM shortcut
> > based on the range?
> >
> > I'm interested in being able to share code across drivers and not have
> > to worry about the HMM special case at the api level.
> >
> > And to be clear this isn't an anti-HMM critique this is a "yes, let's
> > do this, but how about a more fundamental change".
>
> It is a yes and no, HMM have the synchronization with mmu notifier
> which is not common to all device driver ie you have device driver
> that do not synchronize with mmu notifier and use GUP. For instance
> see the range->valid test in below code this is HMM specific and it
> would not apply to GUP user.
>
> Nonetheless i want to remove more HMM code and grow GUP to do some
> of this too so that HMM and non HMM driver can share the common part
> (under GUP). But right now updating GUP is a too big endeavor.

I'm open to that argument, but that statement then seems to indicate
that these apis are indeed temporary. If the end game is common api
between HMM and non-HMM drivers then I think these should at least
come with /* TODO: */ comments about what might change in the future,
and then should be EXPORT_SYMBOL_GPL since they're already planning to
be deprecated. They are a point in time export for a work-in-progress
interface.

> I need
> to make progress on more driver with HMM before thinking of messing
> with GUP code. Making that code HMM only for now will make the GUP
> factorization easier and smaller down the road (should only need to
> update HMM helper and not each individual driver which use HMM).
>
> FYI here is my todo list:
>     - this patchset
>     - HMM ODP
>     - mmu notifier changes for optimization and device range binding
>     - device range binding (amdgpu/nouveau/...)
>     - factor out some nouveau deep inner-layer code to outer-layer for
>       more code sharing
>     - page->mapping endeavor for generic page protection for instance
>       KSM with file back page
>     - grow GUP to remove HMM code and consolidate with GUP code

Sounds workable as a plan.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-03-18 21:30       ` Dan Williams
@ 2019-03-18 22:15         ` Jerome Glisse
  2019-03-19  3:29           ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-18 22:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard

On Mon, Mar 18, 2019 at 02:30:15PM -0700, Dan Williams wrote:
> On Mon, Mar 18, 2019 at 1:41 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Mon, Mar 18, 2019 at 01:21:00PM -0700, Dan Williams wrote:
> > > On Tue, Jan 29, 2019 at 8:55 AM <jglisse@redhat.com> wrote:
> > > >
> > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > >
> > > > This is a all in one helper that fault pages in a range and map them to
> > > > a device so that every single device driver do not have to re-implement
> > > > this common pattern.
> > >
> > > Ok, correct me if I am wrong but these seem effectively be the typical
> > > "get_user_pages() + dma_map_page()" pattern that non-HMM drivers would
> > > follow. Could we just teach get_user_pages() to take an HMM shortcut
> > > based on the range?
> > >
> > > I'm interested in being able to share code across drivers and not have
> > > to worry about the HMM special case at the api level.
> > >
> > > And to be clear this isn't an anti-HMM critique this is a "yes, let's
> > > do this, but how about a more fundamental change".
> >
> > It is a yes and no, HMM have the synchronization with mmu notifier
> > which is not common to all device driver ie you have device driver
> > that do not synchronize with mmu notifier and use GUP. For instance
> > see the range->valid test in below code this is HMM specific and it
> > would not apply to GUP user.
> >
> > Nonetheless i want to remove more HMM code and grow GUP to do some
> > of this too so that HMM and non HMM driver can share the common part
> > (under GUP). But right now updating GUP is a too big endeavor.
> 
> I'm open to that argument, but that statement then seems to indicate
> that these apis are indeed temporary. If the end game is common api
> between HMM and non-HMM drivers then I think these should at least
> come with /* TODO: */ comments about what might change in the future,
> and then should be EXPORT_SYMBOL_GPL since they're already planning to
> be deprecated. They are a point in time export for a work-in-progress
> interface.

The API is not temporary it will stay the same ie the device driver
using HMM would not need further modification. Only the inner working
of HMM would be ported over to use improved common GUP. But GUP has
few shortcoming today that would be a regression for HMM:
    - huge page handling (ie dma mapping huge page not 4k chunk of
      huge page)
    - not incrementing page refcount for HMM (other user like user-
      faultd also want a GUP without FOLL_GET because they abide by
      mmu notifier)
    - support for device memory without leaking it ie restrict such
      memory to caller that can handle it properly and are fully
      aware of the gotcha that comes with it
    ...

So before converting HMM to use common GUP code under-neath those GUP
shortcoming (from HMM POV) need to be addressed and at the same time
the common dma map pattern can be added as an extra GUP helper.

The issue is that some of the above changes need to be done carefully
to not impact existing GUP users. So i rather clear some of my plate
before starting chewing on this carefully.

Also doing this patch first and then the GUP thing solve the first user
problem you have been asking for. With that code in first the first user
of the GUP convertion will be all the devices that use those two HMM
functions. In turn the first user of that code is the ODP RDMA patch
i already posted. Second will be nouveau once i tackle out some nouveau
changes. I expect amdgpu to come close third as a user and other device
driver who are working on HMM integration to come shortly after.



> > I need
> > to make progress on more driver with HMM before thinking of messing
> > with GUP code. Making that code HMM only for now will make the GUP
> > factorization easier and smaller down the road (should only need to
> > update HMM helper and not each individual driver which use HMM).
> >
> > FYI here is my todo list:
> >     - this patchset
> >     - HMM ODP
> >     - mmu notifier changes for optimization and device range binding
> >     - device range binding (amdgpu/nouveau/...)
> >     - factor out some nouveau deep inner-layer code to outer-layer for
> >       more code sharing
> >     - page->mapping endeavor for generic page protection for instance
> >       KSM with file back page
> >     - grow GUP to remove HMM code and consolidate with GUP code
> 
> Sounds workable as a plan.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-03-18 22:15         ` Jerome Glisse
@ 2019-03-19  3:29           ` Dan Williams
  2019-03-19 13:30             ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-19  3:29 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard

On Mon, Mar 18, 2019 at 3:15 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Mon, Mar 18, 2019 at 02:30:15PM -0700, Dan Williams wrote:
> > On Mon, Mar 18, 2019 at 1:41 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Mon, Mar 18, 2019 at 01:21:00PM -0700, Dan Williams wrote:
> > > > On Tue, Jan 29, 2019 at 8:55 AM <jglisse@redhat.com> wrote:
> > > > >
> > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > >
> > > > > This is a all in one helper that fault pages in a range and map them to
> > > > > a device so that every single device driver do not have to re-implement
> > > > > this common pattern.
> > > >
> > > > Ok, correct me if I am wrong but these seem effectively be the typical
> > > > "get_user_pages() + dma_map_page()" pattern that non-HMM drivers would
> > > > follow. Could we just teach get_user_pages() to take an HMM shortcut
> > > > based on the range?
> > > >
> > > > I'm interested in being able to share code across drivers and not have
> > > > to worry about the HMM special case at the api level.
> > > >
> > > > And to be clear this isn't an anti-HMM critique this is a "yes, let's
> > > > do this, but how about a more fundamental change".
> > >
> > > It is a yes and no, HMM have the synchronization with mmu notifier
> > > which is not common to all device driver ie you have device driver
> > > that do not synchronize with mmu notifier and use GUP. For instance
> > > see the range->valid test in below code this is HMM specific and it
> > > would not apply to GUP user.
> > >
> > > Nonetheless i want to remove more HMM code and grow GUP to do some
> > > of this too so that HMM and non HMM driver can share the common part
> > > (under GUP). But right now updating GUP is a too big endeavor.
> >
> > I'm open to that argument, but that statement then seems to indicate
> > that these apis are indeed temporary. If the end game is common api
> > between HMM and non-HMM drivers then I think these should at least
> > come with /* TODO: */ comments about what might change in the future,
> > and then should be EXPORT_SYMBOL_GPL since they're already planning to
> > be deprecated. They are a point in time export for a work-in-progress
> > interface.
>
> The API is not temporary it will stay the same ie the device driver
> using HMM would not need further modification. Only the inner working
> of HMM would be ported over to use improved common GUP. But GUP has
> few shortcoming today that would be a regression for HMM:
>     - huge page handling (ie dma mapping huge page not 4k chunk of
>       huge page)
>     - not incrementing page refcount for HMM (other user like user-
>       faultd also want a GUP without FOLL_GET because they abide by
>       mmu notifier)
>     - support for device memory without leaking it ie restrict such
>       memory to caller that can handle it properly and are fully
>       aware of the gotcha that comes with it
>     ...

...but this is backwards because the end state is 2 driver interfaces
for dealing with page mappings instead of one. My primary critique of
HMM is that it creates a parallel universe of HMM apis rather than
evolving the existing core apis.

> So before converting HMM to use common GUP code under-neath those GUP
> shortcoming (from HMM POV) need to be addressed and at the same time
> the common dma map pattern can be added as an extra GUP helper.

If the HMM special cases are not being absorbed into the core-mm over
time then I think this is going in the wrong direction. Specifically a
direction that increases the long term maintenance burden over time as
HMM drivers stay needlessly separated.

> The issue is that some of the above changes need to be done carefully
> to not impact existing GUP users. So i rather clear some of my plate
> before starting chewing on this carefully.

I urge you to put this kind of consideration first and not "merge
first, ask hard questions later".

> Also doing this patch first and then the GUP thing solve the first user
> problem you have been asking for. With that code in first the first user
> of the GUP convertion will be all the devices that use those two HMM
> functions. In turn the first user of that code is the ODP RDMA patch
> i already posted. Second will be nouveau once i tackle out some nouveau
> changes. I expect amdgpu to come close third as a user and other device
> driver who are working on HMM integration to come shortly after.

I appreciate that it has users, but the point of having users is so
that the code review can actually be fruitful to see if the
infrastructure makes sense, and in this case it seems to be
duplicating an existing common pattern in the kernel.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-03-19 13:30             ` Jerome Glisse
@ 2019-03-19  8:44               ` Ira Weiny
  2019-03-19 17:10                 ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Ira Weiny @ 2019-03-19  8:44 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, linux-mm, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard

On Tue, Mar 19, 2019 at 09:30:05AM -0400, Jerome Glisse wrote:
> On Mon, Mar 18, 2019 at 08:29:45PM -0700, Dan Williams wrote:
> > On Mon, Mar 18, 2019 at 3:15 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Mon, Mar 18, 2019 at 02:30:15PM -0700, Dan Williams wrote:
> > > > On Mon, Mar 18, 2019 at 1:41 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > >
> > > > > On Mon, Mar 18, 2019 at 01:21:00PM -0700, Dan Williams wrote:
> > > > > > On Tue, Jan 29, 2019 at 8:55 AM <jglisse@redhat.com> wrote:
> > > > > > >
> > > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > >
> > > > > > > This is a all in one helper that fault pages in a range and map them to
> > > > > > > a device so that every single device driver do not have to re-implement
> > > > > > > this common pattern.
> > > > > >
> > > > > > Ok, correct me if I am wrong but these seem effectively be the typical
> > > > > > "get_user_pages() + dma_map_page()" pattern that non-HMM drivers would
> > > > > > follow. Could we just teach get_user_pages() to take an HMM shortcut
> > > > > > based on the range?
> > > > > >
> > > > > > I'm interested in being able to share code across drivers and not have
> > > > > > to worry about the HMM special case at the api level.
> > > > > >
> > > > > > And to be clear this isn't an anti-HMM critique this is a "yes, let's
> > > > > > do this, but how about a more fundamental change".
> > > > >
> > > > > It is a yes and no, HMM have the synchronization with mmu notifier
> > > > > which is not common to all device driver ie you have device driver
> > > > > that do not synchronize with mmu notifier and use GUP. For instance
> > > > > see the range->valid test in below code this is HMM specific and it
> > > > > would not apply to GUP user.
> > > > >
> > > > > Nonetheless i want to remove more HMM code and grow GUP to do some
> > > > > of this too so that HMM and non HMM driver can share the common part
> > > > > (under GUP). But right now updating GUP is a too big endeavor.
> > > >
> > > > I'm open to that argument, but that statement then seems to indicate
> > > > that these apis are indeed temporary. If the end game is common api
> > > > between HMM and non-HMM drivers then I think these should at least
> > > > come with /* TODO: */ comments about what might change in the future,
> > > > and then should be EXPORT_SYMBOL_GPL since they're already planning to
> > > > be deprecated. They are a point in time export for a work-in-progress
> > > > interface.
> > >
> > > The API is not temporary it will stay the same ie the device driver
> > > using HMM would not need further modification. Only the inner working
> > > of HMM would be ported over to use improved common GUP. But GUP has
> > > few shortcoming today that would be a regression for HMM:
> > >     - huge page handling (ie dma mapping huge page not 4k chunk of
> > >       huge page)

Agreed.

> > >     - not incrementing page refcount for HMM (other user like user-
> > >       faultd also want a GUP without FOLL_GET because they abide by
> > >       mmu notifier)
> > >     - support for device memory without leaking it ie restrict such
> > >       memory to caller that can handle it properly and are fully
> > >       aware of the gotcha that comes with it
> > >     ...
> > 
> > ...but this is backwards because the end state is 2 driver interfaces
> > for dealing with page mappings instead of one. My primary critique of
> > HMM is that it creates a parallel universe of HMM apis rather than
> > evolving the existing core apis.
> 
> Just to make it clear here is pseudo code:
>     gup_range_dma_map() {...}
> 
>     hmm_range_dma_map() {
>         hmm_specific_prep_step();
>         gup_range_dma_map();

Does this GUP use FOLL_GET and then a put after the mmu_notifier is setup?

>         hmm_specific_post_step();
>     }
> 
> Like i said HMM do have the synchronization with mmu notifier to take
> care of and other user of GUP and dma map pattern do not care about
> that. Hence why not everything can be share between device driver that
> can not do mmu notifier and other.
> 
> Is that not acceptable to you ? Should every driver duplicate the code
> HMM factorize ?
> 

In the final API you envision will drivers be able to call gup_range_dma_map()
_or_ hmm_range_dma_map()?

If so, at that time how will drivers know which to call and parameters control
those calls?

> 
> > > So before converting HMM to use common GUP code under-neath those GUP
> > > shortcoming (from HMM POV) need to be addressed and at the same time
> > > the common dma map pattern can be added as an extra GUP helper.
> > 
> > If the HMM special cases are not being absorbed into the core-mm over
> > time then I think this is going in the wrong direction. Specifically a
> > direction that increases the long term maintenance burden over time as
> > HMM drivers stay needlessly separated.
> 
> HMM is core mm and other thing like GUP do not need to absord all of HMM
> as it would be forcing down on them mmu notifier and those other user can
> not leverage mmu notifier. So forcing down something that is useless on
> other is pointless, don't you agree ?
> 
> > 
> > > The issue is that some of the above changes need to be done carefully
> > > to not impact existing GUP users. So i rather clear some of my plate
> > > before starting chewing on this carefully.
> > 
> > I urge you to put this kind of consideration first and not "merge
> > first, ask hard questions later".
> 
> There is no hard question here. GUP does not handle THP optimization and
> other thing HMM and ODP has. Adding this to GUP need to be done carefully
> to not break existing GUP user. So i taking a small step approach since
> when that is a bad thing. First merge HMM and ODP together then push down
> common thing into GUP. It is a lot safer than a huge jump.

FWIW I think it is fine to have a new interface which allows new features
during a transition is a good thing.  But if that comes at the price of leaving
the old "deficient" interface sitting around that presents confusion for driver
writers and we get users calling GUP when perhaps they should be calling HMM.

I think having GPL exports helps to ensure we can later merge these to make it
clear to driver writers what the right thing to do is.

> 
> > 
> > > Also doing this patch first and then the GUP thing solve the first user
> > > problem you have been asking for. With that code in first the first user
> > > of the GUP convertion will be all the devices that use those two HMM
> > > functions. In turn the first user of that code is the ODP RDMA patch i
> > > already posted. Second will be nouveau once i tackle out some nouveau
> > > changes. I expect amdgpu to come close third as a user and other device
> > > driver who are working on HMM integration to come shortly after.
> > 
> > I appreciate that it has users, but the point of having users is so that
> > the code review can actually be fruitful to see if the infrastructure makes
> > sense, and in this case it seems to be duplicating an existing common
> > pattern in the kernel.
> 
> It is not duplicating anything i am removing code at the end if you include
           ^^^^^^^^^^^^^

The duplication is in how drivers indicate to the core that a set of pages is
being used by the hardware the driver is controlling, what the rules for those
pages are and how the use by that hardware is going to be coordinated with the
other hardware vying for those pages.  There are differences, true, but
fundamentally it would be nice for drivers to not have to care about the
details.

Maybe that is a dream we will never realize but if there are going to be
different ways for drivers to "get pages" then we need to make it clear what it
means when those pages come to the driver and how they can be used safely.

Ira

> the odp convertion patch and i will remove more code once i am done with
> nouveau changes, and again more code once other driver catchup.
> 
> Cheers, Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-03-19  3:29           ` Dan Williams
@ 2019-03-19 13:30             ` Jerome Glisse
  2019-03-19  8:44               ` Ira Weiny
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-19 13:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-mm, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard

On Mon, Mar 18, 2019 at 08:29:45PM -0700, Dan Williams wrote:
> On Mon, Mar 18, 2019 at 3:15 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Mon, Mar 18, 2019 at 02:30:15PM -0700, Dan Williams wrote:
> > > On Mon, Mar 18, 2019 at 1:41 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Mon, Mar 18, 2019 at 01:21:00PM -0700, Dan Williams wrote:
> > > > > On Tue, Jan 29, 2019 at 8:55 AM <jglisse@redhat.com> wrote:
> > > > > >
> > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > >
> > > > > > This is a all in one helper that fault pages in a range and map them to
> > > > > > a device so that every single device driver do not have to re-implement
> > > > > > this common pattern.
> > > > >
> > > > > Ok, correct me if I am wrong but these seem effectively be the typical
> > > > > "get_user_pages() + dma_map_page()" pattern that non-HMM drivers would
> > > > > follow. Could we just teach get_user_pages() to take an HMM shortcut
> > > > > based on the range?
> > > > >
> > > > > I'm interested in being able to share code across drivers and not have
> > > > > to worry about the HMM special case at the api level.
> > > > >
> > > > > And to be clear this isn't an anti-HMM critique this is a "yes, let's
> > > > > do this, but how about a more fundamental change".
> > > >
> > > > It is a yes and no, HMM have the synchronization with mmu notifier
> > > > which is not common to all device driver ie you have device driver
> > > > that do not synchronize with mmu notifier and use GUP. For instance
> > > > see the range->valid test in below code this is HMM specific and it
> > > > would not apply to GUP user.
> > > >
> > > > Nonetheless i want to remove more HMM code and grow GUP to do some
> > > > of this too so that HMM and non HMM driver can share the common part
> > > > (under GUP). But right now updating GUP is a too big endeavor.
> > >
> > > I'm open to that argument, but that statement then seems to indicate
> > > that these apis are indeed temporary. If the end game is common api
> > > between HMM and non-HMM drivers then I think these should at least
> > > come with /* TODO: */ comments about what might change in the future,
> > > and then should be EXPORT_SYMBOL_GPL since they're already planning to
> > > be deprecated. They are a point in time export for a work-in-progress
> > > interface.
> >
> > The API is not temporary it will stay the same ie the device driver
> > using HMM would not need further modification. Only the inner working
> > of HMM would be ported over to use improved common GUP. But GUP has
> > few shortcoming today that would be a regression for HMM:
> >     - huge page handling (ie dma mapping huge page not 4k chunk of
> >       huge page)
> >     - not incrementing page refcount for HMM (other user like user-
> >       faultd also want a GUP without FOLL_GET because they abide by
> >       mmu notifier)
> >     - support for device memory without leaking it ie restrict such
> >       memory to caller that can handle it properly and are fully
> >       aware of the gotcha that comes with it
> >     ...
> 
> ...but this is backwards because the end state is 2 driver interfaces
> for dealing with page mappings instead of one. My primary critique of
> HMM is that it creates a parallel universe of HMM apis rather than
> evolving the existing core apis.

Just to make it clear here is pseudo code:
    gup_range_dma_map() {...}

    hmm_range_dma_map() {
        hmm_specific_prep_step();
        gup_range_dma_map();
        hmm_specific_post_step();
    }

Like i said HMM do have the synchronization with mmu notifier to take
care of and other user of GUP and dma map pattern do not care about
that. Hence why not everything can be share between device driver that
can not do mmu notifier and other.

Is that not acceptable to you ? Should every driver duplicate the code
HMM factorize ?


> > So before converting HMM to use common GUP code under-neath those GUP
> > shortcoming (from HMM POV) need to be addressed and at the same time
> > the common dma map pattern can be added as an extra GUP helper.
> 
> If the HMM special cases are not being absorbed into the core-mm over
> time then I think this is going in the wrong direction. Specifically a
> direction that increases the long term maintenance burden over time as
> HMM drivers stay needlessly separated.

HMM is core mm and other thing like GUP do not need to absord all of HMM
as it would be forcing down on them mmu notifier and those other user can
not leverage mmu notifier. So forcing down something that is useless on
other is pointless, don't you agree ?

> 
> > The issue is that some of the above changes need to be done carefully
> > to not impact existing GUP users. So i rather clear some of my plate
> > before starting chewing on this carefully.
> 
> I urge you to put this kind of consideration first and not "merge
> first, ask hard questions later".

There is no hard question here. GUP does not handle THP optimization and
other thing HMM and ODP has. Adding this to GUP need to be done carefully
to not break existing GUP user. So i taking a small step approach since
when that is a bad thing. First merge HMM and ODP together then push down
common thing into GUP. It is a lot safer than a huge jump.

> 
> > Also doing this patch first and then the GUP thing solve the first user
> > problem you have been asking for. With that code in first the first user
> > of the GUP convertion will be all the devices that use those two HMM
> > functions. In turn the first user of that code is the ODP RDMA patch
> > i already posted. Second will be nouveau once i tackle out some nouveau
> > changes. I expect amdgpu to come close third as a user and other device
> > driver who are working on HMM integration to come shortly after.
> 
> I appreciate that it has users, but the point of having users is so
> that the code review can actually be fruitful to see if the
> infrastructure makes sense, and in this case it seems to be
> duplicating an existing common pattern in the kernel.

It is not duplicating anything i am removing code at the end if you
include the odp convertion patch and i will remove more code once
i am done with nouveau changes, and again more code once other driver
catchup.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-03-19 17:10                 ` Jerome Glisse
@ 2019-03-19 14:10                   ` Ira Weiny
  0 siblings, 0 replies; 98+ messages in thread
From: Ira Weiny @ 2019-03-19 14:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, linux-mm, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard

On Tue, Mar 19, 2019 at 01:10:43PM -0400, Jerome Glisse wrote:
> On Tue, Mar 19, 2019 at 01:44:57AM -0700, Ira Weiny wrote:
> > On Tue, Mar 19, 2019 at 09:30:05AM -0400, Jerome Glisse wrote:
> > > On Mon, Mar 18, 2019 at 08:29:45PM -0700, Dan Williams wrote:
> > > > On Mon, Mar 18, 2019 at 3:15 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > >
> > > > > On Mon, Mar 18, 2019 at 02:30:15PM -0700, Dan Williams wrote:
> > > > > > On Mon, Mar 18, 2019 at 1:41 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Mar 18, 2019 at 01:21:00PM -0700, Dan Williams wrote:
> > > > > > > > On Tue, Jan 29, 2019 at 8:55 AM <jglisse@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > > > >

[snip]

> > > > >
> > > > > The API is not temporary it will stay the same ie the device driver
> > > > > using HMM would not need further modification. Only the inner working
> > > > > of HMM would be ported over to use improved common GUP. But GUP has
> > > > > few shortcoming today that would be a regression for HMM:
> > > > >     - huge page handling (ie dma mapping huge page not 4k chunk of
> > > > >       huge page)
> > 
> > Agreed.
> > 
> > > > >     - not incrementing page refcount for HMM (other user like user-
> > > > >       faultd also want a GUP without FOLL_GET because they abide by
> > > > >       mmu notifier)
> > > > >     - support for device memory without leaking it ie restrict such
> > > > >       memory to caller that can handle it properly and are fully
> > > > >       aware of the gotcha that comes with it
> > > > >     ...
> > > > 
> > > > ...but this is backwards because the end state is 2 driver interfaces
> > > > for dealing with page mappings instead of one. My primary critique of
> > > > HMM is that it creates a parallel universe of HMM apis rather than
> > > > evolving the existing core apis.
> > > 
> > > Just to make it clear here is pseudo code:
> > >     gup_range_dma_map() {...}
> > > 
> > >     hmm_range_dma_map() {
> > >         hmm_specific_prep_step();
> > >         gup_range_dma_map();
> > 
> > Does this GUP use FOLL_GET and then a put after the mmu_notifier is setup?
> 
> No it avoids incrementing page refcount all together and use mmu notifier
> synchronization to garantee that it is fine to do so. Hence we need a way
> to do GUP without incrementing the page refcount (ie no FOLL_GET but still
> returning page).

Isn't this follow_page?  I'll admit it may be broken and I'll further admit
that fixing it may have unintended consequences on drivers using GUP but some
of the code in this series looks a lot like the code there.

> 
> > 
> > >         hmm_specific_post_step();
> > >     }
> > > 
> > > Like i said HMM do have the synchronization with mmu notifier to take
> > > care of and other user of GUP and dma map pattern do not care about
> > > that. Hence why not everything can be share between device driver that
> > > can not do mmu notifier and other.
> > > 
> > > Is that not acceptable to you ? Should every driver duplicate the code
> > > HMM factorize ?
> > > 
> > 
> > In the final API you envision will drivers be able to call gup_range_dma_map()
> > _or_ hmm_range_dma_map()?
> > 
> > If so, at that time how will drivers know which to call and parameters control
> > those calls?
> 
> Device that can do invalidation at anytime and thus that can support
> mmu notifier will use HMM and thus the HMM version of it and they will
> always stick with the HMM version.
> 
> Device that can not do invalidation at anytime and thus require pin
> will use the GUP version and always the GUP version.
> 
> What the HMM version does is extra synchronization with mmu notifier
> to ensure that not incrementing page refcount is fine. You can think
> of HMM mirror as an helper than handle mmu notifier common device
> driver pattern.

ok sounds fair.

> 
> > 
> > > 
> > > > > So before converting HMM to use common GUP code under-neath those GUP
> > > > > shortcoming (from HMM POV) need to be addressed and at the same time
> > > > > the common dma map pattern can be added as an extra GUP helper.
> > > > 
> > > > If the HMM special cases are not being absorbed into the core-mm over
> > > > time then I think this is going in the wrong direction. Specifically a
> > > > direction that increases the long term maintenance burden over time as
> > > > HMM drivers stay needlessly separated.
> > > 
> > > HMM is core mm and other thing like GUP do not need to absord all of HMM
> > > as it would be forcing down on them mmu notifier and those other user can
> > > not leverage mmu notifier. So forcing down something that is useless on
> > > other is pointless, don't you agree ?
> > > 
> > > > 
> > > > > The issue is that some of the above changes need to be done carefully
> > > > > to not impact existing GUP users. So i rather clear some of my plate
> > > > > before starting chewing on this carefully.
> > > > 
> > > > I urge you to put this kind of consideration first and not "merge
> > > > first, ask hard questions later".
> > > 
> > > There is no hard question here. GUP does not handle THP optimization and
> > > other thing HMM and ODP has. Adding this to GUP need to be done carefully
> > > to not break existing GUP user. So i taking a small step approach since
> > > when that is a bad thing. First merge HMM and ODP together then push down
> > > common thing into GUP. It is a lot safer than a huge jump.
> > 
> > FWIW I think it is fine to have a new interface which allows new features
> > during a transition is a good thing.  But if that comes at the price of leaving
> > the old "deficient" interface sitting around that presents confusion for driver
> > writers and we get users calling GUP when perhaps they should be calling HMM.
> 
> This is not the intention here, i am converting device driver that can use
> HMM to HMM. Those device driver do not need GUP in the sense that they do
> not need the page refcount increment and this is the short path the HMM does
> provide today. Now i want to convert all device that can follow that to use
> HMM (i posted patchset for amdgpu, radeon, nouveau, i915 and odp rdma for
> that already).
> 
> Device driver that can not do mmu notifier will never use HMM and stick to
> the GUP/dma map pattern. But i want to share the same underlying code for
> both API latter on.

Great!  We agree on something!  :-D

> 
> So i do not see how it would confuse anyone. I am probably bad at expressing
> intent but HMM is not for all device driver it is only for device driver that
> would be able to do mmu notifier but instead of doing mmu notifier directly
> and duplicating common code they can use HMM which has all the common code
> they would need.

I guess I see HMM being bigger than that _eventually_.  I see it being a "one
stop shop" for devices to get pages from the system...  But I think what you
have limited it to is good for now.

Basic pseudocode:

hmm_get_pages()
	if (!mmu_capability)
		do_gup_stuff
	else
		do_hmm_stuff

	return pages;

> 
> > 
> > I think having GPL exports helps to ensure we can later merge these to make it
> > clear to driver writers what the right thing to do is.
> 
> I am fine with GPL export but i stress agains this does not help in the GPU
> world we had tons of GPL driver that are not upstream. GPL was not the issue.
> So i fail to see how GPL helps device driver writer in anyway.

GPL to ensure we can change the interfaces of HMM at will and have a good
chance of getting all the drivers in tree fixed.  There are a couple of patches
in this series which change the interface of exported symbols.  I think this is
fine but it shows we are not ready to export this interface to out of tree users.

> 
> > > 
> > > > 
> > > > > Also doing this patch first and then the GUP thing solve the first user
> > > > > problem you have been asking for. With that code in first the first user
> > > > > of the GUP convertion will be all the devices that use those two HMM
> > > > > functions. In turn the first user of that code is the ODP RDMA patch i
> > > > > already posted. Second will be nouveau once i tackle out some nouveau
> > > > > changes. I expect amdgpu to come close third as a user and other device
> > > > > driver who are working on HMM integration to come shortly after.
> > > > 
> > > > I appreciate that it has users, but the point of having users is so that
> > > > the code review can actually be fruitful to see if the infrastructure makes
> > > > sense, and in this case it seems to be duplicating an existing common
> > > > pattern in the kernel.
> > > 
> > > It is not duplicating anything i am removing code at the end if you include
> >            ^^^^^^^^^^^^^
> > 
> > The duplication is in how drivers indicate to the core that a set of pages is
> > being used by the hardware the driver is controlling, what the rules for those
> > pages are and how the use by that hardware is going to be coordinated with the
> > other hardware vying for those pages.  There are differences, true, but
> > fundamentally it would be nice for drivers to not have to care about the
> > details.
> > 
> > Maybe that is a dream we will never realize but if there are going to be
> > different ways for drivers to "get pages" then we need to make it clear what it
> > means when those pages come to the driver and how they can be used safely.
> 
> This is exactly what HMM mirror is. Device driver do not have to care about
> mm gory details or about mmu notifier subtilities, HMM provide an abstracted
> API easy to understand for device driver and takes care of the sublte details.

If the device supports MMU notification.  ;-)

> 
> Please read the HMM documentation and provide feedback if that is not clear.

FWIW I also want to be clear that having some common code to handle MMU
notification would be great.  I've had to fix mmu_notification code in the past
because mmu notification code can be tricky.  So I'm not against HMM helping
out there.  But I also don't want to leave drivers which don't do MMU
notification with a broken GUP interface.

Ira

> 
> Cheers,
> Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 19:13                       ` Dan Williams
@ 2019-03-19 14:18                         ` Ira Weiny
  2019-03-19 22:24                           ` Jerome Glisse
  2019-03-19 19:18                         ` Jerome Glisse
  1 sibling, 1 reply; 98+ messages in thread
From: Ira Weiny @ 2019-03-19 14:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jerome Glisse, Andrew Morton, Linux MM,
	Linux Kernel Mailing List, Felix Kuehling, Christian König,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Alex Deucher

On Tue, Mar 19, 2019 at 12:13:40PM -0700, Dan Williams wrote:
> On Tue, Mar 19, 2019 at 12:05 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Mar 19, 2019 at 11:42:00AM -0700, Dan Williams wrote:
> > > On Tue, Mar 19, 2019 at 10:45 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Tue, Mar 19, 2019 at 10:33:57AM -0700, Dan Williams wrote:
> > > > > On Tue, Mar 19, 2019 at 10:19 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> > > > > > > On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

[snip]

> >
> > Right now i am trying to unify driver for device that have can support
> > the mmu notifier approach through HMM. Unify to a superset of driver
> > that can not abide by mmu notifier is on my todo list like i said but
> > it comes after. I do not want to make the big jump in just one go. So
> > i doing thing under HMM and thus in HMM namespace, but once i tackle
> > the larger set i will move to generic namespace what make sense.
> >
> > This exact approach did happen several time already in the kernel. In
> > the GPU sub-system we did it several time. First do something for couple
> > devices that are very similar then grow to a bigger set of devices and
> > generalise along the way.
> >
> > So i do not see what is the problem of me repeating that same pattern
> > here again. Do something for a smaller set before tackling it on for
> > a bigger set.
> 
> All of that is fine, but when I asked about the ultimate trajectory
> that replaces hmm_range_dma_map() with an updated / HMM-aware GUP
> implementation, the response was that hmm_range_dma_map() is here to
> stay. The issue is not with forking off a small side effort, it's the
> plan to absorb that capability into a common implementation across
> non-HMM drivers where possible.

Just to get on the record in this thread.

+1

I think having an interface which handles the MMU notifier stuff for drivers is
awesome but we need to agree that the trajectory is to help more drivers if
possible.

Ira


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-18 17:04     ` Jerome Glisse
  2019-03-18 18:30       ` Dan Williams
@ 2019-03-19 16:40       ` Andrew Morton
  2019-03-19 16:58         ` Jerome Glisse
  1 sibling, 1 reply; 98+ messages in thread
From: Andrew Morton @ 2019-03-19 16:40 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Felix Kuehling, Christian König,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Dan Williams,
	Alex Deucher

On Mon, 18 Mar 2019 13:04:04 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

> On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> > On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > 
> > > Andrew you will not be pushing this patchset in 5.1 ?
> > 
> > I'd like to.  It sounds like we're converging on a plan.
> > 
> > It would be good to hear more from the driver developers who will be
> > consuming these new features - links to patchsets, review feedback,
> > etc.  Which individuals should we be asking?  Felix, Christian and
> > Jason, perhaps?
> > 
> 
> So i am guessing you will not send this to Linus ?

I was waiting to see how the discussion proceeds.  Was also expecting
various changelog updates (at least) - more acks from driver
developers, additional pointers to client driver patchsets, description
of their readiness, etc.

Today I discover that Alex has cherrypicked "mm/hmm: use reference
counting for HMM struct" into a tree which is fed into linux-next which
rather messes things up from my end and makes it hard to feed a
(possibly modified version of) that into Linus.

So I think I'll throw up my hands, drop them all and shall await
developments :(


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 16:40       ` Andrew Morton
@ 2019-03-19 16:58         ` Jerome Glisse
  2019-03-19 17:12           ` Andrew Morton
  2019-03-19 18:51           ` Deucher, Alexander
  0 siblings, 2 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-03-19 16:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Felix Kuehling, Christian König,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Dan Williams,
	Alex Deucher

On Tue, Mar 19, 2019 at 09:40:07AM -0700, Andrew Morton wrote:
> On Mon, 18 Mar 2019 13:04:04 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> > > On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > 
> > > > Andrew you will not be pushing this patchset in 5.1 ?
> > > 
> > > I'd like to.  It sounds like we're converging on a plan.
> > > 
> > > It would be good to hear more from the driver developers who will be
> > > consuming these new features - links to patchsets, review feedback,
> > > etc.  Which individuals should we be asking?  Felix, Christian and
> > > Jason, perhaps?
> > > 
> > 
> > So i am guessing you will not send this to Linus ?
> 
> I was waiting to see how the discussion proceeds.  Was also expecting
> various changelog updates (at least) - more acks from driver
> developers, additional pointers to client driver patchsets, description
> of their readiness, etc.

nouveau will benefit from this patchset and is already upstream in 5.1
so i am not sure what kind of pointer i can give for that, it is already
there. amdgpu will also benefit from it and is queue up AFAICT. ODP RDMA
is the third driver and i gave link to the patch that also use the 2
new functions that this patchset introduce. Do you want more ?

I guess i will repost with updated ack as Felix, Jason and few others
told me they were fine with it.

> 
> Today I discover that Alex has cherrypicked "mm/hmm: use reference
> counting for HMM struct" into a tree which is fed into linux-next which
> rather messes things up from my end and makes it hard to feed a
> (possibly modified version of) that into Linus.

:( i did not know the tree they pull that in was fed into next. I will
discourage them from doing so going forward.

> So I think I'll throw up my hands, drop them all and shall await
> developments :(

What more do you want to see ? I can repost with the ack already given
and the improve commit wording on some of the patch. But from user point
of view nouveau is already upstream, ODP RDMA depends on this patchset
and is posted and i have given link to it. amdgpu is queue up. What more
do i need ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device
  2019-03-19  8:44               ` Ira Weiny
@ 2019-03-19 17:10                 ` Jerome Glisse
  2019-03-19 14:10                   ` Ira Weiny
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-19 17:10 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, linux-mm, Linux Kernel Mailing List, Andrew Morton,
	Ralph Campbell, John Hubbard

On Tue, Mar 19, 2019 at 01:44:57AM -0700, Ira Weiny wrote:
> On Tue, Mar 19, 2019 at 09:30:05AM -0400, Jerome Glisse wrote:
> > On Mon, Mar 18, 2019 at 08:29:45PM -0700, Dan Williams wrote:
> > > On Mon, Mar 18, 2019 at 3:15 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Mon, Mar 18, 2019 at 02:30:15PM -0700, Dan Williams wrote:
> > > > > On Mon, Mar 18, 2019 at 1:41 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 18, 2019 at 01:21:00PM -0700, Dan Williams wrote:
> > > > > > > On Tue, Jan 29, 2019 at 8:55 AM <jglisse@redhat.com> wrote:
> > > > > > > >
> > > > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > > >
> > > > > > > > This is a all in one helper that fault pages in a range and map them to
> > > > > > > > a device so that every single device driver do not have to re-implement
> > > > > > > > this common pattern.
> > > > > > >
> > > > > > > Ok, correct me if I am wrong but these seem effectively be the typical
> > > > > > > "get_user_pages() + dma_map_page()" pattern that non-HMM drivers would
> > > > > > > follow. Could we just teach get_user_pages() to take an HMM shortcut
> > > > > > > based on the range?
> > > > > > >
> > > > > > > I'm interested in being able to share code across drivers and not have
> > > > > > > to worry about the HMM special case at the api level.
> > > > > > >
> > > > > > > And to be clear this isn't an anti-HMM critique this is a "yes, let's
> > > > > > > do this, but how about a more fundamental change".
> > > > > >
> > > > > > It is a yes and no, HMM have the synchronization with mmu notifier
> > > > > > which is not common to all device driver ie you have device driver
> > > > > > that do not synchronize with mmu notifier and use GUP. For instance
> > > > > > see the range->valid test in below code this is HMM specific and it
> > > > > > would not apply to GUP user.
> > > > > >
> > > > > > Nonetheless i want to remove more HMM code and grow GUP to do some
> > > > > > of this too so that HMM and non HMM driver can share the common part
> > > > > > (under GUP). But right now updating GUP is a too big endeavor.
> > > > >
> > > > > I'm open to that argument, but that statement then seems to indicate
> > > > > that these apis are indeed temporary. If the end game is common api
> > > > > between HMM and non-HMM drivers then I think these should at least
> > > > > come with /* TODO: */ comments about what might change in the future,
> > > > > and then should be EXPORT_SYMBOL_GPL since they're already planning to
> > > > > be deprecated. They are a point in time export for a work-in-progress
> > > > > interface.
> > > >
> > > > The API is not temporary it will stay the same ie the device driver
> > > > using HMM would not need further modification. Only the inner working
> > > > of HMM would be ported over to use improved common GUP. But GUP has
> > > > few shortcoming today that would be a regression for HMM:
> > > >     - huge page handling (ie dma mapping huge page not 4k chunk of
> > > >       huge page)
> 
> Agreed.
> 
> > > >     - not incrementing page refcount for HMM (other user like user-
> > > >       faultd also want a GUP without FOLL_GET because they abide by
> > > >       mmu notifier)
> > > >     - support for device memory without leaking it ie restrict such
> > > >       memory to caller that can handle it properly and are fully
> > > >       aware of the gotcha that comes with it
> > > >     ...
> > > 
> > > ...but this is backwards because the end state is 2 driver interfaces
> > > for dealing with page mappings instead of one. My primary critique of
> > > HMM is that it creates a parallel universe of HMM apis rather than
> > > evolving the existing core apis.
> > 
> > Just to make it clear here is pseudo code:
> >     gup_range_dma_map() {...}
> > 
> >     hmm_range_dma_map() {
> >         hmm_specific_prep_step();
> >         gup_range_dma_map();
> 
> Does this GUP use FOLL_GET and then a put after the mmu_notifier is setup?

No it avoids incrementing page refcount all together and use mmu notifier
synchronization to garantee that it is fine to do so. Hence we need a way
to do GUP without incrementing the page refcount (ie no FOLL_GET but still
returning page).

> 
> >         hmm_specific_post_step();
> >     }
> > 
> > Like i said HMM do have the synchronization with mmu notifier to take
> > care of and other user of GUP and dma map pattern do not care about
> > that. Hence why not everything can be share between device driver that
> > can not do mmu notifier and other.
> > 
> > Is that not acceptable to you ? Should every driver duplicate the code
> > HMM factorize ?
> > 
> 
> In the final API you envision will drivers be able to call gup_range_dma_map()
> _or_ hmm_range_dma_map()?
> 
> If so, at that time how will drivers know which to call and parameters control
> those calls?

Device that can do invalidation at anytime and thus that can support
mmu notifier will use HMM and thus the HMM version of it and they will
always stick with the HMM version.

Device that can not do invalidation at anytime and thus require pin
will use the GUP version and always the GUP version.

What the HMM version does is extra synchronization with mmu notifier
to ensure that not incrementing page refcount is fine. You can think
of HMM mirror as an helper than handle mmu notifier common device
driver pattern.

> 
> > 
> > > > So before converting HMM to use common GUP code under-neath those GUP
> > > > shortcoming (from HMM POV) need to be addressed and at the same time
> > > > the common dma map pattern can be added as an extra GUP helper.
> > > 
> > > If the HMM special cases are not being absorbed into the core-mm over
> > > time then I think this is going in the wrong direction. Specifically a
> > > direction that increases the long term maintenance burden over time as
> > > HMM drivers stay needlessly separated.
> > 
> > HMM is core mm and other thing like GUP do not need to absord all of HMM
> > as it would be forcing down on them mmu notifier and those other user can
> > not leverage mmu notifier. So forcing down something that is useless on
> > other is pointless, don't you agree ?
> > 
> > > 
> > > > The issue is that some of the above changes need to be done carefully
> > > > to not impact existing GUP users. So i rather clear some of my plate
> > > > before starting chewing on this carefully.
> > > 
> > > I urge you to put this kind of consideration first and not "merge
> > > first, ask hard questions later".
> > 
> > There is no hard question here. GUP does not handle THP optimization and
> > other thing HMM and ODP has. Adding this to GUP need to be done carefully
> > to not break existing GUP user. So i taking a small step approach since
> > when that is a bad thing. First merge HMM and ODP together then push down
> > common thing into GUP. It is a lot safer than a huge jump.
> 
> FWIW I think it is fine to have a new interface which allows new features
> during a transition is a good thing.  But if that comes at the price of leaving
> the old "deficient" interface sitting around that presents confusion for driver
> writers and we get users calling GUP when perhaps they should be calling HMM.

This is not the intention here, i am converting device driver that can use
HMM to HMM. Those device driver do not need GUP in the sense that they do
not need the page refcount increment and this is the short path the HMM does
provide today. Now i want to convert all device that can follow that to use
HMM (i posted patchset for amdgpu, radeon, nouveau, i915 and odp rdma for
that already).

Device driver that can not do mmu notifier will never use HMM and stick to
the GUP/dma map pattern. But i want to share the same underlying code for
both API latter on.

So i do not see how it would confuse anyone. I am probably bad at expressing
intent but HMM is not for all device driver it is only for device driver that
would be able to do mmu notifier but instead of doing mmu notifier directly
and duplicating common code they can use HMM which has all the common code
they would need.

> 
> I think having GPL exports helps to ensure we can later merge these to make it
> clear to driver writers what the right thing to do is.

I am fine with GPL export but i stress agains this does not help in the GPU
world we had tons of GPL driver that are not upstream. GPL was not the issue.
So i fail to see how GPL helps device driver writer in anyway.

> > 
> > > 
> > > > Also doing this patch first and then the GUP thing solve the first user
> > > > problem you have been asking for. With that code in first the first user
> > > > of the GUP convertion will be all the devices that use those two HMM
> > > > functions. In turn the first user of that code is the ODP RDMA patch i
> > > > already posted. Second will be nouveau once i tackle out some nouveau
> > > > changes. I expect amdgpu to come close third as a user and other device
> > > > driver who are working on HMM integration to come shortly after.
> > > 
> > > I appreciate that it has users, but the point of having users is so that
> > > the code review can actually be fruitful to see if the infrastructure makes
> > > sense, and in this case it seems to be duplicating an existing common
> > > pattern in the kernel.
> > 
> > It is not duplicating anything i am removing code at the end if you include
>            ^^^^^^^^^^^^^
> 
> The duplication is in how drivers indicate to the core that a set of pages is
> being used by the hardware the driver is controlling, what the rules for those
> pages are and how the use by that hardware is going to be coordinated with the
> other hardware vying for those pages.  There are differences, true, but
> fundamentally it would be nice for drivers to not have to care about the
> details.
> 
> Maybe that is a dream we will never realize but if there are going to be
> different ways for drivers to "get pages" then we need to make it clear what it
> means when those pages come to the driver and how they can be used safely.

This is exactly what HMM mirror is. Device driver do not have to care about
mm gory details or about mmu notifier subtilities, HMM provide an abstracted
API easy to understand for device driver and takes care of the sublte details.

Please read the HMM documentation and provide feedback if that is not clear.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 16:58         ` Jerome Glisse
@ 2019-03-19 17:12           ` Andrew Morton
  2019-03-19 17:18             ` Jerome Glisse
  2019-03-19 21:51             ` Stephen Rothwell
  2019-03-19 18:51           ` Deucher, Alexander
  1 sibling, 2 replies; 98+ messages in thread
From: Andrew Morton @ 2019-03-19 17:12 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Felix Kuehling, Christian König,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Dan Williams,
	Alex Deucher

On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

> > So I think I'll throw up my hands, drop them all and shall await
> > developments :(
> 
> What more do you want to see ? I can repost with the ack already given
> and the improve commit wording on some of the patch. But from user point
> of view nouveau is already upstream, ODP RDMA depends on this patchset
> and is posted and i have given link to it. amdgpu is queue up. What more
> do i need ?

I guess I can ignore linux-next for a few days.  

Yes, a resend against mainline with those various updates will be
helpful.  Please go through the various fixes which we had as well:


mm-hmm-use-reference-counting-for-hmm-struct.patch
mm-hmm-do-not-erase-snapshot-when-a-range-is-invalidated.patch
mm-hmm-improve-and-rename-hmm_vma_get_pfns-to-hmm_range_snapshot.patch
mm-hmm-improve-and-rename-hmm_vma_fault-to-hmm_range_fault.patch
mm-hmm-improve-driver-api-to-work-and-wait-over-a-range.patch
mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix.patch
mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix-fix.patch
mm-hmm-add-default-fault-flags-to-avoid-the-need-to-pre-fill-pfns-arrays.patch
mm-hmm-add-an-helper-function-that-fault-pages-and-map-them-to-a-device.patch
mm-hmm-support-hugetlbfs-snap-shoting-faulting-and-dma-mapping.patch
mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem.patch
mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix.patch
mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix-2.patch
mm-hmm-add-helpers-for-driver-to-safely-take-the-mmap_sem.patch

Also, the discussion regarding [07/10] is substantial and is ongoing so
please let's push along wth that.

What is the review/discussion status of "[PATCH 09/10] mm/hmm: allow to
mirror vma of a file on a DAX backed filesystem"?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 17:12           ` Andrew Morton
@ 2019-03-19 17:18             ` Jerome Glisse
  2019-03-19 17:33               ` Dan Williams
  2019-03-19 21:51             ` Stephen Rothwell
  1 sibling, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-19 17:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Felix Kuehling, Christian König,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Dan Williams,
	Alex Deucher

On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > > So I think I'll throw up my hands, drop them all and shall await
> > > developments :(
> > 
> > What more do you want to see ? I can repost with the ack already given
> > and the improve commit wording on some of the patch. But from user point
> > of view nouveau is already upstream, ODP RDMA depends on this patchset
> > and is posted and i have given link to it. amdgpu is queue up. What more
> > do i need ?
> 
> I guess I can ignore linux-next for a few days.  
> 
> Yes, a resend against mainline with those various updates will be
> helpful.  Please go through the various fixes which we had as well:

Yes i will not forget them and i will try to get more config build
to be sure there is not issue. I need to register a tree with the
rand-config builder but i lack place where i can host a https tree
(i believe this is a requirement).

> 
> mm-hmm-use-reference-counting-for-hmm-struct.patch
> mm-hmm-do-not-erase-snapshot-when-a-range-is-invalidated.patch
> mm-hmm-improve-and-rename-hmm_vma_get_pfns-to-hmm_range_snapshot.patch
> mm-hmm-improve-and-rename-hmm_vma_fault-to-hmm_range_fault.patch
> mm-hmm-improve-driver-api-to-work-and-wait-over-a-range.patch
> mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix.patch
> mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix-fix.patch
> mm-hmm-add-default-fault-flags-to-avoid-the-need-to-pre-fill-pfns-arrays.patch
> mm-hmm-add-an-helper-function-that-fault-pages-and-map-them-to-a-device.patch
> mm-hmm-support-hugetlbfs-snap-shoting-faulting-and-dma-mapping.patch
> mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem.patch
> mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix.patch
> mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix-2.patch
> mm-hmm-add-helpers-for-driver-to-safely-take-the-mmap_sem.patch
> 
> Also, the discussion regarding [07/10] is substantial and is ongoing so
> please let's push along wth that.

I can move it as last patch in the serie but it is needed for ODP RDMA
convertion too. Otherwise i will just move that code into the ODP RDMA
code and will have to move it again into HMM code once i am done with
the nouveau changes and in the meantime i expect other driver will want
to use this 2 helpers too.

> 
> What is the review/discussion status of "[PATCH 09/10] mm/hmm: allow to
> mirror vma of a file on a DAX backed filesystem"?

I explained that this is needed for the ODP RDMA convertion as ODP RDMA
does supported DAX today and thus i can not push that convertion without
that support as otherwise i would regress RDMA ODP.

Also this is to be use by nouveau which is upstream and there is no
reasons to not support vma that happens to be mmap of a file on a file-
system that is using a DAX block device.

I do not think Dan had any comment code wise, i think he was complaining
about the wording of the commit not being clear and i proposed an updated
wording that he seemed to like.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 17:18             ` Jerome Glisse
@ 2019-03-19 17:33               ` Dan Williams
  2019-03-19 17:45                 ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-19 17:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Alex Deucher

On Tue, Mar 19, 2019 at 10:19 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> > On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
[..]
> > Also, the discussion regarding [07/10] is substantial and is ongoing so
> > please let's push along wth that.
>
> I can move it as last patch in the serie but it is needed for ODP RDMA
> convertion too. Otherwise i will just move that code into the ODP RDMA
> code and will have to move it again into HMM code once i am done with
> the nouveau changes and in the meantime i expect other driver will want
> to use this 2 helpers too.

I still hold out hope that we can find a way to have productive
discussions about the implementation of this infrastructure.
Threatening to move the code elsewhere to bypass the feedback is not
productive.

>
> >
> > What is the review/discussion status of "[PATCH 09/10] mm/hmm: allow to
> > mirror vma of a file on a DAX backed filesystem"?
>
> I explained that this is needed for the ODP RDMA convertion as ODP RDMA
> does supported DAX today and thus i can not push that convertion without
> that support as otherwise i would regress RDMA ODP.
>
> Also this is to be use by nouveau which is upstream and there is no
> reasons to not support vma that happens to be mmap of a file on a file-
> system that is using a DAX block device.
>
> I do not think Dan had any comment code wise, i think he was complaining
> about the wording of the commit not being clear and i proposed an updated
> wording that he seemed to like.

Yes, please resend with the updated changelog and I'll ack.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 17:33               ` Dan Williams
@ 2019-03-19 17:45                 ` Jerome Glisse
  2019-03-19 18:42                   ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-19 17:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Alex Deucher

On Tue, Mar 19, 2019 at 10:33:57AM -0700, Dan Williams wrote:
> On Tue, Mar 19, 2019 at 10:19 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> > > On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> [..]
> > > Also, the discussion regarding [07/10] is substantial and is ongoing so
> > > please let's push along wth that.
> >
> > I can move it as last patch in the serie but it is needed for ODP RDMA
> > convertion too. Otherwise i will just move that code into the ODP RDMA
> > code and will have to move it again into HMM code once i am done with
> > the nouveau changes and in the meantime i expect other driver will want
> > to use this 2 helpers too.
> 
> I still hold out hope that we can find a way to have productive
> discussions about the implementation of this infrastructure.
> Threatening to move the code elsewhere to bypass the feedback is not
> productive.

I am not threatening anything that code is in ODP _today_ with that
patchset i was factering it out so that i could also use it in nouveau.
nouveau is built in such way that right now i can not use it directly.
But i wanted to factor out now in hope that i can get the nouveau
changes in 5.2 and then convert nouveau in 5.3.

So when i said that code will be in ODP it just means that instead of
removing it from ODP i will keep it there and it will just delay more
code sharing for everyone.


> 
> >
> > >
> > > What is the review/discussion status of "[PATCH 09/10] mm/hmm: allow to
> > > mirror vma of a file on a DAX backed filesystem"?
> >
> > I explained that this is needed for the ODP RDMA convertion as ODP RDMA
> > does supported DAX today and thus i can not push that convertion without
> > that support as otherwise i would regress RDMA ODP.
> >
> > Also this is to be use by nouveau which is upstream and there is no
> > reasons to not support vma that happens to be mmap of a file on a file-
> > system that is using a DAX block device.
> >
> > I do not think Dan had any comment code wise, i think he was complaining
> > about the wording of the commit not being clear and i proposed an updated
> > wording that he seemed to like.
> 
> Yes, please resend with the updated changelog and I'll ack.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 17:45                 ` Jerome Glisse
@ 2019-03-19 18:42                   ` Dan Williams
  2019-03-19 19:05                     ` Jerome Glisse
  0 siblings, 1 reply; 98+ messages in thread
From: Dan Williams @ 2019-03-19 18:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Alex Deucher

On Tue, Mar 19, 2019 at 10:45 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Mar 19, 2019 at 10:33:57AM -0700, Dan Williams wrote:
> > On Tue, Mar 19, 2019 at 10:19 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> > > > On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > [..]
> > > > Also, the discussion regarding [07/10] is substantial and is ongoing so
> > > > please let's push along wth that.
> > >
> > > I can move it as last patch in the serie but it is needed for ODP RDMA
> > > convertion too. Otherwise i will just move that code into the ODP RDMA
> > > code and will have to move it again into HMM code once i am done with
> > > the nouveau changes and in the meantime i expect other driver will want
> > > to use this 2 helpers too.
> >
> > I still hold out hope that we can find a way to have productive
> > discussions about the implementation of this infrastructure.
> > Threatening to move the code elsewhere to bypass the feedback is not
> > productive.
>
> I am not threatening anything that code is in ODP _today_ with that
> patchset i was factering it out so that i could also use it in nouveau.
> nouveau is built in such way that right now i can not use it directly.
> But i wanted to factor out now in hope that i can get the nouveau
> changes in 5.2 and then convert nouveau in 5.3.
>
> So when i said that code will be in ODP it just means that instead of
> removing it from ODP i will keep it there and it will just delay more
> code sharing for everyone.

The point I'm trying to make is that the code sharing for everyone is
moving the implementation closer to canonical kernel code and use
existing infrastructure. For example, I look at 'struct hmm_range' and
see nothing hmm specific in it. I think we can make that generic and
not build up more apis and data structures in the "hmm" namespace.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* RE: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 16:58         ` Jerome Glisse
  2019-03-19 17:12           ` Andrew Morton
@ 2019-03-19 18:51           ` Deucher, Alexander
  1 sibling, 0 replies; 98+ messages in thread
From: Deucher, Alexander @ 2019-03-19 18:51 UTC (permalink / raw)
  To: Jerome Glisse, Andrew Morton
  Cc: linux-mm, linux-kernel, Kuehling, Felix, Koenig, Christian,
	Ralph Campbell, John Hubbard, Jason Gunthorpe, Dan Williams

> -----Original Message-----
> From: Jerome Glisse <jglisse@redhat.com>
> Sent: Tuesday, March 19, 2019 12:58 PM
> To: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Kuehling, Felix
> <Felix.Kuehling@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; Ralph Campbell <rcampbell@nvidia.com>;
> John Hubbard <jhubbard@nvidia.com>; Jason Gunthorpe
> <jgg@mellanox.com>; Dan Williams <dan.j.williams@intel.com>; Deucher,
> Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH 00/10] HMM updates for 5.1
> 
> On Tue, Mar 19, 2019 at 09:40:07AM -0700, Andrew Morton wrote:
> > On Mon, 18 Mar 2019 13:04:04 -0400 Jerome Glisse <jglisse@redhat.com>
> wrote:
> >
> > > On Wed, Mar 13, 2019 at 09:10:04AM -0700, Andrew Morton wrote:
> > > > On Tue, 12 Mar 2019 21:27:06 -0400 Jerome Glisse
> <jglisse@redhat.com> wrote:
> > > >
> > > > > Andrew you will not be pushing this patchset in 5.1 ?
> > > >
> > > > I'd like to.  It sounds like we're converging on a plan.
> > > >
> > > > It would be good to hear more from the driver developers who will
> > > > be consuming these new features - links to patchsets, review
> > > > feedback, etc.  Which individuals should we be asking?  Felix,
> > > > Christian and Jason, perhaps?
> > > >
> > >
> > > So i am guessing you will not send this to Linus ?
> >
> > I was waiting to see how the discussion proceeds.  Was also expecting
> > various changelog updates (at least) - more acks from driver
> > developers, additional pointers to client driver patchsets,
> > description of their readiness, etc.
> 
> nouveau will benefit from this patchset and is already upstream in 5.1 so i am
> not sure what kind of pointer i can give for that, it is already there. amdgpu
> will also benefit from it and is queue up AFAICT. ODP RDMA is the third driver
> and i gave link to the patch that also use the 2 new functions that this
> patchset introduce. Do you want more ?
> 
> I guess i will repost with updated ack as Felix, Jason and few others told me
> they were fine with it.
> 
> >
> > Today I discover that Alex has cherrypicked "mm/hmm: use reference
> > counting for HMM struct" into a tree which is fed into linux-next
> > which rather messes things up from my end and makes it hard to feed a
> > (possibly modified version of) that into Linus.
> 
> :( i did not know the tree they pull that in was fed into next. I will discourage
> them from doing so going forward.
> 

I can drop it.  I included it because it fixes an issue with HMM as used by amdgpu in our current -next tree.  So users testing my drm-next branch will run into the issue without it.  I don't plan to include it the actual -next PR.  What is the recommended way to deal with this?

Alex

> > So I think I'll throw up my hands, drop them all and shall await
> > developments :(
> 
> What more do you want to see ? I can repost with the ack already given and
> the improve commit wording on some of the patch. But from user point of
> view nouveau is already upstream, ODP RDMA depends on this patchset and
> is posted and i have given link to it. amdgpu is queue up. What more do i
> need ?
> 
> Cheers,
> Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 18:42                   ` Dan Williams
@ 2019-03-19 19:05                     ` Jerome Glisse
  2019-03-19 19:13                       ` Dan Williams
  0 siblings, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-19 19:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Alex Deucher

On Tue, Mar 19, 2019 at 11:42:00AM -0700, Dan Williams wrote:
> On Tue, Mar 19, 2019 at 10:45 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Mar 19, 2019 at 10:33:57AM -0700, Dan Williams wrote:
> > > On Tue, Mar 19, 2019 at 10:19 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> > > > > On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > [..]
> > > > > Also, the discussion regarding [07/10] is substantial and is ongoing so
> > > > > please let's push along wth that.
> > > >
> > > > I can move it as last patch in the serie but it is needed for ODP RDMA
> > > > convertion too. Otherwise i will just move that code into the ODP RDMA
> > > > code and will have to move it again into HMM code once i am done with
> > > > the nouveau changes and in the meantime i expect other driver will want
> > > > to use this 2 helpers too.
> > >
> > > I still hold out hope that we can find a way to have productive
> > > discussions about the implementation of this infrastructure.
> > > Threatening to move the code elsewhere to bypass the feedback is not
> > > productive.
> >
> > I am not threatening anything that code is in ODP _today_ with that
> > patchset i was factering it out so that i could also use it in nouveau.
> > nouveau is built in such way that right now i can not use it directly.
> > But i wanted to factor out now in hope that i can get the nouveau
> > changes in 5.2 and then convert nouveau in 5.3.
> >
> > So when i said that code will be in ODP it just means that instead of
> > removing it from ODP i will keep it there and it will just delay more
> > code sharing for everyone.
> 
> The point I'm trying to make is that the code sharing for everyone is
> moving the implementation closer to canonical kernel code and use
> existing infrastructure. For example, I look at 'struct hmm_range' and
> see nothing hmm specific in it. I think we can make that generic and
> not build up more apis and data structures in the "hmm" namespace.

Right now i am trying to unify driver for device that have can support
the mmu notifier approach through HMM. Unify to a superset of driver
that can not abide by mmu notifier is on my todo list like i said but
it comes after. I do not want to make the big jump in just one go. So
i doing thing under HMM and thus in HMM namespace, but once i tackle
the larger set i will move to generic namespace what make sense.

This exact approach did happen several time already in the kernel. In
the GPU sub-system we did it several time. First do something for couple
devices that are very similar then grow to a bigger set of devices and
generalise along the way.

So i do not see what is the problem of me repeating that same pattern
here again. Do something for a smaller set before tackling it on for
a bigger set.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 19:05                     ` Jerome Glisse
@ 2019-03-19 19:13                       ` Dan Williams
  2019-03-19 14:18                         ` Ira Weiny
  2019-03-19 19:18                         ` Jerome Glisse
  0 siblings, 2 replies; 98+ messages in thread
From: Dan Williams @ 2019-03-19 19:13 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Alex Deucher

On Tue, Mar 19, 2019 at 12:05 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Mar 19, 2019 at 11:42:00AM -0700, Dan Williams wrote:
> > On Tue, Mar 19, 2019 at 10:45 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Mar 19, 2019 at 10:33:57AM -0700, Dan Williams wrote:
> > > > On Tue, Mar 19, 2019 at 10:19 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > >
> > > > > On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> > > > > > On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > [..]
> > > > > > Also, the discussion regarding [07/10] is substantial and is ongoing so
> > > > > > please let's push along wth that.
> > > > >
> > > > > I can move it as last patch in the serie but it is needed for ODP RDMA
> > > > > convertion too. Otherwise i will just move that code into the ODP RDMA
> > > > > code and will have to move it again into HMM code once i am done with
> > > > > the nouveau changes and in the meantime i expect other driver will want
> > > > > to use this 2 helpers too.
> > > >
> > > > I still hold out hope that we can find a way to have productive
> > > > discussions about the implementation of this infrastructure.
> > > > Threatening to move the code elsewhere to bypass the feedback is not
> > > > productive.
> > >
> > > I am not threatening anything that code is in ODP _today_ with that
> > > patchset i was factering it out so that i could also use it in nouveau.
> > > nouveau is built in such way that right now i can not use it directly.
> > > But i wanted to factor out now in hope that i can get the nouveau
> > > changes in 5.2 and then convert nouveau in 5.3.
> > >
> > > So when i said that code will be in ODP it just means that instead of
> > > removing it from ODP i will keep it there and it will just delay more
> > > code sharing for everyone.
> >
> > The point I'm trying to make is that the code sharing for everyone is
> > moving the implementation closer to canonical kernel code and use
> > existing infrastructure. For example, I look at 'struct hmm_range' and
> > see nothing hmm specific in it. I think we can make that generic and
> > not build up more apis and data structures in the "hmm" namespace.
>
> Right now i am trying to unify driver for device that have can support
> the mmu notifier approach through HMM. Unify to a superset of driver
> that can not abide by mmu notifier is on my todo list like i said but
> it comes after. I do not want to make the big jump in just one go. So
> i doing thing under HMM and thus in HMM namespace, but once i tackle
> the larger set i will move to generic namespace what make sense.
>
> This exact approach did happen several time already in the kernel. In
> the GPU sub-system we did it several time. First do something for couple
> devices that are very similar then grow to a bigger set of devices and
> generalise along the way.
>
> So i do not see what is the problem of me repeating that same pattern
> here again. Do something for a smaller set before tackling it on for
> a bigger set.

All of that is fine, but when I asked about the ultimate trajectory
that replaces hmm_range_dma_map() with an updated / HMM-aware GUP
implementation, the response was that hmm_range_dma_map() is here to
stay. The issue is not with forking off a small side effort, it's the
plan to absorb that capability into a common implementation across
non-HMM drivers where possible.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 19:13                       ` Dan Williams
  2019-03-19 14:18                         ` Ira Weiny
@ 2019-03-19 19:18                         ` Jerome Glisse
  2019-03-19 20:25                           ` Jerome Glisse
  1 sibling, 1 reply; 98+ messages in thread
From: Jerome Glisse @ 2019-03-19 19:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Alex Deucher

On Tue, Mar 19, 2019 at 12:13:40PM -0700, Dan Williams wrote:
> On Tue, Mar 19, 2019 at 12:05 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Mar 19, 2019 at 11:42:00AM -0700, Dan Williams wrote:
> > > On Tue, Mar 19, 2019 at 10:45 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Tue, Mar 19, 2019 at 10:33:57AM -0700, Dan Williams wrote:
> > > > > On Tue, Mar 19, 2019 at 10:19 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> > > > > > > On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > [..]
> > > > > > > Also, the discussion regarding [07/10] is substantial and is ongoing so
> > > > > > > please let's push along wth that.
> > > > > >
> > > > > > I can move it as last patch in the serie but it is needed for ODP RDMA
> > > > > > convertion too. Otherwise i will just move that code into the ODP RDMA
> > > > > > code and will have to move it again into HMM code once i am done with
> > > > > > the nouveau changes and in the meantime i expect other driver will want
> > > > > > to use this 2 helpers too.
> > > > >
> > > > > I still hold out hope that we can find a way to have productive
> > > > > discussions about the implementation of this infrastructure.
> > > > > Threatening to move the code elsewhere to bypass the feedback is not
> > > > > productive.
> > > >
> > > > I am not threatening anything that code is in ODP _today_ with that
> > > > patchset i was factering it out so that i could also use it in nouveau.
> > > > nouveau is built in such way that right now i can not use it directly.
> > > > But i wanted to factor out now in hope that i can get the nouveau
> > > > changes in 5.2 and then convert nouveau in 5.3.
> > > >
> > > > So when i said that code will be in ODP it just means that instead of
> > > > removing it from ODP i will keep it there and it will just delay more
> > > > code sharing for everyone.
> > >
> > > The point I'm trying to make is that the code sharing for everyone is
> > > moving the implementation closer to canonical kernel code and use
> > > existing infrastructure. For example, I look at 'struct hmm_range' and
> > > see nothing hmm specific in it. I think we can make that generic and
> > > not build up more apis and data structures in the "hmm" namespace.
> >
> > Right now i am trying to unify driver for device that have can support
> > the mmu notifier approach through HMM. Unify to a superset of driver
> > that can not abide by mmu notifier is on my todo list like i said but
> > it comes after. I do not want to make the big jump in just one go. So
> > i doing thing under HMM and thus in HMM namespace, but once i tackle
> > the larger set i will move to generic namespace what make sense.
> >
> > This exact approach did happen several time already in the kernel. In
> > the GPU sub-system we did it several time. First do something for couple
> > devices that are very similar then grow to a bigger set of devices and
> > generalise along the way.
> >
> > So i do not see what is the problem of me repeating that same pattern
> > here again. Do something for a smaller set before tackling it on for
> > a bigger set.
> 
> All of that is fine, but when I asked about the ultimate trajectory
> that replaces hmm_range_dma_map() with an updated / HMM-aware GUP
> implementation, the response was that hmm_range_dma_map() is here to
> stay. The issue is not with forking off a small side effort, it's the
> plan to absorb that capability into a common implementation across
> non-HMM drivers where possible.

hmm_range_dma_map() is a superset of gup_range_dma_map() because on
top of gup_range_dma_map() the hmm version deals with mmu notifier.

But everything that is not mmu notifier related can be share through
gup_range_dma_map() so plan is to end up with:
    hmm_range_dma_map(hmm_struct) {
        hmm_mmu_notifier_specific_prep_step();
        gup_range_dma_map(hmm_struct->common_base_struct);
        hmm_mmu_notifier_specific_post_step();
    }

ie share as much as possible. Does that not make sense ? To get
there i will need to do non trivial addition to GUP and so i went
first to get HMM bits working and then work on common gup API.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 19:18                         ` Jerome Glisse
@ 2019-03-19 20:25                           ` Jerome Glisse
  0 siblings, 0 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-03-19 20:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Alex Deucher

On Tue, Mar 19, 2019 at 03:18:49PM -0400, Jerome Glisse wrote:
> On Tue, Mar 19, 2019 at 12:13:40PM -0700, Dan Williams wrote:
> > On Tue, Mar 19, 2019 at 12:05 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Mar 19, 2019 at 11:42:00AM -0700, Dan Williams wrote:
> > > > On Tue, Mar 19, 2019 at 10:45 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > >
> > > > > On Tue, Mar 19, 2019 at 10:33:57AM -0700, Dan Williams wrote:
> > > > > > On Tue, Mar 19, 2019 at 10:19 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> > > > > > > > On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > [..]
> > > > > > > > Also, the discussion regarding [07/10] is substantial and is ongoing so
> > > > > > > > please let's push along wth that.
> > > > > > >
> > > > > > > I can move it as last patch in the serie but it is needed for ODP RDMA
> > > > > > > convertion too. Otherwise i will just move that code into the ODP RDMA
> > > > > > > code and will have to move it again into HMM code once i am done with
> > > > > > > the nouveau changes and in the meantime i expect other driver will want
> > > > > > > to use this 2 helpers too.
> > > > > >
> > > > > > I still hold out hope that we can find a way to have productive
> > > > > > discussions about the implementation of this infrastructure.
> > > > > > Threatening to move the code elsewhere to bypass the feedback is not
> > > > > > productive.
> > > > >
> > > > > I am not threatening anything that code is in ODP _today_ with that
> > > > > patchset i was factering it out so that i could also use it in nouveau.
> > > > > nouveau is built in such way that right now i can not use it directly.
> > > > > But i wanted to factor out now in hope that i can get the nouveau
> > > > > changes in 5.2 and then convert nouveau in 5.3.
> > > > >
> > > > > So when i said that code will be in ODP it just means that instead of
> > > > > removing it from ODP i will keep it there and it will just delay more
> > > > > code sharing for everyone.
> > > >
> > > > The point I'm trying to make is that the code sharing for everyone is
> > > > moving the implementation closer to canonical kernel code and use
> > > > existing infrastructure. For example, I look at 'struct hmm_range' and
> > > > see nothing hmm specific in it. I think we can make that generic and
> > > > not build up more apis and data structures in the "hmm" namespace.
> > >
> > > Right now i am trying to unify driver for device that have can support
> > > the mmu notifier approach through HMM. Unify to a superset of driver
> > > that can not abide by mmu notifier is on my todo list like i said but
> > > it comes after. I do not want to make the big jump in just one go. So
> > > i doing thing under HMM and thus in HMM namespace, but once i tackle
> > > the larger set i will move to generic namespace what make sense.
> > >
> > > This exact approach did happen several time already in the kernel. In
> > > the GPU sub-system we did it several time. First do something for couple
> > > devices that are very similar then grow to a bigger set of devices and
> > > generalise along the way.
> > >
> > > So i do not see what is the problem of me repeating that same pattern
> > > here again. Do something for a smaller set before tackling it on for
> > > a bigger set.
> > 
> > All of that is fine, but when I asked about the ultimate trajectory
> > that replaces hmm_range_dma_map() with an updated / HMM-aware GUP
> > implementation, the response was that hmm_range_dma_map() is here to
> > stay. The issue is not with forking off a small side effort, it's the
> > plan to absorb that capability into a common implementation across
> > non-HMM drivers where possible.
> 
> hmm_range_dma_map() is a superset of gup_range_dma_map() because on
> top of gup_range_dma_map() the hmm version deals with mmu notifier.
> 
> But everything that is not mmu notifier related can be share through
> gup_range_dma_map() so plan is to end up with:
>     hmm_range_dma_map(hmm_struct) {
>         hmm_mmu_notifier_specific_prep_step();
>         gup_range_dma_map(hmm_struct->common_base_struct);
>         hmm_mmu_notifier_specific_post_step();
>     }
> 
> ie share as much as possible. Does that not make sense ? To get
> there i will need to do non trivial addition to GUP and so i went
> first to get HMM bits working and then work on common gup API.
> 

And more to the hmm_range struct:

struct hmm_range {
    struct vm_area_struct *vma;       // Common
    struct list_head      list;       // HMM specific this is only useful
                                      // to track valid range if a mmu
                                      // notifier happens while we do
                                      // lookup the CPU page table
    unsigned long         start;      // Common
    unsigned long         end;        // Common
    uint64_t              *pfns;      // Common
    const uint64_t        *flags;     // Some flags would be HMM specific
    const uint64_t        *values;    // HMM specific
    uint8_t               pfn_shift;  // Common
    bool                  valid;      // HMM specific
};

So it is not all common they are thing that just do not make sense out
side a HMM capable driver.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 17:12           ` Andrew Morton
  2019-03-19 17:18             ` Jerome Glisse
@ 2019-03-19 21:51             ` Stephen Rothwell
  1 sibling, 0 replies; 98+ messages in thread
From: Stephen Rothwell @ 2019-03-19 21:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jerome Glisse, linux-mm, linux-kernel, Felix Kuehling,
	Christian König, Ralph Campbell, John Hubbard,
	Jason Gunthorpe, Dan Williams, Alex Deucher

[-- Attachment #1: Type: text/plain, Size: 1387 bytes --]

Hi Andrew,

On Tue, 19 Mar 2019 10:12:49 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Yes, a resend against mainline with those various updates will be
> helpful.  Please go through the various fixes which we had as well:
> 
> mm-hmm-use-reference-counting-for-hmm-struct.patch
> mm-hmm-do-not-erase-snapshot-when-a-range-is-invalidated.patch
> mm-hmm-improve-and-rename-hmm_vma_get_pfns-to-hmm_range_snapshot.patch
> mm-hmm-improve-and-rename-hmm_vma_fault-to-hmm_range_fault.patch
> mm-hmm-improve-driver-api-to-work-and-wait-over-a-range.patch
> mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix.patch
> mm-hmm-improve-driver-api-to-work-and-wait-over-a-range-fix-fix.patch
> mm-hmm-add-default-fault-flags-to-avoid-the-need-to-pre-fill-pfns-arrays.patch
> mm-hmm-add-an-helper-function-that-fault-pages-and-map-them-to-a-device.patch
> mm-hmm-support-hugetlbfs-snap-shoting-faulting-and-dma-mapping.patch
> mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem.patch
> mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix.patch
> mm-hmm-allow-to-mirror-vma-of-a-file-on-a-dax-backed-filesystem-fix-2.patch
> mm-hmm-add-helpers-for-driver-to-safely-take-the-mmap_sem.patch

I have dropped all those from linux-next today (along with
mm-hmm-fix-unused-variable-warnings.patch).

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 00/10] HMM updates for 5.1
  2019-03-19 14:18                         ` Ira Weiny
@ 2019-03-19 22:24                           ` Jerome Glisse
  0 siblings, 0 replies; 98+ messages in thread
From: Jerome Glisse @ 2019-03-19 22:24 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Andrew Morton, Linux MM, Linux Kernel Mailing List,
	Felix Kuehling, Christian König, Ralph Campbell,
	John Hubbard, Jason Gunthorpe, Alex Deucher

On Tue, Mar 19, 2019 at 07:18:26AM -0700, Ira Weiny wrote:
> On Tue, Mar 19, 2019 at 12:13:40PM -0700, Dan Williams wrote:
> > On Tue, Mar 19, 2019 at 12:05 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Mar 19, 2019 at 11:42:00AM -0700, Dan Williams wrote:
> > > > On Tue, Mar 19, 2019 at 10:45 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > >
> > > > > On Tue, Mar 19, 2019 at 10:33:57AM -0700, Dan Williams wrote:
> > > > > > On Tue, Mar 19, 2019 at 10:19 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Mar 19, 2019 at 10:12:49AM -0700, Andrew Morton wrote:
> > > > > > > > On Tue, 19 Mar 2019 12:58:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> 
> [snip]
> 
> > >
> > > Right now i am trying to unify driver for device that have can support
> > > the mmu notifier approach through HMM. Unify to a superset of driver
> > > that can not abide by mmu notifier is on my todo list like i said but
> > > it comes after. I do not want to make the big jump in just one go. So
> > > i doing thing under HMM and thus in HMM namespace, but once i tackle
> > > the larger set i will move to generic namespace what make sense.
> > >
> > > This exact approach did happen several time already in the kernel. In
> > > the GPU sub-system we did it several time. First do something for couple
> > > devices that are very similar then grow to a bigger set of devices and
> > > generalise along the way.
> > >
> > > So i do not see what is the problem of me repeating that same pattern
> > > here again. Do something for a smaller set before tackling it on for
> > > a bigger set.
> > 
> > All of that is fine, but when I asked about the ultimate trajectory
> > that replaces hmm_range_dma_map() with an updated / HMM-aware GUP
> > implementation, the response was that hmm_range_dma_map() is here to
> > stay. The issue is not with forking off a small side effort, it's the
> > plan to absorb that capability into a common implementation across
> > non-HMM drivers where possible.
> 
> Just to get on the record in this thread.
> 
> +1
> 
> I think having an interface which handles the MMU notifier stuff for drivers is
> awesome but we need to agree that the trajectory is to help more drivers if
> possible.
> 

Yes and i want to get there step by step not just in one giant leap.
It seems Dan would like to see this all one step and i believe this
is too risky and make the patchset much bigger and harder to review.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2019-03-19 22:24 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-29 16:54 [PATCH 00/10] HMM updates for 5.1 jglisse
2019-01-29 16:54 ` [PATCH 01/10] mm/hmm: use reference counting for HMM struct jglisse
2019-02-20 23:47   ` John Hubbard
2019-02-20 23:59     ` Jerome Glisse
2019-02-21  0:06       ` John Hubbard
2019-02-21  0:15         ` Jerome Glisse
2019-02-21  0:32           ` John Hubbard
2019-02-21  0:37             ` Jerome Glisse
2019-02-21  0:42               ` John Hubbard
2019-01-29 16:54 ` [PATCH 02/10] mm/hmm: do not erase snapshot when a range is invalidated jglisse
2019-02-20 23:58   ` John Hubbard
2019-01-29 16:54 ` [PATCH 03/10] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() jglisse
2019-02-21  0:25   ` John Hubbard
2019-02-21  0:28     ` Jerome Glisse
2019-01-29 16:54 ` [PATCH 04/10] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() jglisse
2019-01-29 16:54 ` [PATCH 05/10] mm/hmm: improve driver API to work and wait over a range jglisse
2019-01-29 16:54 ` [PATCH 06/10] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays jglisse
2019-01-29 16:54 ` [PATCH 07/10] mm/hmm: add an helper function that fault pages and map them to a device jglisse
2019-03-18 20:21   ` Dan Williams
2019-03-18 20:41     ` Jerome Glisse
2019-03-18 21:30       ` Dan Williams
2019-03-18 22:15         ` Jerome Glisse
2019-03-19  3:29           ` Dan Williams
2019-03-19 13:30             ` Jerome Glisse
2019-03-19  8:44               ` Ira Weiny
2019-03-19 17:10                 ` Jerome Glisse
2019-03-19 14:10                   ` Ira Weiny
2019-01-29 16:54 ` [PATCH 08/10] mm/hmm: support hugetlbfs (snap shoting, faulting and DMA mapping) jglisse
2019-01-29 16:54 ` [PATCH 09/10] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem jglisse
2019-01-29 18:41   ` Dan Williams
2019-01-29 19:31     ` Jerome Glisse
2019-01-29 20:51       ` Dan Williams
2019-01-29 21:21         ` Jerome Glisse
2019-01-30  2:32           ` Dan Williams
2019-01-30  3:03             ` Jerome Glisse
2019-01-30 17:25               ` Dan Williams
2019-01-30 18:36                 ` Jerome Glisse
2019-01-31  3:28                   ` Dan Williams
2019-01-31  4:16                     ` Jerome Glisse
2019-01-31  5:44                       ` Dan Williams
2019-03-05 22:16                         ` Andrew Morton
2019-03-06  4:20                           ` Dan Williams
2019-03-06 15:51                             ` Jerome Glisse
2019-03-06 15:57                               ` Dan Williams
2019-03-06 16:03                                 ` Jerome Glisse
2019-03-06 16:06                                   ` Dan Williams
2019-03-07 17:46                             ` Andrew Morton
2019-03-07 18:56                               ` Jerome Glisse
2019-03-12  3:13                                 ` Dan Williams
2019-03-12 15:25                                   ` Jerome Glisse
2019-03-12 16:06                                     ` Dan Williams
2019-03-12 19:06                                       ` Jerome Glisse
2019-03-12 19:30                                         ` Dan Williams
2019-03-12 20:34                                           ` Dave Chinner
2019-03-13  1:06                                             ` Dan Williams
2019-03-12 21:52                                           ` Andrew Morton
2019-03-13  0:10                                             ` Jerome Glisse
2019-03-13  0:46                                               ` Dan Williams
2019-03-13  1:00                                                 ` Jerome Glisse
2019-03-13 16:06                                               ` Andrew Morton
2019-03-13 18:39                                                 ` Jerome Glisse
2019-03-06 15:49                           ` Jerome Glisse
2019-03-06 22:18                             ` Andrew Morton
2019-03-07  0:36                               ` Jerome Glisse
2019-01-29 16:54 ` [PATCH 10/10] mm/hmm: add helpers for driver to safely take the mmap_sem jglisse
2019-02-20 21:59   ` John Hubbard
2019-02-20 22:19     ` Jerome Glisse
2019-02-20 22:40       ` John Hubbard
2019-02-20 23:09         ` Jerome Glisse
2019-02-20 23:17 ` [PATCH 00/10] HMM updates for 5.1 John Hubbard
2019-02-20 23:36   ` Jerome Glisse
2019-02-22 23:31 ` Ralph Campbell
2019-03-13  1:27 ` Jerome Glisse
2019-03-13 16:10   ` Andrew Morton
2019-03-13 18:01     ` Jason Gunthorpe
2019-03-13 18:33     ` Jerome Glisse
2019-03-18 17:00     ` Kuehling, Felix
2019-03-18 17:04     ` Jerome Glisse
2019-03-18 18:30       ` Dan Williams
2019-03-18 18:54         ` Jerome Glisse
2019-03-18 19:18           ` Dan Williams
2019-03-18 19:28             ` Jerome Glisse
2019-03-18 19:36               ` Dan Williams
2019-03-19 16:40       ` Andrew Morton
2019-03-19 16:58         ` Jerome Glisse
2019-03-19 17:12           ` Andrew Morton
2019-03-19 17:18             ` Jerome Glisse
2019-03-19 17:33               ` Dan Williams
2019-03-19 17:45                 ` Jerome Glisse
2019-03-19 18:42                   ` Dan Williams
2019-03-19 19:05                     ` Jerome Glisse
2019-03-19 19:13                       ` Dan Williams
2019-03-19 14:18                         ` Ira Weiny
2019-03-19 22:24                           ` Jerome Glisse
2019-03-19 19:18                         ` Jerome Glisse
2019-03-19 20:25                           ` Jerome Glisse
2019-03-19 21:51             ` Stephen Rothwell
2019-03-19 18:51           ` Deucher, Alexander

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).