All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11] Improve HMM driver API v2
@ 2019-03-25 14:40 jglisse
  2019-03-25 14:40 ` [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM jglisse
                   ` (10 more replies)
  0 siblings, 11 replies; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Balbir Singh,
	Ralph Campbell, Andrew Morton, John Hubbard, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

This patchset improves the HMM driver API and add support for mirroring
virtual address that are mmap of hugetlbfs or of a file in a filesystem
on a DAX block device. You can find a tree with all the patches [1]

This patchset is necessary for converting ODP to HMM and patch to do so
as been posted [2]. All new functions introduced by this patchset are use
by the ODP patch. The ODP patch will be push through the RDMA tree the
release after this patchset is merged.

Moreover all HMM functions are use by the nouveau driver starting in 5.1.

The last patch in the serie add helpers to directly dma map/unmap pages
for virtual addresses that are mirrored on behalf of device driver. This
has been extracted from ODP code as it is is a common pattern accross HMM
device driver. It will be first use by the ODP RDMA code and will latter
get use by nouveau and other driver that are working on including HMM
support.

[1] https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-5.1-v2
[2] https://cgit.freedesktop.org/~glisse/linux/log/?h=odp-hmm
[3] https://lkml.org/lkml/2019/1/29/1008

Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>

Jérôme Glisse (11):
  mm/hmm: select mmu notifier when selecting HMM
  mm/hmm: use reference counting for HMM struct v2
  mm/hmm: do not erase snapshot when a range is invalidated
  mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot()
    v2
  mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() v2
  mm/hmm: improve driver API to work and wait over a range v2
  mm/hmm: add default fault flags to avoid the need to pre-fill pfns
    arrays.
  mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping) v2
  mm/hmm: allow to mirror vma of a file on a DAX backed filesystem v2
  mm/hmm: add helpers for driver to safely take the mmap_sem v2
  mm/hmm: add an helper function that fault pages and map them to a
    device v2

 Documentation/vm/hmm.rst |   36 +-
 include/linux/hmm.h      |  290 ++++++++++-
 mm/Kconfig               |    1 +
 mm/hmm.c                 | 1046 +++++++++++++++++++++++++-------------
 4 files changed, 990 insertions(+), 383 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-28 20:33   ` John Hubbard
  2019-03-25 14:40 ` [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2 jglisse
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Ralph Campbell,
	Andrew Morton, John Hubbard, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

To avoid random config build issue, select mmu notifier when HMM is
selected. In any cases when HMM get selected it will be by users that
will also wants the mmu notifier.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 25c71eb8a7db..0d2944278d80 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -694,6 +694,7 @@ config DEV_PAGEMAP_OPS
 
 config HMM
 	bool
+	select MMU_NOTIFIER
 	select MIGRATE_VMA_HELPER
 
 config HMM_MIRROR
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
  2019-03-25 14:40 ` [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-28 11:07   ` Ira Weiny
  2019-03-25 14:40 ` [PATCH v2 03/11] mm/hmm: do not erase snapshot when a range is invalidated jglisse
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, John Hubbard,
	Andrew Morton, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

Every time i read the code to check that the HMM structure does not
vanish before it should thanks to the many lock protecting its removal
i get a headache. Switch to reference counting instead it is much
easier to follow and harder to break. This also remove some code that
is no longer needed with refcounting.

Changes since v1:
    - removed bunch of useless check (if API is use with bogus argument
      better to fail loudly so user fix their code)
    - s/hmm_get/mm_get_hmm/

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/hmm.h |   2 +
 mm/hmm.c            | 170 ++++++++++++++++++++++++++++----------------
 2 files changed, 112 insertions(+), 60 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ad50b7b4f141..716fc61fa6d4 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -131,6 +131,7 @@ enum hmm_pfn_value_e {
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @hmm: the core HMM structure this range is active against
  * @vma: the vm area struct for the range
  * @list: all range lock are on a list
  * @start: range virtual start address (inclusive)
@@ -142,6 +143,7 @@ enum hmm_pfn_value_e {
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+	struct hmm		*hmm;
 	struct vm_area_struct	*vma;
 	struct list_head	list;
 	unsigned long		start;
diff --git a/mm/hmm.c b/mm/hmm.c
index fe1cd87e49ac..306e57f7cded 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -50,6 +50,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
  */
 struct hmm {
 	struct mm_struct	*mm;
+	struct kref		kref;
 	spinlock_t		lock;
 	struct list_head	ranges;
 	struct list_head	mirrors;
@@ -57,6 +58,16 @@ struct hmm {
 	struct rw_semaphore	mirrors_sem;
 };
 
+static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
+{
+	struct hmm *hmm = READ_ONCE(mm->hmm);
+
+	if (hmm && kref_get_unless_zero(&hmm->kref))
+		return hmm;
+
+	return NULL;
+}
+
 /*
  * hmm_register - register HMM against an mm (HMM internal)
  *
@@ -67,14 +78,9 @@ struct hmm {
  */
 static struct hmm *hmm_register(struct mm_struct *mm)
 {
-	struct hmm *hmm = READ_ONCE(mm->hmm);
+	struct hmm *hmm = mm_get_hmm(mm);
 	bool cleanup = false;
 
-	/*
-	 * The hmm struct can only be freed once the mm_struct goes away,
-	 * hence we should always have pre-allocated an new hmm struct
-	 * above.
-	 */
 	if (hmm)
 		return hmm;
 
@@ -86,6 +92,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	hmm->mmu_notifier.ops = NULL;
 	INIT_LIST_HEAD(&hmm->ranges);
 	spin_lock_init(&hmm->lock);
+	kref_init(&hmm->kref);
 	hmm->mm = mm;
 
 	spin_lock(&mm->page_table_lock);
@@ -106,7 +113,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
 		goto error_mm;
 
-	return mm->hmm;
+	return hmm;
 
 error_mm:
 	spin_lock(&mm->page_table_lock);
@@ -118,9 +125,41 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	return NULL;
 }
 
+static void hmm_free(struct kref *kref)
+{
+	struct hmm *hmm = container_of(kref, struct hmm, kref);
+	struct mm_struct *mm = hmm->mm;
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
+
+	spin_lock(&mm->page_table_lock);
+	if (mm->hmm == hmm)
+		mm->hmm = NULL;
+	spin_unlock(&mm->page_table_lock);
+
+	kfree(hmm);
+}
+
+static inline void hmm_put(struct hmm *hmm)
+{
+	kref_put(&hmm->kref, hmm_free);
+}
+
 void hmm_mm_destroy(struct mm_struct *mm)
 {
-	kfree(mm->hmm);
+	struct hmm *hmm;
+
+	spin_lock(&mm->page_table_lock);
+	hmm = mm_get_hmm(mm);
+	mm->hmm = NULL;
+	if (hmm) {
+		hmm->mm = NULL;
+		spin_unlock(&mm->page_table_lock);
+		hmm_put(hmm);
+		return;
+	}
+
+	spin_unlock(&mm->page_table_lock);
 }
 
 static int hmm_invalidate_range(struct hmm *hmm, bool device,
@@ -165,7 +204,7 @@ static int hmm_invalidate_range(struct hmm *hmm, bool device,
 static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct hmm_mirror *mirror;
-	struct hmm *hmm = mm->hmm;
+	struct hmm *hmm = mm_get_hmm(mm);
 
 	down_write(&hmm->mirrors_sem);
 	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
@@ -186,13 +225,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 						  struct hmm_mirror, list);
 	}
 	up_write(&hmm->mirrors_sem);
+
+	hmm_put(hmm);
 }
 
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *range)
 {
+	struct hmm *hmm = mm_get_hmm(range->mm);
 	struct hmm_update update;
-	struct hmm *hmm = range->mm->hmm;
+	int ret;
 
 	VM_BUG_ON(!hmm);
 
@@ -200,14 +242,16 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 	update.end = range->end;
 	update.event = HMM_UPDATE_INVALIDATE;
 	update.blockable = range->blockable;
-	return hmm_invalidate_range(hmm, true, &update);
+	ret = hmm_invalidate_range(hmm, true, &update);
+	hmm_put(hmm);
+	return ret;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *range)
 {
+	struct hmm *hmm = mm_get_hmm(range->mm);
 	struct hmm_update update;
-	struct hmm *hmm = range->mm->hmm;
 
 	VM_BUG_ON(!hmm);
 
@@ -216,6 +260,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 	update.event = HMM_UPDATE_INVALIDATE;
 	update.blockable = true;
 	hmm_invalidate_range(hmm, false, &update);
+	hmm_put(hmm);
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
@@ -241,24 +286,13 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 	if (!mm || !mirror || !mirror->ops)
 		return -EINVAL;
 
-again:
 	mirror->hmm = hmm_register(mm);
 	if (!mirror->hmm)
 		return -ENOMEM;
 
 	down_write(&mirror->hmm->mirrors_sem);
-	if (mirror->hmm->mm == NULL) {
-		/*
-		 * A racing hmm_mirror_unregister() is about to destroy the hmm
-		 * struct. Try again to allocate a new one.
-		 */
-		up_write(&mirror->hmm->mirrors_sem);
-		mirror->hmm = NULL;
-		goto again;
-	} else {
-		list_add(&mirror->list, &mirror->hmm->mirrors);
-		up_write(&mirror->hmm->mirrors_sem);
-	}
+	list_add(&mirror->list, &mirror->hmm->mirrors);
+	up_write(&mirror->hmm->mirrors_sem);
 
 	return 0;
 }
@@ -273,33 +307,18 @@ EXPORT_SYMBOL(hmm_mirror_register);
  */
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
-	bool should_unregister = false;
-	struct mm_struct *mm;
-	struct hmm *hmm;
+	struct hmm *hmm = READ_ONCE(mirror->hmm);
 
-	if (mirror->hmm == NULL)
+	if (hmm == NULL)
 		return;
 
-	hmm = mirror->hmm;
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
-	should_unregister = list_empty(&hmm->mirrors);
+	/* To protect us against double unregister ... */
 	mirror->hmm = NULL;
-	mm = hmm->mm;
-	hmm->mm = NULL;
 	up_write(&hmm->mirrors_sem);
 
-	if (!should_unregister || mm == NULL)
-		return;
-
-	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
-
-	spin_lock(&mm->page_table_lock);
-	if (mm->hmm == hmm)
-		mm->hmm = NULL;
-	spin_unlock(&mm->page_table_lock);
-
-	kfree(hmm);
+	hmm_put(hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
@@ -708,6 +727,8 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
+	range->hmm = NULL;
+
 	/* Sanity check, this really should not happen ! */
 	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
@@ -717,14 +738,18 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm)
 		return -ENOMEM;
-	/* Caller must have registered a mirror, via hmm_mirror_register() ! */
-	if (!hmm->mmu_notifier.ops)
+
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL) {
+		hmm_put(hmm);
 		return -EINVAL;
+	}
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
 			vma_is_dax(vma)) {
 		hmm_pfns_special(range);
+		hmm_put(hmm);
 		return -EINVAL;
 	}
 
@@ -736,6 +761,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 		 * operations such has atomic access would not work.
 		 */
 		hmm_pfns_clear(range, range->pfns, range->start, range->end);
+		hmm_put(hmm);
 		return -EPERM;
 	}
 
@@ -758,6 +784,12 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	walk_page_range(range->start, range->end, &mm_walk);
+	/*
+	 * Transfer hmm reference to the range struct it will be drop inside
+	 * the hmm_vma_range_done() function (which _must_ be call if this
+	 * function return 0).
+	 */
+	range->hmm = hmm;
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -802,25 +834,27 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  */
 bool hmm_vma_range_done(struct hmm_range *range)
 {
-	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
-	struct hmm *hmm;
+	bool ret = false;
 
-	if (range->end <= range->start) {
+	/* Sanity check this really should not happen. */
+	if (range->hmm == NULL || range->end <= range->start) {
 		BUG();
 		return false;
 	}
 
-	hmm = hmm_register(range->vma->vm_mm);
-	if (!hmm) {
-		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
-		return false;
-	}
-
-	spin_lock(&hmm->lock);
+	spin_lock(&range->hmm->lock);
 	list_del_rcu(&range->list);
-	spin_unlock(&hmm->lock);
+	ret = range->valid;
+	spin_unlock(&range->hmm->lock);
 
-	return range->valid;
+	/* Is the mm still alive ? */
+	if (range->hmm->mm == NULL)
+		ret = false;
+
+	/* Drop reference taken by hmm_vma_fault() or hmm_vma_get_pfns() */
+	hmm_put(range->hmm);
+	range->hmm = NULL;
+	return ret;
 }
 EXPORT_SYMBOL(hmm_vma_range_done);
 
@@ -880,6 +914,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 	struct hmm *hmm;
 	int ret;
 
+	range->hmm = NULL;
+
 	/* Sanity check, this really should not happen ! */
 	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
@@ -891,14 +927,18 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
-	/* Caller must have registered a mirror using hmm_mirror_register() */
-	if (!hmm->mmu_notifier.ops)
+
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL) {
+		hmm_put(hmm);
 		return -EINVAL;
+	}
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
 			vma_is_dax(vma)) {
 		hmm_pfns_special(range);
+		hmm_put(hmm);
 		return -EINVAL;
 	}
 
@@ -910,6 +950,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		 * operations such has atomic access would not work.
 		 */
 		hmm_pfns_clear(range, range->pfns, range->start, range->end);
+		hmm_put(hmm);
 		return -EPERM;
 	}
 
@@ -945,7 +986,16 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
 			       range->end);
 		hmm_vma_range_done(range);
+		hmm_put(hmm);
+	} else {
+		/*
+		 * Transfer hmm reference to the range struct it will be drop
+		 * inside the hmm_vma_range_done() function (which _must_ be
+		 * call if this function return 0).
+		 */
+		range->hmm = hmm;
 	}
+
 	return ret;
 }
 EXPORT_SYMBOL(hmm_vma_fault);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 03/11] mm/hmm: do not erase snapshot when a range is invalidated
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
  2019-03-25 14:40 ` [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM jglisse
  2019-03-25 14:40 ` [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2 jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-25 14:40 ` [PATCH v2 04/11] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() v2 jglisse
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

Users of HMM might be using the snapshot information to do
preparatory step like dma mapping pages to a device before
checking for invalidation through hmm_vma_range_done() so
do not erase that information and assume users will do the
right thing.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 mm/hmm.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 306e57f7cded..213b0beee8d3 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -170,16 +170,10 @@ static int hmm_invalidate_range(struct hmm *hmm, bool device,
 
 	spin_lock(&hmm->lock);
 	list_for_each_entry(range, &hmm->ranges, list) {
-		unsigned long addr, idx, npages;
-
 		if (update->end < range->start || update->start >= range->end)
 			continue;
 
 		range->valid = false;
-		addr = max(update->start, range->start);
-		idx = (addr - range->start) >> PAGE_SHIFT;
-		npages = (min(range->end, update->end) - addr) >> PAGE_SHIFT;
-		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
 	}
 	spin_unlock(&hmm->lock);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 04/11] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() v2
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
                   ` (2 preceding siblings ...)
  2019-03-25 14:40 ` [PATCH v2 03/11] mm/hmm: do not erase snapshot when a range is invalidated jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-28 13:30   ` Ira Weiny
  2019-03-25 14:40 ` [PATCH v2 05/11] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() v2 jglisse
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

Rename for consistency between code, comments and documentation. Also
improves the comments on all the possible returns values. Improve the
function by returning the number of populated entries in pfns array.

Changes since v1:
    - updated documentation
    - reformated some comments

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/vm/hmm.rst | 26 ++++++++++++++++++--------
 include/linux/hmm.h      |  4 ++--
 mm/hmm.c                 | 31 +++++++++++++++++--------------
 3 files changed, 37 insertions(+), 24 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 44205f0b671f..d9b27bdadd1b 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -189,11 +189,7 @@ the driver callback returns.
 When the device driver wants to populate a range of virtual addresses, it can
 use either::
 
-  int hmm_vma_get_pfns(struct vm_area_struct *vma,
-                      struct hmm_range *range,
-                      unsigned long start,
-                      unsigned long end,
-                      hmm_pfn_t *pfns);
+  long hmm_range_snapshot(struct hmm_range *range);
   int hmm_vma_fault(struct vm_area_struct *vma,
                     struct hmm_range *range,
                     unsigned long start,
@@ -202,7 +198,7 @@ When the device driver wants to populate a range of virtual addresses, it can
                     bool write,
                     bool block);
 
-The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
+The first one (hmm_range_snapshot()) will only fetch present CPU page table
 entries and will not trigger a page fault on missing or non-present entries.
 The second one does trigger a page fault on missing or read-only entry if the
 write parameter is true. Page faults use the generic mm page fault code path
@@ -220,19 +216,33 @@ Locking with the update() callback is the most important aspect the driver must
  {
       struct hmm_range range;
       ...
+
+      range.start = ...;
+      range.end = ...;
+      range.pfns = ...;
+      range.flags = ...;
+      range.values = ...;
+      range.pfn_shift = ...;
+
  again:
-      ret = hmm_vma_get_pfns(vma, &range, start, end, pfns);
-      if (ret)
+      down_read(&mm->mmap_sem);
+      range.vma = ...;
+      ret = hmm_range_snapshot(&range);
+      if (ret) {
+          up_read(&mm->mmap_sem);
           return ret;
+      }
       take_lock(driver->update);
       if (!hmm_vma_range_done(vma, &range)) {
           release_lock(driver->update);
+          up_read(&mm->mmap_sem);
           goto again;
       }
 
       // Use pfns array content to update device page table
 
       release_lock(driver->update);
+      up_read(&mm->mmap_sem);
       return 0;
  }
 
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 716fc61fa6d4..32206b0b1bfd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -365,11 +365,11 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  * table invalidation serializes on it.
  *
  * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
- * hmm_vma_get_pfns() WITHOUT ERROR !
+ * hmm_range_snapshot() WITHOUT ERROR !
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct hmm_range *range);
+long hmm_range_snapshot(struct hmm_range *range);
 bool hmm_vma_range_done(struct hmm_range *range);
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 213b0beee8d3..91361aa74b8b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -698,23 +698,25 @@ static void hmm_pfns_special(struct hmm_range *range)
 }
 
 /*
- * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
- * @range: range being snapshotted
- * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
- *          vma permission, 0 success
+ * hmm_range_snapshot() - snapshot CPU page table for a range
+ * @range: range
+ * Returns: number of valid pages in range->pfns[] (from range start
+ *          address). This may be zero. If the return value is negative,
+ *          then one of the following values may be returned:
+ *
+ *           -EINVAL  invalid arguments or mm or virtual address are in an
+ *                    invalid vma (ie either hugetlbfs or device file vma).
+ *           -EPERM   For example, asking for write, when the range is
+ *                    read-only
+ *           -EAGAIN  Caller needs to retry
+ *           -EFAULT  Either no valid vma exists for this range, or it is
+ *                    illegal to access the range
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
  * validity is tracked by range struct. See hmm_vma_range_done() for further
  * information.
- *
- * The range struct is initialized here. It tracks the CPU page table, but only
- * if the function returns success (0), in which case the caller must then call
- * hmm_vma_range_done() to stop CPU page table update tracking on this range.
- *
- * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
- * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct hmm_range *range)
+long hmm_range_snapshot(struct hmm_range *range)
 {
 	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
@@ -768,6 +770,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	hmm_vma_walk.fault = false;
 	hmm_vma_walk.range = range;
 	mm_walk.private = &hmm_vma_walk;
+	hmm_vma_walk.last = range->start;
 
 	mm_walk.vma = vma;
 	mm_walk.mm = vma->vm_mm;
@@ -784,9 +787,9 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	 * function return 0).
 	 */
 	range->hmm = hmm;
-	return 0;
+	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
 }
-EXPORT_SYMBOL(hmm_vma_get_pfns);
+EXPORT_SYMBOL(hmm_range_snapshot);
 
 /*
  * hmm_vma_range_done() - stop tracking change to CPU page table over a range
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 05/11] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() v2
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
                   ` (3 preceding siblings ...)
  2019-03-25 14:40 ` [PATCH v2 04/11] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() v2 jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-28 13:43   ` Ira Weiny
  2019-03-25 14:40 ` [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2 jglisse
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	John Hubbard, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

Rename for consistency between code, comments and documentation. Also
improves the comments on all the possible returns values. Improve the
function by returning the number of populated entries in pfns array.

Changes since v1:
    - updated documentation
    - reformated some comments

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/vm/hmm.rst |  8 +---
 include/linux/hmm.h      | 13 +++++-
 mm/hmm.c                 | 91 +++++++++++++++++-----------------------
 3 files changed, 52 insertions(+), 60 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index d9b27bdadd1b..61f073215a8d 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -190,13 +190,7 @@ When the device driver wants to populate a range of virtual addresses, it can
 use either::
 
   long hmm_range_snapshot(struct hmm_range *range);
-  int hmm_vma_fault(struct vm_area_struct *vma,
-                    struct hmm_range *range,
-                    unsigned long start,
-                    unsigned long end,
-                    hmm_pfn_t *pfns,
-                    bool write,
-                    bool block);
+  long hmm_range_fault(struct hmm_range *range, bool block);
 
 The first one (hmm_range_snapshot()) will only fetch present CPU page table
 entries and will not trigger a page fault on missing or non-present entries.
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 32206b0b1bfd..e9afd23c2eac 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -391,7 +391,18 @@ bool hmm_vma_range_done(struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct hmm_range *range, bool block);
+long hmm_range_fault(struct hmm_range *range, bool block);
+
+/* This is a temporary helper to avoid merge conflict between trees. */
+static inline int hmm_vma_fault(struct hmm_range *range, bool block)
+{
+	long ret = hmm_range_fault(range, block);
+	if (ret == -EBUSY)
+		ret = -EAGAIN;
+	else if (ret == -EAGAIN)
+		ret = -EBUSY;
+	return ret < 0 ? ret : 0;
+}
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
diff --git a/mm/hmm.c b/mm/hmm.c
index 91361aa74b8b..7860e63c3ba7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -336,13 +336,13 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 	flags |= write_fault ? FAULT_FLAG_WRITE : 0;
 	ret = handle_mm_fault(vma, addr, flags);
 	if (ret & VM_FAULT_RETRY)
-		return -EBUSY;
+		return -EAGAIN;
 	if (ret & VM_FAULT_ERROR) {
 		*pfn = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
-	return -EAGAIN;
+	return -EBUSY;
 }
 
 static int hmm_pfns_bad(unsigned long addr,
@@ -368,7 +368,7 @@ static int hmm_pfns_bad(unsigned long addr,
  * @fault: should we fault or not ?
  * @write_fault: write fault ?
  * @walk: mm_walk structure
- * Returns: 0 on success, -EAGAIN after page fault, or page fault error
+ * Returns: 0 on success, -EBUSY after page fault, or page fault error
  *
  * This function will be called whenever pmd_none() or pte_none() returns true,
  * or whenever there is no page directory covering the virtual address range.
@@ -391,12 +391,12 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
 
 			ret = hmm_vma_do_fault(walk, addr, write_fault,
 					       &pfns[i]);
-			if (ret != -EAGAIN)
+			if (ret != -EBUSY)
 				return ret;
 		}
 	}
 
-	return (fault || write_fault) ? -EAGAIN : 0;
+	return (fault || write_fault) ? -EBUSY : 0;
 }
 
 static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
@@ -527,11 +527,11 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	uint64_t orig_pfn = *pfn;
 
 	*pfn = range->values[HMM_PFN_NONE];
-	cpu_flags = pte_to_hmm_pfn_flags(range, pte);
-	hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
-			   &fault, &write_fault);
+	fault = write_fault = false;
 
 	if (pte_none(pte)) {
+		hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0,
+				   &fault, &write_fault);
 		if (fault || write_fault)
 			goto fault;
 		return 0;
@@ -570,7 +570,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 				hmm_vma_walk->last = addr;
 				migration_entry_wait(vma->vm_mm,
 						     pmdp, addr);
-				return -EAGAIN;
+				return -EBUSY;
 			}
 			return 0;
 		}
@@ -578,6 +578,10 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		/* Report error for everything else */
 		*pfn = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
+	} else {
+		cpu_flags = pte_to_hmm_pfn_flags(range, pte);
+		hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
+				   &fault, &write_fault);
 	}
 
 	if (fault || write_fault)
@@ -628,7 +632,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (fault || write_fault) {
 			hmm_vma_walk->last = addr;
 			pmd_migration_entry_wait(vma->vm_mm, pmdp);
-			return -EAGAIN;
+			return -EBUSY;
 		}
 		return 0;
 	} else if (!pmd_present(pmd))
@@ -856,53 +860,34 @@ bool hmm_vma_range_done(struct hmm_range *range)
 EXPORT_SYMBOL(hmm_vma_range_done);
 
 /*
- * hmm_vma_fault() - try to fault some address in a virtual address range
+ * hmm_range_fault() - try to fault some address in a virtual address range
  * @range: range being faulted
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
- * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ * Returns: number of valid pages in range->pfns[] (from range start
+ *          address). This may be zero. If the return value is negative,
+ *          then one of the following values may be returned:
+ *
+ *           -EINVAL  invalid arguments or mm or virtual address are in an
+ *                    invalid vma (ie either hugetlbfs or device file vma).
+ *           -ENOMEM: Out of memory.
+ *           -EPERM:  Invalid permission (for instance asking for write and
+ *                    range is read only).
+ *           -EAGAIN: If you need to retry and mmap_sem was drop. This can only
+ *                    happens if block argument is false.
+ *           -EBUSY:  If the the range is being invalidated and you should wait
+ *                    for invalidation to finish.
+ *           -EFAULT: Invalid (ie either no valid vma or it is illegal to access
+ *                    that range), number of valid pages in range->pfns[] (from
+ *                    range start address).
  *
  * This is similar to a regular CPU page fault except that it will not trigger
- * any memory migration if the memory being faulted is not accessible by CPUs.
+ * any memory migration if the memory being faulted is not accessible by CPUs
+ * and caller does not ask for migration.
  *
  * On error, for one virtual address in the range, the function will mark the
  * corresponding HMM pfn entry with an error flag.
- *
- * Expected use pattern:
- * retry:
- *   down_read(&mm->mmap_sem);
- *   // Find vma and address device wants to fault, initialize hmm_pfn_t
- *   // array accordingly
- *   ret = hmm_vma_fault(range, write, block);
- *   switch (ret) {
- *   case -EAGAIN:
- *     hmm_vma_range_done(range);
- *     // You might want to rate limit or yield to play nicely, you may
- *     // also commit any valid pfn in the array assuming that you are
- *     // getting true from hmm_vma_range_monitor_end()
- *     goto retry;
- *   case 0:
- *     break;
- *   case -ENOMEM:
- *   case -EINVAL:
- *   case -EPERM:
- *   default:
- *     // Handle error !
- *     up_read(&mm->mmap_sem)
- *     return;
- *   }
- *   // Take device driver lock that serialize device page table update
- *   driver_lock_device_page_table_update();
- *   hmm_vma_range_done(range);
- *   // Commit pfns we got from hmm_vma_fault()
- *   driver_unlock_device_page_table_update();
- *   up_read(&mm->mmap_sem)
- *
- * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
- * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
- *
- * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct hmm_range *range, bool block)
+long hmm_range_fault(struct hmm_range *range, bool block)
 {
 	struct vm_area_struct *vma = range->vma;
 	unsigned long start = range->start;
@@ -974,7 +959,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 	do {
 		ret = walk_page_range(start, range->end, &mm_walk);
 		start = hmm_vma_walk.last;
-	} while (ret == -EAGAIN);
+		/* Keep trying while the range is valid. */
+	} while (ret == -EBUSY && range->valid);
 
 	if (ret) {
 		unsigned long i;
@@ -984,6 +970,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 			       range->end);
 		hmm_vma_range_done(range);
 		hmm_put(hmm);
+		return ret;
 	} else {
 		/*
 		 * Transfer hmm reference to the range struct it will be drop
@@ -993,9 +980,9 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		range->hmm = hmm;
 	}
 
-	return ret;
+	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
 }
-EXPORT_SYMBOL(hmm_vma_fault);
+EXPORT_SYMBOL(hmm_range_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
                   ` (4 preceding siblings ...)
  2019-03-25 14:40 ` [PATCH v2 05/11] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() v2 jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-28 13:11   ` Ira Weiny
  2019-03-28 16:12   ` Ira Weiny
  2019-03-25 14:40 ` [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays jglisse
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	John Hubbard, Dan Williams, Dan Carpenter, Matthew Wilcox

From: Jérôme Glisse <jglisse@redhat.com>

A common use case for HMM mirror is user trying to mirror a range
and before they could program the hardware it get invalidated by
some core mm event. Instead of having user re-try right away to
mirror the range provide a completion mechanism for them to wait
for any active invalidation affecting the range.

This also changes how hmm_range_snapshot() and hmm_range_fault()
works by not relying on vma so that we can drop the mmap_sem
when waiting and lookup the vma again on retry.

Changes since v1:
    - squashed: Dan Carpenter: potential deadlock in nonblocking code

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
---
 include/linux/hmm.h | 208 ++++++++++++++---
 mm/hmm.c            | 528 +++++++++++++++++++++-----------------------
 2 files changed, 428 insertions(+), 308 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index e9afd23c2eac..79671036cb5f 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -77,8 +77,34 @@
 #include <linux/migrate.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
+#include <linux/mmu_notifier.h>
 
-struct hmm;
+
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting ranges list
+ * @ranges: list of range being snapshotted
+ * @mirrors: list of mirrors for this mm
+ * @mmu_notifier: mmu notifier to track updates to CPU page table
+ * @mirrors_sem: read/write semaphore protecting the mirrors list
+ * @wq: wait queue for user waiting on a range invalidation
+ * @notifiers: count of active mmu notifiers
+ * @dead: is the mm dead ?
+ */
+struct hmm {
+	struct mm_struct	*mm;
+	struct kref		kref;
+	struct mutex		lock;
+	struct list_head	ranges;
+	struct list_head	mirrors;
+	struct mmu_notifier	mmu_notifier;
+	struct rw_semaphore	mirrors_sem;
+	wait_queue_head_t	wq;
+	long			notifiers;
+	bool			dead;
+};
 
 /*
  * hmm_pfn_flag_e - HMM flag enums
@@ -155,6 +181,38 @@ struct hmm_range {
 	bool			valid;
 };
 
+/*
+ * hmm_range_wait_until_valid() - wait for range to be valid
+ * @range: range affected by invalidation to wait on
+ * @timeout: time out for wait in ms (ie abort wait after that period of time)
+ * Returns: true if the range is valid, false otherwise.
+ */
+static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
+					      unsigned long timeout)
+{
+	/* Check if mm is dead ? */
+	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
+		range->valid = false;
+		return false;
+	}
+	if (range->valid)
+		return true;
+	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
+			   msecs_to_jiffies(timeout));
+	/* Return current valid status just in case we get lucky */
+	return range->valid;
+}
+
+/*
+ * hmm_range_valid() - test if a range is valid or not
+ * @range: range
+ * Returns: true if the range is valid, false otherwise.
+ */
+static inline bool hmm_range_valid(struct hmm_range *range)
+{
+	return range->valid;
+}
+
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
  * @range: range use to decode HMM pfn value
@@ -357,51 +415,133 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
 
 /*
- * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
- * driver lock that serializes device page table updates, then call
- * hmm_vma_range_done(), to check if the snapshot is still valid. The same
- * device driver page table update lock must also be used in the
- * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
- * table invalidation serializes on it.
+ * To snapshot the CPU page table you first have to call hmm_range_register()
+ * to register the range. If hmm_range_register() return an error then some-
+ * thing is horribly wrong and you should fail loudly. If it returned true then
+ * you can wait for the range to be stable with hmm_range_wait_until_valid()
+ * function, a range is valid when there are no concurrent changes to the CPU
+ * page table for the range.
+ *
+ * Once the range is valid you can call hmm_range_snapshot() if that returns
+ * without error then you can take your device page table lock (the same lock
+ * you use in the HMM mirror sync_cpu_device_pagetables() callback). After
+ * taking that lock you have to check the range validity, if it is still valid
+ * (ie hmm_range_valid() returns true) then you can program the device page
+ * table, otherwise you have to start again. Pseudo code:
+ *
+ *      mydevice_prefault(mydevice, mm, start, end)
+ *      {
+ *          struct hmm_range range;
+ *          ...
  *
- * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
- * hmm_range_snapshot() WITHOUT ERROR !
+ *          ret = hmm_range_register(&range, mm, start, end);
+ *          if (ret)
+ *              return ret;
  *
- * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
- */
-long hmm_range_snapshot(struct hmm_range *range);
-bool hmm_vma_range_done(struct hmm_range *range);
-
-
-/*
- * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The HMM pfn array will
- * be updated with the fault result and current snapshot of the CPU page table
- * for the range.
+ *          down_read(mm->mmap_sem);
+ *      again:
+ *
+ *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
+ *              up_read(&mm->mmap_sem);
+ *              hmm_range_unregister(range);
+ *              // Handle time out, either sleep or retry or something else
+ *              ...
+ *              return -ESOMETHING; || goto again;
+ *          }
+ *
+ *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
+ *          if (ret == -EAGAIN) {
+ *              down_read(mm->mmap_sem);
+ *              goto again;
+ *          } else if (ret == -EBUSY) {
+ *              goto again;
+ *          }
+ *
+ *          up_read(&mm->mmap_sem);
+ *          if (ret) {
+ *              hmm_range_unregister(range);
+ *              return ret;
+ *          }
+ *
+ *          // It might not have snap-shoted the whole range but only the first
+ *          // npages, the return values is the number of valid pages from the
+ *          // start of the range.
+ *          npages = ret;
  *
- * The mmap_sem must be taken in read mode before entering and it might be
- * dropped by the function if the block argument is false. In that case, the
- * function returns -EAGAIN.
+ *          ...
  *
- * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the HMM pfn array to
- * determine fault status for each address.
+ *          mydevice_page_table_lock(mydevice);
+ *          if (!hmm_range_valid(range)) {
+ *              mydevice_page_table_unlock(mydevice);
+ *              goto again;
+ *          }
  *
- * Trying to fault inside an invalid vma will result in -EINVAL.
+ *          mydevice_populate_page_table(mydevice, range, npages);
+ *          ...
+ *          mydevice_take_page_table_unlock(mydevice);
+ *          hmm_range_unregister(range);
  *
- * See the function description in mm/hmm.c for further documentation.
+ *          return 0;
+ *      }
+ *
+ * The same scheme apply to hmm_range_fault() (ie replace hmm_range_snapshot()
+ * with hmm_range_fault() in above pseudo code).
+ *
+ * YOU MUST CALL hmm_range_unregister() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_range_register() AND hmm_range_register() RETURNED TRUE ! IF YOU DO NOT
+ * FOLLOW THIS RULE MEMORY CORRUPTION WILL ENSUE !
  */
+int hmm_range_register(struct hmm_range *range,
+		       struct mm_struct *mm,
+		       unsigned long start,
+		       unsigned long end);
+void hmm_range_unregister(struct hmm_range *range);
+long hmm_range_snapshot(struct hmm_range *range);
 long hmm_range_fault(struct hmm_range *range, bool block);
 
+/*
+ * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
+ *
+ * When waiting for mmu notifiers we need some kind of time out otherwise we
+ * could potentialy wait for ever, 1000ms ie 1s sounds like a long time to
+ * wait already.
+ */
+#define HMM_RANGE_DEFAULT_TIMEOUT 1000
+
 /* This is a temporary helper to avoid merge conflict between trees. */
+static inline bool hmm_vma_range_done(struct hmm_range *range)
+{
+	bool ret = hmm_range_valid(range);
+
+	hmm_range_unregister(range);
+	return ret;
+}
+
 static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 {
-	long ret = hmm_range_fault(range, block);
-	if (ret == -EBUSY)
-		ret = -EAGAIN;
-	else if (ret == -EAGAIN)
-		ret = -EBUSY;
-	return ret < 0 ? ret : 0;
+	long ret;
+
+	ret = hmm_range_register(range, range->vma->vm_mm,
+				 range->start, range->end);
+	if (ret)
+		return (int)ret;
+
+	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
+		up_read(&range->vma->vm_mm->mmap_sem);
+		return -EAGAIN;
+	}
+
+	ret = hmm_range_fault(range, block);
+	if (ret <= 0) {
+		if (ret == -EBUSY || !ret) {
+			up_read(&range->vma->vm_mm->mmap_sem);
+			ret = -EBUSY;
+		} else if (ret == -EAGAIN)
+			ret = -EBUSY;
+		hmm_range_unregister(range);
+		return ret;
+	}
+	return 0;
 }
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
diff --git a/mm/hmm.c b/mm/hmm.c
index 7860e63c3ba7..fa9498eeb9b6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -38,26 +38,6 @@
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
-/*
- * struct hmm - HMM per mm struct
- *
- * @mm: mm struct this HMM struct is bound to
- * @lock: lock protecting ranges list
- * @ranges: list of range being snapshotted
- * @mirrors: list of mirrors for this mm
- * @mmu_notifier: mmu notifier to track updates to CPU page table
- * @mirrors_sem: read/write semaphore protecting the mirrors list
- */
-struct hmm {
-	struct mm_struct	*mm;
-	struct kref		kref;
-	spinlock_t		lock;
-	struct list_head	ranges;
-	struct list_head	mirrors;
-	struct mmu_notifier	mmu_notifier;
-	struct rw_semaphore	mirrors_sem;
-};
-
 static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
 {
 	struct hmm *hmm = READ_ONCE(mm->hmm);
@@ -87,12 +67,15 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 	if (!hmm)
 		return NULL;
+	init_waitqueue_head(&hmm->wq);
 	INIT_LIST_HEAD(&hmm->mirrors);
 	init_rwsem(&hmm->mirrors_sem);
 	hmm->mmu_notifier.ops = NULL;
 	INIT_LIST_HEAD(&hmm->ranges);
-	spin_lock_init(&hmm->lock);
+	mutex_init(&hmm->lock);
 	kref_init(&hmm->kref);
+	hmm->notifiers = 0;
+	hmm->dead = false;
 	hmm->mm = mm;
 
 	spin_lock(&mm->page_table_lock);
@@ -154,6 +137,7 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	mm->hmm = NULL;
 	if (hmm) {
 		hmm->mm = NULL;
+		hmm->dead = true;
 		spin_unlock(&mm->page_table_lock);
 		hmm_put(hmm);
 		return;
@@ -162,43 +146,22 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	spin_unlock(&mm->page_table_lock);
 }
 
-static int hmm_invalidate_range(struct hmm *hmm, bool device,
-				const struct hmm_update *update)
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
+	struct hmm *hmm = mm_get_hmm(mm);
 	struct hmm_mirror *mirror;
 	struct hmm_range *range;
 
-	spin_lock(&hmm->lock);
-	list_for_each_entry(range, &hmm->ranges, list) {
-		if (update->end < range->start || update->start >= range->end)
-			continue;
+	/* Report this HMM as dying. */
+	hmm->dead = true;
 
+	/* Wake-up everyone waiting on any range. */
+	mutex_lock(&hmm->lock);
+	list_for_each_entry(range, &hmm->ranges, list) {
 		range->valid = false;
 	}
-	spin_unlock(&hmm->lock);
-
-	if (!device)
-		return 0;
-
-	down_read(&hmm->mirrors_sem);
-	list_for_each_entry(mirror, &hmm->mirrors, list) {
-		int ret;
-
-		ret = mirror->ops->sync_cpu_device_pagetables(mirror, update);
-		if (!update->blockable && ret == -EAGAIN) {
-			up_read(&hmm->mirrors_sem);
-			return -EAGAIN;
-		}
-	}
-	up_read(&hmm->mirrors_sem);
-
-	return 0;
-}
-
-static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
-{
-	struct hmm_mirror *mirror;
-	struct hmm *hmm = mm_get_hmm(mm);
+	wake_up_all(&hmm->wq);
+	mutex_unlock(&hmm->lock);
 
 	down_write(&hmm->mirrors_sem);
 	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
@@ -224,36 +187,80 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 }
 
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
-			const struct mmu_notifier_range *range)
+			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(range->mm);
+	struct hmm *hmm = mm_get_hmm(nrange->mm);
+	struct hmm_mirror *mirror;
 	struct hmm_update update;
-	int ret;
+	struct hmm_range *range;
+	int ret = 0;
 
 	VM_BUG_ON(!hmm);
 
-	update.start = range->start;
-	update.end = range->end;
+	update.start = nrange->start;
+	update.end = nrange->end;
 	update.event = HMM_UPDATE_INVALIDATE;
-	update.blockable = range->blockable;
-	ret = hmm_invalidate_range(hmm, true, &update);
+	update.blockable = nrange->blockable;
+
+	if (nrange->blockable)
+		mutex_lock(&hmm->lock);
+	else if (!mutex_trylock(&hmm->lock)) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	hmm->notifiers++;
+	list_for_each_entry(range, &hmm->ranges, list) {
+		if (update.end < range->start || update.start >= range->end)
+			continue;
+
+		range->valid = false;
+	}
+	mutex_unlock(&hmm->lock);
+
+	if (nrange->blockable)
+		down_read(&hmm->mirrors_sem);
+	else if (!down_read_trylock(&hmm->mirrors_sem)) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	list_for_each_entry(mirror, &hmm->mirrors, list) {
+		int ret;
+
+		ret = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
+		if (!update.blockable && ret == -EAGAIN) {
+			up_read(&hmm->mirrors_sem);
+			ret = -EAGAIN;
+			goto out;
+		}
+	}
+	up_read(&hmm->mirrors_sem);
+
+out:
 	hmm_put(hmm);
 	return ret;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
-			const struct mmu_notifier_range *range)
+			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(range->mm);
-	struct hmm_update update;
+	struct hmm *hmm = mm_get_hmm(nrange->mm);
 
 	VM_BUG_ON(!hmm);
 
-	update.start = range->start;
-	update.end = range->end;
-	update.event = HMM_UPDATE_INVALIDATE;
-	update.blockable = true;
-	hmm_invalidate_range(hmm, false, &update);
+	mutex_lock(&hmm->lock);
+	hmm->notifiers--;
+	if (!hmm->notifiers) {
+		struct hmm_range *range;
+
+		list_for_each_entry(range, &hmm->ranges, list) {
+			if (range->valid)
+				continue;
+			range->valid = true;
+		}
+		wake_up_all(&hmm->wq);
+	}
+	mutex_unlock(&hmm->lock);
+
 	hmm_put(hmm);
 }
 
@@ -405,7 +412,6 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 {
 	struct hmm_range *range = hmm_vma_walk->range;
 
-	*fault = *write_fault = false;
 	if (!hmm_vma_walk->fault)
 		return;
 
@@ -444,10 +450,11 @@ static void hmm_range_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 		return;
 	}
 
+	*fault = *write_fault = false;
 	for (i = 0; i < npages; ++i) {
 		hmm_pte_need_fault(hmm_vma_walk, pfns[i], cpu_flags,
 				   fault, write_fault);
-		if ((*fault) || (*write_fault))
+		if ((*write_fault))
 			return;
 	}
 }
@@ -702,162 +709,152 @@ static void hmm_pfns_special(struct hmm_range *range)
 }
 
 /*
- * hmm_range_snapshot() - snapshot CPU page table for a range
+ * hmm_range_register() - start tracking change to CPU page table over a range
  * @range: range
- * Returns: number of valid pages in range->pfns[] (from range start
- *          address). This may be zero. If the return value is negative,
- *          then one of the following values may be returned:
+ * @mm: the mm struct for the range of virtual address
+ * @start: start virtual address (inclusive)
+ * @end: end virtual address (exclusive)
+ * Returns 0 on success, -EFAULT if the address space is no longer valid
  *
- *           -EINVAL  invalid arguments or mm or virtual address are in an
- *                    invalid vma (ie either hugetlbfs or device file vma).
- *           -EPERM   For example, asking for write, when the range is
- *                    read-only
- *           -EAGAIN  Caller needs to retry
- *           -EFAULT  Either no valid vma exists for this range, or it is
- *                    illegal to access the range
- *
- * This snapshots the CPU page table for a range of virtual addresses. Snapshot
- * validity is tracked by range struct. See hmm_vma_range_done() for further
- * information.
+ * Track updates to the CPU page table see include/linux/hmm.h
  */
-long hmm_range_snapshot(struct hmm_range *range)
+int hmm_range_register(struct hmm_range *range,
+		       struct mm_struct *mm,
+		       unsigned long start,
+		       unsigned long end)
 {
-	struct vm_area_struct *vma = range->vma;
-	struct hmm_vma_walk hmm_vma_walk;
-	struct mm_walk mm_walk;
-	struct hmm *hmm;
-
+	range->start = start & PAGE_MASK;
+	range->end = end & PAGE_MASK;
+	range->valid = false;
 	range->hmm = NULL;
 
-	/* Sanity check, this really should not happen ! */
-	if (range->start < vma->vm_start || range->start >= vma->vm_end)
-		return -EINVAL;
-	if (range->end < vma->vm_start || range->end > vma->vm_end)
+	if (range->start >= range->end)
 		return -EINVAL;
 
-	hmm = hmm_register(vma->vm_mm);
-	if (!hmm)
-		return -ENOMEM;
+	range->hmm = hmm_register(mm);
+	if (!range->hmm)
+		return -EFAULT;
 
 	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL) {
-		hmm_put(hmm);
-		return -EINVAL;
+	if (range->hmm->mm == NULL || range->hmm->dead) {
+		hmm_put(range->hmm);
+		return -EFAULT;
 	}
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
-			vma_is_dax(vma)) {
-		hmm_pfns_special(range);
-		hmm_put(hmm);
-		return -EINVAL;
-	}
+	/* Initialize range to track CPU page table update */
+	mutex_lock(&range->hmm->lock);
 
-	if (!(vma->vm_flags & VM_READ)) {
-		/*
-		 * If vma do not allow read access, then assume that it does
-		 * not allow write access, either. Architecture that allow
-		 * write without read access are not supported by HMM, because
-		 * operations such has atomic access would not work.
-		 */
-		hmm_pfns_clear(range, range->pfns, range->start, range->end);
-		hmm_put(hmm);
-		return -EPERM;
-	}
+	list_add_rcu(&range->list, &range->hmm->ranges);
 
-	/* Initialize range to track CPU page table update */
-	spin_lock(&hmm->lock);
-	range->valid = true;
-	list_add_rcu(&range->list, &hmm->ranges);
-	spin_unlock(&hmm->lock);
-
-	hmm_vma_walk.fault = false;
-	hmm_vma_walk.range = range;
-	mm_walk.private = &hmm_vma_walk;
-	hmm_vma_walk.last = range->start;
-
-	mm_walk.vma = vma;
-	mm_walk.mm = vma->vm_mm;
-	mm_walk.pte_entry = NULL;
-	mm_walk.test_walk = NULL;
-	mm_walk.hugetlb_entry = NULL;
-	mm_walk.pmd_entry = hmm_vma_walk_pmd;
-	mm_walk.pte_hole = hmm_vma_walk_hole;
-
-	walk_page_range(range->start, range->end, &mm_walk);
 	/*
-	 * Transfer hmm reference to the range struct it will be drop inside
-	 * the hmm_vma_range_done() function (which _must_ be call if this
-	 * function return 0).
+	 * If there are any concurrent notifiers we have to wait for them for
+	 * the range to be valid (see hmm_range_wait_until_valid()).
 	 */
-	range->hmm = hmm;
-	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+	if (!range->hmm->notifiers)
+		range->valid = true;
+	mutex_unlock(&range->hmm->lock);
+
+	return 0;
 }
-EXPORT_SYMBOL(hmm_range_snapshot);
+EXPORT_SYMBOL(hmm_range_register);
 
 /*
- * hmm_vma_range_done() - stop tracking change to CPU page table over a range
- * @range: range being tracked
- * Returns: false if range data has been invalidated, true otherwise
+ * hmm_range_unregister() - stop tracking change to CPU page table over a range
+ * @range: range
  *
  * Range struct is used to track updates to the CPU page table after a call to
- * either hmm_vma_get_pfns() or hmm_vma_fault(). Once the device driver is done
- * using the data,  or wants to lock updates to the data it got from those
- * functions, it must call the hmm_vma_range_done() function, which will then
- * stop tracking CPU page table updates.
- *
- * Note that device driver must still implement general CPU page table update
- * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
- * the mmu_notifier API directly.
- *
- * CPU page table update tracking done through hmm_range is only temporary and
- * to be used while trying to duplicate CPU page table contents for a range of
- * virtual addresses.
- *
- * There are two ways to use this :
- * again:
- *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
- *   trans = device_build_page_table_update_transaction(pfns);
- *   device_page_table_lock();
- *   if (!hmm_vma_range_done(range)) {
- *     device_page_table_unlock();
- *     goto again;
- *   }
- *   device_commit_transaction(trans);
- *   device_page_table_unlock();
- *
- * Or:
- *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
- *   device_page_table_lock();
- *   hmm_vma_range_done(range);
- *   device_update_page_table(range->pfns);
- *   device_page_table_unlock();
+ * hmm_range_register(). See include/linux/hmm.h for how to use it.
  */
-bool hmm_vma_range_done(struct hmm_range *range)
+void hmm_range_unregister(struct hmm_range *range)
 {
-	bool ret = false;
-
 	/* Sanity check this really should not happen. */
-	if (range->hmm == NULL || range->end <= range->start) {
-		BUG();
-		return false;
-	}
+	if (range->hmm == NULL || range->end <= range->start)
+		return;
 
-	spin_lock(&range->hmm->lock);
+	mutex_lock(&range->hmm->lock);
 	list_del_rcu(&range->list);
-	ret = range->valid;
-	spin_unlock(&range->hmm->lock);
-
-	/* Is the mm still alive ? */
-	if (range->hmm->mm == NULL)
-		ret = false;
+	mutex_unlock(&range->hmm->lock);
 
-	/* Drop reference taken by hmm_vma_fault() or hmm_vma_get_pfns() */
+	/* Drop reference taken by hmm_range_register() */
+	range->valid = false;
 	hmm_put(range->hmm);
 	range->hmm = NULL;
-	return ret;
 }
-EXPORT_SYMBOL(hmm_vma_range_done);
+EXPORT_SYMBOL(hmm_range_unregister);
+
+/*
+ * hmm_range_snapshot() - snapshot CPU page table for a range
+ * @range: range
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
+ *          permission (for instance asking for write and range is read only),
+ *          -EAGAIN if you need to retry, -EFAULT invalid (ie either no valid
+ *          vma or it is illegal to access that range), number of valid pages
+ *          in range->pfns[] (from range start address).
+ *
+ * This snapshots the CPU page table for a range of virtual addresses. Snapshot
+ * validity is tracked by range struct. See in include/linux/hmm.h for example
+ * on how to use.
+ */
+long hmm_range_snapshot(struct hmm_range *range)
+{
+	unsigned long start = range->start, end;
+	struct hmm_vma_walk hmm_vma_walk;
+	struct hmm *hmm = range->hmm;
+	struct vm_area_struct *vma;
+	struct mm_walk mm_walk;
+
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL || hmm->dead)
+		return -EFAULT;
+
+	do {
+		/* If range is no longer valid force retry. */
+		if (!range->valid)
+			return -EAGAIN;
+
+		vma = find_vma(hmm->mm, start);
+		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
+			return -EFAULT;
+
+		/* FIXME support hugetlb fs/dax */
+		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
+			hmm_pfns_special(range);
+			return -EINVAL;
+		}
+
+		if (!(vma->vm_flags & VM_READ)) {
+			/*
+			 * If vma do not allow read access, then assume that it
+			 * does not allow write access, either. HMM does not
+			 * support architecture that allow write without read.
+			 */
+			hmm_pfns_clear(range, range->pfns,
+				range->start, range->end);
+			return -EPERM;
+		}
+
+		range->vma = vma;
+		hmm_vma_walk.last = start;
+		hmm_vma_walk.fault = false;
+		hmm_vma_walk.range = range;
+		mm_walk.private = &hmm_vma_walk;
+		end = min(range->end, vma->vm_end);
+
+		mm_walk.vma = vma;
+		mm_walk.mm = vma->vm_mm;
+		mm_walk.pte_entry = NULL;
+		mm_walk.test_walk = NULL;
+		mm_walk.hugetlb_entry = NULL;
+		mm_walk.pmd_entry = hmm_vma_walk_pmd;
+		mm_walk.pte_hole = hmm_vma_walk_hole;
+
+		walk_page_range(start, end, &mm_walk);
+		start = end;
+	} while (start < range->end);
+
+	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+}
+EXPORT_SYMBOL(hmm_range_snapshot);
 
 /*
  * hmm_range_fault() - try to fault some address in a virtual address range
@@ -889,96 +886,79 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  */
 long hmm_range_fault(struct hmm_range *range, bool block)
 {
-	struct vm_area_struct *vma = range->vma;
-	unsigned long start = range->start;
+	unsigned long start = range->start, end;
 	struct hmm_vma_walk hmm_vma_walk;
+	struct hmm *hmm = range->hmm;
+	struct vm_area_struct *vma;
 	struct mm_walk mm_walk;
-	struct hmm *hmm;
 	int ret;
 
-	range->hmm = NULL;
-
-	/* Sanity check, this really should not happen ! */
-	if (range->start < vma->vm_start || range->start >= vma->vm_end)
-		return -EINVAL;
-	if (range->end < vma->vm_start || range->end > vma->vm_end)
-		return -EINVAL;
+	/* Check if hmm_mm_destroy() was call. */
+	if (hmm->mm == NULL || hmm->dead)
+		return -EFAULT;
 
-	hmm = hmm_register(vma->vm_mm);
-	if (!hmm) {
-		hmm_pfns_clear(range, range->pfns, range->start, range->end);
-		return -ENOMEM;
-	}
+	do {
+		/* If range is no longer valid force retry. */
+		if (!range->valid) {
+			up_read(&hmm->mm->mmap_sem);
+			return -EAGAIN;
+		}
 
-	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL) {
-		hmm_put(hmm);
-		return -EINVAL;
-	}
+		vma = find_vma(hmm->mm, start);
+		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
+			return -EFAULT;
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
-			vma_is_dax(vma)) {
-		hmm_pfns_special(range);
-		hmm_put(hmm);
-		return -EINVAL;
-	}
+		/* FIXME support hugetlb fs/dax */
+		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
+			hmm_pfns_special(range);
+			return -EINVAL;
+		}
 
-	if (!(vma->vm_flags & VM_READ)) {
-		/*
-		 * If vma do not allow read access, then assume that it does
-		 * not allow write access, either. Architecture that allow
-		 * write without read access are not supported by HMM, because
-		 * operations such has atomic access would not work.
-		 */
-		hmm_pfns_clear(range, range->pfns, range->start, range->end);
-		hmm_put(hmm);
-		return -EPERM;
-	}
+		if (!(vma->vm_flags & VM_READ)) {
+			/*
+			 * If vma do not allow read access, then assume that it
+			 * does not allow write access, either. HMM does not
+			 * support architecture that allow write without read.
+			 */
+			hmm_pfns_clear(range, range->pfns,
+				range->start, range->end);
+			return -EPERM;
+		}
 
-	/* Initialize range to track CPU page table update */
-	spin_lock(&hmm->lock);
-	range->valid = true;
-	list_add_rcu(&range->list, &hmm->ranges);
-	spin_unlock(&hmm->lock);
-
-	hmm_vma_walk.fault = true;
-	hmm_vma_walk.block = block;
-	hmm_vma_walk.range = range;
-	mm_walk.private = &hmm_vma_walk;
-	hmm_vma_walk.last = range->start;
-
-	mm_walk.vma = vma;
-	mm_walk.mm = vma->vm_mm;
-	mm_walk.pte_entry = NULL;
-	mm_walk.test_walk = NULL;
-	mm_walk.hugetlb_entry = NULL;
-	mm_walk.pmd_entry = hmm_vma_walk_pmd;
-	mm_walk.pte_hole = hmm_vma_walk_hole;
+		range->vma = vma;
+		hmm_vma_walk.last = start;
+		hmm_vma_walk.fault = true;
+		hmm_vma_walk.block = block;
+		hmm_vma_walk.range = range;
+		mm_walk.private = &hmm_vma_walk;
+		end = min(range->end, vma->vm_end);
+
+		mm_walk.vma = vma;
+		mm_walk.mm = vma->vm_mm;
+		mm_walk.pte_entry = NULL;
+		mm_walk.test_walk = NULL;
+		mm_walk.hugetlb_entry = NULL;
+		mm_walk.pmd_entry = hmm_vma_walk_pmd;
+		mm_walk.pte_hole = hmm_vma_walk_hole;
+
+		do {
+			ret = walk_page_range(start, end, &mm_walk);
+			start = hmm_vma_walk.last;
+
+			/* Keep trying while the range is valid. */
+		} while (ret == -EBUSY && range->valid);
+
+		if (ret) {
+			unsigned long i;
+
+			i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+			hmm_pfns_clear(range, &range->pfns[i],
+				hmm_vma_walk.last, range->end);
+			return ret;
+		}
+		start = end;
 
-	do {
-		ret = walk_page_range(start, range->end, &mm_walk);
-		start = hmm_vma_walk.last;
-		/* Keep trying while the range is valid. */
-	} while (ret == -EBUSY && range->valid);
-
-	if (ret) {
-		unsigned long i;
-
-		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
-			       range->end);
-		hmm_vma_range_done(range);
-		hmm_put(hmm);
-		return ret;
-	} else {
-		/*
-		 * Transfer hmm reference to the range struct it will be drop
-		 * inside the hmm_vma_range_done() function (which _must_ be
-		 * call if this function return 0).
-		 */
-		range->hmm = hmm;
-	}
+	} while (start < range->end);
 
 	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
 }
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
                   ` (5 preceding siblings ...)
  2019-03-25 14:40 ` [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2 jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-28 21:59   ` John Hubbard
  2019-03-25 14:40 ` [PATCH v2 08/11] mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping) v2 jglisse
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	John Hubbard, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

The HMM mirror API can be use in two fashions. The first one where the HMM
user coalesce multiple page faults into one request and set flags per pfns
for of those faults. The second one where the HMM user want to pre-fault a
range with specific flags. For the latter one it is a waste to have the user
pre-fill the pfn arrays with a default flags value.

This patch adds a default flags value allowing user to set them for a range
without having to pre-fill the pfn array.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/hmm.h |  7 +++++++
 mm/hmm.c            | 12 ++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 79671036cb5f..13bc2c72f791 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
  * @pfns: array of pfns (big enough for the range)
  * @flags: pfn flags to match device driver page table
  * @values: pfn value for some special case (none, special, error, ...)
+ * @default_flags: default flags for the range (write, read, ...)
+ * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
  * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
@@ -177,6 +179,8 @@ struct hmm_range {
 	uint64_t		*pfns;
 	const uint64_t		*flags;
 	const uint64_t		*values;
+	uint64_t		default_flags;
+	uint64_t		pfn_flags_mask;
 	uint8_t			pfn_shift;
 	bool			valid;
 };
@@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 {
 	long ret;
 
+	range->default_flags = 0;
+	range->pfn_flags_mask = -1UL;
+
 	ret = hmm_range_register(range, range->vma->vm_mm,
 				 range->start, range->end);
 	if (ret)
diff --git a/mm/hmm.c b/mm/hmm.c
index fa9498eeb9b6..4fe88a196d17 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 	if (!hmm_vma_walk->fault)
 		return;
 
+	/*
+	 * So we not only consider the individual per page request we also
+	 * consider the default flags requested for the range. The API can
+	 * be use in 2 fashions. The first one where the HMM user coalesce
+	 * multiple page fault into one request and set flags per pfns for
+	 * of those faults. The second one where the HMM user want to pre-
+	 * fault a range with specific flags. For the latter one it is a
+	 * waste to have the user pre-fill the pfn arrays with a default
+	 * flags value.
+	 */
+	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
+
 	/* We aren't ask to do anything ... */
 	if (!(pfns & range->flags[HMM_PFN_VALID]))
 		return;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 08/11] mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping) v2
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
                   ` (6 preceding siblings ...)
  2019-03-25 14:40 ` [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-28 16:53   ` Ira Weiny
  2019-03-25 14:40 ` [PATCH v2 09/11] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem v2 jglisse
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	John Hubbard, Dan Williams, Arnd Bergmann

From: Jérôme Glisse <jglisse@redhat.com>

HMM mirror is a device driver helpers to mirror range of virtual address.
It means that the process jobs running on the device can access the same
virtual address as the CPU threads of that process. This patch adds support
for hugetlbfs mapping (ie range of virtual address that are mmap of a
hugetlbfs).

Changes since v1:
    - improved commit message
    - squashed: Arnd Bergmann: fix unused variable warnings

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
---
 include/linux/hmm.h |  29 ++++++++--
 mm/hmm.c            | 126 +++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 138 insertions(+), 17 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 13bc2c72f791..f3b919b04eda 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -181,10 +181,31 @@ struct hmm_range {
 	const uint64_t		*values;
 	uint64_t		default_flags;
 	uint64_t		pfn_flags_mask;
+	uint8_t			page_shift;
 	uint8_t			pfn_shift;
 	bool			valid;
 };
 
+/*
+ * hmm_range_page_shift() - return the page shift for the range
+ * @range: range being queried
+ * Returns: page shift (page size = 1 << page shift) for the range
+ */
+static inline unsigned hmm_range_page_shift(const struct hmm_range *range)
+{
+	return range->page_shift;
+}
+
+/*
+ * hmm_range_page_size() - return the page size for the range
+ * @range: range being queried
+ * Returns: page size for the range in bytes
+ */
+static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
+{
+	return 1UL << hmm_range_page_shift(range);
+}
+
 /*
  * hmm_range_wait_until_valid() - wait for range to be valid
  * @range: range affected by invalidation to wait on
@@ -438,7 +459,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  *          struct hmm_range range;
  *          ...
  *
- *          ret = hmm_range_register(&range, mm, start, end);
+ *          ret = hmm_range_register(&range, mm, start, end, page_shift);
  *          if (ret)
  *              return ret;
  *
@@ -498,7 +519,8 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 int hmm_range_register(struct hmm_range *range,
 		       struct mm_struct *mm,
 		       unsigned long start,
-		       unsigned long end);
+		       unsigned long end,
+		       unsigned page_shift);
 void hmm_range_unregister(struct hmm_range *range);
 long hmm_range_snapshot(struct hmm_range *range);
 long hmm_range_fault(struct hmm_range *range, bool block);
@@ -529,7 +551,8 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 	range->pfn_flags_mask = -1UL;
 
 	ret = hmm_range_register(range, range->vma->vm_mm,
-				 range->start, range->end);
+				 range->start, range->end,
+				 PAGE_SHIFT);
 	if (ret)
 		return (int)ret;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 4fe88a196d17..64a33770813b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -387,11 +387,13 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	uint64_t *pfns = range->pfns;
-	unsigned long i;
+	unsigned long i, page_size;
 
 	hmm_vma_walk->last = addr;
-	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++) {
+	page_size = 1UL << range->page_shift;
+	i = (addr - range->start) >> range->page_shift;
+
+	for (; addr < end; addr += page_size, i++) {
 		pfns[i] = range->values[HMM_PFN_NONE];
 		if (fault || write_fault) {
 			int ret;
@@ -703,6 +705,69 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
+				      unsigned long start, unsigned long end,
+				      struct mm_walk *walk)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	unsigned long addr = start, i, pfn, mask, size, pfn_inc;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	struct vm_area_struct *vma = walk->vma;
+	struct hstate *h = hstate_vma(vma);
+	uint64_t orig_pfn, cpu_flags;
+	bool fault, write_fault;
+	spinlock_t *ptl;
+	pte_t entry;
+	int ret = 0;
+
+	size = 1UL << huge_page_shift(h);
+	mask = size - 1;
+	if (range->page_shift != PAGE_SHIFT) {
+		/* Make sure we are looking at full page. */
+		if (start & mask)
+			return -EINVAL;
+		if (end < (start + size))
+			return -EINVAL;
+		pfn_inc = size >> PAGE_SHIFT;
+	} else {
+		pfn_inc = 1;
+		size = PAGE_SIZE;
+	}
+
+
+	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
+	entry = huge_ptep_get(pte);
+
+	i = (start - range->start) >> range->page_shift;
+	orig_pfn = range->pfns[i];
+	range->pfns[i] = range->values[HMM_PFN_NONE];
+	cpu_flags = pte_to_hmm_pfn_flags(range, entry);
+	fault = write_fault = false;
+	hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
+			   &fault, &write_fault);
+	if (fault || write_fault) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	pfn = pte_pfn(entry) + (start & mask);
+	for (; addr < end; addr += size, i++, pfn += pfn_inc)
+		range->pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
+	hmm_vma_walk->last = end;
+
+unlock:
+	spin_unlock(ptl);
+
+	if (ret == -ENOENT)
+		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
+
+	return ret;
+#else /* CONFIG_HUGETLB_PAGE */
+	return -EINVAL;
+#endif
+}
+
 static void hmm_pfns_clear(struct hmm_range *range,
 			   uint64_t *pfns,
 			   unsigned long addr,
@@ -726,6 +791,7 @@ static void hmm_pfns_special(struct hmm_range *range)
  * @mm: the mm struct for the range of virtual address
  * @start: start virtual address (inclusive)
  * @end: end virtual address (exclusive)
+ * @page_shift: expect page shift for the range
  * Returns 0 on success, -EFAULT if the address space is no longer valid
  *
  * Track updates to the CPU page table see include/linux/hmm.h
@@ -733,16 +799,23 @@ static void hmm_pfns_special(struct hmm_range *range)
 int hmm_range_register(struct hmm_range *range,
 		       struct mm_struct *mm,
 		       unsigned long start,
-		       unsigned long end)
+		       unsigned long end,
+		       unsigned page_shift)
 {
-	range->start = start & PAGE_MASK;
-	range->end = end & PAGE_MASK;
+	unsigned long mask = ((1UL << page_shift) - 1UL);
+
 	range->valid = false;
 	range->hmm = NULL;
 
-	if (range->start >= range->end)
+	if ((start & mask) || (end & mask))
+		return -EINVAL;
+	if (start >= end)
 		return -EINVAL;
 
+	range->page_shift = page_shift;
+	range->start = start;
+	range->end = end;
+
 	range->hmm = hmm_register(mm);
 	if (!range->hmm)
 		return -EFAULT;
@@ -809,6 +882,7 @@ EXPORT_SYMBOL(hmm_range_unregister);
  */
 long hmm_range_snapshot(struct hmm_range *range)
 {
+	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
 	unsigned long start = range->start, end;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct hmm *hmm = range->hmm;
@@ -825,15 +899,26 @@ long hmm_range_snapshot(struct hmm_range *range)
 			return -EAGAIN;
 
 		vma = find_vma(hmm->mm, start);
-		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
+		if (vma == NULL || (vma->vm_flags & device_vma))
 			return -EFAULT;
 
-		/* FIXME support hugetlb fs/dax */
-		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
+		/* FIXME support dax */
+		if (vma_is_dax(vma)) {
 			hmm_pfns_special(range);
 			return -EINVAL;
 		}
 
+		if (is_vm_hugetlb_page(vma)) {
+			struct hstate *h = hstate_vma(vma);
+
+			if (huge_page_shift(h) != range->page_shift &&
+			    range->page_shift != PAGE_SHIFT)
+				return -EINVAL;
+		} else {
+			if (range->page_shift != PAGE_SHIFT)
+				return -EINVAL;
+		}
+
 		if (!(vma->vm_flags & VM_READ)) {
 			/*
 			 * If vma do not allow read access, then assume that it
@@ -859,6 +944,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 		mm_walk.hugetlb_entry = NULL;
 		mm_walk.pmd_entry = hmm_vma_walk_pmd;
 		mm_walk.pte_hole = hmm_vma_walk_hole;
+		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
 
 		walk_page_range(start, end, &mm_walk);
 		start = end;
@@ -877,7 +963,7 @@ EXPORT_SYMBOL(hmm_range_snapshot);
  *          then one of the following values may be returned:
  *
  *           -EINVAL  invalid arguments or mm or virtual address are in an
- *                    invalid vma (ie either hugetlbfs or device file vma).
+ *                    invalid vma (for instance device file vma).
  *           -ENOMEM: Out of memory.
  *           -EPERM:  Invalid permission (for instance asking for write and
  *                    range is read only).
@@ -898,6 +984,7 @@ EXPORT_SYMBOL(hmm_range_snapshot);
  */
 long hmm_range_fault(struct hmm_range *range, bool block)
 {
+	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
 	unsigned long start = range->start, end;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct hmm *hmm = range->hmm;
@@ -917,15 +1004,25 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		}
 
 		vma = find_vma(hmm->mm, start);
-		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
+		if (vma == NULL || (vma->vm_flags & device_vma))
 			return -EFAULT;
 
-		/* FIXME support hugetlb fs/dax */
-		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
+		/* FIXME support dax */
+		if (vma_is_dax(vma)) {
 			hmm_pfns_special(range);
 			return -EINVAL;
 		}
 
+		if (is_vm_hugetlb_page(vma)) {
+			if (huge_page_shift(hstate_vma(vma)) !=
+			    range->page_shift &&
+			    range->page_shift != PAGE_SHIFT)
+				return -EINVAL;
+		} else {
+			if (range->page_shift != PAGE_SHIFT)
+				return -EINVAL;
+		}
+
 		if (!(vma->vm_flags & VM_READ)) {
 			/*
 			 * If vma do not allow read access, then assume that it
@@ -952,6 +1049,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		mm_walk.hugetlb_entry = NULL;
 		mm_walk.pmd_entry = hmm_vma_walk_pmd;
 		mm_walk.pte_hole = hmm_vma_walk_hole;
+		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
 
 		do {
 			ret = walk_page_range(start, end, &mm_walk);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 09/11] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem v2
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
                   ` (7 preceding siblings ...)
  2019-03-25 14:40 ` [PATCH v2 08/11] mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping) v2 jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-28 18:04   ` Ira Weiny
  2019-03-25 14:40 ` [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2 jglisse
  2019-03-25 14:40 ` [PATCH v2 11/11] mm/hmm: add an helper function that fault pages and map them to a device v2 jglisse
  10 siblings, 1 reply; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Dan Williams, John Hubbard, Arnd Bergmann

From: Jérôme Glisse <jglisse@redhat.com>

HMM mirror is a device driver helpers to mirror range of virtual address.
It means that the process jobs running on the device can access the same
virtual address as the CPU threads of that process. This patch adds support
for mirroring mapping of file that are on a DAX block device (ie range of
virtual address that is an mmap of a file in a filesystem on a DAX block
device). There is no reason to not support such case when mirroring virtual
address on a device.

Note that unlike GUP code we do not take page reference hence when we
back-off we have nothing to undo.

Changes since v1:
    - improved commit message
    - squashed: Arnd Bergmann: fix unused variable warning in hmm_vma_walk_pud

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
---
 mm/hmm.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 111 insertions(+), 21 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 64a33770813b..ce33151c6832 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -325,6 +325,7 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
 
 struct hmm_vma_walk {
 	struct hmm_range	*range;
+	struct dev_pagemap	*pgmap;
 	unsigned long		last;
 	bool			fault;
 	bool			block;
@@ -499,6 +500,15 @@ static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
 				range->flags[HMM_PFN_VALID];
 }
 
+static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud)
+{
+	if (!pud_present(pud))
+		return 0;
+	return pud_write(pud) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
+}
+
 static int hmm_vma_handle_pmd(struct mm_walk *walk,
 			      unsigned long addr,
 			      unsigned long end,
@@ -520,8 +530,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 
 	pfn = pmd_pfn(pmd) + pte_index(addr);
-	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
+	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		if (pmd_devmap(pmd)) {
+			hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
+					      hmm_vma_walk->pgmap);
+			if (unlikely(!hmm_vma_walk->pgmap))
+				return -EBUSY;
+		}
 		pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
+	}
+	if (hmm_vma_walk->pgmap) {
+		put_dev_pagemap(hmm_vma_walk->pgmap);
+		hmm_vma_walk->pgmap = NULL;
+	}
 	hmm_vma_walk->last = end;
 	return 0;
 }
@@ -608,10 +629,24 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	if (fault || write_fault)
 		goto fault;
 
+	if (pte_devmap(pte)) {
+		hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte),
+					      hmm_vma_walk->pgmap);
+		if (unlikely(!hmm_vma_walk->pgmap))
+			return -EBUSY;
+	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
+		*pfn = range->values[HMM_PFN_SPECIAL];
+		return -EFAULT;
+	}
+
 	*pfn = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
 	return 0;
 
 fault:
+	if (hmm_vma_walk->pgmap) {
+		put_dev_pagemap(hmm_vma_walk->pgmap);
+		hmm_vma_walk->pgmap = NULL;
+	}
 	pte_unmap(ptep);
 	/* Fault any virtual address we were asked to fault */
 	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
@@ -699,12 +734,83 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			return r;
 		}
 	}
+	if (hmm_vma_walk->pgmap) {
+		put_dev_pagemap(hmm_vma_walk->pgmap);
+		hmm_vma_walk->pgmap = NULL;
+	}
 	pte_unmap(ptep - 1);
 
 	hmm_vma_walk->last = addr;
 	return 0;
 }
 
+static int hmm_vma_walk_pud(pud_t *pudp,
+			    unsigned long start,
+			    unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	unsigned long addr = start, next;
+	pmd_t *pmdp;
+	pud_t pud;
+	int ret;
+
+again:
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
+		return hmm_vma_walk_hole(start, end, walk);
+
+	if (pud_huge(pud) && pud_devmap(pud)) {
+		unsigned long i, npages, pfn;
+		uint64_t *pfns, cpu_flags;
+		bool fault, write_fault;
+
+		if (!pud_present(pud))
+			return hmm_vma_walk_hole(start, end, walk);
+
+		i = (addr - range->start) >> PAGE_SHIFT;
+		npages = (end - addr) >> PAGE_SHIFT;
+		pfns = &range->pfns[i];
+
+		cpu_flags = pud_to_hmm_pfn_flags(range, pud);
+		hmm_range_need_fault(hmm_vma_walk, pfns, npages,
+				     cpu_flags, &fault, &write_fault);
+		if (fault || write_fault)
+			return hmm_vma_walk_hole_(addr, end, fault,
+						write_fault, walk);
+
+		pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+		for (i = 0; i < npages; ++i, ++pfn) {
+			hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
+					      hmm_vma_walk->pgmap);
+			if (unlikely(!hmm_vma_walk->pgmap))
+				return -EBUSY;
+			pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
+		}
+		if (hmm_vma_walk->pgmap) {
+			put_dev_pagemap(hmm_vma_walk->pgmap);
+			hmm_vma_walk->pgmap = NULL;
+		}
+		hmm_vma_walk->last = end;
+		return 0;
+	}
+
+	split_huge_pud(walk->vma, pudp, addr);
+	if (pud_none(*pudp))
+		goto again;
+
+	pmdp = pmd_offset(pudp, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		ret = hmm_vma_walk_pmd(pmdp, addr, next, walk);
+		if (ret)
+			return ret;
+	} while (pmdp++, addr = next, addr != end);
+
+	return 0;
+}
+
 static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 				      unsigned long start, unsigned long end,
 				      struct mm_walk *walk)
@@ -777,14 +883,6 @@ static void hmm_pfns_clear(struct hmm_range *range,
 		*pfns = range->values[HMM_PFN_NONE];
 }
 
-static void hmm_pfns_special(struct hmm_range *range)
-{
-	unsigned long addr = range->start, i = 0;
-
-	for (; addr < range->end; addr += PAGE_SIZE, i++)
-		range->pfns[i] = range->values[HMM_PFN_SPECIAL];
-}
-
 /*
  * hmm_range_register() - start tracking change to CPU page table over a range
  * @range: range
@@ -902,12 +1000,6 @@ long hmm_range_snapshot(struct hmm_range *range)
 		if (vma == NULL || (vma->vm_flags & device_vma))
 			return -EFAULT;
 
-		/* FIXME support dax */
-		if (vma_is_dax(vma)) {
-			hmm_pfns_special(range);
-			return -EINVAL;
-		}
-
 		if (is_vm_hugetlb_page(vma)) {
 			struct hstate *h = hstate_vma(vma);
 
@@ -931,6 +1023,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 		}
 
 		range->vma = vma;
+		hmm_vma_walk.pgmap = NULL;
 		hmm_vma_walk.last = start;
 		hmm_vma_walk.fault = false;
 		hmm_vma_walk.range = range;
@@ -942,6 +1035,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 		mm_walk.pte_entry = NULL;
 		mm_walk.test_walk = NULL;
 		mm_walk.hugetlb_entry = NULL;
+		mm_walk.pud_entry = hmm_vma_walk_pud;
 		mm_walk.pmd_entry = hmm_vma_walk_pmd;
 		mm_walk.pte_hole = hmm_vma_walk_hole;
 		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
@@ -1007,12 +1101,6 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		if (vma == NULL || (vma->vm_flags & device_vma))
 			return -EFAULT;
 
-		/* FIXME support dax */
-		if (vma_is_dax(vma)) {
-			hmm_pfns_special(range);
-			return -EINVAL;
-		}
-
 		if (is_vm_hugetlb_page(vma)) {
 			if (huge_page_shift(hstate_vma(vma)) !=
 			    range->page_shift &&
@@ -1035,6 +1123,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		}
 
 		range->vma = vma;
+		hmm_vma_walk.pgmap = NULL;
 		hmm_vma_walk.last = start;
 		hmm_vma_walk.fault = true;
 		hmm_vma_walk.block = block;
@@ -1047,6 +1136,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 		mm_walk.pte_entry = NULL;
 		mm_walk.test_walk = NULL;
 		mm_walk.hugetlb_entry = NULL;
+		mm_walk.pud_entry = hmm_vma_walk_pud;
 		mm_walk.pmd_entry = hmm_vma_walk_pmd;
 		mm_walk.pte_hole = hmm_vma_walk_hole;
 		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
                   ` (8 preceding siblings ...)
  2019-03-25 14:40 ` [PATCH v2 09/11] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem v2 jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-03-28 20:54   ` John Hubbard
  2019-03-25 14:40 ` [PATCH v2 11/11] mm/hmm: add an helper function that fault pages and map them to a device v2 jglisse
  10 siblings, 1 reply; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	John Hubbard, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

The device driver context which holds reference to mirror and thus to
core hmm struct might outlive the mm against which it was created. To
avoid every driver to check for that case provide an helper that check
if mm is still alive and take the mmap_sem in read mode if so. If the
mm have been destroy (mmu_notifier release call back did happen) then
we return -EINVAL so that calling code knows that it is trying to do
something against a mm that is no longer valid.

Changes since v1:
    - removed bunch of useless check (if API is use with bogus argument
      better to fail loudly so user fix their code)

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f3b919b04eda..5f9deaeb9d77 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -438,6 +438,50 @@ struct hmm_mirror {
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
+/*
+ * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
+ * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
+ * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
+ *
+ * The device driver context which holds reference to mirror and thus to core
+ * hmm struct might outlive the mm against which it was created. To avoid every
+ * driver to check for that case provide an helper that check if mm is still
+ * alive and take the mmap_sem in read mode if so. If the mm have been destroy
+ * (mmu_notifier release call back did happen) then we return -EINVAL so that
+ * calling code knows that it is trying to do something against a mm that is
+ * no longer valid.
+ */
+static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
+{
+	struct mm_struct *mm;
+
+	/* Sanity check ... */
+	if (!mirror || !mirror->hmm)
+		return -EINVAL;
+	/*
+	 * Before trying to take the mmap_sem make sure the mm is still
+	 * alive as device driver context might outlive the mm lifetime.
+	 *
+	 * FIXME: should we also check for mm that outlive its owning
+	 * task ?
+	 */
+	mm = READ_ONCE(mirror->hmm->mm);
+	if (mirror->hmm->dead || !mm)
+		return -EINVAL;
+
+	down_read(&mm->mmap_sem);
+	return 0;
+}
+
+/*
+ * hmm_mirror_mm_up_read() - unlock the mmap_sem from read mode
+ * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
+ */
+static inline void hmm_mirror_mm_up_read(struct hmm_mirror *mirror)
+{
+	up_read(&mirror->hmm->mm->mmap_sem);
+}
+
 
 /*
  * To snapshot the CPU page table you first have to call hmm_range_register()
@@ -463,7 +507,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  *          if (ret)
  *              return ret;
  *
- *          down_read(mm->mmap_sem);
+ *          hmm_mirror_mm_down_read(mirror);
  *      again:
  *
  *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
@@ -476,13 +520,13 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  *
  *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
  *          if (ret == -EAGAIN) {
- *              down_read(mm->mmap_sem);
+ *              hmm_mirror_mm_down_read(mirror);
  *              goto again;
  *          } else if (ret == -EBUSY) {
  *              goto again;
  *          }
  *
- *          up_read(&mm->mmap_sem);
+ *          hmm_mirror_mm_up_read(mirror);
  *          if (ret) {
  *              hmm_range_unregister(range);
  *              return ret;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 11/11] mm/hmm: add an helper function that fault pages and map them to a device v2
  2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
                   ` (9 preceding siblings ...)
  2019-03-25 14:40 ` [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2 jglisse
@ 2019-03-25 14:40 ` jglisse
  2019-04-01 11:59     ` Souptick Joarder
  10 siblings, 1 reply; 69+ messages in thread
From: jglisse @ 2019-03-25 14:40 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrew Morton,
	Ralph Campbell, John Hubbard, Dan Williams

From: Jérôme Glisse <jglisse@redhat.com>

This is a all in one helper that fault pages in a range and map them to
a device so that every single device driver do not have to re-implement
this common pattern.

This is taken from ODP RDMA in preparation of ODP RDMA convertion. It
will be use by nouveau and other drivers.

Changes since v1:
    - improved commit message

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/hmm.h |   9 +++
 mm/hmm.c            | 152 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 161 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 5f9deaeb9d77..7aadf18b29cb 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -568,6 +568,15 @@ int hmm_range_register(struct hmm_range *range,
 void hmm_range_unregister(struct hmm_range *range);
 long hmm_range_snapshot(struct hmm_range *range);
 long hmm_range_fault(struct hmm_range *range, bool block);
+long hmm_range_dma_map(struct hmm_range *range,
+		       struct device *device,
+		       dma_addr_t *daddrs,
+		       bool block);
+long hmm_range_dma_unmap(struct hmm_range *range,
+			 struct vm_area_struct *vma,
+			 struct device *device,
+			 dma_addr_t *daddrs,
+			 bool dirty);
 
 /*
  * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
diff --git a/mm/hmm.c b/mm/hmm.c
index ce33151c6832..fd143251b157 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -30,6 +30,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memremap.h>
 #include <linux/jump_label.h>
+#include <linux/dma-mapping.h>
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
@@ -1163,6 +1164,157 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
 }
 EXPORT_SYMBOL(hmm_range_fault);
+
+/*
+ * hmm_range_dma_map() - hmm_range_fault() and dma map page all in one.
+ * @range: range being faulted
+ * @device: device against to dma map page to
+ * @daddrs: dma address of mapped pages
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: number of pages mapped on success, -EAGAIN if mmap_sem have been
+ *          drop and you need to try again, some other error value otherwise
+ *
+ * Note same usage pattern as hmm_range_fault().
+ */
+long hmm_range_dma_map(struct hmm_range *range,
+		       struct device *device,
+		       dma_addr_t *daddrs,
+		       bool block)
+{
+	unsigned long i, npages, mapped;
+	long ret;
+
+	ret = hmm_range_fault(range, block);
+	if (ret <= 0)
+		return ret ? ret : -EBUSY;
+
+	npages = (range->end - range->start) >> PAGE_SHIFT;
+	for (i = 0, mapped = 0; i < npages; ++i) {
+		enum dma_data_direction dir = DMA_FROM_DEVICE;
+		struct page *page;
+
+		/*
+		 * FIXME need to update DMA API to provide invalid DMA address
+		 * value instead of a function to test dma address value. This
+		 * would remove lot of dumb code duplicated accross many arch.
+		 *
+		 * For now setting it to 0 here is good enough as the pfns[]
+		 * value is what is use to check what is valid and what isn't.
+		 */
+		daddrs[i] = 0;
+
+		page = hmm_pfn_to_page(range, range->pfns[i]);
+		if (page == NULL)
+			continue;
+
+		/* Check if range is being invalidated */
+		if (!range->valid) {
+			ret = -EBUSY;
+			goto unmap;
+		}
+
+		/* If it is read and write than map bi-directional. */
+		if (range->pfns[i] & range->values[HMM_PFN_WRITE])
+			dir = DMA_BIDIRECTIONAL;
+
+		daddrs[i] = dma_map_page(device, page, 0, PAGE_SIZE, dir);
+		if (dma_mapping_error(device, daddrs[i])) {
+			ret = -EFAULT;
+			goto unmap;
+		}
+
+		mapped++;
+	}
+
+	return mapped;
+
+unmap:
+	for (npages = i, i = 0; (i < npages) && mapped; ++i) {
+		enum dma_data_direction dir = DMA_FROM_DEVICE;
+		struct page *page;
+
+		page = hmm_pfn_to_page(range, range->pfns[i]);
+		if (page == NULL)
+			continue;
+
+		if (dma_mapping_error(device, daddrs[i]))
+			continue;
+
+		/* If it is read and write than map bi-directional. */
+		if (range->pfns[i] & range->values[HMM_PFN_WRITE])
+			dir = DMA_BIDIRECTIONAL;
+
+		dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
+		mapped--;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(hmm_range_dma_map);
+
+/*
+ * hmm_range_dma_unmap() - unmap range of that was map with hmm_range_dma_map()
+ * @range: range being unmapped
+ * @vma: the vma against which the range (optional)
+ * @device: device against which dma map was done
+ * @daddrs: dma address of mapped pages
+ * @dirty: dirty page if it had the write flag set
+ * Returns: number of page unmapped on success, -EINVAL otherwise
+ *
+ * Note that caller MUST abide by mmu notifier or use HMM mirror and abide
+ * to the sync_cpu_device_pagetables() callback so that it is safe here to
+ * call set_page_dirty(). Caller must also take appropriate locks to avoid
+ * concurrent mmu notifier or sync_cpu_device_pagetables() to make progress.
+ */
+long hmm_range_dma_unmap(struct hmm_range *range,
+			 struct vm_area_struct *vma,
+			 struct device *device,
+			 dma_addr_t *daddrs,
+			 bool dirty)
+{
+	unsigned long i, npages;
+	long cpages = 0;
+
+	/* Sanity check. */
+	if (range->end <= range->start)
+		return -EINVAL;
+	if (!daddrs)
+		return -EINVAL;
+	if (!range->pfns)
+		return -EINVAL;
+
+	npages = (range->end - range->start) >> PAGE_SHIFT;
+	for (i = 0; i < npages; ++i) {
+		enum dma_data_direction dir = DMA_FROM_DEVICE;
+		struct page *page;
+
+		page = hmm_pfn_to_page(range, range->pfns[i]);
+		if (page == NULL)
+			continue;
+
+		/* If it is read and write than map bi-directional. */
+		if (range->pfns[i] & range->values[HMM_PFN_WRITE]) {
+			dir = DMA_BIDIRECTIONAL;
+
+			/*
+			 * See comments in function description on why it is
+			 * safe here to call set_page_dirty()
+			 */
+			if (dirty)
+				set_page_dirty(page);
+		}
+
+		/* Unmap and clear pfns/dma address */
+		dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
+		range->pfns[i] = range->values[HMM_PFN_NONE];
+		/* FIXME see comments in hmm_vma_dma_map() */
+		daddrs[i] = 0;
+		cpages++;
+	}
+
+	return cpages;
+}
+EXPORT_SYMBOL(hmm_range_dma_unmap);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-25 14:40 ` [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2 jglisse
@ 2019-03-28 11:07   ` Ira Weiny
  2019-03-28 19:11     ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 11:07 UTC (permalink / raw)
  To: jglisse; +Cc: linux-mm, linux-kernel, John Hubbard, Andrew Morton, Dan Williams

On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Every time i read the code to check that the HMM structure does not
> vanish before it should thanks to the many lock protecting its removal
> i get a headache. Switch to reference counting instead it is much
> easier to follow and harder to break. This also remove some code that
> is no longer needed with refcounting.
> 
> Changes since v1:
>     - removed bunch of useless check (if API is use with bogus argument
>       better to fail loudly so user fix their code)
>     - s/hmm_get/mm_get_hmm/
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/hmm.h |   2 +
>  mm/hmm.c            | 170 ++++++++++++++++++++++++++++----------------
>  2 files changed, 112 insertions(+), 60 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index ad50b7b4f141..716fc61fa6d4 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -131,6 +131,7 @@ enum hmm_pfn_value_e {
>  /*
>   * struct hmm_range - track invalidation lock on virtual address range
>   *
> + * @hmm: the core HMM structure this range is active against
>   * @vma: the vm area struct for the range
>   * @list: all range lock are on a list
>   * @start: range virtual start address (inclusive)
> @@ -142,6 +143,7 @@ enum hmm_pfn_value_e {
>   * @valid: pfns array did not change since it has been fill by an HMM function
>   */
>  struct hmm_range {
> +	struct hmm		*hmm;
>  	struct vm_area_struct	*vma;
>  	struct list_head	list;
>  	unsigned long		start;
> diff --git a/mm/hmm.c b/mm/hmm.c
> index fe1cd87e49ac..306e57f7cded 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -50,6 +50,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>   */
>  struct hmm {
>  	struct mm_struct	*mm;
> +	struct kref		kref;
>  	spinlock_t		lock;
>  	struct list_head	ranges;
>  	struct list_head	mirrors;
> @@ -57,6 +58,16 @@ struct hmm {
>  	struct rw_semaphore	mirrors_sem;
>  };
>  
> +static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> +{
> +	struct hmm *hmm = READ_ONCE(mm->hmm);
> +
> +	if (hmm && kref_get_unless_zero(&hmm->kref))
> +		return hmm;
> +
> +	return NULL;
> +}
> +
>  /*
>   * hmm_register - register HMM against an mm (HMM internal)
>   *
> @@ -67,14 +78,9 @@ struct hmm {
>   */
>  static struct hmm *hmm_register(struct mm_struct *mm)
>  {
> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> +	struct hmm *hmm = mm_get_hmm(mm);

FWIW: having hmm_register == "hmm get" is a bit confusing...

Ira

>  	bool cleanup = false;
>  
> -	/*
> -	 * The hmm struct can only be freed once the mm_struct goes away,
> -	 * hence we should always have pre-allocated an new hmm struct
> -	 * above.
> -	 */
>  	if (hmm)
>  		return hmm;
>  
> @@ -86,6 +92,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
>  	hmm->mmu_notifier.ops = NULL;
>  	INIT_LIST_HEAD(&hmm->ranges);
>  	spin_lock_init(&hmm->lock);
> +	kref_init(&hmm->kref);
>  	hmm->mm = mm;
>  
>  	spin_lock(&mm->page_table_lock);
> @@ -106,7 +113,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
>  	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
>  		goto error_mm;
>  
> -	return mm->hmm;
> +	return hmm;
>  
>  error_mm:
>  	spin_lock(&mm->page_table_lock);
> @@ -118,9 +125,41 @@ static struct hmm *hmm_register(struct mm_struct *mm)
>  	return NULL;
>  }
>  
> +static void hmm_free(struct kref *kref)
> +{
> +	struct hmm *hmm = container_of(kref, struct hmm, kref);
> +	struct mm_struct *mm = hmm->mm;
> +
> +	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (mm->hmm == hmm)
> +		mm->hmm = NULL;
> +	spin_unlock(&mm->page_table_lock);
> +
> +	kfree(hmm);
> +}
> +
> +static inline void hmm_put(struct hmm *hmm)
> +{
> +	kref_put(&hmm->kref, hmm_free);
> +}
> +
>  void hmm_mm_destroy(struct mm_struct *mm)
>  {
> -	kfree(mm->hmm);
> +	struct hmm *hmm;
> +
> +	spin_lock(&mm->page_table_lock);
> +	hmm = mm_get_hmm(mm);
> +	mm->hmm = NULL;
> +	if (hmm) {
> +		hmm->mm = NULL;
> +		spin_unlock(&mm->page_table_lock);
> +		hmm_put(hmm);
> +		return;
> +	}
> +
> +	spin_unlock(&mm->page_table_lock);
>  }
>  
>  static int hmm_invalidate_range(struct hmm *hmm, bool device,
> @@ -165,7 +204,7 @@ static int hmm_invalidate_range(struct hmm *hmm, bool device,
>  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
>  	struct hmm_mirror *mirror;
> -	struct hmm *hmm = mm->hmm;
> +	struct hmm *hmm = mm_get_hmm(mm);
>  
>  	down_write(&hmm->mirrors_sem);
>  	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> @@ -186,13 +225,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  						  struct hmm_mirror, list);
>  	}
>  	up_write(&hmm->mirrors_sem);
> +
> +	hmm_put(hmm);
>  }
>  
>  static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  			const struct mmu_notifier_range *range)
>  {
> +	struct hmm *hmm = mm_get_hmm(range->mm);
>  	struct hmm_update update;
> -	struct hmm *hmm = range->mm->hmm;
> +	int ret;
>  
>  	VM_BUG_ON(!hmm);
>  
> @@ -200,14 +242,16 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  	update.end = range->end;
>  	update.event = HMM_UPDATE_INVALIDATE;
>  	update.blockable = range->blockable;
> -	return hmm_invalidate_range(hmm, true, &update);
> +	ret = hmm_invalidate_range(hmm, true, &update);
> +	hmm_put(hmm);
> +	return ret;
>  }
>  
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>  			const struct mmu_notifier_range *range)
>  {
> +	struct hmm *hmm = mm_get_hmm(range->mm);
>  	struct hmm_update update;
> -	struct hmm *hmm = range->mm->hmm;
>  
>  	VM_BUG_ON(!hmm);
>  
> @@ -216,6 +260,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>  	update.event = HMM_UPDATE_INVALIDATE;
>  	update.blockable = true;
>  	hmm_invalidate_range(hmm, false, &update);
> +	hmm_put(hmm);
>  }
>  
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
> @@ -241,24 +286,13 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
>  	if (!mm || !mirror || !mirror->ops)
>  		return -EINVAL;
>  
> -again:
>  	mirror->hmm = hmm_register(mm);
>  	if (!mirror->hmm)
>  		return -ENOMEM;
>  
>  	down_write(&mirror->hmm->mirrors_sem);
> -	if (mirror->hmm->mm == NULL) {
> -		/*
> -		 * A racing hmm_mirror_unregister() is about to destroy the hmm
> -		 * struct. Try again to allocate a new one.
> -		 */
> -		up_write(&mirror->hmm->mirrors_sem);
> -		mirror->hmm = NULL;
> -		goto again;
> -	} else {
> -		list_add(&mirror->list, &mirror->hmm->mirrors);
> -		up_write(&mirror->hmm->mirrors_sem);
> -	}
> +	list_add(&mirror->list, &mirror->hmm->mirrors);
> +	up_write(&mirror->hmm->mirrors_sem);
>  
>  	return 0;
>  }
> @@ -273,33 +307,18 @@ EXPORT_SYMBOL(hmm_mirror_register);
>   */
>  void hmm_mirror_unregister(struct hmm_mirror *mirror)
>  {
> -	bool should_unregister = false;
> -	struct mm_struct *mm;
> -	struct hmm *hmm;
> +	struct hmm *hmm = READ_ONCE(mirror->hmm);
>  
> -	if (mirror->hmm == NULL)
> +	if (hmm == NULL)
>  		return;
>  
> -	hmm = mirror->hmm;
>  	down_write(&hmm->mirrors_sem);
>  	list_del_init(&mirror->list);
> -	should_unregister = list_empty(&hmm->mirrors);
> +	/* To protect us against double unregister ... */
>  	mirror->hmm = NULL;
> -	mm = hmm->mm;
> -	hmm->mm = NULL;
>  	up_write(&hmm->mirrors_sem);
>  
> -	if (!should_unregister || mm == NULL)
> -		return;
> -
> -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
> -
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -
> -	kfree(hmm);
> +	hmm_put(hmm);
>  }
>  EXPORT_SYMBOL(hmm_mirror_unregister);
>  
> @@ -708,6 +727,8 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>  	struct mm_walk mm_walk;
>  	struct hmm *hmm;
>  
> +	range->hmm = NULL;
> +
>  	/* Sanity check, this really should not happen ! */
>  	if (range->start < vma->vm_start || range->start >= vma->vm_end)
>  		return -EINVAL;
> @@ -717,14 +738,18 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>  	hmm = hmm_register(vma->vm_mm);
>  	if (!hmm)
>  		return -ENOMEM;
> -	/* Caller must have registered a mirror, via hmm_mirror_register() ! */
> -	if (!hmm->mmu_notifier.ops)
> +
> +	/* Check if hmm_mm_destroy() was call. */
> +	if (hmm->mm == NULL) {
> +		hmm_put(hmm);
>  		return -EINVAL;
> +	}
>  
>  	/* FIXME support hugetlb fs */
>  	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
>  			vma_is_dax(vma)) {
>  		hmm_pfns_special(range);
> +		hmm_put(hmm);
>  		return -EINVAL;
>  	}
>  
> @@ -736,6 +761,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>  		 * operations such has atomic access would not work.
>  		 */
>  		hmm_pfns_clear(range, range->pfns, range->start, range->end);
> +		hmm_put(hmm);
>  		return -EPERM;
>  	}
>  
> @@ -758,6 +784,12 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>  	mm_walk.pte_hole = hmm_vma_walk_hole;
>  
>  	walk_page_range(range->start, range->end, &mm_walk);
> +	/*
> +	 * Transfer hmm reference to the range struct it will be drop inside
> +	 * the hmm_vma_range_done() function (which _must_ be call if this
> +	 * function return 0).
> +	 */
> +	range->hmm = hmm;
>  	return 0;
>  }
>  EXPORT_SYMBOL(hmm_vma_get_pfns);
> @@ -802,25 +834,27 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
>   */
>  bool hmm_vma_range_done(struct hmm_range *range)
>  {
> -	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
> -	struct hmm *hmm;
> +	bool ret = false;
>  
> -	if (range->end <= range->start) {
> +	/* Sanity check this really should not happen. */
> +	if (range->hmm == NULL || range->end <= range->start) {
>  		BUG();
>  		return false;
>  	}
>  
> -	hmm = hmm_register(range->vma->vm_mm);
> -	if (!hmm) {
> -		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
> -		return false;
> -	}
> -
> -	spin_lock(&hmm->lock);
> +	spin_lock(&range->hmm->lock);
>  	list_del_rcu(&range->list);
> -	spin_unlock(&hmm->lock);
> +	ret = range->valid;
> +	spin_unlock(&range->hmm->lock);
>  
> -	return range->valid;
> +	/* Is the mm still alive ? */
> +	if (range->hmm->mm == NULL)
> +		ret = false;
> +
> +	/* Drop reference taken by hmm_vma_fault() or hmm_vma_get_pfns() */
> +	hmm_put(range->hmm);
> +	range->hmm = NULL;
> +	return ret;
>  }
>  EXPORT_SYMBOL(hmm_vma_range_done);
>  
> @@ -880,6 +914,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>  	struct hmm *hmm;
>  	int ret;
>  
> +	range->hmm = NULL;
> +
>  	/* Sanity check, this really should not happen ! */
>  	if (range->start < vma->vm_start || range->start >= vma->vm_end)
>  		return -EINVAL;
> @@ -891,14 +927,18 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>  		hmm_pfns_clear(range, range->pfns, range->start, range->end);
>  		return -ENOMEM;
>  	}
> -	/* Caller must have registered a mirror using hmm_mirror_register() */
> -	if (!hmm->mmu_notifier.ops)
> +
> +	/* Check if hmm_mm_destroy() was call. */
> +	if (hmm->mm == NULL) {
> +		hmm_put(hmm);
>  		return -EINVAL;
> +	}
>  
>  	/* FIXME support hugetlb fs */
>  	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
>  			vma_is_dax(vma)) {
>  		hmm_pfns_special(range);
> +		hmm_put(hmm);
>  		return -EINVAL;
>  	}
>  
> @@ -910,6 +950,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>  		 * operations such has atomic access would not work.
>  		 */
>  		hmm_pfns_clear(range, range->pfns, range->start, range->end);
> +		hmm_put(hmm);
>  		return -EPERM;
>  	}
>  
> @@ -945,7 +986,16 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>  		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
>  			       range->end);
>  		hmm_vma_range_done(range);
> +		hmm_put(hmm);
> +	} else {
> +		/*
> +		 * Transfer hmm reference to the range struct it will be drop
> +		 * inside the hmm_vma_range_done() function (which _must_ be
> +		 * call if this function return 0).
> +		 */
> +		range->hmm = hmm;
>  	}
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL(hmm_vma_fault);
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2
  2019-03-25 14:40 ` [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2 jglisse
@ 2019-03-28 13:11   ` Ira Weiny
  2019-03-28 21:39     ` Jerome Glisse
  2019-03-28 16:12   ` Ira Weiny
  1 sibling, 1 reply; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 13:11 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrew Morton, John Hubbard,
	Dan Williams, Dan Carpenter, Matthew Wilcox

On Mon, Mar 25, 2019 at 10:40:06AM -0400, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> A common use case for HMM mirror is user trying to mirror a range
> and before they could program the hardware it get invalidated by
> some core mm event. Instead of having user re-try right away to
> mirror the range provide a completion mechanism for them to wait
> for any active invalidation affecting the range.
> 
> This also changes how hmm_range_snapshot() and hmm_range_fault()
> works by not relying on vma so that we can drop the mmap_sem
> when waiting and lookup the vma again on retry.
> 
> Changes since v1:
>     - squashed: Dan Carpenter: potential deadlock in nonblocking code
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> ---
>  include/linux/hmm.h | 208 ++++++++++++++---
>  mm/hmm.c            | 528 +++++++++++++++++++++-----------------------
>  2 files changed, 428 insertions(+), 308 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index e9afd23c2eac..79671036cb5f 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -77,8 +77,34 @@
>  #include <linux/migrate.h>
>  #include <linux/memremap.h>
>  #include <linux/completion.h>
> +#include <linux/mmu_notifier.h>
>  
> -struct hmm;
> +
> +/*
> + * struct hmm - HMM per mm struct
> + *
> + * @mm: mm struct this HMM struct is bound to
> + * @lock: lock protecting ranges list
> + * @ranges: list of range being snapshotted
> + * @mirrors: list of mirrors for this mm
> + * @mmu_notifier: mmu notifier to track updates to CPU page table
> + * @mirrors_sem: read/write semaphore protecting the mirrors list
> + * @wq: wait queue for user waiting on a range invalidation
> + * @notifiers: count of active mmu notifiers
> + * @dead: is the mm dead ?
> + */
> +struct hmm {
> +	struct mm_struct	*mm;
> +	struct kref		kref;
> +	struct mutex		lock;
> +	struct list_head	ranges;
> +	struct list_head	mirrors;
> +	struct mmu_notifier	mmu_notifier;
> +	struct rw_semaphore	mirrors_sem;
> +	wait_queue_head_t	wq;
> +	long			notifiers;
> +	bool			dead;
> +};
>  
>  /*
>   * hmm_pfn_flag_e - HMM flag enums
> @@ -155,6 +181,38 @@ struct hmm_range {
>  	bool			valid;
>  };
>  
> +/*
> + * hmm_range_wait_until_valid() - wait for range to be valid
> + * @range: range affected by invalidation to wait on
> + * @timeout: time out for wait in ms (ie abort wait after that period of time)
> + * Returns: true if the range is valid, false otherwise.
> + */
> +static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
> +					      unsigned long timeout)
> +{
> +	/* Check if mm is dead ? */
> +	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> +		range->valid = false;
> +		return false;
> +	}
> +	if (range->valid)
> +		return true;
> +	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> +			   msecs_to_jiffies(timeout));
> +	/* Return current valid status just in case we get lucky */
> +	return range->valid;
> +}
> +
> +/*
> + * hmm_range_valid() - test if a range is valid or not
> + * @range: range
> + * Returns: true if the range is valid, false otherwise.
> + */
> +static inline bool hmm_range_valid(struct hmm_range *range)
> +{
> +	return range->valid;
> +}
> +
>  /*
>   * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
>   * @range: range use to decode HMM pfn value
> @@ -357,51 +415,133 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  
>  
>  /*
> - * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
> - * driver lock that serializes device page table updates, then call
> - * hmm_vma_range_done(), to check if the snapshot is still valid. The same
> - * device driver page table update lock must also be used in the
> - * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
> - * table invalidation serializes on it.
> + * To snapshot the CPU page table you first have to call hmm_range_register()
> + * to register the range. If hmm_range_register() return an error then some-
> + * thing is horribly wrong and you should fail loudly. If it returned true then
> + * you can wait for the range to be stable with hmm_range_wait_until_valid()
> + * function, a range is valid when there are no concurrent changes to the CPU
> + * page table for the range.
> + *
> + * Once the range is valid you can call hmm_range_snapshot() if that returns
> + * without error then you can take your device page table lock (the same lock
> + * you use in the HMM mirror sync_cpu_device_pagetables() callback). After
> + * taking that lock you have to check the range validity, if it is still valid
> + * (ie hmm_range_valid() returns true) then you can program the device page
> + * table, otherwise you have to start again. Pseudo code:
> + *
> + *      mydevice_prefault(mydevice, mm, start, end)
> + *      {
> + *          struct hmm_range range;
> + *          ...
>   *
> - * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
> - * hmm_range_snapshot() WITHOUT ERROR !
> + *          ret = hmm_range_register(&range, mm, start, end);
> + *          if (ret)
> + *              return ret;
>   *
> - * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
> - */
> -long hmm_range_snapshot(struct hmm_range *range);
> -bool hmm_vma_range_done(struct hmm_range *range);
> -
> -
> -/*
> - * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
> - * not migrate any device memory back to system memory. The HMM pfn array will
> - * be updated with the fault result and current snapshot of the CPU page table
> - * for the range.
> + *          down_read(mm->mmap_sem);
> + *      again:
> + *
> + *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
> + *              up_read(&mm->mmap_sem);
> + *              hmm_range_unregister(range);
> + *              // Handle time out, either sleep or retry or something else
> + *              ...
> + *              return -ESOMETHING; || goto again;
> + *          }
> + *
> + *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
> + *          if (ret == -EAGAIN) {
> + *              down_read(mm->mmap_sem);
> + *              goto again;
> + *          } else if (ret == -EBUSY) {
> + *              goto again;
> + *          }
> + *
> + *          up_read(&mm->mmap_sem);
> + *          if (ret) {
> + *              hmm_range_unregister(range);
> + *              return ret;
> + *          }
> + *
> + *          // It might not have snap-shoted the whole range but only the first
> + *          // npages, the return values is the number of valid pages from the
> + *          // start of the range.
> + *          npages = ret;
>   *
> - * The mmap_sem must be taken in read mode before entering and it might be
> - * dropped by the function if the block argument is false. In that case, the
> - * function returns -EAGAIN.
> + *          ...
>   *
> - * Return value does not reflect if the fault was successful for every single
> - * address or not. Therefore, the caller must to inspect the HMM pfn array to
> - * determine fault status for each address.
> + *          mydevice_page_table_lock(mydevice);
> + *          if (!hmm_range_valid(range)) {
> + *              mydevice_page_table_unlock(mydevice);
> + *              goto again;
> + *          }
>   *
> - * Trying to fault inside an invalid vma will result in -EINVAL.
> + *          mydevice_populate_page_table(mydevice, range, npages);
> + *          ...
> + *          mydevice_take_page_table_unlock(mydevice);
> + *          hmm_range_unregister(range);
>   *
> - * See the function description in mm/hmm.c for further documentation.
> + *          return 0;
> + *      }
> + *
> + * The same scheme apply to hmm_range_fault() (ie replace hmm_range_snapshot()
> + * with hmm_range_fault() in above pseudo code).
> + *
> + * YOU MUST CALL hmm_range_unregister() ONCE AND ONLY ONCE EACH TIME YOU CALL
> + * hmm_range_register() AND hmm_range_register() RETURNED TRUE ! IF YOU DO NOT
> + * FOLLOW THIS RULE MEMORY CORRUPTION WILL ENSUE !
>   */
> +int hmm_range_register(struct hmm_range *range,
> +		       struct mm_struct *mm,
> +		       unsigned long start,
> +		       unsigned long end);
> +void hmm_range_unregister(struct hmm_range *range);
> +long hmm_range_snapshot(struct hmm_range *range);
>  long hmm_range_fault(struct hmm_range *range, bool block);
>  
> +/*
> + * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> + *
> + * When waiting for mmu notifiers we need some kind of time out otherwise we
> + * could potentialy wait for ever, 1000ms ie 1s sounds like a long time to
> + * wait already.
> + */
> +#define HMM_RANGE_DEFAULT_TIMEOUT 1000
> +
>  /* This is a temporary helper to avoid merge conflict between trees. */
> +static inline bool hmm_vma_range_done(struct hmm_range *range)
> +{
> +	bool ret = hmm_range_valid(range);
> +
> +	hmm_range_unregister(range);
> +	return ret;
> +}
> +
>  static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>  {
> -	long ret = hmm_range_fault(range, block);
> -	if (ret == -EBUSY)
> -		ret = -EAGAIN;
> -	else if (ret == -EAGAIN)
> -		ret = -EBUSY;
> -	return ret < 0 ? ret : 0;
> +	long ret;
> +
> +	ret = hmm_range_register(range, range->vma->vm_mm,
> +				 range->start, range->end);
> +	if (ret)
> +		return (int)ret;
> +
> +	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
> +		up_read(&range->vma->vm_mm->mmap_sem);

Where is the down_read() which correspond to this?

> +		return -EAGAIN;
> +	}
> +
> +	ret = hmm_range_fault(range, block);
> +	if (ret <= 0) {
> +		if (ret == -EBUSY || !ret) {
> +			up_read(&range->vma->vm_mm->mmap_sem);

Or this...?

> +			ret = -EBUSY;
> +		} else if (ret == -EAGAIN)
> +			ret = -EBUSY;
> +		hmm_range_unregister(range);
> +		return ret;
> +	}

And is the side effect of this call that the mmap_sem has been taken?
Or is the side effect that the mmap_sem was released?

I'm not saying this is wrong, but I can't tell so it seems like a comment on
the function would help.

Ira

> +	return 0;
>  }
>  
>  /* Below are for HMM internal use only! Not to be used by device driver! */
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 7860e63c3ba7..fa9498eeb9b6 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -38,26 +38,6 @@
>  #if IS_ENABLED(CONFIG_HMM_MIRROR)
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>  
> -/*
> - * struct hmm - HMM per mm struct
> - *
> - * @mm: mm struct this HMM struct is bound to
> - * @lock: lock protecting ranges list
> - * @ranges: list of range being snapshotted
> - * @mirrors: list of mirrors for this mm
> - * @mmu_notifier: mmu notifier to track updates to CPU page table
> - * @mirrors_sem: read/write semaphore protecting the mirrors list
> - */
> -struct hmm {
> -	struct mm_struct	*mm;
> -	struct kref		kref;
> -	spinlock_t		lock;
> -	struct list_head	ranges;
> -	struct list_head	mirrors;
> -	struct mmu_notifier	mmu_notifier;
> -	struct rw_semaphore	mirrors_sem;
> -};
> -
>  static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>  {
>  	struct hmm *hmm = READ_ONCE(mm->hmm);
> @@ -87,12 +67,15 @@ static struct hmm *hmm_register(struct mm_struct *mm)
>  	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
>  	if (!hmm)
>  		return NULL;
> +	init_waitqueue_head(&hmm->wq);
>  	INIT_LIST_HEAD(&hmm->mirrors);
>  	init_rwsem(&hmm->mirrors_sem);
>  	hmm->mmu_notifier.ops = NULL;
>  	INIT_LIST_HEAD(&hmm->ranges);
> -	spin_lock_init(&hmm->lock);
> +	mutex_init(&hmm->lock);
>  	kref_init(&hmm->kref);
> +	hmm->notifiers = 0;
> +	hmm->dead = false;
>  	hmm->mm = mm;
>  
>  	spin_lock(&mm->page_table_lock);
> @@ -154,6 +137,7 @@ void hmm_mm_destroy(struct mm_struct *mm)
>  	mm->hmm = NULL;
>  	if (hmm) {
>  		hmm->mm = NULL;
> +		hmm->dead = true;
>  		spin_unlock(&mm->page_table_lock);
>  		hmm_put(hmm);
>  		return;
> @@ -162,43 +146,22 @@ void hmm_mm_destroy(struct mm_struct *mm)
>  	spin_unlock(&mm->page_table_lock);
>  }
>  
> -static int hmm_invalidate_range(struct hmm *hmm, bool device,
> -				const struct hmm_update *update)
> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
> +	struct hmm *hmm = mm_get_hmm(mm);
>  	struct hmm_mirror *mirror;
>  	struct hmm_range *range;
>  
> -	spin_lock(&hmm->lock);
> -	list_for_each_entry(range, &hmm->ranges, list) {
> -		if (update->end < range->start || update->start >= range->end)
> -			continue;
> +	/* Report this HMM as dying. */
> +	hmm->dead = true;
>  
> +	/* Wake-up everyone waiting on any range. */
> +	mutex_lock(&hmm->lock);
> +	list_for_each_entry(range, &hmm->ranges, list) {
>  		range->valid = false;
>  	}
> -	spin_unlock(&hmm->lock);
> -
> -	if (!device)
> -		return 0;
> -
> -	down_read(&hmm->mirrors_sem);
> -	list_for_each_entry(mirror, &hmm->mirrors, list) {
> -		int ret;
> -
> -		ret = mirror->ops->sync_cpu_device_pagetables(mirror, update);
> -		if (!update->blockable && ret == -EAGAIN) {
> -			up_read(&hmm->mirrors_sem);
> -			return -EAGAIN;
> -		}
> -	}
> -	up_read(&hmm->mirrors_sem);
> -
> -	return 0;
> -}
> -
> -static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> -{
> -	struct hmm_mirror *mirror;
> -	struct hmm *hmm = mm_get_hmm(mm);
> +	wake_up_all(&hmm->wq);
> +	mutex_unlock(&hmm->lock);
>  
>  	down_write(&hmm->mirrors_sem);
>  	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> @@ -224,36 +187,80 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  }
>  
>  static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> -			const struct mmu_notifier_range *range)
> +			const struct mmu_notifier_range *nrange)
>  {
> -	struct hmm *hmm = mm_get_hmm(range->mm);
> +	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm_mirror *mirror;
>  	struct hmm_update update;
> -	int ret;
> +	struct hmm_range *range;
> +	int ret = 0;
>  
>  	VM_BUG_ON(!hmm);
>  
> -	update.start = range->start;
> -	update.end = range->end;
> +	update.start = nrange->start;
> +	update.end = nrange->end;
>  	update.event = HMM_UPDATE_INVALIDATE;
> -	update.blockable = range->blockable;
> -	ret = hmm_invalidate_range(hmm, true, &update);
> +	update.blockable = nrange->blockable;
> +
> +	if (nrange->blockable)
> +		mutex_lock(&hmm->lock);
> +	else if (!mutex_trylock(&hmm->lock)) {
> +		ret = -EAGAIN;
> +		goto out;
> +	}
> +	hmm->notifiers++;
> +	list_for_each_entry(range, &hmm->ranges, list) {
> +		if (update.end < range->start || update.start >= range->end)
> +			continue;
> +
> +		range->valid = false;
> +	}
> +	mutex_unlock(&hmm->lock);
> +
> +	if (nrange->blockable)
> +		down_read(&hmm->mirrors_sem);
> +	else if (!down_read_trylock(&hmm->mirrors_sem)) {
> +		ret = -EAGAIN;
> +		goto out;
> +	}
> +	list_for_each_entry(mirror, &hmm->mirrors, list) {
> +		int ret;
> +
> +		ret = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
> +		if (!update.blockable && ret == -EAGAIN) {
> +			up_read(&hmm->mirrors_sem);
> +			ret = -EAGAIN;
> +			goto out;
> +		}
> +	}
> +	up_read(&hmm->mirrors_sem);
> +
> +out:
>  	hmm_put(hmm);
>  	return ret;
>  }
>  
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> -			const struct mmu_notifier_range *range)
> +			const struct mmu_notifier_range *nrange)
>  {
> -	struct hmm *hmm = mm_get_hmm(range->mm);
> -	struct hmm_update update;
> +	struct hmm *hmm = mm_get_hmm(nrange->mm);
>  
>  	VM_BUG_ON(!hmm);
>  
> -	update.start = range->start;
> -	update.end = range->end;
> -	update.event = HMM_UPDATE_INVALIDATE;
> -	update.blockable = true;
> -	hmm_invalidate_range(hmm, false, &update);
> +	mutex_lock(&hmm->lock);
> +	hmm->notifiers--;
> +	if (!hmm->notifiers) {
> +		struct hmm_range *range;
> +
> +		list_for_each_entry(range, &hmm->ranges, list) {
> +			if (range->valid)
> +				continue;
> +			range->valid = true;
> +		}
> +		wake_up_all(&hmm->wq);
> +	}
> +	mutex_unlock(&hmm->lock);
> +
>  	hmm_put(hmm);
>  }
>  
> @@ -405,7 +412,6 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>  {
>  	struct hmm_range *range = hmm_vma_walk->range;
>  
> -	*fault = *write_fault = false;
>  	if (!hmm_vma_walk->fault)
>  		return;
>  
> @@ -444,10 +450,11 @@ static void hmm_range_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>  		return;
>  	}
>  
> +	*fault = *write_fault = false;
>  	for (i = 0; i < npages; ++i) {
>  		hmm_pte_need_fault(hmm_vma_walk, pfns[i], cpu_flags,
>  				   fault, write_fault);
> -		if ((*fault) || (*write_fault))
> +		if ((*write_fault))
>  			return;
>  	}
>  }
> @@ -702,162 +709,152 @@ static void hmm_pfns_special(struct hmm_range *range)
>  }
>  
>  /*
> - * hmm_range_snapshot() - snapshot CPU page table for a range
> + * hmm_range_register() - start tracking change to CPU page table over a range
>   * @range: range
> - * Returns: number of valid pages in range->pfns[] (from range start
> - *          address). This may be zero. If the return value is negative,
> - *          then one of the following values may be returned:
> + * @mm: the mm struct for the range of virtual address
> + * @start: start virtual address (inclusive)
> + * @end: end virtual address (exclusive)
> + * Returns 0 on success, -EFAULT if the address space is no longer valid
>   *
> - *           -EINVAL  invalid arguments or mm or virtual address are in an
> - *                    invalid vma (ie either hugetlbfs or device file vma).
> - *           -EPERM   For example, asking for write, when the range is
> - *                    read-only
> - *           -EAGAIN  Caller needs to retry
> - *           -EFAULT  Either no valid vma exists for this range, or it is
> - *                    illegal to access the range
> - *
> - * This snapshots the CPU page table for a range of virtual addresses. Snapshot
> - * validity is tracked by range struct. See hmm_vma_range_done() for further
> - * information.
> + * Track updates to the CPU page table see include/linux/hmm.h
>   */
> -long hmm_range_snapshot(struct hmm_range *range)
> +int hmm_range_register(struct hmm_range *range,
> +		       struct mm_struct *mm,
> +		       unsigned long start,
> +		       unsigned long end)
>  {
> -	struct vm_area_struct *vma = range->vma;
> -	struct hmm_vma_walk hmm_vma_walk;
> -	struct mm_walk mm_walk;
> -	struct hmm *hmm;
> -
> +	range->start = start & PAGE_MASK;
> +	range->end = end & PAGE_MASK;
> +	range->valid = false;
>  	range->hmm = NULL;
>  
> -	/* Sanity check, this really should not happen ! */
> -	if (range->start < vma->vm_start || range->start >= vma->vm_end)
> -		return -EINVAL;
> -	if (range->end < vma->vm_start || range->end > vma->vm_end)
> +	if (range->start >= range->end)
>  		return -EINVAL;
>  
> -	hmm = hmm_register(vma->vm_mm);
> -	if (!hmm)
> -		return -ENOMEM;
> +	range->hmm = hmm_register(mm);
> +	if (!range->hmm)
> +		return -EFAULT;
>  
>  	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL) {
> -		hmm_put(hmm);
> -		return -EINVAL;
> +	if (range->hmm->mm == NULL || range->hmm->dead) {
> +		hmm_put(range->hmm);
> +		return -EFAULT;
>  	}
>  
> -	/* FIXME support hugetlb fs */
> -	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
> -			vma_is_dax(vma)) {
> -		hmm_pfns_special(range);
> -		hmm_put(hmm);
> -		return -EINVAL;
> -	}
> +	/* Initialize range to track CPU page table update */
> +	mutex_lock(&range->hmm->lock);
>  
> -	if (!(vma->vm_flags & VM_READ)) {
> -		/*
> -		 * If vma do not allow read access, then assume that it does
> -		 * not allow write access, either. Architecture that allow
> -		 * write without read access are not supported by HMM, because
> -		 * operations such has atomic access would not work.
> -		 */
> -		hmm_pfns_clear(range, range->pfns, range->start, range->end);
> -		hmm_put(hmm);
> -		return -EPERM;
> -	}
> +	list_add_rcu(&range->list, &range->hmm->ranges);
>  
> -	/* Initialize range to track CPU page table update */
> -	spin_lock(&hmm->lock);
> -	range->valid = true;
> -	list_add_rcu(&range->list, &hmm->ranges);
> -	spin_unlock(&hmm->lock);
> -
> -	hmm_vma_walk.fault = false;
> -	hmm_vma_walk.range = range;
> -	mm_walk.private = &hmm_vma_walk;
> -	hmm_vma_walk.last = range->start;
> -
> -	mm_walk.vma = vma;
> -	mm_walk.mm = vma->vm_mm;
> -	mm_walk.pte_entry = NULL;
> -	mm_walk.test_walk = NULL;
> -	mm_walk.hugetlb_entry = NULL;
> -	mm_walk.pmd_entry = hmm_vma_walk_pmd;
> -	mm_walk.pte_hole = hmm_vma_walk_hole;
> -
> -	walk_page_range(range->start, range->end, &mm_walk);
>  	/*
> -	 * Transfer hmm reference to the range struct it will be drop inside
> -	 * the hmm_vma_range_done() function (which _must_ be call if this
> -	 * function return 0).
> +	 * If there are any concurrent notifiers we have to wait for them for
> +	 * the range to be valid (see hmm_range_wait_until_valid()).
>  	 */
> -	range->hmm = hmm;
> -	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
> +	if (!range->hmm->notifiers)
> +		range->valid = true;
> +	mutex_unlock(&range->hmm->lock);
> +
> +	return 0;
>  }
> -EXPORT_SYMBOL(hmm_range_snapshot);
> +EXPORT_SYMBOL(hmm_range_register);
>  
>  /*
> - * hmm_vma_range_done() - stop tracking change to CPU page table over a range
> - * @range: range being tracked
> - * Returns: false if range data has been invalidated, true otherwise
> + * hmm_range_unregister() - stop tracking change to CPU page table over a range
> + * @range: range
>   *
>   * Range struct is used to track updates to the CPU page table after a call to
> - * either hmm_vma_get_pfns() or hmm_vma_fault(). Once the device driver is done
> - * using the data,  or wants to lock updates to the data it got from those
> - * functions, it must call the hmm_vma_range_done() function, which will then
> - * stop tracking CPU page table updates.
> - *
> - * Note that device driver must still implement general CPU page table update
> - * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
> - * the mmu_notifier API directly.
> - *
> - * CPU page table update tracking done through hmm_range is only temporary and
> - * to be used while trying to duplicate CPU page table contents for a range of
> - * virtual addresses.
> - *
> - * There are two ways to use this :
> - * again:
> - *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
> - *   trans = device_build_page_table_update_transaction(pfns);
> - *   device_page_table_lock();
> - *   if (!hmm_vma_range_done(range)) {
> - *     device_page_table_unlock();
> - *     goto again;
> - *   }
> - *   device_commit_transaction(trans);
> - *   device_page_table_unlock();
> - *
> - * Or:
> - *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
> - *   device_page_table_lock();
> - *   hmm_vma_range_done(range);
> - *   device_update_page_table(range->pfns);
> - *   device_page_table_unlock();
> + * hmm_range_register(). See include/linux/hmm.h for how to use it.
>   */
> -bool hmm_vma_range_done(struct hmm_range *range)
> +void hmm_range_unregister(struct hmm_range *range)
>  {
> -	bool ret = false;
> -
>  	/* Sanity check this really should not happen. */
> -	if (range->hmm == NULL || range->end <= range->start) {
> -		BUG();
> -		return false;
> -	}
> +	if (range->hmm == NULL || range->end <= range->start)
> +		return;
>  
> -	spin_lock(&range->hmm->lock);
> +	mutex_lock(&range->hmm->lock);
>  	list_del_rcu(&range->list);
> -	ret = range->valid;
> -	spin_unlock(&range->hmm->lock);
> -
> -	/* Is the mm still alive ? */
> -	if (range->hmm->mm == NULL)
> -		ret = false;
> +	mutex_unlock(&range->hmm->lock);
>  
> -	/* Drop reference taken by hmm_vma_fault() or hmm_vma_get_pfns() */
> +	/* Drop reference taken by hmm_range_register() */
> +	range->valid = false;
>  	hmm_put(range->hmm);
>  	range->hmm = NULL;
> -	return ret;
>  }
> -EXPORT_SYMBOL(hmm_vma_range_done);
> +EXPORT_SYMBOL(hmm_range_unregister);
> +
> +/*
> + * hmm_range_snapshot() - snapshot CPU page table for a range
> + * @range: range
> + * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
> + *          permission (for instance asking for write and range is read only),
> + *          -EAGAIN if you need to retry, -EFAULT invalid (ie either no valid
> + *          vma or it is illegal to access that range), number of valid pages
> + *          in range->pfns[] (from range start address).
> + *
> + * This snapshots the CPU page table for a range of virtual addresses. Snapshot
> + * validity is tracked by range struct. See in include/linux/hmm.h for example
> + * on how to use.
> + */
> +long hmm_range_snapshot(struct hmm_range *range)
> +{
> +	unsigned long start = range->start, end;
> +	struct hmm_vma_walk hmm_vma_walk;
> +	struct hmm *hmm = range->hmm;
> +	struct vm_area_struct *vma;
> +	struct mm_walk mm_walk;
> +
> +	/* Check if hmm_mm_destroy() was call. */
> +	if (hmm->mm == NULL || hmm->dead)
> +		return -EFAULT;
> +
> +	do {
> +		/* If range is no longer valid force retry. */
> +		if (!range->valid)
> +			return -EAGAIN;
> +
> +		vma = find_vma(hmm->mm, start);
> +		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
> +			return -EFAULT;
> +
> +		/* FIXME support hugetlb fs/dax */
> +		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
> +			hmm_pfns_special(range);
> +			return -EINVAL;
> +		}
> +
> +		if (!(vma->vm_flags & VM_READ)) {
> +			/*
> +			 * If vma do not allow read access, then assume that it
> +			 * does not allow write access, either. HMM does not
> +			 * support architecture that allow write without read.
> +			 */
> +			hmm_pfns_clear(range, range->pfns,
> +				range->start, range->end);
> +			return -EPERM;
> +		}
> +
> +		range->vma = vma;
> +		hmm_vma_walk.last = start;
> +		hmm_vma_walk.fault = false;
> +		hmm_vma_walk.range = range;
> +		mm_walk.private = &hmm_vma_walk;
> +		end = min(range->end, vma->vm_end);
> +
> +		mm_walk.vma = vma;
> +		mm_walk.mm = vma->vm_mm;
> +		mm_walk.pte_entry = NULL;
> +		mm_walk.test_walk = NULL;
> +		mm_walk.hugetlb_entry = NULL;
> +		mm_walk.pmd_entry = hmm_vma_walk_pmd;
> +		mm_walk.pte_hole = hmm_vma_walk_hole;
> +
> +		walk_page_range(start, end, &mm_walk);
> +		start = end;
> +	} while (start < range->end);
> +
> +	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
> +}
> +EXPORT_SYMBOL(hmm_range_snapshot);
>  
>  /*
>   * hmm_range_fault() - try to fault some address in a virtual address range
> @@ -889,96 +886,79 @@ EXPORT_SYMBOL(hmm_vma_range_done);
>   */
>  long hmm_range_fault(struct hmm_range *range, bool block)
>  {
> -	struct vm_area_struct *vma = range->vma;
> -	unsigned long start = range->start;
> +	unsigned long start = range->start, end;
>  	struct hmm_vma_walk hmm_vma_walk;
> +	struct hmm *hmm = range->hmm;
> +	struct vm_area_struct *vma;
>  	struct mm_walk mm_walk;
> -	struct hmm *hmm;
>  	int ret;
>  
> -	range->hmm = NULL;
> -
> -	/* Sanity check, this really should not happen ! */
> -	if (range->start < vma->vm_start || range->start >= vma->vm_end)
> -		return -EINVAL;
> -	if (range->end < vma->vm_start || range->end > vma->vm_end)
> -		return -EINVAL;
> +	/* Check if hmm_mm_destroy() was call. */
> +	if (hmm->mm == NULL || hmm->dead)
> +		return -EFAULT;
>  
> -	hmm = hmm_register(vma->vm_mm);
> -	if (!hmm) {
> -		hmm_pfns_clear(range, range->pfns, range->start, range->end);
> -		return -ENOMEM;
> -	}
> +	do {
> +		/* If range is no longer valid force retry. */
> +		if (!range->valid) {
> +			up_read(&hmm->mm->mmap_sem);
> +			return -EAGAIN;
> +		}
>  
> -	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL) {
> -		hmm_put(hmm);
> -		return -EINVAL;
> -	}
> +		vma = find_vma(hmm->mm, start);
> +		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
> +			return -EFAULT;
>  
> -	/* FIXME support hugetlb fs */
> -	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) ||
> -			vma_is_dax(vma)) {
> -		hmm_pfns_special(range);
> -		hmm_put(hmm);
> -		return -EINVAL;
> -	}
> +		/* FIXME support hugetlb fs/dax */
> +		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
> +			hmm_pfns_special(range);
> +			return -EINVAL;
> +		}
>  
> -	if (!(vma->vm_flags & VM_READ)) {
> -		/*
> -		 * If vma do not allow read access, then assume that it does
> -		 * not allow write access, either. Architecture that allow
> -		 * write without read access are not supported by HMM, because
> -		 * operations such has atomic access would not work.
> -		 */
> -		hmm_pfns_clear(range, range->pfns, range->start, range->end);
> -		hmm_put(hmm);
> -		return -EPERM;
> -	}
> +		if (!(vma->vm_flags & VM_READ)) {
> +			/*
> +			 * If vma do not allow read access, then assume that it
> +			 * does not allow write access, either. HMM does not
> +			 * support architecture that allow write without read.
> +			 */
> +			hmm_pfns_clear(range, range->pfns,
> +				range->start, range->end);
> +			return -EPERM;
> +		}
>  
> -	/* Initialize range to track CPU page table update */
> -	spin_lock(&hmm->lock);
> -	range->valid = true;
> -	list_add_rcu(&range->list, &hmm->ranges);
> -	spin_unlock(&hmm->lock);
> -
> -	hmm_vma_walk.fault = true;
> -	hmm_vma_walk.block = block;
> -	hmm_vma_walk.range = range;
> -	mm_walk.private = &hmm_vma_walk;
> -	hmm_vma_walk.last = range->start;
> -
> -	mm_walk.vma = vma;
> -	mm_walk.mm = vma->vm_mm;
> -	mm_walk.pte_entry = NULL;
> -	mm_walk.test_walk = NULL;
> -	mm_walk.hugetlb_entry = NULL;
> -	mm_walk.pmd_entry = hmm_vma_walk_pmd;
> -	mm_walk.pte_hole = hmm_vma_walk_hole;
> +		range->vma = vma;
> +		hmm_vma_walk.last = start;
> +		hmm_vma_walk.fault = true;
> +		hmm_vma_walk.block = block;
> +		hmm_vma_walk.range = range;
> +		mm_walk.private = &hmm_vma_walk;
> +		end = min(range->end, vma->vm_end);
> +
> +		mm_walk.vma = vma;
> +		mm_walk.mm = vma->vm_mm;
> +		mm_walk.pte_entry = NULL;
> +		mm_walk.test_walk = NULL;
> +		mm_walk.hugetlb_entry = NULL;
> +		mm_walk.pmd_entry = hmm_vma_walk_pmd;
> +		mm_walk.pte_hole = hmm_vma_walk_hole;
> +
> +		do {
> +			ret = walk_page_range(start, end, &mm_walk);
> +			start = hmm_vma_walk.last;
> +
> +			/* Keep trying while the range is valid. */
> +		} while (ret == -EBUSY && range->valid);
> +
> +		if (ret) {
> +			unsigned long i;
> +
> +			i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
> +			hmm_pfns_clear(range, &range->pfns[i],
> +				hmm_vma_walk.last, range->end);
> +			return ret;
> +		}
> +		start = end;
>  
> -	do {
> -		ret = walk_page_range(start, range->end, &mm_walk);
> -		start = hmm_vma_walk.last;
> -		/* Keep trying while the range is valid. */
> -	} while (ret == -EBUSY && range->valid);
> -
> -	if (ret) {
> -		unsigned long i;
> -
> -		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
> -		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
> -			       range->end);
> -		hmm_vma_range_done(range);
> -		hmm_put(hmm);
> -		return ret;
> -	} else {
> -		/*
> -		 * Transfer hmm reference to the range struct it will be drop
> -		 * inside the hmm_vma_range_done() function (which _must_ be
> -		 * call if this function return 0).
> -		 */
> -		range->hmm = hmm;
> -	}
> +	} while (start < range->end);
>  
>  	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
>  }
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/11] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() v2
  2019-03-25 14:40 ` [PATCH v2 04/11] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() v2 jglisse
@ 2019-03-28 13:30   ` Ira Weiny
  0 siblings, 0 replies; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 13:30 UTC (permalink / raw)
  To: jglisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Mon, Mar 25, 2019 at 10:40:04AM -0400, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Rename for consistency between code, comments and documentation. Also
> improves the comments on all the possible returns values. Improve the
> function by returning the number of populated entries in pfns array.
> 
> Changes since v1:
>     - updated documentation
>     - reformated some comments
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  Documentation/vm/hmm.rst | 26 ++++++++++++++++++--------
>  include/linux/hmm.h      |  4 ++--
>  mm/hmm.c                 | 31 +++++++++++++++++--------------
>  3 files changed, 37 insertions(+), 24 deletions(-)
> 
> diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> index 44205f0b671f..d9b27bdadd1b 100644
> --- a/Documentation/vm/hmm.rst
> +++ b/Documentation/vm/hmm.rst
> @@ -189,11 +189,7 @@ the driver callback returns.
>  When the device driver wants to populate a range of virtual addresses, it can
>  use either::
>  
> -  int hmm_vma_get_pfns(struct vm_area_struct *vma,
> -                      struct hmm_range *range,
> -                      unsigned long start,
> -                      unsigned long end,
> -                      hmm_pfn_t *pfns);
> +  long hmm_range_snapshot(struct hmm_range *range);
>    int hmm_vma_fault(struct vm_area_struct *vma,
>                      struct hmm_range *range,
>                      unsigned long start,
> @@ -202,7 +198,7 @@ When the device driver wants to populate a range of virtual addresses, it can
>                      bool write,
>                      bool block);
>  
> -The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
> +The first one (hmm_range_snapshot()) will only fetch present CPU page table
>  entries and will not trigger a page fault on missing or non-present entries.
>  The second one does trigger a page fault on missing or read-only entry if the
>  write parameter is true. Page faults use the generic mm page fault code path
> @@ -220,19 +216,33 @@ Locking with the update() callback is the most important aspect the driver must
>   {
>        struct hmm_range range;
>        ...
> +
> +      range.start = ...;
> +      range.end = ...;
> +      range.pfns = ...;
> +      range.flags = ...;
> +      range.values = ...;
> +      range.pfn_shift = ...;
> +
>   again:
> -      ret = hmm_vma_get_pfns(vma, &range, start, end, pfns);
> -      if (ret)
> +      down_read(&mm->mmap_sem);
> +      range.vma = ...;
> +      ret = hmm_range_snapshot(&range);
> +      if (ret) {
> +          up_read(&mm->mmap_sem);
>            return ret;
> +      }
>        take_lock(driver->update);
>        if (!hmm_vma_range_done(vma, &range)) {
>            release_lock(driver->update);
> +          up_read(&mm->mmap_sem);
>            goto again;
>        }
>  
>        // Use pfns array content to update device page table
>  
>        release_lock(driver->update);
> +      up_read(&mm->mmap_sem);
>        return 0;
>   }
>  
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 716fc61fa6d4..32206b0b1bfd 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -365,11 +365,11 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>   * table invalidation serializes on it.
>   *
>   * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
> - * hmm_vma_get_pfns() WITHOUT ERROR !
> + * hmm_range_snapshot() WITHOUT ERROR !
>   *
>   * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
>   */
> -int hmm_vma_get_pfns(struct hmm_range *range);
> +long hmm_range_snapshot(struct hmm_range *range);
>  bool hmm_vma_range_done(struct hmm_range *range);
>  
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 213b0beee8d3..91361aa74b8b 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -698,23 +698,25 @@ static void hmm_pfns_special(struct hmm_range *range)
>  }
>  
>  /*
> - * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
> - * @range: range being snapshotted
> - * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
> - *          vma permission, 0 success
> + * hmm_range_snapshot() - snapshot CPU page table for a range
> + * @range: range
> + * Returns: number of valid pages in range->pfns[] (from range start
> + *          address). This may be zero. If the return value is negative,
> + *          then one of the following values may be returned:
> + *
> + *           -EINVAL  invalid arguments or mm or virtual address are in an
> + *                    invalid vma (ie either hugetlbfs or device file vma).
> + *           -EPERM   For example, asking for write, when the range is
> + *                    read-only
> + *           -EAGAIN  Caller needs to retry
> + *           -EFAULT  Either no valid vma exists for this range, or it is
> + *                    illegal to access the range
>   *
>   * This snapshots the CPU page table for a range of virtual addresses. Snapshot
>   * validity is tracked by range struct. See hmm_vma_range_done() for further
>   * information.
> - *
> - * The range struct is initialized here. It tracks the CPU page table, but only
> - * if the function returns success (0), in which case the caller must then call
> - * hmm_vma_range_done() to stop CPU page table update tracking on this range.
> - *
> - * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
> - * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
>   */
> -int hmm_vma_get_pfns(struct hmm_range *range)
> +long hmm_range_snapshot(struct hmm_range *range)
>  {
>  	struct vm_area_struct *vma = range->vma;
>  	struct hmm_vma_walk hmm_vma_walk;
> @@ -768,6 +770,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>  	hmm_vma_walk.fault = false;
>  	hmm_vma_walk.range = range;
>  	mm_walk.private = &hmm_vma_walk;
> +	hmm_vma_walk.last = range->start;
>  
>  	mm_walk.vma = vma;
>  	mm_walk.mm = vma->vm_mm;
> @@ -784,9 +787,9 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>  	 * function return 0).
>  	 */
>  	range->hmm = hmm;
> -	return 0;
> +	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
>  }
> -EXPORT_SYMBOL(hmm_vma_get_pfns);
> +EXPORT_SYMBOL(hmm_range_snapshot);
>  
>  /*
>   * hmm_vma_range_done() - stop tracking change to CPU page table over a range
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 05/11] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() v2
  2019-03-25 14:40 ` [PATCH v2 05/11] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() v2 jglisse
@ 2019-03-28 13:43   ` Ira Weiny
  2019-03-28 22:03     ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 13:43 UTC (permalink / raw)
  To: jglisse; +Cc: linux-mm, linux-kernel, Andrew Morton, John Hubbard, Dan Williams

On Mon, Mar 25, 2019 at 10:40:05AM -0400, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Rename for consistency between code, comments and documentation. Also
> improves the comments on all the possible returns values. Improve the
> function by returning the number of populated entries in pfns array.
> 
> Changes since v1:
>     - updated documentation
>     - reformated some comments
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  Documentation/vm/hmm.rst |  8 +---
>  include/linux/hmm.h      | 13 +++++-
>  mm/hmm.c                 | 91 +++++++++++++++++-----------------------
>  3 files changed, 52 insertions(+), 60 deletions(-)
> 
> diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> index d9b27bdadd1b..61f073215a8d 100644
> --- a/Documentation/vm/hmm.rst
> +++ b/Documentation/vm/hmm.rst
> @@ -190,13 +190,7 @@ When the device driver wants to populate a range of virtual addresses, it can
>  use either::
>  
>    long hmm_range_snapshot(struct hmm_range *range);
> -  int hmm_vma_fault(struct vm_area_struct *vma,
> -                    struct hmm_range *range,
> -                    unsigned long start,
> -                    unsigned long end,
> -                    hmm_pfn_t *pfns,
> -                    bool write,
> -                    bool block);
> +  long hmm_range_fault(struct hmm_range *range, bool block);
>  
>  The first one (hmm_range_snapshot()) will only fetch present CPU page table
>  entries and will not trigger a page fault on missing or non-present entries.
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 32206b0b1bfd..e9afd23c2eac 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -391,7 +391,18 @@ bool hmm_vma_range_done(struct hmm_range *range);
>   *
>   * See the function description in mm/hmm.c for further documentation.
>   */
> -int hmm_vma_fault(struct hmm_range *range, bool block);
> +long hmm_range_fault(struct hmm_range *range, bool block);
> +
> +/* This is a temporary helper to avoid merge conflict between trees. */
> +static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> +{
> +	long ret = hmm_range_fault(range, block);
> +	if (ret == -EBUSY)
> +		ret = -EAGAIN;
> +	else if (ret == -EAGAIN)
> +		ret = -EBUSY;
> +	return ret < 0 ? ret : 0;
> +}
>  
>  /* Below are for HMM internal use only! Not to be used by device driver! */
>  void hmm_mm_destroy(struct mm_struct *mm);
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 91361aa74b8b..7860e63c3ba7 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -336,13 +336,13 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
>  	flags |= write_fault ? FAULT_FLAG_WRITE : 0;
>  	ret = handle_mm_fault(vma, addr, flags);
>  	if (ret & VM_FAULT_RETRY)
> -		return -EBUSY;
> +		return -EAGAIN;
>  	if (ret & VM_FAULT_ERROR) {
>  		*pfn = range->values[HMM_PFN_ERROR];
>  		return -EFAULT;
>  	}
>  
> -	return -EAGAIN;
> +	return -EBUSY;
>  }
>  
>  static int hmm_pfns_bad(unsigned long addr,
> @@ -368,7 +368,7 @@ static int hmm_pfns_bad(unsigned long addr,
>   * @fault: should we fault or not ?
>   * @write_fault: write fault ?
>   * @walk: mm_walk structure
> - * Returns: 0 on success, -EAGAIN after page fault, or page fault error
> + * Returns: 0 on success, -EBUSY after page fault, or page fault error
>   *
>   * This function will be called whenever pmd_none() or pte_none() returns true,
>   * or whenever there is no page directory covering the virtual address range.
> @@ -391,12 +391,12 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
>  
>  			ret = hmm_vma_do_fault(walk, addr, write_fault,
>  					       &pfns[i]);
> -			if (ret != -EAGAIN)
> +			if (ret != -EBUSY)
>  				return ret;
>  		}
>  	}
>  
> -	return (fault || write_fault) ? -EAGAIN : 0;
> +	return (fault || write_fault) ? -EBUSY : 0;
>  }
>  
>  static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
> @@ -527,11 +527,11 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  	uint64_t orig_pfn = *pfn;
>  
>  	*pfn = range->values[HMM_PFN_NONE];
> -	cpu_flags = pte_to_hmm_pfn_flags(range, pte);
> -	hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
> -			   &fault, &write_fault);
> +	fault = write_fault = false;
>  
>  	if (pte_none(pte)) {
> +		hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0,
> +				   &fault, &write_fault);

This really threw me until I applied the patches to a tree.  It looks like this
is just optimizing away a pte_none() check.  The only functional change which
was mentioned was returning the number of populated pfns.  So I spent a bit of
time trying to figure out why hmm_pte_need_fault() needed to move _here_ to do
that...  :-(

It would have been nice to have said something about optimizing in the commit
message.

>  		if (fault || write_fault)
>  			goto fault;
>  		return 0;
> @@ -570,7 +570,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  				hmm_vma_walk->last = addr;
>  				migration_entry_wait(vma->vm_mm,
>  						     pmdp, addr);
> -				return -EAGAIN;
> +				return -EBUSY;
>  			}
>  			return 0;
>  		}
> @@ -578,6 +578,10 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  		/* Report error for everything else */
>  		*pfn = range->values[HMM_PFN_ERROR];
>  		return -EFAULT;
> +	} else {
> +		cpu_flags = pte_to_hmm_pfn_flags(range, pte);
> +		hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
> +				   &fault, &write_fault);

Looks like the same situation as above.

>  	}
>  
>  	if (fault || write_fault)
> @@ -628,7 +632,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		if (fault || write_fault) {
>  			hmm_vma_walk->last = addr;
>  			pmd_migration_entry_wait(vma->vm_mm, pmdp);
> -			return -EAGAIN;
> +			return -EBUSY;

While I am at it.  Why are we swapping EAGAIN and EBUSY everywhere?

Ira

>  		}
>  		return 0;
>  	} else if (!pmd_present(pmd))
> @@ -856,53 +860,34 @@ bool hmm_vma_range_done(struct hmm_range *range)
>  EXPORT_SYMBOL(hmm_vma_range_done);
>  
>  /*
> - * hmm_vma_fault() - try to fault some address in a virtual address range
> + * hmm_range_fault() - try to fault some address in a virtual address range
>   * @range: range being faulted
>   * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
> - * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
> + * Returns: number of valid pages in range->pfns[] (from range start
> + *          address). This may be zero. If the return value is negative,
> + *          then one of the following values may be returned:
> + *
> + *           -EINVAL  invalid arguments or mm or virtual address are in an
> + *                    invalid vma (ie either hugetlbfs or device file vma).
> + *           -ENOMEM: Out of memory.
> + *           -EPERM:  Invalid permission (for instance asking for write and
> + *                    range is read only).
> + *           -EAGAIN: If you need to retry and mmap_sem was drop. This can only
> + *                    happens if block argument is false.
> + *           -EBUSY:  If the the range is being invalidated and you should wait
> + *                    for invalidation to finish.
> + *           -EFAULT: Invalid (ie either no valid vma or it is illegal to access
> + *                    that range), number of valid pages in range->pfns[] (from
> + *                    range start address).
>   *
>   * This is similar to a regular CPU page fault except that it will not trigger
> - * any memory migration if the memory being faulted is not accessible by CPUs.
> + * any memory migration if the memory being faulted is not accessible by CPUs
> + * and caller does not ask for migration.
>   *
>   * On error, for one virtual address in the range, the function will mark the
>   * corresponding HMM pfn entry with an error flag.
> - *
> - * Expected use pattern:
> - * retry:
> - *   down_read(&mm->mmap_sem);
> - *   // Find vma and address device wants to fault, initialize hmm_pfn_t
> - *   // array accordingly
> - *   ret = hmm_vma_fault(range, write, block);
> - *   switch (ret) {
> - *   case -EAGAIN:
> - *     hmm_vma_range_done(range);
> - *     // You might want to rate limit or yield to play nicely, you may
> - *     // also commit any valid pfn in the array assuming that you are
> - *     // getting true from hmm_vma_range_monitor_end()
> - *     goto retry;
> - *   case 0:
> - *     break;
> - *   case -ENOMEM:
> - *   case -EINVAL:
> - *   case -EPERM:
> - *   default:
> - *     // Handle error !
> - *     up_read(&mm->mmap_sem)
> - *     return;
> - *   }
> - *   // Take device driver lock that serialize device page table update
> - *   driver_lock_device_page_table_update();
> - *   hmm_vma_range_done(range);
> - *   // Commit pfns we got from hmm_vma_fault()
> - *   driver_unlock_device_page_table_update();
> - *   up_read(&mm->mmap_sem)
> - *
> - * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
> - * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
> - *
> - * YOU HAVE BEEN WARNED !
>   */
> -int hmm_vma_fault(struct hmm_range *range, bool block)
> +long hmm_range_fault(struct hmm_range *range, bool block)
>  {
>  	struct vm_area_struct *vma = range->vma;
>  	unsigned long start = range->start;
> @@ -974,7 +959,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>  	do {
>  		ret = walk_page_range(start, range->end, &mm_walk);
>  		start = hmm_vma_walk.last;
> -	} while (ret == -EAGAIN);
> +		/* Keep trying while the range is valid. */
> +	} while (ret == -EBUSY && range->valid);
>  
>  	if (ret) {
>  		unsigned long i;
> @@ -984,6 +970,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>  			       range->end);
>  		hmm_vma_range_done(range);
>  		hmm_put(hmm);
> +		return ret;
>  	} else {
>  		/*
>  		 * Transfer hmm reference to the range struct it will be drop
> @@ -993,9 +980,9 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
>  		range->hmm = hmm;
>  	}
>  
> -	return ret;
> +	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
>  }
> -EXPORT_SYMBOL(hmm_vma_fault);
> +EXPORT_SYMBOL(hmm_range_fault);
>  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>  
>  
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2
  2019-03-25 14:40 ` [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2 jglisse
  2019-03-28 13:11   ` Ira Weiny
@ 2019-03-28 16:12   ` Ira Weiny
  2019-03-29  0:56     ` Jerome Glisse
  1 sibling, 1 reply; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 16:12 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrew Morton, John Hubbard,
	Dan Williams, Dan Carpenter, Matthew Wilcox

On Mon, Mar 25, 2019 at 10:40:06AM -0400, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> A common use case for HMM mirror is user trying to mirror a range
> and before they could program the hardware it get invalidated by
> some core mm event. Instead of having user re-try right away to
> mirror the range provide a completion mechanism for them to wait
> for any active invalidation affecting the range.
> 
> This also changes how hmm_range_snapshot() and hmm_range_fault()
> works by not relying on vma so that we can drop the mmap_sem
> when waiting and lookup the vma again on retry.
> 
> Changes since v1:
>     - squashed: Dan Carpenter: potential deadlock in nonblocking code
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> ---
>  include/linux/hmm.h | 208 ++++++++++++++---
>  mm/hmm.c            | 528 +++++++++++++++++++++-----------------------
>  2 files changed, 428 insertions(+), 308 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index e9afd23c2eac..79671036cb5f 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -77,8 +77,34 @@
>  #include <linux/migrate.h>
>  #include <linux/memremap.h>
>  #include <linux/completion.h>
> +#include <linux/mmu_notifier.h>
>  
> -struct hmm;
> +
> +/*
> + * struct hmm - HMM per mm struct
> + *
> + * @mm: mm struct this HMM struct is bound to
> + * @lock: lock protecting ranges list
> + * @ranges: list of range being snapshotted
> + * @mirrors: list of mirrors for this mm
> + * @mmu_notifier: mmu notifier to track updates to CPU page table
> + * @mirrors_sem: read/write semaphore protecting the mirrors list
> + * @wq: wait queue for user waiting on a range invalidation
> + * @notifiers: count of active mmu notifiers
> + * @dead: is the mm dead ?
> + */
> +struct hmm {
> +	struct mm_struct	*mm;
> +	struct kref		kref;
> +	struct mutex		lock;
> +	struct list_head	ranges;
> +	struct list_head	mirrors;
> +	struct mmu_notifier	mmu_notifier;
> +	struct rw_semaphore	mirrors_sem;
> +	wait_queue_head_t	wq;
> +	long			notifiers;
> +	bool			dead;
> +};
>  
>  /*
>   * hmm_pfn_flag_e - HMM flag enums
> @@ -155,6 +181,38 @@ struct hmm_range {
>  	bool			valid;
>  };
>  
> +/*
> + * hmm_range_wait_until_valid() - wait for range to be valid
> + * @range: range affected by invalidation to wait on
> + * @timeout: time out for wait in ms (ie abort wait after that period of time)
> + * Returns: true if the range is valid, false otherwise.
> + */
> +static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
> +					      unsigned long timeout)
> +{
> +	/* Check if mm is dead ? */
> +	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> +		range->valid = false;
> +		return false;
> +	}
> +	if (range->valid)
> +		return true;
> +	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> +			   msecs_to_jiffies(timeout));
> +	/* Return current valid status just in case we get lucky */
> +	return range->valid;
> +}
> +
> +/*
> + * hmm_range_valid() - test if a range is valid or not
> + * @range: range
> + * Returns: true if the range is valid, false otherwise.
> + */
> +static inline bool hmm_range_valid(struct hmm_range *range)
> +{
> +	return range->valid;
> +}
> +
>  /*
>   * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
>   * @range: range use to decode HMM pfn value
> @@ -357,51 +415,133 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  
>  
>  /*
> - * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
> - * driver lock that serializes device page table updates, then call
> - * hmm_vma_range_done(), to check if the snapshot is still valid. The same
> - * device driver page table update lock must also be used in the
> - * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
> - * table invalidation serializes on it.
> + * To snapshot the CPU page table you first have to call hmm_range_register()
> + * to register the range. If hmm_range_register() return an error then some-
> + * thing is horribly wrong and you should fail loudly. If it returned true then
> + * you can wait for the range to be stable with hmm_range_wait_until_valid()
> + * function, a range is valid when there are no concurrent changes to the CPU
> + * page table for the range.
> + *
> + * Once the range is valid you can call hmm_range_snapshot() if that returns
> + * without error then you can take your device page table lock (the same lock
> + * you use in the HMM mirror sync_cpu_device_pagetables() callback). After
> + * taking that lock you have to check the range validity, if it is still valid
> + * (ie hmm_range_valid() returns true) then you can program the device page
> + * table, otherwise you have to start again. Pseudo code:
> + *
> + *      mydevice_prefault(mydevice, mm, start, end)
> + *      {
> + *          struct hmm_range range;
> + *          ...
>   *
> - * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
> - * hmm_range_snapshot() WITHOUT ERROR !
> + *          ret = hmm_range_register(&range, mm, start, end);
> + *          if (ret)
> + *              return ret;
>   *
> - * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
> - */
> -long hmm_range_snapshot(struct hmm_range *range);
> -bool hmm_vma_range_done(struct hmm_range *range);
> -
> -
> -/*
> - * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
> - * not migrate any device memory back to system memory. The HMM pfn array will
> - * be updated with the fault result and current snapshot of the CPU page table
> - * for the range.
> + *          down_read(mm->mmap_sem);
> + *      again:
> + *
> + *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
> + *              up_read(&mm->mmap_sem);
> + *              hmm_range_unregister(range);
> + *              // Handle time out, either sleep or retry or something else
> + *              ...
> + *              return -ESOMETHING; || goto again;
> + *          }
> + *
> + *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
> + *          if (ret == -EAGAIN) {
> + *              down_read(mm->mmap_sem);
> + *              goto again;
> + *          } else if (ret == -EBUSY) {
> + *              goto again;
> + *          }
> + *
> + *          up_read(&mm->mmap_sem);
> + *          if (ret) {
> + *              hmm_range_unregister(range);
> + *              return ret;
> + *          }
> + *
> + *          // It might not have snap-shoted the whole range but only the first
> + *          // npages, the return values is the number of valid pages from the
> + *          // start of the range.
> + *          npages = ret;
>   *
> - * The mmap_sem must be taken in read mode before entering and it might be
> - * dropped by the function if the block argument is false. In that case, the
> - * function returns -EAGAIN.
> + *          ...
>   *
> - * Return value does not reflect if the fault was successful for every single
> - * address or not. Therefore, the caller must to inspect the HMM pfn array to
> - * determine fault status for each address.
> + *          mydevice_page_table_lock(mydevice);
> + *          if (!hmm_range_valid(range)) {
> + *              mydevice_page_table_unlock(mydevice);
> + *              goto again;
> + *          }
>   *
> - * Trying to fault inside an invalid vma will result in -EINVAL.
> + *          mydevice_populate_page_table(mydevice, range, npages);
> + *          ...
> + *          mydevice_take_page_table_unlock(mydevice);
> + *          hmm_range_unregister(range);
>   *
> - * See the function description in mm/hmm.c for further documentation.
> + *          return 0;
> + *      }
> + *
> + * The same scheme apply to hmm_range_fault() (ie replace hmm_range_snapshot()
> + * with hmm_range_fault() in above pseudo code).
> + *
> + * YOU MUST CALL hmm_range_unregister() ONCE AND ONLY ONCE EACH TIME YOU CALL
> + * hmm_range_register() AND hmm_range_register() RETURNED TRUE ! IF YOU DO NOT
> + * FOLLOW THIS RULE MEMORY CORRUPTION WILL ENSUE !
>   */
> +int hmm_range_register(struct hmm_range *range,
> +		       struct mm_struct *mm,
> +		       unsigned long start,
> +		       unsigned long end);
> +void hmm_range_unregister(struct hmm_range *range);

The above comment is great!  But I think you also need to update
Documentation/vm/hmm.rst:hmm_range_snapshot() to show the use of
hmm_range_[un]register()

> +long hmm_range_snapshot(struct hmm_range *range);
>  long hmm_range_fault(struct hmm_range *range, bool block);
>  
> +/*
> + * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> + *
> + * When waiting for mmu notifiers we need some kind of time out otherwise we
> + * could potentialy wait for ever, 1000ms ie 1s sounds like a long time to
> + * wait already.
> + */
> +#define HMM_RANGE_DEFAULT_TIMEOUT 1000
> +
>  /* This is a temporary helper to avoid merge conflict between trees. */
> +static inline bool hmm_vma_range_done(struct hmm_range *range)
> +{
> +	bool ret = hmm_range_valid(range);
> +
> +	hmm_range_unregister(range);
> +	return ret;
> +}
> +
>  static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>  {
> -	long ret = hmm_range_fault(range, block);
> -	if (ret == -EBUSY)
> -		ret = -EAGAIN;
> -	else if (ret == -EAGAIN)
> -		ret = -EBUSY;
> -	return ret < 0 ? ret : 0;
> +	long ret;
> +
> +	ret = hmm_range_register(range, range->vma->vm_mm,
> +				 range->start, range->end);
> +	if (ret)
> +		return (int)ret;
> +
> +	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
> +		up_read(&range->vma->vm_mm->mmap_sem);
> +		return -EAGAIN;
> +	}
> +
> +	ret = hmm_range_fault(range, block);
> +	if (ret <= 0) {
> +		if (ret == -EBUSY || !ret) {
> +			up_read(&range->vma->vm_mm->mmap_sem);
> +			ret = -EBUSY;
> +		} else if (ret == -EAGAIN)
> +			ret = -EBUSY;
> +		hmm_range_unregister(range);
> +		return ret;
> +	}
> +	return 0;

Is hmm_vma_fault() also temporary to keep the nouveau driver working?  It looks
like it to me.

This and hmm_vma_range_done() above are part of the old interface which is in
the Documentation correct?  As stated above we should probably change that
documentation with this patch to ensure no new users of these 2 functions
appear.

Ira


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-28 23:28               ` John Hubbard
@ 2019-03-28 16:42                 ` Ira Weiny
  2019-03-29  1:17                   ` Jerome Glisse
  2019-03-28 23:43                 ` Jerome Glisse
  1 sibling, 1 reply; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 16:42 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> [...]
> >> Hi Jerome,
> >>
> >> I think you're talking about flags, but I'm talking about the mask. The 
> >> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
> >> that it uses are still in the same lower 3 bits:
> >>
> >> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> >> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
> >> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
> >> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
> >> +};
> >>
> >> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
> >> mask, that is *also* runtime changeable. 
> > 
> > So the pfn array is using a device driver specific format and we have
> > no idea nor do we need to know where the valid, write, ... bit are in
> > that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> > we do not care. They are device with bit at the top and for those you
> > need a mask that allows you to mask out those bits or not depending on
> > what the user want to do.
> > 
> > The mask here is against an _unknown_ (from HMM POV) format. So we can
> > not presume where the bits will be and thus we can not presume what a
> > proper mask is.
> > 
> > So that's why a full unsigned long mask is use here.
> > 
> > Maybe an example will help let say the device flag are:
> >     VALID (1 << 63)
> >     WRITE (1 << 62)
> > 
> > Now let say that device wants to fault with at least read a range
> > it does set:
> >     range->default_flags = (1 << 63)
> >     range->pfn_flags_mask = 0;
> > 
> > This will fill fault all page in the range with at least read
> > permission.
> > 
> > Now let say it wants to do the same except for one page in the range
> > for which its want to have write. Now driver set:
> >     range->default_flags = (1 << 63);
> >     range->pfn_flags_mask = (1 << 62);
> >     range->pfns[index_of_write] = (1 << 62);
> > 
> > With this HMM will fault in all page with at least read (ie valid)
> > and for the address: range->start + index_of_write << PAGE_SHIFT it
> > will fault with write permission ie if the CPU pte does not have
> > write permission set then handle_mm_fault() will be call asking for
> > write permission.
> > 
> > 
> > Note that in the above HMM will populate the pfns array with write
> > permission for any entry that have write permission within the CPU
> > pte ie the default_flags and pfn_flags_mask is only the minimun
> > requirement but HMM always returns all the flag that are set in the
> > CPU pte.
> > 
> > 
> > Now let say you are an "old" driver like nouveau upstream, then it
> > means that you are setting each individual entry within range->pfns
> > with the exact flags you want for each address hence here what you
> > want is:
> >     range->default_flags = 0;
> >     range->pfn_flags_mask = -1UL;
> > 
> > So that what we do is (for each entry):
> >     (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> > and we end up with the flags that were set by the driver for each of
> > the individual range->pfns entries.
> > 
> > 
> > Does this help ?
> > 
> 
> Yes, the key point for me was that this is an entirely device driver specific
> format. OK. But then we have HMM setting it. So a comment to the effect that
> this is device-specific might be nice, but I'll leave that up to you whether
> it is useful.

Indeed I did not realize there is an hmm "pfn" until I saw this function:

/*
 * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
 * @range: range use to encode HMM pfn value
 * @pfn: pfn value for which to create the HMM pfn
 * Returns: valid HMM pfn for the pfn
 */
static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
                                        unsigned long pfn)

So should this patch contain some sort of helper like this... maybe?

I'm assuming the "hmm_pfn" being returned above is the device pfn being
discussed here?

I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
have shortened the discussion here.

Ira

> 
> Either way, you can add:
> 
> 	Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/11] mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping) v2
  2019-03-25 14:40 ` [PATCH v2 08/11] mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping) v2 jglisse
@ 2019-03-28 16:53   ` Ira Weiny
  0 siblings, 0 replies; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 16:53 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrew Morton, John Hubbard,
	Dan Williams, Arnd Bergmann

On Mon, Mar 25, 2019 at 10:40:08AM -0400, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> HMM mirror is a device driver helpers to mirror range of virtual address.
> It means that the process jobs running on the device can access the same
> virtual address as the CPU threads of that process. This patch adds support
> for hugetlbfs mapping (ie range of virtual address that are mmap of a
> hugetlbfs).
> 
> Changes since v1:
>     - improved commit message
>     - squashed: Arnd Bergmann: fix unused variable warnings
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> ---
>  include/linux/hmm.h |  29 ++++++++--
>  mm/hmm.c            | 126 +++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 138 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 13bc2c72f791..f3b919b04eda 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -181,10 +181,31 @@ struct hmm_range {
>  	const uint64_t		*values;
>  	uint64_t		default_flags;
>  	uint64_t		pfn_flags_mask;
> +	uint8_t			page_shift;
>  	uint8_t			pfn_shift;
>  	bool			valid;
>  };
>  
> +/*
> + * hmm_range_page_shift() - return the page shift for the range
> + * @range: range being queried
> + * Returns: page shift (page size = 1 << page shift) for the range
> + */
> +static inline unsigned hmm_range_page_shift(const struct hmm_range *range)
> +{
> +	return range->page_shift;
> +}
> +
> +/*
> + * hmm_range_page_size() - return the page size for the range
> + * @range: range being queried
> + * Returns: page size for the range in bytes
> + */
> +static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
> +{
> +	return 1UL << hmm_range_page_shift(range);
> +}
> +
>  /*
>   * hmm_range_wait_until_valid() - wait for range to be valid
>   * @range: range affected by invalidation to wait on
> @@ -438,7 +459,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>   *          struct hmm_range range;
>   *          ...
>   *
> - *          ret = hmm_range_register(&range, mm, start, end);
> + *          ret = hmm_range_register(&range, mm, start, end, page_shift);
>   *          if (ret)
>   *              return ret;
>   *
> @@ -498,7 +519,8 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  int hmm_range_register(struct hmm_range *range,
>  		       struct mm_struct *mm,
>  		       unsigned long start,
> -		       unsigned long end);
> +		       unsigned long end,
> +		       unsigned page_shift);
>  void hmm_range_unregister(struct hmm_range *range);
>  long hmm_range_snapshot(struct hmm_range *range);
>  long hmm_range_fault(struct hmm_range *range, bool block);
> @@ -529,7 +551,8 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>  	range->pfn_flags_mask = -1UL;
>  
>  	ret = hmm_range_register(range, range->vma->vm_mm,
> -				 range->start, range->end);
> +				 range->start, range->end,
> +				 PAGE_SHIFT);
>  	if (ret)
>  		return (int)ret;
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 4fe88a196d17..64a33770813b 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -387,11 +387,13 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
>  	struct hmm_vma_walk *hmm_vma_walk = walk->private;
>  	struct hmm_range *range = hmm_vma_walk->range;
>  	uint64_t *pfns = range->pfns;
> -	unsigned long i;
> +	unsigned long i, page_size;
>  
>  	hmm_vma_walk->last = addr;
> -	i = (addr - range->start) >> PAGE_SHIFT;
> -	for (; addr < end; addr += PAGE_SIZE, i++) {
> +	page_size = 1UL << range->page_shift;

NIT: page_size = hmm_range_page_size(range);

??

Otherwise:

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> +	i = (addr - range->start) >> range->page_shift;
> +
> +	for (; addr < end; addr += page_size, i++) {
>  		pfns[i] = range->values[HMM_PFN_NONE];
>  		if (fault || write_fault) {
>  			int ret;
> @@ -703,6 +705,69 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	return 0;
>  }
>  
> +static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
> +				      unsigned long start, unsigned long end,
> +				      struct mm_walk *walk)
> +{
> +#ifdef CONFIG_HUGETLB_PAGE
> +	unsigned long addr = start, i, pfn, mask, size, pfn_inc;
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	struct hmm_range *range = hmm_vma_walk->range;
> +	struct vm_area_struct *vma = walk->vma;
> +	struct hstate *h = hstate_vma(vma);
> +	uint64_t orig_pfn, cpu_flags;
> +	bool fault, write_fault;
> +	spinlock_t *ptl;
> +	pte_t entry;
> +	int ret = 0;
> +
> +	size = 1UL << huge_page_shift(h);
> +	mask = size - 1;
> +	if (range->page_shift != PAGE_SHIFT) {
> +		/* Make sure we are looking at full page. */
> +		if (start & mask)
> +			return -EINVAL;
> +		if (end < (start + size))
> +			return -EINVAL;
> +		pfn_inc = size >> PAGE_SHIFT;
> +	} else {
> +		pfn_inc = 1;
> +		size = PAGE_SIZE;
> +	}
> +
> +
> +	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
> +	entry = huge_ptep_get(pte);
> +
> +	i = (start - range->start) >> range->page_shift;
> +	orig_pfn = range->pfns[i];
> +	range->pfns[i] = range->values[HMM_PFN_NONE];
> +	cpu_flags = pte_to_hmm_pfn_flags(range, entry);
> +	fault = write_fault = false;
> +	hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
> +			   &fault, &write_fault);
> +	if (fault || write_fault) {
> +		ret = -ENOENT;
> +		goto unlock;
> +	}
> +
> +	pfn = pte_pfn(entry) + (start & mask);
> +	for (; addr < end; addr += size, i++, pfn += pfn_inc)
> +		range->pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
> +	hmm_vma_walk->last = end;
> +
> +unlock:
> +	spin_unlock(ptl);
> +
> +	if (ret == -ENOENT)
> +		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> +
> +	return ret;
> +#else /* CONFIG_HUGETLB_PAGE */
> +	return -EINVAL;
> +#endif
> +}
> +
>  static void hmm_pfns_clear(struct hmm_range *range,
>  			   uint64_t *pfns,
>  			   unsigned long addr,
> @@ -726,6 +791,7 @@ static void hmm_pfns_special(struct hmm_range *range)
>   * @mm: the mm struct for the range of virtual address
>   * @start: start virtual address (inclusive)
>   * @end: end virtual address (exclusive)
> + * @page_shift: expect page shift for the range
>   * Returns 0 on success, -EFAULT if the address space is no longer valid
>   *
>   * Track updates to the CPU page table see include/linux/hmm.h
> @@ -733,16 +799,23 @@ static void hmm_pfns_special(struct hmm_range *range)
>  int hmm_range_register(struct hmm_range *range,
>  		       struct mm_struct *mm,
>  		       unsigned long start,
> -		       unsigned long end)
> +		       unsigned long end,
> +		       unsigned page_shift)
>  {
> -	range->start = start & PAGE_MASK;
> -	range->end = end & PAGE_MASK;
> +	unsigned long mask = ((1UL << page_shift) - 1UL);
> +
>  	range->valid = false;
>  	range->hmm = NULL;
>  
> -	if (range->start >= range->end)
> +	if ((start & mask) || (end & mask))
> +		return -EINVAL;
> +	if (start >= end)
>  		return -EINVAL;
>  
> +	range->page_shift = page_shift;
> +	range->start = start;
> +	range->end = end;
> +
>  	range->hmm = hmm_register(mm);
>  	if (!range->hmm)
>  		return -EFAULT;
> @@ -809,6 +882,7 @@ EXPORT_SYMBOL(hmm_range_unregister);
>   */
>  long hmm_range_snapshot(struct hmm_range *range)
>  {
> +	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
>  	unsigned long start = range->start, end;
>  	struct hmm_vma_walk hmm_vma_walk;
>  	struct hmm *hmm = range->hmm;
> @@ -825,15 +899,26 @@ long hmm_range_snapshot(struct hmm_range *range)
>  			return -EAGAIN;
>  
>  		vma = find_vma(hmm->mm, start);
> -		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
> +		if (vma == NULL || (vma->vm_flags & device_vma))
>  			return -EFAULT;
>  
> -		/* FIXME support hugetlb fs/dax */
> -		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
> +		/* FIXME support dax */
> +		if (vma_is_dax(vma)) {
>  			hmm_pfns_special(range);
>  			return -EINVAL;
>  		}
>  
> +		if (is_vm_hugetlb_page(vma)) {
> +			struct hstate *h = hstate_vma(vma);
> +
> +			if (huge_page_shift(h) != range->page_shift &&
> +			    range->page_shift != PAGE_SHIFT)
> +				return -EINVAL;
> +		} else {
> +			if (range->page_shift != PAGE_SHIFT)
> +				return -EINVAL;
> +		}
> +
>  		if (!(vma->vm_flags & VM_READ)) {
>  			/*
>  			 * If vma do not allow read access, then assume that it
> @@ -859,6 +944,7 @@ long hmm_range_snapshot(struct hmm_range *range)
>  		mm_walk.hugetlb_entry = NULL;
>  		mm_walk.pmd_entry = hmm_vma_walk_pmd;
>  		mm_walk.pte_hole = hmm_vma_walk_hole;
> +		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
>  
>  		walk_page_range(start, end, &mm_walk);
>  		start = end;
> @@ -877,7 +963,7 @@ EXPORT_SYMBOL(hmm_range_snapshot);
>   *          then one of the following values may be returned:
>   *
>   *           -EINVAL  invalid arguments or mm or virtual address are in an
> - *                    invalid vma (ie either hugetlbfs or device file vma).
> + *                    invalid vma (for instance device file vma).
>   *           -ENOMEM: Out of memory.
>   *           -EPERM:  Invalid permission (for instance asking for write and
>   *                    range is read only).
> @@ -898,6 +984,7 @@ EXPORT_SYMBOL(hmm_range_snapshot);
>   */
>  long hmm_range_fault(struct hmm_range *range, bool block)
>  {
> +	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
>  	unsigned long start = range->start, end;
>  	struct hmm_vma_walk hmm_vma_walk;
>  	struct hmm *hmm = range->hmm;
> @@ -917,15 +1004,25 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>  		}
>  
>  		vma = find_vma(hmm->mm, start);
> -		if (vma == NULL || (vma->vm_flags & VM_SPECIAL))
> +		if (vma == NULL || (vma->vm_flags & device_vma))
>  			return -EFAULT;
>  
> -		/* FIXME support hugetlb fs/dax */
> -		if (is_vm_hugetlb_page(vma) || vma_is_dax(vma)) {
> +		/* FIXME support dax */
> +		if (vma_is_dax(vma)) {
>  			hmm_pfns_special(range);
>  			return -EINVAL;
>  		}
>  
> +		if (is_vm_hugetlb_page(vma)) {
> +			if (huge_page_shift(hstate_vma(vma)) !=
> +			    range->page_shift &&
> +			    range->page_shift != PAGE_SHIFT)
> +				return -EINVAL;
> +		} else {
> +			if (range->page_shift != PAGE_SHIFT)
> +				return -EINVAL;
> +		}
> +
>  		if (!(vma->vm_flags & VM_READ)) {
>  			/*
>  			 * If vma do not allow read access, then assume that it
> @@ -952,6 +1049,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>  		mm_walk.hugetlb_entry = NULL;
>  		mm_walk.pmd_entry = hmm_vma_walk_pmd;
>  		mm_walk.pte_hole = hmm_vma_walk_hole;
> +		mm_walk.hugetlb_entry = hmm_vma_walk_hugetlb_entry;
>  
>  		do {
>  			ret = walk_page_range(start, end, &mm_walk);
> -- 
> 2.17.2
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-29  0:39           ` John Hubbard
@ 2019-03-28 16:57             ` Ira Weiny
  2019-03-29  1:00               ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 16:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 05:39:26PM -0700, John Hubbard wrote:
> On 3/28/19 2:21 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
> >> On 3/28/19 12:11 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> >>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> >>>>> From: Jérôme Glisse <jglisse@redhat.com>
> [...]
> >>>>> @@ -67,14 +78,9 @@ struct hmm {
> >>>>>   */
> >>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
> >>>>>  {
> >>>>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> >>>>> +	struct hmm *hmm = mm_get_hmm(mm);
> >>>>
> >>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
> >>>
> >>> The thing is that you want only one hmm struct per process and thus
> >>> if there is already one and it is not being destroy then you want to
> >>> reuse it.
> >>>
> >>> Also this is all internal to HMM code and so it should not confuse
> >>> anyone.
> >>>
> >>
> >> Well, it has repeatedly come up, and I'd claim that it is quite 
> >> counter-intuitive. So if there is an easy way to make this internal 
> >> HMM code clearer or better named, I would really love that to happen.
> >>
> >> And we shouldn't ever dismiss feedback based on "this is just internal
> >> xxx subsystem code, no need for it to be as clear as other parts of the
> >> kernel", right?
> > 
> > Yes but i have not seen any better alternative that present code. If
> > there is please submit patch.
> > 
> 
> Ira, do you have any patch you're working on, or a more detailed suggestion there?
> If not, then I might (later, as it's not urgent) propose a small cleanup patch 
> I had in mind for the hmm_register code. But I don't want to duplicate effort 
> if you're already thinking about it.

No I don't have anything.

I was just really digging into these this time around and I was about to
comment on the lack of "get's" for some "puts" when I realized that
"hmm_register" _was_ the get...

:-(

Ira

> 
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/11] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem v2
  2019-03-25 14:40 ` [PATCH v2 09/11] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem v2 jglisse
@ 2019-03-28 18:04   ` Ira Weiny
  2019-03-29  2:17     ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 18:04 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams,
	John Hubbard, Arnd Bergmann

On Mon, Mar 25, 2019 at 10:40:09AM -0400, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> HMM mirror is a device driver helpers to mirror range of virtual address.
> It means that the process jobs running on the device can access the same
> virtual address as the CPU threads of that process. This patch adds support
> for mirroring mapping of file that are on a DAX block device (ie range of
> virtual address that is an mmap of a file in a filesystem on a DAX block
> device). There is no reason to not support such case when mirroring virtual
> address on a device.
> 
> Note that unlike GUP code we do not take page reference hence when we
> back-off we have nothing to undo.
> 
> Changes since v1:
>     - improved commit message
>     - squashed: Arnd Bergmann: fix unused variable warning in hmm_vma_walk_pud
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> ---
>  mm/hmm.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 111 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 64a33770813b..ce33151c6832 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -325,6 +325,7 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
>  
>  struct hmm_vma_walk {
>  	struct hmm_range	*range;
> +	struct dev_pagemap	*pgmap;
>  	unsigned long		last;
>  	bool			fault;
>  	bool			block;
> @@ -499,6 +500,15 @@ static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
>  				range->flags[HMM_PFN_VALID];
>  }
>  
> +static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud)
> +{
> +	if (!pud_present(pud))
> +		return 0;
> +	return pud_write(pud) ? range->flags[HMM_PFN_VALID] |
> +				range->flags[HMM_PFN_WRITE] :
> +				range->flags[HMM_PFN_VALID];
> +}
> +
>  static int hmm_vma_handle_pmd(struct mm_walk *walk,
>  			      unsigned long addr,
>  			      unsigned long end,
> @@ -520,8 +530,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>  		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
>  
>  	pfn = pmd_pfn(pmd) + pte_index(addr);
> -	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
> +	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +		if (pmd_devmap(pmd)) {
> +			hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> +					      hmm_vma_walk->pgmap);
> +			if (unlikely(!hmm_vma_walk->pgmap))
> +				return -EBUSY;
> +		}
>  		pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
> +	}
> +	if (hmm_vma_walk->pgmap) {
> +		put_dev_pagemap(hmm_vma_walk->pgmap);
> +		hmm_vma_walk->pgmap = NULL;
> +	}
>  	hmm_vma_walk->last = end;
>  	return 0;
>  }
> @@ -608,10 +629,24 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  	if (fault || write_fault)
>  		goto fault;
>  
> +	if (pte_devmap(pte)) {
> +		hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte),
> +					      hmm_vma_walk->pgmap);
> +		if (unlikely(!hmm_vma_walk->pgmap))
> +			return -EBUSY;
> +	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> +		*pfn = range->values[HMM_PFN_SPECIAL];
> +		return -EFAULT;
> +	}
> +
>  	*pfn = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;

	<tag>

>  	return 0;
>  
>  fault:
> +	if (hmm_vma_walk->pgmap) {
> +		put_dev_pagemap(hmm_vma_walk->pgmap);
> +		hmm_vma_walk->pgmap = NULL;
> +	}
>  	pte_unmap(ptep);
>  	/* Fault any virtual address we were asked to fault */
>  	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> @@ -699,12 +734,83 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  			return r;
>  		}
>  	}
> +	if (hmm_vma_walk->pgmap) {
> +		put_dev_pagemap(hmm_vma_walk->pgmap);
> +		hmm_vma_walk->pgmap = NULL;
> +	}


Why is this here and not in hmm_vma_handle_pte()?  Unless I'm just getting
tired this is the corresponding put when hmm_vma_handle_pte() returns 0 above
at <tag> above.

Ira


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-29  1:50                   ` Jerome Glisse
@ 2019-03-28 18:21                     ` Ira Weiny
  2019-03-29  2:25                       ` Jerome Glisse
  2019-03-29  2:11                     ` John Hubbard
  1 sibling, 1 reply; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 18:21 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 09:50:03PM -0400, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 06:18:35PM -0700, John Hubbard wrote:
> > On 3/28/19 6:00 PM, Jerome Glisse wrote:
> > > On Thu, Mar 28, 2019 at 09:57:09AM -0700, Ira Weiny wrote:
> > >> On Thu, Mar 28, 2019 at 05:39:26PM -0700, John Hubbard wrote:
> > >>> On 3/28/19 2:21 PM, Jerome Glisse wrote:
> > >>>> On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
> > >>>>> On 3/28/19 12:11 PM, Jerome Glisse wrote:
> > >>>>>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> > >>>>>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> > >>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> > >>> [...]
> > >>>>>>>> @@ -67,14 +78,9 @@ struct hmm {
> > >>>>>>>>   */
> > >>>>>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
> > >>>>>>>>  {
> > >>>>>>>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> > >>>>>>>> +	struct hmm *hmm = mm_get_hmm(mm);
> > >>>>>>>
> > >>>>>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
> > >>>>>>
> > >>>>>> The thing is that you want only one hmm struct per process and thus
> > >>>>>> if there is already one and it is not being destroy then you want to
> > >>>>>> reuse it.
> > >>>>>>
> > >>>>>> Also this is all internal to HMM code and so it should not confuse
> > >>>>>> anyone.
> > >>>>>>
> > >>>>>
> > >>>>> Well, it has repeatedly come up, and I'd claim that it is quite 
> > >>>>> counter-intuitive. So if there is an easy way to make this internal 
> > >>>>> HMM code clearer or better named, I would really love that to happen.
> > >>>>>
> > >>>>> And we shouldn't ever dismiss feedback based on "this is just internal
> > >>>>> xxx subsystem code, no need for it to be as clear as other parts of the
> > >>>>> kernel", right?
> > >>>>
> > >>>> Yes but i have not seen any better alternative that present code. If
> > >>>> there is please submit patch.
> > >>>>
> > >>>
> > >>> Ira, do you have any patch you're working on, or a more detailed suggestion there?
> > >>> If not, then I might (later, as it's not urgent) propose a small cleanup patch 
> > >>> I had in mind for the hmm_register code. But I don't want to duplicate effort 
> > >>> if you're already thinking about it.
> > >>
> > >> No I don't have anything.
> > >>
> > >> I was just really digging into these this time around and I was about to
> > >> comment on the lack of "get's" for some "puts" when I realized that
> > >> "hmm_register" _was_ the get...
> > >>
> > >> :-(
> > >>
> > > 
> > > The get is mm_get_hmm() were you get a reference on HMM from a mm struct.
> > > John in previous posting complained about me naming that function hmm_get()
> > > and thus in this version i renamed it to mm_get_hmm() as we are getting
> > > a reference on hmm from a mm struct.
> > 
> > Well, that's not what I recommended, though. The actual conversation went like
> > this [1]:
> > 
> > ---------------------------------------------------------------
> > >> So for this, hmm_get() really ought to be symmetric with
> > >> hmm_put(), by taking a struct hmm*. And the null check is
> > >> not helping here, so let's just go with this smaller version:
> > >>
> > >> static inline struct hmm *hmm_get(struct hmm *hmm)
> > >> {
> > >>     if (kref_get_unless_zero(&hmm->kref))
> > >>         return hmm;
> > >>
> > >>     return NULL;
> > >> }
> > >>
> > >> ...and change the few callers accordingly.
> > >>
> > >
> > > What about renaning hmm_get() to mm_get_hmm() instead ?
> > >
> > 
> > For a get/put pair of functions, it would be ideal to pass
> > the same argument type to each. It looks like we are passing
> > around hmm*, and hmm retains a reference count on hmm->mm,
> > so I think you have a choice of using either mm* or hmm* as
> > the argument. I'm not sure that one is better than the other
> > here, as the lifetimes appear to be linked pretty tightly.
> > 
> > Whichever one is used, I think it would be best to use it
> > in both the _get() and _put() calls. 
> > ---------------------------------------------------------------
> > 
> > Your response was to change the name to mm_get_hmm(), but that's not
> > what I recommended.
> 
> Because i can not do that, hmm_put() can _only_ take hmm struct as
> input while hmm_get() can _only_ get mm struct as input.
> 
> hmm_put() can only take hmm because the hmm we are un-referencing
> might no longer be associated with any mm struct and thus i do not
> have a mm struct to use.
> 
> hmm_get() can only get mm as input as we need to be careful when
> accessing the hmm field within the mm struct and thus it is better
> to have that code within a function than open coded and duplicated
> all over the place.

The input value is not the problem.  The problem is in the naming.

obj = get_obj( various parameters );
put_obj(obj);


The problem is that the function is named hmm_register() either "gets" a
reference to _or_ creates and gets a reference to the hmm object.

What John is probably ready to submit is something like.

struct hmm *get_create_hmm(struct mm *mm);
void put_hmm(struct hmm *hmm);


So when you are reading the code you see...

foo(...) {
	struct hmm *hmm = get_create_hmm(mm);

	if (!hmm)
		error...

	do stuff...

	put_hmm(hmm);
}

Here I can see a very clear get/put pair.  The name also shows that the hmm is
created if need be as well as getting a reference.

Ira

> 
> > 
> > > 
> > > The hmm_put() is just releasing the reference on the hmm struct.
> > > 
> > > Here i feel i am getting contradicting requirement from different people.
> > > I don't think there is a way to please everyone here.
> > > 
> > 
> > That's not a true conflict: you're comparing your actual implementation
> > to Ira's request, rather than comparing my request to Ira's request.
> > 
> > I think there's a way forward. Ira and I are actually both asking for the
> > same thing:
> > 
> > a) clear, concise get/put routines
> > 
> > b) avoiding odd side effects in functions that have one name, but do
> > additional surprising things.
> 
> Please show me code because i do not see any other way to do it then
> how i did.
> 
> Cheers,
> Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 23:34                       ` John Hubbard
@ 2019-03-28 18:44                         ` Ira Weiny
  0 siblings, 0 replies; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 18:44 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 04:34:04PM -0700, John Hubbard wrote:
> On 3/28/19 4:24 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 04:20:37PM -0700, John Hubbard wrote:
> >> On 3/28/19 4:05 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 03:43:33PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 3:40 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
> >>>>>> On 3/28/19 3:08 PM, Jerome Glisse wrote:
> >>>>>>> On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
> >>>>>>>> On 3/28/19 2:30 PM, Jerome Glisse wrote:
> >>>>>>>>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> >>>>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> >>>>>> [...]
> >>>>>> OK, so let's either drop this patch, or if merge windows won't allow that,
> >>>>>> then *eventually* drop this patch. And instead, put in a hmm_sanity_check()
> >>>>>> that does the same checks.
> >>>>>
> >>>>> RDMA depends on this, so does the nouveau patchset that convert to new API.
> >>>>> So i do not see reason to drop this. They are user for this they are posted
> >>>>> and i hope i explained properly the benefit.
> >>>>>
> >>>>> It is a common pattern. Yes it only save couple lines of code but down the
> >>>>> road i will also help for people working on the mmap_sem patchset.
> >>>>>
> >>>>
> >>>> It *adds* a couple of lines that are misleading, because they look like they
> >>>> make things safer, but they don't actually do so.
> >>>
> >>> It is not about safety, sorry if it confused you but there is nothing about
> >>> safety here, i can add a big fat comment that explains that there is no safety
> >>> here. The intention is to allow the page fault handler that potential have
> >>> hundred of page fault queue up to abort as soon as it sees that it is pointless
> >>> to keep faulting on a dying process.
> >>>
> >>> Again if we race it is _fine_ nothing bad will happen, we are just doing use-
> >>> less work that gonna be thrown on the floor and we are just slowing down the
> >>> process tear down.
> >>>
> >>
> >> In addition to a comment, how about naming this thing to indicate the above 
> >> intention?  I have a really hard time with this odd down_read() wrapper, which
> >> allows code to proceed without really getting a lock. It's just too wrong-looking.
> >> If it were instead named:
> >>
> >> 	hmm_is_exiting()
> > 
> > What about: hmm_lock_mmap_if_alive() ?
> > 
> 
> That's definitely better, but I want to vote for just doing a check, not 
> taking any locks.
> 
> I'm not super concerned about the exact name, but I really want a routine that
> just checks (and optionally asserts, via WARN or BUG), and that's *all*. Then
> drivers can scatter that around like pixie dust as they see fit. Maybe right before
> taking a lock, maybe in other places. Decoupled from locking.

I agree.  Names matter and any function which is called *_down_read and could
potentially not take the lock should be called try_*_down_read.  Furthermore
users should be checking the return values from any try_*.

It is also odd that we are calling "down/up" on something which is not a
semaphore.  So the user here needs to _know_ that they are really getting the
lock on the mm which sits behind the scenes.  What John is proposing is more
explicit when reading driver code.

Ira

> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2
  2019-03-29  0:56     ` Jerome Glisse
@ 2019-03-28 18:49       ` Ira Weiny
  0 siblings, 0 replies; 69+ messages in thread
From: Ira Weiny @ 2019-03-28 18:49 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Andrew Morton, John Hubbard,
	Dan Williams, Dan Carpenter, Matthew Wilcox

On Thu, Mar 28, 2019 at 08:56:54PM -0400, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 09:12:21AM -0700, Ira Weiny wrote:
> > On Mon, Mar 25, 2019 at 10:40:06AM -0400, Jerome Glisse wrote:
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > > 

[snip]

> > > +/*
> > > + * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> > > + *
> > > + * When waiting for mmu notifiers we need some kind of time out otherwise we
> > > + * could potentialy wait for ever, 1000ms ie 1s sounds like a long time to
> > > + * wait already.
> > > + */
> > > +#define HMM_RANGE_DEFAULT_TIMEOUT 1000
> > > +
> > >  /* This is a temporary helper to avoid merge conflict between trees. */
> > > +static inline bool hmm_vma_range_done(struct hmm_range *range)
> > > +{
> > > +	bool ret = hmm_range_valid(range);
> > > +
> > > +	hmm_range_unregister(range);
> > > +	return ret;
> > > +}
> > > +
> > >  static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> > >  {
> > > -	long ret = hmm_range_fault(range, block);
> > > -	if (ret == -EBUSY)
> > > -		ret = -EAGAIN;
> > > -	else if (ret == -EAGAIN)
> > > -		ret = -EBUSY;
> > > -	return ret < 0 ? ret : 0;
> > > +	long ret;
> > > +
> > > +	ret = hmm_range_register(range, range->vma->vm_mm,
> > > +				 range->start, range->end);
> > > +	if (ret)
> > > +		return (int)ret;
> > > +
> > > +	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
> > > +		up_read(&range->vma->vm_mm->mmap_sem);
> > > +		return -EAGAIN;
> > > +	}
> > > +
> > > +	ret = hmm_range_fault(range, block);
> > > +	if (ret <= 0) {
> > > +		if (ret == -EBUSY || !ret) {
> > > +			up_read(&range->vma->vm_mm->mmap_sem);
> > > +			ret = -EBUSY;
> > > +		} else if (ret == -EAGAIN)
> > > +			ret = -EBUSY;
> > > +		hmm_range_unregister(range);
> > > +		return ret;
> > > +	}
> > > +	return 0;
> > 
> > Is hmm_vma_fault() also temporary to keep the nouveau driver working?  It looks
> > like it to me.
> > 
> > This and hmm_vma_range_done() above are part of the old interface which is in
> > the Documentation correct?  As stated above we should probably change that
> > documentation with this patch to ensure no new users of these 2 functions
> > appear.
> 
> Ok will update the documentation, note that i already posted patches to use
> this new API see the ODP RDMA link in the cover letter.
> 

Thanks,  Sorry for my previous email on this patch.  After looking more I see
that this is the old interface but this was not clear.  And I have not had time
to follow the previous threads.  I'm finding time to do this now...

Sorry,
Ira


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-28 11:07   ` Ira Weiny
@ 2019-03-28 19:11     ` Jerome Glisse
  2019-03-28 20:43       ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 19:11 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-mm, linux-kernel, John Hubbard, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > Every time i read the code to check that the HMM structure does not
> > vanish before it should thanks to the many lock protecting its removal
> > i get a headache. Switch to reference counting instead it is much
> > easier to follow and harder to break. This also remove some code that
> > is no longer needed with refcounting.
> > 
> > Changes since v1:
> >     - removed bunch of useless check (if API is use with bogus argument
> >       better to fail loudly so user fix their code)
> >     - s/hmm_get/mm_get_hmm/
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  include/linux/hmm.h |   2 +
> >  mm/hmm.c            | 170 ++++++++++++++++++++++++++++----------------
> >  2 files changed, 112 insertions(+), 60 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index ad50b7b4f141..716fc61fa6d4 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -131,6 +131,7 @@ enum hmm_pfn_value_e {
> >  /*
> >   * struct hmm_range - track invalidation lock on virtual address range
> >   *
> > + * @hmm: the core HMM structure this range is active against
> >   * @vma: the vm area struct for the range
> >   * @list: all range lock are on a list
> >   * @start: range virtual start address (inclusive)
> > @@ -142,6 +143,7 @@ enum hmm_pfn_value_e {
> >   * @valid: pfns array did not change since it has been fill by an HMM function
> >   */
> >  struct hmm_range {
> > +	struct hmm		*hmm;
> >  	struct vm_area_struct	*vma;
> >  	struct list_head	list;
> >  	unsigned long		start;
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index fe1cd87e49ac..306e57f7cded 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -50,6 +50,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
> >   */
> >  struct hmm {
> >  	struct mm_struct	*mm;
> > +	struct kref		kref;
> >  	spinlock_t		lock;
> >  	struct list_head	ranges;
> >  	struct list_head	mirrors;
> > @@ -57,6 +58,16 @@ struct hmm {
> >  	struct rw_semaphore	mirrors_sem;
> >  };
> >  
> > +static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> > +{
> > +	struct hmm *hmm = READ_ONCE(mm->hmm);
> > +
> > +	if (hmm && kref_get_unless_zero(&hmm->kref))
> > +		return hmm;
> > +
> > +	return NULL;
> > +}
> > +
> >  /*
> >   * hmm_register - register HMM against an mm (HMM internal)
> >   *
> > @@ -67,14 +78,9 @@ struct hmm {
> >   */
> >  static struct hmm *hmm_register(struct mm_struct *mm)
> >  {
> > -	struct hmm *hmm = READ_ONCE(mm->hmm);
> > +	struct hmm *hmm = mm_get_hmm(mm);
> 
> FWIW: having hmm_register == "hmm get" is a bit confusing...

The thing is that you want only one hmm struct per process and thus
if there is already one and it is not being destroy then you want to
reuse it.

Also this is all internal to HMM code and so it should not confuse
anyone.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM
  2019-03-25 14:40 ` [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM jglisse
@ 2019-03-28 20:33   ` John Hubbard
  2019-03-29 21:15     ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 20:33 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: linux-kernel, Ralph Campbell, Andrew Morton, Dan Williams

On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> To avoid random config build issue, select mmu notifier when HMM is
> selected. In any cases when HMM get selected it will be by users that
> will also wants the mmu notifier.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Acked-by: Balbir Singh <bsingharora@gmail.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  mm/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 25c71eb8a7db..0d2944278d80 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -694,6 +694,7 @@ config DEV_PAGEMAP_OPS
>  
>  config HMM
>  	bool
> +	select MMU_NOTIFIER
>  	select MIGRATE_VMA_HELPER
>  
>  config HMM_MIRROR
> 

Yes, this is a good move, given that MMU notifiers are completely,
indispensably part of the HMM design and implementation.

The alternative would also work, but it's not quite as good. I'm
listing it in order to forestall any debate: 

  config HMM
  	bool
 +	depends on MMU_NOTIFIER
  	select MIGRATE_VMA_HELPER

...and "depends on" versus "select" is always a subtle question. But in
this case, I'd say that if someone wants HMM, there's no advantage in
making them know that they must first ensure MMU_NOTIFIER is enabled.
After poking around a bit I don't see any obvious downsides either.

However, given that you're making this change, in order to avoid odd
redundancy, you should also do this:

diff --git a/mm/Kconfig b/mm/Kconfig
index 0d2944278d80..2e6d24d783f7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,7 +700,6 @@ config HMM
 config HMM_MIRROR
        bool "HMM mirror CPU page table into a device page table"
        depends on ARCH_HAS_HMM
-       select MMU_NOTIFIER
        select HMM
        help
          Select HMM_MIRROR if you want to mirror range of the CPU page table of a


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-28 19:11     ` Jerome Glisse
@ 2019-03-28 20:43       ` John Hubbard
  2019-03-28 21:21         ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 20:43 UTC (permalink / raw)
  To: Jerome Glisse, Ira Weiny
  Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 12:11 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
>>> Every time i read the code to check that the HMM structure does not
>>> vanish before it should thanks to the many lock protecting its removal
>>> i get a headache. Switch to reference counting instead it is much
>>> easier to follow and harder to break. This also remove some code that
>>> is no longer needed with refcounting.
>>>
>>> Changes since v1:
>>>     - removed bunch of useless check (if API is use with bogus argument
>>>       better to fail loudly so user fix their code)
>>>     - s/hmm_get/mm_get_hmm/
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>>  include/linux/hmm.h |   2 +
>>>  mm/hmm.c            | 170 ++++++++++++++++++++++++++++----------------
>>>  2 files changed, 112 insertions(+), 60 deletions(-)
>>>
>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>> index ad50b7b4f141..716fc61fa6d4 100644
>>> --- a/include/linux/hmm.h
>>> +++ b/include/linux/hmm.h
>>> @@ -131,6 +131,7 @@ enum hmm_pfn_value_e {
>>>  /*
>>>   * struct hmm_range - track invalidation lock on virtual address range
>>>   *
>>> + * @hmm: the core HMM structure this range is active against
>>>   * @vma: the vm area struct for the range
>>>   * @list: all range lock are on a list
>>>   * @start: range virtual start address (inclusive)
>>> @@ -142,6 +143,7 @@ enum hmm_pfn_value_e {
>>>   * @valid: pfns array did not change since it has been fill by an HMM function
>>>   */
>>>  struct hmm_range {
>>> +	struct hmm		*hmm;
>>>  	struct vm_area_struct	*vma;
>>>  	struct list_head	list;
>>>  	unsigned long		start;
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index fe1cd87e49ac..306e57f7cded 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -50,6 +50,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>>>   */
>>>  struct hmm {
>>>  	struct mm_struct	*mm;
>>> +	struct kref		kref;
>>>  	spinlock_t		lock;
>>>  	struct list_head	ranges;
>>>  	struct list_head	mirrors;
>>> @@ -57,6 +58,16 @@ struct hmm {
>>>  	struct rw_semaphore	mirrors_sem;
>>>  };
>>>  
>>> +static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>>> +{
>>> +	struct hmm *hmm = READ_ONCE(mm->hmm);
>>> +
>>> +	if (hmm && kref_get_unless_zero(&hmm->kref))
>>> +		return hmm;
>>> +
>>> +	return NULL;
>>> +}
>>> +
>>>  /*
>>>   * hmm_register - register HMM against an mm (HMM internal)
>>>   *
>>> @@ -67,14 +78,9 @@ struct hmm {
>>>   */
>>>  static struct hmm *hmm_register(struct mm_struct *mm)
>>>  {
>>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
>>> +	struct hmm *hmm = mm_get_hmm(mm);
>>
>> FWIW: having hmm_register == "hmm get" is a bit confusing...
> 
> The thing is that you want only one hmm struct per process and thus
> if there is already one and it is not being destroy then you want to
> reuse it.
> 
> Also this is all internal to HMM code and so it should not confuse
> anyone.
> 

Well, it has repeatedly come up, and I'd claim that it is quite 
counter-intuitive. So if there is an easy way to make this internal 
HMM code clearer or better named, I would really love that to happen.

And we shouldn't ever dismiss feedback based on "this is just internal
xxx subsystem code, no need for it to be as clear as other parts of the
kernel", right?

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-25 14:40 ` [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2 jglisse
@ 2019-03-28 20:54   ` John Hubbard
  2019-03-28 21:30     ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 20:54 UTC (permalink / raw)
  To: jglisse, linux-mm; +Cc: linux-kernel, Andrew Morton, Dan Williams

On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> The device driver context which holds reference to mirror and thus to
> core hmm struct might outlive the mm against which it was created. To
> avoid every driver to check for that case provide an helper that check
> if mm is still alive and take the mmap_sem in read mode if so. If the
> mm have been destroy (mmu_notifier release call back did happen) then
> we return -EINVAL so that calling code knows that it is trying to do
> something against a mm that is no longer valid.
> 
> Changes since v1:
>     - removed bunch of useless check (if API is use with bogus argument
>       better to fail loudly so user fix their code)
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index f3b919b04eda..5f9deaeb9d77 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -438,6 +438,50 @@ struct hmm_mirror {
>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>  void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  
> +/*
> + * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
> + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> + * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
> + *
> + * The device driver context which holds reference to mirror and thus to core
> + * hmm struct might outlive the mm against which it was created. To avoid every
> + * driver to check for that case provide an helper that check if mm is still
> + * alive and take the mmap_sem in read mode if so. If the mm have been destroy
> + * (mmu_notifier release call back did happen) then we return -EINVAL so that
> + * calling code knows that it is trying to do something against a mm that is
> + * no longer valid.
> + */
> +static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)

Hi Jerome,

Let's please not do this. There are at least two problems here:

1. The hmm_mirror_mm_down_read() wrapper around down_read() requires a 
return value. This is counter to how locking is normally done: callers do
not normally have to check the return value of most locks (other than
trylocks). And sure enough, your own code below doesn't check the return value.
That is a pretty good illustration of why not to do this.

2. This is a weird place to randomly check for semi-unrelated state, such 
as "is HMM still alive". By that I mean, if you have to detect a problem
at down_read() time, then the problem could have existed both before and
after the call to this wrapper. So it is providing a false sense of security,
and it is therefore actually undesirable to add the code.

If you insist on having this wrapper, I think it should have approximately 
this form:

void hmm_mirror_mm_down_read(...)
{
	WARN_ON(...)
	down_read(...)
} 

> +{
> +	struct mm_struct *mm;
> +
> +	/* Sanity check ... */
> +	if (!mirror || !mirror->hmm)
> +		return -EINVAL;
> +	/*
> +	 * Before trying to take the mmap_sem make sure the mm is still
> +	 * alive as device driver context might outlive the mm lifetime.

Let's find another way, and a better place, to solve this problem.
Ref counting?

> +	 *
> +	 * FIXME: should we also check for mm that outlive its owning
> +	 * task ?
> +	 */
> +	mm = READ_ONCE(mirror->hmm->mm);
> +	if (mirror->hmm->dead || !mm)
> +		return -EINVAL;
> +
> +	down_read(&mm->mmap_sem);
> +	return 0;
> +}
> +
> +/*
> + * hmm_mirror_mm_up_read() - unlock the mmap_sem from read mode
> + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> + */
> +static inline void hmm_mirror_mm_up_read(struct hmm_mirror *mirror)
> +{
> +	up_read(&mirror->hmm->mm->mmap_sem);
> +}
> +
>  
>  /*
>   * To snapshot the CPU page table you first have to call hmm_range_register()
> @@ -463,7 +507,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>   *          if (ret)
>   *              return ret;
>   *
> - *          down_read(mm->mmap_sem);
> + *          hmm_mirror_mm_down_read(mirror);

See? The normal down_read() code never needs to check a return value, so when
someone does a "simple" upgrade, it introduces a fatal bug here: if the wrapper
returns early, then the caller proceeds without having acquired the mmap_sem.

>   *      again:
>   *
>   *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
> @@ -476,13 +520,13 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>   *
>   *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
>   *          if (ret == -EAGAIN) {
> - *              down_read(mm->mmap_sem);
> + *              hmm_mirror_mm_down_read(mirror);

Same problem here.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-28 20:43       ` John Hubbard
@ 2019-03-28 21:21         ` Jerome Glisse
  2019-03-29  0:39           ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 21:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ira Weiny, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
> On 3/28/19 12:11 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> >> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> >>> From: Jérôme Glisse <jglisse@redhat.com>
> >>>
> >>> Every time i read the code to check that the HMM structure does not
> >>> vanish before it should thanks to the many lock protecting its removal
> >>> i get a headache. Switch to reference counting instead it is much
> >>> easier to follow and harder to break. This also remove some code that
> >>> is no longer needed with refcounting.
> >>>
> >>> Changes since v1:
> >>>     - removed bunch of useless check (if API is use with bogus argument
> >>>       better to fail loudly so user fix their code)
> >>>     - s/hmm_get/mm_get_hmm/
> >>>
> >>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> >>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> >>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: Dan Williams <dan.j.williams@intel.com>
> >>> ---
> >>>  include/linux/hmm.h |   2 +
> >>>  mm/hmm.c            | 170 ++++++++++++++++++++++++++++----------------
> >>>  2 files changed, 112 insertions(+), 60 deletions(-)
> >>>
> >>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> >>> index ad50b7b4f141..716fc61fa6d4 100644
> >>> --- a/include/linux/hmm.h
> >>> +++ b/include/linux/hmm.h
> >>> @@ -131,6 +131,7 @@ enum hmm_pfn_value_e {
> >>>  /*
> >>>   * struct hmm_range - track invalidation lock on virtual address range
> >>>   *
> >>> + * @hmm: the core HMM structure this range is active against
> >>>   * @vma: the vm area struct for the range
> >>>   * @list: all range lock are on a list
> >>>   * @start: range virtual start address (inclusive)
> >>> @@ -142,6 +143,7 @@ enum hmm_pfn_value_e {
> >>>   * @valid: pfns array did not change since it has been fill by an HMM function
> >>>   */
> >>>  struct hmm_range {
> >>> +	struct hmm		*hmm;
> >>>  	struct vm_area_struct	*vma;
> >>>  	struct list_head	list;
> >>>  	unsigned long		start;
> >>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>> index fe1cd87e49ac..306e57f7cded 100644
> >>> --- a/mm/hmm.c
> >>> +++ b/mm/hmm.c
> >>> @@ -50,6 +50,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
> >>>   */
> >>>  struct hmm {
> >>>  	struct mm_struct	*mm;
> >>> +	struct kref		kref;
> >>>  	spinlock_t		lock;
> >>>  	struct list_head	ranges;
> >>>  	struct list_head	mirrors;
> >>> @@ -57,6 +58,16 @@ struct hmm {
> >>>  	struct rw_semaphore	mirrors_sem;
> >>>  };
> >>>  
> >>> +static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> >>> +{
> >>> +	struct hmm *hmm = READ_ONCE(mm->hmm);
> >>> +
> >>> +	if (hmm && kref_get_unless_zero(&hmm->kref))
> >>> +		return hmm;
> >>> +
> >>> +	return NULL;
> >>> +}
> >>> +
> >>>  /*
> >>>   * hmm_register - register HMM against an mm (HMM internal)
> >>>   *
> >>> @@ -67,14 +78,9 @@ struct hmm {
> >>>   */
> >>>  static struct hmm *hmm_register(struct mm_struct *mm)
> >>>  {
> >>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> >>> +	struct hmm *hmm = mm_get_hmm(mm);
> >>
> >> FWIW: having hmm_register == "hmm get" is a bit confusing...
> > 
> > The thing is that you want only one hmm struct per process and thus
> > if there is already one and it is not being destroy then you want to
> > reuse it.
> > 
> > Also this is all internal to HMM code and so it should not confuse
> > anyone.
> > 
> 
> Well, it has repeatedly come up, and I'd claim that it is quite 
> counter-intuitive. So if there is an easy way to make this internal 
> HMM code clearer or better named, I would really love that to happen.
> 
> And we shouldn't ever dismiss feedback based on "this is just internal
> xxx subsystem code, no need for it to be as clear as other parts of the
> kernel", right?

Yes but i have not seen any better alternative that present code. If
there is please submit patch.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 20:54   ` John Hubbard
@ 2019-03-28 21:30     ` Jerome Glisse
  2019-03-28 21:41       ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 21:30 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > The device driver context which holds reference to mirror and thus to
> > core hmm struct might outlive the mm against which it was created. To
> > avoid every driver to check for that case provide an helper that check
> > if mm is still alive and take the mmap_sem in read mode if so. If the
> > mm have been destroy (mmu_notifier release call back did happen) then
> > we return -EINVAL so that calling code knows that it is trying to do
> > something against a mm that is no longer valid.
> > 
> > Changes since v1:
> >     - removed bunch of useless check (if API is use with bogus argument
> >       better to fail loudly so user fix their code)
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 47 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index f3b919b04eda..5f9deaeb9d77 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -438,6 +438,50 @@ struct hmm_mirror {
> >  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
> >  void hmm_mirror_unregister(struct hmm_mirror *mirror);
> >  
> > +/*
> > + * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
> > + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> > + * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
> > + *
> > + * The device driver context which holds reference to mirror and thus to core
> > + * hmm struct might outlive the mm against which it was created. To avoid every
> > + * driver to check for that case provide an helper that check if mm is still
> > + * alive and take the mmap_sem in read mode if so. If the mm have been destroy
> > + * (mmu_notifier release call back did happen) then we return -EINVAL so that
> > + * calling code knows that it is trying to do something against a mm that is
> > + * no longer valid.
> > + */
> > +static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
> 
> Hi Jerome,
> 
> Let's please not do this. There are at least two problems here:
> 
> 1. The hmm_mirror_mm_down_read() wrapper around down_read() requires a 
> return value. This is counter to how locking is normally done: callers do
> not normally have to check the return value of most locks (other than
> trylocks). And sure enough, your own code below doesn't check the return value.
> That is a pretty good illustration of why not to do this.

Please read the function description this is not about checking lock
return value it is about checking wether we are racing with process
destruction and avoid trying to take lock in such cases so that driver
do abort as quickly as possible when a process is being kill.

> 
> 2. This is a weird place to randomly check for semi-unrelated state, such 
> as "is HMM still alive". By that I mean, if you have to detect a problem
> at down_read() time, then the problem could have existed both before and
> after the call to this wrapper. So it is providing a false sense of security,
> and it is therefore actually undesirable to add the code.

It is not, this function is use in device page fault handler which will
happens asynchronously from CPU event or process lifetime when a process
is killed or is dying we do want to avoid useless page fault work and
we do want to avoid blocking the page fault queue of the device. This
function reports to the caller that the process is dying and that it
should just abort the page fault and do whatever other device specific
thing that needs to happen.

> 
> If you insist on having this wrapper, I think it should have approximately 
> this form:
> 
> void hmm_mirror_mm_down_read(...)
> {
> 	WARN_ON(...)
> 	down_read(...)
> } 

I do insist as it is useful and use by both RDMA and nouveau and the
above would kill the intent. The intent is do not try to take the lock
if the process is dying.


> 
> > +{
> > +	struct mm_struct *mm;
> > +
> > +	/* Sanity check ... */
> > +	if (!mirror || !mirror->hmm)
> > +		return -EINVAL;
> > +	/*
> > +	 * Before trying to take the mmap_sem make sure the mm is still
> > +	 * alive as device driver context might outlive the mm lifetime.
> 
> Let's find another way, and a better place, to solve this problem.
> Ref counting?

This has nothing to do with refcount or use after free or anthing
like that. It is just about checking wether we are about to do
something pointless. If the process is dying then it is pointless
to try to take the lock and it is pointless for the device driver
to trigger handle_mm_fault().

> 
> > +	 *
> > +	 * FIXME: should we also check for mm that outlive its owning
> > +	 * task ?
> > +	 */
> > +	mm = READ_ONCE(mirror->hmm->mm);
> > +	if (mirror->hmm->dead || !mm)
> > +		return -EINVAL;
> > +
> > +	down_read(&mm->mmap_sem);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * hmm_mirror_mm_up_read() - unlock the mmap_sem from read mode
> > + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> > + */
> > +static inline void hmm_mirror_mm_up_read(struct hmm_mirror *mirror)
> > +{
> > +	up_read(&mirror->hmm->mm->mmap_sem);
> > +}
> > +
> >  
> >  /*
> >   * To snapshot the CPU page table you first have to call hmm_range_register()
> > @@ -463,7 +507,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
> >   *          if (ret)
> >   *              return ret;
> >   *
> > - *          down_read(mm->mmap_sem);
> > + *          hmm_mirror_mm_down_read(mirror);
> 
> See? The normal down_read() code never needs to check a return value, so when
> someone does a "simple" upgrade, it introduces a fatal bug here: if the wrapper
> returns early, then the caller proceeds without having acquired the mmap_sem.

That convertion is useless can't remember why i did it.

> >   *      again:
> >   *
> >   *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
> > @@ -476,13 +520,13 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
> >   *
> >   *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
> >   *          if (ret == -EAGAIN) {
> > - *              down_read(mm->mmap_sem);
> > + *              hmm_mirror_mm_down_read(mirror);
> 
> Same problem here.

Again useless i can't remember why i did that one. This helper is
intended to be use by driver.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2
  2019-03-28 13:11   ` Ira Weiny
@ 2019-03-28 21:39     ` Jerome Glisse
  0 siblings, 0 replies; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 21:39 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-mm, linux-kernel, Andrew Morton, John Hubbard,
	Dan Williams, Dan Carpenter, Matthew Wilcox

On Thu, Mar 28, 2019 at 06:11:54AM -0700, Ira Weiny wrote:
> On Mon, Mar 25, 2019 at 10:40:06AM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > A common use case for HMM mirror is user trying to mirror a range
> > and before they could program the hardware it get invalidated by
> > some core mm event. Instead of having user re-try right away to
> > mirror the range provide a completion mechanism for them to wait
> > for any active invalidation affecting the range.
> > 
> > This also changes how hmm_range_snapshot() and hmm_range_fault()
> > works by not relying on vma so that we can drop the mmap_sem
> > when waiting and lookup the vma again on retry.
> > 
> > Changes since v1:
> >     - squashed: Dan Carpenter: potential deadlock in nonblocking code
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Dan Carpenter <dan.carpenter@oracle.com>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > ---
> >  include/linux/hmm.h | 208 ++++++++++++++---
> >  mm/hmm.c            | 528 +++++++++++++++++++++-----------------------
> >  2 files changed, 428 insertions(+), 308 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index e9afd23c2eac..79671036cb5f 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -77,8 +77,34 @@
> >  #include <linux/migrate.h>
> >  #include <linux/memremap.h>
> >  #include <linux/completion.h>
> > +#include <linux/mmu_notifier.h>
> >  
> > -struct hmm;
> > +
> > +/*
> > + * struct hmm - HMM per mm struct
> > + *
> > + * @mm: mm struct this HMM struct is bound to
> > + * @lock: lock protecting ranges list
> > + * @ranges: list of range being snapshotted
> > + * @mirrors: list of mirrors for this mm
> > + * @mmu_notifier: mmu notifier to track updates to CPU page table
> > + * @mirrors_sem: read/write semaphore protecting the mirrors list
> > + * @wq: wait queue for user waiting on a range invalidation
> > + * @notifiers: count of active mmu notifiers
> > + * @dead: is the mm dead ?
> > + */
> > +struct hmm {
> > +	struct mm_struct	*mm;
> > +	struct kref		kref;
> > +	struct mutex		lock;
> > +	struct list_head	ranges;
> > +	struct list_head	mirrors;
> > +	struct mmu_notifier	mmu_notifier;
> > +	struct rw_semaphore	mirrors_sem;
> > +	wait_queue_head_t	wq;
> > +	long			notifiers;
> > +	bool			dead;
> > +};
> >  
> >  /*
> >   * hmm_pfn_flag_e - HMM flag enums
> > @@ -155,6 +181,38 @@ struct hmm_range {
> >  	bool			valid;
> >  };
> >  
> > +/*
> > + * hmm_range_wait_until_valid() - wait for range to be valid
> > + * @range: range affected by invalidation to wait on
> > + * @timeout: time out for wait in ms (ie abort wait after that period of time)
> > + * Returns: true if the range is valid, false otherwise.
> > + */
> > +static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
> > +					      unsigned long timeout)
> > +{
> > +	/* Check if mm is dead ? */
> > +	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> > +		range->valid = false;
> > +		return false;
> > +	}
> > +	if (range->valid)
> > +		return true;
> > +	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> > +			   msecs_to_jiffies(timeout));
> > +	/* Return current valid status just in case we get lucky */
> > +	return range->valid;
> > +}
> > +
> > +/*
> > + * hmm_range_valid() - test if a range is valid or not
> > + * @range: range
> > + * Returns: true if the range is valid, false otherwise.
> > + */
> > +static inline bool hmm_range_valid(struct hmm_range *range)
> > +{
> > +	return range->valid;
> > +}
> > +
> >  /*
> >   * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> >   * @range: range use to decode HMM pfn value
> > @@ -357,51 +415,133 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
> >  
> >  
> >  /*
> > - * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
> > - * driver lock that serializes device page table updates, then call
> > - * hmm_vma_range_done(), to check if the snapshot is still valid. The same
> > - * device driver page table update lock must also be used in the
> > - * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
> > - * table invalidation serializes on it.
> > + * To snapshot the CPU page table you first have to call hmm_range_register()
> > + * to register the range. If hmm_range_register() return an error then some-
> > + * thing is horribly wrong and you should fail loudly. If it returned true then
> > + * you can wait for the range to be stable with hmm_range_wait_until_valid()
> > + * function, a range is valid when there are no concurrent changes to the CPU
> > + * page table for the range.
> > + *
> > + * Once the range is valid you can call hmm_range_snapshot() if that returns
> > + * without error then you can take your device page table lock (the same lock
> > + * you use in the HMM mirror sync_cpu_device_pagetables() callback). After
> > + * taking that lock you have to check the range validity, if it is still valid
> > + * (ie hmm_range_valid() returns true) then you can program the device page
> > + * table, otherwise you have to start again. Pseudo code:
> > + *
> > + *      mydevice_prefault(mydevice, mm, start, end)
> > + *      {
> > + *          struct hmm_range range;
> > + *          ...
> >   *
> > - * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
> > - * hmm_range_snapshot() WITHOUT ERROR !
> > + *          ret = hmm_range_register(&range, mm, start, end);
> > + *          if (ret)
> > + *              return ret;
> >   *
> > - * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
> > - */
> > -long hmm_range_snapshot(struct hmm_range *range);
> > -bool hmm_vma_range_done(struct hmm_range *range);
> > -
> > -
> > -/*
> > - * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
> > - * not migrate any device memory back to system memory. The HMM pfn array will
> > - * be updated with the fault result and current snapshot of the CPU page table
> > - * for the range.
> > + *          down_read(mm->mmap_sem);
> > + *      again:
> > + *
> > + *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
> > + *              up_read(&mm->mmap_sem);
> > + *              hmm_range_unregister(range);
> > + *              // Handle time out, either sleep or retry or something else
> > + *              ...
> > + *              return -ESOMETHING; || goto again;
> > + *          }
> > + *
> > + *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
> > + *          if (ret == -EAGAIN) {
> > + *              down_read(mm->mmap_sem);
> > + *              goto again;
> > + *          } else if (ret == -EBUSY) {
> > + *              goto again;
> > + *          }
> > + *
> > + *          up_read(&mm->mmap_sem);
> > + *          if (ret) {
> > + *              hmm_range_unregister(range);
> > + *              return ret;
> > + *          }
> > + *
> > + *          // It might not have snap-shoted the whole range but only the first
> > + *          // npages, the return values is the number of valid pages from the
> > + *          // start of the range.
> > + *          npages = ret;
> >   *
> > - * The mmap_sem must be taken in read mode before entering and it might be
> > - * dropped by the function if the block argument is false. In that case, the
> > - * function returns -EAGAIN.
> > + *          ...
> >   *
> > - * Return value does not reflect if the fault was successful for every single
> > - * address or not. Therefore, the caller must to inspect the HMM pfn array to
> > - * determine fault status for each address.
> > + *          mydevice_page_table_lock(mydevice);
> > + *          if (!hmm_range_valid(range)) {
> > + *              mydevice_page_table_unlock(mydevice);
> > + *              goto again;
> > + *          }
> >   *
> > - * Trying to fault inside an invalid vma will result in -EINVAL.
> > + *          mydevice_populate_page_table(mydevice, range, npages);
> > + *          ...
> > + *          mydevice_take_page_table_unlock(mydevice);
> > + *          hmm_range_unregister(range);
> >   *
> > - * See the function description in mm/hmm.c for further documentation.
> > + *          return 0;
> > + *      }
> > + *
> > + * The same scheme apply to hmm_range_fault() (ie replace hmm_range_snapshot()
> > + * with hmm_range_fault() in above pseudo code).
> > + *
> > + * YOU MUST CALL hmm_range_unregister() ONCE AND ONLY ONCE EACH TIME YOU CALL
> > + * hmm_range_register() AND hmm_range_register() RETURNED TRUE ! IF YOU DO NOT
> > + * FOLLOW THIS RULE MEMORY CORRUPTION WILL ENSUE !
> >   */
> > +int hmm_range_register(struct hmm_range *range,
> > +		       struct mm_struct *mm,
> > +		       unsigned long start,
> > +		       unsigned long end);
> > +void hmm_range_unregister(struct hmm_range *range);
> > +long hmm_range_snapshot(struct hmm_range *range);
> >  long hmm_range_fault(struct hmm_range *range, bool block);
> >  
> > +/*
> > + * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> > + *
> > + * When waiting for mmu notifiers we need some kind of time out otherwise we
> > + * could potentialy wait for ever, 1000ms ie 1s sounds like a long time to
> > + * wait already.
> > + */
> > +#define HMM_RANGE_DEFAULT_TIMEOUT 1000
> > +
> >  /* This is a temporary helper to avoid merge conflict between trees. */
> > +static inline bool hmm_vma_range_done(struct hmm_range *range)
> > +{
> > +	bool ret = hmm_range_valid(range);
> > +
> > +	hmm_range_unregister(range);
> > +	return ret;
> > +}
> > +
> >  static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> >  {
> > -	long ret = hmm_range_fault(range, block);
> > -	if (ret == -EBUSY)
> > -		ret = -EAGAIN;
> > -	else if (ret == -EAGAIN)
> > -		ret = -EBUSY;
> > -	return ret < 0 ? ret : 0;
> > +	long ret;
> > +
> > +	ret = hmm_range_register(range, range->vma->vm_mm,
> > +				 range->start, range->end);
> > +	if (ret)
> > +		return (int)ret;
> > +
> > +	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
> > +		up_read(&range->vma->vm_mm->mmap_sem);
> 
> Where is the down_read() which correspond to this?
> 
> > +		return -EAGAIN;
> > +	}
> > +
> > +	ret = hmm_range_fault(range, block);
> > +	if (ret <= 0) {
> > +		if (ret == -EBUSY || !ret) {
> > +			up_read(&range->vma->vm_mm->mmap_sem);
> 
> Or this...?
> 
> > +			ret = -EBUSY;
> > +		} else if (ret == -EAGAIN)
> > +			ret = -EBUSY;
> > +		hmm_range_unregister(range);
> > +		return ret;
> > +	}
> 
> And is the side effect of this call that the mmap_sem has been taken?
> Or is the side effect that the mmap_sem was released?
> 
> I'm not saying this is wrong, but I can't tell so it seems like a comment on
> the function would help.
> 

The comments above:

/* This is a temporary helper to avoid merge conflict between trees. */

Do matter, here i am updating HMM API and thus the old API which is
hmm_vma_fault() no longer exist and is replaced internaly by the new
API hmm_range_register() and hmm_range_fault().

To avoid complex cross tree merging i introduce this inline function
that reproduce the old API on top of the new API. The intent is that
once this is merged, then the next merge window i will drop the old
API from the upstream user (ie update the nouveau driver) and convert
them to the new API. Then i will remove this inline function.

This was discuss extensively already. Cross tree synchronization is
painfull and it is best to do this dance. Upstream something in at
relase N, update everyone in each indepedent tree in N+1, cleanup if
any in N+2


If you want to check that the above code is valid then you need to
look at the comment above the old API function which describe the
old API and also probably look at the code to understand all the cases.

Then you can compare against what this _temporary_ wrapper do with
the new API and that it does match the old API.


I have tested all this with nouveau which is upstream and nouveau
keep working unmodified before and after this patch.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 21:30     ` Jerome Glisse
@ 2019-03-28 21:41       ` John Hubbard
  2019-03-28 22:08         ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 21:41 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 2:30 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
>>> The device driver context which holds reference to mirror and thus to
>>> core hmm struct might outlive the mm against which it was created. To
>>> avoid every driver to check for that case provide an helper that check
>>> if mm is still alive and take the mmap_sem in read mode if so. If the
>>> mm have been destroy (mmu_notifier release call back did happen) then
>>> we return -EINVAL so that calling code knows that it is trying to do
>>> something against a mm that is no longer valid.
>>>
>>> Changes since v1:
>>>     - removed bunch of useless check (if API is use with bogus argument
>>>       better to fail loudly so user fix their code)
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>>  include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
>>>  1 file changed, 47 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>> index f3b919b04eda..5f9deaeb9d77 100644
>>> --- a/include/linux/hmm.h
>>> +++ b/include/linux/hmm.h
>>> @@ -438,6 +438,50 @@ struct hmm_mirror {
>>>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>>>  void hmm_mirror_unregister(struct hmm_mirror *mirror);
>>>  
>>> +/*
>>> + * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
>>> + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
>>> + * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
>>> + *
>>> + * The device driver context which holds reference to mirror and thus to core
>>> + * hmm struct might outlive the mm against which it was created. To avoid every
>>> + * driver to check for that case provide an helper that check if mm is still
>>> + * alive and take the mmap_sem in read mode if so. If the mm have been destroy
>>> + * (mmu_notifier release call back did happen) then we return -EINVAL so that
>>> + * calling code knows that it is trying to do something against a mm that is
>>> + * no longer valid.
>>> + */
>>> +static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
>>
>> Hi Jerome,
>>
>> Let's please not do this. There are at least two problems here:
>>
>> 1. The hmm_mirror_mm_down_read() wrapper around down_read() requires a 
>> return value. This is counter to how locking is normally done: callers do
>> not normally have to check the return value of most locks (other than
>> trylocks). And sure enough, your own code below doesn't check the return value.
>> That is a pretty good illustration of why not to do this.
> 
> Please read the function description this is not about checking lock
> return value it is about checking wether we are racing with process
> destruction and avoid trying to take lock in such cases so that driver
> do abort as quickly as possible when a process is being kill.
> 
>>
>> 2. This is a weird place to randomly check for semi-unrelated state, such 
>> as "is HMM still alive". By that I mean, if you have to detect a problem
>> at down_read() time, then the problem could have existed both before and
>> after the call to this wrapper. So it is providing a false sense of security,
>> and it is therefore actually undesirable to add the code.
> 
> It is not, this function is use in device page fault handler which will
> happens asynchronously from CPU event or process lifetime when a process
> is killed or is dying we do want to avoid useless page fault work and
> we do want to avoid blocking the page fault queue of the device. This
> function reports to the caller that the process is dying and that it
> should just abort the page fault and do whatever other device specific
> thing that needs to happen.
> 

But it's inherently racy, to check for a condition outside of any lock, so again,
it's a false sense of security.

>>
>> If you insist on having this wrapper, I think it should have approximately 
>> this form:
>>
>> void hmm_mirror_mm_down_read(...)
>> {
>> 	WARN_ON(...)
>> 	down_read(...)
>> } 
> 
> I do insist as it is useful and use by both RDMA and nouveau and the
> above would kill the intent. The intent is do not try to take the lock
> if the process is dying.

Could you provide me a link to those examples so I can take a peek? I
am still convinced that this whole thing is a race condition at best.

> 
> 
>>
>>> +{
>>> +	struct mm_struct *mm;
>>> +
>>> +	/* Sanity check ... */
>>> +	if (!mirror || !mirror->hmm)
>>> +		return -EINVAL;
>>> +	/*
>>> +	 * Before trying to take the mmap_sem make sure the mm is still
>>> +	 * alive as device driver context might outlive the mm lifetime.
>>
>> Let's find another way, and a better place, to solve this problem.
>> Ref counting?
> 
> This has nothing to do with refcount or use after free or anthing
> like that. It is just about checking wether we are about to do
> something pointless. If the process is dying then it is pointless
> to try to take the lock and it is pointless for the device driver
> to trigger handle_mm_fault().

Well, what happens if you let such pointless code run anyway? 
Does everything still work? If yes, then we don't need this change.
If no, then we need a race-free version of this change.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-25 14:40 ` [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays jglisse
@ 2019-03-28 21:59   ` John Hubbard
  2019-03-28 22:12     ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 21:59 UTC (permalink / raw)
  To: jglisse, linux-mm; +Cc: linux-kernel, Andrew Morton, Dan Williams

On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> The HMM mirror API can be use in two fashions. The first one where the HMM
> user coalesce multiple page faults into one request and set flags per pfns
> for of those faults. The second one where the HMM user want to pre-fault a
> range with specific flags. For the latter one it is a waste to have the user
> pre-fill the pfn arrays with a default flags value.
> 
> This patch adds a default flags value allowing user to set them for a range
> without having to pre-fill the pfn array.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/hmm.h |  7 +++++++
>  mm/hmm.c            | 12 ++++++++++++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 79671036cb5f..13bc2c72f791 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
>   * @pfns: array of pfns (big enough for the range)
>   * @flags: pfn flags to match device driver page table
>   * @values: pfn value for some special case (none, special, error, ...)
> + * @default_flags: default flags for the range (write, read, ...)
> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
>   * @valid: pfns array did not change since it has been fill by an HMM function
>   */
> @@ -177,6 +179,8 @@ struct hmm_range {
>  	uint64_t		*pfns;
>  	const uint64_t		*flags;
>  	const uint64_t		*values;
> +	uint64_t		default_flags;
> +	uint64_t		pfn_flags_mask;
>  	uint8_t			pfn_shift;
>  	bool			valid;
>  };
> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>  {
>  	long ret;
>  
> +	range->default_flags = 0;
> +	range->pfn_flags_mask = -1UL;

Hi Jerome,

This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
definitely does not need to be a run time value. And we want some assurance that
the mask is 
	a) large enough for the flags, and
	b) small enough to avoid overrunning the pfns field.

Those are less certain with a run-time struct field, and more obviously correct with
something like, approximately:

 	#define PFN_FLAGS_MASK 0xFFFF

or something.

In other words, this is more flexibility than we need--just a touch too much,
IMHO.

> +
>  	ret = hmm_range_register(range, range->vma->vm_mm,
>  				 range->start, range->end);
>  	if (ret)
> diff --git a/mm/hmm.c b/mm/hmm.c
> index fa9498eeb9b6..4fe88a196d17 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>  	if (!hmm_vma_walk->fault)
>  		return;
>  
> +	/*
> +	 * So we not only consider the individual per page request we also
> +	 * consider the default flags requested for the range. The API can
> +	 * be use in 2 fashions. The first one where the HMM user coalesce
> +	 * multiple page fault into one request and set flags per pfns for
> +	 * of those faults. The second one where the HMM user want to pre-
> +	 * fault a range with specific flags. For the latter one it is a
> +	 * waste to have the user pre-fill the pfn arrays with a default
> +	 * flags value.
> +	 */
> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;

Need to verify that the mask isn't too large or too small.

> +
>  	/* We aren't ask to do anything ... */
>  	if (!(pfns & range->flags[HMM_PFN_VALID]))
>  		return;
> 



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 05/11] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() v2
  2019-03-28 13:43   ` Ira Weiny
@ 2019-03-28 22:03     ` Jerome Glisse
  0 siblings, 0 replies; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 22:03 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-mm, linux-kernel, Andrew Morton, John Hubbard, Dan Williams

On Thu, Mar 28, 2019 at 06:43:51AM -0700, Ira Weiny wrote:
> On Mon, Mar 25, 2019 at 10:40:05AM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > Rename for consistency between code, comments and documentation. Also
> > improves the comments on all the possible returns values. Improve the
> > function by returning the number of populated entries in pfns array.
> > 
> > Changes since v1:
> >     - updated documentation
> >     - reformated some comments
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  Documentation/vm/hmm.rst |  8 +---
> >  include/linux/hmm.h      | 13 +++++-
> >  mm/hmm.c                 | 91 +++++++++++++++++-----------------------
> >  3 files changed, 52 insertions(+), 60 deletions(-)
> > 
> > diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> > index d9b27bdadd1b..61f073215a8d 100644
> > --- a/Documentation/vm/hmm.rst
> > +++ b/Documentation/vm/hmm.rst
> > @@ -190,13 +190,7 @@ When the device driver wants to populate a range of virtual addresses, it can
> >  use either::
> >  
> >    long hmm_range_snapshot(struct hmm_range *range);
> > -  int hmm_vma_fault(struct vm_area_struct *vma,
> > -                    struct hmm_range *range,
> > -                    unsigned long start,
> > -                    unsigned long end,
> > -                    hmm_pfn_t *pfns,
> > -                    bool write,
> > -                    bool block);
> > +  long hmm_range_fault(struct hmm_range *range, bool block);
> >  
> >  The first one (hmm_range_snapshot()) will only fetch present CPU page table
> >  entries and will not trigger a page fault on missing or non-present entries.
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 32206b0b1bfd..e9afd23c2eac 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -391,7 +391,18 @@ bool hmm_vma_range_done(struct hmm_range *range);
> >   *
> >   * See the function description in mm/hmm.c for further documentation.
> >   */
> > -int hmm_vma_fault(struct hmm_range *range, bool block);
> > +long hmm_range_fault(struct hmm_range *range, bool block);
> > +
> > +/* This is a temporary helper to avoid merge conflict between trees. */
> > +static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> > +{
> > +	long ret = hmm_range_fault(range, block);
> > +	if (ret == -EBUSY)
> > +		ret = -EAGAIN;
> > +	else if (ret == -EAGAIN)
> > +		ret = -EBUSY;
> > +	return ret < 0 ? ret : 0;
> > +}
> >  
> >  /* Below are for HMM internal use only! Not to be used by device driver! */
> >  void hmm_mm_destroy(struct mm_struct *mm);
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 91361aa74b8b..7860e63c3ba7 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -336,13 +336,13 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
> >  	flags |= write_fault ? FAULT_FLAG_WRITE : 0;
> >  	ret = handle_mm_fault(vma, addr, flags);
> >  	if (ret & VM_FAULT_RETRY)
> > -		return -EBUSY;
> > +		return -EAGAIN;
> >  	if (ret & VM_FAULT_ERROR) {
> >  		*pfn = range->values[HMM_PFN_ERROR];
> >  		return -EFAULT;
> >  	}
> >  
> > -	return -EAGAIN;
> > +	return -EBUSY;
> >  }
> >  
> >  static int hmm_pfns_bad(unsigned long addr,
> > @@ -368,7 +368,7 @@ static int hmm_pfns_bad(unsigned long addr,
> >   * @fault: should we fault or not ?
> >   * @write_fault: write fault ?
> >   * @walk: mm_walk structure
> > - * Returns: 0 on success, -EAGAIN after page fault, or page fault error
> > + * Returns: 0 on success, -EBUSY after page fault, or page fault error
> >   *
> >   * This function will be called whenever pmd_none() or pte_none() returns true,
> >   * or whenever there is no page directory covering the virtual address range.
> > @@ -391,12 +391,12 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
> >  
> >  			ret = hmm_vma_do_fault(walk, addr, write_fault,
> >  					       &pfns[i]);
> > -			if (ret != -EAGAIN)
> > +			if (ret != -EBUSY)
> >  				return ret;
> >  		}
> >  	}
> >  
> > -	return (fault || write_fault) ? -EAGAIN : 0;
> > +	return (fault || write_fault) ? -EBUSY : 0;
> >  }
> >  
> >  static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
> > @@ -527,11 +527,11 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> >  	uint64_t orig_pfn = *pfn;
> >  
> >  	*pfn = range->values[HMM_PFN_NONE];
> > -	cpu_flags = pte_to_hmm_pfn_flags(range, pte);
> > -	hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
> > -			   &fault, &write_fault);
> > +	fault = write_fault = false;
> >  
> >  	if (pte_none(pte)) {
> > +		hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0,
> > +				   &fault, &write_fault);
> 
> This really threw me until I applied the patches to a tree.  It looks like this
> is just optimizing away a pte_none() check.  The only functional change which
> was mentioned was returning the number of populated pfns.  So I spent a bit of
> time trying to figure out why hmm_pte_need_fault() needed to move _here_ to do
> that...  :-(
> 
> It would have been nice to have said something about optimizing in the commit
> message.

Yes i should have added that to the commit message i forgot.

> 
> >  		if (fault || write_fault)
> >  			goto fault;
> >  		return 0;
> > @@ -570,7 +570,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> >  				hmm_vma_walk->last = addr;
> >  				migration_entry_wait(vma->vm_mm,
> >  						     pmdp, addr);
> > -				return -EAGAIN;
> > +				return -EBUSY;
> >  			}
> >  			return 0;
> >  		}
> > @@ -578,6 +578,10 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> >  		/* Report error for everything else */
> >  		*pfn = range->values[HMM_PFN_ERROR];
> >  		return -EFAULT;
> > +	} else {
> > +		cpu_flags = pte_to_hmm_pfn_flags(range, pte);
> > +		hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
> > +				   &fault, &write_fault);
> 
> Looks like the same situation as above.
> 
> >  	}
> >  
> >  	if (fault || write_fault)
> > @@ -628,7 +632,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  		if (fault || write_fault) {
> >  			hmm_vma_walk->last = addr;
> >  			pmd_migration_entry_wait(vma->vm_mm, pmdp);
> > -			return -EAGAIN;
> > +			return -EBUSY;
> 
> While I am at it.  Why are we swapping EAGAIN and EBUSY everywhere?

It is a part of the API change when going from hmm_vma_fault() to
hmm_range_fault() and unifying the return values with the old
hmm_vma_get_pfns() so that all API have the same meaning behind
the same return value.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 21:41       ` John Hubbard
@ 2019-03-28 22:08         ` Jerome Glisse
  2019-03-28 22:25           ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 22:08 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
> On 3/28/19 2:30 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> >> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>> From: Jérôme Glisse <jglisse@redhat.com>
> >>>
> >>> The device driver context which holds reference to mirror and thus to
> >>> core hmm struct might outlive the mm against which it was created. To
> >>> avoid every driver to check for that case provide an helper that check
> >>> if mm is still alive and take the mmap_sem in read mode if so. If the
> >>> mm have been destroy (mmu_notifier release call back did happen) then
> >>> we return -EINVAL so that calling code knows that it is trying to do
> >>> something against a mm that is no longer valid.
> >>>
> >>> Changes since v1:
> >>>     - removed bunch of useless check (if API is use with bogus argument
> >>>       better to fail loudly so user fix their code)
> >>>
> >>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> >>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>> Cc: Dan Williams <dan.j.williams@intel.com>
> >>> ---
> >>>  include/linux/hmm.h | 50 ++++++++++++++++++++++++++++++++++++++++++---
> >>>  1 file changed, 47 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> >>> index f3b919b04eda..5f9deaeb9d77 100644
> >>> --- a/include/linux/hmm.h
> >>> +++ b/include/linux/hmm.h
> >>> @@ -438,6 +438,50 @@ struct hmm_mirror {
> >>>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
> >>>  void hmm_mirror_unregister(struct hmm_mirror *mirror);
> >>>  
> >>> +/*
> >>> + * hmm_mirror_mm_down_read() - lock the mmap_sem in read mode
> >>> + * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> >>> + * Returns: -EINVAL if the mm is dead, 0 otherwise (lock taken).
> >>> + *
> >>> + * The device driver context which holds reference to mirror and thus to core
> >>> + * hmm struct might outlive the mm against which it was created. To avoid every
> >>> + * driver to check for that case provide an helper that check if mm is still
> >>> + * alive and take the mmap_sem in read mode if so. If the mm have been destroy
> >>> + * (mmu_notifier release call back did happen) then we return -EINVAL so that
> >>> + * calling code knows that it is trying to do something against a mm that is
> >>> + * no longer valid.
> >>> + */
> >>> +static inline int hmm_mirror_mm_down_read(struct hmm_mirror *mirror)
> >>
> >> Hi Jerome,
> >>
> >> Let's please not do this. There are at least two problems here:
> >>
> >> 1. The hmm_mirror_mm_down_read() wrapper around down_read() requires a 
> >> return value. This is counter to how locking is normally done: callers do
> >> not normally have to check the return value of most locks (other than
> >> trylocks). And sure enough, your own code below doesn't check the return value.
> >> That is a pretty good illustration of why not to do this.
> > 
> > Please read the function description this is not about checking lock
> > return value it is about checking wether we are racing with process
> > destruction and avoid trying to take lock in such cases so that driver
> > do abort as quickly as possible when a process is being kill.
> > 
> >>
> >> 2. This is a weird place to randomly check for semi-unrelated state, such 
> >> as "is HMM still alive". By that I mean, if you have to detect a problem
> >> at down_read() time, then the problem could have existed both before and
> >> after the call to this wrapper. So it is providing a false sense of security,
> >> and it is therefore actually undesirable to add the code.
> > 
> > It is not, this function is use in device page fault handler which will
> > happens asynchronously from CPU event or process lifetime when a process
> > is killed or is dying we do want to avoid useless page fault work and
> > we do want to avoid blocking the page fault queue of the device. This
> > function reports to the caller that the process is dying and that it
> > should just abort the page fault and do whatever other device specific
> > thing that needs to happen.
> > 
> 
> But it's inherently racy, to check for a condition outside of any lock, so again,
> it's a false sense of security.

Yes and race are fine here, this is to avoid useless work if we are
unlucky and we race and fail to see the destruction that is just
happening then it is fine we are just going to do useless work. So
we do not care about race here we just want to bailout early if we
can witness the process dying.

> 
> >>
> >> If you insist on having this wrapper, I think it should have approximately 
> >> this form:
> >>
> >> void hmm_mirror_mm_down_read(...)
> >> {
> >> 	WARN_ON(...)
> >> 	down_read(...)
> >> } 
> > 
> > I do insist as it is useful and use by both RDMA and nouveau and the
> > above would kill the intent. The intent is do not try to take the lock
> > if the process is dying.
> 
> Could you provide me a link to those examples so I can take a peek? I
> am still convinced that this whole thing is a race condition at best.

The race is fine and ok see:

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec

which has been posted and i think i provided a link in the cover
letter to that post. The same patch exist for nouveau i need to
cleanup that tree and push it.

> > 
> > 
> >>
> >>> +{
> >>> +	struct mm_struct *mm;
> >>> +
> >>> +	/* Sanity check ... */
> >>> +	if (!mirror || !mirror->hmm)
> >>> +		return -EINVAL;
> >>> +	/*
> >>> +	 * Before trying to take the mmap_sem make sure the mm is still
> >>> +	 * alive as device driver context might outlive the mm lifetime.
> >>
> >> Let's find another way, and a better place, to solve this problem.
> >> Ref counting?
> > 
> > This has nothing to do with refcount or use after free or anthing
> > like that. It is just about checking wether we are about to do
> > something pointless. If the process is dying then it is pointless
> > to try to take the lock and it is pointless for the device driver
> > to trigger handle_mm_fault().
> 
> Well, what happens if you let such pointless code run anyway? 
> Does everything still work? If yes, then we don't need this change.
> If no, then we need a race-free version of this change.

Yes everything work, nothing bad can happen from a race, it will just
do useless work which never hurt anyone.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-28 21:59   ` John Hubbard
@ 2019-03-28 22:12     ` Jerome Glisse
  2019-03-28 22:19       ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 22:12 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > The HMM mirror API can be use in two fashions. The first one where the HMM
> > user coalesce multiple page faults into one request and set flags per pfns
> > for of those faults. The second one where the HMM user want to pre-fault a
> > range with specific flags. For the latter one it is a waste to have the user
> > pre-fill the pfn arrays with a default flags value.
> > 
> > This patch adds a default flags value allowing user to set them for a range
> > without having to pre-fill the pfn array.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  include/linux/hmm.h |  7 +++++++
> >  mm/hmm.c            | 12 ++++++++++++
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 79671036cb5f..13bc2c72f791 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
> >   * @pfns: array of pfns (big enough for the range)
> >   * @flags: pfn flags to match device driver page table
> >   * @values: pfn value for some special case (none, special, error, ...)
> > + * @default_flags: default flags for the range (write, read, ...)
> > + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
> >   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> >   * @valid: pfns array did not change since it has been fill by an HMM function
> >   */
> > @@ -177,6 +179,8 @@ struct hmm_range {
> >  	uint64_t		*pfns;
> >  	const uint64_t		*flags;
> >  	const uint64_t		*values;
> > +	uint64_t		default_flags;
> > +	uint64_t		pfn_flags_mask;
> >  	uint8_t			pfn_shift;
> >  	bool			valid;
> >  };
> > @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> >  {
> >  	long ret;
> >  
> > +	range->default_flags = 0;
> > +	range->pfn_flags_mask = -1UL;
> 
> Hi Jerome,
> 
> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
> definitely does not need to be a run time value. And we want some assurance that
> the mask is 
> 	a) large enough for the flags, and
> 	b) small enough to avoid overrunning the pfns field.
> 
> Those are less certain with a run-time struct field, and more obviously correct with
> something like, approximately:
> 
>  	#define PFN_FLAGS_MASK 0xFFFF
> 
> or something.
> 
> In other words, this is more flexibility than we need--just a touch too much,
> IMHO.

This mirror the fact that flags are provided as an array and some devices use
the top bits for flags (read, write, ...). So here it is the safe default to
set it to -1. If the caller want to leverage this optimization it can override
the default_flags value.

> 
> > +
> >  	ret = hmm_range_register(range, range->vma->vm_mm,
> >  				 range->start, range->end);
> >  	if (ret)
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index fa9498eeb9b6..4fe88a196d17 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
> >  	if (!hmm_vma_walk->fault)
> >  		return;
> >  
> > +	/*
> > +	 * So we not only consider the individual per page request we also
> > +	 * consider the default flags requested for the range. The API can
> > +	 * be use in 2 fashions. The first one where the HMM user coalesce
> > +	 * multiple page fault into one request and set flags per pfns for
> > +	 * of those faults. The second one where the HMM user want to pre-
> > +	 * fault a range with specific flags. For the latter one it is a
> > +	 * waste to have the user pre-fill the pfn arrays with a default
> > +	 * flags value.
> > +	 */
> > +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
> 
> Need to verify that the mask isn't too large or too small.

I need to check agin but default flag is anded somewhere to limit
the bit to the one we expect.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-28 22:12     ` Jerome Glisse
@ 2019-03-28 22:19       ` John Hubbard
  2019-03-28 22:31         ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 22:19 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 3:12 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
>>> The HMM mirror API can be use in two fashions. The first one where the HMM
>>> user coalesce multiple page faults into one request and set flags per pfns
>>> for of those faults. The second one where the HMM user want to pre-fault a
>>> range with specific flags. For the latter one it is a waste to have the user
>>> pre-fill the pfn arrays with a default flags value.
>>>
>>> This patch adds a default flags value allowing user to set them for a range
>>> without having to pre-fill the pfn array.
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>>  include/linux/hmm.h |  7 +++++++
>>>  mm/hmm.c            | 12 ++++++++++++
>>>  2 files changed, 19 insertions(+)
>>>
>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>> index 79671036cb5f..13bc2c72f791 100644
>>> --- a/include/linux/hmm.h
>>> +++ b/include/linux/hmm.h
>>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
>>>   * @pfns: array of pfns (big enough for the range)
>>>   * @flags: pfn flags to match device driver page table
>>>   * @values: pfn value for some special case (none, special, error, ...)
>>> + * @default_flags: default flags for the range (write, read, ...)
>>> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
>>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
>>>   * @valid: pfns array did not change since it has been fill by an HMM function
>>>   */
>>> @@ -177,6 +179,8 @@ struct hmm_range {
>>>  	uint64_t		*pfns;
>>>  	const uint64_t		*flags;
>>>  	const uint64_t		*values;
>>> +	uint64_t		default_flags;
>>> +	uint64_t		pfn_flags_mask;
>>>  	uint8_t			pfn_shift;
>>>  	bool			valid;
>>>  };
>>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>>>  {
>>>  	long ret;
>>>  
>>> +	range->default_flags = 0;
>>> +	range->pfn_flags_mask = -1UL;
>>
>> Hi Jerome,
>>
>> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
>> definitely does not need to be a run time value. And we want some assurance that
>> the mask is 
>> 	a) large enough for the flags, and
>> 	b) small enough to avoid overrunning the pfns field.
>>
>> Those are less certain with a run-time struct field, and more obviously correct with
>> something like, approximately:
>>
>>  	#define PFN_FLAGS_MASK 0xFFFF
>>
>> or something.
>>
>> In other words, this is more flexibility than we need--just a touch too much,
>> IMHO.
> 
> This mirror the fact that flags are provided as an array and some devices use
> the top bits for flags (read, write, ...). So here it is the safe default to
> set it to -1. If the caller want to leverage this optimization it can override
> the default_flags value.
> 

Optimization? OK, now I'm a bit lost. Maybe this is another place where I could
use a peek at the calling code. The only flags I've seen so far use the bottom
3 bits and that's it. 

Maybe comments here?

>>
>>> +
>>>  	ret = hmm_range_register(range, range->vma->vm_mm,
>>>  				 range->start, range->end);
>>>  	if (ret)
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index fa9498eeb9b6..4fe88a196d17 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>>>  	if (!hmm_vma_walk->fault)
>>>  		return;
>>>  
>>> +	/*
>>> +	 * So we not only consider the individual per page request we also
>>> +	 * consider the default flags requested for the range. The API can
>>> +	 * be use in 2 fashions. The first one where the HMM user coalesce
>>> +	 * multiple page fault into one request and set flags per pfns for
>>> +	 * of those faults. The second one where the HMM user want to pre-
>>> +	 * fault a range with specific flags. For the latter one it is a
>>> +	 * waste to have the user pre-fill the pfn arrays with a default
>>> +	 * flags value.
>>> +	 */
>>> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
>>
>> Need to verify that the mask isn't too large or too small.
> 
> I need to check agin but default flag is anded somewhere to limit
> the bit to the one we expect.

Right, but in general, the *mask* could be wrong. It would be nice to have
an assert, and/or a comment, or something to verify the mask is proper.

Really, a hardcoded mask is simple and correct--unless it *definitely* must
vary for devices of course.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 22:08         ` Jerome Glisse
@ 2019-03-28 22:25           ` John Hubbard
  2019-03-28 22:40             ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 22:25 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 3:08 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
>> On 3/28/19 2:30 PM, Jerome Glisse wrote:
>>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>> From: Jérôme Glisse <jglisse@redhat.com>
[...]
>>
>>>>
>>>> If you insist on having this wrapper, I think it should have approximately 
>>>> this form:
>>>>
>>>> void hmm_mirror_mm_down_read(...)
>>>> {
>>>> 	WARN_ON(...)
>>>> 	down_read(...)
>>>> } 
>>>
>>> I do insist as it is useful and use by both RDMA and nouveau and the
>>> above would kill the intent. The intent is do not try to take the lock
>>> if the process is dying.
>>
>> Could you provide me a link to those examples so I can take a peek? I
>> am still convinced that this whole thing is a race condition at best.
> 
> The race is fine and ok see:
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
> 
> which has been posted and i think i provided a link in the cover
> letter to that post. The same patch exist for nouveau i need to
> cleanup that tree and push it.

Thanks for that link, and I apologize for not keeping up with that
other review thread.

Looking it over, hmm_mirror_mm_down_read() is only used in one place.
So, what you really want there is not a down_read() wrapper, but rather,
something like

	hmm_sanity_check()

, that ib_umem_odp_map_dma_pages() calls.


> 
>>>
>>>
>>>>
>>>>> +{
>>>>> +	struct mm_struct *mm;
>>>>> +
>>>>> +	/* Sanity check ... */
>>>>> +	if (!mirror || !mirror->hmm)
>>>>> +		return -EINVAL;
>>>>> +	/*
>>>>> +	 * Before trying to take the mmap_sem make sure the mm is still
>>>>> +	 * alive as device driver context might outlive the mm lifetime.
>>>>
>>>> Let's find another way, and a better place, to solve this problem.
>>>> Ref counting?
>>>
>>> This has nothing to do with refcount or use after free or anthing
>>> like that. It is just about checking wether we are about to do
>>> something pointless. If the process is dying then it is pointless
>>> to try to take the lock and it is pointless for the device driver
>>> to trigger handle_mm_fault().
>>
>> Well, what happens if you let such pointless code run anyway? 
>> Does everything still work? If yes, then we don't need this change.
>> If no, then we need a race-free version of this change.
> 
> Yes everything work, nothing bad can happen from a race, it will just
> do useless work which never hurt anyone.
> 

OK, so let's either drop this patch, or if merge windows won't allow that,
then *eventually* drop this patch. And instead, put in a hmm_sanity_check()
that does the same checks.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-28 22:19       ` John Hubbard
@ 2019-03-28 22:31         ` Jerome Glisse
  2019-03-28 22:40           ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 22:31 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>> From: Jérôme Glisse <jglisse@redhat.com>
> >>>
> >>> The HMM mirror API can be use in two fashions. The first one where the HMM
> >>> user coalesce multiple page faults into one request and set flags per pfns
> >>> for of those faults. The second one where the HMM user want to pre-fault a
> >>> range with specific flags. For the latter one it is a waste to have the user
> >>> pre-fill the pfn arrays with a default flags value.
> >>>
> >>> This patch adds a default flags value allowing user to set them for a range
> >>> without having to pre-fill the pfn array.
> >>>
> >>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> >>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>> Cc: Dan Williams <dan.j.williams@intel.com>
> >>> ---
> >>>  include/linux/hmm.h |  7 +++++++
> >>>  mm/hmm.c            | 12 ++++++++++++
> >>>  2 files changed, 19 insertions(+)
> >>>
> >>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> >>> index 79671036cb5f..13bc2c72f791 100644
> >>> --- a/include/linux/hmm.h
> >>> +++ b/include/linux/hmm.h
> >>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
> >>>   * @pfns: array of pfns (big enough for the range)
> >>>   * @flags: pfn flags to match device driver page table
> >>>   * @values: pfn value for some special case (none, special, error, ...)
> >>> + * @default_flags: default flags for the range (write, read, ...)
> >>> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
> >>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> >>>   * @valid: pfns array did not change since it has been fill by an HMM function
> >>>   */
> >>> @@ -177,6 +179,8 @@ struct hmm_range {
> >>>  	uint64_t		*pfns;
> >>>  	const uint64_t		*flags;
> >>>  	const uint64_t		*values;
> >>> +	uint64_t		default_flags;
> >>> +	uint64_t		pfn_flags_mask;
> >>>  	uint8_t			pfn_shift;
> >>>  	bool			valid;
> >>>  };
> >>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> >>>  {
> >>>  	long ret;
> >>>  
> >>> +	range->default_flags = 0;
> >>> +	range->pfn_flags_mask = -1UL;
> >>
> >> Hi Jerome,
> >>
> >> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
> >> definitely does not need to be a run time value. And we want some assurance that
> >> the mask is 
> >> 	a) large enough for the flags, and
> >> 	b) small enough to avoid overrunning the pfns field.
> >>
> >> Those are less certain with a run-time struct field, and more obviously correct with
> >> something like, approximately:
> >>
> >>  	#define PFN_FLAGS_MASK 0xFFFF
> >>
> >> or something.
> >>
> >> In other words, this is more flexibility than we need--just a touch too much,
> >> IMHO.
> > 
> > This mirror the fact that flags are provided as an array and some devices use
> > the top bits for flags (read, write, ...). So here it is the safe default to
> > set it to -1. If the caller want to leverage this optimization it can override
> > the default_flags value.
> > 
> 
> Optimization? OK, now I'm a bit lost. Maybe this is another place where I could
> use a peek at the calling code. The only flags I've seen so far use the bottom
> 3 bits and that's it. 
> 
> Maybe comments here?
> 
> >>
> >>> +
> >>>  	ret = hmm_range_register(range, range->vma->vm_mm,
> >>>  				 range->start, range->end);
> >>>  	if (ret)
> >>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>> index fa9498eeb9b6..4fe88a196d17 100644
> >>> --- a/mm/hmm.c
> >>> +++ b/mm/hmm.c
> >>> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
> >>>  	if (!hmm_vma_walk->fault)
> >>>  		return;
> >>>  
> >>> +	/*
> >>> +	 * So we not only consider the individual per page request we also
> >>> +	 * consider the default flags requested for the range. The API can
> >>> +	 * be use in 2 fashions. The first one where the HMM user coalesce
> >>> +	 * multiple page fault into one request and set flags per pfns for
> >>> +	 * of those faults. The second one where the HMM user want to pre-
> >>> +	 * fault a range with specific flags. For the latter one it is a
> >>> +	 * waste to have the user pre-fill the pfn arrays with a default
> >>> +	 * flags value.
> >>> +	 */
> >>> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
> >>
> >> Need to verify that the mask isn't too large or too small.
> > 
> > I need to check agin but default flag is anded somewhere to limit
> > the bit to the one we expect.
> 
> Right, but in general, the *mask* could be wrong. It would be nice to have
> an assert, and/or a comment, or something to verify the mask is proper.
> 
> Really, a hardcoded mask is simple and correct--unless it *definitely* must
> vary for devices of course.

Ok so re-read the code and it is correct. The helper for compatibility with
old API (so that i do not break nouveau upstream code) initialize those to
the safe default ie:

range->default_flags = 0;
range->pfn_flags_mask = -1;

Which means that in the above comment we are in the case where it is the
individual entry within the pfn array that will determine if we fault or
not.

Driver using the new API can either use this safe default or use the
second case in the above comment and set default_flags to something
else than 0.

Note that those default_flags are not set in the final result they are
use to determine if we need to do a page fault. For instance if you set
the write bit in the default flags then the pfns computed above will
have the write bit set and when we compare with the CPU pte if the CPU
pte do not have the write bit set then we will fault. What matter is
that in this case the value within the pfns array is totaly pointless
ie we do not care what it is, it will not affect the decission ie the
decision is made by looking at the default flags.

Hope this clarify thing. You can look at the ODP patch to see how it
is use:

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 22:25           ` John Hubbard
@ 2019-03-28 22:40             ` Jerome Glisse
  2019-03-28 22:43               ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 22:40 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
> On 3/28/19 3:08 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
> >> On 3/28/19 2:30 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> >>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>> From: Jérôme Glisse <jglisse@redhat.com>
> [...]
> >>
> >>>>
> >>>> If you insist on having this wrapper, I think it should have approximately 
> >>>> this form:
> >>>>
> >>>> void hmm_mirror_mm_down_read(...)
> >>>> {
> >>>> 	WARN_ON(...)
> >>>> 	down_read(...)
> >>>> } 
> >>>
> >>> I do insist as it is useful and use by both RDMA and nouveau and the
> >>> above would kill the intent. The intent is do not try to take the lock
> >>> if the process is dying.
> >>
> >> Could you provide me a link to those examples so I can take a peek? I
> >> am still convinced that this whole thing is a race condition at best.
> > 
> > The race is fine and ok see:
> > 
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
> > 
> > which has been posted and i think i provided a link in the cover
> > letter to that post. The same patch exist for nouveau i need to
> > cleanup that tree and push it.
> 
> Thanks for that link, and I apologize for not keeping up with that
> other review thread.
> 
> Looking it over, hmm_mirror_mm_down_read() is only used in one place.
> So, what you really want there is not a down_read() wrapper, but rather,
> something like
> 
> 	hmm_sanity_check()
> 
> , that ib_umem_odp_map_dma_pages() calls.

Why ? The device driver pattern is:
    if (hmm_is_it_dying()) {
        // handle when process die and abort the fault ie useless
        // to call within HMM
    }
    down_read(mmap_sem);

This pattern is common within nouveau and RDMA and other device driver in
the work. Hence why i am replacing it with just one helper. Also it has the
added benefit that changes being discussed around the mmap sem will be easier
to do as it avoid having to update each driver but instead it can be done
just once for the HMM helpers.

> 
> 
> > 
> >>>
> >>>
> >>>>
> >>>>> +{
> >>>>> +	struct mm_struct *mm;
> >>>>> +
> >>>>> +	/* Sanity check ... */
> >>>>> +	if (!mirror || !mirror->hmm)
> >>>>> +		return -EINVAL;
> >>>>> +	/*
> >>>>> +	 * Before trying to take the mmap_sem make sure the mm is still
> >>>>> +	 * alive as device driver context might outlive the mm lifetime.
> >>>>
> >>>> Let's find another way, and a better place, to solve this problem.
> >>>> Ref counting?
> >>>
> >>> This has nothing to do with refcount or use after free or anthing
> >>> like that. It is just about checking wether we are about to do
> >>> something pointless. If the process is dying then it is pointless
> >>> to try to take the lock and it is pointless for the device driver
> >>> to trigger handle_mm_fault().
> >>
> >> Well, what happens if you let such pointless code run anyway? 
> >> Does everything still work? If yes, then we don't need this change.
> >> If no, then we need a race-free version of this change.
> > 
> > Yes everything work, nothing bad can happen from a race, it will just
> > do useless work which never hurt anyone.
> > 
> 
> OK, so let's either drop this patch, or if merge windows won't allow that,
> then *eventually* drop this patch. And instead, put in a hmm_sanity_check()
> that does the same checks.

RDMA depends on this, so does the nouveau patchset that convert to new API.
So i do not see reason to drop this. They are user for this they are posted
and i hope i explained properly the benefit.

It is a common pattern. Yes it only save couple lines of code but down the
road i will also help for people working on the mmap_sem patchset.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-28 22:31         ` Jerome Glisse
@ 2019-03-28 22:40           ` John Hubbard
  2019-03-28 23:21             ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 22:40 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 3:31 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>>>
>>>>> The HMM mirror API can be use in two fashions. The first one where the HMM
>>>>> user coalesce multiple page faults into one request and set flags per pfns
>>>>> for of those faults. The second one where the HMM user want to pre-fault a
>>>>> range with specific flags. For the latter one it is a waste to have the user
>>>>> pre-fill the pfn arrays with a default flags value.
>>>>>
>>>>> This patch adds a default flags value allowing user to set them for a range
>>>>> without having to pre-fill the pfn array.
>>>>>
>>>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>>>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>> ---
>>>>>  include/linux/hmm.h |  7 +++++++
>>>>>  mm/hmm.c            | 12 ++++++++++++
>>>>>  2 files changed, 19 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>>>> index 79671036cb5f..13bc2c72f791 100644
>>>>> --- a/include/linux/hmm.h
>>>>> +++ b/include/linux/hmm.h
>>>>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
>>>>>   * @pfns: array of pfns (big enough for the range)
>>>>>   * @flags: pfn flags to match device driver page table
>>>>>   * @values: pfn value for some special case (none, special, error, ...)
>>>>> + * @default_flags: default flags for the range (write, read, ...)
>>>>> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
>>>>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
>>>>>   * @valid: pfns array did not change since it has been fill by an HMM function
>>>>>   */
>>>>> @@ -177,6 +179,8 @@ struct hmm_range {
>>>>>  	uint64_t		*pfns;
>>>>>  	const uint64_t		*flags;
>>>>>  	const uint64_t		*values;
>>>>> +	uint64_t		default_flags;
>>>>> +	uint64_t		pfn_flags_mask;
>>>>>  	uint8_t			pfn_shift;
>>>>>  	bool			valid;
>>>>>  };
>>>>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>>>>>  {
>>>>>  	long ret;
>>>>>  
>>>>> +	range->default_flags = 0;
>>>>> +	range->pfn_flags_mask = -1UL;
>>>>
>>>> Hi Jerome,
>>>>
>>>> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
>>>> definitely does not need to be a run time value. And we want some assurance that
>>>> the mask is 
>>>> 	a) large enough for the flags, and
>>>> 	b) small enough to avoid overrunning the pfns field.
>>>>
>>>> Those are less certain with a run-time struct field, and more obviously correct with
>>>> something like, approximately:
>>>>
>>>>  	#define PFN_FLAGS_MASK 0xFFFF
>>>>
>>>> or something.
>>>>
>>>> In other words, this is more flexibility than we need--just a touch too much,
>>>> IMHO.
>>>
>>> This mirror the fact that flags are provided as an array and some devices use
>>> the top bits for flags (read, write, ...). So here it is the safe default to
>>> set it to -1. If the caller want to leverage this optimization it can override
>>> the default_flags value.
>>>
>>
>> Optimization? OK, now I'm a bit lost. Maybe this is another place where I could
>> use a peek at the calling code. The only flags I've seen so far use the bottom
>> 3 bits and that's it. 
>>
>> Maybe comments here?
>>
>>>>
>>>>> +
>>>>>  	ret = hmm_range_register(range, range->vma->vm_mm,
>>>>>  				 range->start, range->end);
>>>>>  	if (ret)
>>>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>>>> index fa9498eeb9b6..4fe88a196d17 100644
>>>>> --- a/mm/hmm.c
>>>>> +++ b/mm/hmm.c
>>>>> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>>>>>  	if (!hmm_vma_walk->fault)
>>>>>  		return;
>>>>>  
>>>>> +	/*
>>>>> +	 * So we not only consider the individual per page request we also
>>>>> +	 * consider the default flags requested for the range. The API can
>>>>> +	 * be use in 2 fashions. The first one where the HMM user coalesce
>>>>> +	 * multiple page fault into one request and set flags per pfns for
>>>>> +	 * of those faults. The second one where the HMM user want to pre-
>>>>> +	 * fault a range with specific flags. For the latter one it is a
>>>>> +	 * waste to have the user pre-fill the pfn arrays with a default
>>>>> +	 * flags value.
>>>>> +	 */
>>>>> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
>>>>
>>>> Need to verify that the mask isn't too large or too small.
>>>
>>> I need to check agin but default flag is anded somewhere to limit
>>> the bit to the one we expect.
>>
>> Right, but in general, the *mask* could be wrong. It would be nice to have
>> an assert, and/or a comment, or something to verify the mask is proper.
>>
>> Really, a hardcoded mask is simple and correct--unless it *definitely* must
>> vary for devices of course.
> 
> Ok so re-read the code and it is correct. The helper for compatibility with
> old API (so that i do not break nouveau upstream code) initialize those to
> the safe default ie:
> 
> range->default_flags = 0;
> range->pfn_flags_mask = -1;
> 
> Which means that in the above comment we are in the case where it is the
> individual entry within the pfn array that will determine if we fault or
> not.
> 
> Driver using the new API can either use this safe default or use the
> second case in the above comment and set default_flags to something
> else than 0.
> 
> Note that those default_flags are not set in the final result they are
> use to determine if we need to do a page fault. For instance if you set
> the write bit in the default flags then the pfns computed above will
> have the write bit set and when we compare with the CPU pte if the CPU
> pte do not have the write bit set then we will fault. What matter is
> that in this case the value within the pfns array is totaly pointless
> ie we do not care what it is, it will not affect the decission ie the
> decision is made by looking at the default flags.
> 
> Hope this clarify thing. You can look at the ODP patch to see how it
> is use:
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
> 

Hi Jerome,

I think you're talking about flags, but I'm talking about the mask. The 
above link doesn't appear to use the pfn_flags_mask, and the default_flags 
that it uses are still in the same lower 3 bits:

+static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
+	ODP_READ_BIT,	/* HMM_PFN_VALID */
+	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
+	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
+};

So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
mask, that is *also* runtime changeable. 

thanks,
-- 
John Hubbard
NVIDIAr

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 22:40             ` Jerome Glisse
@ 2019-03-28 22:43               ` John Hubbard
  2019-03-28 23:05                 ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 22:43 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 3:40 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
>> On 3/28/19 3:08 PM, Jerome Glisse wrote:
>>> On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
>>>> On 3/28/19 2:30 PM, Jerome Glisse wrote:
>>>>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>> [...]
>>>>
>>>>>>
>>>>>> If you insist on having this wrapper, I think it should have approximately 
>>>>>> this form:
>>>>>>
>>>>>> void hmm_mirror_mm_down_read(...)
>>>>>> {
>>>>>> 	WARN_ON(...)
>>>>>> 	down_read(...)
>>>>>> } 
>>>>>
>>>>> I do insist as it is useful and use by both RDMA and nouveau and the
>>>>> above would kill the intent. The intent is do not try to take the lock
>>>>> if the process is dying.
>>>>
>>>> Could you provide me a link to those examples so I can take a peek? I
>>>> am still convinced that this whole thing is a race condition at best.
>>>
>>> The race is fine and ok see:
>>>
>>> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
>>>
>>> which has been posted and i think i provided a link in the cover
>>> letter to that post. The same patch exist for nouveau i need to
>>> cleanup that tree and push it.
>>
>> Thanks for that link, and I apologize for not keeping up with that
>> other review thread.
>>
>> Looking it over, hmm_mirror_mm_down_read() is only used in one place.
>> So, what you really want there is not a down_read() wrapper, but rather,
>> something like
>>
>> 	hmm_sanity_check()
>>
>> , that ib_umem_odp_map_dma_pages() calls.
> 
> Why ? The device driver pattern is:
>     if (hmm_is_it_dying()) {
>         // handle when process die and abort the fault ie useless
>         // to call within HMM
>     }
>     down_read(mmap_sem);
> 
> This pattern is common within nouveau and RDMA and other device driver in
> the work. Hence why i am replacing it with just one helper. Also it has the
> added benefit that changes being discussed around the mmap sem will be easier
> to do as it avoid having to update each driver but instead it can be done
> just once for the HMM helpers.

Yes, and I'm saying that the pattern is broken. Because it's racy. :)

>>>>>>> +{
>>>>>>> +	struct mm_struct *mm;
>>>>>>> +
>>>>>>> +	/* Sanity check ... */
>>>>>>> +	if (!mirror || !mirror->hmm)
>>>>>>> +		return -EINVAL;
>>>>>>> +	/*
>>>>>>> +	 * Before trying to take the mmap_sem make sure the mm is still
>>>>>>> +	 * alive as device driver context might outlive the mm lifetime.
>>>>>>
>>>>>> Let's find another way, and a better place, to solve this problem.
>>>>>> Ref counting?
>>>>>
>>>>> This has nothing to do with refcount or use after free or anthing
>>>>> like that. It is just about checking wether we are about to do
>>>>> something pointless. If the process is dying then it is pointless
>>>>> to try to take the lock and it is pointless for the device driver
>>>>> to trigger handle_mm_fault().
>>>>
>>>> Well, what happens if you let such pointless code run anyway? 
>>>> Does everything still work? If yes, then we don't need this change.
>>>> If no, then we need a race-free version of this change.
>>>
>>> Yes everything work, nothing bad can happen from a race, it will just
>>> do useless work which never hurt anyone.
>>>
>>
>> OK, so let's either drop this patch, or if merge windows won't allow that,
>> then *eventually* drop this patch. And instead, put in a hmm_sanity_check()
>> that does the same checks.
> 
> RDMA depends on this, so does the nouveau patchset that convert to new API.
> So i do not see reason to drop this. They are user for this they are posted
> and i hope i explained properly the benefit.
> 
> It is a common pattern. Yes it only save couple lines of code but down the
> road i will also help for people working on the mmap_sem patchset.
> 

It *adds* a couple of lines that are misleading, because they look like they
make things safer, but they don't actually do so.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 22:43               ` John Hubbard
@ 2019-03-28 23:05                 ` Jerome Glisse
  2019-03-28 23:20                   ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 23:05 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 03:43:33PM -0700, John Hubbard wrote:
> On 3/28/19 3:40 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:08 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 2:30 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> >>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> >> [...]
> >>>>
> >>>>>>
> >>>>>> If you insist on having this wrapper, I think it should have approximately 
> >>>>>> this form:
> >>>>>>
> >>>>>> void hmm_mirror_mm_down_read(...)
> >>>>>> {
> >>>>>> 	WARN_ON(...)
> >>>>>> 	down_read(...)
> >>>>>> } 
> >>>>>
> >>>>> I do insist as it is useful and use by both RDMA and nouveau and the
> >>>>> above would kill the intent. The intent is do not try to take the lock
> >>>>> if the process is dying.
> >>>>
> >>>> Could you provide me a link to those examples so I can take a peek? I
> >>>> am still convinced that this whole thing is a race condition at best.
> >>>
> >>> The race is fine and ok see:
> >>>
> >>> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
> >>>
> >>> which has been posted and i think i provided a link in the cover
> >>> letter to that post. The same patch exist for nouveau i need to
> >>> cleanup that tree and push it.
> >>
> >> Thanks for that link, and I apologize for not keeping up with that
> >> other review thread.
> >>
> >> Looking it over, hmm_mirror_mm_down_read() is only used in one place.
> >> So, what you really want there is not a down_read() wrapper, but rather,
> >> something like
> >>
> >> 	hmm_sanity_check()
> >>
> >> , that ib_umem_odp_map_dma_pages() calls.
> > 
> > Why ? The device driver pattern is:
> >     if (hmm_is_it_dying()) {
> >         // handle when process die and abort the fault ie useless
> >         // to call within HMM
> >     }
> >     down_read(mmap_sem);
> > 
> > This pattern is common within nouveau and RDMA and other device driver in
> > the work. Hence why i am replacing it with just one helper. Also it has the
> > added benefit that changes being discussed around the mmap sem will be easier
> > to do as it avoid having to update each driver but instead it can be done
> > just once for the HMM helpers.
> 
> Yes, and I'm saying that the pattern is broken. Because it's racy. :)

And i explained why it is fine, it just an optimization, in most case
it takes time to tear down a process and the device page fault handler
can be trigger while that happens, so instead of having it pile more
work on we can detect that even if it is racy. It is just about avoiding
useless work. There is nothing about correctness here. It does not need
to identify dying process with 100% accuracy. The fact that the process
is dying will be identified race free later on and it just means that in
the meantime we are doing useless works, potential tons of useless works.

They are hardware that can storm the page fault handler and we end up
with hundred of page fault queued up against a process that might be
dying. It is a big waste to go over all those fault and do works that
will be trown on the floor later on.

> 
> >>>>>>> +{
> >>>>>>> +	struct mm_struct *mm;
> >>>>>>> +
> >>>>>>> +	/* Sanity check ... */
> >>>>>>> +	if (!mirror || !mirror->hmm)
> >>>>>>> +		return -EINVAL;
> >>>>>>> +	/*
> >>>>>>> +	 * Before trying to take the mmap_sem make sure the mm is still
> >>>>>>> +	 * alive as device driver context might outlive the mm lifetime.
> >>>>>>
> >>>>>> Let's find another way, and a better place, to solve this problem.
> >>>>>> Ref counting?
> >>>>>
> >>>>> This has nothing to do with refcount or use after free or anthing
> >>>>> like that. It is just about checking wether we are about to do
> >>>>> something pointless. If the process is dying then it is pointless
> >>>>> to try to take the lock and it is pointless for the device driver
> >>>>> to trigger handle_mm_fault().
> >>>>
> >>>> Well, what happens if you let such pointless code run anyway? 
> >>>> Does everything still work? If yes, then we don't need this change.
> >>>> If no, then we need a race-free version of this change.
> >>>
> >>> Yes everything work, nothing bad can happen from a race, it will just
> >>> do useless work which never hurt anyone.
> >>>
> >>
> >> OK, so let's either drop this patch, or if merge windows won't allow that,
> >> then *eventually* drop this patch. And instead, put in a hmm_sanity_check()
> >> that does the same checks.
> > 
> > RDMA depends on this, so does the nouveau patchset that convert to new API.
> > So i do not see reason to drop this. They are user for this they are posted
> > and i hope i explained properly the benefit.
> > 
> > It is a common pattern. Yes it only save couple lines of code but down the
> > road i will also help for people working on the mmap_sem patchset.
> > 
> 
> It *adds* a couple of lines that are misleading, because they look like they
> make things safer, but they don't actually do so.

It is not about safety, sorry if it confused you but there is nothing about
safety here, i can add a big fat comment that explains that there is no safety
here. The intention is to allow the page fault handler that potential have
hundred of page fault queue up to abort as soon as it sees that it is pointless
to keep faulting on a dying process.

Again if we race it is _fine_ nothing bad will happen, we are just doing use-
less work that gonna be thrown on the floor and we are just slowing down the
process tear down.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 23:05                 ` Jerome Glisse
@ 2019-03-28 23:20                   ` John Hubbard
  2019-03-28 23:24                     ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 23:20 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 4:05 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 03:43:33PM -0700, John Hubbard wrote:
>> On 3/28/19 3:40 PM, Jerome Glisse wrote:
>>> On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
>>>> On 3/28/19 3:08 PM, Jerome Glisse wrote:
>>>>> On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
>>>>>> On 3/28/19 2:30 PM, Jerome Glisse wrote:
>>>>>>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
>>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>> [...]
>>>> OK, so let's either drop this patch, or if merge windows won't allow that,
>>>> then *eventually* drop this patch. And instead, put in a hmm_sanity_check()
>>>> that does the same checks.
>>>
>>> RDMA depends on this, so does the nouveau patchset that convert to new API.
>>> So i do not see reason to drop this. They are user for this they are posted
>>> and i hope i explained properly the benefit.
>>>
>>> It is a common pattern. Yes it only save couple lines of code but down the
>>> road i will also help for people working on the mmap_sem patchset.
>>>
>>
>> It *adds* a couple of lines that are misleading, because they look like they
>> make things safer, but they don't actually do so.
> 
> It is not about safety, sorry if it confused you but there is nothing about
> safety here, i can add a big fat comment that explains that there is no safety
> here. The intention is to allow the page fault handler that potential have
> hundred of page fault queue up to abort as soon as it sees that it is pointless
> to keep faulting on a dying process.
> 
> Again if we race it is _fine_ nothing bad will happen, we are just doing use-
> less work that gonna be thrown on the floor and we are just slowing down the
> process tear down.
> 

In addition to a comment, how about naming this thing to indicate the above 
intention?  I have a really hard time with this odd down_read() wrapper, which
allows code to proceed without really getting a lock. It's just too wrong-looking.
If it were instead named:

	hmm_is_exiting()

and had a comment about why racy is OK, then I'd be a lot happier. :)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-28 22:40           ` John Hubbard
@ 2019-03-28 23:21             ` Jerome Glisse
  2019-03-28 23:28               ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 23:21 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>> From: Jérôme Glisse <jglisse@redhat.com>
> >>>>>
> >>>>> The HMM mirror API can be use in two fashions. The first one where the HMM
> >>>>> user coalesce multiple page faults into one request and set flags per pfns
> >>>>> for of those faults. The second one where the HMM user want to pre-fault a
> >>>>> range with specific flags. For the latter one it is a waste to have the user
> >>>>> pre-fill the pfn arrays with a default flags value.
> >>>>>
> >>>>> This patch adds a default flags value allowing user to set them for a range
> >>>>> without having to pre-fill the pfn array.
> >>>>>
> >>>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> >>>>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>>>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>>>> Cc: Dan Williams <dan.j.williams@intel.com>
> >>>>> ---
> >>>>>  include/linux/hmm.h |  7 +++++++
> >>>>>  mm/hmm.c            | 12 ++++++++++++
> >>>>>  2 files changed, 19 insertions(+)
> >>>>>
> >>>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> >>>>> index 79671036cb5f..13bc2c72f791 100644
> >>>>> --- a/include/linux/hmm.h
> >>>>> +++ b/include/linux/hmm.h
> >>>>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
> >>>>>   * @pfns: array of pfns (big enough for the range)
> >>>>>   * @flags: pfn flags to match device driver page table
> >>>>>   * @values: pfn value for some special case (none, special, error, ...)
> >>>>> + * @default_flags: default flags for the range (write, read, ...)
> >>>>> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
> >>>>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> >>>>>   * @valid: pfns array did not change since it has been fill by an HMM function
> >>>>>   */
> >>>>> @@ -177,6 +179,8 @@ struct hmm_range {
> >>>>>  	uint64_t		*pfns;
> >>>>>  	const uint64_t		*flags;
> >>>>>  	const uint64_t		*values;
> >>>>> +	uint64_t		default_flags;
> >>>>> +	uint64_t		pfn_flags_mask;
> >>>>>  	uint8_t			pfn_shift;
> >>>>>  	bool			valid;
> >>>>>  };
> >>>>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> >>>>>  {
> >>>>>  	long ret;
> >>>>>  
> >>>>> +	range->default_flags = 0;
> >>>>> +	range->pfn_flags_mask = -1UL;
> >>>>
> >>>> Hi Jerome,
> >>>>
> >>>> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
> >>>> definitely does not need to be a run time value. And we want some assurance that
> >>>> the mask is 
> >>>> 	a) large enough for the flags, and
> >>>> 	b) small enough to avoid overrunning the pfns field.
> >>>>
> >>>> Those are less certain with a run-time struct field, and more obviously correct with
> >>>> something like, approximately:
> >>>>
> >>>>  	#define PFN_FLAGS_MASK 0xFFFF
> >>>>
> >>>> or something.
> >>>>
> >>>> In other words, this is more flexibility than we need--just a touch too much,
> >>>> IMHO.
> >>>
> >>> This mirror the fact that flags are provided as an array and some devices use
> >>> the top bits for flags (read, write, ...). So here it is the safe default to
> >>> set it to -1. If the caller want to leverage this optimization it can override
> >>> the default_flags value.
> >>>
> >>
> >> Optimization? OK, now I'm a bit lost. Maybe this is another place where I could
> >> use a peek at the calling code. The only flags I've seen so far use the bottom
> >> 3 bits and that's it. 
> >>
> >> Maybe comments here?
> >>
> >>>>
> >>>>> +
> >>>>>  	ret = hmm_range_register(range, range->vma->vm_mm,
> >>>>>  				 range->start, range->end);
> >>>>>  	if (ret)
> >>>>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>>>> index fa9498eeb9b6..4fe88a196d17 100644
> >>>>> --- a/mm/hmm.c
> >>>>> +++ b/mm/hmm.c
> >>>>> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
> >>>>>  	if (!hmm_vma_walk->fault)
> >>>>>  		return;
> >>>>>  
> >>>>> +	/*
> >>>>> +	 * So we not only consider the individual per page request we also
> >>>>> +	 * consider the default flags requested for the range. The API can
> >>>>> +	 * be use in 2 fashions. The first one where the HMM user coalesce
> >>>>> +	 * multiple page fault into one request and set flags per pfns for
> >>>>> +	 * of those faults. The second one where the HMM user want to pre-
> >>>>> +	 * fault a range with specific flags. For the latter one it is a
> >>>>> +	 * waste to have the user pre-fill the pfn arrays with a default
> >>>>> +	 * flags value.
> >>>>> +	 */
> >>>>> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
> >>>>
> >>>> Need to verify that the mask isn't too large or too small.
> >>>
> >>> I need to check agin but default flag is anded somewhere to limit
> >>> the bit to the one we expect.
> >>
> >> Right, but in general, the *mask* could be wrong. It would be nice to have
> >> an assert, and/or a comment, or something to verify the mask is proper.
> >>
> >> Really, a hardcoded mask is simple and correct--unless it *definitely* must
> >> vary for devices of course.
> > 
> > Ok so re-read the code and it is correct. The helper for compatibility with
> > old API (so that i do not break nouveau upstream code) initialize those to
> > the safe default ie:
> > 
> > range->default_flags = 0;
> > range->pfn_flags_mask = -1;
> > 
> > Which means that in the above comment we are in the case where it is the
> > individual entry within the pfn array that will determine if we fault or
> > not.
> > 
> > Driver using the new API can either use this safe default or use the
> > second case in the above comment and set default_flags to something
> > else than 0.
> > 
> > Note that those default_flags are not set in the final result they are
> > use to determine if we need to do a page fault. For instance if you set
> > the write bit in the default flags then the pfns computed above will
> > have the write bit set and when we compare with the CPU pte if the CPU
> > pte do not have the write bit set then we will fault. What matter is
> > that in this case the value within the pfns array is totaly pointless
> > ie we do not care what it is, it will not affect the decission ie the
> > decision is made by looking at the default flags.
> > 
> > Hope this clarify thing. You can look at the ODP patch to see how it
> > is use:
> > 
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
> > 
> 
> Hi Jerome,
> 
> I think you're talking about flags, but I'm talking about the mask. The 
> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
> that it uses are still in the same lower 3 bits:
> 
> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
> +};
> 
> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
> mask, that is *also* runtime changeable. 

So the pfn array is using a device driver specific format and we have
no idea nor do we need to know where the valid, write, ... bit are in
that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
we do not care. They are device with bit at the top and for those you
need a mask that allows you to mask out those bits or not depending on
what the user want to do.

The mask here is against an _unknown_ (from HMM POV) format. So we can
not presume where the bits will be and thus we can not presume what a
proper mask is.

So that's why a full unsigned long mask is use here.

Maybe an example will help let say the device flag are:
    VALID (1 << 63)
    WRITE (1 << 62)

Now let say that device wants to fault with at least read a range
it does set:
    range->default_flags = (1 << 63)
    range->pfn_flags_mask = 0;

This will fill fault all page in the range with at least read
permission.

Now let say it wants to do the same except for one page in the range
for which its want to have write. Now driver set:
    range->default_flags = (1 << 63);
    range->pfn_flags_mask = (1 << 62);
    range->pfns[index_of_write] = (1 << 62);

With this HMM will fault in all page with at least read (ie valid)
and for the address: range->start + index_of_write << PAGE_SHIFT it
will fault with write permission ie if the CPU pte does not have
write permission set then handle_mm_fault() will be call asking for
write permission.


Note that in the above HMM will populate the pfns array with write
permission for any entry that have write permission within the CPU
pte ie the default_flags and pfn_flags_mask is only the minimun
requirement but HMM always returns all the flag that are set in the
CPU pte.


Now let say you are an "old" driver like nouveau upstream, then it
means that you are setting each individual entry within range->pfns
with the exact flags you want for each address hence here what you
want is:
    range->default_flags = 0;
    range->pfn_flags_mask = -1UL;

So that what we do is (for each entry):
    (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
and we end up with the flags that were set by the driver for each of
the individual range->pfns entries.


Does this help ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 23:20                   ` John Hubbard
@ 2019-03-28 23:24                     ` Jerome Glisse
  2019-03-28 23:34                       ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 23:24 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 04:20:37PM -0700, John Hubbard wrote:
> On 3/28/19 4:05 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:43:33PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:40 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 3:08 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
> >>>>>> On 3/28/19 2:30 PM, Jerome Glisse wrote:
> >>>>>>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
> >>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> >>>> [...]
> >>>> OK, so let's either drop this patch, or if merge windows won't allow that,
> >>>> then *eventually* drop this patch. And instead, put in a hmm_sanity_check()
> >>>> that does the same checks.
> >>>
> >>> RDMA depends on this, so does the nouveau patchset that convert to new API.
> >>> So i do not see reason to drop this. They are user for this they are posted
> >>> and i hope i explained properly the benefit.
> >>>
> >>> It is a common pattern. Yes it only save couple lines of code but down the
> >>> road i will also help for people working on the mmap_sem patchset.
> >>>
> >>
> >> It *adds* a couple of lines that are misleading, because they look like they
> >> make things safer, but they don't actually do so.
> > 
> > It is not about safety, sorry if it confused you but there is nothing about
> > safety here, i can add a big fat comment that explains that there is no safety
> > here. The intention is to allow the page fault handler that potential have
> > hundred of page fault queue up to abort as soon as it sees that it is pointless
> > to keep faulting on a dying process.
> > 
> > Again if we race it is _fine_ nothing bad will happen, we are just doing use-
> > less work that gonna be thrown on the floor and we are just slowing down the
> > process tear down.
> > 
> 
> In addition to a comment, how about naming this thing to indicate the above 
> intention?  I have a really hard time with this odd down_read() wrapper, which
> allows code to proceed without really getting a lock. It's just too wrong-looking.
> If it were instead named:
> 
> 	hmm_is_exiting()

What about: hmm_lock_mmap_if_alive() ?


> 
> and had a comment about why racy is OK, then I'd be a lot happier. :)

Will add fat comment.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-28 23:21             ` Jerome Glisse
@ 2019-03-28 23:28               ` John Hubbard
  2019-03-28 16:42                 ` Ira Weiny
  2019-03-28 23:43                 ` Jerome Glisse
  0 siblings, 2 replies; 69+ messages in thread
From: John Hubbard @ 2019-03-28 23:28 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 4:21 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
[...]
>> Hi Jerome,
>>
>> I think you're talking about flags, but I'm talking about the mask. The 
>> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
>> that it uses are still in the same lower 3 bits:
>>
>> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
>> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
>> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
>> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
>> +};
>>
>> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
>> mask, that is *also* runtime changeable. 
> 
> So the pfn array is using a device driver specific format and we have
> no idea nor do we need to know where the valid, write, ... bit are in
> that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> we do not care. They are device with bit at the top and for those you
> need a mask that allows you to mask out those bits or not depending on
> what the user want to do.
> 
> The mask here is against an _unknown_ (from HMM POV) format. So we can
> not presume where the bits will be and thus we can not presume what a
> proper mask is.
> 
> So that's why a full unsigned long mask is use here.
> 
> Maybe an example will help let say the device flag are:
>     VALID (1 << 63)
>     WRITE (1 << 62)
> 
> Now let say that device wants to fault with at least read a range
> it does set:
>     range->default_flags = (1 << 63)
>     range->pfn_flags_mask = 0;
> 
> This will fill fault all page in the range with at least read
> permission.
> 
> Now let say it wants to do the same except for one page in the range
> for which its want to have write. Now driver set:
>     range->default_flags = (1 << 63);
>     range->pfn_flags_mask = (1 << 62);
>     range->pfns[index_of_write] = (1 << 62);
> 
> With this HMM will fault in all page with at least read (ie valid)
> and for the address: range->start + index_of_write << PAGE_SHIFT it
> will fault with write permission ie if the CPU pte does not have
> write permission set then handle_mm_fault() will be call asking for
> write permission.
> 
> 
> Note that in the above HMM will populate the pfns array with write
> permission for any entry that have write permission within the CPU
> pte ie the default_flags and pfn_flags_mask is only the minimun
> requirement but HMM always returns all the flag that are set in the
> CPU pte.
> 
> 
> Now let say you are an "old" driver like nouveau upstream, then it
> means that you are setting each individual entry within range->pfns
> with the exact flags you want for each address hence here what you
> want is:
>     range->default_flags = 0;
>     range->pfn_flags_mask = -1UL;
> 
> So that what we do is (for each entry):
>     (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> and we end up with the flags that were set by the driver for each of
> the individual range->pfns entries.
> 
> 
> Does this help ?
> 

Yes, the key point for me was that this is an entirely device driver specific
format. OK. But then we have HMM setting it. So a comment to the effect that
this is device-specific might be nice, but I'll leave that up to you whether
it is useful.

Either way, you can add:

	Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2
  2019-03-28 23:24                     ` Jerome Glisse
@ 2019-03-28 23:34                       ` John Hubbard
  2019-03-28 18:44                         ` Ira Weiny
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-28 23:34 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 4:24 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 04:20:37PM -0700, John Hubbard wrote:
>> On 3/28/19 4:05 PM, Jerome Glisse wrote:
>>> On Thu, Mar 28, 2019 at 03:43:33PM -0700, John Hubbard wrote:
>>>> On 3/28/19 3:40 PM, Jerome Glisse wrote:
>>>>> On Thu, Mar 28, 2019 at 03:25:39PM -0700, John Hubbard wrote:
>>>>>> On 3/28/19 3:08 PM, Jerome Glisse wrote:
>>>>>>> On Thu, Mar 28, 2019 at 02:41:02PM -0700, John Hubbard wrote:
>>>>>>>> On 3/28/19 2:30 PM, Jerome Glisse wrote:
>>>>>>>>> On Thu, Mar 28, 2019 at 01:54:01PM -0700, John Hubbard wrote:
>>>>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>>>> [...]
>>>>>> OK, so let's either drop this patch, or if merge windows won't allow that,
>>>>>> then *eventually* drop this patch. And instead, put in a hmm_sanity_check()
>>>>>> that does the same checks.
>>>>>
>>>>> RDMA depends on this, so does the nouveau patchset that convert to new API.
>>>>> So i do not see reason to drop this. They are user for this they are posted
>>>>> and i hope i explained properly the benefit.
>>>>>
>>>>> It is a common pattern. Yes it only save couple lines of code but down the
>>>>> road i will also help for people working on the mmap_sem patchset.
>>>>>
>>>>
>>>> It *adds* a couple of lines that are misleading, because they look like they
>>>> make things safer, but they don't actually do so.
>>>
>>> It is not about safety, sorry if it confused you but there is nothing about
>>> safety here, i can add a big fat comment that explains that there is no safety
>>> here. The intention is to allow the page fault handler that potential have
>>> hundred of page fault queue up to abort as soon as it sees that it is pointless
>>> to keep faulting on a dying process.
>>>
>>> Again if we race it is _fine_ nothing bad will happen, we are just doing use-
>>> less work that gonna be thrown on the floor and we are just slowing down the
>>> process tear down.
>>>
>>
>> In addition to a comment, how about naming this thing to indicate the above 
>> intention?  I have a really hard time with this odd down_read() wrapper, which
>> allows code to proceed without really getting a lock. It's just too wrong-looking.
>> If it were instead named:
>>
>> 	hmm_is_exiting()
> 
> What about: hmm_lock_mmap_if_alive() ?
> 

That's definitely better, but I want to vote for just doing a check, not 
taking any locks.

I'm not super concerned about the exact name, but I really want a routine that
just checks (and optionally asserts, via WARN or BUG), and that's *all*. Then
drivers can scatter that around like pixie dust as they see fit. Maybe right before
taking a lock, maybe in other places. Decoupled from locking.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-28 23:28               ` John Hubbard
  2019-03-28 16:42                 ` Ira Weiny
@ 2019-03-28 23:43                 ` Jerome Glisse
  1 sibling, 0 replies; 69+ messages in thread
From: Jerome Glisse @ 2019-03-28 23:43 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> [...]
> >> Hi Jerome,
> >>
> >> I think you're talking about flags, but I'm talking about the mask. The 
> >> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
> >> that it uses are still in the same lower 3 bits:
> >>
> >> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> >> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
> >> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
> >> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
> >> +};
> >>
> >> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
> >> mask, that is *also* runtime changeable. 
> > 
> > So the pfn array is using a device driver specific format and we have
> > no idea nor do we need to know where the valid, write, ... bit are in
> > that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> > we do not care. They are device with bit at the top and for those you
> > need a mask that allows you to mask out those bits or not depending on
> > what the user want to do.
> > 
> > The mask here is against an _unknown_ (from HMM POV) format. So we can
> > not presume where the bits will be and thus we can not presume what a
> > proper mask is.
> > 
> > So that's why a full unsigned long mask is use here.
> > 
> > Maybe an example will help let say the device flag are:
> >     VALID (1 << 63)
> >     WRITE (1 << 62)
> > 
> > Now let say that device wants to fault with at least read a range
> > it does set:
> >     range->default_flags = (1 << 63)
> >     range->pfn_flags_mask = 0;
> > 
> > This will fill fault all page in the range with at least read
> > permission.
> > 
> > Now let say it wants to do the same except for one page in the range
> > for which its want to have write. Now driver set:
> >     range->default_flags = (1 << 63);
> >     range->pfn_flags_mask = (1 << 62);
> >     range->pfns[index_of_write] = (1 << 62);
> > 
> > With this HMM will fault in all page with at least read (ie valid)
> > and for the address: range->start + index_of_write << PAGE_SHIFT it
> > will fault with write permission ie if the CPU pte does not have
> > write permission set then handle_mm_fault() will be call asking for
> > write permission.
> > 
> > 
> > Note that in the above HMM will populate the pfns array with write
> > permission for any entry that have write permission within the CPU
> > pte ie the default_flags and pfn_flags_mask is only the minimun
> > requirement but HMM always returns all the flag that are set in the
> > CPU pte.
> > 
> > 
> > Now let say you are an "old" driver like nouveau upstream, then it
> > means that you are setting each individual entry within range->pfns
> > with the exact flags you want for each address hence here what you
> > want is:
> >     range->default_flags = 0;
> >     range->pfn_flags_mask = -1UL;
> > 
> > So that what we do is (for each entry):
> >     (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> > and we end up with the flags that were set by the driver for each of
> > the individual range->pfns entries.
> > 
> > 
> > Does this help ?
> > 
> 
> Yes, the key point for me was that this is an entirely device driver specific
> format. OK. But then we have HMM setting it. So a comment to the effect that
> this is device-specific might be nice, but I'll leave that up to you whether
> it is useful.

The code you were pointing at is temporary ie once this get merge that code
will get remove in release N+2 ie merge code in N, update nouveau in N+1 and
remove this temporary code in N+2

When updating HMM API it is easier to stage API update over release like that
so there is no need to synchronize accross multiple tree (mm, drm, rdma, ...)

> Either way, you can add:
> 
> 	Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-28 21:21         ` Jerome Glisse
@ 2019-03-29  0:39           ` John Hubbard
  2019-03-28 16:57             ` Ira Weiny
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-29  0:39 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Ira Weiny, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 2:21 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
>> On 3/28/19 12:11 PM, Jerome Glisse wrote:
>>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
>>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
>>>>> From: Jérôme Glisse <jglisse@redhat.com>
[...]
>>>>> @@ -67,14 +78,9 @@ struct hmm {
>>>>>   */
>>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
>>>>>  {
>>>>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
>>>>> +	struct hmm *hmm = mm_get_hmm(mm);
>>>>
>>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
>>>
>>> The thing is that you want only one hmm struct per process and thus
>>> if there is already one and it is not being destroy then you want to
>>> reuse it.
>>>
>>> Also this is all internal to HMM code and so it should not confuse
>>> anyone.
>>>
>>
>> Well, it has repeatedly come up, and I'd claim that it is quite 
>> counter-intuitive. So if there is an easy way to make this internal 
>> HMM code clearer or better named, I would really love that to happen.
>>
>> And we shouldn't ever dismiss feedback based on "this is just internal
>> xxx subsystem code, no need for it to be as clear as other parts of the
>> kernel", right?
> 
> Yes but i have not seen any better alternative that present code. If
> there is please submit patch.
> 

Ira, do you have any patch you're working on, or a more detailed suggestion there?
If not, then I might (later, as it's not urgent) propose a small cleanup patch 
I had in mind for the hmm_register code. But I don't want to duplicate effort 
if you're already thinking about it.


thanks,
-- 
John Hubbard
NVIDIA



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2
  2019-03-28 16:12   ` Ira Weiny
@ 2019-03-29  0:56     ` Jerome Glisse
  2019-03-28 18:49       ` Ira Weiny
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  0:56 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-mm, linux-kernel, Andrew Morton, John Hubbard,
	Dan Williams, Dan Carpenter, Matthew Wilcox

On Thu, Mar 28, 2019 at 09:12:21AM -0700, Ira Weiny wrote:
> On Mon, Mar 25, 2019 at 10:40:06AM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > A common use case for HMM mirror is user trying to mirror a range
> > and before they could program the hardware it get invalidated by
> > some core mm event. Instead of having user re-try right away to
> > mirror the range provide a completion mechanism for them to wait
> > for any active invalidation affecting the range.
> > 
> > This also changes how hmm_range_snapshot() and hmm_range_fault()
> > works by not relying on vma so that we can drop the mmap_sem
> > when waiting and lookup the vma again on retry.
> > 
> > Changes since v1:
> >     - squashed: Dan Carpenter: potential deadlock in nonblocking code
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Dan Carpenter <dan.carpenter@oracle.com>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > ---
> >  include/linux/hmm.h | 208 ++++++++++++++---
> >  mm/hmm.c            | 528 +++++++++++++++++++++-----------------------
> >  2 files changed, 428 insertions(+), 308 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index e9afd23c2eac..79671036cb5f 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -77,8 +77,34 @@
> >  #include <linux/migrate.h>
> >  #include <linux/memremap.h>
> >  #include <linux/completion.h>
> > +#include <linux/mmu_notifier.h>
> >  
> > -struct hmm;
> > +
> > +/*
> > + * struct hmm - HMM per mm struct
> > + *
> > + * @mm: mm struct this HMM struct is bound to
> > + * @lock: lock protecting ranges list
> > + * @ranges: list of range being snapshotted
> > + * @mirrors: list of mirrors for this mm
> > + * @mmu_notifier: mmu notifier to track updates to CPU page table
> > + * @mirrors_sem: read/write semaphore protecting the mirrors list
> > + * @wq: wait queue for user waiting on a range invalidation
> > + * @notifiers: count of active mmu notifiers
> > + * @dead: is the mm dead ?
> > + */
> > +struct hmm {
> > +	struct mm_struct	*mm;
> > +	struct kref		kref;
> > +	struct mutex		lock;
> > +	struct list_head	ranges;
> > +	struct list_head	mirrors;
> > +	struct mmu_notifier	mmu_notifier;
> > +	struct rw_semaphore	mirrors_sem;
> > +	wait_queue_head_t	wq;
> > +	long			notifiers;
> > +	bool			dead;
> > +};
> >  
> >  /*
> >   * hmm_pfn_flag_e - HMM flag enums
> > @@ -155,6 +181,38 @@ struct hmm_range {
> >  	bool			valid;
> >  };
> >  
> > +/*
> > + * hmm_range_wait_until_valid() - wait for range to be valid
> > + * @range: range affected by invalidation to wait on
> > + * @timeout: time out for wait in ms (ie abort wait after that period of time)
> > + * Returns: true if the range is valid, false otherwise.
> > + */
> > +static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
> > +					      unsigned long timeout)
> > +{
> > +	/* Check if mm is dead ? */
> > +	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> > +		range->valid = false;
> > +		return false;
> > +	}
> > +	if (range->valid)
> > +		return true;
> > +	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> > +			   msecs_to_jiffies(timeout));
> > +	/* Return current valid status just in case we get lucky */
> > +	return range->valid;
> > +}
> > +
> > +/*
> > + * hmm_range_valid() - test if a range is valid or not
> > + * @range: range
> > + * Returns: true if the range is valid, false otherwise.
> > + */
> > +static inline bool hmm_range_valid(struct hmm_range *range)
> > +{
> > +	return range->valid;
> > +}
> > +
> >  /*
> >   * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> >   * @range: range use to decode HMM pfn value
> > @@ -357,51 +415,133 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
> >  
> >  
> >  /*
> > - * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
> > - * driver lock that serializes device page table updates, then call
> > - * hmm_vma_range_done(), to check if the snapshot is still valid. The same
> > - * device driver page table update lock must also be used in the
> > - * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
> > - * table invalidation serializes on it.
> > + * To snapshot the CPU page table you first have to call hmm_range_register()
> > + * to register the range. If hmm_range_register() return an error then some-
> > + * thing is horribly wrong and you should fail loudly. If it returned true then
> > + * you can wait for the range to be stable with hmm_range_wait_until_valid()
> > + * function, a range is valid when there are no concurrent changes to the CPU
> > + * page table for the range.
> > + *
> > + * Once the range is valid you can call hmm_range_snapshot() if that returns
> > + * without error then you can take your device page table lock (the same lock
> > + * you use in the HMM mirror sync_cpu_device_pagetables() callback). After
> > + * taking that lock you have to check the range validity, if it is still valid
> > + * (ie hmm_range_valid() returns true) then you can program the device page
> > + * table, otherwise you have to start again. Pseudo code:
> > + *
> > + *      mydevice_prefault(mydevice, mm, start, end)
> > + *      {
> > + *          struct hmm_range range;
> > + *          ...
> >   *
> > - * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
> > - * hmm_range_snapshot() WITHOUT ERROR !
> > + *          ret = hmm_range_register(&range, mm, start, end);
> > + *          if (ret)
> > + *              return ret;
> >   *
> > - * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
> > - */
> > -long hmm_range_snapshot(struct hmm_range *range);
> > -bool hmm_vma_range_done(struct hmm_range *range);
> > -
> > -
> > -/*
> > - * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
> > - * not migrate any device memory back to system memory. The HMM pfn array will
> > - * be updated with the fault result and current snapshot of the CPU page table
> > - * for the range.
> > + *          down_read(mm->mmap_sem);
> > + *      again:
> > + *
> > + *          if (!hmm_range_wait_until_valid(&range, TIMEOUT)) {
> > + *              up_read(&mm->mmap_sem);
> > + *              hmm_range_unregister(range);
> > + *              // Handle time out, either sleep or retry or something else
> > + *              ...
> > + *              return -ESOMETHING; || goto again;
> > + *          }
> > + *
> > + *          ret = hmm_range_snapshot(&range); or hmm_range_fault(&range);
> > + *          if (ret == -EAGAIN) {
> > + *              down_read(mm->mmap_sem);
> > + *              goto again;
> > + *          } else if (ret == -EBUSY) {
> > + *              goto again;
> > + *          }
> > + *
> > + *          up_read(&mm->mmap_sem);
> > + *          if (ret) {
> > + *              hmm_range_unregister(range);
> > + *              return ret;
> > + *          }
> > + *
> > + *          // It might not have snap-shoted the whole range but only the first
> > + *          // npages, the return values is the number of valid pages from the
> > + *          // start of the range.
> > + *          npages = ret;
> >   *
> > - * The mmap_sem must be taken in read mode before entering and it might be
> > - * dropped by the function if the block argument is false. In that case, the
> > - * function returns -EAGAIN.
> > + *          ...
> >   *
> > - * Return value does not reflect if the fault was successful for every single
> > - * address or not. Therefore, the caller must to inspect the HMM pfn array to
> > - * determine fault status for each address.
> > + *          mydevice_page_table_lock(mydevice);
> > + *          if (!hmm_range_valid(range)) {
> > + *              mydevice_page_table_unlock(mydevice);
> > + *              goto again;
> > + *          }
> >   *
> > - * Trying to fault inside an invalid vma will result in -EINVAL.
> > + *          mydevice_populate_page_table(mydevice, range, npages);
> > + *          ...
> > + *          mydevice_take_page_table_unlock(mydevice);
> > + *          hmm_range_unregister(range);
> >   *
> > - * See the function description in mm/hmm.c for further documentation.
> > + *          return 0;
> > + *      }
> > + *
> > + * The same scheme apply to hmm_range_fault() (ie replace hmm_range_snapshot()
> > + * with hmm_range_fault() in above pseudo code).
> > + *
> > + * YOU MUST CALL hmm_range_unregister() ONCE AND ONLY ONCE EACH TIME YOU CALL
> > + * hmm_range_register() AND hmm_range_register() RETURNED TRUE ! IF YOU DO NOT
> > + * FOLLOW THIS RULE MEMORY CORRUPTION WILL ENSUE !
> >   */
> > +int hmm_range_register(struct hmm_range *range,
> > +		       struct mm_struct *mm,
> > +		       unsigned long start,
> > +		       unsigned long end);
> > +void hmm_range_unregister(struct hmm_range *range);
> 
> The above comment is great!  But I think you also need to update
> Documentation/vm/hmm.rst:hmm_range_snapshot() to show the use of
> hmm_range_[un]register()
> 
> > +long hmm_range_snapshot(struct hmm_range *range);
> >  long hmm_range_fault(struct hmm_range *range, bool block);
> >  
> > +/*
> > + * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> > + *
> > + * When waiting for mmu notifiers we need some kind of time out otherwise we
> > + * could potentialy wait for ever, 1000ms ie 1s sounds like a long time to
> > + * wait already.
> > + */
> > +#define HMM_RANGE_DEFAULT_TIMEOUT 1000
> > +
> >  /* This is a temporary helper to avoid merge conflict between trees. */
> > +static inline bool hmm_vma_range_done(struct hmm_range *range)
> > +{
> > +	bool ret = hmm_range_valid(range);
> > +
> > +	hmm_range_unregister(range);
> > +	return ret;
> > +}
> > +
> >  static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> >  {
> > -	long ret = hmm_range_fault(range, block);
> > -	if (ret == -EBUSY)
> > -		ret = -EAGAIN;
> > -	else if (ret == -EAGAIN)
> > -		ret = -EBUSY;
> > -	return ret < 0 ? ret : 0;
> > +	long ret;
> > +
> > +	ret = hmm_range_register(range, range->vma->vm_mm,
> > +				 range->start, range->end);
> > +	if (ret)
> > +		return (int)ret;
> > +
> > +	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
> > +		up_read(&range->vma->vm_mm->mmap_sem);
> > +		return -EAGAIN;
> > +	}
> > +
> > +	ret = hmm_range_fault(range, block);
> > +	if (ret <= 0) {
> > +		if (ret == -EBUSY || !ret) {
> > +			up_read(&range->vma->vm_mm->mmap_sem);
> > +			ret = -EBUSY;
> > +		} else if (ret == -EAGAIN)
> > +			ret = -EBUSY;
> > +		hmm_range_unregister(range);
> > +		return ret;
> > +	}
> > +	return 0;
> 
> Is hmm_vma_fault() also temporary to keep the nouveau driver working?  It looks
> like it to me.
> 
> This and hmm_vma_range_done() above are part of the old interface which is in
> the Documentation correct?  As stated above we should probably change that
> documentation with this patch to ensure no new users of these 2 functions
> appear.

Ok will update the documentation, note that i already posted patches to use
this new API see the ODP RDMA link in the cover letter.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-28 16:57             ` Ira Weiny
@ 2019-03-29  1:00               ` Jerome Glisse
  2019-03-29  1:18                 ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  1:00 UTC (permalink / raw)
  To: Ira Weiny
  Cc: John Hubbard, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 09:57:09AM -0700, Ira Weiny wrote:
> On Thu, Mar 28, 2019 at 05:39:26PM -0700, John Hubbard wrote:
> > On 3/28/19 2:21 PM, Jerome Glisse wrote:
> > > On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
> > >> On 3/28/19 12:11 PM, Jerome Glisse wrote:
> > >>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> > >>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> > >>>>> From: Jérôme Glisse <jglisse@redhat.com>
> > [...]
> > >>>>> @@ -67,14 +78,9 @@ struct hmm {
> > >>>>>   */
> > >>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
> > >>>>>  {
> > >>>>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> > >>>>> +	struct hmm *hmm = mm_get_hmm(mm);
> > >>>>
> > >>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
> > >>>
> > >>> The thing is that you want only one hmm struct per process and thus
> > >>> if there is already one and it is not being destroy then you want to
> > >>> reuse it.
> > >>>
> > >>> Also this is all internal to HMM code and so it should not confuse
> > >>> anyone.
> > >>>
> > >>
> > >> Well, it has repeatedly come up, and I'd claim that it is quite 
> > >> counter-intuitive. So if there is an easy way to make this internal 
> > >> HMM code clearer or better named, I would really love that to happen.
> > >>
> > >> And we shouldn't ever dismiss feedback based on "this is just internal
> > >> xxx subsystem code, no need for it to be as clear as other parts of the
> > >> kernel", right?
> > > 
> > > Yes but i have not seen any better alternative that present code. If
> > > there is please submit patch.
> > > 
> > 
> > Ira, do you have any patch you're working on, or a more detailed suggestion there?
> > If not, then I might (later, as it's not urgent) propose a small cleanup patch 
> > I had in mind for the hmm_register code. But I don't want to duplicate effort 
> > if you're already thinking about it.
> 
> No I don't have anything.
> 
> I was just really digging into these this time around and I was about to
> comment on the lack of "get's" for some "puts" when I realized that
> "hmm_register" _was_ the get...
> 
> :-(
> 

The get is mm_get_hmm() were you get a reference on HMM from a mm struct.
John in previous posting complained about me naming that function hmm_get()
and thus in this version i renamed it to mm_get_hmm() as we are getting
a reference on hmm from a mm struct.

The hmm_put() is just releasing the reference on the hmm struct.

Here i feel i am getting contradicting requirement from different people.
I don't think there is a way to please everyone here.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-28 16:42                 ` Ira Weiny
@ 2019-03-29  1:17                   ` Jerome Glisse
  2019-03-29  1:30                     ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  1:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: John Hubbard, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> > On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > > On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> > >> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> > >>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> > >>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> > >>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> > >>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> > >>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> > [...]
> > >> Hi Jerome,
> > >>
> > >> I think you're talking about flags, but I'm talking about the mask. The 
> > >> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
> > >> that it uses are still in the same lower 3 bits:
> > >>
> > >> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> > >> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
> > >> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
> > >> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
> > >> +};
> > >>
> > >> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
> > >> mask, that is *also* runtime changeable. 
> > > 
> > > So the pfn array is using a device driver specific format and we have
> > > no idea nor do we need to know where the valid, write, ... bit are in
> > > that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> > > we do not care. They are device with bit at the top and for those you
> > > need a mask that allows you to mask out those bits or not depending on
> > > what the user want to do.
> > > 
> > > The mask here is against an _unknown_ (from HMM POV) format. So we can
> > > not presume where the bits will be and thus we can not presume what a
> > > proper mask is.
> > > 
> > > So that's why a full unsigned long mask is use here.
> > > 
> > > Maybe an example will help let say the device flag are:
> > >     VALID (1 << 63)
> > >     WRITE (1 << 62)
> > > 
> > > Now let say that device wants to fault with at least read a range
> > > it does set:
> > >     range->default_flags = (1 << 63)
> > >     range->pfn_flags_mask = 0;
> > > 
> > > This will fill fault all page in the range with at least read
> > > permission.
> > > 
> > > Now let say it wants to do the same except for one page in the range
> > > for which its want to have write. Now driver set:
> > >     range->default_flags = (1 << 63);
> > >     range->pfn_flags_mask = (1 << 62);
> > >     range->pfns[index_of_write] = (1 << 62);
> > > 
> > > With this HMM will fault in all page with at least read (ie valid)
> > > and for the address: range->start + index_of_write << PAGE_SHIFT it
> > > will fault with write permission ie if the CPU pte does not have
> > > write permission set then handle_mm_fault() will be call asking for
> > > write permission.
> > > 
> > > 
> > > Note that in the above HMM will populate the pfns array with write
> > > permission for any entry that have write permission within the CPU
> > > pte ie the default_flags and pfn_flags_mask is only the minimun
> > > requirement but HMM always returns all the flag that are set in the
> > > CPU pte.
> > > 
> > > 
> > > Now let say you are an "old" driver like nouveau upstream, then it
> > > means that you are setting each individual entry within range->pfns
> > > with the exact flags you want for each address hence here what you
> > > want is:
> > >     range->default_flags = 0;
> > >     range->pfn_flags_mask = -1UL;
> > > 
> > > So that what we do is (for each entry):
> > >     (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> > > and we end up with the flags that were set by the driver for each of
> > > the individual range->pfns entries.
> > > 
> > > 
> > > Does this help ?
> > > 
> > 
> > Yes, the key point for me was that this is an entirely device driver specific
> > format. OK. But then we have HMM setting it. So a comment to the effect that
> > this is device-specific might be nice, but I'll leave that up to you whether
> > it is useful.
> 
> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> 
> /*
>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
>  * @range: range use to encode HMM pfn value
>  * @pfn: pfn value for which to create the HMM pfn
>  * Returns: valid HMM pfn for the pfn
>  */
> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
>                                         unsigned long pfn)
> 
> So should this patch contain some sort of helper like this... maybe?
> 
> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> discussed here?
> 
> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
> have shortened the discussion here.
> 

That helper is also use today by nouveau so changing that name is not that
easy it does require the multi-release dance. So i am not sure how much
value there is in a name change.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-29  1:00               ` Jerome Glisse
@ 2019-03-29  1:18                 ` John Hubbard
  2019-03-29  1:50                   ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-29  1:18 UTC (permalink / raw)
  To: Jerome Glisse, Ira Weiny
  Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 6:00 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 09:57:09AM -0700, Ira Weiny wrote:
>> On Thu, Mar 28, 2019 at 05:39:26PM -0700, John Hubbard wrote:
>>> On 3/28/19 2:21 PM, Jerome Glisse wrote:
>>>> On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
>>>>> On 3/28/19 12:11 PM, Jerome Glisse wrote:
>>>>>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
>>>>>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>> [...]
>>>>>>>> @@ -67,14 +78,9 @@ struct hmm {
>>>>>>>>   */
>>>>>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
>>>>>>>>  {
>>>>>>>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
>>>>>>>> +	struct hmm *hmm = mm_get_hmm(mm);
>>>>>>>
>>>>>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
>>>>>>
>>>>>> The thing is that you want only one hmm struct per process and thus
>>>>>> if there is already one and it is not being destroy then you want to
>>>>>> reuse it.
>>>>>>
>>>>>> Also this is all internal to HMM code and so it should not confuse
>>>>>> anyone.
>>>>>>
>>>>>
>>>>> Well, it has repeatedly come up, and I'd claim that it is quite 
>>>>> counter-intuitive. So if there is an easy way to make this internal 
>>>>> HMM code clearer or better named, I would really love that to happen.
>>>>>
>>>>> And we shouldn't ever dismiss feedback based on "this is just internal
>>>>> xxx subsystem code, no need for it to be as clear as other parts of the
>>>>> kernel", right?
>>>>
>>>> Yes but i have not seen any better alternative that present code. If
>>>> there is please submit patch.
>>>>
>>>
>>> Ira, do you have any patch you're working on, or a more detailed suggestion there?
>>> If not, then I might (later, as it's not urgent) propose a small cleanup patch 
>>> I had in mind for the hmm_register code. But I don't want to duplicate effort 
>>> if you're already thinking about it.
>>
>> No I don't have anything.
>>
>> I was just really digging into these this time around and I was about to
>> comment on the lack of "get's" for some "puts" when I realized that
>> "hmm_register" _was_ the get...
>>
>> :-(
>>
> 
> The get is mm_get_hmm() were you get a reference on HMM from a mm struct.
> John in previous posting complained about me naming that function hmm_get()
> and thus in this version i renamed it to mm_get_hmm() as we are getting
> a reference on hmm from a mm struct.

Well, that's not what I recommended, though. The actual conversation went like
this [1]:

---------------------------------------------------------------
>> So for this, hmm_get() really ought to be symmetric with
>> hmm_put(), by taking a struct hmm*. And the null check is
>> not helping here, so let's just go with this smaller version:
>>
>> static inline struct hmm *hmm_get(struct hmm *hmm)
>> {
>>     if (kref_get_unless_zero(&hmm->kref))
>>         return hmm;
>>
>>     return NULL;
>> }
>>
>> ...and change the few callers accordingly.
>>
>
> What about renaning hmm_get() to mm_get_hmm() instead ?
>

For a get/put pair of functions, it would be ideal to pass
the same argument type to each. It looks like we are passing
around hmm*, and hmm retains a reference count on hmm->mm,
so I think you have a choice of using either mm* or hmm* as
the argument. I'm not sure that one is better than the other
here, as the lifetimes appear to be linked pretty tightly.

Whichever one is used, I think it would be best to use it
in both the _get() and _put() calls. 
---------------------------------------------------------------

Your response was to change the name to mm_get_hmm(), but that's not
what I recommended.

> 
> The hmm_put() is just releasing the reference on the hmm struct.
> 
> Here i feel i am getting contradicting requirement from different people.
> I don't think there is a way to please everyone here.
> 

That's not a true conflict: you're comparing your actual implementation
to Ira's request, rather than comparing my request to Ira's request.

I think there's a way forward. Ira and I are actually both asking for the
same thing:

a) clear, concise get/put routines

b) avoiding odd side effects in functions that have one name, but do
additional surprising things.

[1] https://lore.kernel.org/r/1ccab0d3-7e90-8e39-074d-02ffbfc68480@nvidia.com

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-29  1:17                   ` Jerome Glisse
@ 2019-03-29  1:30                     ` John Hubbard
  2019-03-29  1:42                       ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-29  1:30 UTC (permalink / raw)
  To: Jerome Glisse, Ira Weiny
  Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 6:17 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
>> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
>>> On 3/28/19 4:21 PM, Jerome Glisse wrote:
>>>> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
>>>>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
>>>>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
>>>>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
>>>>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
>>>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>> [...]
>> Indeed I did not realize there is an hmm "pfn" until I saw this function:
>>
>> /*
>>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
>>  * @range: range use to encode HMM pfn value
>>  * @pfn: pfn value for which to create the HMM pfn
>>  * Returns: valid HMM pfn for the pfn
>>  */
>> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
>>                                         unsigned long pfn)
>>
>> So should this patch contain some sort of helper like this... maybe?
>>
>> I'm assuming the "hmm_pfn" being returned above is the device pfn being
>> discussed here?
>>
>> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
>> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
>> have shortened the discussion here.
>>
> 
> That helper is also use today by nouveau so changing that name is not that
> easy it does require the multi-release dance. So i am not sure how much
> value there is in a name change.
> 

Once the dust settles, I would expect that a name change for this could go
via Andrew's tree, right? It seems incredible to claim that we've built something
that effectively does not allow any minor changes!

I do think it's worth some *minor* trouble to improve the name, assuming that we
can do it in a simple patch, rather than some huge maintainer-level effort.

This field name not a large thing, but the cumulative effect of having a number of
naming glitches within HMM is significant. The size and complexity of HMM has
always made it hard to attract code reviewers, so let's improve what we can, to
counteract that.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-29  1:30                     ` John Hubbard
@ 2019-03-29  1:42                       ` Jerome Glisse
  2019-03-29  1:59                         ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  1:42 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ira Weiny, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 06:30:26PM -0700, John Hubbard wrote:
> On 3/28/19 6:17 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
> >> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> >>> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> >>>> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> >>>>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> >>>>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >>>>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>>>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> >>> [...]
> >> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> >>
> >> /*
> >>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
> >>  * @range: range use to encode HMM pfn value
> >>  * @pfn: pfn value for which to create the HMM pfn
> >>  * Returns: valid HMM pfn for the pfn
> >>  */
> >> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> >>                                         unsigned long pfn)
> >>
> >> So should this patch contain some sort of helper like this... maybe?
> >>
> >> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> >> discussed here?
> >>
> >> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
> >> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
> >> have shortened the discussion here.
> >>
> > 
> > That helper is also use today by nouveau so changing that name is not that
> > easy it does require the multi-release dance. So i am not sure how much
> > value there is in a name change.
> > 
> 
> Once the dust settles, I would expect that a name change for this could go
> via Andrew's tree, right? It seems incredible to claim that we've built something
> that effectively does not allow any minor changes!
> 
> I do think it's worth some *minor* trouble to improve the name, assuming that we
> can do it in a simple patch, rather than some huge maintainer-level effort.

Change to nouveau have to go through nouveau tree so changing name means:
 -  release N add function with new name, maybe make the old function just
    a wrapper to the new function
 -  release N+1 update user to use the new name
 -  release N+2 remove the old name

So it is do-able but it is painful so i rather do that one latter that now
as i am sure people will then complain again about some little thing and it
will post pone this whole patchset on that new bit. To avoid post-poning
RDMA and bunch of other patchset that build on top of that i rather get
this patchset in and then do more changes in the next cycle.

This is just a capacity thing.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-29  1:18                 ` John Hubbard
@ 2019-03-29  1:50                   ` Jerome Glisse
  2019-03-28 18:21                     ` Ira Weiny
  2019-03-29  2:11                     ` John Hubbard
  0 siblings, 2 replies; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  1:50 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ira Weiny, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 06:18:35PM -0700, John Hubbard wrote:
> On 3/28/19 6:00 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 09:57:09AM -0700, Ira Weiny wrote:
> >> On Thu, Mar 28, 2019 at 05:39:26PM -0700, John Hubbard wrote:
> >>> On 3/28/19 2:21 PM, Jerome Glisse wrote:
> >>>> On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
> >>>>> On 3/28/19 12:11 PM, Jerome Glisse wrote:
> >>>>>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> >>>>>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> >>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> >>> [...]
> >>>>>>>> @@ -67,14 +78,9 @@ struct hmm {
> >>>>>>>>   */
> >>>>>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
> >>>>>>>>  {
> >>>>>>>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> >>>>>>>> +	struct hmm *hmm = mm_get_hmm(mm);
> >>>>>>>
> >>>>>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
> >>>>>>
> >>>>>> The thing is that you want only one hmm struct per process and thus
> >>>>>> if there is already one and it is not being destroy then you want to
> >>>>>> reuse it.
> >>>>>>
> >>>>>> Also this is all internal to HMM code and so it should not confuse
> >>>>>> anyone.
> >>>>>>
> >>>>>
> >>>>> Well, it has repeatedly come up, and I'd claim that it is quite 
> >>>>> counter-intuitive. So if there is an easy way to make this internal 
> >>>>> HMM code clearer or better named, I would really love that to happen.
> >>>>>
> >>>>> And we shouldn't ever dismiss feedback based on "this is just internal
> >>>>> xxx subsystem code, no need for it to be as clear as other parts of the
> >>>>> kernel", right?
> >>>>
> >>>> Yes but i have not seen any better alternative that present code. If
> >>>> there is please submit patch.
> >>>>
> >>>
> >>> Ira, do you have any patch you're working on, or a more detailed suggestion there?
> >>> If not, then I might (later, as it's not urgent) propose a small cleanup patch 
> >>> I had in mind for the hmm_register code. But I don't want to duplicate effort 
> >>> if you're already thinking about it.
> >>
> >> No I don't have anything.
> >>
> >> I was just really digging into these this time around and I was about to
> >> comment on the lack of "get's" for some "puts" when I realized that
> >> "hmm_register" _was_ the get...
> >>
> >> :-(
> >>
> > 
> > The get is mm_get_hmm() were you get a reference on HMM from a mm struct.
> > John in previous posting complained about me naming that function hmm_get()
> > and thus in this version i renamed it to mm_get_hmm() as we are getting
> > a reference on hmm from a mm struct.
> 
> Well, that's not what I recommended, though. The actual conversation went like
> this [1]:
> 
> ---------------------------------------------------------------
> >> So for this, hmm_get() really ought to be symmetric with
> >> hmm_put(), by taking a struct hmm*. And the null check is
> >> not helping here, so let's just go with this smaller version:
> >>
> >> static inline struct hmm *hmm_get(struct hmm *hmm)
> >> {
> >>     if (kref_get_unless_zero(&hmm->kref))
> >>         return hmm;
> >>
> >>     return NULL;
> >> }
> >>
> >> ...and change the few callers accordingly.
> >>
> >
> > What about renaning hmm_get() to mm_get_hmm() instead ?
> >
> 
> For a get/put pair of functions, it would be ideal to pass
> the same argument type to each. It looks like we are passing
> around hmm*, and hmm retains a reference count on hmm->mm,
> so I think you have a choice of using either mm* or hmm* as
> the argument. I'm not sure that one is better than the other
> here, as the lifetimes appear to be linked pretty tightly.
> 
> Whichever one is used, I think it would be best to use it
> in both the _get() and _put() calls. 
> ---------------------------------------------------------------
> 
> Your response was to change the name to mm_get_hmm(), but that's not
> what I recommended.

Because i can not do that, hmm_put() can _only_ take hmm struct as
input while hmm_get() can _only_ get mm struct as input.

hmm_put() can only take hmm because the hmm we are un-referencing
might no longer be associated with any mm struct and thus i do not
have a mm struct to use.

hmm_get() can only get mm as input as we need to be careful when
accessing the hmm field within the mm struct and thus it is better
to have that code within a function than open coded and duplicated
all over the place.

> 
> > 
> > The hmm_put() is just releasing the reference on the hmm struct.
> > 
> > Here i feel i am getting contradicting requirement from different people.
> > I don't think there is a way to please everyone here.
> > 
> 
> That's not a true conflict: you're comparing your actual implementation
> to Ira's request, rather than comparing my request to Ira's request.
> 
> I think there's a way forward. Ira and I are actually both asking for the
> same thing:
> 
> a) clear, concise get/put routines
> 
> b) avoiding odd side effects in functions that have one name, but do
> additional surprising things.

Please show me code because i do not see any other way to do it then
how i did.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-29  1:42                       ` Jerome Glisse
@ 2019-03-29  1:59                         ` Jerome Glisse
  2019-03-29  2:05                           ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  1:59 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ira Weiny, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 09:42:59PM -0400, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 06:30:26PM -0700, John Hubbard wrote:
> > On 3/28/19 6:17 PM, Jerome Glisse wrote:
> > > On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
> > >> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> > >>> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > >>>> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> > >>>>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> > >>>>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> > >>>>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> > >>>>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> > >>>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> > >>>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> > >>> [...]
> > >> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> > >>
> > >> /*
> > >>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
> > >>  * @range: range use to encode HMM pfn value
> > >>  * @pfn: pfn value for which to create the HMM pfn
> > >>  * Returns: valid HMM pfn for the pfn
> > >>  */
> > >> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> > >>                                         unsigned long pfn)
> > >>
> > >> So should this patch contain some sort of helper like this... maybe?
> > >>
> > >> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> > >> discussed here?
> > >>
> > >> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
> > >> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
> > >> have shortened the discussion here.
> > >>
> > > 
> > > That helper is also use today by nouveau so changing that name is not that
> > > easy it does require the multi-release dance. So i am not sure how much
> > > value there is in a name change.
> > > 
> > 
> > Once the dust settles, I would expect that a name change for this could go
> > via Andrew's tree, right? It seems incredible to claim that we've built something
> > that effectively does not allow any minor changes!
> > 
> > I do think it's worth some *minor* trouble to improve the name, assuming that we
> > can do it in a simple patch, rather than some huge maintainer-level effort.
> 
> Change to nouveau have to go through nouveau tree so changing name means:
>  -  release N add function with new name, maybe make the old function just
>     a wrapper to the new function
>  -  release N+1 update user to use the new name
>  -  release N+2 remove the old name
> 
> So it is do-able but it is painful so i rather do that one latter that now
> as i am sure people will then complain again about some little thing and it
> will post pone this whole patchset on that new bit. To avoid post-poning
> RDMA and bunch of other patchset that build on top of that i rather get
> this patchset in and then do more changes in the next cycle.
> 
> This is just a capacity thing.

Also for clarity changes to API i am doing in this patchset is to make
the ODP convertion easier and thus they bring a real hard value. Renaming
those function is esthetic, i am not saying it is useless, i am saying it
does not have the same value as those other changes and i would rather not
miss another merge window just for esthetic changes.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-29  1:59                         ` Jerome Glisse
@ 2019-03-29  2:05                           ` John Hubbard
  2019-03-29  2:12                             ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-29  2:05 UTC (permalink / raw)
  To: Jerome Glisse, Ben Skeggs
  Cc: Ira Weiny, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 6:59 PM, Jerome Glisse wrote:
>>>>>> [...]
>>>>> Indeed I did not realize there is an hmm "pfn" until I saw this function:
>>>>>
>>>>> /*
>>>>>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
>>>>>  * @range: range use to encode HMM pfn value
>>>>>  * @pfn: pfn value for which to create the HMM pfn
>>>>>  * Returns: valid HMM pfn for the pfn
>>>>>  */
>>>>> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
>>>>>                                         unsigned long pfn)
>>>>>
>>>>> So should this patch contain some sort of helper like this... maybe?
>>>>>
>>>>> I'm assuming the "hmm_pfn" being returned above is the device pfn being
>>>>> discussed here?
>>>>>
>>>>> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
>>>>> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
>>>>> have shortened the discussion here.
>>>>>
>>>>
>>>> That helper is also use today by nouveau so changing that name is not that
>>>> easy it does require the multi-release dance. So i am not sure how much
>>>> value there is in a name change.
>>>>
>>>
>>> Once the dust settles, I would expect that a name change for this could go
>>> via Andrew's tree, right? It seems incredible to claim that we've built something
>>> that effectively does not allow any minor changes!
>>>
>>> I do think it's worth some *minor* trouble to improve the name, assuming that we
>>> can do it in a simple patch, rather than some huge maintainer-level effort.
>>
>> Change to nouveau have to go through nouveau tree so changing name means:

Yes, I understand the guideline, but is that always how it must be done? Ben (+cc)?

>>  -  release N add function with new name, maybe make the old function just
>>     a wrapper to the new function
>>  -  release N+1 update user to use the new name
>>  -  release N+2 remove the old name
>>
>> So it is do-able but it is painful so i rather do that one latter that now
>> as i am sure people will then complain again about some little thing and it
>> will post pone this whole patchset on that new bit. To avoid post-poning
>> RDMA and bunch of other patchset that build on top of that i rather get
>> this patchset in and then do more changes in the next cycle.
>>
>> This is just a capacity thing.
> 
> Also for clarity changes to API i am doing in this patchset is to make
> the ODP convertion easier and thus they bring a real hard value. Renaming
> those function is esthetic, i am not saying it is useless, i am saying it
> does not have the same value as those other changes and i would rather not
> miss another merge window just for esthetic changes.
> 

Agreed, that this minor point should not hold up this patch.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-29  1:50                   ` Jerome Glisse
  2019-03-28 18:21                     ` Ira Weiny
@ 2019-03-29  2:11                     ` John Hubbard
  2019-03-29  2:22                       ` Jerome Glisse
  1 sibling, 1 reply; 69+ messages in thread
From: John Hubbard @ 2019-03-29  2:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Ira Weiny, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 6:50 PM, Jerome Glisse wrote:
[...]
>>>
>>> The hmm_put() is just releasing the reference on the hmm struct.
>>>
>>> Here i feel i am getting contradicting requirement from different people.
>>> I don't think there is a way to please everyone here.
>>>
>>
>> That's not a true conflict: you're comparing your actual implementation
>> to Ira's request, rather than comparing my request to Ira's request.
>>
>> I think there's a way forward. Ira and I are actually both asking for the
>> same thing:
>>
>> a) clear, concise get/put routines
>>
>> b) avoiding odd side effects in functions that have one name, but do
>> additional surprising things.
> 
> Please show me code because i do not see any other way to do it then
> how i did.
> 

Sure, I'll take a run at it. I've driven you crazy enough with the naming 
today, it's time to back it up with actual code. :)

I hope this is not one of those "we must also change Nouveau in N+M steps" 
situations, though. I'm starting to despair about reviewing code that
basically can't be changed...

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.
  2019-03-29  2:05                           ` John Hubbard
@ 2019-03-29  2:12                             ` Jerome Glisse
  0 siblings, 0 replies; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  2:12 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ben Skeggs, Ira Weiny, linux-mm, linux-kernel, Andrew Morton,
	Dan Williams

On Thu, Mar 28, 2019 at 07:05:21PM -0700, John Hubbard wrote:
> On 3/28/19 6:59 PM, Jerome Glisse wrote:
> >>>>>> [...]
> >>>>> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> >>>>>
> >>>>> /*
> >>>>>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
> >>>>>  * @range: range use to encode HMM pfn value
> >>>>>  * @pfn: pfn value for which to create the HMM pfn
> >>>>>  * Returns: valid HMM pfn for the pfn
> >>>>>  */
> >>>>> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> >>>>>                                         unsigned long pfn)
> >>>>>
> >>>>> So should this patch contain some sort of helper like this... maybe?
> >>>>>
> >>>>> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> >>>>> discussed here?
> >>>>>
> >>>>> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
> >>>>> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
> >>>>> have shortened the discussion here.
> >>>>>
> >>>>
> >>>> That helper is also use today by nouveau so changing that name is not that
> >>>> easy it does require the multi-release dance. So i am not sure how much
> >>>> value there is in a name change.
> >>>>
> >>>
> >>> Once the dust settles, I would expect that a name change for this could go
> >>> via Andrew's tree, right? It seems incredible to claim that we've built something
> >>> that effectively does not allow any minor changes!
> >>>
> >>> I do think it's worth some *minor* trouble to improve the name, assuming that we
> >>> can do it in a simple patch, rather than some huge maintainer-level effort.
> >>
> >> Change to nouveau have to go through nouveau tree so changing name means:
> 
> Yes, I understand the guideline, but is that always how it must be done? Ben (+cc)?

Yes, it is not only about nouveau, it will be about every single
upstream driver using HMM. It is the easiest solution all other
solution involve coordination and/or risk of people that handle
the conflict to do something that break things.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/11] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem v2
  2019-03-28 18:04   ` Ira Weiny
@ 2019-03-29  2:17     ` Jerome Glisse
  0 siblings, 0 replies; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  2:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams,
	John Hubbard, Arnd Bergmann

On Thu, Mar 28, 2019 at 11:04:26AM -0700, Ira Weiny wrote:
> On Mon, Mar 25, 2019 at 10:40:09AM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > HMM mirror is a device driver helpers to mirror range of virtual address.
> > It means that the process jobs running on the device can access the same
> > virtual address as the CPU threads of that process. This patch adds support
> > for mirroring mapping of file that are on a DAX block device (ie range of
> > virtual address that is an mmap of a file in a filesystem on a DAX block
> > device). There is no reason to not support such case when mirroring virtual
> > address on a device.
> > 
> > Note that unlike GUP code we do not take page reference hence when we
> > back-off we have nothing to undo.
> > 
> > Changes since v1:
> >     - improved commit message
> >     - squashed: Arnd Bergmann: fix unused variable warning in hmm_vma_walk_pud
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > ---
> >  mm/hmm.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 111 insertions(+), 21 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 64a33770813b..ce33151c6832 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -325,6 +325,7 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
> >  
> >  struct hmm_vma_walk {
> >  	struct hmm_range	*range;
> > +	struct dev_pagemap	*pgmap;
> >  	unsigned long		last;
> >  	bool			fault;
> >  	bool			block;
> > @@ -499,6 +500,15 @@ static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
> >  				range->flags[HMM_PFN_VALID];
> >  }
> >  
> > +static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud)
> > +{
> > +	if (!pud_present(pud))
> > +		return 0;
> > +	return pud_write(pud) ? range->flags[HMM_PFN_VALID] |
> > +				range->flags[HMM_PFN_WRITE] :
> > +				range->flags[HMM_PFN_VALID];
> > +}
> > +
> >  static int hmm_vma_handle_pmd(struct mm_walk *walk,
> >  			      unsigned long addr,
> >  			      unsigned long end,
> > @@ -520,8 +530,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
> >  		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> >  
> >  	pfn = pmd_pfn(pmd) + pte_index(addr);
> > -	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
> > +	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +		if (pmd_devmap(pmd)) {
> > +			hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> > +					      hmm_vma_walk->pgmap);
> > +			if (unlikely(!hmm_vma_walk->pgmap))
> > +				return -EBUSY;
> > +		}
> >  		pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
> > +	}
> > +	if (hmm_vma_walk->pgmap) {
> > +		put_dev_pagemap(hmm_vma_walk->pgmap);
> > +		hmm_vma_walk->pgmap = NULL;
> > +	}
> >  	hmm_vma_walk->last = end;
> >  	return 0;
> >  }
> > @@ -608,10 +629,24 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> >  	if (fault || write_fault)
> >  		goto fault;
> >  
> > +	if (pte_devmap(pte)) {
> > +		hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte),
> > +					      hmm_vma_walk->pgmap);
> > +		if (unlikely(!hmm_vma_walk->pgmap))
> > +			return -EBUSY;
> > +	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> > +		*pfn = range->values[HMM_PFN_SPECIAL];
> > +		return -EFAULT;
> > +	}
> > +
> >  	*pfn = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
> 
> 	<tag>
> 
> >  	return 0;
> >  
> >  fault:
> > +	if (hmm_vma_walk->pgmap) {
> > +		put_dev_pagemap(hmm_vma_walk->pgmap);
> > +		hmm_vma_walk->pgmap = NULL;
> > +	}
> >  	pte_unmap(ptep);
> >  	/* Fault any virtual address we were asked to fault */
> >  	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> > @@ -699,12 +734,83 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  			return r;
> >  		}
> >  	}
> > +	if (hmm_vma_walk->pgmap) {
> > +		put_dev_pagemap(hmm_vma_walk->pgmap);
> > +		hmm_vma_walk->pgmap = NULL;
> > +	}
> 
> 
> Why is this here and not in hmm_vma_handle_pte()?  Unless I'm just getting
> tired this is the corresponding put when hmm_vma_handle_pte() returns 0 above
> at <tag> above.

This is because get_dev_pagemap() optimize away the reference getting
if we already hold a reference on the correct dev_pagemap. So if we
were releasing the reference within hmm_vma_handle_pte() then we would
loose the get_dev_pagemap() optimization.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-29  2:11                     ` John Hubbard
@ 2019-03-29  2:22                       ` Jerome Glisse
  0 siblings, 0 replies; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  2:22 UTC (permalink / raw)
  To: John Hubbard
  Cc: Ira Weiny, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 07:11:17PM -0700, John Hubbard wrote:
> On 3/28/19 6:50 PM, Jerome Glisse wrote:
> [...]
> >>>
> >>> The hmm_put() is just releasing the reference on the hmm struct.
> >>>
> >>> Here i feel i am getting contradicting requirement from different people.
> >>> I don't think there is a way to please everyone here.
> >>>
> >>
> >> That's not a true conflict: you're comparing your actual implementation
> >> to Ira's request, rather than comparing my request to Ira's request.
> >>
> >> I think there's a way forward. Ira and I are actually both asking for the
> >> same thing:
> >>
> >> a) clear, concise get/put routines
> >>
> >> b) avoiding odd side effects in functions that have one name, but do
> >> additional surprising things.
> > 
> > Please show me code because i do not see any other way to do it then
> > how i did.
> > 
> 
> Sure, I'll take a run at it. I've driven you crazy enough with the naming 
> today, it's time to back it up with actual code. :)

Note that every single line in mm_get_hmm() do matter.

> I hope this is not one of those "we must also change Nouveau in N+M steps" 
> situations, though. I'm starting to despair about reviewing code that
> basically can't be changed...

It can be change but i rather not do too many in one go, each change is
like a tango with one partner and having tango with multiple partner at
once is painful much more likely to step on each other foot.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-28 18:21                     ` Ira Weiny
@ 2019-03-29  2:25                       ` Jerome Glisse
  2019-03-29 20:07                         ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29  2:25 UTC (permalink / raw)
  To: Ira Weiny
  Cc: John Hubbard, linux-mm, linux-kernel, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 11:21:00AM -0700, Ira Weiny wrote:
> On Thu, Mar 28, 2019 at 09:50:03PM -0400, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 06:18:35PM -0700, John Hubbard wrote:
> > > On 3/28/19 6:00 PM, Jerome Glisse wrote:
> > > > On Thu, Mar 28, 2019 at 09:57:09AM -0700, Ira Weiny wrote:
> > > >> On Thu, Mar 28, 2019 at 05:39:26PM -0700, John Hubbard wrote:
> > > >>> On 3/28/19 2:21 PM, Jerome Glisse wrote:
> > > >>>> On Thu, Mar 28, 2019 at 01:43:13PM -0700, John Hubbard wrote:
> > > >>>>> On 3/28/19 12:11 PM, Jerome Glisse wrote:
> > > >>>>>> On Thu, Mar 28, 2019 at 04:07:20AM -0700, Ira Weiny wrote:
> > > >>>>>>> On Mon, Mar 25, 2019 at 10:40:02AM -0400, Jerome Glisse wrote:
> > > >>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> > > >>> [...]
> > > >>>>>>>> @@ -67,14 +78,9 @@ struct hmm {
> > > >>>>>>>>   */
> > > >>>>>>>>  static struct hmm *hmm_register(struct mm_struct *mm)
> > > >>>>>>>>  {
> > > >>>>>>>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> > > >>>>>>>> +	struct hmm *hmm = mm_get_hmm(mm);
> > > >>>>>>>
> > > >>>>>>> FWIW: having hmm_register == "hmm get" is a bit confusing...
> > > >>>>>>
> > > >>>>>> The thing is that you want only one hmm struct per process and thus
> > > >>>>>> if there is already one and it is not being destroy then you want to
> > > >>>>>> reuse it.
> > > >>>>>>
> > > >>>>>> Also this is all internal to HMM code and so it should not confuse
> > > >>>>>> anyone.
> > > >>>>>>
> > > >>>>>
> > > >>>>> Well, it has repeatedly come up, and I'd claim that it is quite 
> > > >>>>> counter-intuitive. So if there is an easy way to make this internal 
> > > >>>>> HMM code clearer or better named, I would really love that to happen.
> > > >>>>>
> > > >>>>> And we shouldn't ever dismiss feedback based on "this is just internal
> > > >>>>> xxx subsystem code, no need for it to be as clear as other parts of the
> > > >>>>> kernel", right?
> > > >>>>
> > > >>>> Yes but i have not seen any better alternative that present code. If
> > > >>>> there is please submit patch.
> > > >>>>
> > > >>>
> > > >>> Ira, do you have any patch you're working on, or a more detailed suggestion there?
> > > >>> If not, then I might (later, as it's not urgent) propose a small cleanup patch 
> > > >>> I had in mind for the hmm_register code. But I don't want to duplicate effort 
> > > >>> if you're already thinking about it.
> > > >>
> > > >> No I don't have anything.
> > > >>
> > > >> I was just really digging into these this time around and I was about to
> > > >> comment on the lack of "get's" for some "puts" when I realized that
> > > >> "hmm_register" _was_ the get...
> > > >>
> > > >> :-(
> > > >>
> > > > 
> > > > The get is mm_get_hmm() were you get a reference on HMM from a mm struct.
> > > > John in previous posting complained about me naming that function hmm_get()
> > > > and thus in this version i renamed it to mm_get_hmm() as we are getting
> > > > a reference on hmm from a mm struct.
> > > 
> > > Well, that's not what I recommended, though. The actual conversation went like
> > > this [1]:
> > > 
> > > ---------------------------------------------------------------
> > > >> So for this, hmm_get() really ought to be symmetric with
> > > >> hmm_put(), by taking a struct hmm*. And the null check is
> > > >> not helping here, so let's just go with this smaller version:
> > > >>
> > > >> static inline struct hmm *hmm_get(struct hmm *hmm)
> > > >> {
> > > >>     if (kref_get_unless_zero(&hmm->kref))
> > > >>         return hmm;
> > > >>
> > > >>     return NULL;
> > > >> }
> > > >>
> > > >> ...and change the few callers accordingly.
> > > >>
> > > >
> > > > What about renaning hmm_get() to mm_get_hmm() instead ?
> > > >
> > > 
> > > For a get/put pair of functions, it would be ideal to pass
> > > the same argument type to each. It looks like we are passing
> > > around hmm*, and hmm retains a reference count on hmm->mm,
> > > so I think you have a choice of using either mm* or hmm* as
> > > the argument. I'm not sure that one is better than the other
> > > here, as the lifetimes appear to be linked pretty tightly.
> > > 
> > > Whichever one is used, I think it would be best to use it
> > > in both the _get() and _put() calls. 
> > > ---------------------------------------------------------------
> > > 
> > > Your response was to change the name to mm_get_hmm(), but that's not
> > > what I recommended.
> > 
> > Because i can not do that, hmm_put() can _only_ take hmm struct as
> > input while hmm_get() can _only_ get mm struct as input.
> > 
> > hmm_put() can only take hmm because the hmm we are un-referencing
> > might no longer be associated with any mm struct and thus i do not
> > have a mm struct to use.
> > 
> > hmm_get() can only get mm as input as we need to be careful when
> > accessing the hmm field within the mm struct and thus it is better
> > to have that code within a function than open coded and duplicated
> > all over the place.
> 
> The input value is not the problem.  The problem is in the naming.
> 
> obj = get_obj( various parameters );
> put_obj(obj);
> 
> 
> The problem is that the function is named hmm_register() either "gets" a
> reference to _or_ creates and gets a reference to the hmm object.
> 
> What John is probably ready to submit is something like.
> 
> struct hmm *get_create_hmm(struct mm *mm);
> void put_hmm(struct hmm *hmm);
> 
> 
> So when you are reading the code you see...
> 
> foo(...) {
> 	struct hmm *hmm = get_create_hmm(mm);
> 
> 	if (!hmm)
> 		error...
> 
> 	do stuff...
> 
> 	put_hmm(hmm);
> }
> 
> Here I can see a very clear get/put pair.  The name also shows that the hmm is
> created if need be as well as getting a reference.
> 

You only need to create HMM when you either register a mirror or
register a range. So they two pattern:

    average_foo() {
        struct hmm *hmm = mm_get_hmm(mm);
        ...
        hmm_put(hmm);
    }

    register_foo() {
        struct hmm *hmm = hmm_register(mm);
        ...
        return 0;
    error:
        ...
        hmm_put(hmm);
    }

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2
  2019-03-29  2:25                       ` Jerome Glisse
@ 2019-03-29 20:07                         ` John Hubbard
  0 siblings, 0 replies; 69+ messages in thread
From: John Hubbard @ 2019-03-29 20:07 UTC (permalink / raw)
  To: Jerome Glisse, Ira Weiny
  Cc: linux-mm, linux-kernel, Andrew Morton, Dan Williams

On 3/28/19 7:25 PM, Jerome Glisse wrote:
[...]
>> The input value is not the problem.  The problem is in the naming.
>>
>> obj = get_obj( various parameters );
>> put_obj(obj);
>>
>>
>> The problem is that the function is named hmm_register() either "gets" a
>> reference to _or_ creates and gets a reference to the hmm object.
>>
>> What John is probably ready to submit is something like.
>>
>> struct hmm *get_create_hmm(struct mm *mm);
>> void put_hmm(struct hmm *hmm);
>>
>>
>> So when you are reading the code you see...
>>
>> foo(...) {
>> 	struct hmm *hmm = get_create_hmm(mm);
>>
>> 	if (!hmm)
>> 		error...
>>
>> 	do stuff...
>>
>> 	put_hmm(hmm);
>> }
>>
>> Here I can see a very clear get/put pair.  The name also shows that the hmm is
>> created if need be as well as getting a reference.
>>
> 
> You only need to create HMM when you either register a mirror or
> register a range. So they two pattern:
> 
>     average_foo() {
>         struct hmm *hmm = mm_get_hmm(mm);
>         ...
>         hmm_put(hmm);
>     }
> 
>     register_foo() {
>         struct hmm *hmm = hmm_register(mm);
>         ...
>         return 0;
>     error:
>         ...
>         hmm_put(hmm);
>     }
> 

1. Looking at this fresh this morning, Ira's idea of just a single rename
actually clarifies things a lot more than I expected. I think the following
tiny patch would suffice here (I've updated documentation to match, and added
a missing "@Return:" line too):

iff --git a/mm/hmm.c b/mm/hmm.c
index fd143251b157..37b1c5803f1e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -50,14 +50,17 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
 }
 
 /*
- * hmm_register - register HMM against an mm (HMM internal)
+ * hmm_get_create - returns an HMM object, either by referencing the existing
+ * (per-process) object, or by creating a new one.
  *
- * @mm: mm struct to attach to
+ * @mm: the mm_struct to attach to
+ * @Return: a pointer to the HMM object, or NULL upon failure. This pointer must
+ * be released, when done, via hmm_put().
  *
- * This is not intended to be used directly by device drivers. It allocates an
- * HMM struct if mm does not have one, and initializes it.
+ * This is an internal HMM function, and is not intended to be used directly by
+ * device drivers.
  */
-static struct hmm *hmm_register(struct mm_struct *mm)
+static struct hmm *hmm_get_create(struct mm_struct *mm)
 {
        struct hmm *hmm = mm_get_hmm(mm);
        bool cleanup = false;
@@ -288,7 +291,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
        if (!mm || !mirror || !mirror->ops)
                return -EINVAL;
 
-       mirror->hmm = hmm_register(mm);
+       mirror->hmm = hmm_get_create(mm);
        if (!mirror->hmm)
                return -ENOMEM;
 
@@ -915,7 +918,7 @@ int hmm_range_register(struct hmm_range *range,
        range->start = start;
        range->end = end;
 
-       range->hmm = hmm_register(mm);
+       range->hmm = hmm_get_create(mm);
        if (!range->hmm)
                return -EFAULT;




2. A not directly related point: did you see my minor comment on patch 0001? I think it might have been missed in all the threads yesterday.



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM
  2019-03-28 20:33   ` John Hubbard
@ 2019-03-29 21:15     ` Jerome Glisse
  2019-03-29 21:42       ` John Hubbard
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-03-29 21:15 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, linux-kernel, Ralph Campbell, Andrew Morton, Dan Williams

On Thu, Mar 28, 2019 at 01:33:42PM -0700, John Hubbard wrote:
> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > To avoid random config build issue, select mmu notifier when HMM is
> > selected. In any cases when HMM get selected it will be by users that
> > will also wants the mmu notifier.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Acked-by: Balbir Singh <bsingharora@gmail.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  mm/Kconfig | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 25c71eb8a7db..0d2944278d80 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -694,6 +694,7 @@ config DEV_PAGEMAP_OPS
> >  
> >  config HMM
> >  	bool
> > +	select MMU_NOTIFIER
> >  	select MIGRATE_VMA_HELPER
> >  
> >  config HMM_MIRROR
> > 
> 
> Yes, this is a good move, given that MMU notifiers are completely,
> indispensably part of the HMM design and implementation.
> 
> The alternative would also work, but it's not quite as good. I'm
> listing it in order to forestall any debate: 
> 
>   config HMM
>   	bool
>  +	depends on MMU_NOTIFIER
>   	select MIGRATE_VMA_HELPER
> 
> ...and "depends on" versus "select" is always a subtle question. But in
> this case, I'd say that if someone wants HMM, there's no advantage in
> making them know that they must first ensure MMU_NOTIFIER is enabled.
> After poking around a bit I don't see any obvious downsides either.

You can not depend on MMU_NOTIFIER it is one of the kernel config
option that is not selectable. So any config that need MMU_NOTIFIER
must select it.

> 
> However, given that you're making this change, in order to avoid odd
> redundancy, you should also do this:
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0d2944278d80..2e6d24d783f7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -700,7 +700,6 @@ config HMM
>  config HMM_MIRROR
>         bool "HMM mirror CPU page table into a device page table"
>         depends on ARCH_HAS_HMM
> -       select MMU_NOTIFIER
>         select HMM
>         help
>           Select HMM_MIRROR if you want to mirror range of the CPU page table of a

Because it is a select option no harm can come from that hence i do
not remove but i can remove it.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM
  2019-03-29 21:15     ` Jerome Glisse
@ 2019-03-29 21:42       ` John Hubbard
  0 siblings, 0 replies; 69+ messages in thread
From: John Hubbard @ 2019-03-29 21:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Ralph Campbell, Andrew Morton, Dan Williams

On 3/29/19 2:15 PM, Jerome Glisse wrote:
[...]
>> Yes, this is a good move, given that MMU notifiers are completely,
>> indispensably part of the HMM design and implementation.
>>
>> The alternative would also work, but it's not quite as good. I'm
>> listing it in order to forestall any debate: 
>>
>>   config HMM
>>   	bool
>>  +	depends on MMU_NOTIFIER
>>   	select MIGRATE_VMA_HELPER
>>
>> ...and "depends on" versus "select" is always a subtle question. But in
>> this case, I'd say that if someone wants HMM, there's no advantage in
>> making them know that they must first ensure MMU_NOTIFIER is enabled.
>> After poking around a bit I don't see any obvious downsides either.
> 
> You can not depend on MMU_NOTIFIER it is one of the kernel config
> option that is not selectable. So any config that need MMU_NOTIFIER
> must select it.
> 

aha, thanks for explaining that point about the non-user-selectable items,
I wasn't aware of that. (I had convinced myself that those were set by
hard-coding a choice in one of the Kconfig files.)

>>
>> However, given that you're making this change, in order to avoid odd
>> redundancy, you should also do this:
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 0d2944278d80..2e6d24d783f7 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -700,7 +700,6 @@ config HMM
>>  config HMM_MIRROR
>>         bool "HMM mirror CPU page table into a device page table"
>>         depends on ARCH_HAS_HMM
>> -       select MMU_NOTIFIER
>>         select HMM
>>         help
>>           Select HMM_MIRROR if you want to mirror range of the CPU page table of a
> 
> Because it is a select option no harm can come from that hence i do
> not remove but i can remove it.
> 

Yes, this is just a tiny housecleaning point, not anything earthshaking.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 11/11] mm/hmm: add an helper function that fault pages and map them to a device v2
  2019-03-25 14:40 ` [PATCH v2 11/11] mm/hmm: add an helper function that fault pages and map them to a device v2 jglisse
@ 2019-04-01 11:59     ` Souptick Joarder
  0 siblings, 0 replies; 69+ messages in thread
From: Souptick Joarder @ 2019-04-01 11:59 UTC (permalink / raw)
  To: jglisse
  Cc: Linux-MM, linux-kernel, Andrew Morton, Ralph Campbell,
	John Hubbard, Dan Williams

On Mon, Mar 25, 2019 at 8:11 PM <jglisse@redhat.com> wrote:
>
> From: Jérôme Glisse <jglisse@redhat.com>
>
> This is a all in one helper that fault pages in a range and map them to
> a device so that every single device driver do not have to re-implement
> this common pattern.
>
> This is taken from ODP RDMA in preparation of ODP RDMA convertion. It
> will be use by nouveau and other drivers.
>
> Changes since v1:
>     - improved commit message
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/hmm.h |   9 +++
>  mm/hmm.c            | 152 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 161 insertions(+)
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 5f9deaeb9d77..7aadf18b29cb 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -568,6 +568,15 @@ int hmm_range_register(struct hmm_range *range,
>  void hmm_range_unregister(struct hmm_range *range);
>  long hmm_range_snapshot(struct hmm_range *range);
>  long hmm_range_fault(struct hmm_range *range, bool block);
> +long hmm_range_dma_map(struct hmm_range *range,
> +                      struct device *device,
> +                      dma_addr_t *daddrs,
> +                      bool block);
> +long hmm_range_dma_unmap(struct hmm_range *range,
> +                        struct vm_area_struct *vma,
> +                        struct device *device,
> +                        dma_addr_t *daddrs,
> +                        bool dirty);
>
>  /*
>   * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> diff --git a/mm/hmm.c b/mm/hmm.c
> index ce33151c6832..fd143251b157 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -30,6 +30,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memremap.h>
>  #include <linux/jump_label.h>
> +#include <linux/dma-mapping.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/memory_hotplug.h>
>
> @@ -1163,6 +1164,157 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>         return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
>  }
>  EXPORT_SYMBOL(hmm_range_fault);
> +
> +/*

Adding extra * might be helpful here for documentation.

> + * hmm_range_dma_map() - hmm_range_fault() and dma map page all in one.
> + * @range: range being faulted
> + * @device: device against to dma map page to
> + * @daddrs: dma address of mapped pages
> + * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
> + * Returns: number of pages mapped on success, -EAGAIN if mmap_sem have been
> + *          drop and you need to try again, some other error value otherwise
> + *
> + * Note same usage pattern as hmm_range_fault().
> + */
> +long hmm_range_dma_map(struct hmm_range *range,
> +                      struct device *device,
> +                      dma_addr_t *daddrs,
> +                      bool block)
> +{
> +       unsigned long i, npages, mapped;
> +       long ret;
> +
> +       ret = hmm_range_fault(range, block);
> +       if (ret <= 0)
> +               return ret ? ret : -EBUSY;
> +
> +       npages = (range->end - range->start) >> PAGE_SHIFT;
> +       for (i = 0, mapped = 0; i < npages; ++i) {
> +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> +               struct page *page;
> +
> +               /*
> +                * FIXME need to update DMA API to provide invalid DMA address
> +                * value instead of a function to test dma address value. This
> +                * would remove lot of dumb code duplicated accross many arch.
> +                *
> +                * For now setting it to 0 here is good enough as the pfns[]
> +                * value is what is use to check what is valid and what isn't.
> +                */
> +               daddrs[i] = 0;
> +
> +               page = hmm_pfn_to_page(range, range->pfns[i]);
> +               if (page == NULL)
> +                       continue;
> +
> +               /* Check if range is being invalidated */
> +               if (!range->valid) {
> +                       ret = -EBUSY;
> +                       goto unmap;
> +               }
> +
> +               /* If it is read and write than map bi-directional. */
> +               if (range->pfns[i] & range->values[HMM_PFN_WRITE])
> +                       dir = DMA_BIDIRECTIONAL;
> +
> +               daddrs[i] = dma_map_page(device, page, 0, PAGE_SIZE, dir);
> +               if (dma_mapping_error(device, daddrs[i])) {
> +                       ret = -EFAULT;
> +                       goto unmap;
> +               }
> +
> +               mapped++;
> +       }
> +
> +       return mapped;
> +
> +unmap:
> +       for (npages = i, i = 0; (i < npages) && mapped; ++i) {
> +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> +               struct page *page;
> +
> +               page = hmm_pfn_to_page(range, range->pfns[i]);
> +               if (page == NULL)
> +                       continue;
> +
> +               if (dma_mapping_error(device, daddrs[i]))
> +                       continue;
> +
> +               /* If it is read and write than map bi-directional. */
> +               if (range->pfns[i] & range->values[HMM_PFN_WRITE])
> +                       dir = DMA_BIDIRECTIONAL;
> +
> +               dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
> +               mapped--;
> +       }
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL(hmm_range_dma_map);
> +
> +/*

Same here.

> + * hmm_range_dma_unmap() - unmap range of that was map with hmm_range_dma_map()
> + * @range: range being unmapped
> + * @vma: the vma against which the range (optional)
> + * @device: device against which dma map was done
> + * @daddrs: dma address of mapped pages
> + * @dirty: dirty page if it had the write flag set
> + * Returns: number of page unmapped on success, -EINVAL otherwise
> + *
> + * Note that caller MUST abide by mmu notifier or use HMM mirror and abide
> + * to the sync_cpu_device_pagetables() callback so that it is safe here to
> + * call set_page_dirty(). Caller must also take appropriate locks to avoid
> + * concurrent mmu notifier or sync_cpu_device_pagetables() to make progress.
> + */
> +long hmm_range_dma_unmap(struct hmm_range *range,
> +                        struct vm_area_struct *vma,
> +                        struct device *device,
> +                        dma_addr_t *daddrs,
> +                        bool dirty)
> +{
> +       unsigned long i, npages;
> +       long cpages = 0;
> +
> +       /* Sanity check. */
> +       if (range->end <= range->start)
> +               return -EINVAL;
> +       if (!daddrs)
> +               return -EINVAL;
> +       if (!range->pfns)
> +               return -EINVAL;
> +
> +       npages = (range->end - range->start) >> PAGE_SHIFT;
> +       for (i = 0; i < npages; ++i) {
> +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> +               struct page *page;
> +
> +               page = hmm_pfn_to_page(range, range->pfns[i]);
> +               if (page == NULL)
> +                       continue;
> +
> +               /* If it is read and write than map bi-directional. */
> +               if (range->pfns[i] & range->values[HMM_PFN_WRITE]) {
> +                       dir = DMA_BIDIRECTIONAL;
> +
> +                       /*
> +                        * See comments in function description on why it is
> +                        * safe here to call set_page_dirty()
> +                        */
> +                       if (dirty)
> +                               set_page_dirty(page);
> +               }
> +
> +               /* Unmap and clear pfns/dma address */
> +               dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
> +               range->pfns[i] = range->values[HMM_PFN_NONE];
> +               /* FIXME see comments in hmm_vma_dma_map() */
> +               daddrs[i] = 0;
> +               cpages++;
> +       }
> +
> +       return cpages;
> +}
> +EXPORT_SYMBOL(hmm_range_dma_unmap);
>  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>
>
> --
> 2.17.2
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 11/11] mm/hmm: add an helper function that fault pages and map them to a device v2
@ 2019-04-01 11:59     ` Souptick Joarder
  0 siblings, 0 replies; 69+ messages in thread
From: Souptick Joarder @ 2019-04-01 11:59 UTC (permalink / raw)
  To: jglisse
  Cc: Linux-MM, linux-kernel, Andrew Morton, Ralph Campbell,
	John Hubbard, Dan Williams

On Mon, Mar 25, 2019 at 8:11 PM <jglisse@redhat.com> wrote:
>
> From: Jérôme Glisse <jglisse@redhat.com>
>
> This is a all in one helper that fault pages in a range and map them to
> a device so that every single device driver do not have to re-implement
> this common pattern.
>
> This is taken from ODP RDMA in preparation of ODP RDMA convertion. It
> will be use by nouveau and other drivers.
>
> Changes since v1:
>     - improved commit message
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/hmm.h |   9 +++
>  mm/hmm.c            | 152 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 161 insertions(+)
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 5f9deaeb9d77..7aadf18b29cb 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -568,6 +568,15 @@ int hmm_range_register(struct hmm_range *range,
>  void hmm_range_unregister(struct hmm_range *range);
>  long hmm_range_snapshot(struct hmm_range *range);
>  long hmm_range_fault(struct hmm_range *range, bool block);
> +long hmm_range_dma_map(struct hmm_range *range,
> +                      struct device *device,
> +                      dma_addr_t *daddrs,
> +                      bool block);
> +long hmm_range_dma_unmap(struct hmm_range *range,
> +                        struct vm_area_struct *vma,
> +                        struct device *device,
> +                        dma_addr_t *daddrs,
> +                        bool dirty);
>
>  /*
>   * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> diff --git a/mm/hmm.c b/mm/hmm.c
> index ce33151c6832..fd143251b157 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -30,6 +30,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memremap.h>
>  #include <linux/jump_label.h>
> +#include <linux/dma-mapping.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/memory_hotplug.h>
>
> @@ -1163,6 +1164,157 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>         return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
>  }
>  EXPORT_SYMBOL(hmm_range_fault);
> +
> +/*

Adding extra * might be helpful here for documentation.

> + * hmm_range_dma_map() - hmm_range_fault() and dma map page all in one.
> + * @range: range being faulted
> + * @device: device against to dma map page to
> + * @daddrs: dma address of mapped pages
> + * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
> + * Returns: number of pages mapped on success, -EAGAIN if mmap_sem have been
> + *          drop and you need to try again, some other error value otherwise
> + *
> + * Note same usage pattern as hmm_range_fault().
> + */
> +long hmm_range_dma_map(struct hmm_range *range,
> +                      struct device *device,
> +                      dma_addr_t *daddrs,
> +                      bool block)
> +{
> +       unsigned long i, npages, mapped;
> +       long ret;
> +
> +       ret = hmm_range_fault(range, block);
> +       if (ret <= 0)
> +               return ret ? ret : -EBUSY;
> +
> +       npages = (range->end - range->start) >> PAGE_SHIFT;
> +       for (i = 0, mapped = 0; i < npages; ++i) {
> +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> +               struct page *page;
> +
> +               /*
> +                * FIXME need to update DMA API to provide invalid DMA address
> +                * value instead of a function to test dma address value. This
> +                * would remove lot of dumb code duplicated accross many arch.
> +                *
> +                * For now setting it to 0 here is good enough as the pfns[]
> +                * value is what is use to check what is valid and what isn't.
> +                */
> +               daddrs[i] = 0;
> +
> +               page = hmm_pfn_to_page(range, range->pfns[i]);
> +               if (page == NULL)
> +                       continue;
> +
> +               /* Check if range is being invalidated */
> +               if (!range->valid) {
> +                       ret = -EBUSY;
> +                       goto unmap;
> +               }
> +
> +               /* If it is read and write than map bi-directional. */
> +               if (range->pfns[i] & range->values[HMM_PFN_WRITE])
> +                       dir = DMA_BIDIRECTIONAL;
> +
> +               daddrs[i] = dma_map_page(device, page, 0, PAGE_SIZE, dir);
> +               if (dma_mapping_error(device, daddrs[i])) {
> +                       ret = -EFAULT;
> +                       goto unmap;
> +               }
> +
> +               mapped++;
> +       }
> +
> +       return mapped;
> +
> +unmap:
> +       for (npages = i, i = 0; (i < npages) && mapped; ++i) {
> +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> +               struct page *page;
> +
> +               page = hmm_pfn_to_page(range, range->pfns[i]);
> +               if (page == NULL)
> +                       continue;
> +
> +               if (dma_mapping_error(device, daddrs[i]))
> +                       continue;
> +
> +               /* If it is read and write than map bi-directional. */
> +               if (range->pfns[i] & range->values[HMM_PFN_WRITE])
> +                       dir = DMA_BIDIRECTIONAL;
> +
> +               dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
> +               mapped--;
> +       }
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL(hmm_range_dma_map);
> +
> +/*

Same here.

> + * hmm_range_dma_unmap() - unmap range of that was map with hmm_range_dma_map()
> + * @range: range being unmapped
> + * @vma: the vma against which the range (optional)
> + * @device: device against which dma map was done
> + * @daddrs: dma address of mapped pages
> + * @dirty: dirty page if it had the write flag set
> + * Returns: number of page unmapped on success, -EINVAL otherwise
> + *
> + * Note that caller MUST abide by mmu notifier or use HMM mirror and abide
> + * to the sync_cpu_device_pagetables() callback so that it is safe here to
> + * call set_page_dirty(). Caller must also take appropriate locks to avoid
> + * concurrent mmu notifier or sync_cpu_device_pagetables() to make progress.
> + */
> +long hmm_range_dma_unmap(struct hmm_range *range,
> +                        struct vm_area_struct *vma,
> +                        struct device *device,
> +                        dma_addr_t *daddrs,
> +                        bool dirty)
> +{
> +       unsigned long i, npages;
> +       long cpages = 0;
> +
> +       /* Sanity check. */
> +       if (range->end <= range->start)
> +               return -EINVAL;
> +       if (!daddrs)
> +               return -EINVAL;
> +       if (!range->pfns)
> +               return -EINVAL;
> +
> +       npages = (range->end - range->start) >> PAGE_SHIFT;
> +       for (i = 0; i < npages; ++i) {
> +               enum dma_data_direction dir = DMA_FROM_DEVICE;
> +               struct page *page;
> +
> +               page = hmm_pfn_to_page(range, range->pfns[i]);
> +               if (page == NULL)
> +                       continue;
> +
> +               /* If it is read and write than map bi-directional. */
> +               if (range->pfns[i] & range->values[HMM_PFN_WRITE]) {
> +                       dir = DMA_BIDIRECTIONAL;
> +
> +                       /*
> +                        * See comments in function description on why it is
> +                        * safe here to call set_page_dirty()
> +                        */
> +                       if (dirty)
> +                               set_page_dirty(page);
> +               }
> +
> +               /* Unmap and clear pfns/dma address */
> +               dma_unmap_page(device, daddrs[i], PAGE_SIZE, dir);
> +               range->pfns[i] = range->values[HMM_PFN_NONE];
> +               /* FIXME see comments in hmm_vma_dma_map() */
> +               daddrs[i] = 0;
> +               cpages++;
> +       }
> +
> +       return cpages;
> +}
> +EXPORT_SYMBOL(hmm_range_dma_unmap);
>  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>
>
> --
> 2.17.2
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2019-04-01 12:00 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-25 14:40 [PATCH v2 00/11] Improve HMM driver API v2 jglisse
2019-03-25 14:40 ` [PATCH v2 01/11] mm/hmm: select mmu notifier when selecting HMM jglisse
2019-03-28 20:33   ` John Hubbard
2019-03-29 21:15     ` Jerome Glisse
2019-03-29 21:42       ` John Hubbard
2019-03-25 14:40 ` [PATCH v2 02/11] mm/hmm: use reference counting for HMM struct v2 jglisse
2019-03-28 11:07   ` Ira Weiny
2019-03-28 19:11     ` Jerome Glisse
2019-03-28 20:43       ` John Hubbard
2019-03-28 21:21         ` Jerome Glisse
2019-03-29  0:39           ` John Hubbard
2019-03-28 16:57             ` Ira Weiny
2019-03-29  1:00               ` Jerome Glisse
2019-03-29  1:18                 ` John Hubbard
2019-03-29  1:50                   ` Jerome Glisse
2019-03-28 18:21                     ` Ira Weiny
2019-03-29  2:25                       ` Jerome Glisse
2019-03-29 20:07                         ` John Hubbard
2019-03-29  2:11                     ` John Hubbard
2019-03-29  2:22                       ` Jerome Glisse
2019-03-25 14:40 ` [PATCH v2 03/11] mm/hmm: do not erase snapshot when a range is invalidated jglisse
2019-03-25 14:40 ` [PATCH v2 04/11] mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() v2 jglisse
2019-03-28 13:30   ` Ira Weiny
2019-03-25 14:40 ` [PATCH v2 05/11] mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() v2 jglisse
2019-03-28 13:43   ` Ira Weiny
2019-03-28 22:03     ` Jerome Glisse
2019-03-25 14:40 ` [PATCH v2 06/11] mm/hmm: improve driver API to work and wait over a range v2 jglisse
2019-03-28 13:11   ` Ira Weiny
2019-03-28 21:39     ` Jerome Glisse
2019-03-28 16:12   ` Ira Weiny
2019-03-29  0:56     ` Jerome Glisse
2019-03-28 18:49       ` Ira Weiny
2019-03-25 14:40 ` [PATCH v2 07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays jglisse
2019-03-28 21:59   ` John Hubbard
2019-03-28 22:12     ` Jerome Glisse
2019-03-28 22:19       ` John Hubbard
2019-03-28 22:31         ` Jerome Glisse
2019-03-28 22:40           ` John Hubbard
2019-03-28 23:21             ` Jerome Glisse
2019-03-28 23:28               ` John Hubbard
2019-03-28 16:42                 ` Ira Weiny
2019-03-29  1:17                   ` Jerome Glisse
2019-03-29  1:30                     ` John Hubbard
2019-03-29  1:42                       ` Jerome Glisse
2019-03-29  1:59                         ` Jerome Glisse
2019-03-29  2:05                           ` John Hubbard
2019-03-29  2:12                             ` Jerome Glisse
2019-03-28 23:43                 ` Jerome Glisse
2019-03-25 14:40 ` [PATCH v2 08/11] mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping) v2 jglisse
2019-03-28 16:53   ` Ira Weiny
2019-03-25 14:40 ` [PATCH v2 09/11] mm/hmm: allow to mirror vma of a file on a DAX backed filesystem v2 jglisse
2019-03-28 18:04   ` Ira Weiny
2019-03-29  2:17     ` Jerome Glisse
2019-03-25 14:40 ` [PATCH v2 10/11] mm/hmm: add helpers for driver to safely take the mmap_sem v2 jglisse
2019-03-28 20:54   ` John Hubbard
2019-03-28 21:30     ` Jerome Glisse
2019-03-28 21:41       ` John Hubbard
2019-03-28 22:08         ` Jerome Glisse
2019-03-28 22:25           ` John Hubbard
2019-03-28 22:40             ` Jerome Glisse
2019-03-28 22:43               ` John Hubbard
2019-03-28 23:05                 ` Jerome Glisse
2019-03-28 23:20                   ` John Hubbard
2019-03-28 23:24                     ` Jerome Glisse
2019-03-28 23:34                       ` John Hubbard
2019-03-28 18:44                         ` Ira Weiny
2019-03-25 14:40 ` [PATCH v2 11/11] mm/hmm: add an helper function that fault pages and map them to a device v2 jglisse
2019-04-01 11:59   ` Souptick Joarder
2019-04-01 11:59     ` Souptick Joarder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.