linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* HMM (Heterogeneous Memory Management) v8
@ 2015-05-21 19:31 j.glisse
  2015-05-21 19:31 ` [PATCH 01/36] mmu_notifier: add event information to address invalidation v7 j.glisse
                   ` (21 more replies)
  0 siblings, 22 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	linux-fsdevel, Linda Wang, Kevin E Martin, Jeff Law, Or Gerlitz,
	Sagi Grimberg


So sorry had to resend because i stupidly forgot to cc mailing list.
Ignore private send done before.


HMM (Heterogeneous Memory Management) is an helper layer for device
that want to mirror a process address space into their own mmu. Main
target is GPU but other hardware, like network device can take also
use HMM.

There is two side to HMM, first one is mirroring of process address
space on behalf of a device. HMM will manage a secondary page table
for the device and keep it synchronize with the CPU page table. HMM
also do DMA mapping on behalf of the device (which would allow new
kind of optimization further down the road (1)).

Second side is allowing to migrate process memory to device memory
where device memory is unmappable by the CPU. Any CPU access will
trigger special fault that will migrate memory back.

>From design point of view not much changed since last patchset (2).
Most of the change are in small details of the API expose to device
driver. This version also include device driver change for Mellanox
hardware to use HMM as an alternative to ODP (which provide a subset
of HMM functionality specificaly for RDMA devices). Long term plan
is to have HMM completely replace ODP.



Why doing this ?

Mirroring a process address space is mandatory with OpenCL 2.0 and
with other GPU compute API. OpenCL 2.0 allow different level of
implementation and currently only the lowest 2 are supported on
Linux. To implement the highest level, where CPU and GPU access
can happen concurently and are cache coherent, HMM is needed, or
something providing same functionality, for instance through
platform hardware.

Hardware solution such as PCIE ATS/PASID is limited to mirroring
system memory and does not provide way to migrate memory to device
memory (which offer significantly more bandwidth up to 10 times
faster than regular system memory with discret GPU, also have
lower latency than PCIE transaction).

Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID
and for Intel a special level of cache (backed by a large pool of
fast memory).

For foreseeable futur, discrete GPU will remain releveant as they
can have a large quantity of faster memory than integrated GPU.

Thus we believe HMM will allow to leverage discret GPU memory in
a transparent fashion to the application, with minimum disruption
to the linux kernel mm code. Also HMM can work along hardware
solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID
while HMM handles the migrated memory case).



Design :

The patch 1, 2, 3 and 4 augment the mmu notifier API with new
informations to more efficiently mirror CPU page table updates.

The first side of HMM, process address space mirroring, is
implemented in patch 5 through 12. This use a secondary page
table, in which HMM mirror memory actively use by the device.
HMM does not take a reference on any of the page, it use the
mmu notifier API to track changes to the CPU page table and to
update the mirror page table. All this while providing a simple
API to device driver.

To implement this we use a "generic" page table and not a radix
tree because we need to store more flags than radix allows and
we need to store dma address (sizeof(dma_addr_t) > sizeof(long)
on some platform). All this is

Patch 14 pass down the lane the new child mm struct of a parent
process being forked. This is necessary to properly handle fork
when parent process have migrated memory (more on that below).

Patch 15 allow to get the current memcg against which anonymous
memory of a process should be accounted. It usefull because in
HMM we do bulk transaction on address space and we wish to avoid
storing a pointer to memcg for each single page. All operation
dealing with memcg happens under the protection of the mmap
semaphore.


Second side of HMM, migration to device memory, is implemented
in patch 16 to 28. This only deal with anonymous memory. A new
special swap type is introduced. Migrated memory will have there
CPU page table entry set to this special swap entry (like the
migration entry but unlike migration this is not a short lived
state).

All the patches are then set of functions that deals with those
special entry in the various code path that might face them.

Memory migration require several steps, first the memory is un-
mapped from CPU and replace with special "locked" entry, HMM
locked entry is a short lived transitional state, this is to
avoid two threads to fight over migration entry.

Once unmapped HMM can determine what can be migrated or not by
comparing mapcount and page count. If something holds a reference
then the page is not migrated and CPU page table is restored.
Next step is to schedule the copy to device memory and update
the CPU page table to regular HMM entry.

Migration back follow the same pattern, replace with special
lock entry, then copy back, then update CPU page table.


(1) Because HMM keeps a secondary page table which keeps track of
    DMA mapping, there is room for new optimization. We want to
    add a new DMA API to allow to manage DMA page table mapping
    at directory level. This would allow to minimize memory
    consumption of mirror page table and also over head of doing
    DMA mapping page per page. This is a future feature we want
    to work on and hope the idea will proove usefull not only to
    HMM users.

(2) Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/


Cheers,
JA(C)rA'me

To: "Andrew Morton" <akpm@linux-foundation.org>,
Cc: <linux-kernel@vger.kernel.org>,
Cc: linux-mm <linux-mm@kvack.org>,
Cc: <linux-fsdevel@vger.kernel.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 01/36] mmu_notifier: add event information to address invalidation v7
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-30  3:43   ` John Hubbard
  2015-05-21 19:31 ` [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3 j.glisse
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The event information will be useful for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their usage
    also documenting what exceptation the listener can have in
    respect to each event.

Changed since v2:
  - Avoid crazy name.
  - Do not move code that do not need to move.

Changed since v3:
  - Separate hugue page split from mlock/munlock and softdirty.

Changed since v4:
  - Rebase (no other changes).

Changed since v5:
  - Typo fix.
  - Changed zap_page_range from MMU_MUNMAP to MMU_MIGRATE to reflect the
    fact that the address range is still valid just the page backing it
    are no longer.

Changed since v6:
  - try_to_unmap_one() only invalidate when doing migration.
  - Differentiate fork from other case.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/gpu/drm/radeon/radeon_mn.c      |   3 +-
 drivers/infiniband/core/umem_odp.c      |   9 ++-
 drivers/iommu/amd_iommu_v2.c            |   3 +-
 drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
 drivers/xen/gntdev.c                    |   9 ++-
 fs/proc/task_mmu.c                      |   6 +-
 include/linux/mmu_notifier.h            | 135 ++++++++++++++++++++++++++------
 kernel/events/uprobes.c                 |  10 ++-
 mm/huge_memory.c                        |  39 ++++++---
 mm/hugetlb.c                            |  23 +++---
 mm/ksm.c                                |  18 +++--
 mm/madvise.c                            |   4 +-
 mm/memory.c                             |  27 ++++---
 mm/migrate.c                            |   9 ++-
 mm/mmu_notifier.c                       |  28 ++++---
 mm/mprotect.c                           |   6 +-
 mm/mremap.c                             |   6 +-
 mm/rmap.c                               |   4 +-
 virt/kvm/kvm_main.c                     |  12 ++-
 20 files changed, 261 insertions(+), 102 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 4039ede..452e9b1 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -132,7 +132,8 @@ restart:
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index eef006c..3a9615b 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -121,7 +121,8 @@ static void radeon_mn_release(struct mmu_notifier *mn,
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     enum mmu_event event)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 40becdb..6ed69fa 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -165,7 +165,8 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -192,7 +193,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -217,7 +219,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 3465faf..4aa4de6 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -384,7 +384,8 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
 				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
+				unsigned long address,
+				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 8927485..46bc610 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,7 +467,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_event event)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -484,9 +486,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d..58e2390 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -934,11 +934,13 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0,
+							    -1, MMU_ISDIRTY);
 		}
 		walk_page_range(0, ~0UL, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0,
+							  -1, MMU_ISDIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 61cd67f..8b11b1b 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,70 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ *   - MMU_FORK when a process is forking and as a results various vma needs to
+ *     be write protected to allow for COW.
+ *
+ *   - MMU_HSPLIT huge page split, the memory is the same only the page table
+ *     structure is updated (level added or removed).
+ *
+ *   - MMU_ISDIRTY need to update the dirty bit of the page table so proper
+ *     dirty accounting can happen.
+ *
+ *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ *     access must stop after invalidate_range_start callback returns.
+ *     Furthermore, no read access should be allowed either, as a new page can
+ *     be remapped with write access before the invalidate_range_end callback
+ *     happens and thus any read access to old page might read stale data. There
+ *     are several sources for this event, including:
+ *
+ *         - A page moving to swap (various reasons, including page reclaim),
+ *         - An mremap syscall,
+ *         - migration for NUMA reasons,
+ *         - balancing the memory pool,
+ *         - write fault on COW page,
+ *         - and more that are not listed here.
+ *
+ *   - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ *     the new access protection. All memory access are still valid until the
+ *     invalidate_range_end callback.
+ *
+ *   - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ *     page are unlocked.
+ *
+ *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ *     process destruction). However, access is still allowed, up until the
+ *     invalidate_range_free_pages callback. This also implies that secondary
+ *     page table can be trimmed, because the address range is no longer valid.
+ *
+ *   - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ *     must stop after invalidate_range_start callback returns. Read access are
+ *     still allowed.
+ *
+ *   - MMU_WRITE_PROTECT: memory is being write protected (ie should be mapped
+ *     read only no matter what the vma memory protection allows). All write
+ *     accesses must stop after invalidate_range_start callback returns. Read
+ *     access are still allowed.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+	MMU_FORK = 0,
+	MMU_HSPLIT,
+	MMU_ISDIRTY,
+	MMU_MIGRATE,
+	MMU_MPROT,
+	MMU_MUNLOCK,
+	MMU_MUNMAP,
+	MMU_WRITE_BACK,
+	MMU_WRITE_PROTECT,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -82,7 +146,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_event event);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -93,7 +158,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_event event);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -140,10 +206,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_event event);
 
 	/*
 	 * invalidate_range() is either called between
@@ -206,13 +276,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event event);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
@@ -240,31 +317,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, event);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -359,13 +443,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -393,22 +477,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f2..802828a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -176,7 +176,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -194,7 +195,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -208,7 +211,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cb8904c..41c342c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1024,7 +1024,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1058,7 +1059,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1068,7 +1070,8 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1160,7 +1163,8 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	spin_lock(ptl);
 	if (page)
@@ -1192,7 +1196,8 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
 out_unlock:
@@ -1646,7 +1651,8 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_HSPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1662,7 +1668,8 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_HSPLIT);
 
 	return ret;
 }
@@ -2526,7 +2533,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2536,7 +2544,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2933,24 +2942,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54f129d..19da310 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2670,7 +2670,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start,
+						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2724,7 +2725,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 	return ret;
 }
@@ -2750,7 +2752,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -2824,7 +2827,8 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -3003,8 +3007,8 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -3025,7 +3029,8 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3493,7 +3498,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3543,7 +3548,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index bc7be0e..76f167c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_WRITE_PROTECT);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_WRITE_PROTECT);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_WRITE_PROTECT);
 out:
 	return err;
 }
@@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out:
 	return err;
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 22e8f0c..b90ba3d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -405,9 +405,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
 	madvise_free_page_range(&tlb, vma, start, end);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, start, end);
 
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index d1fa0c1..9300fad 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1048,7 +1048,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_FORK);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1065,7 +1065,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
+						  mmun_end, MMU_FORK);
 	return ret;
 }
 
@@ -1335,10 +1336,12 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr,
+					    end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr,
+					  end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1360,10 +1363,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1386,9 +1389,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2086,7 +2089,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_new;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2119,7 +2123,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2158,7 +2162,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 236ee25..ad9a55a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1759,12 +1759,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1818,7 +1820,8 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 3b9b3d0..e51ea02 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -142,8 +142,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -151,13 +153,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -165,13 +168,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_event event)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -179,14 +185,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start,
+							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -204,7 +213,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start,
+						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e7d6f11..a57e8af 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -155,7 +155,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start,
+							    end, MMU_MPROT);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -183,7 +184,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end,
+						  MMU_MPROT);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index a7c93ec..72051cf 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -176,7 +176,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 9c04594..74c51e0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -915,7 +915,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1338,7 +1338,7 @@ discard:
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
 out:
 	return ret;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f202c40..d0b1060 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -260,7 +260,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -302,7 +303,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -318,7 +320,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -344,7 +347,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
  2015-05-21 19:31 ` [PATCH 01/36] mmu_notifier: add event information to address invalidation v7 j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-27  5:09   ` Aneesh Kumar K.V
  2015-06-02  9:32   ` John Hubbard
  2015-05-21 19:31 ` [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() j.glisse
                   ` (19 subsequent siblings)
  21 siblings, 2 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
can be considered as forming an "atomic" section for the cpu page table update
point of view. Between this two function the cpu page table content is unreliable
for the address range being invalidated.

Current user such as kvm need to know when they can trust the content of the cpu
page table. This becomes even more important to new users of the mmu_notifier
api (such as HMM or ODP).

This patch use a structure define at all call site to invalidate_range_start()
that is added to a list for the duration of the invalidation. It adds two new
helpers to allow querying if a range is being invalidated or to wait for a range
to become valid.

For proper synchronization, user must block new range invalidation from inside
there invalidate_range_start() callback, before calling the helper functions.
Otherwise there is no garanty that a new range invalidation will not be added
after the call to the helper function to query for existing range.

Changed since v1:
  - Fix a possible deadlock in mmu_notifier_range_wait_valid()

Changed since v2:
  - Add the range to invalid range list before calling ->range_start().
  - Del the range from invalid range list after calling ->range_end().
  - Remove useless list initialization.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |  9 ++--
 drivers/gpu/drm/radeon/radeon_mn.c      | 14 +++---
 drivers/infiniband/core/umem_odp.c      | 16 +++----
 drivers/misc/sgi-gru/grutlbpurge.c      | 15 +++----
 drivers/xen/gntdev.c                    | 15 ++++---
 fs/proc/task_mmu.c                      | 11 +++--
 include/linux/mmu_notifier.h            | 55 ++++++++++++-----------
 kernel/events/uprobes.c                 | 13 +++---
 mm/huge_memory.c                        | 78 ++++++++++++++------------------
 mm/hugetlb.c                            | 55 ++++++++++++-----------
 mm/ksm.c                                | 28 +++++-------
 mm/madvise.c                            | 20 ++++-----
 mm/memory.c                             | 72 +++++++++++++++++-------------
 mm/migrate.c                            | 36 +++++++--------
 mm/mmu_notifier.c                       | 79 ++++++++++++++++++++++++++++-----
 mm/mprotect.c                           | 18 ++++----
 mm/mremap.c                             | 14 +++---
 virt/kvm/kvm_main.c                     | 10 ++---
 18 files changed, 302 insertions(+), 256 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 452e9b1..80fe72a 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -131,16 +131,15 @@ restart:
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+						       const struct mmu_notifier_range *range)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
-	unsigned long next = start;
+	unsigned long next = range->start;
 	unsigned long serial = 0;
+	/* interval ranges are inclusive, but invalidate range is exclusive */
+	unsigned long end = range->end - 1, start = range->start;
 
-	end--; /* interval ranges are inclusive, but invalidate range is exclusive */
 	while (next < end) {
 		struct drm_i915_gem_object *obj = NULL;
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index 3a9615b..24898bf 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -112,34 +112,30 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  *
  * @mn: our notifier
  * @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: Address range information.
  *
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long start,
-					     unsigned long end,
-					     enum mmu_event event)
+					     const struct mmu_notifier_range *range)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
-
 	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
+	unsigned long end = range->end - 1;
 
 	mutex_lock(&rmn->lock);
 
-	it = interval_tree_iter_first(&rmn->objects, start, end);
+	it = interval_tree_iter_first(&rmn->objects, range->start, end);
 	while (it) {
 		struct radeon_mn_node *node;
 		struct radeon_bo *bo;
 		long r;
 
 		node = container_of(it, struct radeon_mn_node, it);
-		it = interval_tree_iter_next(it, start, end);
+		it = interval_tree_iter_next(it, range->start, end);
 
 		list_for_each_entry(bo, &node->bos, mn_list) {
 
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6ed69fa..8f7f845 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -192,9 +192,7 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+						    const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -203,8 +201,8 @@ static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	ib_ucontext_notifier_start_account(context);
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_start_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -218,9 +216,7 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+						  const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -228,8 +224,8 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		return;
 
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_end_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..44b41b7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,8 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end,
-				       enum mmu_event event)
+				       const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -230,14 +229,13 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	STAT(mmu_invalidate_range);
 	atomic_inc(&gms->ms_range_active);
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
-		start, end, atomic_read(&gms->ms_range_active));
-	gru_flush_tlb_range(gms, start, end - start);
+		range->start, range->end, atomic_read(&gms->ms_range_active));
+	gru_flush_tlb_range(gms, range->start, range->end - range->start);
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event)
+				     struct mm_struct *mm,
+				     const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -246,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 	(void)atomic_dec_and_test(&gms->ms_range_active);
 
 	wake_up_all(&gms->ms_wait_queue);
-	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
+	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms,
+		range->start, range->end);
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 46bc610..0e8aa12 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,19 +467,17 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start,
-				unsigned long end,
-				enum mmu_event event)
+				const struct mmu_notifier_range *range)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
 	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	mutex_unlock(&priv->lock);
 }
@@ -489,7 +487,12 @@ static void mn_invl_page(struct mmu_notifier *mn,
 			 unsigned long address,
 			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+	struct mmu_notifier_range range;
+
+	range.start = address;
+	range.end = address + PAGE_SIZE;
+	range.event = event;
+	mn_invl_range_start(mn, mm, &range);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 58e2390..1c7a2c3 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -908,6 +908,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.mm = mm,
 			.private = &cp,
 		};
+		struct mmu_notifier_range range = {
+			.start = 0,
+			.end = -1UL,
+			.event = MMU_ISDIRTY,
+		};
 
 		if (type == CLEAR_REFS_MM_HIWATER_RSS) {
 			/*
@@ -934,13 +939,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0,
-							    -1, MMU_ISDIRTY);
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 		walk_page_range(0, ~0UL, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0,
-							  -1, MMU_ISDIRTY);
+			mmu_notifier_invalidate_range_end(mm, &range);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 8b11b1b..ada3ed1 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -73,6 +73,13 @@ enum mmu_event {
 	MMU_WRITE_PROTECT,
 };
 
+struct mmu_notifier_range {
+	struct list_head list;
+	unsigned long start;
+	unsigned long end;
+	enum mmu_event event;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -86,6 +93,12 @@ struct mmu_notifier_mm {
 	struct hlist_head list;
 	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
+	/* List of all active range invalidations. */
+	struct list_head ranges;
+	/* Number of active range invalidations. */
+	int nranges;
+	/* For threads waiting on range invalidations. */
+	wait_queue_head_t wait_queue;
 };
 
 struct mmu_notifier_ops {
@@ -206,14 +219,10 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start,
-				       unsigned long end,
-				       enum mmu_event event);
+				       const struct mmu_notifier_range *range);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event);
+				     const struct mmu_notifier_range *range);
 
 	/*
 	 * invalidate_range() is either called between
@@ -283,15 +292,17 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event);
+						  struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						unsigned long start,
-						unsigned long end,
-						enum mmu_event event);
+						struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
+extern bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
+extern void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -334,21 +345,17 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+						       struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, event);
+		__mmu_notifier_invalidate_range_start(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+						     struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, event);
+		__mmu_notifier_invalidate_range_end(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -490,16 +497,12 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+						       struct mmu_notifier_range *range)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+						     struct mmu_notifier_range *range)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 802828a..b7f7f6b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -164,9 +164,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *ptep;
 	int err;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = addr;
-	const unsigned long mmun_end   = addr + PAGE_SIZE;
+	struct mmu_notifier_range range;
 	struct mem_cgroup *memcg;
 
 	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
@@ -176,8 +174,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -211,8 +211,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41c342c..77f78a8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -983,8 +983,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	pmd_t _pmd;
 	int ret = 0, i;
 	struct page **pages;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
 			GFP_KERNEL);
@@ -1022,10 +1021,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1059,8 +1058,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1070,8 +1068,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1090,9 +1087,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL, *new_page;
 	struct mem_cgroup *memcg;
 	unsigned long haddr;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 	gfp_t huge_gfp;			/* for allocation and charge */
+	struct mmu_notifier_range range;
 
 	ptl = pmd_lockptr(mm, pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1161,10 +1157,10 @@ alloc:
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	spin_lock(ptl);
 	if (page)
@@ -1196,8 +1192,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return ret;
 out_unlock:
@@ -1647,12 +1642,12 @@ static int __split_huge_page_splitting(struct page *page,
 	spinlock_t *ptl;
 	pmd_t *pmd;
 	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_HSPLIT);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_HSPLIT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1668,8 +1663,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_HSPLIT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return ret;
 }
@@ -2485,8 +2479,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	int isolated;
 	unsigned long hstart, hend;
 	struct mem_cgroup *memcg;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	gfp_t gfp;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -2531,10 +2524,10 @@ static void collapse_huge_page(struct mm_struct *mm,
 	pte = pte_offset_map(pmd, address);
 	pte_ptl = pte_lockptr(mm, pmd);
 
-	mmun_start = address;
-	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2544,8 +2537,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2934,36 +2926,32 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 19da310..2472f54 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2661,17 +2661,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	int cow;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	int ret = 0;
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	mmun_start = vma->vm_start;
-	mmun_end = vma->vm_end;
+	range.start = vma->vm_start;
+	range.end = vma->vm_end;
+	range.event = MMU_MIGRATE;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start,
-						    mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_start(src, &range);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2711,8 +2710,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else {
 			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
-				mmu_notifier_invalidate_range(src, mmun_start,
-								   mmun_end);
+				mmu_notifier_invalidate_range(src, range.start,
+								   range.end);
 			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -2725,8 +2724,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src, &range);
 
 	return ret;
 }
@@ -2744,16 +2742,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	const unsigned long mmun_start = start;	/* For mmu_notifiers */
-	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MIGRATE;
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -2827,8 +2826,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2925,8 +2923,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int ret = 0, outside_reserve = 0;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	old_page = pte_page(pte);
 
@@ -3005,10 +3002,11 @@ retry_avoidcopy:
 	__SetPageUptodate(new_page);
 	set_page_huge_active(new_page);
 
-	mmun_start = address & huge_page_mask(h);
-	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = address & huge_page_mask(h);
+	range.end = range.start + huge_page_size(h);
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
+
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -3020,7 +3018,7 @@ retry_avoidcopy:
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
 		page_remove_rmap(old_page);
@@ -3029,8 +3027,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3494,11 +3491,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long pages = 0;
+	struct mmu_notifier_range range;
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MPROT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3548,7 +3549,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 76f167c..bc292ea 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -855,14 +855,13 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
+	struct mmu_notifier_range range;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr;
 	pte_t *ptep;
 	spinlock_t *ptl;
 	int swapped;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -870,10 +869,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	BUG_ON(PageTransCompound(page));
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_WRITE_PROTECT);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_WRITE_PROTECT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -913,8 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_WRITE_PROTECT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
@@ -937,8 +935,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	unsigned long addr;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -948,10 +945,10 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd)
 		goto out;
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -976,8 +973,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index b90ba3d..b363fe2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -383,8 +383,8 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
 static int madvise_free_single_vma(struct vm_area_struct *vma,
 			unsigned long start_addr, unsigned long end_addr)
 {
-	unsigned long start, end;
 	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range = {.event = MMU_MUNMAP};
 	struct mmu_gather tlb;
 
 	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
@@ -394,21 +394,21 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	if (vma->vm_file)
 		return -EINVAL;
 
-	start = max(vma->vm_start, start_addr);
-	if (start >= vma->vm_end)
+	range.start = max(vma->vm_start, start_addr);
+	if (range.start >= vma->vm_end)
 		return -EINVAL;
-	end = min(vma->vm_end, end_addr);
-	if (end <= vma->vm_start)
+	range.end = min(vma->vm_end, end_addr);
+	if (range.end <= vma->vm_start)
 		return -EINVAL;
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, start, end);
+	tlb_gather_mmu(&tlb, mm, range.start, range.end);
 	update_hiwater_rss(mm);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
-	madvise_free_page_range(&tlb, vma, start, end);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
-	tlb_finish_mmu(&tlb, start, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	madvise_free_page_range(&tlb, vma, range.start, range.end);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, range.start, range.end);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 9300fad..5a1131f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1009,8 +1009,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	bool is_cow;
 	int ret;
 
@@ -1044,11 +1043,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * is_cow_mapping() returns true.
 	 */
 	is_cow = is_cow_mapping(vma->vm_flags);
-	mmun_start = addr;
-	mmun_end   = end;
+	range.start = addr;
+	range.end = end;
+	range.event = MMU_FORK;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end, MMU_FORK);
+		mmu_notifier_invalidate_range_start(src_mm, &range);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1065,8 +1064,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
-						  mmun_end, MMU_FORK);
+		mmu_notifier_invalidate_range_end(src_mm, &range);
 	return ret;
 }
 
@@ -1335,13 +1333,16 @@ void unmap_vmas(struct mmu_gather *tlb,
 		unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range = {
+		.start = start_addr,
+		.end = end_addr,
+		.event = MMU_MUNMAP,
+	};
 
-	mmu_notifier_invalidate_range_start(mm, start_addr,
-					    end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr,
-					  end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, &range);
 }
 
 /**
@@ -1358,16 +1359,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = start + size;
+	struct mmu_notifier_range range = {
+		.start = start,
+		.end = start + size,
+		.event = MMU_MIGRATE,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, start, end);
+	tlb_gather_mmu(&tlb, mm, start, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
-		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
-	tlb_finish_mmu(&tlb, start, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+		unmap_single_vma(&tlb, vma, start, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, start, range.end);
 }
 
 /**
@@ -1384,15 +1389,19 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = address + size;
+	struct mmu_notifier_range range = {
+		.start = address,
+		.end = address + size,
+		.event = MMU_MUNMAP,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, address, end);
+	tlb_gather_mmu(&tlb, mm, address, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
-	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
-	tlb_finish_mmu(&tlb, address, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	unmap_single_vma(&tlb, vma, address, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, address, range.end);
 }
 
 /**
@@ -2000,6 +2009,7 @@ static inline int wp_page_reuse(struct mm_struct *mm,
 	__releases(ptl)
 {
 	pte_t entry;
+
 	/*
 	 * Clear the pages cpupid information as the existing
 	 * information potentially belongs to a now completely
@@ -2067,9 +2077,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl = NULL;
 	pte_t entry;
 	int page_copied = 0;
-	const unsigned long mmun_start = address & PAGE_MASK;	/* For mmu_notifiers */
-	const unsigned long mmun_end = mmun_start + PAGE_SIZE;	/* For mmu_notifiers */
 	struct mem_cgroup *memcg;
+	struct mmu_notifier_range range;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2089,8 +2098,10 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_new;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2162,8 +2173,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index ad9a55a..7edaa25 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1721,10 +1721,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int isolated = 0;
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
-	unsigned long mmun_start = address & HPAGE_PMD_MASK;
-	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 	pmd_t orig_entry;
 
+	range.start = address & HPAGE_PMD_MASK;
+	range.end = range.start + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
@@ -1746,7 +1749,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	}
 
 	if (mm_tlb_flush_pending(mm))
-		flush_tlb_range(vma, mmun_start, mmun_end);
+		flush_tlb_range(vma, range.start, range.end);
 
 	/* Prepare a page as a migration target */
 	__SetPageLocked(new_page);
@@ -1759,14 +1762,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1799,17 +1800,17 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
-	pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
-	set_pmd_at(mm, mmun_start, pmd, entry);
-	flush_tlb_range(vma, mmun_start, mmun_end);
+	flush_cache_range(vma, range.start, range.end);
+	page_add_anon_rmap(new_page, vma, range.start);
+	pmdp_huge_clear_flush_notify(vma, range.start, pmd);
+	set_pmd_at(mm, range.start, pmd, entry);
+	flush_tlb_range(vma, range.start, range.end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
-		set_pmd_at(mm, mmun_start, pmd, orig_entry);
-		flush_tlb_range(vma, mmun_start, mmun_end);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		set_pmd_at(mm, range.start, pmd, orig_entry);
+		flush_tlb_range(vma, range.start, range.end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
@@ -1820,8 +1821,7 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -1846,7 +1846,7 @@ out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_modify(entry, vma->vm_page_prot);
-		set_pmd_at(mm, mmun_start, pmd, entry);
+		set_pmd_at(mm, range.start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index e51ea02..294ebc4 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,28 +174,28 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   unsigned long start,
-					   unsigned long end,
-					   enum mmu_event event)
+					   struct mmu_notifier_range *range)
 
 {
 	struct mmu_notifier *mn;
 	int id;
 
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
+	mm->mmu_notifier_mm->nranges++;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start,
-							end, event);
+			mn->ops->invalidate_range_start(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 unsigned long start,
-					 unsigned long end,
-					 enum mmu_event event)
+					 struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -211,12 +211,23 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		 * (besides the pointer check).
 		 */
 		if (mn->ops->invalidate_range)
-			mn->ops->invalidate_range(mn, mm, start, end);
+			mn->ops->invalidate_range(mn, mm,
+						  range->start, range->end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start,
-						      end, event);
+			mn->ops->invalidate_range_end(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_del_init(&range->list);
+	mm->mmu_notifier_mm->nranges--;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * Wakeup after callback so they can do their job before any of the
+	 * waiters resume.
+	 */
+	wake_up(&mm->mmu_notifier_mm->wait_queue);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
@@ -235,6 +246,49 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
 
+static bool mmu_notifier_range_is_valid_locked(struct mm_struct *mm,
+					       unsigned long start,
+					       unsigned long end)
+{
+	struct mmu_notifier_range *range;
+
+	list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
+		if (!(range->end <= start || range->start >= end))
+			return false;
+	}
+	return true;
+}
+
+bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+				 unsigned long start,
+				 unsigned long end)
+{
+	bool valid;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	valid = mmu_notifier_range_is_valid_locked(mm, start, end);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+	return valid;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_is_valid);
+
+void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+				   unsigned long start,
+				   unsigned long end)
+{
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (!mmu_notifier_range_is_valid_locked(mm, start, end)) {
+		int nranges = mm->mmu_notifier_mm->nranges;
+
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		wait_event(mm->mmu_notifier_mm->wait_queue,
+			   nranges != mm->mmu_notifier_mm->nranges);
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_valid);
+
 static int do_mmu_notifier_register(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
 				    int take_mmap_sem)
@@ -264,6 +318,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
 	if (!mm_has_notifiers(mm)) {
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
 		spin_lock_init(&mmu_notifier_mm->lock);
+		INIT_LIST_HEAD(&mmu_notifier_mm->ranges);
+		mmu_notifier_mm->nranges = 0;
+		init_waitqueue_head(&mmu_notifier_mm->wait_queue);
 
 		mm->mmu_notifier_mm = mmu_notifier_mm;
 		mmu_notifier_mm = NULL;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a57e8af..0c394db 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -142,7 +142,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	unsigned long next;
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	unsigned long mni_start = 0;
+	struct mmu_notifier_range range = {
+		.start = 0,
+	};
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -153,10 +155,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */
-		if (!mni_start) {
-			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start,
-							    end, MMU_MPROT);
+		if (!range.start) {
+			range.start = addr;
+			range.end = end;
+			range.event = MMU_MPROT;
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -183,9 +186,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pages += this_pages;
 	} while (pmd++, addr = next, addr != end);
 
-	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end,
-						  MMU_MPROT);
+	if (range.start)
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 72051cf..03fb4e5 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -166,18 +166,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		bool need_rmap_locks)
 {
 	unsigned long extent, next, old_end;
+	struct mmu_notifier_range range;
 	pmd_t *old_pmd, *new_pmd;
 	bool need_flush = false;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
-	mmun_start = old_addr;
-	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = old_addr;
+	range.end = old_end;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(vma->vm_mm, &range);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -229,8 +228,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, &range);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0b1060..6177c56 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -319,9 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+						    const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -334,7 +332,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_notifier_count++;
-	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
 	need_tlb_flush |= kvm->tlbs_dirty;
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush)
@@ -346,9 +344,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+						  const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page()
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
  2015-05-21 19:31 ` [PATCH 01/36] mmu_notifier: add event information to address invalidation v7 j.glisse
  2015-05-21 19:31 ` [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3 j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-27  5:17   ` Aneesh Kumar K.V
  2015-06-03  4:25   ` John Hubbard
  2015-05-21 19:31 ` [PATCH 04/36] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier j.glisse
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Listener of mm event might not have easy way to get the struct page
behind and address invalidated with mmu_notifier_invalidate_page()
function as this happens after the cpu page table have been clear/
updated. This happens for instance if the listener is storing a dma
mapping inside its secondary page table. To avoid complex reverse
dma mapping lookup just pass along a pointer to the page being
invalidated.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/infiniband/core/umem_odp.c | 1 +
 drivers/iommu/amd_iommu_v2.c       | 1 +
 drivers/misc/sgi-gru/grutlbpurge.c | 1 +
 drivers/xen/gntdev.c               | 1 +
 include/linux/mmu_notifier.h       | 6 +++++-
 mm/mmu_notifier.c                  | 3 ++-
 mm/rmap.c                          | 4 ++--
 virt/kvm/kvm_main.c                | 1 +
 8 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 8f7f845..d10dd88 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -166,6 +166,7 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long address,
+					     struct page *page,
 					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 4aa4de6..de3c540 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -385,6 +385,7 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
 			       unsigned long address,
+			       struct page *page,
 			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 44b41b7..c7659b76 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -250,6 +250,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
 				unsigned long address,
+				struct page *page,
 				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 0e8aa12..90693ce 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -485,6 +485,7 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
 			 unsigned long address,
+			 struct page *page,
 			 enum mmu_event event)
 {
 	struct mmu_notifier_range range;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index ada3ed1..283ad26 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -172,6 +172,7 @@ struct mmu_notifier_ops {
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
 				unsigned long address,
+				struct page *page,
 				enum mmu_event event);
 
 	/*
@@ -290,6 +291,7 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
+					  struct page *page,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 						  struct mmu_notifier_range *range);
@@ -338,10 +340,11 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 						unsigned long address,
+						struct page *page,
 						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address, event);
+		__mmu_notifier_invalidate_page(mm, address, page, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -492,6 +495,7 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 						unsigned long address,
+						struct page *page,
 						enum mmu_event event)
 {
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 294ebc4..2ff6d43 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -160,6 +160,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 				    unsigned long address,
+				    struct page *page,
 				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
@@ -168,7 +169,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address, event);
+			mn->ops->invalidate_page(mn, mm, address, page, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 74c51e0..4563edc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -915,7 +915,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
+		mmu_notifier_invalidate_page(mm, address, page, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1338,7 +1338,7 @@ discard:
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
+		mmu_notifier_invalidate_page(mm, address, page, MMU_MIGRATE);
 out:
 	return ret;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6177c56..62978ed 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -261,6 +261,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long address,
+					     struct page *page,
 					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 04/36] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (2 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch allow to invalidate a range while excluding call to a specific
mmu_notifier which allow for a subsystem to invalidate a range for everyone
but itself.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/mmu_notifier.h | 60 +++++++++++++++++++++++++++++++++++++++-----
 mm/mmu_notifier.c            | 16 +++++++++---
 2 files changed, 67 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 283ad26..867ca06 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -294,11 +294,15 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  struct page *page,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						  struct mmu_notifier_range *range);
+						  struct mmu_notifier_range *range,
+						  const struct mmu_notifier *exclude);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						struct mmu_notifier_range *range);
+						struct mmu_notifier_range *range,
+						const struct mmu_notifier *exclude);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+					    unsigned long start,
+					    unsigned long end,
+					    const struct mmu_notifier *exclude);
 extern bool mmu_notifier_range_is_valid(struct mm_struct *mm,
 					unsigned long start,
 					unsigned long end);
@@ -351,21 +355,46 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 						       struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, range);
+		__mmu_notifier_invalidate_range_start(mm, range, NULL);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 						     struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, range);
+		__mmu_notifier_invalidate_range_end(mm, range, NULL);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range(mm, start, end);
+		__mmu_notifier_invalidate_range(mm, start, end, NULL);
+}
+
+static inline void mmu_notifier_invalidate_range_start_excluding(struct mm_struct *mm,
+						struct mmu_notifier_range *range,
+						const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(struct mm_struct *mm,
+							struct mmu_notifier_range *range,
+							const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(struct mm_struct *mm,
+						unsigned long start,
+						unsigned long end,
+						const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range(mm, start, end, exclude);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -515,6 +544,25 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range_start_excluding(struct mm_struct *mm,
+						struct mmu_notifier_range *range,
+						const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(struct mm_struct *mm,
+							struct mmu_notifier_range *range,
+							const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(struct mm_struct *mm,
+						unsigned long start,
+						unsigned long end,
+						const struct mmu_notifier *exclude)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 2ff6d43..43172c6 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -175,7 +175,8 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   struct mmu_notifier_range *range)
+					   struct mmu_notifier_range *range,
+					   const struct mmu_notifier *exclude)
 
 {
 	struct mmu_notifier *mn;
@@ -188,6 +189,8 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		if (mn->ops->invalidate_range_start)
 			mn->ops->invalidate_range_start(mn, mm, range);
 	}
@@ -196,13 +199,16 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 struct mmu_notifier_range *range)
+					 struct mmu_notifier_range *range,
+					 const struct mmu_notifier *exclude)
 {
 	struct mmu_notifier *mn;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		/*
 		 * Call invalidate_range here too to avoid the need for the
 		 * subsystem of having to register an invalidate_range_end
@@ -233,13 +239,17 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
 void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+				     unsigned long start,
+				     unsigned long end,
+				     const struct mmu_notifier *exclude)
 {
 	struct mmu_notifier *mn;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 	}
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (3 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 04/36] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-27  5:50   ` Aneesh Kumar K.V
  2015-06-08 19:40   ` Mark Hairgrove
  2015-05-21 19:31 ` [PATCH 06/36] HMM: add HMM page table v2 j.glisse
                   ` (16 subsequent siblings)
  21 siblings, 2 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar, linux-rdma

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch only introduce core HMM functions for registering a new
mirror and stopping a mirror as well as HMM device registering and
unregistering.

The lifecycle of HMM object is handled differently then the one of
mmu_notifier because unlike mmu_notifier there can be concurrent
call from both mm code to HMM code and/or from device driver code
to HMM code. Moreover lifetime of HMM can be uncorrelated from the
lifetime of the process that is being mirror (GPU might take longer
time to cleanup).

Changed since v1:
  - Updated comment of hmm_device_register().

Changed since v2:
  - Expose struct hmm for easy access to mm struct.
  - Simplify hmm_mirror_register() arguments.
  - Removed the device name.
  - Refcount the mirror struct internaly to HMM allowing to get
    rid of the srcu and making the device driver callback error
    handling simpler.
  - Safe to call several time hmm_mirror_unregister().
  - Rework the mmu_notifier unregistration and release callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
cc: <linux-rdma@vger.kernel.org>
---
 MAINTAINERS              |   7 +
 include/linux/hmm.h      | 164 +++++++++++++++++++++
 include/linux/mm.h       |  11 ++
 include/linux/mm_types.h |  14 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  15 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 | 370 +++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 584 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 78ea7b6..2f2a2be 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4730,6 +4730,13 @@ F:	include/uapi/linux/if_hippi.h
 F:	net/802/hippi.c
 F:	drivers/net/hippi/
 
+HMM - Heterogeneous Memory Management
+M:	JA(C)rA'me Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm.c
+F:	include/linux/hmm.h
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap@shmoo.com (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..175a757
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,164 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ *   - An mmu with pagetable.
+ *   - Read only flag per cpu page.
+ *   - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ *   - Dirty bit per cpu page.
+ *   - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_mirror;
+struct hmm;
+
+
+/* hmm_device - Each device must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+	/* release() - mirror must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * When this is call, device driver must kill all device thread using
+	 * this mirror. Also, this callback is the last thing call by HMM and
+	 * HMM will not access the mirror struct after this call (ie no more
+	 * dereference of it so it is safe for the device driver to free it).
+	 * It is call either from :
+	 *   - mm dying (all process using this mm exiting).
+	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
+	 *   - outcome of some device error reported by any of the device
+	 *     callback against that mirror.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+};
+
+
+/* struct hmm - per mm_struct HMM states.
+ *
+ * @mm: The mm struct this hmm is associated with.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @vm_end: Last valid address for this mm (exclusive).
+ * @kref: Reference counter.
+ * @rwsem: Serialize the mirror list modifications.
+ * @mmu_notifier: The mmu_notifier of this mm.
+ * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ *
+ * Device driver must not access this structure other than for getting the
+ * mm pointer.
+ */
+struct hmm {
+	struct mm_struct	*mm;
+	struct hlist_head	mirrors;
+	unsigned long		vm_end;
+	struct kref		kref;
+	struct rw_semaphore	rwsem;
+	struct mmu_notifier	mmu_notifier;
+	struct rcu_head		rcu;
+};
+
+
+/* struct hmm_device - per device HMM structure
+ *
+ * @dev: Linux device structure pointer.
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @mutex: Mutex protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once per linux device).
+ */
+struct hmm_device {
+	struct device			*dev;
+	const struct hmm_device_ops	*ops;
+	struct list_head		mirrors;
+	struct mutex			mutex;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm HMM structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @kref: Reference counter (private to HMM do not use).
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	struct kref		kref;
+	struct list_head	dlist;
+	struct hlist_node	mlist;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2923a51..cf642d9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2199,5 +2199,16 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7..4494f7f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -451,6 +455,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem old in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 0e0ae9a..4083be7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -597,6 +598,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 52ffb86..189e48f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -653,3 +653,18 @@ config DEFERRED_STRUCT_PAGE_INIT
 	  when kswapd starts. This has a potential performance impact on
 	  processes running early in the lifetime of the systemm until kswapd
 	  finishes the initialisation.
+
+if STAGING
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	select GENERIC_PAGE_TABLE
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
+endif # STAGING
diff --git a/mm/Makefile b/mm/Makefile
index 98c4eae..90ca9c4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..e684dd0
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,370 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further information on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+
+#include "internal.h"
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+static inline void hmm_mirror_unref(struct hmm_mirror **mirror);
+
+
+/* hmm - core HMM functions.
+ *
+ * Core HMM functions that deal with all the process mm activities.
+ */
+
+static int hmm_init(struct hmm *hmm)
+{
+	hmm->mm = current->mm;
+	hmm->vm_end = TASK_SIZE;
+	kref_init(&hmm->kref);
+	INIT_HLIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->rwsem);
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	return __mmu_notifier_register(&hmm->mmu_notifier, current->mm);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	struct hmm_mirror *tmp;
+
+	down_write(&hmm->rwsem);
+	hlist_for_each_entry(tmp, &hmm->mirrors, mlist)
+		if (tmp->device == mirror->device) {
+			/* Same device can mirror only once. */
+			up_write(&hmm->rwsem);
+			return -EINVAL;
+		}
+	hlist_add_head(&mirror->mlist, &hmm->mirrors);
+	hmm_mirror_ref(mirror);
+	up_write(&hmm->rwsem);
+
+	return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (!hmm || !kref_get_unless_zero(&hmm->kref))
+		return NULL;
+	return hmm;
+}
+
+static void hmm_destroy_delayed(struct rcu_head *rcu)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(rcu, struct hmm, rcu);
+	kfree(hmm);
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	BUG_ON(!hlist_empty(&hmm->mirrors));
+
+	down_write(&hmm->mm->mmap_sem);
+	/* A new hmm might have been register before reaching that point. */
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+	mmu_notifier_call_srcu(&hmm->rcu, &hmm_destroy_delayed);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm)
+		kref_put(&hmm->kref, hmm_destroy);
+	return NULL;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	hmm = hmm_ref(container_of(mn, struct hmm, mmu_notifier));
+	if (!hmm)
+		return;
+
+	down_write(&hmm->rwsem);
+	while (hmm->mirrors.first) {
+		struct hmm_mirror *mirror;
+
+		/*
+		 * Here we are holding the mirror reference from the mirror
+		 * list. As list removal is synchronized through spinlock, no
+		 * other thread can assume it holds that reference.
+		 */
+		mirror = hlist_entry(hmm->mirrors.first,
+				     struct hmm_mirror,
+				     mlist);
+		hlist_del_init(&mirror->mlist);
+		up_write(&hmm->rwsem);
+
+		hmm_mirror_unref(&mirror);
+
+		down_write(&hmm->rwsem);
+	}
+	up_write(&hmm->rwsem);
+
+	hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (!mirror || !kref_get_unless_zero(&mirror->kref))
+		return NULL;
+	return mirror;
+}
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+	struct hmm_mirror *mirror;
+	struct hmm *hmm;
+
+	mirror = container_of(kref, struct hmm_mirror, kref);
+	device = mirror->device;
+	hmm = mirror->hmm;
+
+	mutex_lock(&device->mutex);
+	list_del_init(&mirror->dlist);
+	device->ops->release(mirror);
+	mutex_unlock(&device->mutex);
+}
+
+static inline void hmm_mirror_unref(struct hmm_mirror **mirror)
+{
+	struct hmm_mirror *tmp = mirror ? *mirror : NULL;
+
+	if (tmp) {
+		*mirror = NULL;
+		kref_put(&tmp->kref, hmm_mirror_destroy);
+	}
+}
+
+/* hmm_mirror_register() - register mirror against current process for a device.
+ *
+ * @mirror: The mirror struct being registered.
+ * Returns: 0 on success or -ENOMEM, -EINVAL on error.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * HMM shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The task device driver want to mirror must be current !
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return NULL if
+ * the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror)
+{
+	struct mm_struct *mm = current->mm;
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!mirror->device);
+	BUG_ON(!mm);
+
+	/*
+	 * Initialize the mirror struct fields, the mlist init and del dance is
+	 * necessary to make the error path easier for driver and for hmm.
+	 */
+	kref_init(&mirror->kref);
+	INIT_HLIST_NODE(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+	mutex_lock(&mirror->device->mutex);
+	list_add(&mirror->dlist, &mirror->device->mirrors);
+	mutex_unlock(&mirror->device->mutex);
+
+	down_write(&mm->mmap_sem);
+
+	hmm = mm->hmm ? hmm_ref(hmm) : NULL;
+	if (hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			up_write(&mm->mmap_sem);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		ret = hmm_init(hmm);
+		if (ret) {
+			up_write(&mm->mmap_sem);
+			kfree(hmm);
+			goto error;
+		}
+
+		mm->hmm = hmm;
+	}
+
+	mirror->hmm = hmm;
+	ret = hmm_add_mirror(hmm, mirror);
+	up_write(&mm->mmap_sem);
+	if (ret) {
+		mirror->hmm = NULL;
+		hmm_unref(hmm);
+		goto error;
+	}
+	return 0;
+
+error:
+	mutex_lock(&mirror->device->mutex);
+	list_del_init(&mirror->dlist);
+	mutex_unlock(&mirror->device->mutex);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_kill(struct hmm_mirror *mirror)
+{
+	down_write(&mirror->hmm->rwsem);
+	if (!hlist_unhashed(&mirror->mlist)) {
+		hlist_del_init(&mirror->mlist);
+		up_write(&mirror->hmm->rwsem);
+
+		hmm_mirror_unref(&mirror);
+	} else
+		up_write(&mirror->hmm->rwsem);
+}
+
+/* hmm_mirror_unregister() - unregister a mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Driver can call this function when it wants to stop mirroring a process.
+ * This will trigger a call to the ->release() callback if it did not aleady
+ * happen.
+ *
+ * Note that caller must hold a reference on the mirror.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	if (mirror == NULL)
+		return;
+
+	hmm_mirror_kill(mirror);
+	hmm_mirror_unref(&mirror);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* hmm_device_register() - register a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EINVAL otherwise.
+ *
+ *
+ * Call when device driver want to register itself with HMM. Device driver must
+ * only register once.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->release);
+
+	mutex_init(&device->mutex);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EBUSY otherwise.
+ *
+ * Call when device driver want to unregister itself with HMM. This will check
+ * that there is no any active mirror and returns -EBUSY if so.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+	mutex_lock(&device->mutex);
+	if (!list_empty(&device->mirrors)) {
+		mutex_unlock(&device->mutex);
+		return -EBUSY;
+	}
+	mutex_unlock(&device->mutex);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 06/36] HMM: add HMM page table v2.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (4 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-06-19  2:06   ` Mark Hairgrove
  2015-06-25 22:57   ` Mark Hairgrove
  2015-05-21 19:31 ` [PATCH 07/36] HMM: add per mirror page table v3 j.glisse
                   ` (15 subsequent siblings)
  21 siblings, 2 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Heterogeneous memory management main purpose is to mirror a process address.
To do so it must maintain a secondary page table that is use by the device
driver to program the device or build a device specific page table.

Radix tree can not be use to create this secondary page table because HMM
needs more flags than RADIX_TREE_MAX_TAGS (while this can be increase we
believe HMM will require so much flags that cost will becomes prohibitive
to others users of radix tree).

Moreover radix tree is built around long but for HMM we need to store dma
address and on some platform sizeof(dma_addr_t) > sizeof(long). Thus radix
tree is unsuitable to fulfill HMM requirement hence why we introduce this
code which allows to create page table that can grow and shrink dynamicly.

The design is very clause to CPU page table as it reuse some of the feature
such as spinlock embedded in struct page.

Changed since v1:
  - Use PAGE_SHIFT as shift value to reserve low bit for private device
    specific flags. This is to allow device driver to use and some of the
    lower bits for their own device specific purpose.
  - Add set of helper for atomically clear, setting and testing bit on
    dma_addr_t pointer. Atomicity being usefull only for dirty bit.
  - Differentiate btw DMA mapped entry and non mapped entry (pfn).
  - Split page directory entry and page table entry helpers.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS            |   2 +
 include/linux/hmm_pt.h | 380 +++++++++++++++++++++++++++++++++++++++++++
 mm/Makefile            |   2 +-
 mm/hmm_pt.c            | 425 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 808 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm_pt.h
 create mode 100644 mm/hmm_pt.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 2f2a2be..8cd0aa7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4736,6 +4736,8 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm.c
 F:	include/linux/hmm.h
+F:	mm/hmm_pt.c
+F:	include/linux/hmm_pt.h
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
new file mode 100644
index 0000000..330edb2
--- /dev/null
+++ b/include/linux/hmm_pt.h
@@ -0,0 +1,380 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is.
+ *
+ * HMM page table rely on a locking mecanism similar to CPU page table for page
+ * table update. It use the spinlock embedded inside the struct page to protect
+ * change to page table directory which should minimize lock contention for
+ * concurrent update.
+ *
+ * It does also provide a directory tree protection mechanism. Unlike CPU page
+ * table there is no mmap semaphore to protect directory tree from removal and
+ * this is done intentionaly so that concurrent removal/insertion of directory
+ * inside the tree can happen.
+ *
+ * So anyone walking down the page table must protect directory it traverses so
+ * they are not free by some other thread. This is done by using a reference
+ * counter for each directory. Before traversing a directory a reference is
+ * taken and once traversal is done the reference is drop.
+ *
+ * A directory entry dereference and refcount increment of sub-directory page
+ * must happen in a critical rcu section so that directory page removal can
+ * gracefully wait for all possible other threads that might have dereferenced
+ * the directory.
+ */
+#ifndef _HMM_PT_H
+#define _HMM_PT_H
+
+/*
+ * The HMM page table entry does not reflect any specific hardware. It is just
+ * a common entry format use by HMM internal and expose to HMM user so they can
+ * extract information out of HMM page table.
+ *
+ * Device driver should only rely on the helpers and should not traverse the
+ * page table themself.
+ */
+#define HMM_PT_MAX_LEVEL	6
+
+#define HMM_PDE_VALID_BIT	0
+#define HMM_PDE_VALID		(1 << HMM_PDE_VALID_BIT)
+#define HMM_PDE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
+}
+
+static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
+{
+	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
+}
+
+
+/*
+ * The HMM_PTE_VALID_DMA_BIT is set for valid DMA mapped entry, while for pfn
+ * entry the HMM_PTE_VALID_PFN_BIT is set. If the hmm_device is associated with
+ * a valid struct device than device driver will be supplied with DMA mapped
+ * entry otherwise it will be supplied with pfn entry.
+ *
+ * In the first case the device driver must ignore any pfn entry as they might
+ * show as transient state while HMM is mapping the page.
+ */
+#define HMM_PTE_VALID_DMA_BIT	0
+#define HMM_PTE_VALID_PFN_BIT	1
+#define HMM_PTE_WRITE_BIT	2
+#define HMM_PTE_DIRTY_BIT	3
+/*
+ * Reserve some bits for device driver private flags. Note that thus can only
+ * be manipulated using the hmm_pte_*_bit() sets of helpers.
+ *
+ * WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
+ * AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
+ */
+#define HMM_PTE_HW_SHIFT	4
+
+#define HMM_PTE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+#define HMM_PTE_DMA_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+
+#ifdef __BIG_ENDIAN
+/*
+ * The dma_addr_t casting we do on little endian do not work on big endian. It
+ * would require some macro trickery to adjust the bit value depending on the
+ * number of bit unsigned long have in comparison to dma_addr_t. This is just
+ * low on the todo list for now.
+ */
+#error "HMM not supported on BIG_ENDIAN architecture.\n"
+#else /* __BIG_ENDIAN */
+static inline void hmm_pte_clear_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline void hmm_pte_set_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	set_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	return !!test_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_clear_bit(dma_addr_t *ptep,
+					      unsigned char bit)
+{
+	return !!test_and_clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
+					    unsigned char bit)
+{
+	return !!test_and_set_bit(bit, (unsigned long *)ptep);
+}
+#endif /* __BIG_ENDIAN */
+
+
+#define HMM_PTE_CLEAR_BIT(name, bit)\
+	static inline void hmm_pte_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_SET_BIT(name, bit)\
+	static inline void hmm_pte_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_BIT(name, bit)\
+	static inline bool hmm_pte_test_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_SET_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_BIT_HELPER(name, bit)\
+	HMM_PTE_CLEAR_BIT(name, bit)\
+	HMM_PTE_SET_BIT(name, bit)\
+	HMM_PTE_TEST_BIT(name, bit)\
+	HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	HMM_PTE_TEST_AND_SET_BIT(name, bit)
+
+HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
+HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
+HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)
+
+static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
+}
+
+static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
+{
+	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
+}
+
+
+/* struct hmm_pt - HMM page table structure.
+ *
+ * @mask: Array of address mask value of each level.
+ * @directory_mask: Mask for directory index (see below).
+ * @last: Last valid address (inclusive).
+ * @pgd: page global directory (top first level of the directory tree).
+ * @lock: Share lock if spinlock_t does not fit in struct page.
+ * @shift: Array of address shift value of each level.
+ * @llevel: Last level.
+ *
+ * The index into each directory for a given address and level is :
+ *   (address >> shift[level]) & directory_mask
+ *
+ * Only hmm_pt.last field needs to be set before calling hmm_pt_init().
+ */
+struct hmm_pt {
+	unsigned long		mask[HMM_PT_MAX_LEVEL];
+	unsigned long		directory_mask;
+	unsigned long		last;
+	dma_addr_t		*pgd;
+	spinlock_t		lock;
+	unsigned char		shift[HMM_PT_MAX_LEVEL];
+	unsigned char		llevel;
+};
+
+int hmm_pt_init(struct hmm_pt *pt);
+void hmm_pt_fini(struct hmm_pt *pt);
+
+static inline unsigned hmm_pt_index(struct hmm_pt *pt,
+				    unsigned long addr,
+				    unsigned level)
+{
+	return (addr >> pt->shift[level]) & pt->directory_mask;
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	if (level)
+		spin_lock(&ptd->ptl);
+	else
+		spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	if (level)
+		spin_unlock(&ptd->ptl);
+	else
+		spin_unlock(&pt->lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	spin_unlock(&pt->lock);
+}
+#endif
+
+static inline unsigned long hmm_pt_level_start(struct hmm_pt *pt,
+					       unsigned long addr,
+					       unsigned level)
+{
+	return addr & pt->mask[level];
+}
+
+static inline unsigned long hmm_pt_level_end(struct hmm_pt *pt,
+					     unsigned long addr,
+					     unsigned level)
+{
+	return (addr | (~pt->mask[level])) + 1UL;
+}
+
+static inline unsigned long hmm_pt_level_next(struct hmm_pt *pt,
+					      unsigned long addr,
+					      unsigned long end,
+					      unsigned level)
+{
+	addr = (addr | (~pt->mask[level])) + 1UL;
+	return (addr - 1 < end - 1) ? addr : end;
+}
+
+
+/* struct hmm_pt_iter - page table iterator states.
+ *
+ * @ptd: Array of directory struct page pointer for each levels.
+ * @ptdp: Array of pointer to mapped directory levels.
+ * @dead_directories: List of directories that died while walking page table.
+ * @cur: Current address.
+ */
+struct hmm_pt_iter {
+	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
+	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
+	struct list_head	dead_directories;
+	unsigned long		cur;
+};
+
+void hmm_pt_iter_init(struct hmm_pt_iter *iter);
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt);
+unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr,
+			       unsigned long end);
+dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr);
+dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
+			      struct hmm_pt *pt,
+			      unsigned long addr);
+
+/* hmm_pt_protect_directory_unref() - reference a directory.
+ *
+ * @iter: Iterator states that currently protect the directory.
+ * @level: Level of the directory to reference.
+ *
+ * This function will reference a directory but it is illegal for refcount to
+ * be 0 as this helper should only be call when iterator is protecting the
+ * directory (ie iterator hold a reference for the directory).
+ *
+ * HMM user will call this with level = pt.llevel any other value is supicious
+ * outside of hmm_pt code.
+ */
+static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter,
+					     char level)
+{
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_inc_not_zero(&iter->ptd[level - 1]->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+}
+
+/* hmm_pt_protect_directory_unref() - unreference a directory.
+ *
+ * @iter: Iterator states that currently protect the directory.
+ * @level: Level of the directory to unreference.
+ *
+ * This function will unreference a directory but it is illegal for refcount to
+ * reach 0 here as this helper should only be call when iterator is protecting
+ * the directory (ie iterator hold a reference for the directory).
+ *
+ * HMM user will call this with level = pt.llevel any other value is supicious
+ * outside of hmm_pt code.
+ */
+static inline void hmm_pt_iter_directory_unref(struct hmm_pt_iter *iter,
+					       char level)
+{
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	/* Illegal this should not happen. */
+	BUG();
+}
+
+static inline dma_addr_t *hmm_pt_iter_ptdp(struct hmm_pt_iter *iter,
+					   struct hmm_pt *pt,
+					   unsigned long addr)
+{
+	BUG_ON(!iter->ptd[pt->llevel - 1] ||
+	       addr < hmm_pt_level_start(pt, iter->cur, pt->llevel) ||
+	       addr >= hmm_pt_level_end(pt, iter->cur, pt->llevel));
+	return &iter->ptdp[pt->llevel - 1][hmm_pt_index(pt, addr, pt->llevel)];
+}
+
+static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter,
+					      struct hmm_pt *pt)
+{
+	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter,
+						struct hmm_pt *pt)
+{
+	hmm_pt_directory_unlock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+
+#endif /* _HMM_PT_H */
diff --git a/mm/Makefile b/mm/Makefile
index 90ca9c4..04d7d45 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -78,4 +78,4 @@ obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
-obj-$(CONFIG_HMM) += hmm.o
+obj-$(CONFIG_HMM) += hmm.o hmm_pt.o
diff --git a/mm/hmm_pt.c b/mm/hmm_pt.c
new file mode 100644
index 0000000..49b200e
--- /dev/null
+++ b/mm/hmm_pt.c
@@ -0,0 +1,425 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is and include/linux/hmm_pt.h.
+ */
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/hmm_pt.h>
+
+/* hmm_pt_init() - initialize HMM page table.
+ *
+ * @pt: HMM page table to initialize.
+ *
+ * This function will initialize HMM page table and allocate memory for global
+ * directory. Only the hmm_pt.last fields need to be set prior to calling this
+ * function.
+ */
+int hmm_pt_init(struct hmm_pt *pt)
+{
+	unsigned directory_shift, i = 0, npgd;
+
+	pt->last &= PAGE_MASK;
+	spin_lock_init(&pt->lock);
+	/* Directory shift is the number of bits that a single directory level
+	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
+	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
+	 */
+	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
+	/* Level 0 is the root level of the page table. It might use less
+	 * bits than directory_shift but all sub-directory level will use all
+	 * directory_shift bits.
+	 *
+	 * For instance if hmm_pt.last == (1 << 48), PAGE_SHIFT == 12 and
+	 * sizeof(dma_addr_t) == 8 then :
+	 *   directory_shift = 9
+	 *   shift[0] = 39
+	 *   shift[1] = 30
+	 *   shift[2] = 21
+	 *   shift[3] = 12
+	 *   llevel = 3
+	 *
+	 * Note that shift[llevel] == PAGE_SHIFT because the last level
+	 * correspond to the page table entry level (ignoring the case of huge
+	 * page).
+	 */
+	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
+			directory_shift) + PAGE_SHIFT;
+	while (pt->shift[i++] > PAGE_SHIFT)
+		pt->shift[i] = pt->shift[i - 1] - directory_shift;
+	pt->llevel = i - 1;
+	pt->directory_mask = (1 << directory_shift) - 1;
+
+	for (i = 0; i <= pt->llevel; ++i)
+		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
+
+	npgd = (pt->last >> pt->shift[0]) + 1;
+	pt->pgd = kzalloc(npgd * sizeof(dma_addr_t), GFP_KERNEL);
+	if (!pt->pgd)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_pt_init);
+
+static void hmm_pt_fini_directory(struct hmm_pt *pt,
+				  struct page *ptd,
+				  unsigned level)
+{
+	dma_addr_t *ptdp;
+	unsigned i;
+
+	if (level == pt->llevel)
+		return;
+
+	ptdp = kmap(ptd);
+	for (i = 0; i <= pt->directory_mask; ++i) {
+		struct page *lptd;
+
+		if (!(ptdp[i] & HMM_PDE_VALID))
+			continue;
+		lptd = pfn_to_page(hmm_pde_pfn(ptdp[i]));
+		ptdp[i] = 0;
+		hmm_pt_fini_directory(pt, lptd, level + 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+	kunmap(ptd);
+}
+
+/* hmm_pt_fini() - finalize HMM page table.
+ *
+ * @pt: HMM page table to finalize.
+ *
+ * This function will free all resources of a directory page table.
+ */
+void hmm_pt_fini(struct hmm_pt *pt)
+{
+	unsigned i;
+
+	/* Free all directory. */
+	for (i = 0; i <= (pt->last >> pt->shift[0]); ++i) {
+		struct page *ptd;
+
+		if (!(pt->pgd[i] & HMM_PDE_VALID))
+			continue;
+		ptd = pfn_to_page(hmm_pde_pfn(pt->pgd[i]));
+		pt->pgd[i] = 0;
+		hmm_pt_fini_directory(pt, ptd, 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+
+	kfree(pt->pgd);
+	pt->pgd = NULL;
+}
+EXPORT_SYMBOL(hmm_pt_fini);
+
+
+/* hmm_pt_init() - initialize iterator states.
+ *
+ * @iter: Iterator states.
+ *
+ * This function will initialize iterator states. It must always be pair with a
+ * call to hmm_pt_iter_fini().
+ */
+void hmm_pt_iter_init(struct hmm_pt_iter *iter)
+{
+	memset(iter->ptd, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
+	memset(iter->ptdp, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
+	INIT_LIST_HEAD(&iter->dead_directories);
+}
+EXPORT_SYMBOL(hmm_pt_iter_init);
+
+/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ * @level: Level of the directory to unref.
+ *
+ * This function will unreference a directory and add it to dead list if
+ * directory no longer have any reference. It will also clear the entry to
+ * that directory into the upper level directory as well as dropping ref
+ * on the upper directory.
+ */
+static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
+					     struct hmm_pt *pt,
+					     unsigned level)
+{
+	struct page *upper_ptd;
+	dma_addr_t *upper_ptdp;
+
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
+	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : pt->pgd;
+	upper_ptdp = &upper_ptdp[hmm_pt_index(pt, iter->cur, level - 1)];
+	hmm_pt_directory_lock(pt, upper_ptd, level - 1);
+	/*
+	 * There might be race btw decrementing reference count on a directory
+	 * and another thread trying to fault in a new directory. To avoid
+	 * erasing the new directory entry we need to check that the entry
+	 * still correspond to the directory we are removing.
+	 */
+	if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
+		*upper_ptdp = 0;
+	hmm_pt_directory_unlock(pt, upper_ptd, level - 1);
+
+	/* Add it to delayed free list. */
+	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
+
+	/*
+	 * The upper directory is not safe to unref as we have an extra ref and
+	 * thus refcount should not reach 0.
+	 */
+	hmm_pt_iter_directory_unref(iter, level - 1);
+}
+
+static void hmm_pt_iter_unprotect_directory(struct hmm_pt_iter *iter,
+					    struct hmm_pt *pt,
+					    unsigned level)
+{
+	if (!iter->ptd[level - 1])
+		return;
+	kunmap(iter->ptd[level - 1]);
+	hmm_pt_iter_directory_unref_safe(iter, pt, level);
+	iter->ptd[level - 1] = NULL;
+}
+
+/* hmm_pt_iter_protect_directory() - protect a directory.
+ *
+ * @iter: Iterator states.
+ * @ptd: directory struct page to protect.
+ * @addr: Address of the directory.
+ * @level: Level of this directory (> 0).
+ * Returns -EINVAL on error, 1 if protection succeeded, 0 otherwise.
+ *
+ * This function will proctect a directory by taking a reference. It will also
+ * map the directory to allow cpu access.
+ *
+ * Call to this function must be made from inside the rcu read critical section
+ * that convert the table entry to the directory struct page. Doing so allow to
+ * support concurrent removal of directory because this function will take the
+ * reference inside the rcu critical section and thus rcu synchronization will
+ * garanty that we can safely free directory.
+ */
+int hmm_pt_iter_protect_directory(struct hmm_pt_iter *iter,
+				  struct page *ptd,
+				  unsigned long addr,
+				  unsigned level)
+{
+	/* This must be call inside rcu read section. */
+	BUG_ON(!rcu_read_lock_held());
+
+	if (!level || iter->ptd[level - 1]) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	if (!atomic_inc_not_zero(&ptd->_mapcount)) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	rcu_read_unlock();
+
+	iter->ptd[level - 1] = ptd;
+	iter->ptdp[level - 1] = kmap(ptd);
+	iter->cur = addr;
+
+	return 1;
+}
+
+unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr,
+			       unsigned long end)
+{
+	unsigned i;
+
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr < hmm_pt_level_end(pt, iter->cur, i))
+			return hmm_pt_level_next(pt, iter->cur, end, i);
+	}
+
+	/*
+	 * No need for rcu protection worst case is we return a now dead
+	 * address.
+	 */
+	if (pt->pgd[hmm_pt_index(pt, addr, 0)] & HMM_PDE_VALID)
+		return hmm_pt_level_next(pt, addr, end, pt->llevel);
+	for (; addr < end; addr = hmm_pt_level_next(pt, addr, end, 0))
+		if (pt->pgd[hmm_pt_index(pt, addr, 0)] & HMM_PDE_VALID)
+			return addr;
+	return end;
+}
+EXPORT_SYMBOL(hmm_pt_iter_next);
+
+dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr)
+{
+	int i;
+
+	addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    addr < hmm_pt_level_end(pt, iter->cur, pt->llevel))
+		return hmm_pt_iter_ptdp(iter, pt, addr);
+
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr < hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, pt, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, addr, i)]);
+		if (!(pte & HMM_PDE_VALID)) {
+			rcu_read_unlock();
+			return NULL;
+		}
+		ptd = pfn_to_page(hmm_pde_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd, addr, i + 1) != 1)
+			return NULL;
+	}
+
+	return hmm_pt_iter_ptdp(iter, pt, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_update);
+
+dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
+			      struct hmm_pt *pt,
+			      unsigned long addr)
+{
+	dma_addr_t *ptdp = hmm_pt_iter_update(iter, pt, addr);
+	struct page *new = NULL;
+	int i;
+
+	if (ptdp)
+		return ptdp;
+
+	/* Populate directory tree structures. */
+	for (i = 1; i <= pt->llevel; ++i) {
+		struct page *upper_ptd;
+		dma_addr_t *upper_ptdp;
+
+		if (iter->ptd[i - 1])
+			continue;
+
+		new = new ? new : alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+		if (!new)
+			return NULL;
+
+		upper_ptd = i > 1 ? iter->ptd[i - 2] : NULL;
+		upper_ptdp = i > 1 ? iter->ptdp[i - 2] : pt->pgd;
+		upper_ptdp = &upper_ptdp[hmm_pt_index(pt, addr, i - 1)];
+		hmm_pt_directory_lock(pt, upper_ptd, i - 1);
+		if (((*upper_ptdp) & HMM_PDE_VALID)) {
+			struct page *ptd;
+
+			ptd = pfn_to_page(hmm_pde_pfn(*upper_ptdp));
+			if (atomic_inc_not_zero(&ptd->_mapcount)) {
+				/* Already allocated by another thread. */
+				iter->ptd[i - 1] = ptd;
+				hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+				iter->ptdp[i - 1] = kmap(ptd);
+				iter->cur = hmm_pt_level_start(pt, addr, i);
+				continue;
+			}
+			/*
+			 * Means we raced with removal of dead directory it is
+			 * safe to overwritte *upper_ptdp entry with new entry.
+			 */
+		}
+		/* Initialize struct page field for the directory. */
+		atomic_set(&new->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+		spin_lock_init(&new->ptl);
+#endif
+		*upper_ptdp = hmm_pde_from_pfn(page_to_pfn(new));
+		hmm_pt_iter_directory_ref(iter, i - 1);
+		/* Unlock upper directory and map the new directory. */
+		hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+		iter->ptd[i - 1] = new;
+		iter->ptdp[i - 1] = kmap(new);
+		iter->cur = hmm_pt_level_start(pt, addr, i);
+		new = NULL;
+	}
+	if (new)
+		__free_page(new);
+	return hmm_pt_iter_ptdp(iter, pt, addr);
+}
+
+/* hmm_pt_iter_fini() - finalize iterator.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ *
+ * This function will cleanup iterator by unmapping and unreferencing any
+ * directory still mapped and referenced. It will also free any dead directory.
+ */
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt)
+{
+	struct page *ptd, *tmp;
+	unsigned i;
+
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		hmm_pt_iter_unprotect_directory(iter, pt, i);
+	}
+
+	/* Avoid useless synchronize_rcu() if there is no directory to free. */
+	if (list_empty(&iter->dead_directories))
+		return;
+
+	/*
+	 * Some iterator may have dereferenced a dead directory entry and looked
+	 * up the struct page but haven't check yet the reference count. As all
+	 * the above happen in rcu read critical section we know that we need
+	 * to wait for grace period before being able to free any of the dead
+	 * directory page.
+	 */
+	synchronize_rcu();
+	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
+		list_del(&ptd->lru);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+}
+EXPORT_SYMBOL(hmm_pt_iter_fini);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 07/36] HMM: add per mirror page table v3.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (5 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 06/36] HMM: add HMM page table v2 j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-06-25 23:05   ` Mark Hairgrove
  2015-05-21 19:31 ` [PATCH 08/36] HMM: add device page fault support v3 j.glisse
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch add the per mirror page table. It also propagate CPU page
table update to this per mirror page table using mmu_notifier callback.
All update are contextualized with an HMM event structure that convey
all information needed by device driver to take proper actions (update
its own mmu to reflect changes and schedule proper flushing).

Core HMM is responsible for updating the per mirror page table once
the device driver is done with its update. Most importantly HMM will
properly propagate HMM page table dirty bit to underlying page.

Changed since v1:
  - Removed unused fence code to defer it to latter patches.

Changed since v2:
  - Use new bit flag helper for mirror page table manipulation.
  - Differentiate fork event with HMM_FORK from other events.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |  83 ++++++++++++++++++++
 mm/hmm.c            | 221 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 304 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 175a757..573560b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -46,6 +46,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/workqueue.h>
 #include <linux/mman.h>
+#include <linux/hmm_pt.h>
 
 
 struct hmm_device;
@@ -53,6 +54,39 @@ struct hmm_mirror;
 struct hmm;
 
 
+/*
+ * hmm_event - each event is described by a type associated with a struct.
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_FORK,
+	HMM_ISDIRTY,
+	HMM_MIGRATE,
+	HMM_MUNMAP,
+	HMM_DEVICE_RFAULT,
+	HMM_DEVICE_WFAULT,
+	HMM_WRITE_PROTECT,
+};
+
+/* struct hmm_event - memory event information.
+ *
+ * @list: So HMM can keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @pte_mask: HMM pte update mask (bit(s) that are still valid).
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	dma_addr_t		pte_mask;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
+
 /* hmm_device - Each device must register one and only one hmm_device.
  *
  * The hmm_device is the link btw HMM and each device driver.
@@ -76,6 +110,53 @@ struct hmm_device_ops {
 	 *     callback against that mirror.
 	 */
 	void (*release)(struct hmm_mirror *mirror);
+
+	/* update() - update device mmu following an event.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the update.
+	 * Returns: 0 on success or error code {-EIO, -ENOMEM}.
+	 *
+	 * Called to update device page table for a range of address.
+	 * The event type provide the nature of the update :
+	 *   - Range is no longer valid (munmap).
+	 *   - Range protection changes (mprotect, COW, ...).
+	 *   - Range is unmapped (swap, reclaim, page migration, ...).
+	 *   - Device page fault.
+	 *   - ...
+	 *
+	 * Thought most device driver only need to use pte_mask as it reflects
+	 * change that will happen to the HMM page table ie :
+	 *   new_pte = old_pte & event->pte_mask;
+	 *
+	 * Device driver must not update the HMM mirror page table (except the
+	 * dirty bit see below). Core HMM will update HMM page table after the
+	 * update is done.
+	 *
+	 * Note that device must be cache coherent with system memory (snooping
+	 * in case of PCIE devices) so there should be no need for device to
+	 * flush anything.
+	 *
+	 * When write protection is turned on device driver must make sure the
+	 * hardware will no longer be able to write to the page otherwise file
+	 * system corruption may occur.
+	 *
+	 * Device must properly set the dirty bit using hmm_pte_set_bit() on
+	 * each page entry for memory that was written by the device. If device
+	 * can not properly account for write access then the dirty bit must be
+	 * set unconditionaly so that proper write back of file backed page can
+	 * happen.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*update)(struct hmm_mirror *mirror,const struct hmm_event *event);
 };
 
 
@@ -142,6 +223,7 @@ int hmm_device_unregister(struct hmm_device *device);
  * @kref: Reference counter (private to HMM do not use).
  * @dlist: List of all hmm_mirror for same device.
  * @mlist: List of all hmm_mirror for same process.
+ * @pt: Mirror page table.
  *
  * Each device that want to mirror an address space must register one of this
  * struct for each of the address space it wants to mirror. Same device can
@@ -154,6 +236,7 @@ struct hmm_mirror {
 	struct kref		kref;
 	struct list_head	dlist;
 	struct hlist_node	mlist;
+	struct hmm_pt		pt;
 };
 
 int hmm_mirror_register(struct hmm_mirror *mirror);
diff --git a/mm/hmm.c b/mm/hmm.c
index e684dd0..04a3743 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -48,6 +48,51 @@ static struct mmu_notifier_ops hmm_notifier_ops;
 
 static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 static inline void hmm_mirror_unref(struct hmm_mirror **mirror);
+static void hmm_mirror_kill(struct hmm_mirror *mirror);
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event);
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline int hmm_event_init(struct hmm_event *event,
+				 struct hmm *hmm,
+				 unsigned long start,
+				 unsigned long end,
+				 enum hmm_etype etype)
+{
+	event->start = start & PAGE_MASK;
+	event->end = min(end, hmm->vm_end);
+	if (event->start >= event->end)
+		return -EINVAL;
+	event->etype = etype;
+	event->pte_mask = (dma_addr_t)-1ULL;
+	switch (etype) {
+	case HMM_ISDIRTY:
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		break;
+	case HMM_FORK:
+	case HMM_WRITE_PROTECT:
+		event->pte_mask ^= (1 << HMM_PTE_WRITE_BIT);
+		break;
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+		event->pte_mask = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
 
 
 /* hmm - core HMM functions.
@@ -126,6 +171,27 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_mirror *mirror;
+
+	/* Is this hmm already fully stop ? */
+	if (hmm->mm->hmm != hmm)
+		return;
+
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
+		if (hmm_mirror_update(mirror, event)) {
+			mirror = hmm_mirror_ref(mirror);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(mirror);
+			hmm_mirror_unref(&mirror);
+			goto again;
+		}
+	up_read(&hmm->rwsem);
+}
+
 
 /* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
  *
@@ -163,8 +229,91 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	hmm_unref(hmm);
 }
 
+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+				   unsigned long addr,
+				   enum mmu_event mmu_event,
+				   enum hmm_etype *etype)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+		*etype = HMM_MUNMAP;
+		return;
+	}
+
+	if (!(vma->vm_flags & VM_WRITE)) {
+		*etype = HMM_WRITE_PROTECT;
+		return;
+	}
+
+	*etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						const struct mmu_notifier_range *range)
+{
+	struct hmm_event event;
+	unsigned long start = range->start, end = range->end;
+	struct hmm *hmm;
+
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	if (start >= hmm->vm_end)
+		return;
+
+	switch (range->event) {
+	case MMU_FORK:
+		event.etype = HMM_FORK;
+		break;
+	case MMU_MUNLOCK:
+		/* Still same physical ram backing same address. */
+		return;
+	case MMU_MPROT:
+		hmm_mmu_mprot_to_etype(mm, start, range->event, &event.etype);
+		if (event.etype == HMM_NONE)
+			return;
+		break;
+	case MMU_WRITE_BACK:
+	case MMU_WRITE_PROTECT:
+		event.etype = HMM_WRITE_PROTECT;
+		break;
+	case MMU_ISDIRTY:
+		event.etype = HMM_ISDIRTY;
+		break;
+	case MMU_HSPLIT:
+	case MMU_MUNMAP:
+		event.etype = HMM_MUNMAP;
+		break;
+	case MMU_MIGRATE:
+	default:
+		event.etype = HMM_MIGRATE;
+		break;
+	}
+
+	hmm_event_init(&event, hmm, start, end, event.etype);
+
+	hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 unsigned long addr,
+					 struct page *page,
+					 enum mmu_event mmu_event)
+{
+	struct mmu_notifier_range range;
+
+	range.start = addr & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = mmu_event;
+	hmm_notifier_invalidate_range_start(mn, mm, &range);
+}
+
 static struct mmu_notifier_ops hmm_notifier_ops = {
 	.release		= hmm_notifier_release,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
 };
 
 
@@ -195,6 +344,8 @@ static void hmm_mirror_destroy(struct kref *kref)
 	device = mirror->device;
 	hmm = mirror->hmm;
 
+	hmm_pt_fini(&mirror->pt);
+
 	mutex_lock(&device->mutex);
 	list_del_init(&mirror->dlist);
 	device->ops->release(mirror);
@@ -211,6 +362,64 @@ static inline void hmm_mirror_unref(struct hmm_mirror **mirror)
 	}
 }
 
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event)
+{
+	struct hmm_device *device = mirror->device;
+	int ret = 0;
+
+	ret = device->ops->update(mirror, event);
+	hmm_mirror_update_pt(mirror, event);
+	return ret;
+}
+
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event)
+{
+	unsigned long addr;
+	struct hmm_pt_iter iter;
+
+	hmm_pt_iter_init(&iter);
+	for (addr = event->start; addr != event->end;) {
+		unsigned long end, next;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		if (!hmm_pte) {
+			addr = hmm_pt_iter_next(&iter, &mirror->pt,
+						addr, event->end);
+			continue;
+		}
+		end = hmm_pt_level_next(&mirror->pt, addr, event->end,
+					 mirror->pt.llevel - 1);
+		/*
+		 * The directory lock protect against concurrent clearing of
+		 * page table bit flags. Exceptions being the dirty bit and
+		 * the device driver private flags.
+		 */
+		hmm_pt_iter_directory_lock(&iter, &mirror->pt);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, end,
+						 mirror->pt.llevel);
+			if (!hmm_pte_test_valid_pfn(hmm_pte))
+				continue;
+			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
+			    hmm_pte_test_write(hmm_pte)) {
+				struct page *page;
+
+				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
+				set_page_dirty(page);
+			}
+			*hmm_pte &= event->pte_mask;
+			if (hmm_pte_test_valid_pfn(hmm_pte))
+				continue;
+			hmm_pt_iter_directory_unref(&iter, mirror->pt.llevel);
+		} while (addr = next, hmm_pte++, addr != end);
+		hmm_pt_iter_directory_unlock(&iter, &mirror->pt);
+	}
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+}
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
@@ -242,6 +451,11 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
 	 * necessary to make the error path easier for driver and for hmm.
 	 */
 	kref_init(&mirror->kref);
+	mirror->pt.last = TASK_SIZE - 1;
+	if (hmm_pt_init(&mirror->pt)) {
+		kfree(mirror);
+		return -ENOMEM;
+	}
 	INIT_HLIST_NODE(&mirror->mlist);
 	INIT_LIST_HEAD(&mirror->dlist);
 	mutex_lock(&mirror->device->mutex);
@@ -278,6 +492,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
 		hmm_unref(hmm);
 		goto error;
 	}
+	BUG_ON(mirror->pt.last >= hmm->vm_end);
 	return 0;
 
 error:
@@ -290,6 +505,12 @@ EXPORT_SYMBOL(hmm_mirror_register);
 
 static void hmm_mirror_kill(struct hmm_mirror *mirror)
 {
+	struct hmm_event event;
+
+	/* Make sure everything is unmapped. */
+	hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+	hmm_mirror_update(mirror, &event);
+
 	down_write(&mirror->hmm->rwsem);
 	if (!hlist_unhashed(&mirror->mlist)) {
 		hlist_del_init(&mirror->mlist);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 08/36] HMM: add device page fault support v3.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (6 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 07/36] HMM: add per mirror page table v3 j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 09/36] HMM: add mm page table iterator helpers j.glisse
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch add helper for device page fault. Device page fault helper will
fill the mirror page table using the CPU page table all this synchronized
with any update to CPU page table.

Changed since v1:
  - Add comment about directory lock.

Changed since v2:
  - Check for mirror->hmm in hmm_mirror_fault()

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |   9 ++
 mm/hmm.c            | 386 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 394 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 573560b..fdb1975 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -169,6 +169,10 @@ struct hmm_device_ops {
  * @rwsem: Serialize the mirror list modifications.
  * @mmu_notifier: The mmu_notifier of this mm.
  * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ * @device_faults: List of all active device page faults.
+ * @ndevice_faults: Number of active device page faults.
+ * @wait_queue: Wait queue for event synchronization.
+ * @lock: Serialize device_faults list modification.
  *
  * For each process address space (mm_struct) there is one and only one hmm
  * struct. hmm functions will redispatch to each devices the change made to
@@ -185,6 +189,10 @@ struct hmm {
 	struct rw_semaphore	rwsem;
 	struct mmu_notifier	mmu_notifier;
 	struct rcu_head		rcu;
+	struct list_head	device_faults;
+	unsigned		ndevice_faults;
+	wait_queue_head_t	wait_queue;
+	spinlock_t		lock;
 };
 
 
@@ -241,6 +249,7 @@ struct hmm_mirror {
 
 int hmm_mirror_register(struct hmm_mirror *mirror);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 04a3743..e1aa6ca 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -63,6 +63,11 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
  * help dealing with all this.
  */
 
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->end <= b->start) || (a->start >= b->end));
+}
+
 static inline int hmm_event_init(struct hmm_event *event,
 				 struct hmm *hmm,
 				 unsigned long start,
@@ -70,7 +75,7 @@ static inline int hmm_event_init(struct hmm_event *event,
 				 enum hmm_etype etype)
 {
 	event->start = start & PAGE_MASK;
-	event->end = min(end, hmm->vm_end);
+	event->end = PAGE_ALIGN(min(end, hmm->vm_end));
 	if (event->start >= event->end)
 		return -EINVAL;
 	event->etype = etype;
@@ -107,6 +112,10 @@ static int hmm_init(struct hmm *hmm)
 	kref_init(&hmm->kref);
 	INIT_HLIST_HEAD(&hmm->mirrors);
 	init_rwsem(&hmm->rwsem);
+	INIT_LIST_HEAD(&hmm->device_faults);
+	hmm->ndevice_faults = 0;
+	init_waitqueue_head(&hmm->wait_queue);
+	spin_lock_init(&hmm->lock);
 
 	/* register notifier */
 	hmm->mmu_notifier.ops = &hmm_notifier_ops;
@@ -171,6 +180,58 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *event)
+{
+	int ret = 0;
+
+	mmu_notifier_range_wait_valid(hmm->mm, event->start, event->end);
+
+	spin_lock(&hmm->lock);
+	if (mmu_notifier_range_is_valid(hmm->mm, event->start, event->end)) {
+		list_add_tail(&event->list, &hmm->device_faults);
+		hmm->ndevice_faults++;
+		event->backoff = false;
+	} else
+		ret = -EAGAIN;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+
+	return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *event)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	hmm->ndevice_faults--;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+	struct hmm_event *fevent;
+	unsigned long wait_for = 0;
+
+again:
+	spin_lock(&hmm->lock);
+	list_for_each_entry(fevent, &hmm->device_faults, list) {
+		if (!hmm_event_overlap(fevent, ievent))
+			continue;
+		fevent->backoff = true;
+		wait_for = hmm->ndevice_faults;
+	}
+	spin_unlock(&hmm->lock);
+
+	if (wait_for > 0) {
+		wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+		wait_for = 0;
+		goto again;
+	}
+}
+
 static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 {
 	struct hmm_mirror *mirror;
@@ -179,6 +240,8 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 	if (hmm->mm->hmm != hmm)
 		return;
 
+	hmm_wait_device_fault(hmm, event);
+
 again:
 	down_read(&hmm->rwsem);
 	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
@@ -190,6 +253,35 @@ again:
 			goto again;
 		}
 	up_read(&hmm->rwsem);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static int hmm_mm_fault(struct hmm *hmm,
+			struct hmm_event *event,
+			struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned flags;
+	int r;
+
+	flags = (event->etype == HMM_DEVICE_WFAULT) ? FAULT_FLAG_WRITE : 0;
+	for (addr &= PAGE_MASK; addr < event->end; addr += PAGE_SIZE) {
+
+		flags |= FAULT_FLAG_ALLOW_RETRY;
+		do {
+			r = handle_mm_fault(mm, vma, addr, flags);
+			if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) {
+				if (r & VM_FAULT_OOM)
+					return -ENOMEM;
+				/* Same error code for all other cases. */
+				return -EFAULT;
+			}
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		} while (r & VM_FAULT_RETRY);
+	}
+	return 0;
 }
 
 
@@ -226,6 +318,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	}
 	up_write(&hmm->rwsem);
 
+	wake_up(&hmm->wait_queue);
 	hmm_unref(hmm);
 }
 
@@ -420,6 +513,297 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 	hmm_pt_iter_fini(&iter, &mirror->pt);
 }
 
+static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
+{
+	if (hlist_unhashed(&mirror->mlist) || list_empty(&mirror->dlist))
+		return true;
+	return false;
+}
+
+struct hmm_mirror_fault {
+	struct hmm_mirror	*mirror;
+	struct hmm_event	*event;
+	struct vm_area_struct	*vma;
+	unsigned long		addr;
+	struct hmm_pt_iter	*iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+				 struct hmm_event *event,
+				 struct vm_area_struct *vma,
+				 struct hmm_pt_iter *iter,
+				 pmd_t *pmdp,
+				 struct hmm_mirror_fault *mirror_fault,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct page *page;
+	unsigned long addr, pfn;
+	unsigned flags = FOLL_TOUCH;
+	spinlock_t *ptl;
+	int ret;
+
+	ptl = pmd_lock(mirror->hmm->mm, pmdp);
+	if (unlikely(!pmd_trans_huge(*pmdp))) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+	if (unlikely(pmd_trans_splitting(*pmdp))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmdp);
+		return -EAGAIN;
+	}
+	flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
+	page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+	pfn = page_to_pfn(page);
+	spin_unlock(ptl);
+
+	/* Just fault in the whole PMD. */
+	start &= PMD_MASK;
+	end = start + PMD_SIZE - 1;
+
+	if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
+			return -ENOENT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, hmm_end, next;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_fault(iter, &mirror->pt, addr);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		hmm_end = hmm_pt_level_next(&mirror->pt, addr, end,
+					    mirror->pt.llevel - 1);
+		/*
+		 * The directory lock protect against concurrent clearing of
+		 * page table bit flags. Exceptions being the dirty bit and
+		 * the device driver private flags.
+		 */
+		hmm_pt_iter_directory_lock(iter, &mirror->pt);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
+						 mirror->pt.llevel);
+
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pfn);
+				hmm_pt_iter_directory_ref(iter,
+							  mirror->pt.llevel);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
+			if (pmd_write(*pmdp))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr = next, pfn++, i++, addr != hmm_end);
+		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
+		mirror_fault->addr = addr;
+	}
+
+	return 0;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct hmm_mirror *mirror = mirror_fault->mirror;
+	struct hmm_event *event = mirror_fault->event;
+	struct hmm_pt_iter *iter = mirror_fault->iter;
+	bool write = (event->etype == HMM_DEVICE_WFAULT);
+	unsigned long addr;
+	int ret = 0;
+
+	/* Make sure there was no gap. */
+	if (start != mirror_fault->addr)
+		return -ENOENT;
+
+	if (event->backoff)
+		return -EAGAIN;
+
+	if (pmd_none(*pmdp))
+		return -ENOENT;
+
+	if (pmd_trans_huge(*pmdp))
+		return hmm_mirror_fault_hpmd(mirror, event, mirror_fault->vma,
+					     iter, pmdp, mirror_fault, start,
+					     end);
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+		return -EFAULT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, hmm_end, next;
+		dma_addr_t *hmm_pte;
+		pte_t *ptep;
+
+		hmm_pte = hmm_pt_iter_fault(iter, &mirror->pt, addr);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		hmm_end = hmm_pt_level_next(&mirror->pt, addr, end,
+					    mirror->pt.llevel - 1);
+		ptep = pte_offset_map(pmdp, start);
+		hmm_pt_iter_directory_lock(iter, &mirror->pt);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
+						 mirror->pt.llevel);
+			if (!pte_present(*ptep) || (write && !pte_write(*ptep))) {
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+				hmm_pt_iter_directory_ref(iter,
+							  mirror->pt.llevel);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+			if (pte_write(*ptep))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr = next, ptep++, i++, addr != hmm_end);
+		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
+		pte_unmap(ptep - 1);
+		mirror_fault->addr = addr;
+	}
+
+	return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   struct vm_area_struct *vma,
+				   struct hmm_pt_iter *iter)
+{
+	struct hmm_mirror_fault mirror_fault;
+	unsigned long addr = event->start;
+	struct mm_walk walk = {0};
+	int ret = 0;
+
+	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
+		return -EACCES;
+
+	ret = hmm_device_fault_start(mirror->hmm, event);
+	if (ret)
+		return ret;
+
+again:
+	if (event->backoff) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	if (addr >= event->end)
+		goto out;
+
+	mirror_fault.event = event;
+	mirror_fault.mirror = mirror;
+	mirror_fault.vma = vma;
+	mirror_fault.addr = addr;
+	mirror_fault.iter = iter;
+	walk.mm = mirror->hmm->mm;
+	walk.private = &mirror_fault;
+	walk.pmd_entry = hmm_mirror_fault_pmd;
+	ret = walk_page_range(addr, event->end, &walk);
+	if (!ret) {
+		ret = mirror->device->ops->update(mirror, event);
+		if (!ret) {
+			addr = mirror_fault.addr;
+			goto again;
+		}
+	}
+
+out:
+	hmm_device_fault_end(mirror->hmm, event);
+	if (ret == -ENOENT) {
+		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
+		ret = ret ? ret : -EAGAIN;
+	}
+	return ret;
+}
+
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+	struct vm_area_struct *vma;
+	struct hmm_pt_iter iter;
+	int ret = 0;
+
+	mirror = hmm_mirror_ref(mirror);
+	if (!mirror)
+		return -ENODEV;
+	if (event->start >= mirror->hmm->vm_end) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	if (hmm_event_init(event, mirror->hmm, event->start,
+			   event->end, event->etype)) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	hmm_pt_iter_init(&iter);
+
+retry:
+	if (hmm_mirror_is_dead(mirror)) {
+		hmm_mirror_unref(&mirror);
+		return -ENODEV;
+	}
+
+	/*
+	 * So synchronization with the cpu page table is the most important
+	 * and tedious aspect of device page fault. There must be a strong
+	 * ordering btw call to device->update() for device page fault and
+	 * device->update() for cpu page table invalidation/update.
+	 *
+	 * Page that are exposed to device driver must stay valid while the
+	 * callback is in progress ie any cpu page table invalidation that
+	 * render those pages obsolete must call device->update() after the
+	 * device->update() call that faulted those pages.
+	 *
+	 * To achieve this we rely on few things. First the mmap_sem insure
+	 * us that any munmap() syscall will serialize with us. So issue are
+	 * with unmap_mapping_range() and with migrate or merge page. For this
+	 * hmm keep track of affected range of address and block device page
+	 * fault that hit overlapping range.
+	 */
+	down_read(&mirror->hmm->mm->mmap_sem);
+	vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+	if (!vma) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (vma->vm_start > event->start) {
+		event->end = vma->vm_start;
+		ret = -EFAULT;
+		goto out;
+	}
+	event->end = min(event->end, vma->vm_end) & PAGE_MASK;
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (event->etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		ret = hmm_mirror_handle_fault(mirror, event, vma, &iter);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	/* Drop the mmap_sem so anyone waiting on it have a chance. */
+	up_read(&mirror->hmm->mm->mmap_sem);
+	wake_up(&mirror->hmm->wait_queue);
+	if (ret == -EAGAIN)
+		goto retry;
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+	hmm_mirror_unref(&mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 09/36] HMM: add mm page table iterator helpers.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (7 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 08/36] HMM: add device page fault support v3 j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 10/36] HMM: use CPU page table during invalidation j.glisse
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Because inside the mmu_notifier callback we do not have access to the
vma nor do we know which lock we are holding (the mmap semaphore or
the i_mmap_lock) we can not rely on the regular page table walk (nor
do we want as we have to be carefull to not split huge page).

So this patch introduce an helper to iterate of the cpu page table
content in an efficient way for the situation we are in. Which is we
know that none of the page table entry might vanish from below us
and thus it is safe to walk the page table.

The only added value of the iterator is that it keeps the page table
entry level map accross call which fit well with the HMM mirror page
table update code.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index e1aa6ca..93d6f5e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -410,6 +410,101 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+struct mm_pt_iter {
+	struct mm_struct	*mm;
+	pte_t			*ptep;
+	unsigned long		addr;
+};
+
+static void mm_pt_iter_init(struct mm_pt_iter *pt_iter, struct mm_struct *mm)
+{
+	pt_iter->mm = mm;
+	pt_iter->ptep = NULL;
+	pt_iter->addr = -1UL;
+}
+
+static void mm_pt_iter_fini(struct mm_pt_iter *pt_iter)
+{
+	pte_unmap(pt_iter->ptep);
+	pt_iter->ptep = NULL;
+	pt_iter->addr = -1UL;
+	pt_iter->mm = NULL;
+}
+
+static inline bool mm_pt_iter_in_range(struct mm_pt_iter *pt_iter,
+				       unsigned long addr)
+{
+	return (addr >= pt_iter->addr && addr < (pt_iter->addr + PMD_SIZE));
+}
+
+static struct page *mm_pt_iter_page(struct mm_pt_iter *pt_iter,
+				    unsigned long addr)
+{
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+
+again:
+	/*
+	 * What we are doing here is only valid if we old either the mmap
+	 * semaphore or the i_mmap_lock of vma->address_space the address
+	 * belongs to. Sadly because we can not easily get the vma struct
+	 * we can not sanity test that either of those lock is taken.
+	 *
+	 * We have to rely on people using this code knowing what they do.
+	 */
+	if (mm_pt_iter_in_range(pt_iter, addr) && likely(pt_iter->ptep)) {
+		pte_t pte = *(pt_iter->ptep + pte_index(addr));
+		unsigned long pfn;
+
+		if (pte_none(pte) || !pte_present(pte))
+			return NULL;
+		if (unlikely(pte_special(pte)))
+			return NULL;
+
+		pfn = pte_pfn(pte);
+		if (is_zero_pfn(pfn))
+			return NULL;
+		return pfn_to_page(pfn);
+	}
+
+	if (pt_iter->ptep) {
+		pte_unmap(pt_iter->ptep);
+		pt_iter->ptep = NULL;
+		pt_iter->addr = -1UL;
+	}
+
+	pgdp = pgd_offset(pt_iter->mm, addr);
+	if (pgd_none_or_clear_bad(pgdp))
+		return NULL;
+	pudp = pud_offset(pgdp, addr);
+	if (pud_none_or_clear_bad(pudp))
+		return NULL;
+	pmdp = pmd_offset(pudp, addr);
+	/*
+	 * Because we either have the mmap semaphore or the i_mmap_lock we know
+	 * that pmd can not vanish from under us, thus if pmd exist then it is
+	 * either a huge page or a valid pmd. It might also be in the splitting
+	 * transitory state.
+	 */
+	if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
+		return NULL;
+	if (pmd_trans_splitting(*pmdp))
+		/*
+		 * FIXME idealy we would wait but we have no easy mean to get a
+		 * hold of the vma. So for now busy loop until the splitting is
+		 * done.
+		 */
+		goto again;
+	if (pmd_huge(*pmdp))
+		return pmd_page(*pmdp) + pte_index(addr);
+	/* Regular pmd and it can not morph. */
+	pt_iter->ptep = pte_offset_map(pmdp, addr & PMD_MASK);
+	pt_iter->addr = addr & PMD_MASK;
+	goto again;
+}
+
+
 /* hmm_mirror - per device mirroring functions.
  *
  * Each device that mirror a process has a uniq hmm_mirror struct. A process
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 10/36] HMM: use CPU page table during invalidation.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (8 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 09/36] HMM: add mm page table iterator helpers j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 11/36] HMM: add discard range helper (to clear and free resources for a range) j.glisse
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jerome Glisse

From: Jerome Glisse <jglisse@redhat.com>

Once we store the dma mapping inside the secondary page table we can
no longer easily find back the page backing an address. Instead use
the cpu page table which still has the proper informations, except for
the invalidate_page() case which is handled by using the page passed
by the mmu_notifier layer.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 51 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 93d6f5e..8ec9ffa 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -50,9 +50,11 @@ static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 static inline void hmm_mirror_unref(struct hmm_mirror **mirror);
 static void hmm_mirror_kill(struct hmm_mirror *mirror);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
-				    struct hmm_event *event);
+				    struct hmm_event *event,
+				    struct page *page);
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
-				 struct hmm_event *event);
+				 struct hmm_event *event,
+				 struct page *page);
 
 
 /* hmm_event - use to track information relating to an event.
@@ -232,7 +234,9 @@ again:
 	}
 }
 
-static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+static void hmm_update(struct hmm *hmm,
+		       struct hmm_event *event,
+		       struct page *page)
 {
 	struct hmm_mirror *mirror;
 
@@ -245,7 +249,7 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 again:
 	down_read(&hmm->rwsem);
 	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
-		if (hmm_mirror_update(mirror, event)) {
+		if (hmm_mirror_update(mirror, event, page)) {
 			mirror = hmm_mirror_ref(mirror);
 			up_read(&hmm->rwsem);
 			hmm_mirror_kill(mirror);
@@ -343,9 +347,10 @@ static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
 	*etype = HMM_NONE;
 }
 
-static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
-						struct mm_struct *mm,
-						const struct mmu_notifier_range *range)
+static void hmm_notifier_invalidate(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    struct page *page,
+				    const struct mmu_notifier_range *range)
 {
 	struct hmm_event event;
 	unsigned long start = range->start, end = range->end;
@@ -386,7 +391,14 @@ static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	hmm_event_init(&event, hmm, start, end, event.etype);
 
-	hmm_update(hmm, &event);
+	hmm_update(hmm, &event, page);
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						const struct mmu_notifier_range *range)
+{
+	hmm_notifier_invalidate(mn, mm, NULL, range);
 }
 
 static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
@@ -400,7 +412,7 @@ static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
 	range.start = addr & PAGE_MASK;
 	range.end = range.start + PAGE_SIZE;
 	range.event = mmu_event;
-	hmm_notifier_invalidate_range_start(mn, mm, &range);
+	hmm_notifier_invalidate(mn, mm, page, &range);
 }
 
 static struct mmu_notifier_ops hmm_notifier_ops = {
@@ -551,23 +563,27 @@ static inline void hmm_mirror_unref(struct hmm_mirror **mirror)
 }
 
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
-				    struct hmm_event *event)
+				    struct hmm_event *event,
+				    struct page *page)
 {
 	struct hmm_device *device = mirror->device;
 	int ret = 0;
 
 	ret = device->ops->update(mirror, event);
-	hmm_mirror_update_pt(mirror, event);
+	hmm_mirror_update_pt(mirror, event, page);
 	return ret;
 }
 
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
-				 struct hmm_event *event)
+				 struct hmm_event *event,
+				 struct page *page)
 {
 	unsigned long addr;
 	struct hmm_pt_iter iter;
+	struct mm_pt_iter mm_iter;
 
 	hmm_pt_iter_init(&iter);
+	mm_pt_iter_init(&mm_iter, mirror->hmm->mm);
 	for (addr = event->start; addr != event->end;) {
 		unsigned long end, next;
 		dma_addr_t *hmm_pte;
@@ -593,10 +609,10 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 				continue;
 			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
 			    hmm_pte_test_write(hmm_pte)) {
-				struct page *page;
-
-				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
-				set_page_dirty(page);
+				page = page ? : mm_pt_iter_page(&mm_iter, addr);
+				if (page)
+					set_page_dirty(page);
+				page = NULL;
 			}
 			*hmm_pte &= event->pte_mask;
 			if (hmm_pte_test_valid_pfn(hmm_pte))
@@ -606,6 +622,7 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 		hmm_pt_iter_directory_unlock(&iter, &mirror->pt);
 	}
 	hmm_pt_iter_fini(&iter, &mirror->pt);
+	mm_pt_iter_fini(&mm_iter);
 }
 
 static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
@@ -988,7 +1005,7 @@ static void hmm_mirror_kill(struct hmm_mirror *mirror)
 
 	/* Make sure everything is unmapped. */
 	hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
-	hmm_mirror_update(mirror, &event);
+	hmm_mirror_update(mirror, &event, NULL);
 
 	down_write(&mirror->hmm->rwsem);
 	if (!hlist_unhashed(&mirror->mlist)) {
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 11/36] HMM: add discard range helper (to clear and free resources for a range).
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (9 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 10/36] HMM: use CPU page table during invalidation j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 12/36] HMM: add dirty range helper (to toggle dirty bit inside mirror page table) j.glisse
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

A common use case is for device driver to stop caring for a range of address
long before said range is munmapped by userspace program. To avoid keeping
track of such range provide an helper function that will free HMM resources
for a range of address.

NOTE THAT DEVICE DRIVER MUST MAKE SURE THE HARDWARE WILL NO LONGER ACCESS THE
RANGE BECAUSE CALLING THIS HELPER !

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h |  3 +++
 mm/hmm.c            | 24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index fdb1975..ec05df8 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -250,6 +250,9 @@ struct hmm_mirror {
 int hmm_mirror_register(struct hmm_mirror *mirror);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+			      unsigned long start,
+			      unsigned long end);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 8ec9ffa..4cab3f2 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -916,6 +916,30 @@ out:
 }
 EXPORT_SYMBOL(hmm_mirror_fault);
 
+/* hmm_mirror_range_discard() - discard a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to stop mirroring a range of address and free
+ * any HMM resources associated with that range (including dma mapping if any).
+ *
+ * THIS FUNCTION ASSUME THAT DRIVER ALREADY STOPPED USING THE RANGE OF ADDRESS
+ * AND THUS DO NOT PERFORM ANY SYNCHRONIZATION OR UPDATE WITH THE DRIVER TO
+ * INVALIDATE SAID RANGE.
+ */
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+			      unsigned long start,
+			      unsigned long end)
+{
+	struct hmm_event event;
+
+	hmm_event_init(&event, mirror->hmm, start, end, HMM_MUNMAP);
+	hmm_mirror_update_pt(mirror, &event, NULL);
+}
+EXPORT_SYMBOL(hmm_mirror_range_discard);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 12/36] HMM: add dirty range helper (to toggle dirty bit inside mirror page table).
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (10 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 11/36] HMM: add discard range helper (to clear and free resources for a range) j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 13/36] HMM: DMA map memory on behalf of device driver j.glisse
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Device driver must properly toggle the dirty inside the mirror page table so
dirtyness is properly accounted when core mm code needs to know. Provide a
simple helper to toggle that bit for a range of address.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h |  3 +++
 mm/hmm.c            | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ec05df8..186f497 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -253,6 +253,9 @@ int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 void hmm_mirror_range_discard(struct hmm_mirror *mirror,
 			      unsigned long start,
 			      unsigned long end);
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+			    unsigned long start,
+			    unsigned long end);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 4cab3f2..21fda9f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -940,6 +940,53 @@ void hmm_mirror_range_discard(struct hmm_mirror *mirror,
 }
 EXPORT_SYMBOL(hmm_mirror_range_discard);
 
+/* hmm_mirror_range_dirty() - toggle dirty bit for a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to toggle the dirty bit for a range of address.
+ * Usefull when the device driver just want to toggle the bit for whole range
+ * without walking the mirror page table itself.
+ *
+ * Note this function does not directly dirty the page behind an address, but
+ * this will happen once address is invalidated or discard by device driver or
+ * core mm code.
+ */
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+			    unsigned long start,
+			    unsigned long end)
+{
+	struct hmm_pt_iter iter;
+	unsigned long addr;
+
+	hmm_pt_iter_init(&iter);
+	for (addr = start; addr != end;) {
+		unsigned long cend, next;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		if (!hmm_pte) {
+			addr = hmm_pt_iter_next(&iter, &mirror->pt,
+						addr, end);
+			continue;
+		}
+		cend = hmm_pt_level_next(&mirror->pt, addr, end,
+					 mirror->pt.llevel - 1);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, cend,
+						 mirror->pt.llevel);
+			if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+			    !hmm_pte_test_write(hmm_pte))
+				continue;
+			hmm_pte_set_dirty(hmm_pte);
+		} while (addr = next, hmm_pte++, addr != cend);
+	}
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+}
+EXPORT_SYMBOL(hmm_mirror_range_dirty);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 13/36] HMM: DMA map memory on behalf of device driver.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (11 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 12/36] HMM: add dirty range helper (to toggle dirty bit inside mirror page table) j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 14/36] fork: pass the dst vma to copy_page_range() and its sub-functions j.glisse
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Do the DMA mapping on behalf of the device as HMM is a good place
to perform this common task. Moreover in the future we hope to
add new infrastructure that would make DMA mapping more efficient
(lower overhead per page) by leveraging HMM data structure.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h |  11 +++
 mm/hmm.c               | 223 ++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 184 insertions(+), 50 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 330edb2..78a9073 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -176,6 +176,17 @@ static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
 	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
 }
 
+static inline dma_addr_t hmm_pte_from_dma_addr(dma_addr_t dma_addr)
+{
+	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DMA_BIT);
+}
+
+static inline dma_addr_t hmm_pte_dma_addr(dma_addr_t pte)
+{
+	/* FIXME Use max dma addr instead of 0 ? */
+	return hmm_pte_test_valid_dma(&pte) ? (pte & HMM_PTE_DMA_MASK) : 0;
+}
+
 static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
 {
 	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
diff --git a/mm/hmm.c b/mm/hmm.c
index 21fda9f..1533223 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -41,6 +41,7 @@
 #include <linux/mman.h>
 #include <linux/delay.h>
 #include <linux/workqueue.h>
+#include <linux/dma-mapping.h>
 
 #include "internal.h"
 
@@ -574,6 +575,46 @@ static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 	return ret;
 }
 
+static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
+				  struct hmm_event *event,
+				  struct hmm_pt_iter *iter,
+				  struct mm_pt_iter *mm_iter,
+				  struct page *page,
+				  dma_addr_t *hmm_pte,
+				  unsigned long addr)
+{
+	bool dirty = hmm_pte_test_and_clear_dirty(hmm_pte);
+
+	if (hmm_pte_test_valid_pfn(hmm_pte)) {
+		*hmm_pte &= event->pte_mask;
+		if (!hmm_pte_test_valid_pfn(hmm_pte))
+			hmm_pt_iter_directory_unref(iter, mirror->pt.llevel);
+		goto out;
+	}
+
+	if (!hmm_pte_test_valid_dma(hmm_pte))
+		return;
+
+	if (!hmm_pte_test_valid_dma(&event->pte_mask)) {
+		struct device *dev = mirror->device->dev;
+		dma_addr_t dma_addr;
+
+		dma_addr = hmm_pte_dma_addr(*hmm_pte);
+		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	}
+
+	*hmm_pte &= event->pte_mask;
+	if (!hmm_pte_test_valid_dma(hmm_pte))
+		hmm_pt_iter_directory_unref(iter, mirror->pt.llevel);
+
+out:
+	if (dirty) {
+		page = page ? : mm_pt_iter_page(mm_iter, addr);
+		if (page)
+			set_page_dirty(page);
+	}
+}
+
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 				 struct hmm_event *event,
 				 struct page *page)
@@ -605,19 +646,9 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 		do {
 			next = hmm_pt_level_next(&mirror->pt, addr, end,
 						 mirror->pt.llevel);
-			if (!hmm_pte_test_valid_pfn(hmm_pte))
-				continue;
-			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
-			    hmm_pte_test_write(hmm_pte)) {
-				page = page ? : mm_pt_iter_page(&mm_iter, addr);
-				if (page)
-					set_page_dirty(page);
-				page = NULL;
-			}
-			*hmm_pte &= event->pte_mask;
-			if (hmm_pte_test_valid_pfn(hmm_pte))
-				continue;
-			hmm_pt_iter_directory_unref(&iter, mirror->pt.llevel);
+			hmm_mirror_update_pte(mirror, event, &iter, &mm_iter,
+					      page, hmm_pte, addr);
+			page = NULL;
 		} while (addr = next, hmm_pte++, addr != end);
 		hmm_pt_iter_directory_unlock(&iter, &mirror->pt);
 	}
@@ -697,12 +728,12 @@ static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
 			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
 						 mirror->pt.llevel);
 
-			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
-				hmm_pte[i] = hmm_pte_from_pfn(pfn);
-				hmm_pt_iter_directory_ref(iter,
-							  mirror->pt.llevel);
-			}
-			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i]))
+				hmm_pt_iter_directory_ref(iter, mirror->pt.llevel);
+			hmm_pte[i] = hmm_pte_from_pfn(pfn);
 			if (pmd_write(*pmdp))
 				hmm_pte_set_write(&hmm_pte[i]);
 		} while (addr = next, pfn++, i++, addr != hmm_end);
@@ -766,12 +797,12 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 				break;
 			}
 
-			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
-				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
-				hmm_pt_iter_directory_ref(iter,
-							  mirror->pt.llevel);
-			}
-			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i]))
+				hmm_pt_iter_directory_ref(iter, mirror->pt.llevel);
+			hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
 			if (pte_write(*ptep))
 				hmm_pte_set_write(&hmm_pte[i]);
 		} while (addr = next, ptep++, i++, addr != hmm_end);
@@ -783,6 +814,86 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 	return ret;
 }
 
+
+static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
+			      struct hmm_pt_iter *iter,
+			      unsigned long start,
+			      unsigned long end)
+{
+	struct device *dev = mirror->device->dev;
+	unsigned long addr;
+	int ret;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, hmm_end, next;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_fault(iter, &mirror->pt, addr);
+		if (!hmm_pte)
+			return -ENOENT;
+
+		hmm_end = hmm_pt_level_next(&mirror->pt, addr, end,
+					    mirror->pt.llevel - 1);
+		do {
+			dma_addr_t dma_addr, pte;
+			struct page *page;
+
+			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
+						 mirror->pt.llevel);
+
+again:
+			pte = ACCESS_ONCE(hmm_pte[i]);
+			if (!hmm_pte_test_valid_pfn(&pte)) {
+				if (!hmm_pte_test_valid_dma(&pte)) {
+					ret = -ENOENT;
+					break;
+				}
+				continue;
+			}
+
+			page = pfn_to_page(hmm_pte_pfn(pte));
+			VM_BUG_ON(!page);
+			dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+						DMA_BIDIRECTIONAL);
+			if (dma_mapping_error(dev, dma_addr)) {
+				ret = -ENOMEM;
+				break;
+			}
+
+			hmm_pt_iter_directory_lock(iter, &mirror->pt);
+			/*
+			 * Make sure we transfer the dirty bit. Note that there
+			 * might still be a window for another thread to set
+			 * the dirty bit before we check for pte equality. This
+			 * will just lead to a useless retry so it is not the
+			 * end of the world here.
+			 */
+			if (hmm_pte_test_dirty(&hmm_pte[i]))
+				hmm_pte_set_dirty(&pte);
+			if (ACCESS_ONCE(hmm_pte[i]) != pte) {
+				hmm_pt_iter_directory_unlock(iter,&mirror->pt);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+				if (hmm_pte_test_valid_pfn(&pte))
+					goto again;
+				if (!hmm_pte_test_valid_dma(&pte)) {
+					ret = -ENOENT;
+					break;
+				}
+			} else {
+				hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
+				if (hmm_pte_test_write(&pte))
+					hmm_pte_set_write(&hmm_pte[i]);
+				if (hmm_pte_test_dirty(&pte))
+					hmm_pte_set_dirty(&hmm_pte[i]);
+				hmm_pt_iter_directory_unlock(iter, &mirror->pt);
+			}
+		} while (addr = next, i++, addr != hmm_end && !ret);
+	}
+
+	return ret;
+}
+
 static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 				   struct hmm_event *event,
 				   struct vm_area_struct *vma,
@@ -791,7 +902,7 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 	struct hmm_mirror_fault mirror_fault;
 	unsigned long addr = event->start;
 	struct mm_walk walk = {0};
-	int ret = 0;
+	int ret;
 
 	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
 		return -EACCES;
@@ -800,32 +911,43 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 	if (ret)
 		return ret;
 
-again:
-	if (event->backoff) {
-		ret = -EAGAIN;
-		goto out;
-	}
-	if (addr >= event->end)
-		goto out;
+	do {
+		if (event->backoff) {
+			ret = -EAGAIN;
+			break;
+		}
+		if (addr >= event->end)
+			break;
+
+		mirror_fault.event = event;
+		mirror_fault.mirror = mirror;
+		mirror_fault.vma = vma;
+		mirror_fault.addr = addr;
+		mirror_fault.iter = iter;
+		walk.mm = mirror->hmm->mm;
+		walk.private = &mirror_fault;
+		walk.pmd_entry = hmm_mirror_fault_pmd;
+		ret = walk_page_range(addr, event->end, &walk);
+		if (ret)
+			break;
+
+		if (event->backoff) {
+			ret = -EAGAIN;
+			break;
+		}
 
-	mirror_fault.event = event;
-	mirror_fault.mirror = mirror;
-	mirror_fault.vma = vma;
-	mirror_fault.addr = addr;
-	mirror_fault.iter = iter;
-	walk.mm = mirror->hmm->mm;
-	walk.private = &mirror_fault;
-	walk.pmd_entry = hmm_mirror_fault_pmd;
-	ret = walk_page_range(addr, event->end, &walk);
-	if (!ret) {
-		ret = mirror->device->ops->update(mirror, event);
-		if (!ret) {
-			addr = mirror_fault.addr;
-			goto again;
+		if (mirror->device->dev) {
+			ret = hmm_mirror_dma_map(mirror, iter, addr, event->end);
+			if (ret)
+				break;
 		}
-	}
 
-out:
+		ret = mirror->device->ops->update(mirror, event);
+		if (ret)
+			break;
+		addr = mirror_fault.addr;
+	} while (1);
+
 	hmm_device_fault_end(mirror->hmm, event);
 	if (ret == -ENOENT) {
 		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
@@ -977,7 +1099,8 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
 		do {
 			next = hmm_pt_level_next(&mirror->pt, addr, cend,
 						 mirror->pt.llevel);
-			if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+			if (!hmm_pte_test_valid_dma(hmm_pte) ||
+			    !hmm_pte_test_valid_pfn(hmm_pte) ||
 			    !hmm_pte_test_write(hmm_pte))
 				continue;
 			hmm_pte_set_dirty(hmm_pte);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 14/36] fork: pass the dst vma to copy_page_range() and its sub-functions.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (12 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 13/36] HMM: DMA map memory on behalf of device driver j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 15/36] memcg: export get_mem_cgroup_from_mm() j.glisse
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

For HMM we will need to resort to the old way of allocating new page
for anonymous memory when that anonymous memory have been migrated
to device memory.

This does not impact any process that do not use HMM through some
device driver. Only process that migrate anonymous memory to device
memory with HMM will have to copy migrated page on fork.

We do not expect this to be a common or adviced thing to do so we
resort to the simpler solution of allocating new page. If this kind
of usage turns out to be important we will revisit way to achieve
COW even for remote memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/mm.h |  5 +++--
 kernel/fork.c      |  2 +-
 mm/memory.c        | 33 +++++++++++++++++++++------------
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cf642d9..8923532 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1083,8 +1083,9 @@ int walk_page_range(unsigned long addr, unsigned long end,
 int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk);
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		    struct vm_area_struct *dst_vma,
+		    struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
diff --git a/kernel/fork.c b/kernel/fork.c
index 4083be7..0bd5b59 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -492,7 +492,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/memory.c b/mm/memory.c
index 5a1131f..6497009 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -885,8 +885,10 @@ out_set_pte:
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		   unsigned long addr, unsigned long end)
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *dst_vma,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -947,9 +949,12 @@ again:
 	return 0;
 }
 
-static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+static inline int copy_pmd_range(struct mm_struct *dst_mm,
+				 struct mm_struct *src_mm,
+				 pud_t *dst_pud, pud_t *src_pud,
+				 struct vm_area_struct *dst_vma,
+				 struct vm_area_struct *vma,
+				 unsigned long addr, unsigned long end)
 {
 	pmd_t *src_pmd, *dst_pmd;
 	unsigned long next;
@@ -974,15 +979,18 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+				   dst_vma, vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
 
-static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+static inline int copy_pud_range(struct mm_struct *dst_mm,
+				 struct mm_struct *src_mm,
+				 pgd_t *dst_pgd, pgd_t *src_pgd,
+				 struct vm_area_struct *dst_vma,
+				 struct vm_area_struct *vma,
+				 unsigned long addr, unsigned long end)
 {
 	pud_t *src_pud, *dst_pud;
 	unsigned long next;
@@ -996,14 +1004,15 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+				   dst_vma, vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		    struct vm_area_struct *dst_vma,
+		    struct vm_area_struct *vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
@@ -1057,7 +1066,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					    dst_vma, vma, addr, next))) {
 			ret = -ENOMEM;
 			break;
 		}
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 15/36] memcg: export get_mem_cgroup_from_mm()
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (13 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 14/36] fork: pass the dst vma to copy_page_range() and its sub-functions j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory j.glisse
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Usefull for HMM when trying to uncharge freshly allocated anonymous
pages after error inside migration memory migration path.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/memcontrol.h | 7 +++++++
 mm/memcontrol.c            | 3 ++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c89181..488748e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -93,6 +93,7 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+extern struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
 
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 extern struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css);
@@ -275,6 +276,12 @@ static inline struct cgroup_subsys_state
 	return NULL;
 }
 
+
+static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
+{
+	return NULL;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f20..360d9e0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -966,7 +966,7 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 	return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
 }
 
-static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
+struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
 	struct mem_cgroup *memcg = NULL;
 
@@ -988,6 +988,7 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 	rcu_read_unlock();
 	return memcg;
 }
+EXPORT_SYMBOL(get_mem_cgroup_from_mm);
 
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (14 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 15/36] memcg: export get_mem_cgroup_from_mm() j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-06-24  7:49   ` Haggai Eran
  2015-05-21 19:31 ` [PATCH 17/36] HMM: add new HMM page table flag (valid device memory) j.glisse
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jerome Glisse, Jatin Kumar

From: Jerome Glisse <jglisse@redhat.com>

When migrating anonymous memory from system memory to device memory
CPU pte are replaced with special HMM swap entry so that page fault,
get user page (gup), fork, ... are properly redirected to HMM helpers.

This patch only add the new swap type entry and hooks HMM helpers
functions inside the page fault and fork code path.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h     | 34 ++++++++++++++++++++++++++++++++++
 include/linux/swap.h    | 12 +++++++++++-
 include/linux/swapops.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
 mm/hmm.c                | 21 +++++++++++++++++++++
 mm/memory.c             | 22 ++++++++++++++++++++++
 5 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 186f497..f243eb5 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -257,6 +257,40 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
 			    unsigned long start,
 			    unsigned long end);
 
+int hmm_handle_cpu_fault(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmdp, unsigned long addr,
+			unsigned flags, pte_t orig_pte);
+
+int hmm_mm_fork(struct mm_struct *src_mm,
+		struct mm_struct *dst_mm,
+		struct vm_area_struct *dst_vma,
+		pmd_t *dst_pmd,
+		unsigned long start,
+		unsigned long end);
+
+#else /* CONFIG_HMM */
+
+static inline int hmm_handle_mm_fault(struct mm_struct *mm,
+				      struct vm_area_struct *vma,
+				      pmd_t *pmdp, unsigned long addr,
+				      unsigned flags, pte_t orig_pte)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static inline int hmm_mm_fork(struct mm_struct *src_mm,
+			      struct mm_struct *dst_mm,
+			      struct vm_area_struct *dst_vma,
+			      pmd_t *dst_pmd,
+			      unsigned long start,
+			      unsigned long end)
+{
+	BUG();
+	return -ENOMEM;
+}
 
 #endif /* CONFIG_HMM */
+
+
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0428e4c..89b9dda 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,18 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif
 
+/*
+ * HMM (heterogeneous memory management) used when data is in remote memory.
+ */
+#ifdef CONFIG_HMM
+#define SWP_HMM_NUM 1
+#define SWP_HMM			(MAX_SWAPFILES + SWP_MIGRATION_NUM + SWP_HWPOISON_NUM)
+#else
+#define SWP_HMM_NUM 0
+#endif
+
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_HMM_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index cedf3d3..934359f 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -190,7 +190,7 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
 }
 #endif
 
-#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || defined(CONFIG_HMM)
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
@@ -202,4 +202,45 @@ static inline int non_swap_entry(swp_entry_t entry)
 }
 #endif
 
+#ifdef CONFIG_HMM
+static inline swp_entry_t make_hmm_entry(void)
+{
+	/* We do not store anything inside the CPU page table entry (pte). */
+	return swp_entry(SWP_HMM, 0);
+}
+
+static inline swp_entry_t make_hmm_entry_locked(void)
+{
+	/* We do not store anything inside the CPU page table entry (pte). */
+	return swp_entry(SWP_HMM, 1);
+}
+
+static inline swp_entry_t make_hmm_entry_poisonous(void)
+{
+	/* We do not store anything inside the CPU page table entry (pte). */
+	return swp_entry(SWP_HMM, 2);
+}
+
+static inline int is_hmm_entry(swp_entry_t entry)
+{
+	return (swp_type(entry) == SWP_HMM);
+}
+
+static inline int is_hmm_entry_locked(swp_entry_t entry)
+{
+	return (swp_type(entry) == SWP_HMM) && (swp_offset(entry) == 1);
+}
+
+static inline int is_hmm_entry_poisonous(swp_entry_t entry)
+{
+	return (swp_type(entry) == SWP_HMM) && (swp_offset(entry) == 2);
+}
+#else /* CONFIG_HMM */
+static inline int is_hmm_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif /* CONFIG_HMM */
+
+
 #endif /* _LINUX_SWAPOPS_H */
diff --git a/mm/hmm.c b/mm/hmm.c
index 1533223..2143a58 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -423,6 +423,27 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+int hmm_handle_cpu_fault(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmdp, unsigned long addr,
+			unsigned flags, pte_t orig_pte)
+{
+	return VM_FAULT_SIGBUS;
+}
+EXPORT_SYMBOL(hmm_handle_cpu_fault);
+
+int hmm_mm_fork(struct mm_struct *src_mm,
+		struct mm_struct *dst_mm,
+		struct vm_area_struct *dst_vma,
+		pmd_t *dst_pmd,
+		unsigned long start,
+		unsigned long end)
+{
+	return -ENOMEM;
+}
+EXPORT_SYMBOL(hmm_mm_fork);
+
+
 struct mm_pt_iter {
 	struct mm_struct	*mm;
 	pte_t			*ptep;
diff --git a/mm/memory.c b/mm/memory.c
index 6497009..b6840fb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -53,6 +53,7 @@
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -893,9 +894,11 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
+	unsigned cnt_hmm_entry = 0;
 	int progress = 0;
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
+	unsigned long start;
 
 again:
 	init_rss_vec(rss);
@@ -909,6 +912,7 @@ again:
 	orig_src_pte = src_pte;
 	orig_dst_pte = dst_pte;
 	arch_enter_lazy_mmu_mode();
+	start = addr;
 
 	do {
 		/*
@@ -925,6 +929,12 @@ again:
 			progress++;
 			continue;
 		}
+		if (unlikely(!pte_present(*src_pte))) {
+			entry = pte_to_swp_entry(*src_pte);
+
+			if (is_hmm_entry(entry))
+				cnt_hmm_entry++;
+		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
 							vma, addr, rss);
 		if (entry.val)
@@ -939,6 +949,15 @@ again:
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
 
+	if (cnt_hmm_entry) {
+		int ret;
+
+		ret = hmm_mm_fork(src_mm, dst_mm, dst_vma,
+				  dst_pmd, start, end);
+		if (ret)
+			return ret;
+	}
+
 	if (entry.val) {
 		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
 			return -ENOMEM;
@@ -2487,6 +2506,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_hmm_entry(entry)) {
+			ret = hmm_handle_cpu_fault(mm, vma, pmd, address,
+						   flags, orig_pte);
 		} else {
 			print_bad_pte(vma, address, orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 17/36] HMM: add new HMM page table flag (valid device memory).
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (15 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 18/36] HMM: add new HMM page table flag (select flag) j.glisse
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

For memory migrated to device we need a new type of memory entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm_pt.h | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 78a9073..26cfe5e 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -74,10 +74,11 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
  * In the first case the device driver must ignore any pfn entry as they might
  * show as transient state while HMM is mapping the page.
  */
-#define HMM_PTE_VALID_DMA_BIT	0
-#define HMM_PTE_VALID_PFN_BIT	1
-#define HMM_PTE_WRITE_BIT	2
-#define HMM_PTE_DIRTY_BIT	3
+#define HMM_PTE_VALID_DEV_BIT	0
+#define HMM_PTE_VALID_DMA_BIT	1
+#define HMM_PTE_VALID_PFN_BIT	2
+#define HMM_PTE_WRITE_BIT	3
+#define HMM_PTE_DIRTY_BIT	4
 /*
  * Reserve some bits for device driver private flags. Note that thus can only
  * be manipulated using the hmm_pte_*_bit() sets of helpers.
@@ -85,7 +86,7 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
  * WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
  * AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
  */
-#define HMM_PTE_HW_SHIFT	4
+#define HMM_PTE_HW_SHIFT	8
 
 #define HMM_PTE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
 #define HMM_PTE_DMA_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
@@ -166,6 +167,7 @@ static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
 	HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
 	HMM_PTE_TEST_AND_SET_BIT(name, bit)
 
+HMM_PTE_BIT_HELPER(valid_dev, HMM_PTE_VALID_DEV_BIT)
 HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
 HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
 HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
@@ -176,11 +178,23 @@ static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
 	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
 }
 
+static inline dma_addr_t hmm_pte_from_dev_addr(dma_addr_t dma_addr)
+{
+	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DEV_BIT);
+}
+
 static inline dma_addr_t hmm_pte_from_dma_addr(dma_addr_t dma_addr)
 {
 	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DMA_BIT);
 }
 
+static inline dma_addr_t hmm_pte_dev_addr(dma_addr_t pte)
+{
+	/* FIXME Use max dma addr instead of 0 ? */
+	return hmm_pte_test_valid_dev(&pte) ? (pte & HMM_PTE_DMA_MASK) :
+					      (dma_addr_t)-1UL;
+}
+
 static inline dma_addr_t hmm_pte_dma_addr(dma_addr_t pte)
 {
 	/* FIXME Use max dma addr instead of 0 ? */
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 18/36] HMM: add new HMM page table flag (select flag).
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (16 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 17/36] HMM: add new HMM page table flag (valid device memory) j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 19:31 ` [PATCH 19/36] HMM: handle HMM device page table entry on mirror page table fault and update j.glisse
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

When migrating memory the same array for HMM page table entry might be
use with several different devices. Add a new select flag so current
device driver callback can know which entry are selected for the device.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h | 6 ++++--
 mm/hmm.c               | 5 ++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 26cfe5e..36f7e00 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -77,8 +77,9 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
 #define HMM_PTE_VALID_DEV_BIT	0
 #define HMM_PTE_VALID_DMA_BIT	1
 #define HMM_PTE_VALID_PFN_BIT	2
-#define HMM_PTE_WRITE_BIT	3
-#define HMM_PTE_DIRTY_BIT	4
+#define HMM_PTE_SELECT		3
+#define HMM_PTE_WRITE_BIT	4
+#define HMM_PTE_DIRTY_BIT	5
 /*
  * Reserve some bits for device driver private flags. Note that thus can only
  * be manipulated using the hmm_pte_*_bit() sets of helpers.
@@ -170,6 +171,7 @@ static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
 HMM_PTE_BIT_HELPER(valid_dev, HMM_PTE_VALID_DEV_BIT)
 HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
 HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(select, HMM_PTE_SELECT)
 HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
 HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 2143a58..761905a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -757,6 +757,7 @@ static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
 			hmm_pte[i] = hmm_pte_from_pfn(pfn);
 			if (pmd_write(*pmdp))
 				hmm_pte_set_write(&hmm_pte[i]);
+			hmm_pte_set_select(&hmm_pte[i]);
 		} while (addr = next, pfn++, i++, addr != hmm_end);
 		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
 		mirror_fault->addr = addr;
@@ -826,6 +827,7 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 			hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
 			if (pte_write(*ptep))
 				hmm_pte_set_write(&hmm_pte[i]);
+			hmm_pte_set_select(&hmm_pte[i]);
 		} while (addr = next, ptep++, i++, addr != hmm_end);
 		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
 		pte_unmap(ptep - 1);
@@ -864,7 +866,8 @@ static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
 
 again:
 			pte = ACCESS_ONCE(hmm_pte[i]);
-			if (!hmm_pte_test_valid_pfn(&pte)) {
+			if (!hmm_pte_test_valid_pfn(&pte) ||
+			    !hmm_pte_test_select(&pte)) {
 				if (!hmm_pte_test_valid_dma(&pte)) {
 					ret = -ENOENT;
 					break;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 19/36] HMM: handle HMM device page table entry on mirror page table fault and update.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (17 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 18/36] HMM: add new HMM page table flag (select flag) j.glisse
@ 2015-05-21 19:31 ` j.glisse
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-05-21 19:31 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

When faulting or updating the device page table properly handle the case of device
memory entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 761905a..e4585b7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -613,6 +613,13 @@ static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
 		goto out;
 	}
 
+	if (hmm_pte_test_valid_dev(hmm_pte)) {
+		*hmm_pte &= event->pte_mask;
+		if (!hmm_pte_test_valid_dev(hmm_pte))
+			hmm_pt_iter_directory_unref(iter, mirror->pt.llevel);
+		return;
+	}
+
 	if (!hmm_pte_test_valid_dma(hmm_pte))
 		return;
 
@@ -813,6 +820,13 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 		do {
 			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
 						 mirror->pt.llevel);
+
+			if (hmm_pte_test_valid_dev(&hmm_pte[i])) {
+				if (write)
+					hmm_pte_set_write(&hmm_pte[i]);
+				continue;
+			}
+
 			if (!pte_present(*ptep) || (write && !pte_write(*ptep))) {
 				ret = -ENOENT;
 				ptep++;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back.
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (18 preceding siblings ...)
  2015-05-21 19:31 ` [PATCH 19/36] HMM: handle HMM device page table entry on mirror page table fault and update j.glisse
@ 2015-05-21 20:22 ` jglisse
  2015-05-21 20:22   ` [PATCH 21/36] HMM: mm add helper to update page table when migrating memory jglisse
                     ` (15 more replies)
  2015-05-30  3:01 ` HMM (Heterogeneous Memory Management) v8 John Hubbard
  2015-05-31  6:56 ` Haggai Eran
  21 siblings, 16 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:22 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

To migrate memory back we first need to lock HMM special CPU page
table entry so we know no one else might try to migrate those entry
back. Helper also allocate new page where data will be copied back
from the device. Then we can proceed with the device DMA operation.

Once DMA is done we can update again the CPU page table to point to
the new page that holds the content copied back from device memory.

Note that we do not need to invalidate the range are we are only
modifying non present CPU page table entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/mm.h |  12 +++
 mm/memory.c        | 236 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 248 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8923532..f512b8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2205,6 +2205,18 @@ static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
+
+int mm_hmm_migrate_back(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pte_t *new_pte,
+			unsigned long start,
+			unsigned long end);
+void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
+				 struct vm_area_struct *vma,
+				 pte_t *new_pte,
+				 dma_addr_t *hmm_pte,
+				 unsigned long start,
+				 unsigned long end);
 #else /* !CONFIG_HMM */
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
diff --git a/mm/memory.c b/mm/memory.c
index b6840fb..4674d40 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3461,6 +3461,242 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+
+#ifdef CONFIG_HMM
+/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @new_pte: Array of new CPU page table entry value.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This function will lock HMM page table entry and allocate new page for entry
+ * it successfully locked.
+ */
+int mm_hmm_migrate_back(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pte_t *new_pte,
+			unsigned long start,
+			unsigned long end)
+{
+	pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
+	unsigned long addr, i;
+	int ret = 0;
+
+	VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+
+	start &= PAGE_MASK;
+	end = PAGE_ALIGN(end);
+	memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
+
+	for (addr = start; addr < end;) {
+		unsigned long cstart, next;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		/*
+		 * Some other thread might already have migrated back the entry
+		 * and freed the page table. Unlikely thought.
+		 */
+		if (unlikely(!pudp)) {
+			addr = min((addr + PUD_SIZE) & PUD_MASK, end);
+			continue;
+		}
+		pmdp = pmd_offset(pudp, addr);
+		if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			     pmd_trans_huge(*pmdp))) {
+			addr = min((addr + PMD_SIZE) & PMD_MASK, end);
+			continue;
+		}
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
+		     next = min((addr + PMD_SIZE) & PMD_MASK, end);
+		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
+			swp_entry_t entry;
+
+			entry = pte_to_swp_entry(*ptep);
+			if (pte_none(*ptep) || pte_present(*ptep) ||
+			    !is_hmm_entry(entry) ||
+			    is_hmm_entry_locked(entry))
+				continue;
+
+			set_pte_at(mm, addr, ptep, hmm_entry);
+			new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
+						   vma->vm_page_prot));
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, i++) {
+			struct mem_cgroup *memcg;
+			struct page *page;
+
+			if (!pte_present(new_pte[i]))
+				continue;
+
+			page = alloc_zeroed_user_highpage_movable(vma, addr);
+			if (!page) {
+				ret = -ENOMEM;
+				break;
+			}
+			__SetPageUptodate(page);
+			if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
+						  &memcg)) {
+				page_cache_release(page);
+				ret = -ENOMEM;
+				break;
+			}
+			/*
+			 * FIXME Need to see if that can happens and how. I
+			 * would rather not have a array of memcg.
+			 */
+			BUG_ON(memcg != get_mem_cgroup_from_mm(mm));
+			new_pte[i] = mk_pte(page, vma->vm_page_prot);
+			if (vma->vm_flags & VM_WRITE)
+				new_pte[i] = pte_mkwrite(pte_mkdirty(new_pte[i]));
+		}
+
+		if (!ret)
+			continue;
+
+		hmm_entry = swp_entry_to_pte(make_hmm_entry());
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
+			unsigned long pfn = pte_pfn(new_pte[i]);
+
+			if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
+				continue;
+
+			set_pte_at(mm, addr, ptep, hmm_entry);
+			pte_clear(mm, addr, &new_pte[i]);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+		break;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(mm_hmm_migrate_back);
+
+/* mm_hmm_migrate_back_cleanup() - set CPU page table entry to new page.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @new_pte: Array of new CPU page table entry value.
+ * @hmm_pte: Array of HMM table entry indicating if migration was successfull.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This is call after mm_hmm_migrate_back() and after effective migration. It
+ * will set CPU page table entry to new value pointing to newly allocated page
+ * where the data was effectively copied back from device memory.
+ *
+ * Any failure will trigger a bug on.
+ *
+ * TODO: For copy failure we might simply set a new value for the HMM special
+ * entry indicating poisonous entry.
+ */
+void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
+				 struct vm_area_struct *vma,
+				 pte_t *new_pte,
+				 dma_addr_t *hmm_pte,
+				 unsigned long start,
+				 unsigned long end)
+{
+	pte_t hmm_poison = swp_entry_to_pte(make_hmm_entry_poisonous());
+	struct mem_cgroup *memcg;
+	unsigned long addr, i;
+
+	memcg = get_mem_cgroup_from_mm(mm);
+	for (addr = start; addr < end;) {
+		unsigned long cstart, next, free_pages;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * We know for certain that we did set special swap entry for
+		 * the range and HMM entry are mark as locked so it means that
+		 * no one beside us can modify them which apply that all level
+		 * of the CPU page table are valid.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		VM_BUG_ON(!pudp);
+		pmdp = pmd_offset(pudp, addr);
+		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			  pmd_trans_huge(*pmdp));
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+		     cstart = addr, i = (addr - start) >> PAGE_SHIFT,
+		     free_pages = 0; addr < next; addr += PAGE_SIZE,
+		     ptep++, i++) {
+			swp_entry_t entry;
+			struct page *page;
+
+			if (!pte_present(new_pte[i]))
+				continue;
+
+			entry = pte_to_swp_entry(*ptep);
+
+			/*
+			 * Sanity catch all the things that could go wrong but
+			 * should not, no plan B here.
+			 */
+			VM_BUG_ON(pte_none(*ptep));
+			VM_BUG_ON(pte_present(*ptep));
+			VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+			if (!hmm_pte_test_valid_dma(&hmm_pte[i]) &&
+			    !hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				set_pte_at(mm, addr, ptep, hmm_poison);
+				free_pages++;
+				continue;
+			}
+
+			page = pte_page(new_pte[i]);
+			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			page_add_new_anon_rmap(page, vma, addr);
+			mem_cgroup_commit_charge(page, memcg, false);
+			lru_cache_add_active_or_unevictable(page, vma);
+			set_pte_at(mm, addr, ptep, new_pte[i]);
+			update_mmu_cache(vma, addr, ptep);
+			pte_clear(mm, addr, &new_pte[i]);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		if (!free_pages)
+			continue;
+
+		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, i++) {
+			struct page *page;
+
+			if (!pte_present(new_pte[i]))
+				continue;
+
+			page = pte_page(new_pte[i]);
+			mem_cgroup_cancel_charge(page, memcg);
+			page_cache_release(page);
+		}
+	}
+}
+EXPORT_SYMBOL(mm_hmm_migrate_back_cleanup);
+#endif
+
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 21/36] HMM: mm add helper to update page table when migrating memory
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
@ 2015-05-21 20:22   ` jglisse
  2015-05-21 20:22   ` [PATCH 22/36] HMM: add new callback for copying memory from and to device memory jglisse
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:22 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

For doing memory migration to remote memory we need to unmap range
of anonymous memory from CPU page table and replace page table entry
with special HMM entry.

This is a multi-stage process, first we save and replace page table
entry with special HMM entry, also flushing tlb in the process. If
we run into non allocated entry we either use the zero page or we
allocate new page. For swaped entry we try to swap them in.

Once we have set the page table entry to the special entry we check
the page backing each of the address to make sure that only page
table mappings are holding reference on the page, which means we
can safely migrate the page to device memory. Because the CPU page
table entry are special entry, no get_user_pages() can reference
the page anylonger. So we are safe from race on that front. Note
that the page can still be referenced by get_user_pages() from
other process but in that case the page is write protected and
as we do not drop the mapcount nor the page count we know that
all user of get_user_pages() are only doing read only access (on
write access they would allocate a new page).

Once we have identified all the page that are safe to migrate the
first function return and let HMM schedule the migration with the
device driver.

Finaly there is a cleanup function that will drop the mapcount and
reference count on all page that have been successfully migrated,
or restore the page table entry otherwise.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/mm.h |  14 ++
 mm/memory.c        | 470 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 484 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f512b8a..6761dfc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2206,6 +2206,20 @@ static inline void hmm_mm_init(struct mm_struct *mm)
 	mm->hmm = NULL;
 }
 
+int mm_hmm_migrate(struct mm_struct *mm,
+		   struct vm_area_struct *vma,
+		   pte_t *save_pte,
+		   bool *backoff,
+		   const void *mmu_notifier_exclude,
+		   unsigned long start,
+		   unsigned long end);
+void mm_hmm_migrate_cleanup(struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    pte_t *save_pte,
+			    dma_addr_t *hmm_pte,
+			    unsigned long start,
+			    unsigned long end);
+
 int mm_hmm_migrate_back(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			pte_t *new_pte,
diff --git a/mm/memory.c b/mm/memory.c
index 4674d40..4bcd4d2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -54,6 +54,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hmm.h>
+#include <linux/hmm_pt.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -3694,6 +3695,475 @@ void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
 	}
 }
 EXPORT_SYMBOL(mm_hmm_migrate_back_cleanup);
+
+/* mm_hmm_migrate() - unmap range and set special HMM pte for it.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @save_pte: array where to save current CPU page table entry value.
+ * @backoff: Pointer toward a boolean indicating that we need to stop.
+ * @exclude: The mmu_notifier listener to exclude from mmu_notifier callback.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ * Returns: 0 on success, -EINVAL if some argument where invalid, -ENOMEM if
+ * it failed allocating memory for performing the operation, -EFAULT if some
+ * memory backing the range is in bad state, -EAGAIN if backoff flag turned
+ * to true.
+ *
+ * The process of memory migration is bit involve, first we must set all CPU
+ * page table entry to the special HMM locked entry ensuring us exclusive
+ * control over the page table entry (ie no other process can change the page
+ * table but us).
+ *
+ * While doing that we must handle empty and swaped entry. For empty entry we
+ * either use the zero page or allocate a new page. For swap entry we call
+ * __handle_mm_fault() to try to faultin the page (swap entry can be a number
+ * of thing).
+ *
+ * Once we have unmapped we need to check that we can effectively migrate the
+ * page, by testing that no one is holding a reference on the page beside the
+ * reference taken by each page mapping.
+ *
+ * On success every valid entry inside save_pte array is an entry that can be
+ * migrated.
+ *
+ * Note that this function does not free any of the page, nor does it updates
+ * the various memcg counter (exception being for accounting new allocation).
+ * This happen inside the mm_hmm_migrate_cleanup() function.
+ *
+ */
+int mm_hmm_migrate(struct mm_struct *mm,
+		   struct vm_area_struct *vma,
+		   pte_t *save_pte,
+		   bool *backoff,
+		   const void *mmu_notifier_exclude,
+		   unsigned long start,
+		   unsigned long end)
+{
+	pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
+	struct mmu_notifier_range range = {
+		.start = start,
+		.end = end,
+		.event = MMU_MIGRATE,
+	};
+	unsigned long addr = start, i;
+	struct mmu_gather tlb;
+	int ret = 0;
+
+	/* Only allow anonymous mapping and sanity check arguments. */
+	if (vma->vm_ops || unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)))
+		return -EINVAL;
+	start &= PAGE_MASK;
+	end = PAGE_ALIGN(end);
+	if (start >= end || end > vma->vm_end)
+		return -EINVAL;
+
+	/* Only need to test on the last address of the range. */
+	if (check_stack_guard_page(vma, end) < 0)
+		return -EFAULT;
+
+	/* Try to fail early on. */
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+
+retry:
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, range.start, range.end);
+	update_hiwater_rss(mm);
+	mmu_notifier_invalidate_range_start_excluding(mm, &range,
+						      mmu_notifier_exclude);
+	tlb_start_vma(&tlb, vma);
+	for (addr = range.start, i = 0; addr < end && !ret;) {
+		unsigned long cstart, next, npages = 0;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * Pretty much the exact same logic as __handle_mm_fault(),
+		 * exception being the handling of huge pmd.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_alloc(mm, pgdp, addr);
+		if (!pudp) {
+			ret = -ENOMEM;
+			break;
+		}
+		pmdp = pmd_alloc(mm, pudp, addr);
+		if (!pmdp) {
+			ret = -ENOMEM;
+			break;
+		}
+		if (unlikely(pmd_trans_splitting(*pmdp))) {
+			wait_split_huge_page(vma->anon_vma, pmdp);
+			ret = -EAGAIN;
+			break;
+		}
+		if (unlikely(pmd_bad(*pmdp))) {
+			ret = -EFAULT;
+			break;
+		}
+		if (unlikely(pmd_none(*pmdp)) &&
+		    unlikely(__pte_alloc(mm, vma, pmdp, addr))) {
+			ret = -ENOMEM;
+			break;
+		}
+		/*
+		 * If an huge pmd materialized from under us split it and break
+		 * out of the loop to retry.
+		 */
+		if (unlikely(pmd_trans_huge(*pmdp))) {
+			split_huge_page_pmd(vma, addr, pmdp);
+			ret = -EAGAIN;
+			break;
+		}
+
+		/*
+		 * A regular pmd is established and it can't morph into a huge pmd
+		 * from under us anymore at this point because we hold the mmap_sem
+		 * read mode and khugepaged takes it in write mode. So now it's
+		 * safe to run pte_offset_map().
+		 */
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (i = (addr - start) >> PAGE_SHIFT, cstart = addr,
+		     next = min((addr + PMD_SIZE) & PMD_MASK, end);
+		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
+			save_pte[i] = ptep_get_and_clear(mm, addr, ptep);
+			tlb_remove_tlb_entry(&tlb, ptep, addr);
+			set_pte_at(mm, addr, ptep, hmm_entry);
+
+			if (pte_present(save_pte[i]))
+				continue;
+
+			if (!pte_none(save_pte[i])) {
+				set_pte_at(mm, addr, ptep, save_pte[i]);
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+			/*
+			 * TODO: This mm_forbids_zeropage() really does not
+			 * apply to us. First it seems only S390 have it set,
+			 * second we are not even using the zero page entry
+			 * to populate the CPU page table, thought on error
+			 * we might use the save_pte entry to set the CPU
+			 * page table entry.
+			 *
+			 * Live with that oddity for now.
+			 */
+			if (!mm_forbids_zeropage(mm)) {
+				pte_clear(mm, addr, &save_pte[i]);
+				npages++;
+				continue;
+			}
+			save_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
+						    vma->vm_page_prot));
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		/*
+		 * So we must allocate pages before checking for error, which
+		 * here indicate that one entry is a swap entry. We need to
+		 * allocate first because otherwise there is no easy way to
+		 * know on retry or in error code path wether the CPU page
+		 * table locked HMM entry is ours or from some other thread.
+		 */
+
+		if (!npages)
+			continue;
+
+		for (next = addr, addr = cstart,
+		     i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, i++) {
+			struct mem_cgroup *memcg;
+			struct page * page;
+
+			if (pte_present(save_pte[i]) || !pte_none(save_pte[i]))
+				continue;
+
+			page = alloc_zeroed_user_highpage_movable(vma, addr);
+			if (!page) {
+				ret = -ENOMEM;
+				break;
+			}
+			__SetPageUptodate(page);
+			if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
+				page_cache_release(page);
+				ret = -ENOMEM;
+				break;
+			}
+			save_pte[i] = mk_pte(page, vma->vm_page_prot);
+			if (vma->vm_flags & VM_WRITE)
+				save_pte[i] = pte_mkwrite(save_pte[i]);
+			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			/*
+			 * Because we set the page table entry to the special
+			 * HMM locked entry we know no other process might do
+			 * anything with it and thus we can safely account the
+			 * page without holding any lock at this point.
+			 */
+			page_add_new_anon_rmap(page, vma, addr);
+			mem_cgroup_commit_charge(page, memcg, false);
+			lru_cache_add_active_or_unevictable(page, vma);
+		}
+	}
+	tlb_end_vma(&tlb, vma);
+	mmu_notifier_invalidate_range_end_excluding(mm, &range,
+						    mmu_notifier_exclude);
+	tlb_finish_mmu(&tlb, range.start, range.end);
+
+	if (backoff && *backoff) {
+		/* Stick to the range we updated. */
+		ret = -EAGAIN;
+		end = addr;
+		goto out;
+	}
+
+	/* Check if something is missing or something went wrong. */
+	if (ret == -ENOENT) {
+		int flags = FAULT_FLAG_ALLOW_RETRY;
+
+		do {
+			/*
+			 * Using __handle_mm_fault() as current->mm != mm ie we
+			 * might have been call from a kernel thread on behalf
+			 * of a driver and all accounting handle_mm_fault() is
+			 * pointless in our case.
+			 */
+			ret = __handle_mm_fault(mm, vma, addr, flags);
+			flags |= FAULT_FLAG_TRIED;
+		} while ((ret & VM_FAULT_RETRY));
+		if ((ret & VM_FAULT_ERROR)) {
+			/* Stick to the range we updated. */
+			end = addr;
+			ret = -EFAULT;
+			goto out;
+		}
+		range.start = addr;
+		goto retry;
+	}
+	if (ret == -EAGAIN) {
+		range.start = addr;
+		goto retry;
+	}
+	if (ret)
+		/* Stick to the range we updated. */
+		end = addr;
+
+	/*
+	 * At this point no one else can take a reference on the page from this
+	 * process CPU page table. So we can safely check wether we can migrate
+	 * or not the page.
+	 */
+
+out:
+	for (addr = start, i = 0; addr < end;) {
+		unsigned long next;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * We know for certain that we did set special swap entry for
+		 * the range and HMM entry are mark as locked so it means that
+		 * no one beside us can modify them which apply that all level
+		 * of the CPU page table are valid.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		VM_BUG_ON(!pudp);
+		pmdp = pmd_offset(pudp, addr);
+		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			  pmd_trans_huge(*pmdp));
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+		     i = (addr - start) >> PAGE_SHIFT; addr < next;
+		     addr += PAGE_SIZE, ptep++, i++) {
+			struct page *page;
+			swp_entry_t entry;
+			int swapped;
+
+			entry = pte_to_swp_entry(save_pte[i]);
+			if (is_hmm_entry(entry)) {
+				/*
+				 * Logic here is pretty involve. If save_pte is
+				 * an HMM special swap entry then it means that
+				 * we failed to swap in that page so error must
+				 * be set.
+				 *
+				 * If that's not the case than it means we are
+				 * seriously screw.
+				 */
+				VM_BUG_ON(!ret);
+				continue;
+			}
+
+			/*
+			 * This can not happen, no one else can replace our
+			 * special entry and as range end is re-ajusted on
+			 * error.
+			 */
+			entry = pte_to_swp_entry(*ptep);
+			VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+			/* On error or backoff restore all the saved pte. */
+			if (ret)
+				goto restore;
+
+			page = vm_normal_page(vma, addr, save_pte[i]);
+			/* The zero page is fine to migrate. */
+			if (!page)
+				continue;
+
+			/*
+			 * Check that only CPU mapping hold a reference on the
+			 * page. To make thing simpler we just refuse bail out
+			 * if page_mapcount() != page_count() (also accounting
+			 * for swap cache).
+			 *
+			 * There is a small window here where wp_page_copy()
+			 * might have decremented mapcount but have not yet
+			 * decremented the page count. This is not an issue as
+			 * we backoff in that case.
+			 */
+			swapped = PageSwapCache(page);
+			if (page_mapcount(page) + swapped == page_count(page))
+				continue;
+
+restore:
+			/* Ok we have to restore that page. */
+			set_pte_at(mm, addr, ptep, save_pte[i]);
+			/*
+			 * No need to invalidate - it was non-present
+			 * before.
+			 */
+			update_mmu_cache(vma, addr, ptep);
+			pte_clear(mm, addr, &save_pte[i]);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(mm_hmm_migrate);
+
+/* mm_hmm_migrate_cleanup() - unmap range cleanup.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @save_pte: Array where to save current CPU page table entry value.
+ * @hmm_pte: Array of HMM table entry indicating if migration was successfull.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This is call after mm_hmm_migrate() and after effective migration. It will
+ * restore CPU page table entry for page that not been migrated or in case of
+ * failure.
+ *
+ * It will free pages that have been migrated and updates appropriate counters,
+ * it will also "unlock" special HMM pte entry.
+ */
+void mm_hmm_migrate_cleanup(struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    pte_t *save_pte,
+			    dma_addr_t *hmm_pte,
+			    unsigned long start,
+			    unsigned long end)
+{
+	pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry());
+	struct page *pages[MMU_GATHER_BUNDLE];
+	unsigned long addr, c, i;
+
+	for (addr = start, i = 0; addr < end;) {
+		unsigned long next;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * We know for certain that we did set special swap entry for
+		 * the range and HMM entry are mark as locked so it means that
+		 * no one beside us can modify them which apply that all level
+		 * of the CPU page table are valid.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		VM_BUG_ON(!pudp);
+		pmdp = pmd_offset(pudp, addr);
+		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			  pmd_trans_huge(*pmdp));
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+		     i = (addr - start) >> PAGE_SHIFT; addr < next;
+		     addr += PAGE_SIZE, ptep++, i++) {
+			struct page *page;
+			swp_entry_t entry;
+
+			/*
+			 * This can't happen no one else can replace our
+			 * precious special entry.
+			 */
+			entry = pte_to_swp_entry(*ptep);
+			VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+			if (!hmm_pte_test_valid_dev(&hmm_pte[i])) {
+				/* Ok we have to restore that page. */
+				set_pte_at(mm, addr, ptep, save_pte[i]);
+				/*
+				 * No need to invalidate - it was non-present
+				 * before.
+				 */
+				update_mmu_cache(vma, addr, ptep);
+				pte_clear(mm, addr, &save_pte[i]);
+				continue;
+			}
+
+			/* Set unlocked entry. */
+			set_pte_at(mm, addr, ptep, hmm_entry);
+			/*
+			 * No need to invalidate - it was non-present
+			 * before.
+			 */
+			update_mmu_cache(vma, addr, ptep);
+
+			page = vm_normal_page(vma, addr, save_pte[i]);
+			/* The zero page is fine to migrate. */
+			if (!page)
+				continue;
+
+			page_remove_rmap(page);
+			dec_mm_counter_fast(mm, MM_ANONPAGES);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+	}
+
+	/* Free pages. */
+	for (addr = start, i = 0, c = 0; addr < end; i++, addr += PAGE_SIZE) {
+		if (pte_none(save_pte[i]))
+			continue;
+		if (c >= MMU_GATHER_BUNDLE) {
+			/*
+			 * TODO: What we really want to do is keep the memory
+			 * accounted inside the memory group and inside rss
+			 * while still freeing the page. So that migration
+			 * back from device memory will not fail because we
+			 * go over memory group limit.
+			 */
+			free_pages_and_swap_cache(pages, c);
+			c = 0;
+		}
+		pages[c] = vm_normal_page(vma, addr, save_pte[i]);
+		c = pages[c] ? c + 1 : c;
+	}
+}
+EXPORT_SYMBOL(mm_hmm_migrate_cleanup);
 #endif
 
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 22/36] HMM: add new callback for copying memory from and to device memory.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
  2015-05-21 20:22   ` [PATCH 21/36] HMM: mm add helper to update page table when migrating memory jglisse
@ 2015-05-21 20:22   ` jglisse
  2015-05-21 20:22   ` [PATCH 23/36] HMM: allow to get pointer to spinlock protecting a directory jglisse
                     ` (13 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:22 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jerome Glisse, Jatin Kumar

From: Jerome Glisse <jglisse@redhat.com>

This patch only adds the new callback device driver must implement
to copy memory from and to device memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/hmm.c            |   2 +
 2 files changed, 105 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f243eb5..eb30418 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -66,6 +66,8 @@ enum hmm_etype {
 	HMM_DEVICE_RFAULT,
 	HMM_DEVICE_WFAULT,
 	HMM_WRITE_PROTECT,
+	HMM_COPY_FROM_DEVICE,
+	HMM_COPY_TO_DEVICE,
 };
 
 /* struct hmm_event - memory event information.
@@ -157,6 +159,107 @@ struct hmm_device_ops {
 	 * All other return value trigger warning and are transformed to -EIO.
 	 */
 	int (*update)(struct hmm_mirror *mirror,const struct hmm_event *event);
+
+	/* copy_from_device() - copy from device memory to system memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the copy.
+	 * @dst: Array containing hmm_pte of destination memory.
+	 * @start: Start address of the range (sub-range of event) to copy.
+	 * @end: End address of the range (sub-range of event) to copy.
+	 * Returns: 0 on success, error code otherwise {-ENOMEM, -EIO}.
+	 *
+	 * Called when migrating memory from device memory to system memory.
+	 * The dst array contains valid DMA address for the device of the page
+	 * to copy to (or pfn of page if hmm_device.device == NULL).
+	 *
+	 * If event.etype == HMM_FORK then device driver only need to schedule
+	 * a copy to the system pages given in the dst hmm_pte array. Do not
+	 * update the device page, and do not pause/stop the device threads
+	 * that are using this address space. Just copy memory.
+	 *
+	 * If event.type == HMM_COPY_FROM_DEVICE then device driver must first
+	 * write protect the range then schedule the copy, then update its page
+	 * table to use the new system memory given the dst array. Some device
+	 * can perform all this in an atomic fashion from device point of view.
+	 * The device driver must also free the device memory once the copy is
+	 * done.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill and CPU page table set to HWPOISON entry.
+	 *
+	 * Note that device driver must clear the valid bit of the dst entry it
+	 * failed to copy.
+	 *
+	 * On failure the mirror will be kill by HMM which will do a HMM_MUNMAP
+	 * invalidation of all the memory when this happen the device driver
+	 * can free the device memory.
+	 *
+	 * Note also that there can be hole in the range being copied ie some
+	 * entry of dst array will not have the valid bit set, device driver
+	 * must simply ignore non valid entry.
+	 *
+	 * Finaly device driver must set the dirty bit for each page that was
+	 * modified since it was copied inside the device memory. This must be
+	 * conservative ie if device can not determine that with certainty then
+	 * it must set the dirty bit unconditionally.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*copy_from_device)(struct hmm_mirror *mirror,
+				const struct hmm_event *event,
+				dma_addr_t *dst,
+				unsigned long start,
+				unsigned long end);
+
+	/* copy_to_device() - copy to device memory from system memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the copy.
+	 * @dst: Array containing hmm_pte of destination memory.
+	 * @start: Start address of the range (sub-range of event) to copy.
+	 * @end: End address of the range (sub-range of event) to copy.
+	 * Returns: 0 on success, error code otherwise {-ENOMEM, -EIO}.
+	 *
+	 * Called when migrating memory from system memory to device memory.
+	 * The dst array is empty, all of its entry are equal to zero. Device
+	 * driver must allocate the device memory and populate each entry using
+	 * hmm_pte_from_device_pfn() only the valid device bit and hardware
+	 * specific bit will be preserve (write and dirty will be taken from
+	 * the original entry inside the mirror page table). It is advice to
+	 * set the device pfn to match the physical address of device memory
+	 * being use. The event.etype will be equals to HMM_COPY_TO_DEVICE.
+	 *
+	 * Device driver that can atomically copy a page and update its page
+	 * table entry to point to the device memory can do that. Partial
+	 * failure is allowed, entry that have not been migrated must have
+	 * the HMM_PTE_VALID_DEV bit clear inside the dst array. HMM will
+	 * update the CPU page table of failed entry to point back to the
+	 * system page.
+	 *
+	 * Note that device driver is responsible for allocating and freeing
+	 * the device memory and properly updating to dst array entry with
+	 * the allocated device memory.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 * Errors means that the migration is aborted. So in case of partial
+	 * failure if device do not want to fully abort it must return 0.
+	 * Device driver can update device page table only if it knows it will
+	 * not return failure.
+	 */
+	int (*copy_to_device)(struct hmm_mirror *mirror,
+			      const struct hmm_event *event,
+			      dma_addr_t *dst,
+			      unsigned long start,
+			      unsigned long end);
 };
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index e4585b7..9dbb1e43 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -87,6 +87,8 @@ static inline int hmm_event_init(struct hmm_event *event,
 	case HMM_ISDIRTY:
 	case HMM_DEVICE_RFAULT:
 	case HMM_DEVICE_WFAULT:
+	case HMM_COPY_TO_DEVICE:
+	case HMM_COPY_FROM_DEVICE:
 		break;
 	case HMM_FORK:
 	case HMM_WRITE_PROTECT:
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 23/36] HMM: allow to get pointer to spinlock protecting a directory.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
  2015-05-21 20:22   ` [PATCH 21/36] HMM: mm add helper to update page table when migrating memory jglisse
  2015-05-21 20:22   ` [PATCH 22/36] HMM: add new callback for copying memory from and to device memory jglisse
@ 2015-05-21 20:22   ` jglisse
  2015-05-21 20:23   ` [PATCH 24/36] HMM: split DMA mapping function in two jglisse
                     ` (12 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:22 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Several use case for getting pointer to spinlock protecting a directory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 36f7e00..27668a8 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -255,6 +255,16 @@ static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
 		spin_lock(&pt->lock);
 }
 
+static inline spinlock_t *hmm_pt_directory_lock_ptr(struct hmm_pt *pt,
+						    struct page *ptd,
+						    unsigned level)
+{
+	if (level)
+		return &ptd->ptl;
+	else
+		return &pt->lock;
+}
+
 static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
 					   struct page *ptd,
 					   unsigned level)
@@ -272,6 +282,13 @@ static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
 	spin_lock(&pt->lock);
 }
 
+static inline spinlock_t *hmm_pt_directory_lock_ptr(struct hmm_pt *pt,
+						    struct page *ptd,
+						    unsigned level)
+{
+	return &pt->lock;
+}
+
 static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
 					   struct page *ptd,
 					   unsigned level)
@@ -397,6 +414,13 @@ static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter,
 	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
 }
 
+static inline spinlock_t *hmm_pt_iter_directory_lock_ptr(struct hmm_pt_iter *iter,
+							 struct hmm_pt *pt)
+{
+	return hmm_pt_directory_lock_ptr(pt, iter->ptd[pt->llevel - 1],
+					 pt->llevel);
+}
+
 static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter,
 						struct hmm_pt *pt)
 {
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 24/36] HMM: split DMA mapping function in two.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (2 preceding siblings ...)
  2015-05-21 20:22   ` [PATCH 23/36] HMM: allow to get pointer to spinlock protecting a directory jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 25/36] HMM: add helpers for migration back to system memory jglisse
                     ` (11 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

To be able to reuse the DMA mapping logic, split it in two functions.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 125 +++++++++++++++++++++++++++++++++------------------------------
 1 file changed, 66 insertions(+), 59 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 9dbb1e43..b8807b2 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -853,82 +853,89 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 	return ret;
 }
 
+static int hmm_mirror_dma_map_range(struct hmm_mirror *mirror,
+				    dma_addr_t *hmm_pte,
+				    spinlock_t *lock,
+				    unsigned long npages)
+{
+	struct device *dev = mirror->device->dev;
+	unsigned long i;
+	int ret = 0;
+
+	for (i = 0; i < npages; i++) {
+		dma_addr_t dma_addr, pte;
+		struct page *page;
+
+again:
+		pte = ACCESS_ONCE(hmm_pte[i]);
+		if (!hmm_pte_test_valid_pfn(&pte) || !hmm_pte_test_select(&pte))
+			continue;
+
+		page = pfn_to_page(hmm_pte_pfn(pte));
+		VM_BUG_ON(!page);
+		dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+					DMA_BIDIRECTIONAL);
+		if (dma_mapping_error(dev, dma_addr)) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		/*
+		 * Make sure we transfer the dirty bit. Note that there
+		 * might still be a window for another thread to set
+		 * the dirty bit before we check for pte equality. This
+		 * will just lead to a useless retry so it is not the
+		 * end of the world here.
+		 */
+		if (lock)
+			spin_lock(lock);
+		if (hmm_pte_test_dirty(&hmm_pte[i]))
+			hmm_pte_set_dirty(&pte);
+		if (ACCESS_ONCE(hmm_pte[i]) != pte) {
+				if (lock)
+					spin_unlock(lock);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+				if (hmm_pte_test_valid_pfn(&hmm_pte[i]))
+					goto again;
+				continue;
+		}
+		hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
+		if (hmm_pte_test_write(&pte))
+			hmm_pte_set_write(&hmm_pte[i]);
+		if (hmm_pte_test_dirty(&pte))
+			hmm_pte_set_dirty(&hmm_pte[i]);
+		if (lock)
+			spin_unlock(lock);
+	}
+
+	return ret;
+}
 
 static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
 			      struct hmm_pt_iter *iter,
 			      unsigned long start,
 			      unsigned long end)
 {
-	struct device *dev = mirror->device->dev;
 	unsigned long addr;
 	int ret;
 
 	for (ret = 0, addr = start; !ret && addr < end;) {
-		unsigned long i = 0, hmm_end, next;
+		unsigned long next, npages;
 		dma_addr_t *hmm_pte;
+		spinlock_t *lock;
 
 		hmm_pte = hmm_pt_iter_fault(iter, &mirror->pt, addr);
 		if (!hmm_pte)
 			return -ENOENT;
 
-		hmm_end = hmm_pt_level_next(&mirror->pt, addr, end,
-					    mirror->pt.llevel - 1);
-		do {
-			dma_addr_t dma_addr, pte;
-			struct page *page;
-
-			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
-						 mirror->pt.llevel);
-
-again:
-			pte = ACCESS_ONCE(hmm_pte[i]);
-			if (!hmm_pte_test_valid_pfn(&pte) ||
-			    !hmm_pte_test_select(&pte)) {
-				if (!hmm_pte_test_valid_dma(&pte)) {
-					ret = -ENOENT;
-					break;
-				}
-				continue;
-			}
-
-			page = pfn_to_page(hmm_pte_pfn(pte));
-			VM_BUG_ON(!page);
-			dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
-						DMA_BIDIRECTIONAL);
-			if (dma_mapping_error(dev, dma_addr)) {
-				ret = -ENOMEM;
-				break;
-			}
+		next = hmm_pt_level_next(&mirror->pt, addr, end,
+					 mirror->pt.llevel - 1);
 
-			hmm_pt_iter_directory_lock(iter, &mirror->pt);
-			/*
-			 * Make sure we transfer the dirty bit. Note that there
-			 * might still be a window for another thread to set
-			 * the dirty bit before we check for pte equality. This
-			 * will just lead to a useless retry so it is not the
-			 * end of the world here.
-			 */
-			if (hmm_pte_test_dirty(&hmm_pte[i]))
-				hmm_pte_set_dirty(&pte);
-			if (ACCESS_ONCE(hmm_pte[i]) != pte) {
-				hmm_pt_iter_directory_unlock(iter,&mirror->pt);
-				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
-					       DMA_BIDIRECTIONAL);
-				if (hmm_pte_test_valid_pfn(&pte))
-					goto again;
-				if (!hmm_pte_test_valid_dma(&pte)) {
-					ret = -ENOENT;
-					break;
-				}
-			} else {
-				hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
-				if (hmm_pte_test_write(&pte))
-					hmm_pte_set_write(&hmm_pte[i]);
-				if (hmm_pte_test_dirty(&pte))
-					hmm_pte_set_dirty(&hmm_pte[i]);
-				hmm_pt_iter_directory_unlock(iter, &mirror->pt);
-			}
-		} while (addr = next, i++, addr != hmm_end && !ret);
+		npages = (next - addr) >> PAGE_SHIFT;
+		lock = hmm_pt_iter_directory_lock_ptr(iter, &mirror->pt);
+		ret = hmm_mirror_dma_map_range(mirror, hmm_pte, lock, npages);
+		addr = next;
 	}
 
 	return ret;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 25/36] HMM: add helpers for migration back to system memory.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (3 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 24/36] HMM: split DMA mapping function in two jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 26/36] HMM: fork copy migrated memory into system memory for child process jglisse
                     ` (10 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch add all necessary functions and helpers for migration
from device memory back to system memory. They are 3 differents
case that would use that code :
  - CPU page fault
  - fork
  - device driver request

Note that this patch use regular memory accounting this means that
migration can fail as a result of memory cgroup resource exhaustion.
Latter patches will modify memcg to allow to keep remote memory
accounted as regular memory thus removing this point of failure.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 mm/hmm.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 157 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index b8807b2..1208f64 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -50,6 +50,12 @@ static struct mmu_notifier_ops hmm_notifier_ops;
 static inline struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 static inline void hmm_mirror_unref(struct hmm_mirror **mirror);
 static void hmm_mirror_kill(struct hmm_mirror *mirror);
+static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   pte_t *new_pte,
+				   dma_addr_t *dst,
+				   unsigned long start,
+				   unsigned long end);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 				    struct hmm_event *event,
 				    struct page *page);
@@ -425,6 +431,46 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+static int hmm_migrate_back(struct hmm *hmm,
+			    struct hmm_event *event,
+			    struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    pte_t *new_pte,
+			    dma_addr_t *dst,
+			    unsigned long start,
+			    unsigned long end)
+{
+	struct hmm_mirror *mirror;
+	int r, ret;
+
+	/*
+	 * Do not return right away on error, as there might be valid page we
+	 * can migrate.
+	 */
+	ret = mm_hmm_migrate_back(mm, vma, new_pte, start, end);
+
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(mirror, &hmm->mirrors, mlist) {
+		r = hmm_mirror_migrate_back(mirror, event, new_pte,
+					    dst, start, end);
+		if (r) {
+			ret = ret ? ret : r;
+			mirror = hmm_mirror_ref(mirror);
+			BUG_ON(!mirror);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(mirror);
+			hmm_mirror_unref(&mirror);
+			goto again;
+		}
+	}
+	up_read(&hmm->rwsem);
+
+	mm_hmm_migrate_back_cleanup(mm, vma, new_pte, dst, start, end);
+
+	return ret;
+}
+
 int hmm_handle_cpu_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			pmd_t *pmdp, unsigned long addr,
@@ -1085,6 +1131,117 @@ out:
 }
 EXPORT_SYMBOL(hmm_mirror_fault);
 
+static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   pte_t *new_pte,
+				   dma_addr_t *dst,
+				   unsigned long start,
+				   unsigned long end)
+{
+	unsigned long addr, i, npages = (end - start) >> PAGE_SHIFT;
+	struct hmm_device *device = mirror->device;
+	struct device *dev = mirror->device->dev;
+	struct hmm_pt_iter iter;
+	int r, ret = 0;
+
+	hmm_pt_iter_init(&iter);
+	for (addr = start, i = 0; addr < end; addr += PAGE_SIZE, ++i) {
+		dma_addr_t *hmm_pte;
+
+		hmm_pte_clear_select(&dst[i]);
+
+		if (!pte_present(new_pte[i]))
+			continue;
+		hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		if (!hmm_pte)
+			continue;
+
+		if (!hmm_pte_test_valid_dev(hmm_pte))
+			continue;
+
+		dst[i] = hmm_pte_from_pfn(pte_pfn(new_pte[i]));
+		hmm_pte_set_select(&dst[i]);
+		hmm_pte_set_write(&dst[i]);
+	}
+
+	if (device->dev) {
+		ret = hmm_mirror_dma_map_range(mirror, dst, NULL, npages);
+		if (ret) {
+			for (i = 0; i < npages; ++i) {
+				if (!hmm_pte_test_select(&dst[i]))
+					continue;
+				if (hmm_pte_test_valid_dma(&dst[i]))
+					continue;
+				dst[i] = 0;
+			}
+		}
+	}
+
+	r = device->ops->copy_from_device(mirror, event, dst, start, end);
+
+	/* Update mirror page table with successfully migrated entry. */
+	for (addr = start; addr < end;) {
+		unsigned long idx, next, npages;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		if (!hmm_pte) {
+			addr = hmm_pt_iter_next(&iter, &mirror->pt,
+						addr, end);
+			continue;
+		}
+
+		next = hmm_pt_level_next(&mirror->pt, addr, end,
+					 mirror->pt.llevel - 1);
+
+		idx = (addr - event->start) >> PAGE_SHIFT;
+		npages = (next - addr) >> PAGE_SHIFT;
+		hmm_pt_iter_directory_lock(&iter, &mirror->pt);
+		for (i = 0; i < npages; i++, idx++) {
+			if (!hmm_pte_test_valid_pfn(&dst[idx]) &&
+			    !hmm_pte_test_valid_dma(&dst[idx])) {
+				if (hmm_pte_test_valid_dev(&hmm_pte[i])) {
+					hmm_pte[i] = 0;
+					hmm_pt_iter_directory_unref(&iter,
+							mirror->pt.llevel);
+				}
+				continue;
+			}
+
+			VM_BUG_ON(!hmm_pte_test_select(&dst[idx]));
+			VM_BUG_ON(!hmm_pte_test_valid_dev(&hmm_pte[i]));
+			hmm_pte[i] = dst[idx];
+		}
+		hmm_pt_iter_directory_unlock(&iter, &mirror->pt);
+
+		/* DMA unmap failed migrate entry. */
+		if (dev) {
+			idx = (addr - event->start) >> PAGE_SHIFT;
+			for (i = 0; i < npages; i++, idx++) {
+				dma_addr_t dma_addr;
+
+				/*
+				 * Failed entry have the valid bit clear but
+				 * the select bit remain intact.
+				 */
+				if (!hmm_pte_test_select(&dst[idx]) &&
+				    !hmm_pte_test_valid_dma(&dst[i]))
+					continue;
+
+				hmm_pte_set_valid_dma(&dst[idx]);
+				dma_addr = hmm_pte_dma_addr(*hmm_pte);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+			}
+		}
+
+		addr = next;
+	}
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+
+	return ret ? ret : r;
+}
+
 /* hmm_mirror_range_discard() - discard a range of address.
  *
  * @mirror: The mirror struct.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 26/36] HMM: fork copy migrated memory into system memory for child process.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (4 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 25/36] HMM: add helpers for migration back to system memory jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 27/36] HMM: CPU page fault on migrated memory jglisse
                     ` (9 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

When forking if process being fork had any memory migrated to some
device memory, we need to make a system copy for the child process.
Latter patches can revisit this and use the same COW semantic for
device memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 1208f64..143c6ab 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -487,7 +487,37 @@ int hmm_mm_fork(struct mm_struct *src_mm,
 		unsigned long start,
 		unsigned long end)
 {
-	return -ENOMEM;
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+	struct hmm_event event;
+	dma_addr_t *dst;
+	struct hmm *hmm;
+	pte_t *new_pte;
+	int ret;
+
+	hmm = hmm_ref(src_mm->hmm);
+	if (!hmm)
+		return -EINVAL;
+
+
+	dst = kzalloc(npages * sizeof(*dst), GFP_KERNEL);
+	if (!dst) {
+		hmm_unref(hmm);
+		return -ENOMEM;
+	}
+	new_pte = kzalloc(npages * sizeof(*new_pte), GFP_KERNEL);
+	if (!new_pte) {
+		kfree(dst);
+		hmm_unref(hmm);
+		return -ENOMEM;
+	}
+
+	hmm_event_init(&event, hmm, start, end, HMM_FORK);
+	ret = hmm_migrate_back(hmm, &event, dst_mm, dst_vma, new_pte,
+			       dst, start, end);
+	hmm_unref(hmm);
+	kfree(new_pte);
+	kfree(dst);
+	return ret;
 }
 EXPORT_SYMBOL(hmm_mm_fork);
 
@@ -662,6 +692,12 @@ static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
 	}
 
 	if (hmm_pte_test_valid_dev(hmm_pte)) {
+		/*
+		 * On fork device memory is duplicated so no need to write
+		 * protect it.
+		 */
+		if (event->etype == HMM_FORK)
+			return;
 		*hmm_pte &= event->pte_mask;
 		if (!hmm_pte_test_valid_dev(hmm_pte))
 			hmm_pt_iter_directory_unref(iter, mirror->pt.llevel);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 27/36] HMM: CPU page fault on migrated memory.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (5 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 26/36] HMM: fork copy migrated memory into system memory for child process jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 28/36] HMM: add mirror fault support for system to device memory migration jglisse
                     ` (8 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

When CPU try to access memory that have been migrated to device memory
we have to copy it back to system memory. This patch implement the CPU
page fault handler for special HMM pte swap entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 143c6ab..1a7554d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -476,7 +476,59 @@ int hmm_handle_cpu_fault(struct mm_struct *mm,
 			pmd_t *pmdp, unsigned long addr,
 			unsigned flags, pte_t orig_pte)
 {
-	return VM_FAULT_SIGBUS;
+	unsigned long start, end;
+	struct hmm_event event;
+	swp_entry_t entry;
+	struct hmm *hmm;
+	dma_addr_t dst;
+	pte_t new_pte;
+	int ret;
+
+	/* First check for poisonous entry. */
+	entry = pte_to_swp_entry(orig_pte);
+	if (is_hmm_entry_poisonous(entry))
+		return VM_FAULT_SIGBUS;
+
+	hmm = hmm_ref(mm->hmm);
+	if (!hmm) {
+		pte_t poison = swp_entry_to_pte(make_hmm_entry_poisonous());
+		spinlock_t *ptl;
+		pte_t *ptep;
+
+		/* Check if cpu pte is already updated. */
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		if (!pte_same(*ptep, orig_pte)) {
+			pte_unmap_unlock(ptep, ptl);
+			return 0;
+		}
+		set_pte_at(mm, addr, ptep, poison);
+		pte_unmap_unlock(ptep, ptl);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/*
+	 * TODO we likely want to migrate more then one page at a time, we need
+	 * to call into the device driver to get good hint on the range to copy
+	 * back to system memory.
+	 *
+	 * For now just live with the one page at a time solution.
+	 */
+	start = addr & PAGE_MASK;
+	end = start + PAGE_SIZE;
+	hmm_event_init(&event, hmm, start, end, HMM_COPY_FROM_DEVICE);
+
+	ret = hmm_migrate_back(hmm, &event, mm, vma, &new_pte,
+			       &dst, start, end);
+	hmm_unref(hmm);
+	switch (ret) {
+	case 0:
+		return VM_FAULT_MAJOR;
+	case -ENOMEM:
+		return VM_FAULT_OOM;
+	case -EINVAL:
+	default:
+		return VM_FAULT_SIGBUS;
+	}
 }
 EXPORT_SYMBOL(hmm_handle_cpu_fault);
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 28/36] HMM: add mirror fault support for system to device memory migration.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (6 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 27/36] HMM: CPU page fault on migrated memory jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 29/36] IB/mlx5: add a new paramter to __mlx_ib_populated_pas for ODP with HMM jglisse
                     ` (7 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Migration to device memory is done as a special kind of device mirror
fault. Memory migration being initiated by device driver and never by
HMM (unless it is a migration back to system memory).

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 mm/hmm.c | 181 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 181 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 1a7554d..7c044f0 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -56,6 +56,10 @@ static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
 				   dma_addr_t *dst,
 				   unsigned long start,
 				   unsigned long end);
+static int hmm_mirror_migrate(struct hmm_mirror *mirror,
+			      struct hmm_event *event,
+			      struct vm_area_struct *vma,
+			      struct hmm_pt_iter *iter);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 				    struct hmm_event *event,
 				    struct page *page);
@@ -110,6 +114,12 @@ static inline int hmm_event_init(struct hmm_event *event,
 	return 0;
 }
 
+static inline unsigned long hmm_event_npages(const struct hmm_event *event)
+{
+	return (PAGE_ALIGN(event->end) - (event->start & PAGE_MASK)) >>
+	       PAGE_SHIFT;
+}
+
 
 /* hmm - core HMM functions.
  *
@@ -1198,6 +1208,9 @@ retry:
 	}
 
 	switch (event->etype) {
+	case HMM_COPY_TO_DEVICE:
+		ret = hmm_mirror_migrate(mirror, event, vma, &iter);
+		break;
 	case HMM_DEVICE_RFAULT:
 	case HMM_DEVICE_WFAULT:
 		ret = hmm_mirror_handle_fault(mirror, event, vma, &iter);
@@ -1330,6 +1343,174 @@ static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
 	return ret ? ret : r;
 }
 
+static int hmm_mirror_migrate(struct hmm_mirror *mirror,
+			      struct hmm_event *event,
+			      struct vm_area_struct *vma,
+			      struct hmm_pt_iter *iter)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm *hmm = mirror->hmm;
+	struct hmm_event invalidate;
+	unsigned long addr, npages;
+	struct hmm_mirror *tmp;
+	dma_addr_t *dst;
+	pte_t *save_pte;
+	int r = 0, ret;
+
+	/* Only allow migration of private anonymous memory. */
+	if (vma->vm_ops || unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)))
+		return -EINVAL;
+
+	/*
+	 * TODO More advance loop for splitting migration into several chunk.
+	 * For now limit the amount that can be migrated in one shot. Also we
+	 * would need to see if we need rescheduling if this is happening as
+	 * part of system call to the device driver.
+	 */
+	npages = hmm_event_npages(event);
+	if (npages * max(sizeof(*dst), sizeof(*save_pte)) > PAGE_SIZE)
+		return -EINVAL;
+	dst = kzalloc(npages * sizeof(*dst), GFP_KERNEL);
+	if (dst == NULL)
+		return -ENOMEM;
+	save_pte = kzalloc(npages * sizeof(*save_pte), GFP_KERNEL);
+	if (save_pte == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = mm_hmm_migrate(hmm->mm, vma, save_pte, &event->backoff,
+			     &hmm->mmu_notifier, event->start, event->end);
+	if (ret == -EAGAIN)
+		goto out;
+	if (ret)
+		goto out_cleanup;
+
+	/*
+	 * Now invalidate for all other device, note that they can not race
+	 * with us as the CPU page table is full of special entry.
+	 */
+	hmm_event_init(&invalidate, mirror->hmm, event->start,
+		       event->end, HMM_MIGRATE);
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(tmp, &hmm->mirrors, mlist) {
+		if (tmp == mirror)
+			continue;
+		if (hmm_mirror_update(tmp, &invalidate, NULL)) {
+			hmm_mirror_ref(tmp);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(tmp);
+			hmm_mirror_unref(&tmp);
+			goto again;
+		}
+	}
+	up_read(&hmm->rwsem);
+
+	/*
+	 * Populate the mirror page table with saved entry and also mark entry
+	 * that can be migrated.
+	 */
+	for (addr = event->start; addr < event->end;) {
+		unsigned long i, idx, next, npages;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_fault(iter, &mirror->pt, addr);
+		if (!hmm_pte) {
+			ret = -ENOMEM;
+			goto out_cleanup;
+		}
+
+		next = hmm_pt_level_next(&mirror->pt, addr, event->end,
+					 mirror->pt.llevel - 1);
+
+		npages = (next - addr) >> PAGE_SHIFT;
+		idx = (addr - event->start) >> PAGE_SHIFT;
+		hmm_pt_iter_directory_lock(iter, &mirror->pt);
+		for (i = 0; i < npages; i++, idx++) {
+			hmm_pte_clear_select(&hmm_pte[i]);
+			if (!pte_present(save_pte[idx]))
+				continue;
+			hmm_pte_set_select(&hmm_pte[i]);
+			/* This can not be a valid device entry here. */
+			VM_BUG_ON(hmm_pte_test_valid_dev(&hmm_pte[i]));
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
+			if (hmm_pte_test_valid_pfn(&hmm_pte[i]))
+				continue;
+
+			hmm_pt_iter_directory_ref(iter, mirror->pt.llevel);
+			hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(save_pte[idx]));
+			if (pte_write(save_pte[idx]))
+				hmm_pte_set_write(&hmm_pte[i]);
+			hmm_pte_set_select(&hmm_pte[i]);
+		}
+		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
+
+		if (device->dev) {
+			spinlock_t *lock;
+
+			lock = hmm_pt_iter_directory_lock_ptr(iter,
+							      &mirror->pt);
+			ret = hmm_mirror_dma_map_range(mirror, hmm_pte,
+						       lock, npages);
+			/* Keep going only for entry that have been mapped. */
+			if (ret) {
+				for (i = 0; i < npages; ++i) {
+					if (!hmm_pte_test_select(&dst[i]))
+						continue;
+					if (!hmm_pte_test_valid_dma(&dst[i]))
+						continue;
+					hmm_pte_clear_select(&hmm_pte[i]);
+				}
+			}
+		}
+		addr = next;
+	}
+
+	/* Now Waldo we can do the copy. */
+	r = device->ops->copy_to_device(mirror, event, dst,
+					event->start, event->end);
+
+	/* Update mirror page table with successfully migrated entry. */
+	for (addr = event->start; addr < event->end;) {
+		unsigned long i, idx, next, npages;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_update(iter, &mirror->pt, addr);
+		if (!hmm_pte) {
+			addr = hmm_pt_iter_next(iter, &mirror->pt,
+						addr, event->end);
+			continue;
+		}
+
+		next = hmm_pt_level_next(&mirror->pt, addr, event->end,
+					 mirror->pt.llevel - 1);
+
+		npages = (next - addr) >> PAGE_SHIFT;
+		idx = (addr - event->start) >> PAGE_SHIFT;
+		hmm_pt_iter_directory_lock(iter, &mirror->pt);
+		for (i = 0; i < npages; i++, idx++) {
+			if (!hmm_pte_test_valid_dev(&dst[idx]))
+				continue;
+
+			VM_BUG_ON(!hmm_pte_test_select(&hmm_pte[i]));
+			hmm_pte[i] = dst[idx];
+		}
+		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
+		addr = next;
+	}
+
+out_cleanup:
+	mm_hmm_migrate_cleanup(hmm->mm, vma, save_pte, dst,
+			       event->start, event->end);
+out:
+	kfree(save_pte);
+	kfree(dst);
+	return ret ? ret : r;
+}
+
 /* hmm_mirror_range_discard() - discard a range of address.
  *
  * @mirror: The mirror struct.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 29/36] IB/mlx5: add a new paramter to __mlx_ib_populated_pas for ODP with HMM.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (7 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 28/36] HMM: add mirror fault support for system to device memory migration jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() " jglisse
                     ` (6 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

When using HMM for ODP it will be usefull to pass the current mirror
page table iterator for __mlx_ib_populated_pas() function benefit. Add
void parameter for this.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/infiniband/hw/mlx5/mem.c     | 8 +++++---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +-
 drivers/infiniband/hw/mlx5/mr.c      | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 40df2cc..df56b7d 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -145,11 +145,13 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
  * num_pages - total number of pages to fill
  * pas - bus addresses array to fill
  * access_flags - access flags to set on all present pages.
-		  use enum mlx5_ib_mtt_access_flags for this.
+ *                use enum mlx5_ib_mtt_access_flags for this.
+ * data - intended for odp with hmm, it should point to current mirror page
+ *        table iterator.
  */
 void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 			    int page_shift, size_t offset, size_t num_pages,
-			    __be64 *pas, int access_flags)
+			    __be64 *pas, int access_flags, void *data)
 {
 	unsigned long umem_page_shift = ilog2(umem->page_size);
 	int shift = page_shift - umem_page_shift;
@@ -201,7 +203,7 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 {
 	return __mlx5_ib_populate_pas(dev, umem, page_shift, 0,
 				      ib_umem_num_pages(umem), pas,
-				      access_flags);
+				      access_flags, NULL);
 }
 int mlx5_ib_get_buf_offset(u64 addr, int page_shift, u32 *offset)
 {
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index dff1cfc..ec532f0 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -602,7 +602,7 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift,
 			int *ncont, int *order);
 void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 			    int page_shift, size_t offset, size_t num_pages,
-			    __be64 *pas, int access_flags);
+			    __be64 *pas, int access_flags, void *data);
 void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 			  int page_shift, __be64 *pas, int access_flags);
 void mlx5_ib_copy_pas(u64 *old, u64 *new, int step, int num);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 71c5935..51a7775 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -912,7 +912,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
 		if (!zap) {
 			__mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT,
 					       start_page_index, npages, pas,
-					       MLX5_IB_MTT_PRESENT);
+					       MLX5_IB_MTT_PRESENT, NULL);
 			/* Clear padding after the pages brought from the
 			 * umem. */
 			memset(pas + npages, 0, size - npages * sizeof(u64));
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() for ODP with HMM.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (8 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 29/36] IB/mlx5: add a new paramter to __mlx_ib_populated_pas for ODP with HMM jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range() jglisse
                     ` (5 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, linux-rdma

From: JA(C)rA'me Glisse <jglisse@redhat.com>

When using HMM for ODP it will be usefull to pass the current mirror
page table iterator for mlx5_ib_update_mtt() function benefit. Add
void parameter for this.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +-
 drivers/infiniband/hw/mlx5/mr.c      | 4 ++--
 drivers/infiniband/hw/mlx5/odp.c     | 8 +++++---
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index ec532f0..ec629f2 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -569,7 +569,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 				  u64 virt_addr, int access_flags,
 				  struct ib_udata *udata);
 int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index,
-		       int npages, int zap);
+		       int npages, int zap, void *data);
 int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
 int mlx5_ib_destroy_mr(struct ib_mr *ibmr);
 struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 51a7775..759ed15 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -845,7 +845,7 @@ free_mr:
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
-		       int zap)
+		       int zap, void *data)
 {
 	struct mlx5_ib_dev *dev = mr->dev;
 	struct device *ddev = dev->ib_dev.dma_device;
@@ -912,7 +912,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
 		if (!zap) {
 			__mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT,
 					       start_page_index, npages, pas,
-					       MLX5_IB_MTT_PRESENT, NULL);
+					       MLX5_IB_MTT_PRESENT, data);
 			/* Clear padding after the pages brought from the
 			 * umem. */
 			memset(pas + npages, 0, size - npages * sizeof(u64));
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 5099db0..5171959 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -91,14 +91,15 @@ void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 
 			if (in_block && umr_offset == 0) {
 				mlx5_ib_update_mtt(mr, blk_start_idx,
-						   idx - blk_start_idx, 1);
+						   idx - blk_start_idx, 1,
+						   NULL);
 				in_block = 0;
 			}
 		}
 	}
 	if (in_block)
 		mlx5_ib_update_mtt(mr, blk_start_idx, idx - blk_start_idx + 1,
-				   1);
+				   1, NULL);
 
 	/*
 	 * We are now sure that the device will not access the
@@ -256,7 +257,8 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
 			 * this MR, since ib_umem_odp_map_dma_pages already
 			 * checks this.
 			 */
-			ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0);
+			ret = mlx5_ib_update_mtt(mr, start_idx,
+						 npages, 0, NULL);
 		} else {
 			ret = -EAGAIN;
 		}
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range()
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (9 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() " jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP jglisse
                     ` (4 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, linux-rdma

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The mlx5 driver will need this function for its driver specific bit
of ODP (on demand paging) on HMM (Heterogeneous Memory Management).

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/core/umem_rbtree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/infiniband/core/umem_rbtree.c b/drivers/infiniband/core/umem_rbtree.c
index 727d788..f030ec0 100644
--- a/drivers/infiniband/core/umem_rbtree.c
+++ b/drivers/infiniband/core/umem_rbtree.c
@@ -92,3 +92,4 @@ int rbt_ib_umem_for_each_in_range(struct rb_root *root,
 
 	return ret_val;
 }
+EXPORT_SYMBOL(rbt_ib_umem_for_each_in_range);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (10 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range() jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
                     ` (3 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, linux-rdma

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This is a preparatory patch for HMM implementation of ODP (on demand
paging). It introduce a new configure option and add proper build
time conditional code section. Enabling INFINIBAND_ON_DEMAND_PAGING_HMM
will result in build error with this patch.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/Kconfig                   |  10 ++
 drivers/infiniband/core/umem_odp.c           |   4 +
 drivers/infiniband/core/uverbs_cmd.c         |  17 +++-
 drivers/infiniband/hw/mlx5/main.c            |  10 +-
 drivers/infiniband/hw/mlx5/mem.c             |   8 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h         |   9 +-
 drivers/infiniband/hw/mlx5/mr.c              |  10 +-
 drivers/infiniband/hw/mlx5/odp.c             | 135 ++++++++++++++-------------
 drivers/infiniband/hw/mlx5/qp.c              |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/qp.c |   4 +-
 include/rdma/ib_umem_odp.h                   |  52 +++++++----
 include/rdma/ib_verbs.h                      |   4 +-
 12 files changed, 164 insertions(+), 101 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index b899531..764f524 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -49,6 +49,16 @@ config INFINIBAND_ON_DEMAND_PAGING
 	  memory regions without pinning their pages, fetching the
 	  pages on demand instead.
 
+config INFINIBAND_ON_DEMAND_PAGING_HMM
+	bool "InfiniBand on-demand paging support using HMM."
+	depends on HMM
+	depends on INFINIBAND_ON_DEMAND_PAGING
+	default n
+	---help---
+	  Use HMM (heterogeneous memory management) kernel API for
+	  on demand paging. No userspace difference, this is just
+	  an alternative implementation of the feature.
+
 config INFINIBAND_ADDR_TRANS
 	bool
 	depends on INFINIBAND
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index d10dd88..e55e124 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,6 +41,9 @@
 #include <rdma/ib_umem.h>
 #include <rdma/ib_umem_odp.h>
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 static void ib_umem_notifier_start_account(struct ib_umem *item)
 {
 	mutex_lock(&item->odp_data->umem_mutex);
@@ -667,3 +670,4 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 virt,
 	mutex_unlock(&umem->odp_data->umem_mutex);
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index a9f0489..ccd6bbe 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -290,8 +290,10 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	struct ib_udata                   udata;
 	struct ib_device                 *ibdev = file->device->ib_dev;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 	struct ib_device_attr		  dev_attr;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 	struct ib_ucontext		 *ucontext;
 	struct file			 *filp;
 	int ret;
@@ -335,6 +337,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	ucontext->closing = 0;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 	ucontext->umem_tree = RB_ROOT;
 	init_rwsem(&ucontext->umem_rwsem);
 	ucontext->odp_mrs_count = 0;
@@ -345,8 +348,8 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		goto err_free;
 	if (!(dev_attr.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
 		ucontext->invalidate_range = NULL;
-
-#endif
+#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	resp.num_comp_vectors = file->device->num_comp_vectors;
 
@@ -3335,6 +3338,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 		goto end;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	resp.odp_caps.general_caps = attr.odp_caps.general_caps;
 	resp.odp_caps.per_transport_caps.rc_odp_caps =
 		attr.odp_caps.per_transport_caps.rc_odp_caps;
@@ -3343,9 +3349,10 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	resp.odp_caps.per_transport_caps.ud_odp_caps =
 		attr.odp_caps.per_transport_caps.ud_odp_caps;
 	resp.odp_caps.reserved = 0;
-#else
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 	memset(&resp.odp_caps, 0, sizeof(resp.odp_caps));
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 	resp.response_length += sizeof(resp.odp_caps);
 
 end:
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 57c9809..d553f90 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -156,10 +156,14 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 	props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	if (dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
 		props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
 	props->odp_caps = dev->odp_caps;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 out:
 	kfree(in_mad);
@@ -486,8 +490,10 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev,
 	}
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 	context->ibucontext.invalidate_range = &mlx5_ib_invalidate_range;
-#endif
+#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	INIT_LIST_HEAD(&context->db_page_list);
 	mutex_init(&context->db_page_mutex);
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index df56b7d..21084c7 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -132,7 +132,7 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
 
 	return mtt_entry;
 }
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 /*
  * Populate the given array with bus addresses from the umem.
@@ -163,6 +163,9 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 	struct scatterlist *sg;
 	int entry;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	const bool odp = umem->odp_data != NULL;
 
 	if (odp) {
@@ -176,7 +179,8 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 		}
 		return;
 	}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	i = 0;
 	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index ec629f2..a6d62be 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -231,7 +231,7 @@ struct mlx5_ib_qp {
 	 */
 	spinlock_t              disable_page_faults_lock;
 	struct mlx5_ib_pfault	pagefaults[MLX5_IB_PAGEFAULT_CONTEXTS];
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 };
 
 struct mlx5_ib_cq_buf {
@@ -440,7 +440,7 @@ struct mlx5_ib_dev {
 	 * being used by a page fault handler.
 	 */
 	struct srcu_struct      mr_srcu;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 };
 
 static inline struct mlx5_ib_cq *to_mibcq(struct mlx5_core_cq *mcq)
@@ -627,8 +627,13 @@ int __init mlx5_ib_odp_init(void);
 void mlx5_ib_odp_cleanup(void);
 void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
 void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 			      unsigned long end);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 static inline int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 759ed15..23cd123 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -62,7 +62,7 @@ static int destroy_mkey(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 	/* Wait until all page fault handlers using the mr complete. */
 	synchronize_srcu(&dev->mr_srcu);
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	return err;
 }
@@ -1114,7 +1114,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		 */
 		smp_wmb();
 	}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	return &mr->ibmr;
 
@@ -1209,9 +1209,13 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
 		mr->live = 0;
 		/* Wait for all running page-fault handlers to finish. */
 		synchronize_srcu(&dev->mr_srcu);
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 		/* Destroy all page mappings */
 		mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
 					 ib_umem_end(umem));
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 		/*
 		 * We kill the umem before the MR for ODP,
 		 * so that there will not be any invalidations in
@@ -1223,7 +1227,7 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
 		/* Avoid double-freeing the umem. */
 		umem = NULL;
 	}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	clean_mr(mr);
 
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 5171959..1de4d13 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -37,12 +37,30 @@
 
 #define MAX_PREFETCH_LEN (4*1024*1024U)
 
+struct workqueue_struct *mlx5_ib_page_fault_wq;
+
+static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
+						   u32 key)
+{
+	u32 base_key = mlx5_base_mkey(key);
+	struct mlx5_core_mr *mmr = __mlx5_mr_lookup(dev->mdev, base_key);
+	struct mlx5_ib_mr *mr = container_of(mmr, struct mlx5_ib_mr, mmr);
+
+	if (!mmr || mmr->key != key || !mr->live)
+		return NULL;
+
+	return container_of(mmr, struct mlx5_ib_mr, mmr);
+}
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
 /* Timeout in ms to wait for an active mmu notifier to complete when handling
  * a pagefault. */
 #define MMU_NOTIFIER_TIMEOUT 1000
 
-struct workqueue_struct *mlx5_ib_page_fault_wq;
-
 void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 			      unsigned long end)
 {
@@ -110,67 +128,6 @@ void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 	ib_umem_odp_unmap_dma_pages(umem, start, end);
 }
 
-#define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do {	\
-	if (be32_to_cpu(reg.field_name) & MLX5_ODP_SUPPORT_##bit_name)	\
-		ib_caps->field_name |= IB_ODP_SUPPORT_##bit_name;	\
-} while (0)
-
-int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
-{
-	int err;
-	struct mlx5_odp_caps hw_caps;
-	struct ib_odp_caps *caps = &dev->odp_caps;
-
-	memset(caps, 0, sizeof(*caps));
-
-	if (!(dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG))
-		return 0;
-
-	err = mlx5_query_odp_caps(dev->mdev, &hw_caps);
-	if (err)
-		goto out;
-
-	caps->general_caps = IB_ODP_SUPPORT;
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.ud_odp_caps,
-			       SEND);
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
-			       SEND);
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
-			       RECV);
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
-			       WRITE);
-	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
-			       READ);
-
-out:
-	return err;
-}
-
-static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
-						   u32 key)
-{
-	u32 base_key = mlx5_base_mkey(key);
-	struct mlx5_core_mr *mmr = __mlx5_mr_lookup(dev->mdev, base_key);
-	struct mlx5_ib_mr *mr = container_of(mmr, struct mlx5_ib_mr, mmr);
-
-	if (!mmr || mmr->key != key || !mr->live)
-		return NULL;
-
-	return container_of(mmr, struct mlx5_ib_mr, mmr);
-}
-
-static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp,
-				      struct mlx5_ib_pfault *pfault,
-				      int error) {
-	struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
-	int ret = mlx5_core_page_fault_resume(dev->mdev, qp->mqp.qpn,
-					      pfault->mpfault.flags,
-					      error);
-	if (ret)
-		pr_err("Failed to resolve the page fault on QP 0x%x\n",
-		       qp->mqp.qpn);
-}
-
 /*
  * Handle a single data segment in a page-fault WQE.
  *
@@ -298,6 +255,58 @@ srcu_unlock:
 	return ret ? ret : npages;
 }
 
+
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
+#define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do {	\
+	if (be32_to_cpu(reg.field_name) & MLX5_ODP_SUPPORT_##bit_name)	\
+		ib_caps->field_name |= IB_ODP_SUPPORT_##bit_name;	\
+} while (0)
+
+int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
+{
+	int err;
+	struct mlx5_odp_caps hw_caps;
+	struct ib_odp_caps *caps = &dev->odp_caps;
+
+	memset(caps, 0, sizeof(*caps));
+
+	if (!(dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG))
+		return 0;
+
+	err = mlx5_query_odp_caps(dev->mdev, &hw_caps);
+	if (err)
+		goto out;
+
+	caps->general_caps = IB_ODP_SUPPORT;
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.ud_odp_caps,
+			       SEND);
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+			       SEND);
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+			       RECV);
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+			       WRITE);
+	COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+			       READ);
+
+out:
+	return err;
+}
+
+static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp,
+				      struct mlx5_ib_pfault *pfault,
+				      int error) {
+	struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
+	int ret = mlx5_core_page_fault_resume(dev->mdev, qp->mqp.qpn,
+					      pfault->mpfault.flags,
+					      error);
+	if (ret)
+		pr_err("Failed to resolve the page fault on QP 0x%x\n",
+		       qp->mqp.qpn);
+}
+
 /**
  * Parse a series of data segments for page fault handling.
  *
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index d35f62d..e5dec1e 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3046,7 +3046,7 @@ int mlx5_ib_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr
 	 * based upon this query's result.
 	 */
 	flush_workqueue(mlx5_ib_page_fault_wq);
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	mutex_lock(&qp->mutex);
 	outb = kzalloc(sizeof(*outb), GFP_KERNEL);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index dc7dbf7..a437a14 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -175,7 +175,7 @@ void mlx5_eq_pagefault(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe)
 
 	mlx5_core_put_rsc(common);
 }
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 int mlx5_core_create_qp(struct mlx5_core_dev *dev,
 			struct mlx5_core_qp *qp,
@@ -440,4 +440,4 @@ int mlx5_core_page_fault_resume(struct mlx5_core_dev *dev, u32 qpn,
 	return err;
 }
 EXPORT_SYMBOL_GPL(mlx5_core_page_fault_resume);
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 3da0b16..765aeb3 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -43,6 +43,9 @@ struct umem_odp_node {
 };
 
 struct ib_umem_odp {
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else
 	/*
 	 * An array of the pages included in the on-demand paging umem.
 	 * Indices of pages that are currently not mapped into the device will
@@ -62,8 +65,6 @@ struct ib_umem_odp {
 	 * also protects access to the mmu notifier counters.
 	 */
 	struct mutex		umem_mutex;
-	void			*private; /* for the HW driver to use. */
-
 	/* When false, use the notifier counter in the ucontext struct. */
 	bool mn_counters_active;
 	int notifiers_seq;
@@ -72,12 +73,13 @@ struct ib_umem_odp {
 	/* A linked list of umems that don't have private mmu notifier
 	 * counters yet. */
 	struct list_head no_private_counters;
+	struct completion	notifier_completion;
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+	void			*private; /* for the HW driver to use. */
 	struct ib_umem		*umem;
 
 	/* Tree tracking */
 	struct umem_odp_node	interval_tree;
-
-	struct completion	notifier_completion;
 	int			dying;
 };
 
@@ -87,6 +89,28 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem);
 
 void ib_umem_odp_release(struct ib_umem *umem);
 
+void rbt_ib_umem_insert(struct umem_odp_node *node, struct rb_root *root);
+void rbt_ib_umem_remove(struct umem_odp_node *node, struct rb_root *root);
+typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
+			      void *cookie);
+/*
+ * Call the callback on each ib_umem in the range. Returns the logical or of
+ * the return values of the functions called.
+ */
+int rbt_ib_umem_for_each_in_range(struct rb_root *root, u64 start, u64 end,
+				  umem_call_back cb, void *cookie);
+
+struct umem_odp_node *rbt_ib_umem_iter_first(struct rb_root *root,
+					     u64 start, u64 last);
+struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
+					    u64 start, u64 last);
+
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
 /*
  * The lower 2 bits of the DMA address signal the R/W permissions for
  * the entry. To upgrade the permissions, provide the appropriate
@@ -100,28 +124,13 @@ void ib_umem_odp_release(struct ib_umem *umem);
 
 #define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
 
+
 int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 start_offset, u64 bcnt,
 			      u64 access_mask, unsigned long current_seq);
 
 void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 start_offset,
 				 u64 bound);
 
-void rbt_ib_umem_insert(struct umem_odp_node *node, struct rb_root *root);
-void rbt_ib_umem_remove(struct umem_odp_node *node, struct rb_root *root);
-typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
-			      void *cookie);
-/*
- * Call the callback on each ib_umem in the range. Returns the logical or of
- * the return values of the functions called.
- */
-int rbt_ib_umem_for_each_in_range(struct rb_root *root, u64 start, u64 end,
-				  umem_call_back cb, void *cookie);
-
-struct umem_odp_node *rbt_ib_umem_iter_first(struct rb_root *root,
-					     u64 start, u64 last);
-struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
-					    u64 start, u64 last);
-
 static inline int ib_umem_mmu_notifier_retry(struct ib_umem *item,
 					     unsigned long mmu_seq)
 {
@@ -145,8 +154,11 @@ static inline int ib_umem_mmu_notifier_retry(struct ib_umem *item,
 	return 0;
 }
 
+
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
+
 static inline int ib_umem_odp_get(struct ib_ucontext *context,
 				  struct ib_umem *umem)
 {
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 65994a1..7b00d30 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1157,6 +1157,7 @@ struct ib_ucontext {
 
 	struct pid             *tgid;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 	struct rb_root      umem_tree;
 	/*
 	 * Protects .umem_rbroot and tree, as well as odp_mrs_count and
@@ -1171,7 +1172,8 @@ struct ib_ucontext {
 	/* A list of umems that don't have private mmu notifier counters yet. */
 	struct list_head	no_private_counters;
 	int                     odp_mrs_count;
-#endif
+#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 };
 
 struct ib_uobject {
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (11 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-06-24 13:59     ` Haggai Eran
  2015-05-21 20:23   ` [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback jglisse
                     ` (2 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, linux-rdma

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This add new core infiniband structure and helper to implement ODP (on
demand paging) on top of HMM. We need to retain the tree of ib_umem as
some hardware associate unique identifiant with each umem (or mr) and
only allow hardware page table to be updated using this unique id.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/core/umem_odp.c    | 148 +++++++++++++++++++++++++++++++++-
 drivers/infiniband/core/uverbs_cmd.c  |   6 +-
 drivers/infiniband/core/uverbs_main.c |   6 ++
 include/rdma/ib_umem_odp.h            |  28 ++++++-
 include/rdma/ib_verbs.h               |  17 +++-
 5 files changed, 199 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e55e124..d5d57a8 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,9 +41,155 @@
 #include <rdma/ib_umem.h>
 #include <rdma/ib_umem_odp.h>
 
+
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+
+
+static void ib_mirror_destroy(struct kref *kref)
+{
+	struct ib_mirror *ib_mirror;
+	struct ib_device *ib_device;
+
+	ib_mirror = container_of(kref, struct ib_mirror, kref);
+	hmm_mirror_unregister(&ib_mirror->base);
+
+	ib_device = ib_mirror->ib_device;
+	mutex_lock(&ib_device->hmm_mutex);
+	list_del_init(&ib_mirror->list);
+	mutex_unlock(&ib_device->hmm_mutex);
+	kfree(ib_mirror);
+}
+
+void ib_mirror_unref(struct ib_mirror *ib_mirror)
+{
+	if (ib_mirror == NULL)
+		return;
+
+	kref_put(&ib_mirror->kref, ib_mirror_destroy);
+}
+EXPORT_SYMBOL(ib_mirror_unref);
+
+static inline struct ib_mirror *ib_mirror_ref(struct ib_mirror *ib_mirror)
+{
+	if (!ib_mirror || !kref_get_unless_zero(&ib_mirror->kref))
+		return NULL;
+	return ib_mirror;
+}
+
+int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
+{
+	struct mm_struct *mm = get_task_mm(current);
+	struct ib_device *ib_device = context->device;
+	struct ib_mirror *ib_mirror;
+	struct pid *our_pid;
+	int ret;
+
+	if (!mm || !ib_device->hmm_ready)
+		return -EINVAL;
+
+	/* FIXME can this really happen ? */
+	if (unlikely(ib_umem_start(umem) == ib_umem_end(umem)))
+		return -EINVAL;
+
+	/* Prevent creating ODP MRs in child processes */
+	rcu_read_lock();
+	our_pid = get_task_pid(current->group_leader, PIDTYPE_PID);
+	rcu_read_unlock();
+	put_pid(our_pid);
+	if (context->tgid != our_pid) {
+		mmput(mm);
+		return -EINVAL;
+	}
+
+	umem->hugetlb = 0;
+	umem->odp_data = kmalloc(sizeof(*umem->odp_data), GFP_KERNEL);
+	if (umem->odp_data == NULL) {
+		mmput(mm);
+		return -ENOMEM;
+	}
+	umem->odp_data->private = NULL;
+	umem->odp_data->umem = umem;
+
+	mutex_lock(&ib_device->hmm_mutex);
+	/* Is there an existing mirror for this process mm ? */
+	ib_mirror = ib_mirror_ref(context->ib_mirror);
+	if (!ib_mirror)
+		list_for_each_entry(ib_mirror, &ib_device->ib_mirrors, list) {
+			if (ib_mirror->base.hmm->mm != mm)
+				continue;
+			ib_mirror = ib_mirror_ref(ib_mirror);
+			break;
+		}
+
+	if (ib_mirror == NULL ||
+	    ib_mirror == list_first_entry(&ib_device->ib_mirrors,
+					  struct ib_mirror, list)) {
+		/* We need to create a new mirror. */
+		ib_mirror = kmalloc(sizeof(*ib_mirror), GFP_KERNEL);
+		if (ib_mirror == NULL) {
+			mutex_unlock(&ib_device->hmm_mutex);
+			mmput(mm);
+			return -ENOMEM;
+		}
+		kref_init(&ib_mirror->kref);
+		init_rwsem(&ib_mirror->hmm_mr_rwsem);
+		ib_mirror->umem_tree = RB_ROOT;
+		ib_mirror->ib_device = ib_device;
+
+		ib_mirror->base.device = &ib_device->hmm_dev;
+		ret = hmm_mirror_register(&ib_mirror->base);
+		if (ret) {
+			mutex_unlock(&ib_device->hmm_mutex);
+			kfree(ib_mirror);
+			mmput(mm);
+			return ret;
+		}
+
+		list_add(&ib_mirror->list, &ib_device->ib_mirrors);
+		context->ib_mirror = ib_mirror_ref(ib_mirror);
+	}
+	mutex_unlock(&ib_device->hmm_mutex);
+	umem->odp_data.ib_mirror = ib_mirror;
+
+	down_write(&ib_mirror->umem_rwsem);
+	rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
+	up_write(&ib_mirror->umem_rwsem);
+
+	mmput(mm);
+	return 0;
+}
+
+void ib_umem_odp_release(struct ib_umem *umem)
+{
+	struct ib_mirror *ib_mirror = umem->odp_data;
+
+	/*
+	 * Ensure that no more pages are mapped in the umem.
+	 *
+	 * It is the driver's responsibility to ensure, before calling us,
+	 * that the hardware will not attempt to access the MR any more.
+	 */
+
+	/* One optimization to release resources early here would be to call :
+	 *	hmm_mirror_range_discard(&ib_mirror->base,
+	 *			 ib_umem_start(umem),
+	 *			 ib_umem_end(umem));
+	 * But we can have overlapping umem so we would need to only discard
+	 * range covered by one and only one umem while holding the umem rwsem.
+	 */
+	down_write(&ib_mirror->umem_rwsem);
+	rbt_ib_umem_remove(&umem->odp_data->interval_tree, &mirror->umem_tree);
+	up_write(&ib_mirror->umem_rwsem);
+
+	ib_mirror_unref(ib_mirror);
+	kfree(umem->odp_data);
+	kfree(umem);
+}
+
+
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
 static void ib_umem_notifier_start_account(struct ib_umem *item)
 {
 	mutex_lock(&item->odp_data->umem_mutex);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index ccd6bbe..3225ab5 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -337,7 +337,9 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	ucontext->closing = 0;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+	ucontext->ib_mirror = NULL;
+#else  /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	ucontext->umem_tree = RB_ROOT;
 	init_rwsem(&ucontext->umem_rwsem);
 	ucontext->odp_mrs_count = 0;
@@ -348,7 +350,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		goto err_free;
 	if (!(dev_attr.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
 		ucontext->invalidate_range = NULL;
-#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 	resp.num_comp_vectors = file->device->num_comp_vectors;
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 88cce9b..3f069d7 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -45,6 +45,7 @@
 #include <linux/cdev.h>
 #include <linux/anon_inodes.h>
 #include <linux/slab.h>
+#include <rdma/ib_umem_odp.h>
 
 #include <asm/uaccess.h>
 
@@ -297,6 +298,11 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		kfree(uobj);
 	}
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+	ib_mirror_unref(context->ib_mirror);
+	context->ib_mirror = NULL;
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
 	put_pid(context->tgid);
 
 	return context->device->dealloc_ucontext(context);
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 765aeb3..c7c2670 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -37,6 +37,32 @@
 #include <rdma/ib_verbs.h>
 #include <linux/interval_tree.h>
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+/* struct ib_mirror - per process mirror structure for infiniband driver.
+ *
+ * @ib_device: Infiniband device this mirror is associated with.
+ * @base: The hmm base mirror struct.
+ * @kref: Refcount for the structure.
+ * @list: For the list of ib_mirror of a given ib_device.
+ * @umem_tree: Red black tree of ib_umem ordered by virtual address.
+ * @umem_rwsem: Semaphore protecting the reb black tree.
+ *
+ * Because ib_ucontext struct is tie to file descriptor there can be several of
+ * them for a same process, which violate HMM requirement. Hence we create only
+ * one ib_mirror struct per process and have each ib_umem struct reference it.
+ */
+struct ib_mirror {
+	struct ib_device	*ib_device;
+	struct hmm_mirror	base;
+	struct kref		kref;
+	struct list_head	list;
+	struct rb_root		umem_tree;
+	struct rw_semaphore	umem_rwsem;
+};
+
+void ib_mirror_unref(struct ib_mirror *ib_mirror);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
 struct umem_odp_node {
 	u64 __subtree_last;
 	struct rb_node rb;
@@ -44,7 +70,7 @@ struct umem_odp_node {
 
 struct ib_umem_odp {
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+	struct ib_mirror	*ib_mirror;
 #else
 	/*
 	 * An array of the pages included in the on-demand paging umem.
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7b00d30..83da1bd 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -49,6 +49,9 @@
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
 #include <uapi/linux/if_ether.h>
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#include <linux/hmm.h>
+#endif
 
 #include <linux/atomic.h>
 #include <linux/mmu_notifier.h>
@@ -1157,7 +1160,9 @@ struct ib_ucontext {
 
 	struct pid             *tgid;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+	struct ib_mirror	*ib_mirror;
+#else  /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	struct rb_root      umem_tree;
 	/*
 	 * Protects .umem_rbroot and tree, as well as odp_mrs_count and
@@ -1172,7 +1177,7 @@ struct ib_ucontext {
 	/* A list of umems that don't have private mmu notifier counters yet. */
 	struct list_head	no_private_counters;
 	int                     odp_mrs_count;
-#endif /* !CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 };
 
@@ -1657,6 +1662,14 @@ struct ib_device {
 
 	struct ib_dma_mapping_ops   *dma_ops;
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+	/* For ODP using HMM. */
+	struct hmm_device	     hmm_dev;
+	struct list_head	     ib_mirrors;
+	struct mutex		     hmm_mutex;
+	bool			     hmm_ready;
+#endif
+
 	struct module               *owner;
 	struct device                dev;
 	struct kobject               *ports_parent;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (12 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM jglisse
  2015-05-21 20:23   ` [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM jglisse
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, linux-rdma

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This add the core HMM callback for mlx5 device driver and initialize
the HMM device for the mlx5 infiniband device driver.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/core/umem_odp.c   |  12 ++-
 drivers/infiniband/hw/mlx5/main.c    |   5 +
 drivers/infiniband/hw/mlx5/mem.c     |  36 ++++++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  19 +++-
 drivers/infiniband/hw/mlx5/mr.c      |   9 +-
 drivers/infiniband/hw/mlx5/odp.c     | 177 ++++++++++++++++++++++++++++++++++-
 include/rdma/ib_umem_odp.h           |  20 +++-
 7 files changed, 268 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index d5d57a8..559542d 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -132,7 +132,7 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
 			return -ENOMEM;
 		}
 		kref_init(&ib_mirror->kref);
-		init_rwsem(&ib_mirror->hmm_mr_rwsem);
+		init_rwsem(&ib_mirror->umem_rwsem);
 		ib_mirror->umem_tree = RB_ROOT;
 		ib_mirror->ib_device = ib_device;
 
@@ -149,10 +149,11 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
 		context->ib_mirror = ib_mirror_ref(ib_mirror);
 	}
 	mutex_unlock(&ib_device->hmm_mutex);
-	umem->odp_data.ib_mirror = ib_mirror;
+	umem->odp_data->ib_mirror = ib_mirror;
 
 	down_write(&ib_mirror->umem_rwsem);
-	rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
+	rbt_ib_umem_insert(&umem->odp_data->interval_tree,
+			   &ib_mirror->umem_tree);
 	up_write(&ib_mirror->umem_rwsem);
 
 	mmput(mm);
@@ -161,7 +162,7 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
 
 void ib_umem_odp_release(struct ib_umem *umem)
 {
-	struct ib_mirror *ib_mirror = umem->odp_data;
+	struct ib_mirror *ib_mirror = umem->odp_data->ib_mirror;
 
 	/*
 	 * Ensure that no more pages are mapped in the umem.
@@ -178,7 +179,8 @@ void ib_umem_odp_release(struct ib_umem *umem)
 	 * range covered by one and only one umem while holding the umem rwsem.
 	 */
 	down_write(&ib_mirror->umem_rwsem);
-	rbt_ib_umem_remove(&umem->odp_data->interval_tree, &mirror->umem_tree);
+	rbt_ib_umem_remove(&umem->odp_data->interval_tree,
+			   &ib_mirror->umem_tree);
 	up_write(&ib_mirror->umem_rwsem);
 
 	ib_mirror_unref(ib_mirror);
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index d553f90..eddabf0 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1316,6 +1316,9 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 	if (err)
 		goto err_rsrc;
 
+	/* If HMM initialization fails we just do not enable odp. */
+	mlx5_dev_init_odp_hmm(&dev->ib_dev, &mdev->pdev->dev);
+
 	err = ib_register_device(&dev->ib_dev, NULL);
 	if (err)
 		goto err_odp;
@@ -1340,6 +1343,7 @@ err_umrc:
 
 err_dev:
 	ib_unregister_device(&dev->ib_dev);
+	mlx5_dev_fini_odp_hmm(&dev->ib_dev);
 
 err_odp:
 	mlx5_ib_odp_remove_one(dev);
@@ -1359,6 +1363,7 @@ static void mlx5_ib_remove(struct mlx5_core_dev *mdev, void *context)
 
 	ib_unregister_device(&dev->ib_dev);
 	destroy_umrc_res(dev);
+	mlx5_dev_fini_odp_hmm(&dev->ib_dev);
 	mlx5_ib_odp_remove_one(dev);
 	destroy_dev_resources(&dev->devr);
 	ib_dealloc_device(&dev->ib_dev);
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 21084c7..f150825 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -164,7 +164,41 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
 	int entry;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+	if (umem->odp_data) {
+		struct ib_mirror *ib_mirror = umem->odp_data->ib_mirror;
+		struct hmm_mirror *mirror = &ib_mirror->base;
+		struct hmm_pt_iter *iter = data, local_iter;
+		unsigned long addr;
+
+		if (iter == NULL) {
+			iter = &local_iter;
+			hmm_pt_iter_init(iter);
+		}
+
+		for (i = 0, addr = ib_umem_start(umem) + (offset << PAGE_SHIFT);
+		     i < num_pages; ++i, addr += PAGE_SIZE) {
+			dma_addr_t *ptep, pte;
+
+			/* Get and lock pointer to mirror page table. */
+			ptep = hmm_pt_iter_update(iter, &mirror->pt, addr);
+			pte = ptep ? *ptep : 0;
+			/* HMM will not have any page tables set up, if this
+			 * function is called before page faults have happened
+			 * on the MR. In that case, we don't have PA's yet, so
+			 * just set each one to zero and continue on. The hw
+			 * will trigger a page fault.
+			 */
+			if (hmm_pte_test_valid_dma(&pte))
+				pas[i] = cpu_to_be64(umem_dma_to_mtt(pte));
+			else
+				pas[i] = (__be64)0;
+		}
+
+		if (iter == &local_iter)
+			hmm_pt_iter_fini(iter, &mirror->pt);
+
+		return;
+	}
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	const bool odp = umem->odp_data != NULL;
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index a6d62be..f1bafd4 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -615,6 +615,7 @@ int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
 			    struct ib_mr_status *mr_status);
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+
 extern struct workqueue_struct *mlx5_ib_page_fault_wq;
 
 int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev);
@@ -629,13 +630,18 @@ void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
 void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+void mlx5_dev_init_odp_hmm(struct ib_device *ib_dev, struct device *dev);
+void mlx5_dev_fini_odp_hmm(struct ib_device *ib_dev);
+int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
+			    u64 end, void *cookie);
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
 			      unsigned long end);
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
+
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
 static inline int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
 {
 	return 0;
@@ -671,4 +677,15 @@ static inline u8 convert_access(int acc)
 #define MLX5_MAX_UMR_SHIFT 16
 #define MLX5_MAX_UMR_PAGES (1 << MLX5_MAX_UMR_SHIFT)
 
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+static inline void mlx5_dev_init_odp_hmm(struct ib_device *ib_dev,
+					 struct device *dev)
+{
+}
+
+static inline void mlx5_dev_fini_odp_hmm(struct ib_device *ib_dev)
+{
+}
+#endif /* ! CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
 #endif /* MLX5_IB_H */
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 23cd123..7b2d84a 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1210,7 +1210,14 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
 		/* Wait for all running page-fault handlers to finish. */
 		synchronize_srcu(&dev->mr_srcu);
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+		if (mlx5_ib_umem_invalidate(umem, ib_umem_start(umem),
+					    ib_umem_end(umem), NULL))
+			/*
+			 * FIXME do something to kill all mr and umem
+			 * in use by this process.
+			 */
+			pr_err("killing all mr with odp due to "
+			       "mtt update failure\n");
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 		/* Destroy all page mappings */
 		mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 1de4d13..bd29155 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -52,8 +52,183 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
 	return container_of(mmr, struct mlx5_ib_mr, mmr);
 }
 
+
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+
+
+int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
+			    u64 end, void *cookie)
+{
+	const u64 umr_block_mask = (MLX5_UMR_MTT_ALIGNMENT / sizeof(u64)) - 1;
+	u64 idx = 0, blk_start_idx = 0;
+	struct hmm_pt_iter iter;
+	struct mlx5_ib_mr *mlx5_ib_mr;
+	struct hmm_mirror *mirror;
+	int in_block = 0;
+	u64 addr;
+	int ret = 0;
+
+	if (!umem || !umem->odp_data) {
+		pr_err("invalidation called on NULL umem or non-ODP umem\n");
+		return -EINVAL;
+	}
+
+	/* Is this ib_mr active and registered yet ? */
+	if (umem->odp_data->private == NULL)
+		return 0;
+
+	mlx5_ib_mr = umem->odp_data->private;
+	if (!mlx5_ib_mr->ibmr.pd)
+		return 0;
+
+	start = max_t(u64, ib_umem_start(umem), start);
+	end = min_t(u64, ib_umem_end(umem), end);
+	hmm_pt_iter_init(&iter);
+	mirror = &umem->odp_data->ib_mirror->base;
+
+	/*
+	 * Iteration one - zap the HW's MTTs. HMM ensures that while we are
+	 * doing the invalidation, no page fault will attempt to overwrite the
+	 * same MTTs.  Concurent invalidations might race us, but they will
+	 * write 0s as well, so no difference in the end result.
+	 */
+	for (addr = start; addr < end; addr += (u64)umem->page_size) {
+		dma_addr_t *ptep;
+
+		/* Need to happen before ptep as ptep might break the loop and
+		 * idx might be use outside the loop.
+		 */
+		idx = (addr - ib_umem_start(umem)) / PAGE_SIZE;
+
+		/* Get and lock pointer to mirror page table. */
+		ptep = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		if (!ptep) {
+			addr = hmm_pt_iter_next(&iter, &mirror->pt, addr, end);
+			continue;
+		}
+
+		/*
+		 * Strive to write the MTTs in chunks, but avoid overwriting
+		 * non-existing MTTs. The huristic here can be improved to
+		 * estimate the cost of another UMR vs. the cost of bigger
+		 * UMR.
+		 */
+		if ((*ptep) & (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
+			if ((*ptep) & ODP_WRITE_ALLOWED_BIT)
+				hmm_pte_set_dirty(ptep);
+			/*
+			 * Because there can not be concurrent overlapping
+			 * munmap, page migrate, page write protect then it
+			 * is safe here to clear those bits.
+			 */
+			hmm_pte_clear_bit(ptep, ODP_READ_ALLOWED_SHIFT);
+			hmm_pte_clear_bit(ptep, ODP_WRITE_ALLOWED_SHIFT);
+			if (!in_block) {
+				blk_start_idx = idx;
+				in_block = 1;
+			}
+		} else {
+			u64 umr_offset = idx & umr_block_mask;
+
+			if (in_block && umr_offset == 0) {
+				ret = mlx5_ib_update_mtt(mlx5_ib_mr,
+							 blk_start_idx,
+							 idx - blk_start_idx, 1,
+							 &iter) || ret;
+				in_block = 0;
+			}
+		}
+	}
+	if (in_block)
+		ret = mlx5_ib_update_mtt(mlx5_ib_mr, blk_start_idx,
+					 idx - blk_start_idx + 1, 1,
+					 &iter) || ret;
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+	return ret;
+}
+
+static int mlx5_hmm_invalidate_range(struct hmm_mirror *mirror,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct ib_mirror *ib_mirror;
+	int ret;
+
+	ib_mirror = container_of(mirror, struct ib_mirror, base);
+
+	/* Go over all memory region and invalidate them. */
+	down_read(&ib_mirror->umem_rwsem);
+	ret = rbt_ib_umem_for_each_in_range(&ib_mirror->umem_tree, start, end,
+					    mlx5_ib_umem_invalidate, NULL);
+	up_read(&ib_mirror->umem_rwsem);
+	return ret;
+}
+
+static void mlx5_hmm_release(struct hmm_mirror *mirror)
+{
+	struct ib_mirror *ib_mirror;
+
+	ib_mirror = container_of(mirror, struct ib_mirror, base);
+
+	/* Go over all memory region and invalidate them. */
+	mlx5_hmm_invalidate_range(mirror, 0, ULLONG_MAX);
+}
+
+static int mlx5_hmm_update(struct hmm_mirror *mirror,
+			   const struct hmm_event *event)
+{
+	struct device *device = mirror->device->dev;
+	int ret = 0;
+
+	switch (event->etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		/* FIXME implement. */
+		break;
+	case HMM_ISDIRTY:
+		hmm_mirror_range_dirty(mirror, event->start, event->end);
+		break;
+	case HMM_NONE:
+	default:
+		dev_warn(device, "Warning: unhandled HMM event (%d)"
+			 "defaulting to invalidation\n", event->etype);
+		/* Fallthrough. */
+	/* For write protect and fork we could only invalidate writeable mr. */
+	case HMM_WRITE_PROTECT:
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+	case HMM_FORK:
+		ret = mlx5_hmm_invalidate_range(mirror,
+						event->start,
+						event->end);
+		break;
+	}
+
+	return ret;
+}
+
+static const struct hmm_device_ops mlx5_hmm_ops = {
+	.release		= &mlx5_hmm_release,
+	.update			= &mlx5_hmm_update,
+};
+
+void mlx5_dev_init_odp_hmm(struct ib_device *ib_device, struct device *dev)
+{
+	INIT_LIST_HEAD(&ib_device->ib_mirrors);
+	ib_device->hmm_dev.dev = dev;
+	ib_device->hmm_dev.ops = &mlx5_hmm_ops;
+	ib_device->hmm_ready = !hmm_device_register(&ib_device->hmm_dev);
+	mutex_init(&ib_device->hmm_mutex);
+}
+
+void mlx5_dev_fini_odp_hmm(struct ib_device *ib_device)
+{
+	if (!ib_device->hmm_ready)
+		return;
+	hmm_device_unregister(&ib_device->hmm_dev);
+}
+
+
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
 
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index c7c2670..e982fd3 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -133,7 +133,25 @@ struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
 
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+
+
+/*
+ * HMM have few bits reserved for hardware specific bits inside the mirror page
+ * table. For IB we record the mapping protection per page there.
+ */
+#define ODP_READ_ALLOWED_SHIFT	(HMM_PTE_HW_SHIFT + 0)
+#define ODP_WRITE_ALLOWED_SHIFT	(HMM_PTE_HW_SHIFT + 1)
+#define ODP_READ_ALLOWED_BIT	(1 << ODP_READ_ALLOWED_SHIFT)
+#define ODP_WRITE_ALLOWED_BIT	(1 << ODP_WRITE_ALLOWED_SHIFT)
+
+/* Make sure we are not overwritting valid address bit on target arch. */
+#if (HMM_PTE_HW_SHIFT + 2) > PAGE_SHIFT
+#error (HMM_PTE_HW_SHIFT + 2) > PAGE_SHIFT
+#endif
+
+#define ODP_DMA_ADDR_MASK HMM_PTE_DMA_MASK
+
+
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (13 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback jglisse
@ 2015-05-21 20:23   ` jglisse
  2015-05-21 20:23   ` [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM jglisse
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, linux-rdma

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch add HMM specific support for hardware page faulting of
user memory region.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/hw/mlx5/odp.c | 147 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 146 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index bd29155..093f5b8 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -56,6 +56,55 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
 
 
+struct mlx5_hmm_pfault {
+	struct mlx5_ib_mr	*mlx5_ib_mr;
+	u64			start_idx;
+	dma_addr_t		access_mask;
+	unsigned		npages;
+	struct hmm_event	event;
+};
+
+static int mlx5_hmm_pfault(struct mlx5_ib_dev *mlx5_ib_dev,
+			   struct hmm_mirror *mirror,
+			   const struct hmm_event *event)
+{
+	struct mlx5_hmm_pfault *pfault;
+	struct hmm_pt_iter iter;
+	unsigned long addr, cnt;
+	int ret;
+
+	pfault = container_of(event, struct mlx5_hmm_pfault, event);
+	hmm_pt_iter_init(&iter);
+
+	for (addr = event->start, cnt = 0; addr < event->end;
+	     addr += PAGE_SIZE, ++cnt) {
+		dma_addr_t *ptep;
+
+		/* Get and lock pointer to mirror page table. */
+		ptep = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		/* This could be BUG_ON() as it can not happen. */
+		if (!ptep || !hmm_pte_test_valid_dma(ptep)) {
+			pr_warn("got empty mirror page table on pagefault.\n");
+			return -EINVAL;
+		}
+		if ((pfault->access_mask & ODP_WRITE_ALLOWED_BIT)) {
+			if (!hmm_pte_test_write(ptep)) {
+				pr_warn("got wrong protection permission on "
+					"pagefault.\n");
+				return -EINVAL;
+			}
+			hmm_pte_set_bit(ptep, ODP_WRITE_ALLOWED_SHIFT);
+		}
+		hmm_pte_set_bit(ptep, ODP_READ_ALLOWED_SHIFT);
+		pfault->npages++;
+	}
+	ret = mlx5_ib_update_mtt(pfault->mlx5_ib_mr,
+				 pfault->start_idx,
+				 cnt, 0, &iter);
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+	return ret;
+}
+
 int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
 			    u64 end, void *cookie)
 {
@@ -178,12 +227,19 @@ static int mlx5_hmm_update(struct hmm_mirror *mirror,
 			   const struct hmm_event *event)
 {
 	struct device *device = mirror->device->dev;
+	struct mlx5_ib_dev *mlx5_ib_dev;
+	struct ib_device *ib_device;
 	int ret = 0;
 
+	ib_device = container_of(mirror->device, struct ib_device, hmm_dev);
+	mlx5_ib_dev = to_mdev(ib_device);
+
 	switch (event->etype) {
 	case HMM_DEVICE_RFAULT:
 	case HMM_DEVICE_WFAULT:
-		/* FIXME implement. */
+		ret = mlx5_hmm_pfault(mlx5_ib_dev, mirror, event);
+		if (ret)
+			return ret;
 		break;
 	case HMM_ISDIRTY:
 		hmm_mirror_range_dirty(mirror, event->start, event->end);
@@ -228,6 +284,95 @@ void mlx5_dev_fini_odp_hmm(struct ib_device *ib_device)
 	hmm_device_unregister(&ib_device->hmm_dev);
 }
 
+/*
+ * Handle a single data segment in a page-fault WQE.
+ *
+ * Returns number of pages retrieved on success. The caller will continue to
+ * the next data segment.
+ * Can return the following error codes:
+ * -EAGAIN to designate a temporary error. The caller will abort handling the
+ *  page fault and resolve it.
+ * -EFAULT when there's an error mapping the requested pages. The caller will
+ *  abort the page fault handling and possibly move the QP to an error state.
+ * On other errors the QP should also be closed with an error.
+ */
+static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
+					 struct mlx5_ib_pfault *pfault,
+					 u32 key, u64 io_virt, size_t bcnt,
+					 u32 *bytes_mapped)
+{
+	struct mlx5_ib_dev *mlx5_ib_dev = to_mdev(qp->ibqp.pd->device);
+	struct ib_mirror *ib_mirror;
+	struct mlx5_hmm_pfault hmm_pfault;
+	int srcu_key;
+	int ret = 0;
+
+	srcu_key = srcu_read_lock(&mlx5_ib_dev->mr_srcu);
+	hmm_pfault.mlx5_ib_mr = mlx5_ib_odp_find_mr_lkey(mlx5_ib_dev, key);
+	/*
+	 * If we didn't find the MR, it means the MR was closed while we were
+	 * handling the ODP event. In this case we return -EFAULT so that the
+	 * QP will be closed.
+	 */
+	if (!hmm_pfault.mlx5_ib_mr || !hmm_pfault.mlx5_ib_mr->ibmr.pd) {
+		pr_err("Failed to find relevant mr for lkey=0x%06x, probably "
+		       "the MR was destroyed\n", key);
+		ret = -EFAULT;
+		goto srcu_unlock;
+	}
+	if (!hmm_pfault.mlx5_ib_mr->umem->odp_data) {
+		pr_debug("skipping non ODP MR (lkey=0x%06x) in page fault "
+		         "handler.\n", key);
+		if (bytes_mapped)
+			*bytes_mapped +=
+				(bcnt - pfault->mpfault.bytes_committed);
+		goto srcu_unlock;
+	}
+	if (hmm_pfault.mlx5_ib_mr->ibmr.pd != qp->ibqp.pd) {
+		pr_err("Page-fault with different PDs for QP and MR.\n");
+		ret = -EFAULT;
+		goto srcu_unlock;
+	}
+
+	ib_mirror = hmm_pfault.mlx5_ib_mr->umem->odp_data->ib_mirror;
+	if (ib_mirror->base.hmm == NULL) {
+		/* Somehow the mirror was kill from under us. */
+		ret = -EFAULT;
+		goto srcu_unlock;
+	}
+
+	/*
+	 * Avoid branches - this code will perform correctly
+	 * in all iterations (in iteration 2 and above,
+	 * bytes_committed == 0).
+	 */
+	io_virt += pfault->mpfault.bytes_committed;
+	bcnt -= pfault->mpfault.bytes_committed;
+
+	hmm_pfault.npages = 0;
+	hmm_pfault.start_idx = (io_virt - (hmm_pfault.mlx5_ib_mr->mmr.iova &
+					   PAGE_MASK)) >> PAGE_SHIFT;
+	hmm_pfault.access_mask = ODP_READ_ALLOWED_BIT;
+	hmm_pfault.access_mask |= hmm_pfault.mlx5_ib_mr->umem->writable ?
+				  ODP_WRITE_ALLOWED_BIT : 0;
+	hmm_pfault.event.start = io_virt & PAGE_MASK;
+	hmm_pfault.event.end = PAGE_ALIGN(io_virt + bcnt);
+	hmm_pfault.event.etype = hmm_pfault.mlx5_ib_mr->umem->writable ?
+				 HMM_DEVICE_WFAULT : HMM_DEVICE_RFAULT;
+	ret = hmm_mirror_fault(&ib_mirror->base, &hmm_pfault.event);
+
+	if (!ret && hmm_pfault.npages && bytes_mapped) {
+		u32 new_mappings = hmm_pfault.npages * PAGE_SIZE -
+				   (io_virt - round_down(io_virt, PAGE_SIZE));
+		*bytes_mapped += min_t(u32, new_mappings, bcnt);
+	}
+
+srcu_unlock:
+	srcu_read_unlock(&mlx5_ib_dev->mr_srcu, srcu_key);
+	pfault->mpfault.bytes_committed = 0;
+	return ret ? ret : hmm_pfault.npages;
+}
+
 
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM.
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
                     ` (14 preceding siblings ...)
  2015-05-21 20:23   ` [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM jglisse
@ 2015-05-21 20:23   ` jglisse
  15 siblings, 0 replies; 80+ messages in thread
From: jglisse @ 2015-05-21 20:23 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, linux-rdma

From: JA(C)rA'me Glisse <jglisse@redhat.com>

All pieces are in place for ODP (on demand paging) to work using HMM.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
cc: <linux-rdma@vger.kernel.org>
---
 drivers/infiniband/core/uverbs_cmd.c | 4 ----
 drivers/infiniband/hw/mlx5/main.c    | 6 +++++-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 3225ab5..16520bd 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3340,9 +3340,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 		goto end;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
-#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	resp.odp_caps.general_caps = attr.odp_caps.general_caps;
 	resp.odp_caps.per_transport_caps.rc_odp_caps =
 		attr.odp_caps.per_transport_caps.rc_odp_caps;
@@ -3351,7 +3348,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	resp.odp_caps.per_transport_caps.ud_odp_caps =
 		attr.odp_caps.per_transport_caps.ud_odp_caps;
 	resp.odp_caps.reserved = 0;
-#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 	memset(&resp.odp_caps, 0, sizeof(resp.odp_caps));
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index eddabf0..cff70a2 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -157,7 +157,11 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
-#error "CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM not supported at this stage !"
+	if ((dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG) &&
+	     ibdev->hmm_ready) {
+		props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
+		props->odp_caps = dev->odp_caps;
+	}
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 	if (dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
 		props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3
  2015-05-21 19:31 ` [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3 j.glisse
@ 2015-05-27  5:09   ` Aneesh Kumar K.V
  2015-05-27 14:32     ` Jerome Glisse
  2015-06-02  9:32   ` John Hubbard
  1 sibling, 1 reply; 80+ messages in thread
From: Aneesh Kumar K.V @ 2015-05-27  5:09 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

j.glisse@gmail.com writes:

> From: Jérôme Glisse <jglisse@redhat.com>
>
> The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
> can be considered as forming an "atomic" section for the cpu page table update
> point of view. Between this two function the cpu page table content is unreliable
> for the address range being invalidated.
>
> Current user such as kvm need to know when they can trust the content of the cpu
> page table. This becomes even more important to new users of the mmu_notifier
> api (such as HMM or ODP).

I don't see kvm using the new APIs in this patch. Also what is that HMM use this
for, to protect walking of mirror page table ?. I am sure you are
covering that in the later patches. May be you may want to mention
the details here too. 

>
> This patch use a structure define at all call site to invalidate_range_start()
> that is added to a list for the duration of the invalidation. It adds two new
> helpers to allow querying if a range is being invalidated or to wait for a range
> to become valid.
>
> For proper synchronization, user must block new range invalidation from inside
> there invalidate_range_start() callback, before calling the helper functions.
> Otherwise there is no garanty that a new range invalidation will not be added
> after the call to the helper function to query for existing range.
>
> Changed since v1:
>   - Fix a possible deadlock in mmu_notifier_range_wait_valid()
>
> Changed since v2:
>   - Add the range to invalid range list before calling ->range_start().
>   - Del the range from invalid range list after calling ->range_end().
>   - Remove useless list initialization.
>

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page()
  2015-05-21 19:31 ` [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() j.glisse
@ 2015-05-27  5:17   ` Aneesh Kumar K.V
  2015-05-27 14:33     ` Jerome Glisse
  2015-06-03  4:25   ` John Hubbard
  1 sibling, 1 reply; 80+ messages in thread
From: Aneesh Kumar K.V @ 2015-05-27  5:17 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

j.glisse@gmail.com writes:

> From: Jérôme Glisse <jglisse@redhat.com>
>
> Listener of mm event might not have easy way to get the struct page
> behind and address invalidated with mmu_notifier_invalidate_page()
> function as this happens after the cpu page table have been clear/
> updated. This happens for instance if the listener is storing a dma
> mapping inside its secondary page table. To avoid complex reverse
> dma mapping lookup just pass along a pointer to the page being
> invalidated.

.....

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index ada3ed1..283ad26 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -172,6 +172,7 @@ struct mmu_notifier_ops {
>  	void (*invalidate_page)(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
>  				unsigned long address,
> +				struct page *page,
>  				enum mmu_event event);
>  

How do we handle this w.r.t invalidate_range ? 

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
@ 2015-05-27  5:50   ` Aneesh Kumar K.V
  2015-05-27 14:38     ` Jerome Glisse
  2015-06-08 19:40   ` Mark Hairgrove
  1 sibling, 1 reply; 80+ messages in thread
From: Aneesh Kumar K.V @ 2015-05-27  5:50 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar, linux-rdma

j.glisse@gmail.com writes:

> From: Jérôme Glisse <jglisse@redhat.com>
>
> This patch only introduce core HMM functions for registering a new
> mirror and stopping a mirror as well as HMM device registering and
> unregistering.
>
> The lifecycle of HMM object is handled differently then the one of
> mmu_notifier because unlike mmu_notifier there can be concurrent
> call from both mm code to HMM code and/or from device driver code
> to HMM code. Moreover lifetime of HMM can be uncorrelated from the
> lifetime of the process that is being mirror (GPU might take longer
> time to cleanup).
>

......

> +struct hmm_device_ops {
> +	/* release() - mirror must stop using the address space.
> +	 *
> +	 * @mirror: The mirror that link process address space with the device.
> +	 *
> +	 * When this is call, device driver must kill all device thread using

s/call/called, ?

> +	 * this mirror. Also, this callback is the last thing call by HMM and
> +	 * HMM will not access the mirror struct after this call (ie no more
> +	 * dereference of it so it is safe for the device driver to free it).
> +	 * It is call either from :
> +	 *   - mm dying (all process using this mm exiting).
> +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> +	 *   - outcome of some device error reported by any of the device
> +	 *     callback against that mirror.
> +	 */
> +	void (*release)(struct hmm_mirror *mirror);
> +};
> +
> +
> +/* struct hmm - per mm_struct HMM states.
> + *
> + * @mm: The mm struct this hmm is associated with.
> + * @mirrors: List of all mirror for this mm (one per device).
> + * @vm_end: Last valid address for this mm (exclusive).
> + * @kref: Reference counter.
> + * @rwsem: Serialize the mirror list modifications.
> + * @mmu_notifier: The mmu_notifier of this mm.
> + * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
> + *
> + * For each process address space (mm_struct) there is one and only one hmm
> + * struct. hmm functions will redispatch to each devices the change made to
> + * the process address space.
> + *
> + * Device driver must not access this structure other than for getting the
> + * mm pointer.
> + */

.....

>  #ifndef AT_VECTOR_SIZE_ARCH
>  #define AT_VECTOR_SIZE_ARCH 0
>  #endif
> @@ -451,6 +455,16 @@ struct mm_struct {
>  #ifdef CONFIG_MMU_NOTIFIER
>  	struct mmu_notifier_mm *mmu_notifier_mm;
>  #endif
> +#ifdef CONFIG_HMM
> +	/*
> +	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
> +	 * refcount on mm struct as well as forbiding registering hmm on a
> +	 * dying mm
> +	 *
> +	 * This field is set with mmap_sem old in write mode.

s/old/held/ ?


> +	 */
> +	struct hmm *hmm;
> +#endif
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
>  	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
>  #endif
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0e0ae9a..4083be7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -27,6 +27,7 @@
>  #include <linux/binfmts.h>
>  #include <linux/mman.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/hmm.h>
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/vmacache.h>
> @@ -597,6 +598,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>  	mm_init_aio(mm);
>  	mm_init_owner(mm, p);
>  	mmu_notifier_mm_init(mm);
> +	hmm_mm_init(mm);
>  	clear_tlb_flush_pending(mm);
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
>  	mm->pmd_huge_pte = NULL;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 52ffb86..189e48f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -653,3 +653,18 @@ config DEFERRED_STRUCT_PAGE_INIT
>  	  when kswapd starts. This has a potential performance impact on
>  	  processes running early in the lifetime of the systemm until kswapd
>  	  finishes the initialisation.
> +
> +if STAGING
> +config HMM
> +	bool "Enable heterogeneous memory management (HMM)"
> +	depends on MMU
> +	select MMU_NOTIFIER
> +	select GENERIC_PAGE_TABLE

What is GENERIC_PAGE_TABLE ?

> +	default n
> +	help
> +	  Heterogeneous memory management provide infrastructure for a device
> +	  to mirror a process address space into an hardware mmu or into any
> +	  things supporting pagefault like event.
> +
> +	  If unsure, say N to disable hmm.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3
  2015-05-27  5:09   ` Aneesh Kumar K.V
@ 2015-05-27 14:32     ` Jerome Glisse
  0 siblings, 0 replies; 80+ messages in thread
From: Jerome Glisse @ 2015-05-27 14:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

On Wed, May 27, 2015 at 10:39:23AM +0530, Aneesh Kumar K.V wrote:
> j.glisse@gmail.com writes:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> >
> > The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
> > can be considered as forming an "atomic" section for the cpu page table update
> > point of view. Between this two function the cpu page table content is unreliable
> > for the address range being invalidated.
> >
> > Current user such as kvm need to know when they can trust the content of the cpu
> > page table. This becomes even more important to new users of the mmu_notifier
> > api (such as HMM or ODP).
> 
> I don't see kvm using the new APIs in this patch. Also what is that HMM use this
> for, to protect walking of mirror page table ?. I am sure you are
> covering that in the later patches. May be you may want to mention
> the details here too. 

KVM side is not done, i looked at KVM code long time ago and thought oh it
could take advantage of this but now i do not remember exactly. I would need
to check back.

For HMM this is simple, no device fault can populate or walk the mirror page
table on a range that is being invalidated. But concurrent fault/walk can
happen outside the invalidated range. All handled in hmm_device_fault_start().

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page()
  2015-05-27  5:17   ` Aneesh Kumar K.V
@ 2015-05-27 14:33     ` Jerome Glisse
  0 siblings, 0 replies; 80+ messages in thread
From: Jerome Glisse @ 2015-05-27 14:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse

On Wed, May 27, 2015 at 10:47:44AM +0530, Aneesh Kumar K.V wrote:
> j.glisse@gmail.com writes:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> >
> > Listener of mm event might not have easy way to get the struct page
> > behind and address invalidated with mmu_notifier_invalidate_page()
> > function as this happens after the cpu page table have been clear/
> > updated. This happens for instance if the listener is storing a dma
> > mapping inside its secondary page table. To avoid complex reverse
> > dma mapping lookup just pass along a pointer to the page being
> > invalidated.
> 
> .....
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index ada3ed1..283ad26 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -172,6 +172,7 @@ struct mmu_notifier_ops {
> >  	void (*invalidate_page)(struct mmu_notifier *mn,
> >  				struct mm_struct *mm,
> >  				unsigned long address,
> > +				struct page *page,
> >  				enum mmu_event event);
> >  
> 
> How do we handle this w.r.t invalidate_range ? 

With range invalidation the CPU page table is still reliable when
invalidate_range_start() callback happen. So we can lookup the CPU
page table to get the page backing the address.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-05-27  5:50   ` Aneesh Kumar K.V
@ 2015-05-27 14:38     ` Jerome Glisse
  0 siblings, 0 replies; 80+ messages in thread
From: Jerome Glisse @ 2015-05-27 14:38 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar, linux-rdma

On Wed, May 27, 2015 at 11:20:05AM +0530, Aneesh Kumar K.V wrote:
> j.glisse@gmail.com writes:

Noted your grammar fixes.

> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 52ffb86..189e48f 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -653,3 +653,18 @@ config DEFERRED_STRUCT_PAGE_INIT
> >  	  when kswapd starts. This has a potential performance impact on
> >  	  processes running early in the lifetime of the systemm until kswapd
> >  	  finishes the initialisation.
> > +
> > +if STAGING
> > +config HMM
> > +	bool "Enable heterogeneous memory management (HMM)"
> > +	depends on MMU
> > +	select MMU_NOTIFIER
> > +	select GENERIC_PAGE_TABLE
> 
> What is GENERIC_PAGE_TABLE ?

Let over of when patch 0006 what a seperate feature that was introduced
before this patch. I failed to remove that chunk. Just ignore it.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: HMM (Heterogeneous Memory Management) v8
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (19 preceding siblings ...)
  2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
@ 2015-05-30  3:01 ` John Hubbard
  2015-05-31  6:56 ` Haggai Eran
  21 siblings, 0 replies; 80+ messages in thread
From: John Hubbard @ 2015-05-30  3:01 UTC (permalink / raw)
  To: j.glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, linux-fsdevel, Linda Wang,
	Kevin E Martin, Jeff Law, Or Gerlitz, Sagi Grimberg

[-- Attachment #1: Type: text/plain, Size: 9570 bytes --]

On Thu, 21 May 2015, j.glisse@gmail.com wrote:

> 
> So sorry had to resend because i stupidly forgot to cc mailing list.
> Ignore private send done before.
> 
> 
> HMM (Heterogeneous Memory Management) is an helper layer for device
> that want to mirror a process address space into their own mmu. Main
> target is GPU but other hardware, like network device can take also
> use HMM.
> 
> There is two side to HMM, first one is mirroring of process address
> space on behalf of a device. HMM will manage a secondary page table
> for the device and keep it synchronize with the CPU page table. HMM
> also do DMA mapping on behalf of the device (which would allow new
> kind of optimization further down the road (1)).
> 
> Second side is allowing to migrate process memory to device memory
> where device memory is unmappable by the CPU. Any CPU access will
> trigger special fault that will migrate memory back.
> 
> From design point of view not much changed since last patchset (2).
> Most of the change are in small details of the API expose to device
> driver. This version also include device driver change for Mellanox
> hardware to use HMM as an alternative to ODP (which provide a subset
> of HMM functionality specificaly for RDMA devices). Long term plan
> is to have HMM completely replace ODP.
> 

Hi Jerome!

OK, seeing as how there is so much material to review here, I'll start 
with the easiest part first: documentation.

There is a lot of information spread throughout this patchset that needs 
to be preserved and made readily accessible, but some of it is only found 
in the comments in patch headers. It would be better if the information 
were right there in the source tree, not just in git history. Also, the 
comment blocks that are in the code itself are useful, but maybe not 
quite sufficient to provide the big picture.

With that in mind, I think that a Documentation/vm/hmm.txt file should be 
provided. It could capture all of this. We can refer to it from within the 
code, thus providing a higher level of quality (because we only have to 
update one place, for big-picture documentation comments).

If it helps, I'll volunteer to piece something together from the material 
that you have created, plus maybe a few notes about what a typical calling 
sequence looks like (since I have actual backtraces here from the 
ConnectIB cards).

Also, there are a lot of typographical errors that we can fix up as part 
of that effort. We want to ensure that such tiny issues don't distract 
people from the valuable content, so those need to be fixed. I'll let 
others decide as to whether that sort of fit-and-finish needs to happen 
now, or as a follow-up patch or two.

And finally, a critical part of good documentation is the naming of 
things. We're sort of still in the "wow, it works" phase of this project, 
and so now is a good time to start fussing about names. Therefore, you'll 
see a bunch of small and large naming recommendations coming from me, for 
the various patches here.

thanks,
John Hubbard

> 
> 
> Why doing this ?
> 
> Mirroring a process address space is mandatory with OpenCL 2.0 and
> with other GPU compute API. OpenCL 2.0 allow different level of
> implementation and currently only the lowest 2 are supported on
> Linux. To implement the highest level, where CPU and GPU access
> can happen concurently and are cache coherent, HMM is needed, or
> something providing same functionality, for instance through
> platform hardware.
> 
> Hardware solution such as PCIE ATS/PASID is limited to mirroring
> system memory and does not provide way to migrate memory to device
> memory (which offer significantly more bandwidth up to 10 times
> faster than regular system memory with discret GPU, also have
> lower latency than PCIE transaction).
> 
> Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID
> and for Intel a special level of cache (backed by a large pool of
> fast memory).
> 
> For foreseeable futur, discrete GPU will remain releveant as they
> can have a large quantity of faster memory than integrated GPU.
> 
> Thus we believe HMM will allow to leverage discret GPU memory in
> a transparent fashion to the application, with minimum disruption
> to the linux kernel mm code. Also HMM can work along hardware
> solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID
> while HMM handles the migrated memory case).
> 
> 
> 
> Design :
> 
> The patch 1, 2, 3 and 4 augment the mmu notifier API with new
> informations to more efficiently mirror CPU page table updates.
> 
> The first side of HMM, process address space mirroring, is
> implemented in patch 5 through 12. This use a secondary page
> table, in which HMM mirror memory actively use by the device.
> HMM does not take a reference on any of the page, it use the
> mmu notifier API to track changes to the CPU page table and to
> update the mirror page table. All this while providing a simple
> API to device driver.
> 
> To implement this we use a "generic" page table and not a radix
> tree because we need to store more flags than radix allows and
> we need to store dma address (sizeof(dma_addr_t) > sizeof(long)
> on some platform). All this is
> 
> Patch 14 pass down the lane the new child mm struct of a parent
> process being forked. This is necessary to properly handle fork
> when parent process have migrated memory (more on that below).
> 
> Patch 15 allow to get the current memcg against which anonymous
> memory of a process should be accounted. It usefull because in
> HMM we do bulk transaction on address space and we wish to avoid
> storing a pointer to memcg for each single page. All operation
> dealing with memcg happens under the protection of the mmap
> semaphore.
> 
> 
> Second side of HMM, migration to device memory, is implemented
> in patch 16 to 28. This only deal with anonymous memory. A new
> special swap type is introduced. Migrated memory will have there
> CPU page table entry set to this special swap entry (like the
> migration entry but unlike migration this is not a short lived
> state).
> 
> All the patches are then set of functions that deals with those
> special entry in the various code path that might face them.
> 
> Memory migration require several steps, first the memory is un-
> mapped from CPU and replace with special "locked" entry, HMM
> locked entry is a short lived transitional state, this is to
> avoid two threads to fight over migration entry.
> 
> Once unmapped HMM can determine what can be migrated or not by
> comparing mapcount and page count. If something holds a reference
> then the page is not migrated and CPU page table is restored.
> Next step is to schedule the copy to device memory and update
> the CPU page table to regular HMM entry.
> 
> Migration back follow the same pattern, replace with special
> lock entry, then copy back, then update CPU page table.
> 
> 
> (1) Because HMM keeps a secondary page table which keeps track of
>     DMA mapping, there is room for new optimization. We want to
>     add a new DMA API to allow to manage DMA page table mapping
>     at directory level. This would allow to minimize memory
>     consumption of mirror page table and also over head of doing
>     DMA mapping page per page. This is a future feature we want
>     to work on and hope the idea will proove usefull not only to
>     HMM users.
> 
> (2) Previous patchset posting :
>     v1 http://lwn.net/Articles/597289/
>     v2 https://lkml.org/lkml/2014/6/12/559
>     v3 https://lkml.org/lkml/2014/6/13/633
>     v4 https://lkml.org/lkml/2014/8/29/423
>     v5 https://lkml.org/lkml/2014/11/3/759
>     v6 http://lwn.net/Articles/619737/
>     v7 http://lwn.net/Articles/627316/
> 
> 
> Cheers,
> JA(C)rA'me
> 
> To: "Andrew Morton" <akpm@linux-foundation.org>,
> Cc: <linux-kernel@vger.kernel.org>,
> Cc: linux-mm <linux-mm@kvack.org>,
> Cc: <linux-fsdevel@vger.kernel.org>,
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
> Cc: "Mel Gorman" <mgorman@suse.de>,
> Cc: "H. Peter Anvin" <hpa@zytor.com>,
> Cc: "Peter Zijlstra" <peterz@infradead.org>,
> Cc: "Linda Wang" <lwang@redhat.com>,
> Cc: "Kevin E Martin" <kem@redhat.com>,
> Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
> Cc: "Johannes Weiner" <jweiner@redhat.com>,
> Cc: "Larry Woodman" <lwoodman@redhat.com>,
> Cc: "Rik van Riel" <riel@redhat.com>,
> Cc: "Dave Airlie" <airlied@redhat.com>,
> Cc: "Jeff Law" <law@redhat.com>,
> Cc: "Brendan Conoboy" <blc@redhat.com>,
> Cc: "Joe Donohue" <jdonohue@redhat.com>,
> Cc: "Duncan Poole" <dpoole@nvidia.com>,
> Cc: "Sherry Cheung" <SCheung@nvidia.com>,
> Cc: "Subhash Gutti" <sgutti@nvidia.com>,
> Cc: "John Hubbard" <jhubbard@nvidia.com>,
> Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
> Cc: "Lucien Dunning" <ldunning@nvidia.com>,
> Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
> Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
> Cc: "Haggai Eran" <haggaie@mellanox.com>,
> Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
> Cc: "Sagi Grimberg" <sagig@mellanox.com>
> Cc: "Shachar Raindel" <raindel@mellanox.com>,
> Cc: "Liran Liss" <liranl@mellanox.com>,
> Cc: "Roland Dreier" <roland@purestorage.com>,
> Cc: "Sander, Ben" <ben.sander@amd.com>,
> Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
> Cc: "Bridgman, John" <John.Bridgman@amd.com>,
> Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
> Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
> Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
> Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
> Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,
> 
> 

thanks,
John H.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/36] mmu_notifier: add event information to address invalidation v7
  2015-05-21 19:31 ` [PATCH 01/36] mmu_notifier: add event information to address invalidation v7 j.glisse
@ 2015-05-30  3:43   ` John Hubbard
  2015-06-01 19:03     ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: John Hubbard @ 2015-05-30  3:43 UTC (permalink / raw)
  To: j.glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 44778 bytes --]

On Thu, 21 May 2015, j.glisse@gmail.com wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> The event information will be useful for new user of mmu_notifier API.
> The event argument differentiate between a vma disappearing, a page
> being write protected or simply a page being unmaped. This allow new
> user to take different path for different event for instance on unmap
> the resource used to track a vma are still valid and should stay around.
> While if the event is saying that a vma is being destroy it means that any
> resources used to track this vma can be free.
> 
> Changed since v1:
>   - renamed action into event (updated commit message too).
>   - simplified the event names and clarified their usage
>     also documenting what exceptation the listener can have in
>     respect to each event.
> 
> Changed since v2:
>   - Avoid crazy name.
>   - Do not move code that do not need to move.
> 
> Changed since v3:
>   - Separate hugue page split from mlock/munlock and softdirty.

Do we care about fixing up patch comments? If so:

s/hugue/huge/

> 
> Changed since v4:
>   - Rebase (no other changes).
> 
> Changed since v5:
>   - Typo fix.
>   - Changed zap_page_range from MMU_MUNMAP to MMU_MIGRATE to reflect the
>     fact that the address range is still valid just the page backing it
>     are no longer.
> 
> Changed since v6:
>   - try_to_unmap_one() only invalidate when doing migration.
>   - Differentiate fork from other case.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
>  drivers/gpu/drm/radeon/radeon_mn.c      |   3 +-
>  drivers/infiniband/core/umem_odp.c      |   9 ++-
>  drivers/iommu/amd_iommu_v2.c            |   3 +-
>  drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
>  drivers/xen/gntdev.c                    |   9 ++-
>  fs/proc/task_mmu.c                      |   6 +-
>  include/linux/mmu_notifier.h            | 135 ++++++++++++++++++++++++++------
>  kernel/events/uprobes.c                 |  10 ++-
>  mm/huge_memory.c                        |  39 ++++++---
>  mm/hugetlb.c                            |  23 +++---
>  mm/ksm.c                                |  18 +++--
>  mm/madvise.c                            |   4 +-
>  mm/memory.c                             |  27 ++++---
>  mm/migrate.c                            |   9 ++-
>  mm/mmu_notifier.c                       |  28 ++++---
>  mm/mprotect.c                           |   6 +-
>  mm/mremap.c                             |   6 +-
>  mm/rmap.c                               |   4 +-
>  virt/kvm/kvm_main.c                     |  12 ++-
>  20 files changed, 261 insertions(+), 102 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 4039ede..452e9b1 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -132,7 +132,8 @@ restart:
>  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  						       struct mm_struct *mm,
>  						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       enum mmu_event event)
>  {
>  	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
>  	struct interval_tree_node *it = NULL;
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index eef006c..3a9615b 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -121,7 +121,8 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>  static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
>  					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     enum mmu_event event)
>  {
>  	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>  	struct interval_tree_node *it;
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 40becdb..6ed69fa 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -165,7 +165,8 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
>  
>  static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
> -					     unsigned long address)
> +					     unsigned long address,
> +					     enum mmu_event event)
>  {
>  	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
>  
> @@ -192,7 +193,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>  static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    enum mmu_event event)
>  {
>  	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
>  
> @@ -217,7 +219,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
>  static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  						  struct mm_struct *mm,
>  						  unsigned long start,
> -						  unsigned long end)
> +						  unsigned long end,
> +						  enum mmu_event event)
>  {
>  	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
>  
> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 3465faf..4aa4de6 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -384,7 +384,8 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
>  
>  static void mn_invalidate_page(struct mmu_notifier *mn,
>  			       struct mm_struct *mm,
> -			       unsigned long address)
> +			       unsigned long address,
> +			       enum mmu_event event)
>  {
>  	__mn_flush_page(mn, address);
>  }
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index 2129274..e67fed1 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   */
>  static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> @@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void gru_invalidate_range_end(struct mmu_notifier *mn,
>  				     struct mm_struct *mm, unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> @@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
>  }
>  
>  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
> -				unsigned long address)
> +				unsigned long address,
> +				enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index 8927485..46bc610 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -467,7 +467,9 @@ static void unmap_if_in_range(struct grant_map *map,
>  
>  static void mn_invl_range_start(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start,
> +				unsigned long end,
> +				enum mmu_event event)
>  {
>  	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>  	struct grant_map *map;
> @@ -484,9 +486,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>  
>  static void mn_invl_page(struct mmu_notifier *mn,
>  			 struct mm_struct *mm,
> -			 unsigned long address)
> +			 unsigned long address,
> +			 enum mmu_event event)
>  {
> -	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
> +	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
>  }
>  
>  static void mn_release(struct mmu_notifier *mn,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6dee68d..58e2390 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -934,11 +934,13 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  				downgrade_write(&mm->mmap_sem);
>  				break;
>  			}
> -			mmu_notifier_invalidate_range_start(mm, 0, -1);
> +			mmu_notifier_invalidate_range_start(mm, 0,
> +							    -1, MMU_ISDIRTY);
>  		}
>  		walk_page_range(0, ~0UL, &clear_refs_walk);
>  		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_end(mm, 0, -1);
> +			mmu_notifier_invalidate_range_end(mm, 0,
> +							  -1, MMU_ISDIRTY);
>  		flush_tlb_mm(mm);
>  		up_read(&mm->mmap_sem);
>  out_mm:
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 61cd67f..8b11b1b 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -9,6 +9,70 @@
>  struct mmu_notifier;
>  struct mmu_notifier_ops;
>  
> +/* MMU Events report fine-grained information to the callback routine, allowing
> + * the event listener to make a more informed decision as to what action to
> + * take. The event types are:
> + *
> + *   - MMU_FORK when a process is forking and as a results various vma needs to
> + *     be write protected to allow for COW.

Make that:

- MMU_FORK: a process is forking. This will lead to vmas getting 
write-protected, in order to set up COW.

> + *
> + *   - MMU_HSPLIT huge page split, the memory is the same only the page table
> + *     structure is updated (level added or removed).

Make that (depending on the name change ideas mentioned later):

- MMU_PAGE_SPLIT: the pages don't move, nor does their content change, but 
the page table structure is updated (levels added or removed).

> + *
> + *   - MMU_ISDIRTY need to update the dirty bit of the page table so proper
> + *     dirty accounting can happen.
> + *
> + *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
> + *     access must stop after invalidate_range_start callback returns.
> + *     Furthermore, no read access should be allowed either, as a new page can
> + *     be remapped with write access before the invalidate_range_end callback
> + *     happens and thus any read access to old page might read stale data. There
> + *     are several sources for this event, including:
> + *
> + *         - A page moving to swap (various reasons, including page reclaim),
> + *         - An mremap syscall,
> + *         - migration for NUMA reasons,
> + *         - balancing the memory pool,
> + *         - write fault on COW page,
> + *         - and more that are not listed here.
> + *
> + *   - MMU_MPROT: memory access protection is changing. Refer to the vma to get
> + *     the new access protection. All memory access are still valid until the
> + *     invalidate_range_end callback.
> + *
> + *   - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
> + *     page are unlocked.
> + *
> + *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
> + *     process destruction). However, access is still allowed, up until the
> + *     invalidate_range_free_pages callback. This also implies that secondary
> + *     page table can be trimmed, because the address range is no longer valid.
> + *
> + *   - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
> + *     must stop after invalidate_range_start callback returns. Read access are
> + *     still allowed.
> + *
> + *   - MMU_WRITE_PROTECT: memory is being write protected (ie should be mapped
> + *     read only no matter what the vma memory protection allows). All write
> + *     accesses must stop after invalidate_range_start callback returns. Read
> + *     access are still allowed.
> + *
> + * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
> + * because it will always lead to reasonable behavior, but will not allow the
> + * listener a chance to optimize its events.
> + */
> +enum mmu_event {
> +	MMU_FORK = 0,
> +	MMU_HSPLIT,

Let's rename MMU_HSPLIT to one of the following, take your pick:

MMU_HUGE_PAGE_SPLIT (too long, but you can't possibly misunderstand it)
MMU_PAGE_SPLIT (my favorite: only huge pages are ever split, so it works)
MMU_HUGE_SPLIT (ugly, but still hard to misunderstand)


> +	MMU_ISDIRTY,

This MMU_ISDIRTY seems like a problem to me. First of all, it looks 
backwards: the only place that invokes it is the clear_refs_write() 
routine, for the soft-dirty tracking feature. And in that case, the pages 
are *not* being made dirty! Rather, the kernel is actually making the 
pages non-writable, in order to be able to trap the subsequent page fault 
and figure out if the page is in active use.

So, given that there is only one call site, and that call site should 
actually be setting MMU_WRITE_PROTECT instead (I think), let's just delete 
MMU_ISDIRTY.

Come to think about it, there is no callback possible for "a page became 
dirty", anyway. Because the dirty and accessed bits are actually set by 
the hardware, and software is generally unable to know the current state.
So MMU_ISDIRTY just seems inappropriate to me, across the board.

I'll take a look at the corresponding HMM_ISDIRTY, too.

> +	MMU_MIGRATE,
> +	MMU_MPROT,

The MMU_PROT also looks questionable. Short answer: probably better to 
read the protection, and pass either MMU_WRITE_PROTECT, MMU_READ_WRITE 
(that's a new item, of course), or MMU_UNMAP.

Here's why: the call site knows the protection, but by the time it filters 
down to HMM (in later patches), that information is lost, and HMM ends up 
doing (ouch!) another find_vma() call in order to retrieve it--and then 
translates it into only three possible things:

// hmm_mmu_mprot_to_etype() sets one of these:

   HMM_MUNMAP
   HMM_WRITE_PROTECT
   HMM_NONE


> +	MMU_MUNLOCK,

I think MMU_UNLOCK would be clearer. We already know the scope, so the 
extra "M" isn't adding anything.

> +	MMU_MUNMAP,

Same thing here: MMU_UNMAP seems better.

> +	MMU_WRITE_BACK,
> +	MMU_WRITE_PROTECT,

We may have to add MMU_READ_WRITE (and maybe another one, I haven't 
bottomed out on that), if you agree with the above approach of 
always sending a precise event, instead of "protection changed".

That's all I saw. This is not a complicated patch, even though it's 
touching a lot of files, and I think everything else is correct.

thanks,
John Hubbard

> +};
> +
>  #ifdef CONFIG_MMU_NOTIFIER
>  
>  /*
> @@ -82,7 +146,8 @@ struct mmu_notifier_ops {
>  	void (*change_pte)(struct mmu_notifier *mn,
>  			   struct mm_struct *mm,
>  			   unsigned long address,
> -			   pte_t pte);
> +			   pte_t pte,
> +			   enum mmu_event event);
>  
>  	/*
>  	 * Before this is invoked any secondary MMU is still ok to
> @@ -93,7 +158,8 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_page)(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> -				unsigned long address);
> +				unsigned long address,
> +				enum mmu_event event);
>  
>  	/*
>  	 * invalidate_range_start() and invalidate_range_end() must be
> @@ -140,10 +206,14 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_range_start)(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end);
> +				       unsigned long start,
> +				       unsigned long end,
> +				       enum mmu_event event);
>  	void (*invalidate_range_end)(struct mmu_notifier *mn,
>  				     struct mm_struct *mm,
> -				     unsigned long start, unsigned long end);
> +				     unsigned long start,
> +				     unsigned long end,
> +				     enum mmu_event event);
>  
>  	/*
>  	 * invalidate_range() is either called between
> @@ -206,13 +276,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>  extern int __mmu_notifier_test_young(struct mm_struct *mm,
>  				     unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> -				      unsigned long address, pte_t pte);
> +				      unsigned long address,
> +				      pte_t pte,
> +				      enum mmu_event event);
>  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address);
> +					  unsigned long address,
> +					  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +						  unsigned long start,
> +						  unsigned long end,
> +						  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +						unsigned long start,
> +						unsigned long end,
> +						enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
>  				  unsigned long start, unsigned long end);
>  
> @@ -240,31 +317,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> -					   unsigned long address, pte_t pte)
> +					   unsigned long address,
> +					   pte_t pte,
> +					   enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_change_pte(mm, address, pte);
> +		__mmu_notifier_change_pte(mm, address, pte, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address)
> +						unsigned long address,
> +						enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_page(mm, address);
> +		__mmu_notifier_invalidate_page(mm, address, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						       unsigned long start,
> +						       unsigned long end,
> +						       enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						     unsigned long start,
> +						     unsigned long end,
> +						     enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_end(mm, start, end);
> +		__mmu_notifier_invalidate_range_end(mm, start, end, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
> @@ -359,13 +443,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>   * old page would remain mapped readonly in the secondary MMUs after the new
>   * page is already writable by some CPU through the primary MMU.
>   */
> -#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
> +#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
>  ({									\
>  	struct mm_struct *___mm = __mm;					\
>  	unsigned long ___address = __address;				\
>  	pte_t ___pte = __pte;						\
>  									\
> -	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
> +	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
>  	set_pte_at(___mm, ___address, __ptep, ___pte);			\
>  })
>  
> @@ -393,22 +477,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> -					   unsigned long address, pte_t pte)
> +					   unsigned long address,
> +					   pte_t pte,
> +					   enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address)
> +						unsigned long address,
> +						enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						       unsigned long start,
> +						       unsigned long end,
> +						       enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						     unsigned long start,
> +						     unsigned long end,
> +						     enum mmu_event event)
>  {
>  }
>  
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index cb346f2..802828a 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -176,7 +176,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	/* For try_to_free_swap() and munlock_vma_page() below */
>  	lock_page(page);
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	err = -EAGAIN;
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
>  	if (!ptep)
> @@ -194,7 +195,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
>  	ptep_clear_flush_notify(vma, addr, ptep);
> -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> +	set_pte_at_notify(mm, addr, ptep,
> +			  mk_pte(kpage, vma->vm_page_prot),
> +			  MMU_MIGRATE);
>  
>  	page_remove_rmap(page);
>  	if (!page_mapped(page))
> @@ -208,7 +211,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	err = 0;
>   unlock:
>  	mem_cgroup_cancel_charge(kpage, memcg);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	unlock_page(page);
>  	return err;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cb8904c..41c342c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1024,7 +1024,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
> +					    MMU_MIGRATE);
>  
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> @@ -1058,7 +1059,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  	page_remove_rmap(page);
>  	spin_unlock(ptl);
>  
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	ret |= VM_FAULT_WRITE;
>  	put_page(page);
> @@ -1068,7 +1070,8 @@ out:
>  
>  out_free_pages:
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		memcg = (void *)page_private(pages[i]);
>  		set_page_private(pages[i], 0);
> @@ -1160,7 +1163,8 @@ alloc:
>  
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
> +					    MMU_MIGRATE);
>  
>  	spin_lock(ptl);
>  	if (page)
> @@ -1192,7 +1196,8 @@ alloc:
>  	}
>  	spin_unlock(ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  out:
>  	return ret;
>  out_unlock:
> @@ -1646,7 +1651,8 @@ static int __split_huge_page_splitting(struct page *page,
>  	const unsigned long mmun_start = address;
>  	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_HSPLIT);
>  	pmd = page_check_address_pmd(page, mm, address,
>  			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
>  	if (pmd) {
> @@ -1662,7 +1668,8 @@ static int __split_huge_page_splitting(struct page *page,
>  		ret = 1;
>  		spin_unlock(ptl);
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_HSPLIT);
>  
>  	return ret;
>  }
> @@ -2526,7 +2533,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  
>  	mmun_start = address;
>  	mmun_end   = address + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>  	/*
>  	 * After this gup_fast can't run anymore. This also removes
> @@ -2536,7 +2544,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 */
>  	_pmd = pmdp_collapse_flush(vma, address, pmd);
>  	spin_unlock(pmd_ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	spin_lock(pte_ptl);
>  	isolated = __collapse_huge_page_isolate(vma, address, pte);
> @@ -2933,24 +2942,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
>  again:
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
>  	if (is_huge_zero_pmd(*pmd)) {
>  		__split_huge_zero_page_pmd(vma, haddr, pmd);
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
>  	page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	get_page(page);
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	split_huge_page(page);
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 54f129d..19da310 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2670,7 +2670,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	mmun_start = vma->vm_start;
>  	mmun_end = vma->vm_end;
>  	if (cow)
> -		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_start(src, mmun_start,
> +						    mmun_end, MMU_MIGRATE);
>  
>  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>  		spinlock_t *src_ptl, *dst_ptl;
> @@ -2724,7 +2725,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	}
>  
>  	if (cow)
> -		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(src, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  
>  	return ret;
>  }
> @@ -2750,7 +2752,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	BUG_ON(end & ~huge_page_mask(h));
>  
>  	tlb_start_vma(tlb, vma);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	address = start;
>  again:
>  	for (; address < end; address += sz) {
> @@ -2824,7 +2827,8 @@ unlock:
>  		if (address < end && !ref_page)
>  			goto again;
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	tlb_end_vma(tlb, vma);
>  }
>  
> @@ -3003,8 +3007,8 @@ retry_avoidcopy:
>  
>  	mmun_start = address & huge_page_mask(h);
>  	mmun_end = mmun_start + huge_page_size(h);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> -
> +	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
> +					    MMU_MIGRATE);
>  	/*
>  	 * Retake the page table lock to check for racing updates
>  	 * before the page tables are altered
> @@ -3025,7 +3029,8 @@ retry_avoidcopy:
>  		new_page = old_page;
>  	}
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
> +					  MMU_MIGRATE);
>  out_release_all:
>  	page_cache_release(new_page);
>  out_release_old:
> @@ -3493,7 +3498,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	BUG_ON(address >= end);
>  	flush_cache_range(vma, address, end);
>  
> -	mmu_notifier_invalidate_range_start(mm, start, end);
> +	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
>  	i_mmap_lock_write(vma->vm_file->f_mapping);
>  	for (; address < end; address += huge_page_size(h)) {
>  		spinlock_t *ptl;
> @@ -3543,7 +3548,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	flush_tlb_range(vma, start, end);
>  	mmu_notifier_invalidate_range(mm, start, end);
>  	i_mmap_unlock_write(vma->vm_file->f_mapping);
> -	mmu_notifier_invalidate_range_end(mm, start, end);
> +	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
>  
>  	return pages << h->order;
>  }
> diff --git a/mm/ksm.c b/mm/ksm.c
> index bc7be0e..76f167c 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
> +					    MMU_WRITE_PROTECT);
>  
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
>  	if (!ptep)
> @@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  		if (pte_dirty(entry))
>  			set_page_dirty(page);
>  		entry = pte_mkclean(pte_wrprotect(entry));
> -		set_pte_at_notify(mm, addr, ptep, entry);
> +		set_pte_at_notify(mm, addr, ptep, entry, MMU_WRITE_PROTECT);
>  	}
>  	*orig_pte = *ptep;
>  	err = 0;
> @@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  out_unlock:
>  	pte_unmap_unlock(ptep, ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
> +					  MMU_WRITE_PROTECT);
>  out:
>  	return err;
>  }
> @@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
> +					    MMU_MIGRATE);
>  
>  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte_same(*ptep, orig_pte)) {
> @@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
>  	ptep_clear_flush_notify(vma, addr, ptep);
> -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> +	set_pte_at_notify(mm, addr, ptep,
> +			  mk_pte(kpage, vma->vm_page_prot),
> +			  MMU_MIGRATE);
>  
>  	page_remove_rmap(page);
>  	if (!page_mapped(page))
> @@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	pte_unmap_unlock(ptep, ptl);
>  	err = 0;
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
> +					  MMU_MIGRATE);
>  out:
>  	return err;
>  }
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 22e8f0c..b90ba3d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -405,9 +405,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  	tlb_gather_mmu(&tlb, mm, start, end);
>  	update_hiwater_rss(mm);
>  
> -	mmu_notifier_invalidate_range_start(mm, start, end);
> +	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
>  	madvise_free_page_range(&tlb, vma, start, end);
> -	mmu_notifier_invalidate_range_end(mm, start, end);
> +	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
>  	tlb_finish_mmu(&tlb, start, end);
>  
>  	return 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index d1fa0c1..9300fad 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1048,7 +1048,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	mmun_end   = end;
>  	if (is_cow)
>  		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
> -						    mmun_end);
> +						    mmun_end, MMU_FORK);
>  
>  	ret = 0;
>  	dst_pgd = pgd_offset(dst_mm, addr);
> @@ -1065,7 +1065,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
>  
>  	if (is_cow)
> -		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
> +						  mmun_end, MMU_FORK);
>  	return ret;
>  }
>  
> @@ -1335,10 +1336,12 @@ void unmap_vmas(struct mmu_gather *tlb,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  
> -	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> +	mmu_notifier_invalidate_range_start(mm, start_addr,
> +					    end_addr, MMU_MUNMAP);
>  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> -	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> +	mmu_notifier_invalidate_range_end(mm, start_addr,
> +					  end_addr, MMU_MUNMAP);
>  }
>  
>  /**
> @@ -1360,10 +1363,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, start, end);
> +	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
>  	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
>  		unmap_single_vma(&tlb, vma, start, end, details);
> -	mmu_notifier_invalidate_range_end(mm, start, end);
> +	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
>  	tlb_finish_mmu(&tlb, start, end);
>  }
>  
> @@ -1386,9 +1389,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, address, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, address, end);
> +	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
>  	unmap_single_vma(&tlb, vma, address, end, details);
> -	mmu_notifier_invalidate_range_end(mm, address, end);
> +	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
>  	tlb_finish_mmu(&tlb, address, end);
>  }
>  
> @@ -2086,7 +2089,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
>  		goto oom_free_new;
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  
>  	/*
>  	 * Re-check the pte - we dropped the lock
> @@ -2119,7 +2123,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * mmu page tables (such as kvm shadow page tables), we want the
>  		 * new page to be mapped directly into the secondary page table.
>  		 */
> -		set_pte_at_notify(mm, address, page_table, entry);
> +		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
>  		update_mmu_cache(vma, address, page_table);
>  		if (old_page) {
>  			/*
> @@ -2158,7 +2162,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
>  		page_cache_release(new_page);
>  
>  	pte_unmap_unlock(page_table, ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	if (old_page) {
>  		/*
>  		 * Don't let another task, with possibly unlocked vma,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 236ee25..ad9a55a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1759,12 +1759,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	WARN_ON(PageLRU(new_page));
>  
>  	/* Recheck the target PMD */
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>  fail_putback:
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  
>  		/* Reverse changes made by migrate_page_copy() */
>  		if (TestClearPageActive(new_page))
> @@ -1818,7 +1820,8 @@ fail_putback:
>  	page_remove_rmap(page);
>  
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	/* Take an "isolate" reference and put new page on the LRU. */
>  	get_page(new_page);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 3b9b3d0..e51ea02 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -142,8 +142,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
>  	return young;
>  }
>  
> -void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
> -			       pte_t pte)
> +void __mmu_notifier_change_pte(struct mm_struct *mm,
> +			       unsigned long address,
> +			       pte_t pte,
> +			       enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -151,13 +153,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->change_pte)
> -			mn->ops->change_pte(mn, mm, address, pte);
> +			mn->ops->change_pte(mn, mm, address, pte, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  
>  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address)
> +				    unsigned long address,
> +				    enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -165,13 +168,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_page)
> -			mn->ops->invalidate_page(mn, mm, address);
> +			mn->ops->invalidate_page(mn, mm, address, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  
>  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +					   unsigned long start,
> +					   unsigned long end,
> +					   enum mmu_event event)
> +
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -179,14 +185,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +			mn->ops->invalidate_range_start(mn, mm, start,
> +							end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>  
>  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +					 unsigned long start,
> +					 unsigned long end,
> +					 enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -204,7 +213,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  		if (mn->ops->invalidate_range)
>  			mn->ops->invalidate_range(mn, mm, start, end);
>  		if (mn->ops->invalidate_range_end)
> -			mn->ops->invalidate_range_end(mn, mm, start, end);
> +			mn->ops->invalidate_range_end(mn, mm, start,
> +						      end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index e7d6f11..a57e8af 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -155,7 +155,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		/* invoke the mmu notifier if the pmd is populated */
>  		if (!mni_start) {
>  			mni_start = addr;
> -			mmu_notifier_invalidate_range_start(mm, mni_start, end);
> +			mmu_notifier_invalidate_range_start(mm, mni_start,
> +							    end, MMU_MPROT);
>  		}
>  
>  		if (pmd_trans_huge(*pmd)) {
> @@ -183,7 +184,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  	} while (pmd++, addr = next, addr != end);
>  
>  	if (mni_start)
> -		mmu_notifier_invalidate_range_end(mm, mni_start, end);
> +		mmu_notifier_invalidate_range_end(mm, mni_start, end,
> +						  MMU_MPROT);
>  
>  	if (nr_huge_updates)
>  		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index a7c93ec..72051cf 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -176,7 +176,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  
>  	mmun_start = old_addr;
>  	mmun_end   = old_end;
> -	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  
>  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
>  		cond_resched();
> @@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  	if (likely(need_flush))
>  		flush_tlb_range(vma, old_end-len, old_addr);
>  
> -	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	return len + old_addr - old_end;	/* how much done */
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9c04594..74c51e0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -915,7 +915,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  	pte_unmap_unlock(pte, ptl);
>  
>  	if (ret) {
> -		mmu_notifier_invalidate_page(mm, address);
> +		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
>  		(*cleaned)++;
>  	}
>  out:
> @@ -1338,7 +1338,7 @@ discard:
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> -		mmu_notifier_invalidate_page(mm, address);
> +		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
>  out:
>  	return ret;
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f202c40..d0b1060 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -260,7 +260,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  
>  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
> -					     unsigned long address)
> +					     unsigned long address,
> +					     enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int need_tlb_flush, idx;
> @@ -302,7 +303,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  					struct mm_struct *mm,
>  					unsigned long address,
> -					pte_t pte)
> +					pte_t pte,
> +					enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int idx;
> @@ -318,7 +320,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int need_tlb_flush = 0, idx;
> @@ -344,7 +347,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  						  struct mm_struct *mm,
>  						  unsigned long start,
> -						  unsigned long end)
> +						  unsigned long end,
> +						  enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  
> -- 
> 1.9.3
> 
> 

thanks,
John H.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: HMM (Heterogeneous Memory Management) v8
  2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
                   ` (20 preceding siblings ...)
  2015-05-30  3:01 ` HMM (Heterogeneous Memory Management) v8 John Hubbard
@ 2015-05-31  6:56 ` Haggai Eran
  21 siblings, 0 replies; 80+ messages in thread
From: Haggai Eran @ 2015-05-31  6:56 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, linux-fsdevel, Linda Wang,
	Kevin E Martin, Jeff Law, Or Gerlitz, Sagi Grimberg

On 21/05/2015 22:31, j.glisse@gmail.com wrote:
> From design point of view not much changed since last patchset (2).
> Most of the change are in small details of the API expose to device
> driver. This version also include device driver change for Mellanox
> hardware to use HMM as an alternative to ODP (which provide a subset
> of HMM functionality specificaly for RDMA devices). Long term plan
> is to have HMM completely replace ODP.

Hi,

I think HMM would be a good long term solution indeed. For now I would
want to keep ODP and HMM side by side (as the patchset seem to do)
mainly since HMM is introduced as a STAGING feature and ODP is part of
the mainline kernel.

It would be nice if you could provide a git repository to access the
patches. I couldn't apply them to the current linux-next tree.

A minor thing: I noticed some style issues in the patches. You should
run checkpatch.pl on the patches and get them to match the coding style.

Regards,
Haggai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/36] mmu_notifier: add event information to address invalidation v7
  2015-05-30  3:43   ` John Hubbard
@ 2015-06-01 19:03     ` Jerome Glisse
  2015-06-01 23:10       ` John Hubbard
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-01 19:03 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

On Fri, May 29, 2015 at 08:43:59PM -0700, John Hubbard wrote:
> On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > The event information will be useful for new user of mmu_notifier API.
> > The event argument differentiate between a vma disappearing, a page
> > being write protected or simply a page being unmaped. This allow new
> > user to take different path for different event for instance on unmap
> > the resource used to track a vma are still valid and should stay around.
> > While if the event is saying that a vma is being destroy it means that any
> > resources used to track this vma can be free.
> > 
> > Changed since v1:
> >   - renamed action into event (updated commit message too).
> >   - simplified the event names and clarified their usage
> >     also documenting what exceptation the listener can have in
> >     respect to each event.
> > 
> > Changed since v2:
> >   - Avoid crazy name.
> >   - Do not move code that do not need to move.
> > 
> > Changed since v3:
> >   - Separate hugue page split from mlock/munlock and softdirty.
> 
> Do we care about fixing up patch comments? If so:
> 
> s/hugue/huge/

I am noting them down and will go over them.


[...]
> > +	MMU_HSPLIT,
> 
> Let's rename MMU_HSPLIT to one of the following, take your pick:
> 
> MMU_HUGE_PAGE_SPLIT (too long, but you can't possibly misunderstand it)
> MMU_PAGE_SPLIT (my favorite: only huge pages are ever split, so it works)
> MMU_HUGE_SPLIT (ugly, but still hard to misunderstand)

I will go with MMU_HUGE_PAGE_SPLIT 


[...]
> 
> > +	MMU_ISDIRTY,
> 
> This MMU_ISDIRTY seems like a problem to me. First of all, it looks 
> backwards: the only place that invokes it is the clear_refs_write() 
> routine, for the soft-dirty tracking feature. And in that case, the pages 
> are *not* being made dirty! Rather, the kernel is actually making the 
> pages non-writable, in order to be able to trap the subsequent page fault 
> and figure out if the page is in active use.
> 
> So, given that there is only one call site, and that call site should 
> actually be setting MMU_WRITE_PROTECT instead (I think), let's just delete 
> MMU_ISDIRTY.
> 
> Come to think about it, there is no callback possible for "a page became 
> dirty", anyway. Because the dirty and accessed bits are actually set by 
> the hardware, and software is generally unable to know the current state.
> So MMU_ISDIRTY just seems inappropriate to me, across the board.
> 
> I'll take a look at the corresponding HMM_ISDIRTY, too.

Ok i need to rename that one to CLEAR_SOFT_DIRTY, the idea is that
for HMM i would rather not write protect the memory for the device
and just rely on the regular and conservative dirtying of page. The
soft dirty is really for migrating a process where you first clear
the soft dirty bit, then copy memory while process still running,
then freeze process an only copy memory that was dirtied since
first copy. Point being that adding soft dirty to HMM is something
that can be done down the road. We should have enough bit inside
the device page table for that.


> 
> > +	MMU_MIGRATE,
> > +	MMU_MPROT,
> 
> The MMU_PROT also looks questionable. Short answer: probably better to 
> read the protection, and pass either MMU_WRITE_PROTECT, MMU_READ_WRITE 
> (that's a new item, of course), or MMU_UNMAP.
> 
> Here's why: the call site knows the protection, but by the time it filters 
> down to HMM (in later patches), that information is lost, and HMM ends up 
> doing (ouch!) another find_vma() call in order to retrieve it--and then 
> translates it into only three possible things:
> 
> // hmm_mmu_mprot_to_etype() sets one of these:
> 
>    HMM_MUNMAP
>    HMM_WRITE_PROTECT
>    HMM_NONE

Linus complained of my previous version where i differenciated the
kind of protection change that was happening, hence why i only pass
down mprot.


> 
> 
> > +	MMU_MUNLOCK,
> 
> I think MMU_UNLOCK would be clearer. We already know the scope, so the 
> extra "M" isn't adding anything.

I named it that way so it matches syscall name munlock(). I think
it is clearer to use MUNLOCK, or maybe SYSCALL_MUNLOCK

> 
> > +	MMU_MUNMAP,
> 
> Same thing here: MMU_UNMAP seems better.

Well same idea here.


> 
> > +	MMU_WRITE_BACK,
> > +	MMU_WRITE_PROTECT,
> 
> We may have to add MMU_READ_WRITE (and maybe another one, I haven't 
> bottomed out on that), if you agree with the above approach of 
> always sending a precise event, instead of "protection changed".

I think Linus point made sense last time, but i would need to read
again the thread. The idea of that patch is really to provide context
information on what kind of CPU page table changes is happening and
why.

In that respect i should probably change MMU_WRITE_PROTECT to 
MMU_KSM_WRITE_PROTECT.


Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/36] mmu_notifier: add event information to address invalidation v7
  2015-06-01 19:03     ` Jerome Glisse
@ 2015-06-01 23:10       ` John Hubbard
  2015-06-03 16:07         ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: John Hubbard @ 2015-06-01 23:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 6796 bytes --]

On Mon, 1 Jun 2015, Jerome Glisse wrote:

> On Fri, May 29, 2015 at 08:43:59PM -0700, John Hubbard wrote:
> > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > 
> > > From: Jerome Glisse <jglisse@redhat.com>
> > > 
> > > The event information will be useful for new user of mmu_notifier API.
> > > The event argument differentiate between a vma disappearing, a page
> > > being write protected or simply a page being unmaped. This allow new
> > > user to take different path for different event for instance on unmap
> > > the resource used to track a vma are still valid and should stay around.
> > > While if the event is saying that a vma is being destroy it means that any
> > > resources used to track this vma can be free.
> > > 
> > > Changed since v1:
> > >   - renamed action into event (updated commit message too).
> > >   - simplified the event names and clarified their usage
> > >     also documenting what exceptation the listener can have in
> > >     respect to each event.
> > > 
> > > Changed since v2:
> > >   - Avoid crazy name.
> > >   - Do not move code that do not need to move.
> > > 
> > > Changed since v3:
> > >   - Separate hugue page split from mlock/munlock and softdirty.
> > 
> > Do we care about fixing up patch comments? If so:
> > 
> > s/hugue/huge/
> 
> I am noting them down and will go over them.
> 
> 
> [...]
> > > +	MMU_HSPLIT,
> > 
> > Let's rename MMU_HSPLIT to one of the following, take your pick:
> > 
> > MMU_HUGE_PAGE_SPLIT (too long, but you can't possibly misunderstand it)
> > MMU_PAGE_SPLIT (my favorite: only huge pages are ever split, so it works)
> > MMU_HUGE_SPLIT (ugly, but still hard to misunderstand)
> 
> I will go with MMU_HUGE_PAGE_SPLIT 
> 
> 
> [...]
> > 
> > > +	MMU_ISDIRTY,
> > 
> > This MMU_ISDIRTY seems like a problem to me. First of all, it looks 
> > backwards: the only place that invokes it is the clear_refs_write() 
> > routine, for the soft-dirty tracking feature. And in that case, the pages 
> > are *not* being made dirty! Rather, the kernel is actually making the 
> > pages non-writable, in order to be able to trap the subsequent page fault 
> > and figure out if the page is in active use.
> > 
> > So, given that there is only one call site, and that call site should 
> > actually be setting MMU_WRITE_PROTECT instead (I think), let's just delete 
> > MMU_ISDIRTY.
> > 
> > Come to think about it, there is no callback possible for "a page became 
> > dirty", anyway. Because the dirty and accessed bits are actually set by 
> > the hardware, and software is generally unable to know the current state.
> > So MMU_ISDIRTY just seems inappropriate to me, across the board.
> > 
> > I'll take a look at the corresponding HMM_ISDIRTY, too.
> 
> Ok i need to rename that one to CLEAR_SOFT_DIRTY, the idea is that
> for HMM i would rather not write protect the memory for the device
> and just rely on the regular and conservative dirtying of page. The
> soft dirty is really for migrating a process where you first clear
> the soft dirty bit, then copy memory while process still running,
> then freeze process an only copy memory that was dirtied since
> first copy. Point being that adding soft dirty to HMM is something
> that can be done down the road. We should have enough bit inside
> the device page table for that.
> 

Yes, I think renaming it to CLEAR_SOFT_DIRTY will definitely allow more 
accurate behavior in response to these events.

Looking ahead, a couple things:

1. This mechanism is also used for general memory utilization tracking (I 
see that Vladimir DavyDov has an "idle memory tracking" proposal that 
assumes this works, for example: https://lwn.net/Articles/642202/ and 
https://lkml.org/lkml/2015/5/12/449).

2. It seems hard to avoid the need to eventually just write protect the 
page, whether it is on the CPU or the remote device, if things like device 
drivers or user space need to track write accesses to a virtual address. 
Either you write protect the page, and trap the page faults, or you wait 
until later and read the dirty bit (indirectly, via something like 
unmap_mapping_range). Or did you have something else in mind?

Anyway, none of that needs to hold up this part of the patchset, because 
the renaming fixes things up for the future code to do the right thing.

> 
> > 
> > > +	MMU_MIGRATE,
> > > +	MMU_MPROT,
> > 
> > The MMU_PROT also looks questionable. Short answer: probably better to 
> > read the protection, and pass either MMU_WRITE_PROTECT, MMU_READ_WRITE 
> > (that's a new item, of course), or MMU_UNMAP.
> > 
> > Here's why: the call site knows the protection, but by the time it filters 
> > down to HMM (in later patches), that information is lost, and HMM ends up 
> > doing (ouch!) another find_vma() call in order to retrieve it--and then 
> > translates it into only three possible things:
> > 
> > // hmm_mmu_mprot_to_etype() sets one of these:
> > 
> >    HMM_MUNMAP
> >    HMM_WRITE_PROTECT
> >    HMM_NONE
> 
> Linus complained of my previous version where i differenciated the
> kind of protection change that was happening, hence why i only pass
> down mprot.
> 
> 
> > 
> > 
> > > +	MMU_MUNLOCK,
> > 
> > I think MMU_UNLOCK would be clearer. We already know the scope, so the 
> > extra "M" isn't adding anything.
> 
> I named it that way so it matches syscall name munlock(). I think
> it is clearer to use MUNLOCK, or maybe SYSCALL_MUNLOCK
> 
> > 
> > > +	MMU_MUNMAP,
> > 
> > Same thing here: MMU_UNMAP seems better.
> 
> Well same idea here.

OK, sure.

> 
> 
> > 
> > > +	MMU_WRITE_BACK,
> > > +	MMU_WRITE_PROTECT,
> > 
> > We may have to add MMU_READ_WRITE (and maybe another one, I haven't 
> > bottomed out on that), if you agree with the above approach of 
> > always sending a precise event, instead of "protection changed".
> 
> I think Linus point made sense last time, but i would need to read
> again the thread. The idea of that patch is really to provide context
> information on what kind of CPU page table changes is happening and
> why.
>

Shoot, I tried to find that conversation, but my search foo is too weak. 
If you have a link to that thread, I'd appreciate it, so I can refresh my 
memory.

I was hoping to re-read it and see if anything has changed. It's not 
really a huge problem to call find_vma() again, but I do want to be sure 
that there's a good reason for doing so.
 
Otherwise, I'll just rely on your memory that Linus preferred your current 
approach, and call it good, then.

> In that respect i should probably change MMU_WRITE_PROTECT to 
> MMU_KSM_WRITE_PROTECT.
> 

Yes, that might help clarify to the reader, because otherwise it's not 
always obvious why we have "MPROT" and "WRITE_PROTECT" (which seems at 
first like merely a subset of MPROT).

thanks,
john h

> 
> Cheers,
> Jerome
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3
  2015-05-21 19:31 ` [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3 j.glisse
  2015-05-27  5:09   ` Aneesh Kumar K.V
@ 2015-06-02  9:32   ` John Hubbard
  2015-06-03 17:15     ` Jerome Glisse
  1 sibling, 1 reply; 80+ messages in thread
From: John Hubbard @ 2015-06-02  9:32 UTC (permalink / raw)
  To: j.glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 9046 bytes --]

On Thu, 21 May 2015, j.glisse@gmail.com wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
> can be considered as forming an "atomic" section for the cpu page table update
> point of view. Between this two function the cpu page table content is unreliable
> for the address range being invalidated.
> 
> Current user such as kvm need to know when they can trust the content of the cpu
> page table. This becomes even more important to new users of the mmu_notifier
> api (such as HMM or ODP).
> 
> This patch use a structure define at all call site to invalidate_range_start()
> that is added to a list for the duration of the invalidation. It adds two new
> helpers to allow querying if a range is being invalidated or to wait for a range
> to become valid.
> 
> For proper synchronization, user must block new range invalidation from inside
> there invalidate_range_start() callback, before calling the helper functions.
> Otherwise there is no garanty that a new range invalidation will not be added
> after the call to the helper function to query for existing range.

Hi Jerome,

Most of this information will make nice block comments for the new helper 
routines. I can help tighten up the writing slightly, but first:

Question: in hmm.c's hmm_notifier_invalidate function (looking at the 
entire patchset, for a moment), I don't see any blocking of new range 
invalidations, even though you point out, above, that this is required. Am 
I missing it, and if so, where should I be looking instead?

> 
> Changed since v1:
>   - Fix a possible deadlock in mmu_notifier_range_wait_valid()
> 
> Changed since v2:
>   - Add the range to invalid range list before calling ->range_start().
>   - Del the range from invalid range list after calling ->range_end().
>   - Remove useless list initialization.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Haggai Eran <haggaie@mellanox.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_userptr.c |  9 ++--
>  drivers/gpu/drm/radeon/radeon_mn.c      | 14 +++---
>  drivers/infiniband/core/umem_odp.c      | 16 +++----
>  drivers/misc/sgi-gru/grutlbpurge.c      | 15 +++----
>  drivers/xen/gntdev.c                    | 15 ++++---
>  fs/proc/task_mmu.c                      | 11 +++--
>  include/linux/mmu_notifier.h            | 55 ++++++++++++-----------
>  kernel/events/uprobes.c                 | 13 +++---
>  mm/huge_memory.c                        | 78 ++++++++++++++------------------
>  mm/hugetlb.c                            | 55 ++++++++++++-----------
>  mm/ksm.c                                | 28 +++++-------
>  mm/madvise.c                            | 20 ++++-----
>  mm/memory.c                             | 72 +++++++++++++++++-------------
>  mm/migrate.c                            | 36 +++++++--------
>  mm/mmu_notifier.c                       | 79 ++++++++++++++++++++++++++++-----
>  mm/mprotect.c                           | 18 ++++----
>  mm/mremap.c                             | 14 +++---
>  virt/kvm/kvm_main.c                     | 10 ++---
>  18 files changed, 302 insertions(+), 256 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 452e9b1..80fe72a 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -131,16 +131,15 @@ restart:
>  
>  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  						       struct mm_struct *mm,
> -						       unsigned long start,
> -						       unsigned long end,
> -						       enum mmu_event event)
> +						       const struct mmu_notifier_range *range)
>  {
>  	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
>  	struct interval_tree_node *it = NULL;
> -	unsigned long next = start;
> +	unsigned long next = range->start;
>  	unsigned long serial = 0;
> +	/* interval ranges are inclusive, but invalidate range is exclusive */
> +	unsigned long end = range->end - 1, start = range->start;


A *very* minor point, but doing it that way messes up the scope of the 
comment. Something more like this might be cleaner:

unsigned long start = range->start;
unsigned long next = start;
unsigned long serial = 0;
/* interval ranges are inclusive, but invalidate range is exclusive */
unsigned long end = range->end - 1;


[...]

> -					   enum mmu_event event)
> +					   struct mmu_notifier_range *range)
>  
>  {
>  	struct mmu_notifier *mn;
>  	int id;
>  
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
> +	mm->mmu_notifier_mm->nranges++;


Is this missing a call to wake_up(&mm->mmu_notifier_mm->wait_queue)? If 
not, then it would be helpful to explain why that's only required for 
nranges--, and not for the nranges++ case. The helper routine is merely 
waiting for nranges to *change*, not looking for greater than or less 
than.


> +	spin_unlock(&mm->mmu_notifier_mm->lock);
> +
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start,
> -							end, event);
> +			mn->ops->invalidate_range_start(mn, mm, range);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>  
>  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -					 unsigned long start,
> -					 unsigned long end,
> -					 enum mmu_event event)
> +					 struct mmu_notifier_range *range)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -211,12 +211,23 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  		 * (besides the pointer check).
>  		 */
>  		if (mn->ops->invalidate_range)
> -			mn->ops->invalidate_range(mn, mm, start, end);
> +			mn->ops->invalidate_range(mn, mm,
> +						  range->start, range->end);
>  		if (mn->ops->invalidate_range_end)
> -			mn->ops->invalidate_range_end(mn, mm, start,
> -						      end, event);
> +			mn->ops->invalidate_range_end(mn, mm, range);
>  	}
>  	srcu_read_unlock(&srcu, id);
> +
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	list_del_init(&range->list);
> +	mm->mmu_notifier_mm->nranges--;
> +	spin_unlock(&mm->mmu_notifier_mm->lock);
> +
> +	/*
> +	 * Wakeup after callback so they can do their job before any of the
> +	 * waiters resume.
> +	 */
> +	wake_up(&mm->mmu_notifier_mm->wait_queue);
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
>  
> @@ -235,6 +246,49 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
>  


We definitely want to put a little documentation here.


> +static bool mmu_notifier_range_is_valid_locked(struct mm_struct *mm,
> +					       unsigned long start,
> +					       unsigned long end)


This routine is named "_range_is_valid_", but it takes in an implicit 
range (start, end), and also a list of ranges (buried in mm), and so it's 
a little confusing. I'd like to consider *maybe* changing either the name, 
or the args (range* instead of start, end?), or something.

Could you please say a few words about the intent of this routine, to get 
us started there?


> +{
> +	struct mmu_notifier_range *range;
> +
> +	list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
> +		if (!(range->end <= start || range->start >= end))
> +			return false;


This has a lot of negatives in it, if you count the innermost "not in 
range" expression. It can be simplified to this:

if(range->end > start && range->start < end)
	return false;


> +	}
> +	return true;
> +}
> +
> +bool mmu_notifier_range_is_valid(struct mm_struct *mm,
> +				 unsigned long start,
> +				 unsigned long end)
> +{
> +	bool valid;
> +
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	valid = mmu_notifier_range_is_valid_locked(mm, start, end);
> +	spin_unlock(&mm->mmu_notifier_mm->lock);
> +	return valid;
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_range_is_valid);
> +
> +void mmu_notifier_range_wait_valid(struct mm_struct *mm,
> +				   unsigned long start,
> +				   unsigned long end)
> +{
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	while (!mmu_notifier_range_is_valid_locked(mm, start, end)) {
> +		int nranges = mm->mmu_notifier_mm->nranges;
> +
> +		spin_unlock(&mm->mmu_notifier_mm->lock);
> +		wait_event(mm->mmu_notifier_mm->wait_queue,
> +			   nranges != mm->mmu_notifier_mm->nranges);
> +		spin_lock(&mm->mmu_notifier_mm->lock);
> +	}
> +	spin_unlock(&mm->mmu_notifier_mm->lock);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_valid);
> +
>  static int do_mmu_notifier_register(struct mmu_notifier *mn,
>  				    struct mm_struct *mm,
>  				    int take_mmap_sem)
> @@ -264,6 +318,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
>  	if (!mm_has_notifiers(mm)) {

[...]

That's all I could see to mention for this one, thanks,

john h

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page()
  2015-05-21 19:31 ` [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() j.glisse
  2015-05-27  5:17   ` Aneesh Kumar K.V
@ 2015-06-03  4:25   ` John Hubbard
  1 sibling, 0 replies; 80+ messages in thread
From: John Hubbard @ 2015-06-03  4:25 UTC (permalink / raw)
  To: j.glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 6806 bytes --]

On Thu, 21 May 2015, j.glisse@gmail.com wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> Listener of mm event might not have easy way to get the struct page
> behind and address invalidated with mmu_notifier_invalidate_page()

s/behind and address/behind an address/

> function as this happens after the cpu page table have been clear/
> updated. This happens for instance if the listener is storing a dma
> mapping inside its secondary page table. To avoid complex reverse
> dma mapping lookup just pass along a pointer to the page being
> invalidated.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> ---
>  drivers/infiniband/core/umem_odp.c | 1 +
>  drivers/iommu/amd_iommu_v2.c       | 1 +
>  drivers/misc/sgi-gru/grutlbpurge.c | 1 +
>  drivers/xen/gntdev.c               | 1 +
>  include/linux/mmu_notifier.h       | 6 +++++-
>  mm/mmu_notifier.c                  | 3 ++-
>  mm/rmap.c                          | 4 ++--
>  virt/kvm/kvm_main.c                | 1 +
>  8 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 8f7f845..d10dd88 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -166,6 +166,7 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
>  static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
>  					     unsigned long address,
> +					     struct page *page,
>  					     enum mmu_event event)
>  {
>  	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 4aa4de6..de3c540 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -385,6 +385,7 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
>  static void mn_invalidate_page(struct mmu_notifier *mn,
>  			       struct mm_struct *mm,
>  			       unsigned long address,
> +			       struct page *page,
>  			       enum mmu_event event)
>  {
>  	__mn_flush_page(mn, address);
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index 44b41b7..c7659b76 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -250,6 +250,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
>  
>  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
>  				unsigned long address,
> +				struct page *page,
>  				enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index 0e8aa12..90693ce 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -485,6 +485,7 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>  static void mn_invl_page(struct mmu_notifier *mn,
>  			 struct mm_struct *mm,
>  			 unsigned long address,
> +			 struct page *page,
>  			 enum mmu_event event)
>  {
>  	struct mmu_notifier_range range;
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index ada3ed1..283ad26 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -172,6 +172,7 @@ struct mmu_notifier_ops {
>  	void (*invalidate_page)(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
>  				unsigned long address,
> +				struct page *page,
>  				enum mmu_event event);
>  
>  	/*
> @@ -290,6 +291,7 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>  				      enum mmu_event event);
>  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
>  					  unsigned long address,
> +					  struct page *page,
>  					  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  						  struct mmu_notifier_range *range);
> @@ -338,10 +340,11 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
>  						unsigned long address,
> +						struct page *page,
>  						enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_page(mm, address, event);
> +		__mmu_notifier_invalidate_page(mm, address, page, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> @@ -492,6 +495,7 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
>  						unsigned long address,
> +						struct page *page,
>  						enum mmu_event event)
>  {
>  }
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 294ebc4..2ff6d43 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -160,6 +160,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
>  
>  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
>  				    unsigned long address,
> +				    struct page *page,
>  				    enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
> @@ -168,7 +169,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_page)
> -			mn->ops->invalidate_page(mn, mm, address, event);
> +			mn->ops->invalidate_page(mn, mm, address, page, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 74c51e0..4563edc 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -915,7 +915,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  	pte_unmap_unlock(pte, ptl);
>  
>  	if (ret) {
> -		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
> +		mmu_notifier_invalidate_page(mm, address, page, MMU_WRITE_BACK);
>  		(*cleaned)++;
>  	}
>  out:
> @@ -1338,7 +1338,7 @@ discard:
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> -		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
> +		mmu_notifier_invalidate_page(mm, address, page, MMU_MIGRATE);
>  out:
>  	return ret;
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6177c56..62978ed 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -261,6 +261,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
>  					     unsigned long address,
> +					     struct page *page,
>  					     enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> -- 
> 1.9.3
> 
> 

This seems like a reasonable thing to do, and I didn't spot any problems 
in the patch.

thanks,
John H.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/36] mmu_notifier: add event information to address invalidation v7
  2015-06-01 23:10       ` John Hubbard
@ 2015-06-03 16:07         ` Jerome Glisse
  2015-06-03 23:02           ` John Hubbard
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-03 16:07 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

On Mon, Jun 01, 2015 at 04:10:46PM -0700, John Hubbard wrote:
> On Mon, 1 Jun 2015, Jerome Glisse wrote:
> > On Fri, May 29, 2015 at 08:43:59PM -0700, John Hubbard wrote:
> > > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > > > From: Jerome Glisse <jglisse@redhat.com>

[...]
> > > > +	MMU_ISDIRTY,
> > > 
> > > This MMU_ISDIRTY seems like a problem to me. First of all, it looks 
> > > backwards: the only place that invokes it is the clear_refs_write() 
> > > routine, for the soft-dirty tracking feature. And in that case, the pages 
> > > are *not* being made dirty! Rather, the kernel is actually making the 
> > > pages non-writable, in order to be able to trap the subsequent page fault 
> > > and figure out if the page is in active use.
> > > 
> > > So, given that there is only one call site, and that call site should 
> > > actually be setting MMU_WRITE_PROTECT instead (I think), let's just delete 
> > > MMU_ISDIRTY.
> > > 
> > > Come to think about it, there is no callback possible for "a page became 
> > > dirty", anyway. Because the dirty and accessed bits are actually set by 
> > > the hardware, and software is generally unable to know the current state.
> > > So MMU_ISDIRTY just seems inappropriate to me, across the board.
> > > 
> > > I'll take a look at the corresponding HMM_ISDIRTY, too.
> > 
> > Ok i need to rename that one to CLEAR_SOFT_DIRTY, the idea is that
> > for HMM i would rather not write protect the memory for the device
> > and just rely on the regular and conservative dirtying of page. The
> > soft dirty is really for migrating a process where you first clear
> > the soft dirty bit, then copy memory while process still running,
> > then freeze process an only copy memory that was dirtied since
> > first copy. Point being that adding soft dirty to HMM is something
> > that can be done down the road. We should have enough bit inside
> > the device page table for that.
> > 
> 
> Yes, I think renaming it to CLEAR_SOFT_DIRTY will definitely allow more 
> accurate behavior in response to these events.
> 
> Looking ahead, a couple things:
> 
> 1. This mechanism is also used for general memory utilization tracking (I 
> see that Vladimir DavyDov has an "idle memory tracking" proposal that 
> assumes this works, for example: https://lwn.net/Articles/642202/ and 
> https://lkml.org/lkml/2015/5/12/449).
> 
> 2. It seems hard to avoid the need to eventually just write protect the 
> page, whether it is on the CPU or the remote device, if things like device 
> drivers or user space need to track write accesses to a virtual address. 
> Either you write protect the page, and trap the page faults, or you wait 
> until later and read the dirty bit (indirectly, via something like 
> unmap_mapping_range). Or did you have something else in mind?
> 
> Anyway, none of that needs to hold up this part of the patchset, because 
> the renaming fixes things up for the future code to do the right thing.

I will go over Vladimir patchset it was on my radar but haven't had yet a
chance to go over it. We will likely need to do the write protect for device
too. But as you said this is not an issue that this patch need a fix for,
only HMM would need to change. I will do that.


[...]
> > > We may have to add MMU_READ_WRITE (and maybe another one, I haven't 
> > > bottomed out on that), if you agree with the above approach of 
> > > always sending a precise event, instead of "protection changed".
> > 
> > I think Linus point made sense last time, but i would need to read
> > again the thread. The idea of that patch is really to provide context
> > information on what kind of CPU page table changes is happening and
> > why.
> >
> 
> Shoot, I tried to find that conversation, but my search foo is too weak. 
> If you have a link to that thread, I'd appreciate it, so I can refresh my 
> memory.
> 
> I was hoping to re-read it and see if anything has changed. It's not 
> really a huge problem to call find_vma() again, but I do want to be sure 
> that there's a good reason for doing so.
>  
> Otherwise, I'll just rely on your memory that Linus preferred your current 
> approach, and call it good, then.

http://lkml.iu.edu/hypermail/linux/kernel/1406.3/04880.html

I am working on doing some of the changes discussed so far, i will push my
tree to git://people.freedesktop.org/~glisse/linux hmm branch once i am done.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3
  2015-06-02  9:32   ` John Hubbard
@ 2015-06-03 17:15     ` Jerome Glisse
  2015-06-05  3:29       ` John Hubbard
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-03 17:15 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

On Tue, Jun 02, 2015 at 02:32:01AM -0700, John Hubbard wrote:
> On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
> > can be considered as forming an "atomic" section for the cpu page table update
> > point of view. Between this two function the cpu page table content is unreliable
> > for the address range being invalidated.
> > 
> > Current user such as kvm need to know when they can trust the content of the cpu
> > page table. This becomes even more important to new users of the mmu_notifier
> > api (such as HMM or ODP).
> > 
> > This patch use a structure define at all call site to invalidate_range_start()
> > that is added to a list for the duration of the invalidation. It adds two new
> > helpers to allow querying if a range is being invalidated or to wait for a range
> > to become valid.
> > 
> > For proper synchronization, user must block new range invalidation from inside
> > there invalidate_range_start() callback, before calling the helper functions.
> > Otherwise there is no garanty that a new range invalidation will not be added
> > after the call to the helper function to query for existing range.
> 
> Hi Jerome,
> 
> Most of this information will make nice block comments for the new helper 
> routines. I can help tighten up the writing slightly, but first:
> 
> Question: in hmm.c's hmm_notifier_invalidate function (looking at the 
> entire patchset, for a moment), I don't see any blocking of new range 
> invalidations, even though you point out, above, that this is required. Am 
> I missing it, and if so, where should I be looking instead?

This is a 2 sided synchronization:

- hmm_device_fault_start() will wait for active invalidation that conflict
  to be done
- hmm_wait_device_fault() will block new invalidation until
  active fault that conflict back off.


> [...]
> 
> > -					   enum mmu_event event)
> > +					   struct mmu_notifier_range *range)
> >  
> >  {
> >  	struct mmu_notifier *mn;
> >  	int id;
> >  
> > +	spin_lock(&mm->mmu_notifier_mm->lock);
> > +	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
> > +	mm->mmu_notifier_mm->nranges++;
> 
> 
> Is this missing a call to wake_up(&mm->mmu_notifier_mm->wait_queue)? If 
> not, then it would be helpful to explain why that's only required for 
> nranges--, and not for the nranges++ case. The helper routine is merely 
> waiting for nranges to *change*, not looking for greater than or less 
> than.

This is on purpose, as the waiting side only wait for active invalidation
to be done ie for mm->mmu_notifier_mm->nranges-- so there is no reasons to
wake up when a new invalidation is starting. Also the test need to be a not
equal because other non conflicting range might be added/removed meaning
that wait might finish even if mm->mmu_notifier_mm->nranges > saved_nranges.


[...]
> > +static bool mmu_notifier_range_is_valid_locked(struct mm_struct *mm,
> > +					       unsigned long start,
> > +					       unsigned long end)
> 
> 
> This routine is named "_range_is_valid_", but it takes in an implicit 
> range (start, end), and also a list of ranges (buried in mm), and so it's 
> a little confusing. I'd like to consider *maybe* changing either the name, 
> or the args (range* instead of start, end?), or something.
> 
> Could you please say a few words about the intent of this routine, to get 
> us started there?

It is just the same as mmu_notifier_range_is_valid() but it expects locks
to be taken. This is for the benefit of mmu_notifier_range_wait_valid()
which need to test if a range is valid (ie no conflicting invalidation)
or not. I added a comment to explain this 3 function and to explain how
the 2 publics helper needs to be use.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/36] mmu_notifier: add event information to address invalidation v7
  2015-06-03 16:07         ` Jerome Glisse
@ 2015-06-03 23:02           ` John Hubbard
  0 siblings, 0 replies; 80+ messages in thread
From: John Hubbard @ 2015-06-03 23:02 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 2072 bytes --]

On Wed, 3 Jun 2015, Jerome Glisse wrote:
> On Mon, Jun 01, 2015 at 04:10:46PM -0700, John Hubbard wrote:
> > On Mon, 1 Jun 2015, Jerome Glisse wrote:
> > > On Fri, May 29, 2015 at 08:43:59PM -0700, John Hubbard wrote:
> > > > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > > > > From: Jerome Glisse <jglisse@redhat.com>
> 
> [...]
> > > > We may have to add MMU_READ_WRITE (and maybe another one, I haven't 
> > > > bottomed out on that), if you agree with the above approach of 
> > > > always sending a precise event, instead of "protection changed".
> > > 
> > > I think Linus point made sense last time, but i would need to read
> > > again the thread. The idea of that patch is really to provide context
> > > information on what kind of CPU page table changes is happening and
> > > why.
> > >
> > 
> > Shoot, I tried to find that conversation, but my search foo is too weak. 
> > If you have a link to that thread, I'd appreciate it, so I can refresh my 
> > memory.
> > 
> > I was hoping to re-read it and see if anything has changed. It's not 
> > really a huge problem to call find_vma() again, but I do want to be sure 
> > that there's a good reason for doing so.
> >  
> > Otherwise, I'll just rely on your memory that Linus preferred your current 
> > approach, and call it good, then.
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1406.3/04880.html
> 
> I am working on doing some of the changes discussed so far, i will push my
> tree to git://people.freedesktop.org/~glisse/linux hmm branch once i am done.


Aha, OK, that was back when you were passing around the vma. But now, 
you're not doing that anymore. It's just: mm*, range* (start, end, 
event_type), and sometimes page* and exclude*). So I think it's still 
reasonable to either pass down pure vma flags, or else add in new event 
types, in order to avoid having to lookup the vma later.

We could still get NAK'd for adding ugly new event types, but if you're 
going to add the event types at all, let's make them complete, so that we 
really *earn* the NAK. :)

> 
> Cheers,
> Jerome
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3
  2015-06-03 17:15     ` Jerome Glisse
@ 2015-06-05  3:29       ` John Hubbard
  0 siblings, 0 replies; 80+ messages in thread
From: John Hubbard @ 2015-06-05  3:29 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 4482 bytes --]

On Wed, 3 Jun 2015, Jerome Glisse wrote:

> On Tue, Jun 02, 2015 at 02:32:01AM -0700, John Hubbard wrote:
> > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > 
> > > From: Jerome Glisse <jglisse@redhat.com>
> > > 
> > > The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
> > > can be considered as forming an "atomic" section for the cpu page table update
> > > point of view. Between this two function the cpu page table content is unreliable
> > > for the address range being invalidated.
> > > 
> > > Current user such as kvm need to know when they can trust the content of the cpu
> > > page table. This becomes even more important to new users of the mmu_notifier
> > > api (such as HMM or ODP).
> > > 
> > > This patch use a structure define at all call site to invalidate_range_start()
> > > that is added to a list for the duration of the invalidation. It adds two new
> > > helpers to allow querying if a range is being invalidated or to wait for a range
> > > to become valid.
> > > 
> > > For proper synchronization, user must block new range invalidation from inside
> > > there invalidate_range_start() callback, before calling the helper functions.
> > > Otherwise there is no garanty that a new range invalidation will not be added
> > > after the call to the helper function to query for existing range.
> > 
> > Hi Jerome,
> > 
> > Most of this information will make nice block comments for the new helper 
> > routines. I can help tighten up the writing slightly, but first:
> > 
> > Question: in hmm.c's hmm_notifier_invalidate function (looking at the 
> > entire patchset, for a moment), I don't see any blocking of new range 
> > invalidations, even though you point out, above, that this is required. Am 
> > I missing it, and if so, where should I be looking instead?
> 
> This is a 2 sided synchronization:
> 
> - hmm_device_fault_start() will wait for active invalidation that conflict
>   to be done
> - hmm_wait_device_fault() will block new invalidation until
>   active fault that conflict back off.
>


OK. I'll wait until those patches to talk about those, then.
 
> 
> > [...]
> > 
> > > -					   enum mmu_event event)
> > > +					   struct mmu_notifier_range *range)
> > >  
> > >  {
> > >  	struct mmu_notifier *mn;
> > >  	int id;
> > >  
> > > +	spin_lock(&mm->mmu_notifier_mm->lock);
> > > +	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
> > > +	mm->mmu_notifier_mm->nranges++;
> > 
> > 
> > Is this missing a call to wake_up(&mm->mmu_notifier_mm->wait_queue)? If 
> > not, then it would be helpful to explain why that's only required for 
> > nranges--, and not for the nranges++ case. The helper routine is merely 
> > waiting for nranges to *change*, not looking for greater than or less 
> > than.
> 
> This is on purpose, as the waiting side only wait for active invalidation
> to be done ie for mm->mmu_notifier_mm->nranges-- so there is no reasons to
> wake up when a new invalidation is starting. Also the test need to be a not
> equal because other non conflicting range might be added/removed meaning
> that wait might finish even if mm->mmu_notifier_mm->nranges > saved_nranges.
> 


OK, I convinced myself that this works as intended. So I don't see 
anything wrong with this approach.

thanks,
john h

> 
> [...]
> > > +static bool mmu_notifier_range_is_valid_locked(struct mm_struct *mm,
> > > +					       unsigned long start,
> > > +					       unsigned long end)
> > 
> > 
> > This routine is named "_range_is_valid_", but it takes in an implicit 
> > range (start, end), and also a list of ranges (buried in mm), and so it's 
> > a little confusing. I'd like to consider *maybe* changing either the name, 
> > or the args (range* instead of start, end?), or something.
> > 
> > Could you please say a few words about the intent of this routine, to get 
> > us started there?
> 
> It is just the same as mmu_notifier_range_is_valid() but it expects locks
> to be taken. This is for the benefit of mmu_notifier_range_wait_valid()
> which need to test if a range is valid (ie no conflicting invalidation)
> or not. I added a comment to explain this 3 function and to explain how
> the 2 publics helper needs to be use.
> 
> Cheers,
> Jerome
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
  2015-05-27  5:50   ` Aneesh Kumar K.V
@ 2015-06-08 19:40   ` Mark Hairgrove
  2015-06-08 21:17     ` Jerome Glisse
  1 sibling, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-08 19:40 UTC (permalink / raw)
  To: j.glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 6878 bytes --]



On Thu, 21 May 2015, j.glisse@gmail.com wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
>
> This patch only introduce core HMM functions for registering a new
> mirror and stopping a mirror as well as HMM device registering and
> unregistering.
>
> [...]
>
> +/* struct hmm_device_operations - HMM device operation callback
> + */
> +struct hmm_device_ops {
> +	/* release() - mirror must stop using the address space.
> +	 *
> +	 * @mirror: The mirror that link process address space with the device.
> +	 *
> +	 * When this is call, device driver must kill all device thread using
> +	 * this mirror. Also, this callback is the last thing call by HMM and
> +	 * HMM will not access the mirror struct after this call (ie no more
> +	 * dereference of it so it is safe for the device driver to free it).
> +	 * It is call either from :
> +	 *   - mm dying (all process using this mm exiting).
> +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> +	 *   - outcome of some device error reported by any of the device
> +	 *     callback against that mirror.
> +	 */
> +	void (*release)(struct hmm_mirror *mirror);
> +};

The comment that ->release is called when the mm dies doesn't match the
implementation. ->release is only called when the mirror is destroyed, and
that can only happen after the mirror has been unregistered. This may not
happen until after the mm dies.

Is the intent for the driver to get the callback when the mm goes down?
That seems beneficial so the driver can kill whatever's happening on the
device. Otherwise the device may continue operating in a dead address
space until the driver's file gets closed and it unregisters the mirror.


> +static void hmm_mirror_destroy(struct kref *kref)
> +{
> +	struct hmm_device *device;
> +	struct hmm_mirror *mirror;
> +	struct hmm *hmm;
> +
> +	mirror = container_of(kref, struct hmm_mirror, kref);
> +	device = mirror->device;
> +	hmm = mirror->hmm;
> +
> +	mutex_lock(&device->mutex);
> +	list_del_init(&mirror->dlist);
> +	device->ops->release(mirror);
> +	mutex_unlock(&device->mutex);
> +}

The hmm variable is unused. It also probably isn't safe to access at this
point.


> +static void hmm_mirror_kill(struct hmm_mirror *mirror)
> +{
> +	down_write(&mirror->hmm->rwsem);
> +	if (!hlist_unhashed(&mirror->mlist)) {
> +		hlist_del_init(&mirror->mlist);
> +		up_write(&mirror->hmm->rwsem);
> +
> +		hmm_mirror_unref(&mirror);
> +	} else
> +		up_write(&mirror->hmm->rwsem);
> +}

Shouldn't this call hmm_unref? hmm_mirror_register calls hmm_ref but
there's no corresponding hmm_unref when the mirror goes away. As a result
the hmm struct gets leaked and thus so does the entire mm since
mmu_notifier_unregister is never called.

It might also be a good idea to set mirror->hmm = NULL here to prevent
accidental use in say hmm_mirror_destroy.


> +/* hmm_device_unregister() - unregister a device with HMM.
> + *
> + * @device: The hmm_device struct.
> + * Returns: 0 on success or -EBUSY otherwise.
> + *
> + * Call when device driver want to unregister itself with HMM. This will check
> + * that there is no any active mirror and returns -EBUSY if so.
> + */
> +int hmm_device_unregister(struct hmm_device *device)
> +{
> +	mutex_lock(&device->mutex);
> +	if (!list_empty(&device->mirrors)) {
> +		mutex_unlock(&device->mutex);
> +		return -EBUSY;
> +	}
> +	mutex_unlock(&device->mutex);
> +	return 0;
> +}

I assume that the intention is for the caller to spin on
hmm_device_unregister until -EBUSY is no longer returned?

If so, I think there's a race here in the case of mm teardown happening
concurrently with hmm_mirror_unregister. This can happen if the parent
process was forked and exits while the child closes the file, or if the
file is passed to another process and closed last there while the original
process exits.

The upshot is that the hmm_device may still be referenced by another
thread even after hmm_device_unregister returns 0.

The below sequence shows how this might happen. Coming into this, the
mirror's ref count is 2:

Thread A (file close)               Thread B (process exit)
----------------------              ----------------------
                                    hmm_notifier_release
                                      down_write(&hmm->rwsem);
hmm_mirror_unregister
  hmm_mirror_kill
    down_write(&hmm->rwsem);
    // Blocked on thread B
                                      hlist_del_init(&mirror->mlist);
                                      up_write(&hmm->rwsem);

                                      // Thread A unblocked
                                      // Thread B is preempted
    // hlist_unhashed returns 1
    up_write(&hmm->rwsem);

  // Mirror ref goes 2 -> 1
  hmm_mirror_unref(&mirror);

  // hmm_mirror_unregister returns

At this point hmm_mirror_unregister has returned to the caller but the
mirror still is in use by thread B. Since all mirrors have been
unregistered, the driver in thread A is now free to call
hmm_device_unregister.

                                      // Thread B is scheduled

                                      // Mirror ref goes 1 -> 0
                                      hmm_mirror_unref(&mirror);
                                        hmm_mirror_destroy(&mirror)
                                          mutex_lock(&device->mutex);
                                          list_del_init(&mirror->dlist);
                                          device->ops->release(mirror);
                                          mutex_unlock(&device->mutex);

hmm_device_unregister
  mutex_lock(&device->mutex);
  // Device list empty
  mutex_unlock(&device->mutex);
  return 0;
// Caller frees device

Do you agree that this sequence can happen, or am I missing something
which prevents it?

If this can happen, the problem is that the only thing preventing thread A
from freeing the device is that thread B has device->mutex locked. That's
bad, because a lock within a structure cannot be used to control freeing
that structure. The mutex_unlock in thread B may internally still access
the mutex memory even after the atomic operation which unlocks the mutex
and unblocks thread A.

This can't be solved by having the driver wait for the ->release mirror
callback before it calls hmm_device_unregister, because the race happens
after that point.

A kref on the device itself might solve this, but the core issue IMO is
that hmm_mirror_unregister doesn't wait for hmm_notifier_release to
complete before returning. It feels like hmm_mirror_unregister wants to do
a synchronize_srcu on the mmu_notifier srcu. Is that possible?

Whatever the resolution, it would be useful for the block comments of
hmm_mirror_unregister and hmm_device_unregister to describe the
expectations on the caller and what the caller is guaranteed as far as
mirror and device lifetimes go.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-08 19:40   ` Mark Hairgrove
@ 2015-06-08 21:17     ` Jerome Glisse
  2015-06-09  1:54       ` Mark Hairgrove
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-08 21:17 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar, linux-rdma

On Mon, Jun 08, 2015 at 12:40:18PM -0700, Mark Hairgrove wrote:
> 
> 
> On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> >
> > This patch only introduce core HMM functions for registering a new
> > mirror and stopping a mirror as well as HMM device registering and
> > unregistering.
> >
> > [...]
> >
> > +/* struct hmm_device_operations - HMM device operation callback
> > + */
> > +struct hmm_device_ops {
> > +	/* release() - mirror must stop using the address space.
> > +	 *
> > +	 * @mirror: The mirror that link process address space with the device.
> > +	 *
> > +	 * When this is call, device driver must kill all device thread using
> > +	 * this mirror. Also, this callback is the last thing call by HMM and
> > +	 * HMM will not access the mirror struct after this call (ie no more
> > +	 * dereference of it so it is safe for the device driver to free it).
> > +	 * It is call either from :
> > +	 *   - mm dying (all process using this mm exiting).
> > +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> > +	 *   - outcome of some device error reported by any of the device
> > +	 *     callback against that mirror.
> > +	 */
> > +	void (*release)(struct hmm_mirror *mirror);
> > +};
> 
> The comment that ->release is called when the mm dies doesn't match the
> implementation. ->release is only called when the mirror is destroyed, and
> that can only happen after the mirror has been unregistered. This may not
> happen until after the mm dies.
> 
> Is the intent for the driver to get the callback when the mm goes down?
> That seems beneficial so the driver can kill whatever's happening on the
> device. Otherwise the device may continue operating in a dead address
> space until the driver's file gets closed and it unregisters the mirror.

This was the intent before merging free & release. I guess i need to
reinstate the free versus release callback. Sadly the lifetime for HMM
is more complex than mmu_notifier as we intend the mirror struct to
be embedded into a driver private struct.

> 
> > +static void hmm_mirror_destroy(struct kref *kref)
> > +{
> > +	struct hmm_device *device;
> > +	struct hmm_mirror *mirror;
> > +	struct hmm *hmm;
> > +
> > +	mirror = container_of(kref, struct hmm_mirror, kref);
> > +	device = mirror->device;
> > +	hmm = mirror->hmm;
> > +
> > +	mutex_lock(&device->mutex);
> > +	list_del_init(&mirror->dlist);
> > +	device->ops->release(mirror);
> > +	mutex_unlock(&device->mutex);
> > +}
> 
> The hmm variable is unused. It also probably isn't safe to access at this
> point.

hmm_unref(hmm); was lost somewhere probably in a rebase and it is safe to
access hmm here.

> 
> 
> > +static void hmm_mirror_kill(struct hmm_mirror *mirror)
> > +{
> > +	down_write(&mirror->hmm->rwsem);
> > +	if (!hlist_unhashed(&mirror->mlist)) {
> > +		hlist_del_init(&mirror->mlist);
> > +		up_write(&mirror->hmm->rwsem);
> > +
> > +		hmm_mirror_unref(&mirror);
> > +	} else
> > +		up_write(&mirror->hmm->rwsem);
> > +}
> 
> Shouldn't this call hmm_unref? hmm_mirror_register calls hmm_ref but
> there's no corresponding hmm_unref when the mirror goes away. As a result
> the hmm struct gets leaked and thus so does the entire mm since
> mmu_notifier_unregister is never called.
> 
> It might also be a good idea to set mirror->hmm = NULL here to prevent
> accidental use in say hmm_mirror_destroy.

No, hmm_mirror_destroy must be the one doing the hmm_unref(hmm)

> 
> 
> > +/* hmm_device_unregister() - unregister a device with HMM.
> > + *
> > + * @device: The hmm_device struct.
> > + * Returns: 0 on success or -EBUSY otherwise.
> > + *
> > + * Call when device driver want to unregister itself with HMM. This will check
> > + * that there is no any active mirror and returns -EBUSY if so.
> > + */
> > +int hmm_device_unregister(struct hmm_device *device)
> > +{
> > +	mutex_lock(&device->mutex);
> > +	if (!list_empty(&device->mirrors)) {
> > +		mutex_unlock(&device->mutex);
> > +		return -EBUSY;
> > +	}
> > +	mutex_unlock(&device->mutex);
> > +	return 0;
> > +}
> 
> I assume that the intention is for the caller to spin on
> hmm_device_unregister until -EBUSY is no longer returned?
> 
> If so, I think there's a race here in the case of mm teardown happening
> concurrently with hmm_mirror_unregister. This can happen if the parent
> process was forked and exits while the child closes the file, or if the
> file is passed to another process and closed last there while the original
> process exits.
> 
> The upshot is that the hmm_device may still be referenced by another
> thread even after hmm_device_unregister returns 0.
> 
> The below sequence shows how this might happen. Coming into this, the
> mirror's ref count is 2:
> 
> Thread A (file close)               Thread B (process exit)
> ----------------------              ----------------------
>                                     hmm_notifier_release
>                                       down_write(&hmm->rwsem);
> hmm_mirror_unregister
>   hmm_mirror_kill
>     down_write(&hmm->rwsem);
>     // Blocked on thread B
>                                       hlist_del_init(&mirror->mlist);
>                                       up_write(&hmm->rwsem);
> 
>                                       // Thread A unblocked
>                                       // Thread B is preempted
>     // hlist_unhashed returns 1
>     up_write(&hmm->rwsem);
> 
>   // Mirror ref goes 2 -> 1
>   hmm_mirror_unref(&mirror);
> 
>   // hmm_mirror_unregister returns
> 
> At this point hmm_mirror_unregister has returned to the caller but the
> mirror still is in use by thread B. Since all mirrors have been
> unregistered, the driver in thread A is now free to call
> hmm_device_unregister.
> 
>                                       // Thread B is scheduled
> 
>                                       // Mirror ref goes 1 -> 0
>                                       hmm_mirror_unref(&mirror);
>                                         hmm_mirror_destroy(&mirror)
>                                           mutex_lock(&device->mutex);
>                                           list_del_init(&mirror->dlist);
>                                           device->ops->release(mirror);
>                                           mutex_unlock(&device->mutex);
> 
> hmm_device_unregister
>   mutex_lock(&device->mutex);
>   // Device list empty
>   mutex_unlock(&device->mutex);
>   return 0;
> // Caller frees device
> 
> Do you agree that this sequence can happen, or am I missing something
> which prevents it?

Can't happen because child have mm->hmm = NULL ie only one hmm per mm
and hmm is tie to only one mm. It is the responsability of the device
driver to make sure same apply to private reference to the hmm mirror
struct ie hmm_mirror should never be tie to a private file struct.

> 
> If this can happen, the problem is that the only thing preventing thread A
> from freeing the device is that thread B has device->mutex locked. That's
> bad, because a lock within a structure cannot be used to control freeing
> that structure. The mutex_unlock in thread B may internally still access
> the mutex memory even after the atomic operation which unlocks the mutex
> and unblocks thread A.
> 
> This can't be solved by having the driver wait for the ->release mirror
> callback before it calls hmm_device_unregister, because the race happens
> after that point.
> 
> A kref on the device itself might solve this, but the core issue IMO is
> that hmm_mirror_unregister doesn't wait for hmm_notifier_release to
> complete before returning. It feels like hmm_mirror_unregister wants to do
> a synchronize_srcu on the mmu_notifier srcu. Is that possible?

I guess i need to revisit once again the whole lifetime issue.

> 
> Whatever the resolution, it would be useful for the block comments of
> hmm_mirror_unregister and hmm_device_unregister to describe the
> expectations on the caller and what the caller is guaranteed as far as
> mirror and device lifetimes go.

Yes i need to fix comment and add again spend time on lifetime. I
obviously completely screw that up in that version of the patchset.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-08 21:17     ` Jerome Glisse
@ 2015-06-09  1:54       ` Mark Hairgrove
  2015-06-09 15:56         ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-09  1:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Mark Hairgrove,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Shachar Raindel, Liran Liss, Roland Dreier, Ben Sander,
	Greg Stoner, John Bridgman, Michael Mantor, Paul Blinzer,
	Laurent Morichetti, Alexander Deucher, Oded Gabbay,
	Jérôme Glisse, Jatin Kumar, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 3982 bytes --]



On Mon, 8 Jun 2015, Jerome Glisse wrote:

> On Mon, Jun 08, 2015 at 12:40:18PM -0700, Mark Hairgrove wrote:
> >
> >
> > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> >
> > > From: Jerome Glisse <jglisse@redhat.com>
> > >
> > > This patch only introduce core HMM functions for registering a new
> > > mirror and stopping a mirror as well as HMM device registering and
> > > unregistering.
> > >
> > > [...]
> > >
> > > +/* struct hmm_device_operations - HMM device operation callback
> > > + */
> > > +struct hmm_device_ops {
> > > +	/* release() - mirror must stop using the address space.
> > > +	 *
> > > +	 * @mirror: The mirror that link process address space with the device.
> > > +	 *
> > > +	 * When this is call, device driver must kill all device thread using
> > > +	 * this mirror. Also, this callback is the last thing call by HMM and
> > > +	 * HMM will not access the mirror struct after this call (ie no more
> > > +	 * dereference of it so it is safe for the device driver to free it).
> > > +	 * It is call either from :
> > > +	 *   - mm dying (all process using this mm exiting).
> > > +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> > > +	 *   - outcome of some device error reported by any of the device
> > > +	 *     callback against that mirror.
> > > +	 */
> > > +	void (*release)(struct hmm_mirror *mirror);
> > > +};
> >
> > The comment that ->release is called when the mm dies doesn't match the
> > implementation. ->release is only called when the mirror is destroyed, and
> > that can only happen after the mirror has been unregistered. This may not
> > happen until after the mm dies.
> >
> > Is the intent for the driver to get the callback when the mm goes down?
> > That seems beneficial so the driver can kill whatever's happening on the
> > device. Otherwise the device may continue operating in a dead address
> > space until the driver's file gets closed and it unregisters the mirror.
>
> This was the intent before merging free & release. I guess i need to
> reinstate the free versus release callback. Sadly the lifetime for HMM
> is more complex than mmu_notifier as we intend the mirror struct to
> be embedded into a driver private struct.

Can you clarify how that's different from mmu_notifiers? Those are also
embedded into a driver-owned struct.

Is the goal to allow calling hmm_mirror_unregister from within the "mm is
dying" HMM callback? I don't know whether that's really necessary as long
as there's some association between the driver files and the mirrors.


> > If so, I think there's a race here in the case of mm teardown happening
> > concurrently with hmm_mirror_unregister.
> >
> > [...]
> >
> > Do you agree that this sequence can happen, or am I missing something
> > which prevents it?
>
> Can't happen because child have mm->hmm = NULL ie only one hmm per mm
> and hmm is tie to only one mm. It is the responsability of the device
> driver to make sure same apply to private reference to the hmm mirror
> struct ie hmm_mirror should never be tie to a private file struct.

It's useful for the driver to have some association between files and
mirrors. If the file is closed prior to process exit we would like to
unregister the mirror, otherwise it will persist until process teardown.
The association doesn't have to be 1:1 but having the files ref count the
mirror or something would be useful.

But even if we assume no association at all between files and mirrors, are
you sure that prevents the race? The driver may choose to unregister the
hmm_device at any point once its files are closed. In the case of module
unload the device unregister can't be prevented. If mm teardown hasn't
happened yet mirrors may still be active and registered on that
hmm_device. The driver thus has to first call hmm_mirror_unregister on all
active mirrors, then call hmm_device_unregister. mm teardown of those
mirrors may trigger at any point in this sequence, so we're right back to
that race.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-09  1:54       ` Mark Hairgrove
@ 2015-06-09 15:56         ` Jerome Glisse
  2015-06-10  3:33           ` Mark Hairgrove
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-09 15:56 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar, linux-rdma

On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote:
> On Mon, 8 Jun 2015, Jerome Glisse wrote:
> > On Mon, Jun 08, 2015 at 12:40:18PM -0700, Mark Hairgrove wrote:
> > > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > > > From: Jerome Glisse <jglisse@redhat.com>
> > > >
> > > > This patch only introduce core HMM functions for registering a new
> > > > mirror and stopping a mirror as well as HMM device registering and
> > > > unregistering.
> > > >
> > > > [...]
> > > >
> > > > +/* struct hmm_device_operations - HMM device operation callback
> > > > + */
> > > > +struct hmm_device_ops {
> > > > +	/* release() - mirror must stop using the address space.
> > > > +	 *
> > > > +	 * @mirror: The mirror that link process address space with the device.
> > > > +	 *
> > > > +	 * When this is call, device driver must kill all device thread using
> > > > +	 * this mirror. Also, this callback is the last thing call by HMM and
> > > > +	 * HMM will not access the mirror struct after this call (ie no more
> > > > +	 * dereference of it so it is safe for the device driver to free it).
> > > > +	 * It is call either from :
> > > > +	 *   - mm dying (all process using this mm exiting).
> > > > +	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
> > > > +	 *   - outcome of some device error reported by any of the device
> > > > +	 *     callback against that mirror.
> > > > +	 */
> > > > +	void (*release)(struct hmm_mirror *mirror);
> > > > +};
> > >
> > > The comment that ->release is called when the mm dies doesn't match the
> > > implementation. ->release is only called when the mirror is destroyed, and
> > > that can only happen after the mirror has been unregistered. This may not
> > > happen until after the mm dies.
> > >
> > > Is the intent for the driver to get the callback when the mm goes down?
> > > That seems beneficial so the driver can kill whatever's happening on the
> > > device. Otherwise the device may continue operating in a dead address
> > > space until the driver's file gets closed and it unregisters the mirror.
> >
> > This was the intent before merging free & release. I guess i need to
> > reinstate the free versus release callback. Sadly the lifetime for HMM
> > is more complex than mmu_notifier as we intend the mirror struct to
> > be embedded into a driver private struct.
> 
> Can you clarify how that's different from mmu_notifiers? Those are also
> embedded into a driver-owned struct.

For HMM you want to be able to kill a mirror from HMM, you might have kernel
thread call inside HMM with a mirror (outside any device file lifetime) ...
The mirror is not only use at register & unregister, there is a lot more thing
you can call using the HMM mirror struct.

So the HMM mirror lifetime as a result is more complex, it can not simply be
free from the mmu_notifier_release callback or randomly. It needs to be
refcounted. The mmu_notifier code assume that the mmu_notifier struct is
embedded inside a struct that has a lifetime properly synchronize with the
mm. For HMM mirror this is not something that sounds like a good idea as there
is too many way to get it wrong.

So idea of HMM mirror is that it can out last the mm lifetime but the HMM
struct can not. So you have hmm_mirror <~> hmm <-> mm and the mirror can be
"unlink" and have different lifetime from the hmm that itself has same life
time as mm.

> Is the goal to allow calling hmm_mirror_unregister from within the "mm is
> dying" HMM callback? I don't know whether that's really necessary as long
> as there's some association between the driver files and the mirrors.

No this is not a goal and i actualy forbid that.

> 
> > > If so, I think there's a race here in the case of mm teardown happening
> > > concurrently with hmm_mirror_unregister.
> > >
> > > [...]
> > >
> > > Do you agree that this sequence can happen, or am I missing something
> > > which prevents it?
> >
> > Can't happen because child have mm->hmm = NULL ie only one hmm per mm
> > and hmm is tie to only one mm. It is the responsability of the device
> > driver to make sure same apply to private reference to the hmm mirror
> > struct ie hmm_mirror should never be tie to a private file struct.
> 
> It's useful for the driver to have some association between files and
> mirrors. If the file is closed prior to process exit we would like to
> unregister the mirror, otherwise it will persist until process teardown.
> The association doesn't have to be 1:1 but having the files ref count the
> mirror or something would be useful.

This is allowed, i might have put strong word here, but you can associate
with a file struct. What you can not do is use the mirror from a different
process ie one with a different mm struct as mirror is linked to a single
mm. So on fork there is no callback to update the private file struct, when
the device file is duplicated (well just refcount inc) against a different
process. This is something you need to be carefull in your driver. Inside
the dummy driver i abuse that to actually test proper behavior of HMM but
it should not be use as an example.

> 
> But even if we assume no association at all between files and mirrors, are
> you sure that prevents the race? The driver may choose to unregister the
> hmm_device at any point once its files are closed. In the case of module
> unload the device unregister can't be prevented. If mm teardown hasn't
> happened yet mirrors may still be active and registered on that
> hmm_device. The driver thus has to first call hmm_mirror_unregister on all
> active mirrors, then call hmm_device_unregister. mm teardown of those
> mirrors may trigger at any point in this sequence, so we're right back to
> that race.

So when device driver unload the first thing it needs to do is kill all of
its context ie all of its HMM mirror (unregister them) by doing so it will
make sure that there can be no more call to any of its functions.

The race with mm teardown does not exist as what matter for mm teardown is
the fact that the mirror is on the struct hmm mirrors list or not. Either
the device driver is first to remove the mirror from the list or it is the
mm teardown but this is lock protected so only one thread can do it.

The issue you pointed is really about decoupling the lifetime of the mirror
context (ie hardware thread that use the mirror) and the lifetime of the
structure that embedded the hmm_mirror struct. The device driver will care
about the second while everything else will only really care about the
first. The second tells you when you know for sure that there will be no
more callback to your device driver code. The first only tells you that
there should be no more activity associated with that mirror but some thread
might still hold a reference on the underlying struct.


Hope this clarify design and motivation behind the hmm_mirror vs hmm struct
lifetime.


Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-09 15:56         ` Jerome Glisse
@ 2015-06-10  3:33           ` Mark Hairgrove
  2015-06-10 15:42             ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-10  3:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 7344 bytes --]



On Tue, 9 Jun 2015, Jerome Glisse wrote:

> On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote:
> > Can you clarify how that's different from mmu_notifiers? Those are also
> > embedded into a driver-owned struct.
> 
> For HMM you want to be able to kill a mirror from HMM, you might have kernel
> thread call inside HMM with a mirror (outside any device file lifetime) ...
> The mirror is not only use at register & unregister, there is a lot more thing
> you can call using the HMM mirror struct.
> 
> So the HMM mirror lifetime as a result is more complex, it can not simply be
> free from the mmu_notifier_release callback or randomly. It needs to be
> refcounted.

Sure, there are driver -> HMM calls like hmm_mirror_fault that 
mmu_notifiers don't have, but I don't understand why that fundamentally 
makes HMM mirror lifetimes more complex. Decoupling hmm_mirror lifetime 
from mm lifetime adds complexity too.

> The mmu_notifier code assume that the mmu_notifier struct is
> embedded inside a struct that has a lifetime properly synchronize with the
> mm. For HMM mirror this is not something that sounds like a good idea as there
> is too many way to get it wrong.

What kind of synchronization with the mm are you referring to here? 
Clients of mmu_notifiers don't have to do anything as far as I know. 
They're guaranteed that the mm won't go away because each registered 
notifier bumps mm_count.

> So idea of HMM mirror is that it can out last the mm lifetime but the HMM
> struct can not. So you have hmm_mirror <~> hmm <-> mm and the mirror can be
> "unlink" and have different lifetime from the hmm that itself has same life
> time as mm.

Per the earlier discussion hmm_mirror_destroy is missing a call to 
hmm_unref. If that's added back I don't understand how the mirror can 
persist past the hmm struct. The mirror can be unlinked from hmm's list, 
yes, but that doesn't mean that hmm/mm can be torn down. The hmm/mm 
structs will stick around until hmm_destroy since that does the 
mmu_notifier_unregister. hmm_destroy can't be called until the last 
hmm_mirror_destroy.

Doesn't that mean that hmm/mm are guaranteed to be allocated until the 
last hmm_mirror_unregister? That sounds like a good guarantee to make.


> 
> > Is the goal to allow calling hmm_mirror_unregister from within the "mm is
> > dying" HMM callback? I don't know whether that's really necessary as long
> > as there's some association between the driver files and the mirrors.
> 
> No this is not a goal and i actualy forbid that.
> 
> > 
> > > > If so, I think there's a race here in the case of mm teardown happening
> > > > concurrently with hmm_mirror_unregister.
> > > >
> > > > [...]
> > > >
> > > > Do you agree that this sequence can happen, or am I missing something
> > > > which prevents it?
> > >
> > > Can't happen because child have mm->hmm = NULL ie only one hmm per mm
> > > and hmm is tie to only one mm. It is the responsability of the device
> > > driver to make sure same apply to private reference to the hmm mirror
> > > struct ie hmm_mirror should never be tie to a private file struct.
> > 
> > It's useful for the driver to have some association between files and
> > mirrors. If the file is closed prior to process exit we would like to
> > unregister the mirror, otherwise it will persist until process teardown.
> > The association doesn't have to be 1:1 but having the files ref count the
> > mirror or something would be useful.
> 
> This is allowed, i might have put strong word here, but you can associate
> with a file struct. What you can not do is use the mirror from a different
> process ie one with a different mm struct as mirror is linked to a single
> mm. So on fork there is no callback to update the private file struct, when
> the device file is duplicated (well just refcount inc) against a different
> process. This is something you need to be carefull in your driver. Inside
> the dummy driver i abuse that to actually test proper behavior of HMM but
> it should not be use as an example.

So to confirm, on all file operations from user space the driver is 
expected to check that current->mm matches the mm associated with the 
struct file's hmm_mirror?

On file->release the driver still ought to call hmm_mirror_unregister 
regardless of whether the mms match, otherwise we'll never tear down the 
mirror. That means we're not saved from the race condition because 
hmm_mirror_unregister can happen in one thread while hmm_notifier_release 
might be happening in another thread.


> > 
> > But even if we assume no association at all between files and mirrors, are
> > you sure that prevents the race? The driver may choose to unregister the
> > hmm_device at any point once its files are closed. In the case of module
> > unload the device unregister can't be prevented. If mm teardown hasn't
> > happened yet mirrors may still be active and registered on that
> > hmm_device. The driver thus has to first call hmm_mirror_unregister on all
> > active mirrors, then call hmm_device_unregister. mm teardown of those
> > mirrors may trigger at any point in this sequence, so we're right back to
> > that race.
> 
> So when device driver unload the first thing it needs to do is kill all of
> its context ie all of its HMM mirror (unregister them) by doing so it will
> make sure that there can be no more call to any of its functions.

When is the driver expected to call hmm_mirror_unregister? Is it file 
close, module unload, or some other time?

If it's file close, there's no need to unregister anything on module 
unload because the files were all closed already.

If it's module unload, then the mirrors and mms all get leaked until that 
point.

We're exposed to the race in both cases.

> 
> The race with mm teardown does not exist as what matter for mm teardown is
> the fact that the mirror is on the struct hmm mirrors list or not. Either
> the device driver is first to remove the mirror from the list or it is the
> mm teardown but this is lock protected so only one thread can do it.
> 

Agreed, removing the mirror from the list is not a "race" in the classical 
sense. The true race is between hmm_notifier_release's device mutex_unlock 
(process exit) and post-hmm_device_unregister device mutex free (driver 
close/unload). What I meant is that in order to expose that race you first 
need one thread to call hmm_mirror_unregister while another thread is in 
hmm_notifier_release.

Regardless of where hmm_mirror_unregister is called (file close, module 
unload, etc) it can happen concurrently with hmm_notifier_release so we're 
exposed to this race.


> The issue you pointed is really about decoupling the lifetime of the mirror
> context (ie hardware thread that use the mirror) and the lifetime of the
> structure that embedded the hmm_mirror struct. The device driver will care
> about the second while everything else will only really care about the
> first. The second tells you when you know for sure that there will be no
> more callback to your device driver code. The first only tells you that
> there should be no more activity associated with that mirror but some thread
> might still hold a reference on the underlying struct.
> 
> 
> Hope this clarify design and motivation behind the hmm_mirror vs hmm struct
> lifetime.
> 
> 
> Cheers,
> Jerome
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-10  3:33           ` Mark Hairgrove
@ 2015-06-10 15:42             ` Jerome Glisse
  2015-06-11  1:15               ` Mark Hairgrove
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-10 15:42 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar, linux-rdma

On Tue, Jun 09, 2015 at 08:33:12PM -0700, Mark Hairgrove wrote:
> 
> 
> On Tue, 9 Jun 2015, Jerome Glisse wrote:
> 
> > On Mon, Jun 08, 2015 at 06:54:29PM -0700, Mark Hairgrove wrote:
> > > Can you clarify how that's different from mmu_notifiers? Those are also
> > > embedded into a driver-owned struct.
> > 
> > For HMM you want to be able to kill a mirror from HMM, you might have kernel
> > thread call inside HMM with a mirror (outside any device file lifetime) ...
> > The mirror is not only use at register & unregister, there is a lot more thing
> > you can call using the HMM mirror struct.
> > 
> > So the HMM mirror lifetime as a result is more complex, it can not simply be
> > free from the mmu_notifier_release callback or randomly. It needs to be
> > refcounted.
> 
> Sure, there are driver -> HMM calls like hmm_mirror_fault that 
> mmu_notifiers don't have, but I don't understand why that fundamentally 
> makes HMM mirror lifetimes more complex. Decoupling hmm_mirror lifetime 
> from mm lifetime adds complexity too.

Driver->HMM calls can happen from random kernel thread thus you need to
garanty that hmm_mirror can not go away. More over you can have CPU MM
call into HMM outside of mmu_notifier. Basicly you can get to HMM code
by many different code path, unlike any of the current mmu_notifier.

So refcounting is necessary as otherwise the device driver might decide
to unregister and free the mirror while some other kernel thread is
about to dereference the exact same mirror. Synchronization with mmu
notifier srcu will not be enough in the case of page fault on remote
memory for instance. Other case too.

> 
> > The mmu_notifier code assume that the mmu_notifier struct is
> > embedded inside a struct that has a lifetime properly synchronize with the
> > mm. For HMM mirror this is not something that sounds like a good idea as there
> > is too many way to get it wrong.
> 
> What kind of synchronization with the mm are you referring to here? 
> Clients of mmu_notifiers don't have to do anything as far as I know. 
> They're guaranteed that the mm won't go away because each registered 
> notifier bumps mm_count.

So for all current user afaict (kvm, xen, intel, radeon) tie to a file,
(sgi gru) tie to vma, (iommu) tie to mm. Which means this is all properly
synchronize with lifetime of mm (ignoring the fork case).

The struct that is tie to mmu_notifier for all of them can be accessed
only through one code path (ioctl for most of them).

> 
> > So idea of HMM mirror is that it can out last the mm lifetime but the HMM
> > struct can not. So you have hmm_mirror <~> hmm <-> mm and the mirror can be
> > "unlink" and have different lifetime from the hmm that itself has same life
> > time as mm.
> 
> Per the earlier discussion hmm_mirror_destroy is missing a call to 
> hmm_unref. If that's added back I don't understand how the mirror can 
> persist past the hmm struct. The mirror can be unlinked from hmm's list, 
> yes, but that doesn't mean that hmm/mm can be torn down. The hmm/mm 
> structs will stick around until hmm_destroy since that does the 
> mmu_notifier_unregister. hmm_destroy can't be called until the last 
> hmm_mirror_destroy.
> 
> Doesn't that mean that hmm/mm are guaranteed to be allocated until the 
> last hmm_mirror_unregister? That sounds like a good guarantee to make.

Like said, just ignore current code it is utterly broken in so many way
when it comes to lifetime. I screw that part badly when reworking the
patchset, i was focusing on other part.

I fixed that in my tree, i am waiting for more review on other part as
anyway the lifetime thing is easy to rework/fix.

http://cgit.freedesktop.org/~glisse/linux/log/?h=hmm

[...]
> > > > > If so, I think there's a race here in the case of mm teardown happening
> > > > > concurrently with hmm_mirror_unregister.
> > > > >
> > > > > [...]
> > > > >
> > > > > Do you agree that this sequence can happen, or am I missing something
> > > > > which prevents it?
> > > >
> > > > Can't happen because child have mm->hmm = NULL ie only one hmm per mm
> > > > and hmm is tie to only one mm. It is the responsability of the device
> > > > driver to make sure same apply to private reference to the hmm mirror
> > > > struct ie hmm_mirror should never be tie to a private file struct.
> > > 
> > > It's useful for the driver to have some association between files and
> > > mirrors. If the file is closed prior to process exit we would like to
> > > unregister the mirror, otherwise it will persist until process teardown.
> > > The association doesn't have to be 1:1 but having the files ref count the
> > > mirror or something would be useful.
> > 
> > This is allowed, i might have put strong word here, but you can associate
> > with a file struct. What you can not do is use the mirror from a different
> > process ie one with a different mm struct as mirror is linked to a single
> > mm. So on fork there is no callback to update the private file struct, when
> > the device file is duplicated (well just refcount inc) against a different
> > process. This is something you need to be carefull in your driver. Inside
> > the dummy driver i abuse that to actually test proper behavior of HMM but
> > it should not be use as an example.
> 
> So to confirm, on all file operations from user space the driver is 
> expected to check that current->mm matches the mm associated with the 
> struct file's hmm_mirror?

Well you might have a valid usecase for that, just be aware that
anything your driver do with the hmm_mirror will actually impact
the mm of the parent. Which i assume is not what you want.

I would actualy thought that what you want is having a way to find
hmm_mirror using both device file & mm as a key. Otherwise you can
not really use HMM with process that like to fork themself. Which
is a valid usecase to me. For instance process start using HMM
through your driver, decide to fork itself and to also use HMM
through your driver inside its child.

> 
> On file->release the driver still ought to call hmm_mirror_unregister 
> regardless of whether the mms match, otherwise we'll never tear down the 
> mirror. That means we're not saved from the race condition because 
> hmm_mirror_unregister can happen in one thread while hmm_notifier_release 
> might be happening in another thread.

Again there is no race the mirror list is the synchronization point and
it is protected by a lock. So either hmm_mirror_unregister() wins or the
other thread hmm_notifier_release()

> > > But even if we assume no association at all between files and mirrors, are
> > > you sure that prevents the race? The driver may choose to unregister the
> > > hmm_device at any point once its files are closed. In the case of module
> > > unload the device unregister can't be prevented. If mm teardown hasn't
> > > happened yet mirrors may still be active and registered on that
> > > hmm_device. The driver thus has to first call hmm_mirror_unregister on all
> > > active mirrors, then call hmm_device_unregister. mm teardown of those
> > > mirrors may trigger at any point in this sequence, so we're right back to
> > > that race.
> > 
> > So when device driver unload the first thing it needs to do is kill all of
> > its context ie all of its HMM mirror (unregister them) by doing so it will
> > make sure that there can be no more call to any of its functions.
> 
> When is the driver expected to call hmm_mirror_unregister? Is it file 
> close, module unload, or some other time?
> 
> If it's file close, there's no need to unregister anything on module 
> unload because the files were all closed already.
> 
> If it's module unload, then the mirrors and mms all get leaked until that 
> point.
> 
> We're exposed to the race in both cases.

You unregister as soon as you want, it is up to your driver to do it,
i do not enforce anything. The only thing i enforce is that you can
not unregister the hmm device driver before all mirror are unregistered
and free.

So yes for device driver you want to unregister when device file is
close (which happens when the process exit).

In both cases there is no race as explain above.

> 
> > 
> > The race with mm teardown does not exist as what matter for mm teardown is
> > the fact that the mirror is on the struct hmm mirrors list or not. Either
> > the device driver is first to remove the mirror from the list or it is the
> > mm teardown but this is lock protected so only one thread can do it.
> > 
> 
> Agreed, removing the mirror from the list is not a "race" in the classical 
> sense. The true race is between hmm_notifier_release's device mutex_unlock 
> (process exit) and post-hmm_device_unregister device mutex free (driver 
> close/unload). What I meant is that in order to expose that race you first 
> need one thread to call hmm_mirror_unregister while another thread is in 
> hmm_notifier_release.
> 
> Regardless of where hmm_mirror_unregister is called (file close, module 
> unload, etc) it can happen concurrently with hmm_notifier_release so we're 
> exposed to this race.

There is no race here, the mirror struct will only be freed once as again
the list is a synchronization point. Whoever remove the mirror from the
list is responsible to drop the list reference.

In the fixed code the only thing that will happen twice is the ->release()
callback. Even that can be work around to garanty it is call only once.

Anyway i do not see anyrace here.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-10 15:42             ` Jerome Glisse
@ 2015-06-11  1:15               ` Mark Hairgrove
  2015-06-11 14:23                 ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-11  1:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar, linux-rdma



On Wed, 10 Jun 2015, Jerome Glisse wrote:

> [...]
> 
> Like said, just ignore current code it is utterly broken in so many way
> when it comes to lifetime. I screw that part badly when reworking the
> patchset, i was focusing on other part.
> 
> I fixed that in my tree, i am waiting for more review on other part as
> anyway the lifetime thing is easy to rework/fix.
> 
> http://cgit.freedesktop.org/~glisse/linux/log/?h=hmm
> 

Ok, I'm working through the other patches so I'll check the updates out 
once I've made it through. My primary interest in this discussion is 
making sure we know the plan for mirror and device lifetimes.


> > So to confirm, on all file operations from user space the driver is 
> > expected to check that current->mm matches the mm associated with the 
> > struct file's hmm_mirror?
> 
> Well you might have a valid usecase for that, just be aware that
> anything your driver do with the hmm_mirror will actually impact
> the mm of the parent. Which i assume is not what you want.
> 
> I would actualy thought that what you want is having a way to find
> hmm_mirror using both device file & mm as a key. Otherwise you can
> not really use HMM with process that like to fork themself. Which
> is a valid usecase to me. For instance process start using HMM
> through your driver, decide to fork itself and to also use HMM
> through your driver inside its child.

Agreed, that sounds reasonable, and the use case is valid. I was digging 
into this to make sure we don't prevent that.


> > 
> > On file->release the driver still ought to call hmm_mirror_unregister 
> > regardless of whether the mms match, otherwise we'll never tear down the 
> > mirror. That means we're not saved from the race condition because 
> > hmm_mirror_unregister can happen in one thread while hmm_notifier_release 
> > might be happening in another thread.
> 
> Again there is no race the mirror list is the synchronization point and
> it is protected by a lock. So either hmm_mirror_unregister() wins or the
> other thread hmm_notifier_release()

Yes, I agree. That's not the race I'm worried about. I'm worried about a 
race on the device lifetime, but in order to hit that one first 
hmm_notifier_release must take the lock and remove the mirror from the 
list before hmm_mirror_unregister does it. That's why I brought it up.


> 
> You unregister as soon as you want, it is up to your driver to do it,
> i do not enforce anything. The only thing i enforce is that you can
> not unregister the hmm device driver before all mirror are unregistered
> and free.
> 
> So yes for device driver you want to unregister when device file is
> close (which happens when the process exit).

Sounds good.


> 
> There is no race here, the mirror struct will only be freed once as again
> the list is a synchronization point. Whoever remove the mirror from the
> list is responsible to drop the list reference.
> 
> In the fixed code the only thing that will happen twice is the ->release()
> callback. Even that can be work around to garanty it is call only once.
> 
> Anyway i do not see anyrace here.
> 

The mirror lifetime is fine. The problem I see is with the device lifetime 
on a multi-core system. Imagine this sequence:

- On CPU1 the mm associated with the mirror is going down
- On CPU2 the driver unregisters the mirror then the device

When this happens, the last device mutex_unlock on CPU1 is the only thing 
preventing the free of the device in CPU2. That doesn't work, as described 
in this thread: https://lkml.org/lkml/2013/12/2/997

Here's the full sequence again with mutex_unlock split apart. Hopefully 
this shows the device_unregister problem more clearly:

CPU1 (mm release)                   CPU2 (driver)
----------------------              ----------------------
hmm_notifier_release
  down_write(&hmm->rwsem);
  hlist_del_init(&mirror->mlist);
  up_write(&hmm->rwsem);

  // CPU1 thread is preempted or 
  // something
                                    hmm_mirror_unregister
                                      hmm_mirror_kill
                                        down_write(&hmm->rwsem);
                                        // mirror removed by CPU1 already
                                        // so hlist_unhashed returns 1
                                        up_write(&hmm->rwsem);

                                      hmm_mirror_unref(&mirror);
                                      // Mirror ref now 1

                                      // CPU2 thread is preempted or
                                      // something
// CPU1 thread is scheduled

hmm_mirror_unref(&mirror);
  // Mirror ref now 0, cleanup
  hmm_mirror_destroy(&mirror)
    mutex_lock(&device->mutex);
    list_del_init(&mirror->dlist);
    device->ops->release(mirror);
      kfree(mirror);
                                      // CPU2 thread is scheduled, now
                                      // both CPU1 and CPU2 are running

                                    hmm_device_unregister
                                      mutex_lock(&device->mutex);
                                        mutex_optimistic_spin()
    mutex_unlock(&device->mutex);
      [...]
      __mutex_unlock_common_slowpath
        // CPU2 releases lock
        atomic_set(&lock->count, 1);
                                          // Spinning CPU2 acquires now-
                                          // free lock
                                      // mutex_lock returns
                                      // Device list empty
                                      mutex_unlock(&device->mutex);
                                      return 0;
                                    kfree(hmm_device);
        // CPU1 still accessing 
        // hmm_device->mutex in 
        //__mutex_unlock_common_slowpath

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-11  1:15               ` Mark Hairgrove
@ 2015-06-11 14:23                 ` Jerome Glisse
  2015-06-11 22:26                   ` Mark Hairgrove
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-11 14:23 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar, linux-rdma

On Wed, Jun 10, 2015 at 06:15:08PM -0700, Mark Hairgrove wrote:

[...]
> > There is no race here, the mirror struct will only be freed once as again
> > the list is a synchronization point. Whoever remove the mirror from the
> > list is responsible to drop the list reference.
> > 
> > In the fixed code the only thing that will happen twice is the ->release()
> > callback. Even that can be work around to garanty it is call only once.
> > 
> > Anyway i do not see anyrace here.
> > 
> 
> The mirror lifetime is fine. The problem I see is with the device lifetime 
> on a multi-core system. Imagine this sequence:
> 
> - On CPU1 the mm associated with the mirror is going down
> - On CPU2 the driver unregisters the mirror then the device
> 
> When this happens, the last device mutex_unlock on CPU1 is the only thing 
> preventing the free of the device in CPU2. That doesn't work, as described 
> in this thread: https://lkml.org/lkml/2013/12/2/997
> 
> Here's the full sequence again with mutex_unlock split apart. Hopefully 
> this shows the device_unregister problem more clearly:
> 
> CPU1 (mm release)                   CPU2 (driver)
> ----------------------              ----------------------
> hmm_notifier_release
>   down_write(&hmm->rwsem);
>   hlist_del_init(&mirror->mlist);
>   up_write(&hmm->rwsem);
> 
>   // CPU1 thread is preempted or 
>   // something
>                                     hmm_mirror_unregister
>                                       hmm_mirror_kill
>                                         down_write(&hmm->rwsem);
>                                         // mirror removed by CPU1 already
>                                         // so hlist_unhashed returns 1
>                                         up_write(&hmm->rwsem);
> 
>                                       hmm_mirror_unref(&mirror);
>                                       // Mirror ref now 1
> 
>                                       // CPU2 thread is preempted or
>                                       // something
> // CPU1 thread is scheduled
> 
> hmm_mirror_unref(&mirror);
>   // Mirror ref now 0, cleanup
>   hmm_mirror_destroy(&mirror)
>     mutex_lock(&device->mutex);
>     list_del_init(&mirror->dlist);
>     device->ops->release(mirror);
>       kfree(mirror);
>                                       // CPU2 thread is scheduled, now
>                                       // both CPU1 and CPU2 are running
> 
>                                     hmm_device_unregister
>                                       mutex_lock(&device->mutex);
>                                         mutex_optimistic_spin()
>     mutex_unlock(&device->mutex);
>       [...]
>       __mutex_unlock_common_slowpath
>         // CPU2 releases lock
>         atomic_set(&lock->count, 1);
>                                           // Spinning CPU2 acquires now-
>                                           // free lock
>                                       // mutex_lock returns
>                                       // Device list empty
>                                       mutex_unlock(&device->mutex);
>                                       return 0;
>                                     kfree(hmm_device);
>         // CPU1 still accessing 
>         // hmm_device->mutex in 
>         //__mutex_unlock_common_slowpath

Ok i see the race you are afraid of and really it is an unlikely one
__mutex_unlock_common_slowpath() take a spinlock right after allowing
other to take the mutex, when we are in your scenario there is no
contention on that spinlock so it is taken right away and as there
is no one in the mutex wait list then it goes directly to unlock the
spinlock and return. You can ignore the debug function as if debugging
is enabled than the mutex_lock() would need to also take the spinlock
and thus you would have proper synchronization btw 2 thread thanks to
the mutex.wait_lock.

So basicly while CPU1 is going :
spin_lock(mutex.wait_lock)
if (!list_empty(mutex.wait_list)) {
  // wait_list is empty so branch not taken
}
spin_unlock(mutex.wait_lock)

CPU2 would have to test the mirror list and mutex_unlock and return
before the spin_unlock() of CPU1. This is a tight race, i can add a
synchronize_rcu() to device_unregister after the mutex_unlock() so
that we also add a grace period before the device is potentialy freed
which should make that race completely unlikely.

Moreover for something really bad to happen it would need that the
freed memory to be reallocated right away by some other thread. Which
really sound unlikely unless CPU1 is the slowest of all :)

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-11 14:23                 ` Jerome Glisse
@ 2015-06-11 22:26                   ` Mark Hairgrove
  2015-06-15 14:32                     ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-11 22:26 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 6383 bytes --]



On Thu, 11 Jun 2015, Jerome Glisse wrote:

> On Wed, Jun 10, 2015 at 06:15:08PM -0700, Mark Hairgrove wrote:
> 
> [...]
> > > There is no race here, the mirror struct will only be freed once as again
> > > the list is a synchronization point. Whoever remove the mirror from the
> > > list is responsible to drop the list reference.
> > > 
> > > In the fixed code the only thing that will happen twice is the ->release()
> > > callback. Even that can be work around to garanty it is call only once.
> > > 
> > > Anyway i do not see anyrace here.
> > > 
> > 
> > The mirror lifetime is fine. The problem I see is with the device lifetime 
> > on a multi-core system. Imagine this sequence:
> > 
> > - On CPU1 the mm associated with the mirror is going down
> > - On CPU2 the driver unregisters the mirror then the device
> > 
> > When this happens, the last device mutex_unlock on CPU1 is the only thing 
> > preventing the free of the device in CPU2. That doesn't work, as described 
> > in this thread: https://lkml.org/lkml/2013/12/2/997
> > 
> > Here's the full sequence again with mutex_unlock split apart. Hopefully 
> > this shows the device_unregister problem more clearly:
> > 
> > CPU1 (mm release)                   CPU2 (driver)
> > ----------------------              ----------------------
> > hmm_notifier_release
> >   down_write(&hmm->rwsem);
> >   hlist_del_init(&mirror->mlist);
> >   up_write(&hmm->rwsem);
> > 
> >   // CPU1 thread is preempted or 
> >   // something
> >                                     hmm_mirror_unregister
> >                                       hmm_mirror_kill
> >                                         down_write(&hmm->rwsem);
> >                                         // mirror removed by CPU1 already
> >                                         // so hlist_unhashed returns 1
> >                                         up_write(&hmm->rwsem);
> > 
> >                                       hmm_mirror_unref(&mirror);
> >                                       // Mirror ref now 1
> > 
> >                                       // CPU2 thread is preempted or
> >                                       // something
> > // CPU1 thread is scheduled
> > 
> > hmm_mirror_unref(&mirror);
> >   // Mirror ref now 0, cleanup
> >   hmm_mirror_destroy(&mirror)
> >     mutex_lock(&device->mutex);
> >     list_del_init(&mirror->dlist);
> >     device->ops->release(mirror);
> >       kfree(mirror);
> >                                       // CPU2 thread is scheduled, now
> >                                       // both CPU1 and CPU2 are running
> > 
> >                                     hmm_device_unregister
> >                                       mutex_lock(&device->mutex);
> >                                         mutex_optimistic_spin()
> >     mutex_unlock(&device->mutex);
> >       [...]
> >       __mutex_unlock_common_slowpath
> >         // CPU2 releases lock
> >         atomic_set(&lock->count, 1);
> >                                           // Spinning CPU2 acquires now-
> >                                           // free lock
> >                                       // mutex_lock returns
> >                                       // Device list empty
> >                                       mutex_unlock(&device->mutex);
> >                                       return 0;
> >                                     kfree(hmm_device);
> >         // CPU1 still accessing 
> >         // hmm_device->mutex in 
> >         //__mutex_unlock_common_slowpath
> 
> Ok i see the race you are afraid of and really it is an unlikely one
> __mutex_unlock_common_slowpath() take a spinlock right after allowing
> other to take the mutex, when we are in your scenario there is no
> contention on that spinlock so it is taken right away and as there
> is no one in the mutex wait list then it goes directly to unlock the
> spinlock and return. You can ignore the debug function as if debugging
> is enabled than the mutex_lock() would need to also take the spinlock
> and thus you would have proper synchronization btw 2 thread thanks to
> the mutex.wait_lock.
> 
> So basicly while CPU1 is going :
> spin_lock(mutex.wait_lock)
> if (!list_empty(mutex.wait_list)) {
>   // wait_list is empty so branch not taken
> }
> spin_unlock(mutex.wait_lock)
> 
> CPU2 would have to test the mirror list and mutex_unlock and return
> before the spin_unlock() of CPU1. This is a tight race, i can add a
> synchronize_rcu() to device_unregister after the mutex_unlock() so
> that we also add a grace period before the device is potentialy freed
> which should make that race completely unlikely.
> 
> Moreover for something really bad to happen it would need that the
> freed memory to be reallocated right away by some other thread. Which
> really sound unlikely unless CPU1 is the slowest of all :)
> 
> Cheers,
> Jerome
> 

But CPU1 could get preempted between the atomic_set and the 
spin_lock_mutex, and then it doesn't matter whether or not a grace period 
has elapsed before CPU2 proceeds.

Making race conditions less likely just makes them harder to pinpoint when 
they inevitably appear in the wild. I don't think it makes sense to spend 
any effort in making a race condition less likely, and that thread I 
referenced (https://lkml.org/lkml/2013/12/2/997) is fairly strong evidence 
that fixing this race actually matters. So, I think this race condition 
really needs to be fixed.

One fix is for hmm_mirror_unregister to wait for hmm_notifier_release 
completion between hmm_mirror_kill and hmm_mirror_unref. It can do this by 
calling synchronize_srcu() on the mmu_notifier's srcu. This has the 
benefit that the driver is guaranteed not to get the "mm is dead" callback 
after hmm_mirror_unregister returns.

In fact, are there any callbacks on the mirror that can arrive after 
hmm_mirror_unregister? If so, how will hmm_device_unregister solve them?

>From a general standpoint, hmm_device_unregister must perform some kind of 
synchronization to be sure that all mirrors are completely released and 
done and no new callbacks will trigger. Since that has to be true, can't 
that synchronization be moved into hmm_mirror_unregister instead?

If that happens there's no need for a "mirror can be freed" ->release 
callback at all because the driver is guaranteed that a mirror is done 
after hmm_mirror_unregister.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
  2015-06-11 22:26                   ` Mark Hairgrove
@ 2015-06-15 14:32                     ` Jerome Glisse
  0 siblings, 0 replies; 80+ messages in thread
From: Jerome Glisse @ 2015-06-15 14:32 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar, linux-rdma

On Thu, Jun 11, 2015 at 03:26:46PM -0700, Mark Hairgrove wrote:
> On Thu, 11 Jun 2015, Jerome Glisse wrote:
> > On Wed, Jun 10, 2015 at 06:15:08PM -0700, Mark Hairgrove wrote:

[...]
> > Ok i see the race you are afraid of and really it is an unlikely one
> > __mutex_unlock_common_slowpath() take a spinlock right after allowing
> > other to take the mutex, when we are in your scenario there is no
> > contention on that spinlock so it is taken right away and as there
> > is no one in the mutex wait list then it goes directly to unlock the
> > spinlock and return. You can ignore the debug function as if debugging
> > is enabled than the mutex_lock() would need to also take the spinlock
> > and thus you would have proper synchronization btw 2 thread thanks to
> > the mutex.wait_lock.
> > 
> > So basicly while CPU1 is going :
> > spin_lock(mutex.wait_lock)
> > if (!list_empty(mutex.wait_list)) {
> >   // wait_list is empty so branch not taken
> > }
> > spin_unlock(mutex.wait_lock)
> > 
> > CPU2 would have to test the mirror list and mutex_unlock and return
> > before the spin_unlock() of CPU1. This is a tight race, i can add a
> > synchronize_rcu() to device_unregister after the mutex_unlock() so
> > that we also add a grace period before the device is potentialy freed
> > which should make that race completely unlikely.
> > 
> > Moreover for something really bad to happen it would need that the
> > freed memory to be reallocated right away by some other thread. Which
> > really sound unlikely unless CPU1 is the slowest of all :)
> > 
> > Cheers,
> > Jerome
> > 
> 
> But CPU1 could get preempted between the atomic_set and the 
> spin_lock_mutex, and then it doesn't matter whether or not a grace period 
> has elapsed before CPU2 proceeds.
> 
> Making race conditions less likely just makes them harder to pinpoint when 
> they inevitably appear in the wild. I don't think it makes sense to spend 
> any effort in making a race condition less likely, and that thread I 
> referenced (https://lkml.org/lkml/2013/12/2/997) is fairly strong evidence 
> that fixing this race actually matters. So, I think this race condition 
> really needs to be fixed.
> 
> One fix is for hmm_mirror_unregister to wait for hmm_notifier_release 
> completion between hmm_mirror_kill and hmm_mirror_unref. It can do this by 
> calling synchronize_srcu() on the mmu_notifier's srcu. This has the 
> benefit that the driver is guaranteed not to get the "mm is dead" callback 
> after hmm_mirror_unregister returns.
> 
> In fact, are there any callbacks on the mirror that can arrive after 
> hmm_mirror_unregister? If so, how will hmm_device_unregister solve them?
> 
> From a general standpoint, hmm_device_unregister must perform some kind of 
> synchronization to be sure that all mirrors are completely released and 
> done and no new callbacks will trigger. Since that has to be true, can't 
> that synchronization be moved into hmm_mirror_unregister instead?
> 
> If that happens there's no need for a "mirror can be freed" ->release 
> callback at all because the driver is guaranteed that a mirror is done 
> after hmm_mirror_unregister.

Well there is no need or 2 callback (relase|stop , free) just one, the
release|stop that is needed. I kind of went halfway last week on this.
I will probably rework that a little to keep just one call and rely on
driver to call hmm_mirror_unregister()

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/36] HMM: add HMM page table v2.
  2015-05-21 19:31 ` [PATCH 06/36] HMM: add HMM page table v2 j.glisse
@ 2015-06-19  2:06   ` Mark Hairgrove
  2015-06-19 18:07     ` Jerome Glisse
  2015-06-25 22:57   ` Mark Hairgrove
  1 sibling, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-19  2:06 UTC (permalink / raw)
  To: j.glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

[-- Attachment #1: Type: text/plain, Size: 11403 bytes --]



On Thu, 21 May 2015, j.glisse@gmail.com wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> Heterogeneous memory management main purpose is to mirror a process address.
> To do so it must maintain a secondary page table that is use by the device
> driver to program the device or build a device specific page table.
> 
> Radix tree can not be use to create this secondary page table because HMM
> needs more flags than RADIX_TREE_MAX_TAGS (while this can be increase we
> believe HMM will require so much flags that cost will becomes prohibitive
> to others users of radix tree).
> 
> Moreover radix tree is built around long but for HMM we need to store dma
> address and on some platform sizeof(dma_addr_t) > sizeof(long). Thus radix
> tree is unsuitable to fulfill HMM requirement hence why we introduce this
> code which allows to create page table that can grow and shrink dynamicly.
> 
> The design is very clause to CPU page table as it reuse some of the feature

s/clause/close

> such as spinlock embedded in struct page.
> 
> Changed since v1:
>   - Use PAGE_SHIFT as shift value to reserve low bit for private device
>     specific flags. This is to allow device driver to use and some of the
>     lower bits for their own device specific purpose.
>   - Add set of helper for atomically clear, setting and testing bit on
>     dma_addr_t pointer. Atomicity being usefull only for dirty bit.
>   - Differentiate btw DMA mapped entry and non mapped entry (pfn).
>   - Split page directory entry and page table entry helpers.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
> ---
>  MAINTAINERS            |   2 +
>  include/linux/hmm_pt.h | 380 +++++++++++++++++++++++++++++++++++++++++++
>  mm/Makefile            |   2 +-
>  mm/hmm_pt.c            | 425 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 808 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/hmm_pt.h
>  create mode 100644 mm/hmm_pt.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2f2a2be..8cd0aa7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4736,6 +4736,8 @@ L:	linux-mm@kvack.org
>  S:	Maintained
>  F:	mm/hmm.c
>  F:	include/linux/hmm.h
> +F:	mm/hmm_pt.c
> +F:	include/linux/hmm_pt.h
>  
>  HOST AP DRIVER
>  M:	Jouni Malinen <j@w1.fi>
> diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
> new file mode 100644
> index 0000000..330edb2
> --- /dev/null
> +++ b/include/linux/hmm_pt.h
> @@ -0,0 +1,380 @@
> +/*
> [...]
> +
> +static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
> +{
> +	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
> +}
> +
> +static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
> +{
> +	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
> +}
> +

Does hmm_pde_pfn return a dma_addr_t pfn or a system memory pfn?

The types between these two functions don't match. According to 
hmm_pde_from_pfn, both the pde and the pfn are supposed to be dma_addr_t. 
But hmm_pde_pfn returns an unsigned long as a pfn instead of a dma_addr_t. 
If hmm_pde_pfn sometimes used to get a dma_addr_t pfn then shouldn't it 
also return a dma_addr_t, since as you pointed out in the commit message, 
dma_addr_t might be bigger than an unsigned long?

> [...]
> +
> +static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
> +{
> +	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
> +}
> +
> +static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
> +{
> +	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
> +}

Same question as hmm_pde_pfn above.


> [...]
> +/* struct hmm_pt_iter - page table iterator states.
> + *
> + * @ptd: Array of directory struct page pointer for each levels.
> + * @ptdp: Array of pointer to mapped directory levels.
> + * @dead_directories: List of directories that died while walking page table.
> + * @cur: Current address.
> + */
> +struct hmm_pt_iter {
> +	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
> +	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];

These are sized to be HMM_PT_MAX_LEVEL - 1 rather than HMM_PT_MAX_LEVEL 
because the iterator doesn't store the top level, correct? This results in 
a lot of "level - 1" and "level - 2" logic when dealing with the iterator. 
Have you considered keeping the levels consistent to get rid of all the 
extra offset-by-1 logic?

> +	struct list_head	dead_directories;
> +	unsigned long		cur;
> +};
> [...]
> +
> +/* hmm_pt_protect_directory_unref() - reference a directory.

s/unref/ref

> + *
> + * @iter: Iterator states that currently protect the directory.
> + * @level: Level of the directory to reference.
> + *
> + * This function will reference a directory but it is illegal for refcount to
> + * be 0 as this helper should only be call when iterator is protecting the
> + * directory (ie iterator hold a reference for the directory).
> + *
> + * HMM user will call this with level = pt.llevel any other value is supicious
> + * outside of hmm_pt code.
> + */
> +static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter,
> +					     char level)
> +{
> +	/* Nothing to do for root level. */
> +	if (!level)
> +		return;
> +
> +	if (!atomic_inc_not_zero(&iter->ptd[level - 1]->_mapcount))
> +		/* Illegal this should not happen. */
> +		BUG();
> +}
> +
> [...]
> +
> +int hmm_pt_init(struct hmm_pt *pt)
> +{
> +	unsigned directory_shift, i = 0, npgd;
> +
> +	pt->last &= PAGE_MASK;
> +	spin_lock_init(&pt->lock);
> +	/* Directory shift is the number of bits that a single directory level
> +	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
> +	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
> +	 */
> +	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
> +	/* Level 0 is the root level of the page table. It might use less
> +	 * bits than directory_shift but all sub-directory level will use all
> +	 * directory_shift bits.
> +	 *
> +	 * For instance if hmm_pt.last == (1 << 48), PAGE_SHIFT == 12 and

This example should say that hmm_pt.last == (1 << 48) - 1, since last is 
inclusive. Otherwise llevel will be 4.

> +	 * sizeof(dma_addr_t) == 8 then :
> +	 *   directory_shift = 9
> +	 *   shift[0] = 39
> +	 *   shift[1] = 30
> +	 *   shift[2] = 21
> +	 *   shift[3] = 12
> +	 *   llevel = 3
> +	 *
> +	 * Note that shift[llevel] == PAGE_SHIFT because the last level
> +	 * correspond to the page table entry level (ignoring the case of huge
> +	 * page).
> +	 */
> +	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
> +			directory_shift) + PAGE_SHIFT;
> +	while (pt->shift[i++] > PAGE_SHIFT)
> +		pt->shift[i] = pt->shift[i - 1] - directory_shift;
> +	pt->llevel = i - 1;
> +	pt->directory_mask = (1 << directory_shift) - 1;
> +
> +	for (i = 0; i <= pt->llevel; ++i)
> +		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
> +
> +	npgd = (pt->last >> pt->shift[0]) + 1;
> +	pt->pgd = kzalloc(npgd * sizeof(dma_addr_t), GFP_KERNEL);
> +	if (!pt->pgd)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(hmm_pt_init);

Does this need to be EXPORT_SYMBOL? It seems like a driver would never 
need to call this, only core hmm. Same question for hmm_pt_fini.


> [...]
> +
> +/* hmm_pt_init() - initialize iterator states.

This should say hmm_pt_iter_init.

> + *
> + * @iter: Iterator states.
> + *
> + * This function will initialize iterator states. It must always be pair with a
> + * call to hmm_pt_iter_fini().
> + */
> +void hmm_pt_iter_init(struct hmm_pt_iter *iter)
> +{
> +	memset(iter->ptd, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
> +	memset(iter->ptdp, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));

The memset sizes can simply be sizeof(iter->ptd) and sizeof(iter->ptdp).

> +	INIT_LIST_HEAD(&iter->dead_directories);
> +}
> +EXPORT_SYMBOL(hmm_pt_iter_init);
> +
> +/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
> + *
> + * @iter: Iterator states.
> + * @pt: HMM page table.
> + * @level: Level of the directory to unref.
> + *
> + * This function will unreference a directory and add it to dead list if
> + * directory no longer have any reference. It will also clear the entry to
> + * that directory into the upper level directory as well as dropping ref
> + * on the upper directory.
> + */
> +static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
> +					     struct hmm_pt *pt,
> +					     unsigned level)
> +{
> +	struct page *upper_ptd;
> +	dma_addr_t *upper_ptdp;
> +
> +	/* Nothing to do for root level. */
> +	if (!level)
> +		return;
> +
> +	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
> +		return;
> +
> +	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
> +	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : pt->pgd;
> +	upper_ptdp = &upper_ptdp[hmm_pt_index(pt, iter->cur, level - 1)];
> +	hmm_pt_directory_lock(pt, upper_ptd, level - 1);
> +	/*
> +	 * There might be race btw decrementing reference count on a directory
> +	 * and another thread trying to fault in a new directory. To avoid
> +	 * erasing the new directory entry we need to check that the entry
> +	 * still correspond to the directory we are removing.
> +	 */
> +	if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
> +		*upper_ptdp = 0;
> +	hmm_pt_directory_unlock(pt, upper_ptd, level - 1);
> +
> +	/* Add it to delayed free list. */
> +	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
> +
> +	/*
> +	 * The upper directory is not safe to unref as we have an extra ref and

This should be "IS safe to unref", correct?

> +	 * thus refcount should not reach 0.
> +	 */
> +	hmm_pt_iter_directory_unref(iter, level - 1);
> +}
> [...]
> +/* hmm_pt_iter_fini() - finalize iterator.
> + *
> + * @iter: Iterator states.
> + * @pt: HMM page table.
> + *
> + * This function will cleanup iterator by unmapping and unreferencing any
> + * directory still mapped and referenced. It will also free any dead directory.
> + */
> +void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt)
> +{
> +	struct page *ptd, *tmp;
> +	unsigned i;
> +
> +	for (i = pt->llevel; i >= 1; --i) {
> +		if (!iter->ptd[i - 1])
> +			continue;
> +		hmm_pt_iter_unprotect_directory(iter, pt, i);
> +	}
> +
> +	/* Avoid useless synchronize_rcu() if there is no directory to free. */
> +	if (list_empty(&iter->dead_directories))
> +		return;
> +
> +	/*
> +	 * Some iterator may have dereferenced a dead directory entry and looked
> +	 * up the struct page but haven't check yet the reference count. As all
> +	 * the above happen in rcu read critical section we know that we need
> +	 * to wait for grace period before being able to free any of the dead
> +	 * directory page.
> +	 */
> +	synchronize_rcu();
> +	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
> +		list_del(&ptd->lru);
> +		atomic_set(&ptd->_mapcount, -1);
> +		__free_page(ptd);
> +	}
> +}

If I'm following this correctly, a migrate to the device will allocate HMM 
page tables and the subsequent migrate from the device will free them. 
Assuming that's the case, might thrashing of page allocations be a 
problem? What about keeping the HMM page tables around until the actual 
munmap() of the corresponding VA range?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/36] HMM: add HMM page table v2.
  2015-06-19  2:06   ` Mark Hairgrove
@ 2015-06-19 18:07     ` Jerome Glisse
  2015-06-20  2:34       ` Mark Hairgrove
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-19 18:07 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

On Thu, Jun 18, 2015 at 07:06:08PM -0700, Mark Hairgrove wrote:
> On Thu, 21 May 2015, j.glisse@gmail.com wrote:

[...]
> > +
> > +static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
> > +{
> > +	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
> > +}
> > +
> > +static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
> > +{
> > +	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
> > +}
> > +
> 
> Does hmm_pde_pfn return a dma_addr_t pfn or a system memory pfn?
> 
> The types between these two functions don't match. According to 
> hmm_pde_from_pfn, both the pde and the pfn are supposed to be dma_addr_t. 
> But hmm_pde_pfn returns an unsigned long as a pfn instead of a dma_addr_t. 
> If hmm_pde_pfn sometimes used to get a dma_addr_t pfn then shouldn't it 
> also return a dma_addr_t, since as you pointed out in the commit message, 
> dma_addr_t might be bigger than an unsigned long?
> 

Yes internal it use dma_addr_t but for device driver that want to use
physical system page address aka pfn i want them to use the specialize
helper hmm_pte_from_pfn() and hmm_pte_pfn() so type casting happen in
hmm and it make it easier to review device driver as device driver will
be consistent ie either it wants to use pfn or it want to use dma_addr_t
but not mix the 2.

A latter patch add the hmm_pte_from_dma() and hmm_pte_dma_addr() helper
for the dma case. So this patch only introduce the pfn version.

> > [...]
> > +
> > +static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
> > +{
> > +	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
> > +}
> > +
> > +static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
> > +{
> > +	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
> > +}
> 
> Same question as hmm_pde_pfn above.

See above.

> 
> > [...]
> > +/* struct hmm_pt_iter - page table iterator states.
> > + *
> > + * @ptd: Array of directory struct page pointer for each levels.
> > + * @ptdp: Array of pointer to mapped directory levels.
> > + * @dead_directories: List of directories that died while walking page table.
> > + * @cur: Current address.
> > + */
> > +struct hmm_pt_iter {
> > +	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
> > +	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
> 
> These are sized to be HMM_PT_MAX_LEVEL - 1 rather than HMM_PT_MAX_LEVEL 
> because the iterator doesn't store the top level, correct? This results in 
> a lot of "level - 1" and "level - 2" logic when dealing with the iterator. 
> Have you considered keeping the levels consistent to get rid of all the 
> extra offset-by-1 logic?

All this should be optimized away by the compiler thought i have not
check the assembly.

[...]
> > + *
> > + * @iter: Iterator states that currently protect the directory.
> > + * @level: Level of the directory to reference.
> > + *
> > + * This function will reference a directory but it is illegal for refcount to
> > + * be 0 as this helper should only be call when iterator is protecting the
> > + * directory (ie iterator hold a reference for the directory).
> > + *
> > + * HMM user will call this with level = pt.llevel any other value is supicious
> > + * outside of hmm_pt code.
> > + */
> > +static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter,
> > +					     char level)
> > +{
> > +	/* Nothing to do for root level. */
> > +	if (!level)
> > +		return;
> > +
> > +	if (!atomic_inc_not_zero(&iter->ptd[level - 1]->_mapcount))
> > +		/* Illegal this should not happen. */
> > +		BUG();
> > +}
> > +
> > [...]
> > +
> > +int hmm_pt_init(struct hmm_pt *pt)
> > +{
> > +	unsigned directory_shift, i = 0, npgd;
> > +
> > +	pt->last &= PAGE_MASK;
> > +	spin_lock_init(&pt->lock);
> > +	/* Directory shift is the number of bits that a single directory level
> > +	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
> > +	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
> > +	 */
> > +	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
> > +	/* Level 0 is the root level of the page table. It might use less
> > +	 * bits than directory_shift but all sub-directory level will use all
> > +	 * directory_shift bits.
> > +	 *
> > +	 * For instance if hmm_pt.last == (1 << 48), PAGE_SHIFT == 12 and
> 
> This example should say that hmm_pt.last == (1 << 48) - 1, since last is 
> inclusive. Otherwise llevel will be 4.

Yes correct i switched from exclusive to inclusive and forgot to update
comment.

> 
> > +	 * sizeof(dma_addr_t) == 8 then :
> > +	 *   directory_shift = 9
> > +	 *   shift[0] = 39
> > +	 *   shift[1] = 30
> > +	 *   shift[2] = 21
> > +	 *   shift[3] = 12
> > +	 *   llevel = 3
> > +	 *
> > +	 * Note that shift[llevel] == PAGE_SHIFT because the last level
> > +	 * correspond to the page table entry level (ignoring the case of huge
> > +	 * page).
> > +	 */
> > +	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
> > +			directory_shift) + PAGE_SHIFT;
> > +	while (pt->shift[i++] > PAGE_SHIFT)
> > +		pt->shift[i] = pt->shift[i - 1] - directory_shift;
> > +	pt->llevel = i - 1;
> > +	pt->directory_mask = (1 << directory_shift) - 1;
> > +
> > +	for (i = 0; i <= pt->llevel; ++i)
> > +		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
> > +
> > +	npgd = (pt->last >> pt->shift[0]) + 1;
> > +	pt->pgd = kzalloc(npgd * sizeof(dma_addr_t), GFP_KERNEL);
> > +	if (!pt->pgd)
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(hmm_pt_init);
> 
> Does this need to be EXPORT_SYMBOL? It seems like a driver would never 
> need to call this, only core hmm. Same question for hmm_pt_fini.
> 

Well i wanted to use that in the hmm_dummy driver, but i have mix feeling
as using it in dummy driver allow to test it in more use case but at the
same time a bug might be hidden because same code is use in dummy driver.
I will probably juste use it in dummy driver.

[...]
> > + *
> > + * @iter: Iterator states.
> > + *
> > + * This function will initialize iterator states. It must always be pair with a
> > + * call to hmm_pt_iter_fini().
> > + */
> > +void hmm_pt_iter_init(struct hmm_pt_iter *iter)
> > +{
> > +	memset(iter->ptd, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
> > +	memset(iter->ptdp, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
> 
> The memset sizes can simply be sizeof(iter->ptd) and sizeof(iter->ptdp).

Yes i should have.

> > +	INIT_LIST_HEAD(&iter->dead_directories);
> > +}
> > +EXPORT_SYMBOL(hmm_pt_iter_init);
> > +
> > +/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
> > + *
> > + * @iter: Iterator states.
> > + * @pt: HMM page table.
> > + * @level: Level of the directory to unref.
> > + *
> > + * This function will unreference a directory and add it to dead list if
> > + * directory no longer have any reference. It will also clear the entry to
> > + * that directory into the upper level directory as well as dropping ref
> > + * on the upper directory.
> > + */
> > +static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
> > +					     struct hmm_pt *pt,
> > +					     unsigned level)
> > +{
> > +	struct page *upper_ptd;
> > +	dma_addr_t *upper_ptdp;
> > +
> > +	/* Nothing to do for root level. */
> > +	if (!level)
> > +		return;
> > +
> > +	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
> > +		return;
> > +
> > +	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
> > +	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : pt->pgd;
> > +	upper_ptdp = &upper_ptdp[hmm_pt_index(pt, iter->cur, level - 1)];
> > +	hmm_pt_directory_lock(pt, upper_ptd, level - 1);
> > +	/*
> > +	 * There might be race btw decrementing reference count on a directory
> > +	 * and another thread trying to fault in a new directory. To avoid
> > +	 * erasing the new directory entry we need to check that the entry
> > +	 * still correspond to the directory we are removing.
> > +	 */
> > +	if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
> > +		*upper_ptdp = 0;
> > +	hmm_pt_directory_unlock(pt, upper_ptd, level - 1);
> > +
> > +	/* Add it to delayed free list. */
> > +	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
> > +
> > +	/*
> > +	 * The upper directory is not safe to unref as we have an extra ref and
> 
> This should be "IS safe to unref", correct?

Yes (s/not/now) -> "is NOW safe to unref" /me blame my fingers.

[...]
> > +	/*
> > +	 * Some iterator may have dereferenced a dead directory entry and looked
> > +	 * up the struct page but haven't check yet the reference count. As all
> > +	 * the above happen in rcu read critical section we know that we need
> > +	 * to wait for grace period before being able to free any of the dead
> > +	 * directory page.
> > +	 */
> > +	synchronize_rcu();
> > +	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
> > +		list_del(&ptd->lru);
> > +		atomic_set(&ptd->_mapcount, -1);
> > +		__free_page(ptd);
> > +	}
> > +}
> 
> If I'm following this correctly, a migrate to the device will allocate HMM 
> page tables and the subsequent migrate from the device will free them. 
> Assuming that's the case, might thrashing of page allocations be a 
> problem? What about keeping the HMM page tables around until the actual 
> munmap() of the corresponding VA range?

HMM page table is allocate anytime a device mirror a range ie migration to
device is not a special case. When migrating to and from device, the HMM
page table is allocated prior to the migration and outlive the migration
back.

That said the rational here is that i want to free HMM resources as early as
possible mostly to support the use GPU on dataset onetime (ie dataset is use
once and only once by the GPU). I think it will be a common and important use
case and making sure we free resource early does not prevent other use case
where dataset are use for longer time to work properly and efficiently.

In a latter patch i add an helper so that device driver can discard a range
ie tell HMM that they no longer using a range of address allowing HMM to
free associated resources.

However you are correct that currently some MM event will lead to HMM page
table being free and then reallocated right after once again by the device.
Which is obviously bad. But because i did not want to make this patch or
this serie any more complex than it already is i did not include any mecanism
to delay HMM page table directory reclaim. Such delayed reclaim mecanism is
on my road map and i think i shared that roadmap with you. I think it is
something we can optimize latter on. The important part here is that device
driver knows that HMM page table need to be carefully accessed so that when
agressive pruning of HMM page table happens it does not disrupt the device
driver.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/36] HMM: add HMM page table v2.
  2015-06-19 18:07     ` Jerome Glisse
@ 2015-06-20  2:34       ` Mark Hairgrove
  0 siblings, 0 replies; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-20  2:34 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar



On Fri, 19 Jun 2015, Jerome Glisse wrote:

> On Thu, Jun 18, 2015 at 07:06:08PM -0700, Mark Hairgrove wrote:
> > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> 
> [...]
> > > +
> > > +static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
> > > +{
> > > +	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
> > > +}
> > > +
> > > +static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
> > > +{
> > > +	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
> > > +}
> > > +
> > 
> > Does hmm_pde_pfn return a dma_addr_t pfn or a system memory pfn?
> > 
> > The types between these two functions don't match. According to 
> > hmm_pde_from_pfn, both the pde and the pfn are supposed to be dma_addr_t. 
> > But hmm_pde_pfn returns an unsigned long as a pfn instead of a dma_addr_t. 
> > If hmm_pde_pfn sometimes used to get a dma_addr_t pfn then shouldn't it 
> > also return a dma_addr_t, since as you pointed out in the commit message, 
> > dma_addr_t might be bigger than an unsigned long?
> > 
> 
> Yes internal it use dma_addr_t but for device driver that want to use
> physical system page address aka pfn i want them to use the specialize
> helper hmm_pte_from_pfn() and hmm_pte_pfn() so type casting happen in
> hmm and it make it easier to review device driver as device driver will
> be consistent ie either it wants to use pfn or it want to use dma_addr_t
> but not mix the 2.
> 
> A latter patch add the hmm_pte_from_dma() and hmm_pte_dma_addr() helper
> for the dma case. So this patch only introduce the pfn version.

So the only reason for hmm_pde_from_pfn to take in a dma_addr_t is to 
avoid an (unsigned long) cast at the call sites?


> > > [...]
> > > +/* struct hmm_pt_iter - page table iterator states.
> > > + *
> > > + * @ptd: Array of directory struct page pointer for each levels.
> > > + * @ptdp: Array of pointer to mapped directory levels.
> > > + * @dead_directories: List of directories that died while walking page table.
> > > + * @cur: Current address.
> > > + */
> > > +struct hmm_pt_iter {
> > > +	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
> > > +	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
> > 
> > These are sized to be HMM_PT_MAX_LEVEL - 1 rather than HMM_PT_MAX_LEVEL 
> > because the iterator doesn't store the top level, correct? This results in 
> > a lot of "level - 1" and "level - 2" logic when dealing with the iterator. 
> > Have you considered keeping the levels consistent to get rid of all the 
> > extra offset-by-1 logic?
> 
> All this should be optimized away by the compiler thought i have not
> check the assembly.

I was talking about code readability and maintainability rather than 
performance. It's conceptually simpler to have consistent definitions of 
"level" across both the iterator and the hmm_pt helpers even though the 
iterator doesn't need to access the top level. This would turn "level-1" 
and "level-2" into "level" and "level-1", which I think are simpler to 
follow.


> [...]
> > > +	/*
> > > +	 * Some iterator may have dereferenced a dead directory entry and looked
> > > +	 * up the struct page but haven't check yet the reference count. As all
> > > +	 * the above happen in rcu read critical section we know that we need
> > > +	 * to wait for grace period before being able to free any of the dead
> > > +	 * directory page.
> > > +	 */
> > > +	synchronize_rcu();
> > > +	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
> > > +		list_del(&ptd->lru);
> > > +		atomic_set(&ptd->_mapcount, -1);
> > > +		__free_page(ptd);
> > > +	}
> > > +}
> > 
> > If I'm following this correctly, a migrate to the device will allocate HMM 
> > page tables and the subsequent migrate from the device will free them. 
> > Assuming that's the case, might thrashing of page allocations be a 
> > problem? What about keeping the HMM page tables around until the actual 
> > munmap() of the corresponding VA range?
> 
> HMM page table is allocate anytime a device mirror a range ie migration to
> device is not a special case. When migrating to and from device, the HMM
> page table is allocated prior to the migration and outlive the migration
> back.
> 
> That said the rational here is that i want to free HMM resources as early as
> possible mostly to support the use GPU on dataset onetime (ie dataset is use
> once and only once by the GPU). I think it will be a common and important use
> case and making sure we free resource early does not prevent other use case
> where dataset are use for longer time to work properly and efficiently.
> 
> In a latter patch i add an helper so that device driver can discard a range
> ie tell HMM that they no longer using a range of address allowing HMM to
> free associated resources.
> 
> However you are correct that currently some MM event will lead to HMM page
> table being free and then reallocated right after once again by the device.
> Which is obviously bad. But because i did not want to make this patch or
> this serie any more complex than it already is i did not include any mecanism
> to delay HMM page table directory reclaim. Such delayed reclaim mecanism is
> on my road map and i think i shared that roadmap with you. I think it is
> something we can optimize latter on. The important part here is that device
> driver knows that HMM page table need to be carefully accessed so that when
> agressive pruning of HMM page table happens it does not disrupt the device
> driver.

Ok, works for me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory.
  2015-05-21 19:31 ` [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory j.glisse
@ 2015-06-24  7:49   ` Haggai Eran
  0 siblings, 0 replies; 80+ messages in thread
From: Haggai Eran @ 2015-06-24  7:49 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jerome Glisse, Jatin Kumar

On 21/05/2015 22:31, j.glisse@gmail.com wrote:
> From: Jerome Glisse <jglisse@redhat.com>
> 
> When migrating anonymous memory from system memory to device memory
> CPU pte are replaced with special HMM swap entry so that page fault,
> get user page (gup), fork, ... are properly redirected to HMM helpers.
> 
> This patch only add the new swap type entry and hooks HMM helpers
> functions inside the page fault and fork code path.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
> ---
>  include/linux/hmm.h     | 34 ++++++++++++++++++++++++++++++++++
>  include/linux/swap.h    | 12 +++++++++++-
>  include/linux/swapops.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
>  mm/hmm.c                | 21 +++++++++++++++++++++
>  mm/memory.c             | 22 ++++++++++++++++++++++
>  5 files changed, 130 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 186f497..f243eb5 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -257,6 +257,40 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
>  			    unsigned long start,
>  			    unsigned long end);
>  
> +int hmm_handle_cpu_fault(struct mm_struct *mm,
> +			struct vm_area_struct *vma,
> +			pmd_t *pmdp, unsigned long addr,
> +			unsigned flags, pte_t orig_pte);
> +
> +int hmm_mm_fork(struct mm_struct *src_mm,
> +		struct mm_struct *dst_mm,
> +		struct vm_area_struct *dst_vma,
> +		pmd_t *dst_pmd,
> +		unsigned long start,
> +		unsigned long end);
> +
> +#else /* CONFIG_HMM */
> +
> +static inline int hmm_handle_mm_fault(struct mm_struct *mm,
I think this should be hmm_handle_cpu_fault, to match the function
declared above in the CONFIG_HMM case.

> +				      struct vm_area_struct *vma,
> +				      pmd_t *pmdp, unsigned long addr,
> +				      unsigned flags, pte_t orig_pte)
> +{
> +	return VM_FAULT_SIGBUS;
> +}
> +
> +static inline int hmm_mm_fork(struct mm_struct *src_mm,
> +			      struct mm_struct *dst_mm,
> +			      struct vm_area_struct *dst_vma,
> +			      pmd_t *dst_pmd,
> +			      unsigned long start,
> +			      unsigned long end)
> +{
> +	BUG();
> +	return -ENOMEM;
> +}
>  
>  #endif /* CONFIG_HMM */
> +
> +
>  #endif

Regards,
Haggai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM.
  2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
@ 2015-06-24 13:59     ` Haggai Eran
  0 siblings, 0 replies; 80+ messages in thread
From: Haggai Eran @ 2015-06-24 13:59 UTC (permalink / raw)
  To: jglisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, linux-rdma

On 21/05/2015 23:23, jglisse@redhat.com wrote:
> +int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
> +{
> +	struct mm_struct *mm = get_task_mm(current);
> +	struct ib_device *ib_device = context->device;
> +	struct ib_mirror *ib_mirror;
> +	struct pid *our_pid;
> +	int ret;
> +
> +	if (!mm || !ib_device->hmm_ready)
> +		return -EINVAL;
> +
> +	/* FIXME can this really happen ? */
No, following Yann Droneaud's patch 8abaae62f3fd ("IB/core: disallow
registering 0-sized memory region") ib_umem_get() checks against zero
sized umems.

> +	if (unlikely(ib_umem_start(umem) == ib_umem_end(umem)))
> +		return -EINVAL;
> +
> +	/* Prevent creating ODP MRs in child processes */
> +	rcu_read_lock();
> +	our_pid = get_task_pid(current->group_leader, PIDTYPE_PID);
> +	rcu_read_unlock();
> +	put_pid(our_pid);
> +	if (context->tgid != our_pid) {
> +		mmput(mm);
> +		return -EINVAL;
> +	}
> +
> +	umem->hugetlb = 0;
> +	umem->odp_data = kmalloc(sizeof(*umem->odp_data), GFP_KERNEL);
> +	if (umem->odp_data == NULL) {
> +		mmput(mm);
> +		return -ENOMEM;
> +	}
> +	umem->odp_data->private = NULL;
> +	umem->odp_data->umem = umem;
> +
> +	mutex_lock(&ib_device->hmm_mutex);
> +	/* Is there an existing mirror for this process mm ? */
> +	ib_mirror = ib_mirror_ref(context->ib_mirror);
> +	if (!ib_mirror)
> +		list_for_each_entry(ib_mirror, &ib_device->ib_mirrors, list) {
> +			if (ib_mirror->base.hmm->mm != mm)
> +				continue;
> +			ib_mirror = ib_mirror_ref(ib_mirror);
> +			break;
> +		}
> +
> +	if (ib_mirror == NULL ||
> +	    ib_mirror == list_first_entry(&ib_device->ib_mirrors,
> +					  struct ib_mirror, list)) {
Is the second check an attempt to check if the list_for_each_entry above
passed through all the entries and didn't break? Maybe I missing
something, but I think that would cause the ib_mirror to hold a pointer
such that ib_mirror->list == ib_mirrors (point to the list head), and
not to the first entry.

In any case, I think it would be more clear if you add another ib_mirror
variable for iterating the ib_mirrors list.

> +		/* We need to create a new mirror. */
> +		ib_mirror = kmalloc(sizeof(*ib_mirror), GFP_KERNEL);
> +		if (ib_mirror == NULL) {
> +			mutex_unlock(&ib_device->hmm_mutex);
> +			mmput(mm);
> +			return -ENOMEM;
> +		}
> +		kref_init(&ib_mirror->kref);
> +		init_rwsem(&ib_mirror->hmm_mr_rwsem);
> +		ib_mirror->umem_tree = RB_ROOT;
> +		ib_mirror->ib_device = ib_device;
> +
> +		ib_mirror->base.device = &ib_device->hmm_dev;
> +		ret = hmm_mirror_register(&ib_mirror->base);
> +		if (ret) {
> +			mutex_unlock(&ib_device->hmm_mutex);
> +			kfree(ib_mirror);
> +			mmput(mm);
> +			return ret;
> +		}
> +
> +		list_add(&ib_mirror->list, &ib_device->ib_mirrors);
> +		context->ib_mirror = ib_mirror_ref(ib_mirror);
> +	}
> +	mutex_unlock(&ib_device->hmm_mutex);
> +	umem->odp_data.ib_mirror = ib_mirror;
> +
> +	down_write(&ib_mirror->umem_rwsem);
> +	rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
> +	up_write(&ib_mirror->umem_rwsem);
> +
> +	mmput(mm);
> +	return 0;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/36] HMM: add HMM page table v2.
  2015-05-21 19:31 ` [PATCH 06/36] HMM: add HMM page table v2 j.glisse
  2015-06-19  2:06   ` Mark Hairgrove
@ 2015-06-25 22:57   ` Mark Hairgrove
  2015-06-26 16:30     ` Jerome Glisse
  1 sibling, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-25 22:57 UTC (permalink / raw)
  To: j.glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

[-- Attachment #1: Type: text/plain, Size: 3228 bytes --]



On Thu, 21 May 2015, j.glisse@gmail.com wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> [...]
> +
> +void hmm_pt_iter_init(struct hmm_pt_iter *iter);
> +void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt);
> +unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
> +			       struct hmm_pt *pt,
> +			       unsigned long addr,
> +			       unsigned long end);
> +dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
> +			       struct hmm_pt *pt,
> +			       unsigned long addr);
> +dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
> +			      struct hmm_pt *pt,
> +			      unsigned long addr);

I've got a few more thoughts on hmm_pt_iter after looking at some of the 
later patches. I think I've convinced myself that this patch functionally 
works as-is, but I've got some suggestions and questions about the design.

Right now there are these three major functions:

1) hmm_pt_iter_update(addr)
   - Returns the hmm_pte * for addr, or NULL if none exists.

2) hmm_pt_iter_fault(addr)
   - Returns the hmm_pte * for addr, allocating a new one if none exists.

3) hmm_pt_iter_next(addr, end)
   - Returns the next possibly-valid address. The caller must use
     hmm_pt_iter_update to check if there really is an hmm_pte there.

In my view, there are two sources of confusion here:
- Naming. "update" shares a name with the HMM mirror callback, and it also
  implies that the page tables are "updated" as a result of the call. 
  "fault" likewise implies that the function handles a fault in some way.
  Neither of these implications are true.

- hmm_pt_iter_next and hmm_pt_iter_update have some overlapping
  functionality when compared to traditional iterators, requiring the 
  callers to all do this sort of thing:

        hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
        if (!hmm_pte) {
            addr = hmm_pt_iter_next(&iter, &mirror->pt,
                        addr, event->end);
            continue;
        }

Wouldn't it be more efficient and simpler to have _next do all the 
iteration internally so it always returns the next valid entry? Then you 
could combine _update and _next into a single function, something along 
these lines (which also addresses the naming concern):

void hmm_pt_iter_init(iter, pt, start, end);
unsigned long hmm_pt_iter_next(iter, hmm_pte *);
unsigned long hmm_pt_iter_next_alloc(iter, hmm_pte *);

hmm_pt_iter_next would return the address and ptep of the next valid 
entry, taking the place of the existing _update and _next functions. 
hmm_pt_iter_next_alloc takes the place of _fault.

Also, since the _next functions don't take in an address, the iterator 
doesn't have to handle the input addr being different from iter->cur.

The logical extent of this is a callback approach like mm_walk. That would 
be nice because the caller wouldn't have to worry about making the _init 
and _fini calls. I assume you didn't go with this approach because 
sometimes you need to iterate over hmm_pt while doing an mm_walk itself, 
and you didn't want the overhead of nesting those?

Finally, another minor thing I just noticed: shouldn't hmm_pt.h include 
<linux/bitops.h> since it uses all of the clear/set/test bit APIs?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 07/36] HMM: add per mirror page table v3.
  2015-05-21 19:31 ` [PATCH 07/36] HMM: add per mirror page table v3 j.glisse
@ 2015-06-25 23:05   ` Mark Hairgrove
  2015-06-26 16:43     ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-25 23:05 UTC (permalink / raw)
  To: j.glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

[-- Attachment #1: Type: text/plain, Size: 5596 bytes --]



On Thu, 21 May 2015, j.glisse@gmail.com wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> [...]
>  
> +	/* update() - update device mmu following an event.
> +	 *
> +	 * @mirror: The mirror that link process address space with the device.
> +	 * @event: The event that triggered the update.
> +	 * Returns: 0 on success or error code {-EIO, -ENOMEM}.
> +	 *
> +	 * Called to update device page table for a range of address.
> +	 * The event type provide the nature of the update :
> +	 *   - Range is no longer valid (munmap).
> +	 *   - Range protection changes (mprotect, COW, ...).
> +	 *   - Range is unmapped (swap, reclaim, page migration, ...).
> +	 *   - Device page fault.
> +	 *   - ...
> +	 *
> +	 * Thought most device driver only need to use pte_mask as it reflects
> +	 * change that will happen to the HMM page table ie :
> +	 *   new_pte = old_pte & event->pte_mask;

Documentation request: It would be useful to break down exactly what is 
required from the driver for each event type here, and what extra 
information is provided by the type that isn't provided by the pte_mask.

> +	 *
> +	 * Device driver must not update the HMM mirror page table (except the
> +	 * dirty bit see below). Core HMM will update HMM page table after the
> +	 * update is done.
> +	 *
> +	 * Note that device must be cache coherent with system memory (snooping
> +	 * in case of PCIE devices) so there should be no need for device to
> +	 * flush anything.
> +	 *
> +	 * When write protection is turned on device driver must make sure the
> +	 * hardware will no longer be able to write to the page otherwise file
> +	 * system corruption may occur.
> +	 *
> +	 * Device must properly set the dirty bit using hmm_pte_set_bit() on
> +	 * each page entry for memory that was written by the device. If device
> +	 * can not properly account for write access then the dirty bit must be
> +	 * set unconditionaly so that proper write back of file backed page can
> +	 * happen.
> +	 *
> +	 * Device driver must not fail lightly, any failure result in device
> +	 * process being kill.
> +	 *
> +	 * Return 0 on success, error value otherwise :
> +	 * -ENOMEM Not enough memory for performing the operation.
> +	 * -EIO    Some input/output error with the device.
> +	 *
> +	 * All other return value trigger warning and are transformed to -EIO.
> +	 */
> +	int (*update)(struct hmm_mirror *mirror,const struct hmm_event *event);
>  };
>  
>  
> @@ -142,6 +223,7 @@ int hmm_device_unregister(struct hmm_device *device);
>   * @kref: Reference counter (private to HMM do not use).
>   * @dlist: List of all hmm_mirror for same device.
>   * @mlist: List of all hmm_mirror for same process.
> + * @pt: Mirror page table.
>   *
>   * Each device that want to mirror an address space must register one of this
>   * struct for each of the address space it wants to mirror. Same device can
> @@ -154,6 +236,7 @@ struct hmm_mirror {
>  	struct kref		kref;
>  	struct list_head	dlist;
>  	struct hlist_node	mlist;
> +	struct hmm_pt		pt;

Documentation request: Why does each mirror have its own separate set of 
page tables rather than the hmm keeping one set for all devices? This is 
so different devices can have different permissions for the same address 
range, correct?

>  };
>  
> [...]
> +
> +static inline int hmm_event_init(struct hmm_event *event,
> +				 struct hmm *hmm,
> +				 unsigned long start,
> +				 unsigned long end,
> +				 enum hmm_etype etype)
> +{
> +	event->start = start & PAGE_MASK;
> +	event->end = min(end, hmm->vm_end);

start is rounded down to a page boundary. Should end be rounded also?


> [...]
> +
> +static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
> +				 struct hmm_event *event)
> +{
> +	unsigned long addr;
> +	struct hmm_pt_iter iter;
> +
> +	hmm_pt_iter_init(&iter);
> +	for (addr = event->start; addr != event->end;) {
> +		unsigned long end, next;
> +		dma_addr_t *hmm_pte;
> +
> +		hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
> +		if (!hmm_pte) {
> +			addr = hmm_pt_iter_next(&iter, &mirror->pt,
> +						addr, event->end);
> +			continue;
> +		}
> +		end = hmm_pt_level_next(&mirror->pt, addr, event->end,
> +					 mirror->pt.llevel - 1);
> +		/*
> +		 * The directory lock protect against concurrent clearing of
> +		 * page table bit flags. Exceptions being the dirty bit and
> +		 * the device driver private flags.
> +		 */
> +		hmm_pt_iter_directory_lock(&iter, &mirror->pt);
> +		do {
> +			next = hmm_pt_level_next(&mirror->pt, addr, end,
> +						 mirror->pt.llevel);
> +			if (!hmm_pte_test_valid_pfn(hmm_pte))
> +				continue;
> +			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
> +			    hmm_pte_test_write(hmm_pte)) {

If the pte is dirty, why bother checking that it's writable?

Could there be a legitimate case in which the page was dirtied in the 
past, but was made read-only later for some reason? In that case the page 
would still need to be be dirtied correctly even though the hmm_pte isn't 
currently writable.

Or is this check trying to protect against a driver setting the dirty bit 
without the write bit being set? If that happens, that's a driver bug, 
right?

> +				struct page *page;
> +
> +				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
> +				set_page_dirty(page);
> +			}
> +			*hmm_pte &= event->pte_mask;
> +			if (hmm_pte_test_valid_pfn(hmm_pte))
> +				continue;
> +			hmm_pt_iter_directory_unref(&iter, mirror->pt.llevel);
> +		} while (addr = next, hmm_pte++, addr != end);
> +		hmm_pt_iter_directory_unlock(&iter, &mirror->pt);
> +	}
> +	hmm_pt_iter_fini(&iter, &mirror->pt);
> +}

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/36] HMM: add HMM page table v2.
  2015-06-25 22:57   ` Mark Hairgrove
@ 2015-06-26 16:30     ` Jerome Glisse
  2015-06-27  1:34       ` Mark Hairgrove
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-26 16:30 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

On Thu, Jun 25, 2015 at 03:57:29PM -0700, Mark Hairgrove wrote:
> On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > From: Jerome Glisse <jglisse@redhat.com>
> > [...]
> > +
> > +void hmm_pt_iter_init(struct hmm_pt_iter *iter);
> > +void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt);
> > +unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
> > +			       struct hmm_pt *pt,
> > +			       unsigned long addr,
> > +			       unsigned long end);
> > +dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
> > +			       struct hmm_pt *pt,
> > +			       unsigned long addr);
> > +dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
> > +			      struct hmm_pt *pt,
> > +			      unsigned long addr);
> 
> I've got a few more thoughts on hmm_pt_iter after looking at some of the 
> later patches. I think I've convinced myself that this patch functionally 
> works as-is, but I've got some suggestions and questions about the design.
> 
> Right now there are these three major functions:
> 
> 1) hmm_pt_iter_update(addr)
>    - Returns the hmm_pte * for addr, or NULL if none exists.
> 
> 2) hmm_pt_iter_fault(addr)
>    - Returns the hmm_pte * for addr, allocating a new one if none exists.
> 
> 3) hmm_pt_iter_next(addr, end)
>    - Returns the next possibly-valid address. The caller must use
>      hmm_pt_iter_update to check if there really is an hmm_pte there.
> 
> In my view, there are two sources of confusion here:
> - Naming. "update" shares a name with the HMM mirror callback, and it also
>   implies that the page tables are "updated" as a result of the call. 
>   "fault" likewise implies that the function handles a fault in some way.
>   Neither of these implications are true.

Maybe hmm_pt_iter_walk & hmm_pt_iter_populate are better name ?


> - hmm_pt_iter_next and hmm_pt_iter_update have some overlapping
>   functionality when compared to traditional iterators, requiring the 
>   callers to all do this sort of thing:
> 
>         hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
>         if (!hmm_pte) {
>             addr = hmm_pt_iter_next(&iter, &mirror->pt,
>                         addr, event->end);
>             continue;
>         }
> 
> Wouldn't it be more efficient and simpler to have _next do all the 
> iteration internally so it always returns the next valid entry? Then you 
> could combine _update and _next into a single function, something along 
> these lines (which also addresses the naming concern):
> 
> void hmm_pt_iter_init(iter, pt, start, end);
> unsigned long hmm_pt_iter_next(iter, hmm_pte *);
> unsigned long hmm_pt_iter_next_alloc(iter, hmm_pte *);
> 
> hmm_pt_iter_next would return the address and ptep of the next valid 
> entry, taking the place of the existing _update and _next functions. 
> hmm_pt_iter_next_alloc takes the place of _fault.
> 
> Also, since the _next functions don't take in an address, the iterator 
> doesn't have to handle the input addr being different from iter->cur.

It would still need to do the same kind of test, this test is really to
know when you switch from one directory to the next and to drop and take
reference accordingly.


> The logical extent of this is a callback approach like mm_walk. That would 
> be nice because the caller wouldn't have to worry about making the _init 
> and _fini calls. I assume you didn't go with this approach because 
> sometimes you need to iterate over hmm_pt while doing an mm_walk itself, 
> and you didn't want the overhead of nesting those?

Correct i do not want to do a hmm_pt_walk inside a mm_walk, that sounded and
looked bad in my mind. That being said i could add a hmm_pt_walk like mm_walk
for device driver and simply have it using the hmm_pt_iter internally.


> Finally, another minor thing I just noticed: shouldn't hmm_pt.h include 
> <linux/bitops.h> since it uses all of the clear/set/test bit APIs?

Good catch, i forgot that.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 07/36] HMM: add per mirror page table v3.
  2015-06-25 23:05   ` Mark Hairgrove
@ 2015-06-26 16:43     ` Jerome Glisse
  2015-06-27  3:02       ` Mark Hairgrove
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-26 16:43 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

On Thu, Jun 25, 2015 at 04:05:48PM -0700, Mark Hairgrove wrote:
> On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > From: Jerome Glisse <jglisse@redhat.com>
> > [...]
> >  
> > +	/* update() - update device mmu following an event.
> > +	 *
> > +	 * @mirror: The mirror that link process address space with the device.
> > +	 * @event: The event that triggered the update.
> > +	 * Returns: 0 on success or error code {-EIO, -ENOMEM}.
> > +	 *
> > +	 * Called to update device page table for a range of address.
> > +	 * The event type provide the nature of the update :
> > +	 *   - Range is no longer valid (munmap).
> > +	 *   - Range protection changes (mprotect, COW, ...).
> > +	 *   - Range is unmapped (swap, reclaim, page migration, ...).
> > +	 *   - Device page fault.
> > +	 *   - ...
> > +	 *
> > +	 * Thought most device driver only need to use pte_mask as it reflects
> > +	 * change that will happen to the HMM page table ie :
> > +	 *   new_pte = old_pte & event->pte_mask;
> 
> Documentation request: It would be useful to break down exactly what is 
> required from the driver for each event type here, and what extra 
> information is provided by the type that isn't provided by the pte_mask.

Mostly event tell you if you need to free or not the device page table for
the range, which is not something you can infer from the pte_mask reliably.
Difference btw migration and munmap for instance, same pte_mask but range
is still valid in the migration case it will just be backed by a new set of
pages.


[...]
> > @@ -142,6 +223,7 @@ int hmm_device_unregister(struct hmm_device *device);
> >   * @kref: Reference counter (private to HMM do not use).
> >   * @dlist: List of all hmm_mirror for same device.
> >   * @mlist: List of all hmm_mirror for same process.
> > + * @pt: Mirror page table.
> >   *
> >   * Each device that want to mirror an address space must register one of this
> >   * struct for each of the address space it wants to mirror. Same device can
> > @@ -154,6 +236,7 @@ struct hmm_mirror {
> >  	struct kref		kref;
> >  	struct list_head	dlist;
> >  	struct hlist_node	mlist;
> > +	struct hmm_pt		pt;
> 
> Documentation request: Why does each mirror have its own separate set of 
> page tables rather than the hmm keeping one set for all devices? This is 
> so different devices can have different permissions for the same address 
> range, correct?

Several reasons, first and mostly dma mapping, while i have plan to allow
to share dma mapping directory btw devices this require work in the dma
layer first. Second reasons is, like you point out, different permissions,
like one device requesting atomic access ie the device will be the only
one with write permission and HMM need somewhere to store that information
per device per address. It also helps to avoid calling device driver on a
range that one device does not mirror.


> > [...]
> > +
> > +static inline int hmm_event_init(struct hmm_event *event,
> > +				 struct hmm *hmm,
> > +				 unsigned long start,
> > +				 unsigned long end,
> > +				 enum hmm_etype etype)
> > +{
> > +	event->start = start & PAGE_MASK;
> > +	event->end = min(end, hmm->vm_end);
> 
> start is rounded down to a page boundary. Should end be rounded also?

Something went wrong while i was re-organizing the patches, final code is:
	event->start = start & PAGE_MASK;
	event->end = PAGE_ALIGN(min(end, hmm->vm_end));

I will make sure this happen in this patch instead of a latter patch.


> > [...]
> > +
> > +static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
> > +				 struct hmm_event *event)
> > +{
> > +	unsigned long addr;
> > +	struct hmm_pt_iter iter;
> > +
> > +	hmm_pt_iter_init(&iter);
> > +	for (addr = event->start; addr != event->end;) {
> > +		unsigned long end, next;
> > +		dma_addr_t *hmm_pte;
> > +
> > +		hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
> > +		if (!hmm_pte) {
> > +			addr = hmm_pt_iter_next(&iter, &mirror->pt,
> > +						addr, event->end);
> > +			continue;
> > +		}
> > +		end = hmm_pt_level_next(&mirror->pt, addr, event->end,
> > +					 mirror->pt.llevel - 1);
> > +		/*
> > +		 * The directory lock protect against concurrent clearing of
> > +		 * page table bit flags. Exceptions being the dirty bit and
> > +		 * the device driver private flags.
> > +		 */
> > +		hmm_pt_iter_directory_lock(&iter, &mirror->pt);
> > +		do {
> > +			next = hmm_pt_level_next(&mirror->pt, addr, end,
> > +						 mirror->pt.llevel);
> > +			if (!hmm_pte_test_valid_pfn(hmm_pte))
> > +				continue;
> > +			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
> > +			    hmm_pte_test_write(hmm_pte)) {
> 
> If the pte is dirty, why bother checking that it's writable?
> 
> Could there be a legitimate case in which the page was dirtied in the 
> past, but was made read-only later for some reason? In that case the page 
> would still need to be be dirtied correctly even though the hmm_pte isn't 
> currently writable.
> 
> Or is this check trying to protect against a driver setting the dirty bit 
> without the write bit being set? If that happens, that's a driver bug, 
> right?

This is to catch driver bug, i should have add a comment and a debug msg
for that. The dirty bit can not be set if the write bit isn't. So if device
driver do that it is a bug, a bad one. Will add proper warning message.


Thanks for the review,
Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/36] HMM: add HMM page table v2.
  2015-06-26 16:30     ` Jerome Glisse
@ 2015-06-27  1:34       ` Mark Hairgrove
  2015-06-29 14:43         ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-27  1:34 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

[-- Attachment #1: Type: text/plain, Size: 4955 bytes --]



On Fri, 26 Jun 2015, Jerome Glisse wrote:

> On Thu, Jun 25, 2015 at 03:57:29PM -0700, Mark Hairgrove wrote:
> > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > > From: Jerome Glisse <jglisse@redhat.com>
> > > [...]
> > > +
> > > +void hmm_pt_iter_init(struct hmm_pt_iter *iter);
> > > +void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt);
> > > +unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
> > > +			       struct hmm_pt *pt,
> > > +			       unsigned long addr,
> > > +			       unsigned long end);
> > > +dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
> > > +			       struct hmm_pt *pt,
> > > +			       unsigned long addr);
> > > +dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
> > > +			      struct hmm_pt *pt,
> > > +			      unsigned long addr);
> > 
> > I've got a few more thoughts on hmm_pt_iter after looking at some of the 
> > later patches. I think I've convinced myself that this patch functionally 
> > works as-is, but I've got some suggestions and questions about the design.
> > 
> > Right now there are these three major functions:
> > 
> > 1) hmm_pt_iter_update(addr)
> >    - Returns the hmm_pte * for addr, or NULL if none exists.
> > 
> > 2) hmm_pt_iter_fault(addr)
> >    - Returns the hmm_pte * for addr, allocating a new one if none exists.
> > 
> > 3) hmm_pt_iter_next(addr, end)
> >    - Returns the next possibly-valid address. The caller must use
> >      hmm_pt_iter_update to check if there really is an hmm_pte there.
> > 
> > In my view, there are two sources of confusion here:
> > - Naming. "update" shares a name with the HMM mirror callback, and it also
> >   implies that the page tables are "updated" as a result of the call. 
> >   "fault" likewise implies that the function handles a fault in some way.
> >   Neither of these implications are true.
> 
> Maybe hmm_pt_iter_walk & hmm_pt_iter_populate are better name ?

hmm_pt_iter_populate sounds good. See below for _walk.


> 
> 
> > - hmm_pt_iter_next and hmm_pt_iter_update have some overlapping
> >   functionality when compared to traditional iterators, requiring the 
> >   callers to all do this sort of thing:
> > 
> >         hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
> >         if (!hmm_pte) {
> >             addr = hmm_pt_iter_next(&iter, &mirror->pt,
> >                         addr, event->end);
> >             continue;
> >         }
> > 
> > Wouldn't it be more efficient and simpler to have _next do all the 
> > iteration internally so it always returns the next valid entry? Then you 
> > could combine _update and _next into a single function, something along 
> > these lines (which also addresses the naming concern):
> > 
> > void hmm_pt_iter_init(iter, pt, start, end);
> > unsigned long hmm_pt_iter_next(iter, hmm_pte *);
> > unsigned long hmm_pt_iter_next_alloc(iter, hmm_pte *);
> > 
> > hmm_pt_iter_next would return the address and ptep of the next valid 
> > entry, taking the place of the existing _update and _next functions. 
> > hmm_pt_iter_next_alloc takes the place of _fault.
> > 
> > Also, since the _next functions don't take in an address, the iterator 
> > doesn't have to handle the input addr being different from iter->cur.
> 
> It would still need to do the same kind of test, this test is really to
> know when you switch from one directory to the next and to drop and take
> reference accordingly.

But all of the directory references are already hidden entirely in the 
iterator _update function. The caller only has to worry about taking 
references on the bottom level, so I don't understand why the iterator 
needs to return to the caller when it hits the end of a directory. Or for 
that matter, why it returns every possible index within a directory to the 
caller whether that index is valid or not.

If _next only returned to the caller when it hit a valid hmm_pte (or end), 
then only one function would be needed (_next) instead of two 
(_update/_walk and _next).


> 
> 
> > The logical extent of this is a callback approach like mm_walk. That would 
> > be nice because the caller wouldn't have to worry about making the _init 
> > and _fini calls. I assume you didn't go with this approach because 
> > sometimes you need to iterate over hmm_pt while doing an mm_walk itself, 
> > and you didn't want the overhead of nesting those?
> 
> Correct i do not want to do a hmm_pt_walk inside a mm_walk, that sounded and
> looked bad in my mind. That being said i could add a hmm_pt_walk like mm_walk
> for device driver and simply have it using the hmm_pt_iter internally.

I agree that nesting walks feels bad. If we can get the hmm_pt_iter API 
simple enough, I don't think an hmm_pt_walk callback approach is 
necessary.


> 
> 
> > Finally, another minor thing I just noticed: shouldn't hmm_pt.h include 
> > <linux/bitops.h> since it uses all of the clear/set/test bit APIs?
> 
> Good catch, i forgot that.
> 
> Cheers,
> Jerome
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 07/36] HMM: add per mirror page table v3.
  2015-06-26 16:43     ` Jerome Glisse
@ 2015-06-27  3:02       ` Mark Hairgrove
  2015-06-29 14:50         ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-06-27  3:02 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

[-- Attachment #1: Type: text/plain, Size: 4218 bytes --]



On Fri, 26 Jun 2015, Jerome Glisse wrote:

> On Thu, Jun 25, 2015 at 04:05:48PM -0700, Mark Hairgrove wrote:
> > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > > From: Jerome Glisse <jglisse@redhat.com>
> > > [...]
> > >  
> > > +	/* update() - update device mmu following an event.
> > > +	 *
> > > +	 * @mirror: The mirror that link process address space with the device.
> > > +	 * @event: The event that triggered the update.
> > > +	 * Returns: 0 on success or error code {-EIO, -ENOMEM}.
> > > +	 *
> > > +	 * Called to update device page table for a range of address.
> > > +	 * The event type provide the nature of the update :
> > > +	 *   - Range is no longer valid (munmap).
> > > +	 *   - Range protection changes (mprotect, COW, ...).
> > > +	 *   - Range is unmapped (swap, reclaim, page migration, ...).
> > > +	 *   - Device page fault.
> > > +	 *   - ...
> > > +	 *
> > > +	 * Thought most device driver only need to use pte_mask as it reflects
> > > +	 * change that will happen to the HMM page table ie :
> > > +	 *   new_pte = old_pte & event->pte_mask;
> > 
> > Documentation request: It would be useful to break down exactly what is 
> > required from the driver for each event type here, and what extra 
> > information is provided by the type that isn't provided by the pte_mask.
> 
> Mostly event tell you if you need to free or not the device page table for
> the range, which is not something you can infer from the pte_mask reliably.
> Difference btw migration and munmap for instance, same pte_mask but range
> is still valid in the migration case it will just be backed by a new set of
> pages.

Given that event->pte_mask and event->type provide redundant information, 
are they both necessary?

With or without pte_mask, the below table would be helpful to have in the 
comments for the ->update callback:

Event type          Driver action
HMM_NONE            N/A (driver will never get this)

HMM_FORK            Same as HMM_WRITE_PROTECT

HMM_ISDIRTY         Same as HMM_WRITE_PROTECT

HMM_MIGRATE         Make device PTEs invalid and use hmm_pte_set_dirty or
                    hmm_mirror_range_dirty if applicable

HMM_MUNMAP          Same as HMM_MIGRATE, but the driver may take this as a
                    hint to free device page tables and other resources
                    associated with this range

HMM_DEVICE_RFAULT   Read hmm_ptes using hmm_pt_iter and write them on the
                    device

HMM_DEVICE_WFAULT   Same as HMM_DEVICE_RFAULT

HMM_WRITE_PROTECT   Remove write permission from device PTEs and use
                    hmm_pte_set_dirty or hmm_mirror_range_dirty if
                    applicable


> 
> 
> [...]
> > > @@ -142,6 +223,7 @@ int hmm_device_unregister(struct hmm_device *device);
> > >   * @kref: Reference counter (private to HMM do not use).
> > >   * @dlist: List of all hmm_mirror for same device.
> > >   * @mlist: List of all hmm_mirror for same process.
> > > + * @pt: Mirror page table.
> > >   *
> > >   * Each device that want to mirror an address space must register one of this
> > >   * struct for each of the address space it wants to mirror. Same device can
> > > @@ -154,6 +236,7 @@ struct hmm_mirror {
> > >  	struct kref		kref;
> > >  	struct list_head	dlist;
> > >  	struct hlist_node	mlist;
> > > +	struct hmm_pt		pt;
> > 
> > Documentation request: Why does each mirror have its own separate set of 
> > page tables rather than the hmm keeping one set for all devices? This is 
> > so different devices can have different permissions for the same address 
> > range, correct?
> 
> Several reasons, first and mostly dma mapping, while i have plan to allow
> to share dma mapping directory btw devices this require work in the dma
> layer first. Second reasons is, like you point out, different permissions,
> like one device requesting atomic access ie the device will be the only
> one with write permission and HMM need somewhere to store that information
> per device per address. It also helps to avoid calling device driver on a
> range that one device does not mirror.

Sure, that makes sense. Can you put this in the documentation somewhere, 
perhaps in the header comments for struct hmm_mirror?

Thanks!

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/36] HMM: add HMM page table v2.
  2015-06-27  1:34       ` Mark Hairgrove
@ 2015-06-29 14:43         ` Jerome Glisse
  2015-07-01  2:51           ` Mark Hairgrove
  0 siblings, 1 reply; 80+ messages in thread
From: Jerome Glisse @ 2015-06-29 14:43 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

On Fri, Jun 26, 2015 at 06:34:16PM -0700, Mark Hairgrove wrote:
> 
> 
> On Fri, 26 Jun 2015, Jerome Glisse wrote:
> 
> > On Thu, Jun 25, 2015 at 03:57:29PM -0700, Mark Hairgrove wrote:
> > > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > > > From: Jerome Glisse <jglisse@redhat.com>
> > > > [...]
> > > > +
> > > > +void hmm_pt_iter_init(struct hmm_pt_iter *iter);
> > > > +void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt);
> > > > +unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
> > > > +			       struct hmm_pt *pt,
> > > > +			       unsigned long addr,
> > > > +			       unsigned long end);
> > > > +dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
> > > > +			       struct hmm_pt *pt,
> > > > +			       unsigned long addr);
> > > > +dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
> > > > +			      struct hmm_pt *pt,
> > > > +			      unsigned long addr);
> > > 
> > > I've got a few more thoughts on hmm_pt_iter after looking at some of the 
> > > later patches. I think I've convinced myself that this patch functionally 
> > > works as-is, but I've got some suggestions and questions about the design.
> > > 
> > > Right now there are these three major functions:
> > > 
> > > 1) hmm_pt_iter_update(addr)
> > >    - Returns the hmm_pte * for addr, or NULL if none exists.
> > > 
> > > 2) hmm_pt_iter_fault(addr)
> > >    - Returns the hmm_pte * for addr, allocating a new one if none exists.
> > > 
> > > 3) hmm_pt_iter_next(addr, end)
> > >    - Returns the next possibly-valid address. The caller must use
> > >      hmm_pt_iter_update to check if there really is an hmm_pte there.
> > > 
> > > In my view, there are two sources of confusion here:
> > > - Naming. "update" shares a name with the HMM mirror callback, and it also
> > >   implies that the page tables are "updated" as a result of the call. 
> > >   "fault" likewise implies that the function handles a fault in some way.
> > >   Neither of these implications are true.
> > 
> > Maybe hmm_pt_iter_walk & hmm_pt_iter_populate are better name ?
> 
> hmm_pt_iter_populate sounds good. See below for _walk.
> 
> 
> > 
> > 
> > > - hmm_pt_iter_next and hmm_pt_iter_update have some overlapping
> > >   functionality when compared to traditional iterators, requiring the 
> > >   callers to all do this sort of thing:
> > > 
> > >         hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
> > >         if (!hmm_pte) {
> > >             addr = hmm_pt_iter_next(&iter, &mirror->pt,
> > >                         addr, event->end);
> > >             continue;
> > >         }
> > > 
> > > Wouldn't it be more efficient and simpler to have _next do all the 
> > > iteration internally so it always returns the next valid entry? Then you 
> > > could combine _update and _next into a single function, something along 
> > > these lines (which also addresses the naming concern):
> > > 
> > > void hmm_pt_iter_init(iter, pt, start, end);
> > > unsigned long hmm_pt_iter_next(iter, hmm_pte *);
> > > unsigned long hmm_pt_iter_next_alloc(iter, hmm_pte *);
> > > 
> > > hmm_pt_iter_next would return the address and ptep of the next valid 
> > > entry, taking the place of the existing _update and _next functions. 
> > > hmm_pt_iter_next_alloc takes the place of _fault.
> > > 
> > > Also, since the _next functions don't take in an address, the iterator 
> > > doesn't have to handle the input addr being different from iter->cur.
> > 
> > It would still need to do the same kind of test, this test is really to
> > know when you switch from one directory to the next and to drop and take
> > reference accordingly.
> 
> But all of the directory references are already hidden entirely in the 
> iterator _update function. The caller only has to worry about taking 
> references on the bottom level, so I don't understand why the iterator 
> needs to return to the caller when it hits the end of a directory. Or for 
> that matter, why it returns every possible index within a directory to the 
> caller whether that index is valid or not.

Iterator is what protect against concurrent freeing of the directory so it
has to return to caller on directory boundary (for 64bits arch with 64bits
pte it has return every 512 entries). Otherwise pt_iter_fini() would have
to walk over the whole directory range again just to drop reference and this
doesn't sound like a good idea.

So really with what you are asking it whould be:

hmm_pt_iter_init(&iter, start, end);
for(next=pt_iter_next(&iter,&ptep); next<end; next=pt_iter_next(&iter,&ptep))
{
   // Here ptep is valid until next address. Above you have to call
   // pt_iter_next() to switch to next directory.
   addr = max(start, next - (~HMM_PMD_MASK + 1));
   for (; addr < next; addr += PAGE_SIZE, ptep++) {
      // access ptep
   }
}

My point is that internally pt_iter_next() will do the exact same test it is
doing now btw cur and addr. Just that the addr is no longer explicit but iter
infer it.

> If _next only returned to the caller when it hit a valid hmm_pte (or end), 
> then only one function would be needed (_next) instead of two 
> (_update/_walk and _next).

On the valid entry side, this is because when you are walking the page table
you have no garanty that the entry will not be clear below you (in case of
concurrent invalidation). The only garanty you have is that if you are able
to read a valid entry from the update() callback then this entry is valid
until you get a new update() callback telling you otherwise.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 07/36] HMM: add per mirror page table v3.
  2015-06-27  3:02       ` Mark Hairgrove
@ 2015-06-29 14:50         ` Jerome Glisse
  0 siblings, 0 replies; 80+ messages in thread
From: Jerome Glisse @ 2015-06-29 14:50 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

On Fri, Jun 26, 2015 at 08:02:03PM -0700, Mark Hairgrove wrote:
> On Fri, 26 Jun 2015, Jerome Glisse wrote:
> > On Thu, Jun 25, 2015 at 04:05:48PM -0700, Mark Hairgrove wrote:
> > > On Thu, 21 May 2015, j.glisse@gmail.com wrote:
> > > > From: Jerome Glisse <jglisse@redhat.com>
> > > > [...]
> > > >  
> > > > +	/* update() - update device mmu following an event.
> > > > +	 *
> > > > +	 * @mirror: The mirror that link process address space with the device.
> > > > +	 * @event: The event that triggered the update.
> > > > +	 * Returns: 0 on success or error code {-EIO, -ENOMEM}.
> > > > +	 *
> > > > +	 * Called to update device page table for a range of address.
> > > > +	 * The event type provide the nature of the update :
> > > > +	 *   - Range is no longer valid (munmap).
> > > > +	 *   - Range protection changes (mprotect, COW, ...).
> > > > +	 *   - Range is unmapped (swap, reclaim, page migration, ...).
> > > > +	 *   - Device page fault.
> > > > +	 *   - ...
> > > > +	 *
> > > > +	 * Thought most device driver only need to use pte_mask as it reflects
> > > > +	 * change that will happen to the HMM page table ie :
> > > > +	 *   new_pte = old_pte & event->pte_mask;
> > > 
> > > Documentation request: It would be useful to break down exactly what is 
> > > required from the driver for each event type here, and what extra 
> > > information is provided by the type that isn't provided by the pte_mask.
> > 
> > Mostly event tell you if you need to free or not the device page table for
> > the range, which is not something you can infer from the pte_mask reliably.
> > Difference btw migration and munmap for instance, same pte_mask but range
> > is still valid in the migration case it will just be backed by a new set of
> > pages.
> 
> Given that event->pte_mask and event->type provide redundant information, 
> are they both necessary?

Like said, you can not infer event->type from pte_mask but you can infer
pte_mask from event->type. The idea is behind providing pte_mask is that
simple driver can just use that with the iter walk and simply mask the HMM
page table entry they read ((*ptep) & pte_mask) to repopulate the device
page table.

So yes pte_mask is redundant but i think it will be useful for a range of
device driver.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/36] HMM: add HMM page table v2.
  2015-06-29 14:43         ` Jerome Glisse
@ 2015-07-01  2:51           ` Mark Hairgrove
  2015-07-01 15:07             ` Jerome Glisse
  0 siblings, 1 reply; 80+ messages in thread
From: Mark Hairgrove @ 2015-07-01  2:51 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]



On Mon, 29 Jun 2015, Jerome Glisse wrote:

> [...]
> 
> Iterator is what protect against concurrent freeing of the directory so it
> has to return to caller on directory boundary (for 64bits arch with 64bits
> pte it has return every 512 entries). Otherwise pt_iter_fini() would have
> to walk over the whole directory range again just to drop reference and this
> doesn't sound like a good idea.

I don't understand why it would have to return to the caller to unprotect 
the directory. The iterator would simply drop the reference to the 
previous directory, take a reference on the next one, and keep searching 
for a valid entry.

Why would pt_iter_fini have to walk over the entire range? The iterator 
would keep at most one directory per level referenced. _fini would walk 
the per-level ptd array and unprotect each level, the same way it does 
now.


> 
> So really with what you are asking it whould be:
> 
> hmm_pt_iter_init(&iter, start, end);
> for(next=pt_iter_next(&iter,&ptep); next<end; next=pt_iter_next(&iter,&ptep))
> {
>    // Here ptep is valid until next address. Above you have to call
>    // pt_iter_next() to switch to next directory.
>    addr = max(start, next - (~HMM_PMD_MASK + 1));
>    for (; addr < next; addr += PAGE_SIZE, ptep++) {
>       // access ptep
>    }
> }
> 
> My point is that internally pt_iter_next() will do the exact same test it is
> doing now btw cur and addr. Just that the addr is no longer explicit but iter
> infer it.

But this way, the iteration across directories is more efficient because 
the iterator can simply walk the directory array. Take a directory that 
has one valid entry at the very end. The existing iteration will do this:

hmm_pt_iter_next(dir_addr[0], end)
    Walk up the ptd array
    Compute level start and end and compare them to dir_addr[0]
    Compute dir_addr[1] using addr and pt->mask
    Return dir_addr[1]
hmm_pt_iter_update(dir_addr[1])
    Walk up the ptd array, compute level start and end
    Compute level index of dir_addr[1]
    Read entry for dir_addr[1]
    Return NULL
hmm_pt_iter_next(dir_addr[1], end)
    ...
And so on 511 times until the last entry is read.

This is really more suited to a for loop iteration, which it could be if 
this were fully contained within the _next call.

> 
> > If _next only returned to the caller when it hit a valid hmm_pte (or end), 
> > then only one function would be needed (_next) instead of two 
> > (_update/_walk and _next).
> 
> On the valid entry side, this is because when you are walking the page table
> you have no garanty that the entry will not be clear below you (in case of
> concurrent invalidation). The only garanty you have is that if you are able
> to read a valid entry from the update() callback then this entry is valid
> until you get a new update() callback telling you otherwise.
> 
> Cheers,
> Jerome
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/36] HMM: add HMM page table v2.
  2015-07-01  2:51           ` Mark Hairgrove
@ 2015-07-01 15:07             ` Jerome Glisse
  0 siblings, 0 replies; 80+ messages in thread
From: Jerome Glisse @ 2015-07-01 15:07 UTC (permalink / raw)
  To: Mark Hairgrove
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

On Tue, Jun 30, 2015 at 07:51:12PM -0700, Mark Hairgrove wrote:
> On Mon, 29 Jun 2015, Jerome Glisse wrote:
> > [...]
> > 
> > Iterator is what protect against concurrent freeing of the directory so it
> > has to return to caller on directory boundary (for 64bits arch with 64bits
> > pte it has return every 512 entries). Otherwise pt_iter_fini() would have
> > to walk over the whole directory range again just to drop reference and this
> > doesn't sound like a good idea.
> 
> I don't understand why it would have to return to the caller to unprotect 
> the directory. The iterator would simply drop the reference to the 
> previous directory, take a reference on the next one, and keep searching 
> for a valid entry.
> 
> Why would pt_iter_fini have to walk over the entire range? The iterator 
> would keep at most one directory per level referenced. _fini would walk 
> the per-level ptd array and unprotect each level, the same way it does 
> now.

I think here we are just misunderstanding each other. I am saying that
iterator have to return on directory boundary (ie when switching from one
directory to the next). The return part is not only for protection it is
also by design because iterator function should not test the page table
entry as different code path have different synchronization requirement.


> > So really with what you are asking it whould be:
> > 
> > hmm_pt_iter_init(&iter, start, end);
> > for(next=pt_iter_next(&iter,&ptep); next<end; next=pt_iter_next(&iter,&ptep))
> > {
> >    // Here ptep is valid until next address. Above you have to call
> >    // pt_iter_next() to switch to next directory.
> >    addr = max(start, next - (~HMM_PMD_MASK + 1));
> >    for (; addr < next; addr += PAGE_SIZE, ptep++) {
> >       // access ptep
> >    }
> > }
> > 
> > My point is that internally pt_iter_next() will do the exact same test it is
> > doing now btw cur and addr. Just that the addr is no longer explicit but iter
> > infer it.
> 
> But this way, the iteration across directories is more efficient because 
> the iterator can simply walk the directory array. Take a directory that 
> has one valid entry at the very end. The existing iteration will do this:
> 
> hmm_pt_iter_next(dir_addr[0], end)
>     Walk up the ptd array
>     Compute level start and end and compare them to dir_addr[0]
>     Compute dir_addr[1] using addr and pt->mask
>     Return dir_addr[1]
> hmm_pt_iter_update(dir_addr[1])
>     Walk up the ptd array, compute level start and end
>     Compute level index of dir_addr[1]
>     Read entry for dir_addr[1]
>     Return NULL
> hmm_pt_iter_next(dir_addr[1], end)
>     ...
> And so on 511 times until the last entry is read.
> 
> This is really more suited to a for loop iteration, which it could be if 
> this were fully contained within the _next call.

No, existing code does not necessarily do that. Current use pattern is :

for (addr = start; addr < end;) {
   ptep = hmm_pt_iter_update(iter, addr);
   if (!ptep) {
     addr = hmm_pt_iter_next(iter, addr, end);
     continue;
   }
   next = hmm_pt_level_next(pt, addr, end);
   for (; addr < next; addr += PAGE_SIZE, ptep++) {
     // Process addr using ptep.
   }
}

The inner loop is on directory boundary ie 512 entries on 64bits arch.
It is that way because on some case you do not want the iterator to
control the address, the outer loop might be accessing several different
mirror page table each might have different gap. So you really want to
have explicit address provided to the iterator function.

Also iterator can not really test for valid entry as locking requirement
and synchronization with other thread is different depending on which
code path is walking the page table. So testing inside the iterator
function is kind of pointless has the performed test might be no longer
revealent by the time it returns pointer and address to the caller.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* HMM (Heterogeneous Memory Management) v8
@ 2015-01-05 22:44 j.glisse
  0 siblings, 0 replies; 80+ messages in thread
From: j.glisse @ 2015-01-05 22:44 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, linux-fsdevel, Linda Wang,
	Kevin E Martin, Jerome Glisse, Jeff Law, Haggai Eran, Or Gerlitz,
	Sagi Grimberg

So a resend with corrections base on Haggai comments. This patchset is just
the ground foundation on to which we want to build our features set. Main
feature being migrating memory to device memory. The very first version of
this patchset already show cased proof of concept of much of the features.

Below is previous patchset cover letter pretty much unchanged as background
and motivation for it did not.


What it is ?

In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
process address on a device with minimal hardware requirement (mainly device
page fault and read only page mapping). This does not rely on ATS and PASID
PCIE extensions. It intends to supersede those extensions by allowing to move
system memory to device memory in a transparent fashion for core kernel mm
code (ie cpu page fault on page residing in device memory will trigger
migration back to system memory).


Why doing this ?

We want to be able to mirror a process address space so that compute api such
as OpenCL or other similar api can start using the exact same address space on
the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
we believe that we will see more and more specialize unit functions that will
want to mirror process address using their own mmu.

The migration side is simply because GPU memory bandwidth is far beyond than
system memory bandwith and there is no sign that this gap is closing (quite the
opposite).


Current status and future features :

None of this core code change in any major way core kernel mm code. This
is simple ground work with no impact on existing code path. Features that
will be implemented on top of this are :
  1 - Tansparently handle page mapping on behalf of device driver (DMA).
  2 - Improve DMA api to better match new usage pattern of HMM.
  3 - Migration of anonymous memory to device memory.
  4 - Locking memory to remote memory (CPU access trigger SIGBUS).
  5 - Access exclusion btw CPU and device for atomic operations.
  6 - Migration of file backed memory to device memory.


How future features will be implemented :
1 - Simply use existing DMA api to map page on behalf of a device.
2 - Introduce new DMA api to match new semantic of HMM. It is no longer page
    we map but address range and managing which page is effectively backing
    an address should be easy to update. I gave a presentation about that
    during this LPC.
3 - Requires change to cpu page fault code path to handle migration back to
    system memory on cpu access. An implementation of this was already sent
    as part of v1. This will be low impact and only add a new special swap
    type handling to existing fault code.
4 - Require a new syscall as i can not see which current syscall would be
    appropriate for this. My first feeling was to use mbind as it has the
    right semantic (binding a range of address to a device) but mbind is
    too numa centric.

    Second one was madvise, but semantic does not match, madvise does allow
    kernel to ignore them while we do want to block cpu access for as long
    as the range is bind to a device.

    So i do not think any of existing syscall can be extended with new flags
    but maybe i am wrong.
5 - Allowing to map a page as read only on the CPU while a device perform
    some atomic operation on it (this is mainly to work around system bus
    that do not support atomic memory access and sadly there is a large
    base of hardware without that feature).

    Easiest implementation would be using some page flags but there is none
    left. So it must be some flags in vma to know if there is a need to query
    HMM for write protection.

6 - This is the trickiest one to implement and while i showed a proof of
    concept with v1, i am still have a lot of conflictual feeling about how
    to achieve this.


As usual comments are more then welcome. Thanks in advance to anyone that
take a look at this code.

Previous patchset posting :
  v1 http://lwn.net/Articles/597289/
  v2 https://lkml.org/lkml/2014/6/12/559 (cover letter did not make it to ml)
  v3 https://lkml.org/lkml/2014/6/13/633
  v4 https://lkml.org/lkml/2014/8/29/423
  v5 https://lkml.org/lkml/2014/11/3/759
  v6 http://lwn.net/Articles/619737/

Cheers,
JA(C)rA'me

To: "Andrew Morton" <akpm@linux-foundation.org>,
Cc: <linux-kernel@vger.kernel.org>,
Cc: linux-mm <linux-mm@kvack.org>,
Cc: <linux-fsdevel@vger.kernel.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Jerome Glisse" <jglisse@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2015-07-01 15:07 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
2015-05-21 19:31 ` [PATCH 01/36] mmu_notifier: add event information to address invalidation v7 j.glisse
2015-05-30  3:43   ` John Hubbard
2015-06-01 19:03     ` Jerome Glisse
2015-06-01 23:10       ` John Hubbard
2015-06-03 16:07         ` Jerome Glisse
2015-06-03 23:02           ` John Hubbard
2015-05-21 19:31 ` [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3 j.glisse
2015-05-27  5:09   ` Aneesh Kumar K.V
2015-05-27 14:32     ` Jerome Glisse
2015-06-02  9:32   ` John Hubbard
2015-06-03 17:15     ` Jerome Glisse
2015-06-05  3:29       ` John Hubbard
2015-05-21 19:31 ` [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() j.glisse
2015-05-27  5:17   ` Aneesh Kumar K.V
2015-05-27 14:33     ` Jerome Glisse
2015-06-03  4:25   ` John Hubbard
2015-05-21 19:31 ` [PATCH 04/36] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier j.glisse
2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
2015-05-27  5:50   ` Aneesh Kumar K.V
2015-05-27 14:38     ` Jerome Glisse
2015-06-08 19:40   ` Mark Hairgrove
2015-06-08 21:17     ` Jerome Glisse
2015-06-09  1:54       ` Mark Hairgrove
2015-06-09 15:56         ` Jerome Glisse
2015-06-10  3:33           ` Mark Hairgrove
2015-06-10 15:42             ` Jerome Glisse
2015-06-11  1:15               ` Mark Hairgrove
2015-06-11 14:23                 ` Jerome Glisse
2015-06-11 22:26                   ` Mark Hairgrove
2015-06-15 14:32                     ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 06/36] HMM: add HMM page table v2 j.glisse
2015-06-19  2:06   ` Mark Hairgrove
2015-06-19 18:07     ` Jerome Glisse
2015-06-20  2:34       ` Mark Hairgrove
2015-06-25 22:57   ` Mark Hairgrove
2015-06-26 16:30     ` Jerome Glisse
2015-06-27  1:34       ` Mark Hairgrove
2015-06-29 14:43         ` Jerome Glisse
2015-07-01  2:51           ` Mark Hairgrove
2015-07-01 15:07             ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 07/36] HMM: add per mirror page table v3 j.glisse
2015-06-25 23:05   ` Mark Hairgrove
2015-06-26 16:43     ` Jerome Glisse
2015-06-27  3:02       ` Mark Hairgrove
2015-06-29 14:50         ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 08/36] HMM: add device page fault support v3 j.glisse
2015-05-21 19:31 ` [PATCH 09/36] HMM: add mm page table iterator helpers j.glisse
2015-05-21 19:31 ` [PATCH 10/36] HMM: use CPU page table during invalidation j.glisse
2015-05-21 19:31 ` [PATCH 11/36] HMM: add discard range helper (to clear and free resources for a range) j.glisse
2015-05-21 19:31 ` [PATCH 12/36] HMM: add dirty range helper (to toggle dirty bit inside mirror page table) j.glisse
2015-05-21 19:31 ` [PATCH 13/36] HMM: DMA map memory on behalf of device driver j.glisse
2015-05-21 19:31 ` [PATCH 14/36] fork: pass the dst vma to copy_page_range() and its sub-functions j.glisse
2015-05-21 19:31 ` [PATCH 15/36] memcg: export get_mem_cgroup_from_mm() j.glisse
2015-05-21 19:31 ` [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory j.glisse
2015-06-24  7:49   ` Haggai Eran
2015-05-21 19:31 ` [PATCH 17/36] HMM: add new HMM page table flag (valid device memory) j.glisse
2015-05-21 19:31 ` [PATCH 18/36] HMM: add new HMM page table flag (select flag) j.glisse
2015-05-21 19:31 ` [PATCH 19/36] HMM: handle HMM device page table entry on mirror page table fault and update j.glisse
2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
2015-05-21 20:22   ` [PATCH 21/36] HMM: mm add helper to update page table when migrating memory jglisse
2015-05-21 20:22   ` [PATCH 22/36] HMM: add new callback for copying memory from and to device memory jglisse
2015-05-21 20:22   ` [PATCH 23/36] HMM: allow to get pointer to spinlock protecting a directory jglisse
2015-05-21 20:23   ` [PATCH 24/36] HMM: split DMA mapping function in two jglisse
2015-05-21 20:23   ` [PATCH 25/36] HMM: add helpers for migration back to system memory jglisse
2015-05-21 20:23   ` [PATCH 26/36] HMM: fork copy migrated memory into system memory for child process jglisse
2015-05-21 20:23   ` [PATCH 27/36] HMM: CPU page fault on migrated memory jglisse
2015-05-21 20:23   ` [PATCH 28/36] HMM: add mirror fault support for system to device memory migration jglisse
2015-05-21 20:23   ` [PATCH 29/36] IB/mlx5: add a new paramter to __mlx_ib_populated_pas for ODP with HMM jglisse
2015-05-21 20:23   ` [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() " jglisse
2015-05-21 20:23   ` [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range() jglisse
2015-05-21 20:23   ` [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP jglisse
2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
2015-06-24 13:59     ` Haggai Eran
2015-05-21 20:23   ` [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback jglisse
2015-05-21 20:23   ` [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM jglisse
2015-05-21 20:23   ` [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM jglisse
2015-05-30  3:01 ` HMM (Heterogeneous Memory Management) v8 John Hubbard
2015-05-31  6:56 ` Haggai Eran
  -- strict thread matches above, loose matches on Subject: below --
2015-01-05 22:44 j.glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).