[PATCH v11 00/14] HMM anomymous memory migration to device memory

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v11 00/14] HMM anomymous memory migration to device memory
@ 2015-10-21 21:10 ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

Minor fixes since last post, apply on top of 4.3rc6. Tree with
the patchset:

git://people.freedesktop.org/~glisse/linux hmm-v11 branch

This patchset implement anonymous memory migration for HMM.
See HMM patchset for full description of what is HMM and why
doing HMM :

https://lkml.org/lkml/2015/10/21/739

Seamless migration from system memory to device memory ie on CPU
access we migrate memory back to system memory so CPU can access
it again.

Design is simple, a new special swap type is added and CPU pte are
set to this special swap type for migrated memory. On CPU page fault
HMM use its mirror page table to find proper page into device memory
and migrate it back to system memory.

Migration to device memory involves several steps :
  - First CPU page table is updated to special pte and current
    pte is save to temporary array.
  - We check that all pte are for normal/real pages.
  - We check that no one holds an extra reference on the page.
  - At this point we know we are the only one know about that
    memory and we can safely copy it to device memory.
  - Once everything is copied and fine on device side we free
    the system ram pages.

Migration from device memory back to system memory is simpler:
  - We get exclusive access for each pte we want to migrate back
    (special swap pte value).
  - We allocate system memory (memcg and anon_vma handled here).
  - We copy back device memory content into system memory and
    update device page table to point to system memory.
  - We update CPU page table to point to new system memory.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v11 00/14] HMM anomymous memory migration to device memory
@ 2015-10-21 21:10 ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

Minor fixes since last post, apply on top of 4.3rc6. Tree with
the patchset:

git://people.freedesktop.org/~glisse/linux hmm-v11 branch

This patchset implement anonymous memory migration for HMM.
See HMM patchset for full description of what is HMM and why
doing HMM :

https://lkml.org/lkml/2015/10/21/739

Seamless migration from system memory to device memory ie on CPU
access we migrate memory back to system memory so CPU can access
it again.

Design is simple, a new special swap type is added and CPU pte are
set to this special swap type for migrated memory. On CPU page fault
HMM use its mirror page table to find proper page into device memory
and migrate it back to system memory.

Migration to device memory involves several steps :
  - First CPU page table is updated to special pte and current
    pte is save to temporary array.
  - We check that all pte are for normal/real pages.
  - We check that no one holds an extra reference on the page.
  - At this point we know we are the only one know about that
    memory and we can safely copy it to device memory.
  - Once everything is copied and fine on device side we free
    the system ram pages.

Migration from device memory back to system memory is simpler:
  - We get exclusive access for each pte we want to migrate back
    (special swap pte value).
  - We allocate system memory (memcg and anon_vma handled here).
  - We copy back device memory content into system memory and
    update device page table to point to system memory.
  - We update CPU page table to point to new system memory.

Cheers,
JA(C)rA'me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v11 01/14] fork: pass the dst vma to copy_page_range() and its sub-functions.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

For HMM we will need to resort to the old way of allocating new page
for anonymous memory when that anonymous memory have been migrated
to device memory.

This does not impact any process that do not use HMM through some
device driver. Only process that migrate anonymous memory to device
memory with HMM will have to copy migrated page on fork.

We do not expect this to be a common or advised thing to do so we
resort to the simpler solution of allocating new page. If this kind
of usage turns out to be important we will revisit way to achieve
COW even for remote memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/mm.h |  5 +++--
 kernel/fork.c      |  2 +-
 mm/memory.c        | 33 +++++++++++++++++++++------------
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f967a1..18f27afd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,8 +1162,9 @@ int walk_page_range(unsigned long addr, unsigned long end,
 int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk);
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		    struct vm_area_struct *dst_vma,
+		    struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
diff --git a/kernel/fork.c b/kernel/fork.c
index 631c398..74ad33c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -498,7 +498,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/memory.c b/mm/memory.c
index 77bbbf3..bbab5e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -886,8 +886,10 @@ out_set_pte:
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		   unsigned long addr, unsigned long end)
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *dst_vma,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -948,9 +950,12 @@ again:
 	return 0;
 }
 
-static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+static inline int copy_pmd_range(struct mm_struct *dst_mm,
+				 struct mm_struct *src_mm,
+				 pud_t *dst_pud, pud_t *src_pud,
+				 struct vm_area_struct *dst_vma,
+				 struct vm_area_struct *vma,
+				 unsigned long addr, unsigned long end)
 {
 	pmd_t *src_pmd, *dst_pmd;
 	unsigned long next;
@@ -975,15 +980,18 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+				   dst_vma, vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
 
-static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+static inline int copy_pud_range(struct mm_struct *dst_mm,
+				 struct mm_struct *src_mm,
+				 pgd_t *dst_pgd, pgd_t *src_pgd,
+				 struct vm_area_struct *dst_vma,
+				 struct vm_area_struct *vma,
+				 unsigned long addr, unsigned long end)
 {
 	pud_t *src_pud, *dst_pud;
 	unsigned long next;
@@ -997,14 +1005,15 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+				   dst_vma, vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		    struct vm_area_struct *dst_vma,
+		    struct vm_area_struct *vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
@@ -1058,7 +1067,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					    dst_vma, vma, addr, next))) {
 			ret = -ENOMEM;
 			break;
 		}
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 01/14] fork: pass the dst vma to copy_page_range() and its sub-functions.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

For HMM we will need to resort to the old way of allocating new page
for anonymous memory when that anonymous memory have been migrated
to device memory.

This does not impact any process that do not use HMM through some
device driver. Only process that migrate anonymous memory to device
memory with HMM will have to copy migrated page on fork.

We do not expect this to be a common or advised thing to do so we
resort to the simpler solution of allocating new page. If this kind
of usage turns out to be important we will revisit way to achieve
COW even for remote memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/mm.h |  5 +++--
 kernel/fork.c      |  2 +-
 mm/memory.c        | 33 +++++++++++++++++++++------------
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f967a1..18f27afd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,8 +1162,9 @@ int walk_page_range(unsigned long addr, unsigned long end,
 int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk);
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		    struct vm_area_struct *dst_vma,
+		    struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
diff --git a/kernel/fork.c b/kernel/fork.c
index 631c398..74ad33c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -498,7 +498,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/memory.c b/mm/memory.c
index 77bbbf3..bbab5e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -886,8 +886,10 @@ out_set_pte:
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		   unsigned long addr, unsigned long end)
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *dst_vma,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -948,9 +950,12 @@ again:
 	return 0;
 }
 
-static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+static inline int copy_pmd_range(struct mm_struct *dst_mm,
+				 struct mm_struct *src_mm,
+				 pud_t *dst_pud, pud_t *src_pud,
+				 struct vm_area_struct *dst_vma,
+				 struct vm_area_struct *vma,
+				 unsigned long addr, unsigned long end)
 {
 	pmd_t *src_pmd, *dst_pmd;
 	unsigned long next;
@@ -975,15 +980,18 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+				   dst_vma, vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
 
-static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+static inline int copy_pud_range(struct mm_struct *dst_mm,
+				 struct mm_struct *src_mm,
+				 pgd_t *dst_pgd, pgd_t *src_pgd,
+				 struct vm_area_struct *dst_vma,
+				 struct vm_area_struct *vma,
+				 unsigned long addr, unsigned long end)
 {
 	pud_t *src_pud, *dst_pud;
 	unsigned long next;
@@ -997,14 +1005,15 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+				   dst_vma, vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		    struct vm_area_struct *dst_vma,
+		    struct vm_area_struct *vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
@@ -1058,7 +1067,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					    dst_vma, vma, addr, next))) {
 			ret = -ENOMEM;
 			break;
 		}
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 02/14] HMM: add special swap filetype for memory migrated to device v2.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jerome Glisse, Jatin Kumar

From: Jerome Glisse <jglisse@redhat.com>

When migrating anonymous memory from system memory to device memory
CPU pte are replaced with special HMM swap entry so that page fault,
get user page (gup), fork, ... are properly redirected to HMM helpers.

This patch only add the new swap type entry and hooks HMM helpers
functions inside the page fault and fork code path.

Changed since v1:
  - Fix name when of HMM CPU page fault function.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h     | 34 ++++++++++++++++++++++++++++++++++
 include/linux/swap.h    | 13 ++++++++++++-
 include/linux/swapops.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
 mm/hmm.c                | 21 +++++++++++++++++++++
 mm/memory.c             | 22 ++++++++++++++++++++++
 5 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4bc132a..7c66513 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -272,6 +272,40 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
 			    unsigned long start,
 			    unsigned long end);
 
+int hmm_handle_cpu_fault(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmdp, unsigned long addr,
+			unsigned flags, pte_t orig_pte);
+
+int hmm_mm_fork(struct mm_struct *src_mm,
+		struct mm_struct *dst_mm,
+		struct vm_area_struct *dst_vma,
+		pmd_t *dst_pmd,
+		unsigned long start,
+		unsigned long end);
+
+#else /* CONFIG_HMM */
+
+static inline int hmm_handle_cpu_fault(struct mm_struct *mm,
+				       struct vm_area_struct *vma,
+				       pmd_t *pmdp, unsigned long addr,
+				       unsigned flags, pte_t orig_pte)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static inline int hmm_mm_fork(struct mm_struct *src_mm,
+			      struct mm_struct *dst_mm,
+			      struct vm_area_struct *dst_vma,
+			      pmd_t *dst_pmd,
+			      unsigned long start,
+			      unsigned long end)
+{
+	BUG();
+	return -ENOMEM;
+}
 
 #endif /* CONFIG_HMM */
+
+
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dcc..5c8b871 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif
 
+/*
+ * HMM (heterogeneous memory management) used when data is in remote memory.
+ */
+#ifdef CONFIG_HMM
+#define SWP_HMM_NUM 1
+#define SWP_HMM		(MAX_SWAPFILES + SWP_MIGRATION_NUM + SWP_HWPOISON_NUM)
+#else
+#define SWP_HMM_NUM 0
+#endif
+
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - \
+	 SWP_HWPOISON_NUM - SWP_HMM_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..8c6ba9f 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -227,7 +227,7 @@ static inline void num_poisoned_pages_inc(void)
 }
 #endif
 
-#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || defined(CONFIG_HMM)
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
@@ -239,4 +239,45 @@ static inline int non_swap_entry(swp_entry_t entry)
 }
 #endif
 
+#ifdef CONFIG_HMM
+static inline swp_entry_t make_hmm_entry(void)
+{
+	/* We do not store anything inside the CPU page table entry (pte). */
+	return swp_entry(SWP_HMM, 0);
+}
+
+static inline swp_entry_t make_hmm_entry_locked(void)
+{
+	/* We do not store anything inside the CPU page table entry (pte). */
+	return swp_entry(SWP_HMM, 1);
+}
+
+static inline swp_entry_t make_hmm_entry_poisonous(void)
+{
+	/* We do not store anything inside the CPU page table entry (pte). */
+	return swp_entry(SWP_HMM, 2);
+}
+
+static inline int is_hmm_entry(swp_entry_t entry)
+{
+	return (swp_type(entry) == SWP_HMM);
+}
+
+static inline int is_hmm_entry_locked(swp_entry_t entry)
+{
+	return (swp_type(entry) == SWP_HMM) && (swp_offset(entry) == 1);
+}
+
+static inline int is_hmm_entry_poisonous(swp_entry_t entry)
+{
+	return (swp_type(entry) == SWP_HMM) && (swp_offset(entry) == 2);
+}
+#else /* CONFIG_HMM */
+static inline int is_hmm_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif /* CONFIG_HMM */
+
+
 #endif /* _LINUX_SWAPOPS_H */
diff --git a/mm/hmm.c b/mm/hmm.c
index 9e5017a..7fb493f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -416,6 +416,27 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+int hmm_handle_cpu_fault(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmdp, unsigned long addr,
+			unsigned flags, pte_t orig_pte)
+{
+	return VM_FAULT_SIGBUS;
+}
+EXPORT_SYMBOL(hmm_handle_cpu_fault);
+
+int hmm_mm_fork(struct mm_struct *src_mm,
+		struct mm_struct *dst_mm,
+		struct vm_area_struct *dst_vma,
+		pmd_t *dst_pmd,
+		unsigned long start,
+		unsigned long end)
+{
+	return -ENOMEM;
+}
+EXPORT_SYMBOL(hmm_mm_fork);
+
+
 struct mm_pt_iter {
 	struct mm_struct	*mm;
 	pte_t			*ptep;
diff --git a/mm/memory.c b/mm/memory.c
index bbab5e9..08bc37e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -53,6 +53,7 @@
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -894,9 +895,11 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
+	unsigned cnt_hmm_entry = 0;
 	int progress = 0;
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
+	unsigned long start;
 
 again:
 	init_rss_vec(rss);
@@ -910,6 +913,7 @@ again:
 	orig_src_pte = src_pte;
 	orig_dst_pte = dst_pte;
 	arch_enter_lazy_mmu_mode();
+	start = addr;
 
 	do {
 		/*
@@ -926,6 +930,12 @@ again:
 			progress++;
 			continue;
 		}
+		if (unlikely(!pte_present(*src_pte))) {
+			entry = pte_to_swp_entry(*src_pte);
+
+			if (is_hmm_entry(entry))
+				cnt_hmm_entry++;
+		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
 							vma, addr, rss);
 		if (entry.val)
@@ -940,6 +950,15 @@ again:
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
 
+	if (cnt_hmm_entry) {
+		int ret;
+
+		ret = hmm_mm_fork(src_mm, dst_mm, dst_vma,
+				  dst_pmd, start, end);
+		if (ret)
+			return ret;
+	}
+
 	if (entry.val) {
 		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
 			return -ENOMEM;
@@ -2489,6 +2508,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_hmm_entry(entry)) {
+			ret = hmm_handle_cpu_fault(mm, vma, pmd, address,
+						   flags, orig_pte);
 		} else {
 			print_bad_pte(vma, address, orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 02/14] HMM: add special swap filetype for memory migrated to device v2.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jerome Glisse, Jatin Kumar

From: Jerome Glisse <jglisse@redhat.com>

When migrating anonymous memory from system memory to device memory
CPU pte are replaced with special HMM swap entry so that page fault,
get user page (gup), fork, ... are properly redirected to HMM helpers.

This patch only add the new swap type entry and hooks HMM helpers
functions inside the page fault and fork code path.

Changed since v1:
  - Fix name when of HMM CPU page fault function.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h     | 34 ++++++++++++++++++++++++++++++++++
 include/linux/swap.h    | 13 ++++++++++++-
 include/linux/swapops.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
 mm/hmm.c                | 21 +++++++++++++++++++++
 mm/memory.c             | 22 ++++++++++++++++++++++
 5 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4bc132a..7c66513 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -272,6 +272,40 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
 			    unsigned long start,
 			    unsigned long end);
 
+int hmm_handle_cpu_fault(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmdp, unsigned long addr,
+			unsigned flags, pte_t orig_pte);
+
+int hmm_mm_fork(struct mm_struct *src_mm,
+		struct mm_struct *dst_mm,
+		struct vm_area_struct *dst_vma,
+		pmd_t *dst_pmd,
+		unsigned long start,
+		unsigned long end);
+
+#else /* CONFIG_HMM */
+
+static inline int hmm_handle_cpu_fault(struct mm_struct *mm,
+				       struct vm_area_struct *vma,
+				       pmd_t *pmdp, unsigned long addr,
+				       unsigned flags, pte_t orig_pte)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static inline int hmm_mm_fork(struct mm_struct *src_mm,
+			      struct mm_struct *dst_mm,
+			      struct vm_area_struct *dst_vma,
+			      pmd_t *dst_pmd,
+			      unsigned long start,
+			      unsigned long end)
+{
+	BUG();
+	return -ENOMEM;
+}
 
 #endif /* CONFIG_HMM */
+
+
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dcc..5c8b871 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif
 
+/*
+ * HMM (heterogeneous memory management) used when data is in remote memory.
+ */
+#ifdef CONFIG_HMM
+#define SWP_HMM_NUM 1
+#define SWP_HMM		(MAX_SWAPFILES + SWP_MIGRATION_NUM + SWP_HWPOISON_NUM)
+#else
+#define SWP_HMM_NUM 0
+#endif
+
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - \
+	 SWP_HWPOISON_NUM - SWP_HMM_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..8c6ba9f 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -227,7 +227,7 @@ static inline void num_poisoned_pages_inc(void)
 }
 #endif
 
-#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || defined(CONFIG_HMM)
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
@@ -239,4 +239,45 @@ static inline int non_swap_entry(swp_entry_t entry)
 }
 #endif
 
+#ifdef CONFIG_HMM
+static inline swp_entry_t make_hmm_entry(void)
+{
+	/* We do not store anything inside the CPU page table entry (pte). */
+	return swp_entry(SWP_HMM, 0);
+}
+
+static inline swp_entry_t make_hmm_entry_locked(void)
+{
+	/* We do not store anything inside the CPU page table entry (pte). */
+	return swp_entry(SWP_HMM, 1);
+}
+
+static inline swp_entry_t make_hmm_entry_poisonous(void)
+{
+	/* We do not store anything inside the CPU page table entry (pte). */
+	return swp_entry(SWP_HMM, 2);
+}
+
+static inline int is_hmm_entry(swp_entry_t entry)
+{
+	return (swp_type(entry) == SWP_HMM);
+}
+
+static inline int is_hmm_entry_locked(swp_entry_t entry)
+{
+	return (swp_type(entry) == SWP_HMM) && (swp_offset(entry) == 1);
+}
+
+static inline int is_hmm_entry_poisonous(swp_entry_t entry)
+{
+	return (swp_type(entry) == SWP_HMM) && (swp_offset(entry) == 2);
+}
+#else /* CONFIG_HMM */
+static inline int is_hmm_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif /* CONFIG_HMM */
+
+
 #endif /* _LINUX_SWAPOPS_H */
diff --git a/mm/hmm.c b/mm/hmm.c
index 9e5017a..7fb493f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -416,6 +416,27 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+int hmm_handle_cpu_fault(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pmd_t *pmdp, unsigned long addr,
+			unsigned flags, pte_t orig_pte)
+{
+	return VM_FAULT_SIGBUS;
+}
+EXPORT_SYMBOL(hmm_handle_cpu_fault);
+
+int hmm_mm_fork(struct mm_struct *src_mm,
+		struct mm_struct *dst_mm,
+		struct vm_area_struct *dst_vma,
+		pmd_t *dst_pmd,
+		unsigned long start,
+		unsigned long end)
+{
+	return -ENOMEM;
+}
+EXPORT_SYMBOL(hmm_mm_fork);
+
+
 struct mm_pt_iter {
 	struct mm_struct	*mm;
 	pte_t			*ptep;
diff --git a/mm/memory.c b/mm/memory.c
index bbab5e9..08bc37e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -53,6 +53,7 @@
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -894,9 +895,11 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
+	unsigned cnt_hmm_entry = 0;
 	int progress = 0;
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
+	unsigned long start;
 
 again:
 	init_rss_vec(rss);
@@ -910,6 +913,7 @@ again:
 	orig_src_pte = src_pte;
 	orig_dst_pte = dst_pte;
 	arch_enter_lazy_mmu_mode();
+	start = addr;
 
 	do {
 		/*
@@ -926,6 +930,12 @@ again:
 			progress++;
 			continue;
 		}
+		if (unlikely(!pte_present(*src_pte))) {
+			entry = pte_to_swp_entry(*src_pte);
+
+			if (is_hmm_entry(entry))
+				cnt_hmm_entry++;
+		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
 							vma, addr, rss);
 		if (entry.val)
@@ -940,6 +950,15 @@ again:
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
 
+	if (cnt_hmm_entry) {
+		int ret;
+
+		ret = hmm_mm_fork(src_mm, dst_mm, dst_vma,
+				  dst_pmd, start, end);
+		if (ret)
+			return ret;
+	}
+
 	if (entry.val) {
 		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
 			return -ENOMEM;
@@ -2489,6 +2508,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_hmm_entry(entry)) {
+			ret = hmm_handle_cpu_fault(mm, vma, pmd, address,
+						   flags, orig_pte);
 		} else {
 			print_bad_pte(vma, address, orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 03/14] HMM: add new HMM page table flag (valid device memory).
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

For memory migrated to device we need a new type of memory entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm_pt.h | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 8a59a75..b017aa7 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -74,10 +74,11 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
  * In the first case the device driver must ignore any pfn entry as they might
  * show as transient state while HMM is mapping the page.
  */
-#define HMM_PTE_VALID_DMA_BIT	0
-#define HMM_PTE_VALID_PFN_BIT	1
-#define HMM_PTE_WRITE_BIT	2
-#define HMM_PTE_DIRTY_BIT	3
+#define HMM_PTE_VALID_DEV_BIT	0
+#define HMM_PTE_VALID_DMA_BIT	1
+#define HMM_PTE_VALID_PFN_BIT	2
+#define HMM_PTE_WRITE_BIT	3
+#define HMM_PTE_DIRTY_BIT	4
 /*
  * Reserve some bits for device driver private flags. Note that thus can only
  * be manipulated using the hmm_pte_*_bit() sets of helpers.
@@ -85,7 +86,7 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
  * WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
  * AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
  */
-#define HMM_PTE_HW_SHIFT	4
+#define HMM_PTE_HW_SHIFT	8
 
 #define HMM_PTE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
 #define HMM_PTE_DMA_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
@@ -166,6 +167,7 @@ static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
 	HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
 	HMM_PTE_TEST_AND_SET_BIT(name, bit)
 
+HMM_PTE_BIT_HELPER(valid_dev, HMM_PTE_VALID_DEV_BIT)
 HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
 HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
 HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
@@ -176,11 +178,23 @@ static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
 	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
 }
 
+static inline dma_addr_t hmm_pte_from_dev_addr(dma_addr_t dma_addr)
+{
+	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DEV_BIT);
+}
+
 static inline dma_addr_t hmm_pte_from_dma_addr(dma_addr_t dma_addr)
 {
 	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DMA_BIT);
 }
 
+static inline dma_addr_t hmm_pte_dev_addr(dma_addr_t pte)
+{
+	/* FIXME Use max dma addr instead of 0 ? */
+	return hmm_pte_test_valid_dev(&pte) ? (pte & HMM_PTE_DMA_MASK) :
+					      (dma_addr_t)-1UL;
+}
+
 static inline dma_addr_t hmm_pte_dma_addr(dma_addr_t pte)
 {
 	/* FIXME Use max dma addr instead of 0 ? */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 03/14] HMM: add new HMM page table flag (valid device memory).
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

For memory migrated to device we need a new type of memory entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm_pt.h | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 8a59a75..b017aa7 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -74,10 +74,11 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
  * In the first case the device driver must ignore any pfn entry as they might
  * show as transient state while HMM is mapping the page.
  */
-#define HMM_PTE_VALID_DMA_BIT	0
-#define HMM_PTE_VALID_PFN_BIT	1
-#define HMM_PTE_WRITE_BIT	2
-#define HMM_PTE_DIRTY_BIT	3
+#define HMM_PTE_VALID_DEV_BIT	0
+#define HMM_PTE_VALID_DMA_BIT	1
+#define HMM_PTE_VALID_PFN_BIT	2
+#define HMM_PTE_WRITE_BIT	3
+#define HMM_PTE_DIRTY_BIT	4
 /*
  * Reserve some bits for device driver private flags. Note that thus can only
  * be manipulated using the hmm_pte_*_bit() sets of helpers.
@@ -85,7 +86,7 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
  * WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
  * AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
  */
-#define HMM_PTE_HW_SHIFT	4
+#define HMM_PTE_HW_SHIFT	8
 
 #define HMM_PTE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
 #define HMM_PTE_DMA_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
@@ -166,6 +167,7 @@ static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
 	HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
 	HMM_PTE_TEST_AND_SET_BIT(name, bit)
 
+HMM_PTE_BIT_HELPER(valid_dev, HMM_PTE_VALID_DEV_BIT)
 HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
 HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
 HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
@@ -176,11 +178,23 @@ static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
 	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
 }
 
+static inline dma_addr_t hmm_pte_from_dev_addr(dma_addr_t dma_addr)
+{
+	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DEV_BIT);
+}
+
 static inline dma_addr_t hmm_pte_from_dma_addr(dma_addr_t dma_addr)
 {
 	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DMA_BIT);
 }
 
+static inline dma_addr_t hmm_pte_dev_addr(dma_addr_t pte)
+{
+	/* FIXME Use max dma addr instead of 0 ? */
+	return hmm_pte_test_valid_dev(&pte) ? (pte & HMM_PTE_DMA_MASK) :
+					      (dma_addr_t)-1UL;
+}
+
 static inline dma_addr_t hmm_pte_dma_addr(dma_addr_t pte)
 {
 	/* FIXME Use max dma addr instead of 0 ? */
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 04/14] HMM: add new HMM page table flag (select flag).
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

When migrating memory the same array for HMM page table entry might be
use with several different devices. Add a new select flag so current
device driver callback can know which entry are selected for the device.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h | 6 ++++--
 mm/hmm.c               | 5 ++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index b017aa7..f745d6c 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -77,8 +77,9 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
 #define HMM_PTE_VALID_DEV_BIT	0
 #define HMM_PTE_VALID_DMA_BIT	1
 #define HMM_PTE_VALID_PFN_BIT	2
-#define HMM_PTE_WRITE_BIT	3
-#define HMM_PTE_DIRTY_BIT	4
+#define HMM_PTE_SELECT		3
+#define HMM_PTE_WRITE_BIT	4
+#define HMM_PTE_DIRTY_BIT	5
 /*
  * Reserve some bits for device driver private flags. Note that thus can only
  * be manipulated using the hmm_pte_*_bit() sets of helpers.
@@ -170,6 +171,7 @@ static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
 HMM_PTE_BIT_HELPER(valid_dev, HMM_PTE_VALID_DEV_BIT)
 HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
 HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(select, HMM_PTE_SELECT)
 HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
 HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 7fb493f..1c81c68 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -745,6 +745,7 @@ static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
 			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
 			if (pmd_write(*pmdp))
 				hmm_pte_set_write(&hmm_pte[i]);
+			hmm_pte_set_select(&hmm_pte[i]);
 		} while (addr += PAGE_SIZE, pfn++, i++, addr != next);
 		hmm_pt_iter_directory_unlock(iter);
 		mirror_fault->addr = addr;
@@ -821,6 +822,7 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
 			if (pte_write(*ptep))
 				hmm_pte_set_write(&hmm_pte[i]);
+			hmm_pte_set_select(&hmm_pte[i]);
 		} while (addr += PAGE_SIZE, ptep++, i++, addr != next);
 		hmm_pt_iter_directory_unlock(iter);
 		pte_unmap(ptep - 1);
@@ -912,7 +914,8 @@ static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
 
 again:
 			pte = ACCESS_ONCE(hmm_pte[i]);
-			if (!hmm_pte_test_valid_pfn(&pte)) {
+			if (!hmm_pte_test_valid_pfn(&pte) ||
+			    !hmm_pte_test_select(&pte)) {
 				if (!hmm_pte_test_valid_dma(&pte)) {
 					ret = -ENOENT;
 					break;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 04/14] HMM: add new HMM page table flag (select flag).
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

When migrating memory the same array for HMM page table entry might be
use with several different devices. Add a new select flag so current
device driver callback can know which entry are selected for the device.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h | 6 ++++--
 mm/hmm.c               | 5 ++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index b017aa7..f745d6c 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -77,8 +77,9 @@ static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
 #define HMM_PTE_VALID_DEV_BIT	0
 #define HMM_PTE_VALID_DMA_BIT	1
 #define HMM_PTE_VALID_PFN_BIT	2
-#define HMM_PTE_WRITE_BIT	3
-#define HMM_PTE_DIRTY_BIT	4
+#define HMM_PTE_SELECT		3
+#define HMM_PTE_WRITE_BIT	4
+#define HMM_PTE_DIRTY_BIT	5
 /*
  * Reserve some bits for device driver private flags. Note that thus can only
  * be manipulated using the hmm_pte_*_bit() sets of helpers.
@@ -170,6 +171,7 @@ static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
 HMM_PTE_BIT_HELPER(valid_dev, HMM_PTE_VALID_DEV_BIT)
 HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
 HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(select, HMM_PTE_SELECT)
 HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
 HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 7fb493f..1c81c68 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -745,6 +745,7 @@ static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
 			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
 			if (pmd_write(*pmdp))
 				hmm_pte_set_write(&hmm_pte[i]);
+			hmm_pte_set_select(&hmm_pte[i]);
 		} while (addr += PAGE_SIZE, pfn++, i++, addr != next);
 		hmm_pt_iter_directory_unlock(iter);
 		mirror_fault->addr = addr;
@@ -821,6 +822,7 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
 			if (pte_write(*ptep))
 				hmm_pte_set_write(&hmm_pte[i]);
+			hmm_pte_set_select(&hmm_pte[i]);
 		} while (addr += PAGE_SIZE, ptep++, i++, addr != next);
 		hmm_pt_iter_directory_unlock(iter);
 		pte_unmap(ptep - 1);
@@ -912,7 +914,8 @@ static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
 
 again:
 			pte = ACCESS_ONCE(hmm_pte[i]);
-			if (!hmm_pte_test_valid_pfn(&pte)) {
+			if (!hmm_pte_test_valid_pfn(&pte) ||
+			    !hmm_pte_test_select(&pte)) {
 				if (!hmm_pte_test_valid_dma(&pte)) {
 					ret = -ENOENT;
 					break;
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 05/14] HMM: handle HMM device page table entry on mirror page table fault and update.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

When faulting or updating the device page table properly handle the case of
device memory entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 1c81c68..6224131 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -607,6 +607,13 @@ static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
 		goto out;
 	}
 
+	if (hmm_pte_test_valid_dev(hmm_pte)) {
+		*hmm_pte &= event->pte_mask;
+		if (!hmm_pte_test_valid_dev(hmm_pte))
+			hmm_pt_iter_directory_unref(iter);
+		return;
+	}
+
 	if (!hmm_pte_test_valid_dma(hmm_pte))
 		return;
 
@@ -804,6 +811,12 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 		ptep = pte_offset_map(pmdp, start);
 		hmm_pt_iter_directory_lock(iter);
 		do {
+			if (hmm_pte_test_valid_dev(&hmm_pte[i])) {
+				if (write)
+					hmm_pte_set_write(&hmm_pte[i]);
+				continue;
+			}
+
 			if (!pte_present(*ptep) ||
 			    (write && !pte_write(*ptep)) ||
 			    pte_protnone(*ptep)) {
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 05/14] HMM: handle HMM device page table entry on mirror page table fault and update.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

When faulting or updating the device page table properly handle the case of
device memory entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 1c81c68..6224131 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -607,6 +607,13 @@ static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
 		goto out;
 	}
 
+	if (hmm_pte_test_valid_dev(hmm_pte)) {
+		*hmm_pte &= event->pte_mask;
+		if (!hmm_pte_test_valid_dev(hmm_pte))
+			hmm_pt_iter_directory_unref(iter);
+		return;
+	}
+
 	if (!hmm_pte_test_valid_dma(hmm_pte))
 		return;
 
@@ -804,6 +811,12 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 		ptep = pte_offset_map(pmdp, start);
 		hmm_pt_iter_directory_lock(iter);
 		do {
+			if (hmm_pte_test_valid_dev(&hmm_pte[i])) {
+				if (write)
+					hmm_pte_set_write(&hmm_pte[i]);
+				continue;
+			}
+
 			if (!pte_present(*ptep) ||
 			    (write && !pte_write(*ptep)) ||
 			    pte_protnone(*ptep)) {
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 06/14] HMM: mm add helper to update page table when migrating memory back v2.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

To migrate memory back we first need to lock HMM special CPU page
table entry so we know no one else might try to migrate those entry
back. Helper also allocate new page where data will be copied back
from the device. Then we can proceed with the device DMA operation.

Once DMA is done we can update again the CPU page table to point to
the new page that holds the content copied back from device memory.

Note that we do not need to invalidate the range are we are only
modifying non present CPU page table entry.

Changed since v1:
  - Save memcg against which each page is precharge as it might
    change along the way.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/mm.h |  12 +++
 mm/memory.c        | 257 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 269 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 18f27afd..3cb884f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2344,6 +2344,18 @@ static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
+
+int mm_hmm_migrate_back(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pte_t *new_pte,
+			unsigned long start,
+			unsigned long end);
+void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
+				 struct vm_area_struct *vma,
+				 pte_t *new_pte,
+				 dma_addr_t *hmm_pte,
+				 unsigned long start,
+				 unsigned long end);
 #else /* !CONFIG_HMM */
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 08bc37e..4b90e8b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3503,6 +3503,263 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+
+#ifdef CONFIG_HMM
+/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @new_pte: Array of new CPU page table entry value.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This function will lock HMM page table entry and allocate new page for entry
+ * it successfully locked.
+ */
+int mm_hmm_migrate_back(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pte_t *new_pte,
+			unsigned long start,
+			unsigned long end)
+{
+	pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
+	unsigned long addr, i;
+	int ret = 0;
+
+	VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+
+	start &= PAGE_MASK;
+	end = PAGE_ALIGN(end);
+	memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
+
+	for (addr = start; addr < end;) {
+		unsigned long cstart, next;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		/*
+		 * Some other thread might already have migrated back the entry
+		 * and freed the page table. Unlikely thought.
+		 */
+		if (unlikely(!pudp)) {
+			addr = min((addr + PUD_SIZE) & PUD_MASK, end);
+			continue;
+		}
+		pmdp = pmd_offset(pudp, addr);
+		if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			     pmd_trans_huge(*pmdp))) {
+			addr = min((addr + PMD_SIZE) & PMD_MASK, end);
+			continue;
+		}
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
+		     next = min((addr + PMD_SIZE) & PMD_MASK, end);
+		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
+			swp_entry_t entry;
+
+			entry = pte_to_swp_entry(*ptep);
+			if (pte_none(*ptep) || pte_present(*ptep) ||
+			    !is_hmm_entry(entry) ||
+			    is_hmm_entry_locked(entry))
+				continue;
+
+			set_pte_at(mm, addr, ptep, hmm_entry);
+			new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
+						   vma->vm_page_prot));
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, i++) {
+			struct mem_cgroup *memcg;
+			struct page *page;
+
+			if (!pte_present(new_pte[i]))
+				continue;
+
+			page = alloc_zeroed_user_highpage_movable(vma, addr);
+			if (!page) {
+				ret = -ENOMEM;
+				break;
+			}
+			__SetPageUptodate(page);
+			if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
+						  &memcg)) {
+				page_cache_release(page);
+				ret = -ENOMEM;
+				break;
+			}
+			/*
+			 * We can safely reuse the s_mem/mapping field of page
+			 * struct to store the memcg as the page is only seen
+			 * by HMM at this point and we can clear it before it
+			 * is public see mm_hmm_migrate_back_cleanup().
+			 */
+			page->s_mem = memcg;
+			new_pte[i] = mk_pte(page, vma->vm_page_prot);
+			if (vma->vm_flags & VM_WRITE) {
+				new_pte[i] = pte_mkdirty(new_pte[i]);
+				new_pte[i] = pte_mkwrite(new_pte[i]);
+			}
+		}
+
+		if (!ret)
+			continue;
+
+		hmm_entry = swp_entry_to_pte(make_hmm_entry());
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
+			unsigned long pfn = pte_pfn(new_pte[i]);
+
+			if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
+				continue;
+
+			set_pte_at(mm, addr, ptep, hmm_entry);
+			pte_clear(mm, addr, &new_pte[i]);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+		break;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(mm_hmm_migrate_back);
+
+/* mm_hmm_migrate_back_cleanup() - set CPU page table entry to new page.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @new_pte: Array of new CPU page table entry value.
+ * @hmm_pte: Array of HMM table entry indicating if migration was successful.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This is call after mm_hmm_migrate_back() and after effective migration. It
+ * will set CPU page table entry to new value pointing to newly allocated page
+ * where the data was effectively copied back from device memory.
+ *
+ * Any failure will trigger a bug on.
+ *
+ * TODO: For copy failure we might simply set a new value for the HMM special
+ * entry indicating poisonous entry.
+ */
+void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
+				 struct vm_area_struct *vma,
+				 pte_t *new_pte,
+				 dma_addr_t *hmm_pte,
+				 unsigned long start,
+				 unsigned long end)
+{
+	pte_t hmm_poison = swp_entry_to_pte(make_hmm_entry_poisonous());
+	unsigned long addr, i;
+
+	for (addr = start; addr < end;) {
+		unsigned long cstart, next, free_pages;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * We know for certain that we did set special swap entry for
+		 * the range and HMM entry are mark as locked so it means that
+		 * no one beside us can modify them which apply that all level
+		 * of the CPU page table are valid.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		VM_BUG_ON(!pudp);
+		pmdp = pmd_offset(pudp, addr);
+		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			  pmd_trans_huge(*pmdp));
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+		     cstart = addr, i = (addr - start) >> PAGE_SHIFT,
+		     free_pages = 0; addr < next; addr += PAGE_SIZE,
+		     ptep++, i++) {
+			struct mem_cgroup *memcg;
+			swp_entry_t entry;
+			struct page *page;
+
+			if (!pte_present(new_pte[i]))
+				continue;
+
+			entry = pte_to_swp_entry(*ptep);
+
+			/*
+			 * Sanity catch all the things that could go wrong but
+			 * should not, no plan B here.
+			 */
+			VM_BUG_ON(pte_none(*ptep));
+			VM_BUG_ON(pte_present(*ptep));
+			VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+			if (!hmm_pte_test_valid_dma(&hmm_pte[i]) &&
+			    !hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				set_pte_at(mm, addr, ptep, hmm_poison);
+				free_pages++;
+				continue;
+			}
+
+			page = pte_page(new_pte[i]);
+
+			/*
+			 * Up to now the s_mem/mapping field stored the memcg
+			 * against which the page was pre-charged. Save it and
+			 * clear field so PageAnon() return false.
+			 */
+			memcg = page->s_mem;
+			page->s_mem = NULL;
+
+			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			page_add_new_anon_rmap(page, vma, addr);
+			mem_cgroup_commit_charge(page, memcg, false);
+			lru_cache_add_active_or_unevictable(page, vma);
+			set_pte_at(mm, addr, ptep, new_pte[i]);
+			update_mmu_cache(vma, addr, ptep);
+			pte_clear(mm, addr, &new_pte[i]);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		if (!free_pages)
+			continue;
+
+		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, i++) {
+			struct mem_cgroup *memcg;
+			struct page *page;
+
+			if (!pte_present(new_pte[i]))
+				continue;
+
+			page = pte_page(new_pte[i]);
+
+			/*
+			 * Up to now the s_mem/mapping field stored the memcg
+			 * against which the page was pre-charged.
+			 */
+			memcg = page->s_mem;
+			page->s_mem = NULL;
+
+			mem_cgroup_cancel_charge(page, memcg);
+			page_cache_release(page);
+		}
+	}
+}
+EXPORT_SYMBOL(mm_hmm_migrate_back_cleanup);
+#endif
+
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 06/14] HMM: mm add helper to update page table when migrating memory back v2.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

To migrate memory back we first need to lock HMM special CPU page
table entry so we know no one else might try to migrate those entry
back. Helper also allocate new page where data will be copied back
from the device. Then we can proceed with the device DMA operation.

Once DMA is done we can update again the CPU page table to point to
the new page that holds the content copied back from device memory.

Note that we do not need to invalidate the range are we are only
modifying non present CPU page table entry.

Changed since v1:
  - Save memcg against which each page is precharge as it might
    change along the way.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/mm.h |  12 +++
 mm/memory.c        | 257 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 269 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 18f27afd..3cb884f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2344,6 +2344,18 @@ static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
+
+int mm_hmm_migrate_back(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pte_t *new_pte,
+			unsigned long start,
+			unsigned long end);
+void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
+				 struct vm_area_struct *vma,
+				 pte_t *new_pte,
+				 dma_addr_t *hmm_pte,
+				 unsigned long start,
+				 unsigned long end);
 #else /* !CONFIG_HMM */
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 08bc37e..4b90e8b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3503,6 +3503,263 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+
+#ifdef CONFIG_HMM
+/* mm_hmm_migrate_back() - lock HMM CPU page table entry and allocate new page.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @new_pte: Array of new CPU page table entry value.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This function will lock HMM page table entry and allocate new page for entry
+ * it successfully locked.
+ */
+int mm_hmm_migrate_back(struct mm_struct *mm,
+			struct vm_area_struct *vma,
+			pte_t *new_pte,
+			unsigned long start,
+			unsigned long end)
+{
+	pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
+	unsigned long addr, i;
+	int ret = 0;
+
+	VM_BUG_ON(vma->vm_ops || (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+
+	start &= PAGE_MASK;
+	end = PAGE_ALIGN(end);
+	memset(new_pte, 0, sizeof(pte_t) * ((end - start) >> PAGE_SHIFT));
+
+	for (addr = start; addr < end;) {
+		unsigned long cstart, next;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		/*
+		 * Some other thread might already have migrated back the entry
+		 * and freed the page table. Unlikely thought.
+		 */
+		if (unlikely(!pudp)) {
+			addr = min((addr + PUD_SIZE) & PUD_MASK, end);
+			continue;
+		}
+		pmdp = pmd_offset(pudp, addr);
+		if (unlikely(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			     pmd_trans_huge(*pmdp))) {
+			addr = min((addr + PMD_SIZE) & PMD_MASK, end);
+			continue;
+		}
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (cstart = addr, i = (addr - start) >> PAGE_SHIFT,
+		     next = min((addr + PMD_SIZE) & PMD_MASK, end);
+		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
+			swp_entry_t entry;
+
+			entry = pte_to_swp_entry(*ptep);
+			if (pte_none(*ptep) || pte_present(*ptep) ||
+			    !is_hmm_entry(entry) ||
+			    is_hmm_entry_locked(entry))
+				continue;
+
+			set_pte_at(mm, addr, ptep, hmm_entry);
+			new_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
+						   vma->vm_page_prot));
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, i++) {
+			struct mem_cgroup *memcg;
+			struct page *page;
+
+			if (!pte_present(new_pte[i]))
+				continue;
+
+			page = alloc_zeroed_user_highpage_movable(vma, addr);
+			if (!page) {
+				ret = -ENOMEM;
+				break;
+			}
+			__SetPageUptodate(page);
+			if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
+						  &memcg)) {
+				page_cache_release(page);
+				ret = -ENOMEM;
+				break;
+			}
+			/*
+			 * We can safely reuse the s_mem/mapping field of page
+			 * struct to store the memcg as the page is only seen
+			 * by HMM at this point and we can clear it before it
+			 * is public see mm_hmm_migrate_back_cleanup().
+			 */
+			page->s_mem = memcg;
+			new_pte[i] = mk_pte(page, vma->vm_page_prot);
+			if (vma->vm_flags & VM_WRITE) {
+				new_pte[i] = pte_mkdirty(new_pte[i]);
+				new_pte[i] = pte_mkwrite(new_pte[i]);
+			}
+		}
+
+		if (!ret)
+			continue;
+
+		hmm_entry = swp_entry_to_pte(make_hmm_entry());
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
+			unsigned long pfn = pte_pfn(new_pte[i]);
+
+			if (!pte_present(new_pte[i]) || !is_zero_pfn(pfn))
+				continue;
+
+			set_pte_at(mm, addr, ptep, hmm_entry);
+			pte_clear(mm, addr, &new_pte[i]);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+		break;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(mm_hmm_migrate_back);
+
+/* mm_hmm_migrate_back_cleanup() - set CPU page table entry to new page.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @new_pte: Array of new CPU page table entry value.
+ * @hmm_pte: Array of HMM table entry indicating if migration was successful.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This is call after mm_hmm_migrate_back() and after effective migration. It
+ * will set CPU page table entry to new value pointing to newly allocated page
+ * where the data was effectively copied back from device memory.
+ *
+ * Any failure will trigger a bug on.
+ *
+ * TODO: For copy failure we might simply set a new value for the HMM special
+ * entry indicating poisonous entry.
+ */
+void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
+				 struct vm_area_struct *vma,
+				 pte_t *new_pte,
+				 dma_addr_t *hmm_pte,
+				 unsigned long start,
+				 unsigned long end)
+{
+	pte_t hmm_poison = swp_entry_to_pte(make_hmm_entry_poisonous());
+	unsigned long addr, i;
+
+	for (addr = start; addr < end;) {
+		unsigned long cstart, next, free_pages;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * We know for certain that we did set special swap entry for
+		 * the range and HMM entry are mark as locked so it means that
+		 * no one beside us can modify them which apply that all level
+		 * of the CPU page table are valid.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		VM_BUG_ON(!pudp);
+		pmdp = pmd_offset(pudp, addr);
+		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			  pmd_trans_huge(*pmdp));
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+		     cstart = addr, i = (addr - start) >> PAGE_SHIFT,
+		     free_pages = 0; addr < next; addr += PAGE_SIZE,
+		     ptep++, i++) {
+			struct mem_cgroup *memcg;
+			swp_entry_t entry;
+			struct page *page;
+
+			if (!pte_present(new_pte[i]))
+				continue;
+
+			entry = pte_to_swp_entry(*ptep);
+
+			/*
+			 * Sanity catch all the things that could go wrong but
+			 * should not, no plan B here.
+			 */
+			VM_BUG_ON(pte_none(*ptep));
+			VM_BUG_ON(pte_present(*ptep));
+			VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+			if (!hmm_pte_test_valid_dma(&hmm_pte[i]) &&
+			    !hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				set_pte_at(mm, addr, ptep, hmm_poison);
+				free_pages++;
+				continue;
+			}
+
+			page = pte_page(new_pte[i]);
+
+			/*
+			 * Up to now the s_mem/mapping field stored the memcg
+			 * against which the page was pre-charged. Save it and
+			 * clear field so PageAnon() return false.
+			 */
+			memcg = page->s_mem;
+			page->s_mem = NULL;
+
+			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			page_add_new_anon_rmap(page, vma, addr);
+			mem_cgroup_commit_charge(page, memcg, false);
+			lru_cache_add_active_or_unevictable(page, vma);
+			set_pte_at(mm, addr, ptep, new_pte[i]);
+			update_mmu_cache(vma, addr, ptep);
+			pte_clear(mm, addr, &new_pte[i]);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		if (!free_pages)
+			continue;
+
+		for (addr = cstart, i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, i++) {
+			struct mem_cgroup *memcg;
+			struct page *page;
+
+			if (!pte_present(new_pte[i]))
+				continue;
+
+			page = pte_page(new_pte[i]);
+
+			/*
+			 * Up to now the s_mem/mapping field stored the memcg
+			 * against which the page was pre-charged.
+			 */
+			memcg = page->s_mem;
+			page->s_mem = NULL;
+
+			mem_cgroup_cancel_charge(page, memcg);
+			page_cache_release(page);
+		}
+	}
+}
+EXPORT_SYMBOL(mm_hmm_migrate_back_cleanup);
+#endif
+
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 07/14] HMM: mm add helper to update page table when migrating memory v2.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

For doing memory migration to remote memory we need to unmap range
of anonymous memory from CPU page table and replace page table entry
with special HMM entry.

This is a multi-stage process, first we save and replace page table
entry with special HMM entry, also flushing tlb in the process. If
we run into non allocated entry we either use the zero page or we
allocate new page. For swaped entry we try to swap them in.

Once we have set the page table entry to the special entry we check
the page backing each of the address to make sure that only page
table mappings are holding reference on the page, which means we
can safely migrate the page to device memory. Because the CPU page
table entry are special entry, no get_user_pages() can reference
the page anylonger. So we are safe from race on that front. Note
that the page can still be referenced by get_user_pages() from
other process but in that case the page is write protected and
as we do not drop the mapcount nor the page count we know that
all user of get_user_pages() are only doing read only access (on
write access they would allocate a new page).

Once we have identified all the page that are safe to migrate the
first function return and let HMM schedule the migration with the
device driver.

Finaly there is a cleanup function that will drop the mapcount and
reference count on all page that have been successfully migrated,
or restore the page table entry otherwise.

Changed since v1:
  - Fix pmd/pte allocation when migrating.
  - Fix reverse logic on mm_forbids_zeropage()
  - Add comment on why we add to lru list new page.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/mm.h |  14 ++
 mm/memory.c        | 471 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 485 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3cb884f..f478076 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2345,6 +2345,20 @@ static inline void hmm_mm_init(struct mm_struct *mm)
 	mm->hmm = NULL;
 }
 
+int mm_hmm_migrate(struct mm_struct *mm,
+		   struct vm_area_struct *vma,
+		   pte_t *save_pte,
+		   bool *backoff,
+		   const void *mmu_notifier_exclude,
+		   unsigned long start,
+		   unsigned long end);
+void mm_hmm_migrate_cleanup(struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    pte_t *save_pte,
+			    dma_addr_t *hmm_pte,
+			    unsigned long start,
+			    unsigned long end);
+
 int mm_hmm_migrate_back(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			pte_t *new_pte,
diff --git a/mm/memory.c b/mm/memory.c
index 4b90e8b..268569e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -54,6 +54,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hmm.h>
+#include <linux/hmm_pt.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -3757,6 +3758,476 @@ void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
 	}
 }
 EXPORT_SYMBOL(mm_hmm_migrate_back_cleanup);
+
+/* mm_hmm_migrate() - unmap range and set special HMM pte for it.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @save_pte: array where to save current CPU page table entry value.
+ * @backoff: Pointer toward a boolean indicating that we need to stop.
+ * @exclude: The mmu_notifier listener to exclude from mmu_notifier callback.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ * Returns: 0 on success, -EINVAL if some argument where invalid, -ENOMEM if
+ * it failed allocating memory for performing the operation, -EFAULT if some
+ * memory backing the range is in bad state, -EAGAIN if backoff flag turned
+ * to true.
+ *
+ * The process of memory migration is bit involve, first we must set all CPU
+ * page table entry to the special HMM locked entry ensuring us exclusive
+ * control over the page table entry (ie no other process can change the page
+ * table but us).
+ *
+ * While doing that we must handle empty and swaped entry. For empty entry we
+ * either use the zero page or allocate a new page. For swap entry we call
+ * __handle_mm_fault() to try to faultin the page (swap entry can be a number
+ * of thing).
+ *
+ * Once we have unmapped we need to check that we can effectively migrate the
+ * page, by testing that no one is holding a reference on the page beside the
+ * reference taken by each page mapping.
+ *
+ * On success every valid entry inside save_pte array is an entry that can be
+ * migrated.
+ *
+ * Note that this function does not free any of the page, nor does it updates
+ * the various memcg counter (exception being for accounting new allocation).
+ * This happen inside the mm_hmm_migrate_cleanup() function.
+ *
+ */
+int mm_hmm_migrate(struct mm_struct *mm,
+		   struct vm_area_struct *vma,
+		   pte_t *save_pte,
+		   bool *backoff,
+		   const void *mmu_notifier_exclude,
+		   unsigned long start,
+		   unsigned long end)
+{
+	pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
+	struct mmu_notifier_range range = {
+		.start = start,
+		.end = end,
+		.event = MMU_MIGRATE,
+	};
+	unsigned long addr = start, i;
+	struct mmu_gather tlb;
+	int ret = 0;
+
+	/* Only allow anonymous mapping and sanity check arguments. */
+	if (vma->vm_ops || unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)))
+		return -EINVAL;
+	start &= PAGE_MASK;
+	end = PAGE_ALIGN(end);
+	if (start >= end || end > vma->vm_end)
+		return -EINVAL;
+
+	/* Only need to test on the last address of the range. */
+	if (check_stack_guard_page(vma, end) < 0)
+		return -EFAULT;
+
+	/* Try to fail early on. */
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+
+retry:
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, range.start, range.end);
+	update_hiwater_rss(mm);
+	mmu_notifier_invalidate_range_start_excluding(mm, &range,
+						      mmu_notifier_exclude);
+	tlb_start_vma(&tlb, vma);
+	for (addr = range.start, i = 0; addr < end && !ret;) {
+		unsigned long cstart, next, npages = 0;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * Pretty much the exact same logic as __handle_mm_fault(),
+		 * exception being the handling of huge pmd.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_alloc(mm, pgdp, addr);
+		if (!pudp) {
+			ret = -ENOMEM;
+			break;
+		}
+		pmdp = pmd_alloc(mm, pudp, addr);
+		if (!pmdp) {
+			ret = -ENOMEM;
+			break;
+		}
+		if (unlikely(pmd_trans_splitting(*pmdp))) {
+			wait_split_huge_page(vma->anon_vma, pmdp);
+			ret = -EAGAIN;
+			break;
+		}
+		if (unlikely(pmd_none(*pmdp)) &&
+		    unlikely(__pte_alloc(mm, vma, pmdp, addr))) {
+			ret = -ENOMEM;
+			break;
+		}
+		/*
+		 * If an huge pmd materialized from under us split it and break
+		 * out of the loop to retry.
+		 */
+		if (unlikely(pmd_trans_huge(*pmdp))) {
+			split_huge_page_pmd(vma, addr, pmdp);
+			ret = -EAGAIN;
+			break;
+		}
+
+		/*
+		 * A regular pmd is established and it can't morph into a huge
+		 * pmd from under us anymore at this point because we hold the
+		 * mmap_sem read mode and khugepaged takes it in write mode. So
+		 * now it's safe to run pte_offset_map().
+		 */
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (i = (addr - start) >> PAGE_SHIFT, cstart = addr,
+		     next = min((addr + PMD_SIZE) & PMD_MASK, end);
+		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
+			save_pte[i] = ptep_get_and_clear(mm, addr, ptep);
+			tlb_remove_tlb_entry(&tlb, ptep, addr);
+			set_pte_at(mm, addr, ptep, hmm_entry);
+
+			if (pte_present(save_pte[i]))
+				continue;
+
+			if (!pte_none(save_pte[i])) {
+				set_pte_at(mm, addr, ptep, save_pte[i]);
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+			/*
+			 * TODO: This mm_forbids_zeropage() really does not
+			 * apply to us. First it seems only S390 have it set,
+			 * second we are not even using the zero page entry
+			 * to populate the CPU page table, thought on error
+			 * we might use the save_pte entry to set the CPU
+			 * page table entry.
+			 *
+			 * Live with that oddity for now.
+			 */
+			if (mm_forbids_zeropage(mm)) {
+				pte_clear(mm, addr, &save_pte[i]);
+				npages++;
+				continue;
+			}
+			save_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
+						    vma->vm_page_prot));
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		/*
+		 * So we must allocate pages before checking for error, which
+		 * here indicate that one entry is a swap entry. We need to
+		 * allocate first because otherwise there is no easy way to
+		 * know on retry or in error code path wether the CPU page
+		 * table locked HMM entry is ours or from some other thread.
+		 */
+
+		if (!npages)
+			continue;
+
+		for (next = addr, addr = cstart,
+		     i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, i++) {
+			struct mem_cgroup *memcg;
+			struct page *page;
+
+			if (pte_present(save_pte[i]) || !pte_none(save_pte[i]))
+				continue;
+
+			page = alloc_zeroed_user_highpage_movable(vma, addr);
+			if (!page) {
+				ret = -ENOMEM;
+				break;
+			}
+			__SetPageUptodate(page);
+			if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
+						  &memcg)) {
+				page_cache_release(page);
+				ret = -ENOMEM;
+				break;
+			}
+			save_pte[i] = mk_pte(page, vma->vm_page_prot);
+			if (vma->vm_flags & VM_WRITE)
+				save_pte[i] = pte_mkwrite(save_pte[i]);
+			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			/*
+			 * Because we set the page table entry to the special
+			 * HMM locked entry we know no other process might do
+			 * anything with it and thus we can safely account the
+			 * page without holding any lock at this point.
+			 */
+			page_add_new_anon_rmap(page, vma, addr);
+			mem_cgroup_commit_charge(page, memcg, false);
+			/*
+			 * Add to active list so we know vmscan will not waste
+			 * its time with that page while we are still using it.
+			 */
+			lru_cache_add_active_or_unevictable(page, vma);
+		}
+	}
+	tlb_end_vma(&tlb, vma);
+	mmu_notifier_invalidate_range_end_excluding(mm, &range,
+						    mmu_notifier_exclude);
+	tlb_finish_mmu(&tlb, range.start, range.end);
+
+	if (backoff && *backoff) {
+		/* Stick to the range we updated. */
+		ret = -EAGAIN;
+		end = addr;
+		goto out;
+	}
+
+	/* Check if something is missing or something went wrong. */
+	if (ret == -ENOENT) {
+		int flags = FAULT_FLAG_ALLOW_RETRY;
+
+		do {
+			/*
+			 * Using __handle_mm_fault() as current->mm != mm ie we
+			 * might have been call from a kernel thread on behalf
+			 * of a driver and all accounting handle_mm_fault() is
+			 * pointless in our case.
+			 */
+			ret = __handle_mm_fault(mm, vma, addr, flags);
+			flags |= FAULT_FLAG_TRIED;
+		} while ((ret & VM_FAULT_RETRY));
+		if ((ret & VM_FAULT_ERROR)) {
+			/* Stick to the range we updated. */
+			end = addr;
+			ret = -EFAULT;
+			goto out;
+		}
+		range.start = addr;
+		goto retry;
+	}
+	if (ret == -EAGAIN) {
+		range.start = addr;
+		goto retry;
+	}
+	if (ret)
+		/* Stick to the range we updated. */
+		end = addr;
+
+	/*
+	 * At this point no one else can take a reference on the page from this
+	 * process CPU page table. So we can safely check wether we can migrate
+	 * or not the page.
+	 */
+
+out:
+	for (addr = start, i = 0; addr < end;) {
+		unsigned long next;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * We know for certain that we did set special swap entry for
+		 * the range and HMM entry are mark as locked so it means that
+		 * no one beside us can modify them which apply that all level
+		 * of the CPU page table are valid.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		VM_BUG_ON(!pudp);
+		pmdp = pmd_offset(pudp, addr);
+		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			  pmd_trans_huge(*pmdp));
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+		     i = (addr - start) >> PAGE_SHIFT; addr < next;
+		     addr += PAGE_SIZE, ptep++, i++) {
+			struct page *page;
+			swp_entry_t entry;
+			int swapped;
+
+			entry = pte_to_swp_entry(save_pte[i]);
+			if (is_hmm_entry(entry)) {
+				/*
+				 * Logic here is pretty involve. If save_pte is
+				 * an HMM special swap entry then it means that
+				 * we failed to swap in that page so error must
+				 * be set.
+				 *
+				 * If that's not the case than it means we are
+				 * seriously screw.
+				 */
+				VM_BUG_ON(!ret);
+				continue;
+			}
+
+			/*
+			 * This can not happen, no one else can replace our
+			 * special entry and as range end is re-ajusted on
+			 * error.
+			 */
+			entry = pte_to_swp_entry(*ptep);
+			VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+			/* On error or backoff restore all the saved pte. */
+			if (ret)
+				goto restore;
+
+			page = vm_normal_page(vma, addr, save_pte[i]);
+			/* The zero page is fine to migrate. */
+			if (!page)
+				continue;
+
+			/*
+			 * Check that only CPU mapping hold a reference on the
+			 * page. To make thing simpler we just refuse bail out
+			 * if page_mapcount() != page_count() (also accounting
+			 * for swap cache).
+			 *
+			 * There is a small window here where wp_page_copy()
+			 * might have decremented mapcount but have not yet
+			 * decremented the page count. This is not an issue as
+			 * we backoff in that case.
+			 */
+			swapped = PageSwapCache(page);
+			if (page_mapcount(page) + swapped == page_count(page))
+				continue;
+
+restore:
+			/* Ok we have to restore that page. */
+			set_pte_at(mm, addr, ptep, save_pte[i]);
+			/*
+			 * No need to invalidate - it was non-present
+			 * before.
+			 */
+			update_mmu_cache(vma, addr, ptep);
+			pte_clear(mm, addr, &save_pte[i]);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(mm_hmm_migrate);
+
+/* mm_hmm_migrate_cleanup() - unmap range cleanup.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @save_pte: Array where to save current CPU page table entry value.
+ * @hmm_pte: Array of HMM table entry indicating if migration was successful.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This is call after mm_hmm_migrate() and after effective migration. It will
+ * restore CPU page table entry for page that not been migrated or in case of
+ * failure.
+ *
+ * It will free pages that have been migrated and updates appropriate counters,
+ * it will also "unlock" special HMM pte entry.
+ */
+void mm_hmm_migrate_cleanup(struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    pte_t *save_pte,
+			    dma_addr_t *hmm_pte,
+			    unsigned long start,
+			    unsigned long end)
+{
+	pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry());
+	struct page *pages[MMU_GATHER_BUNDLE];
+	unsigned long addr, c, i;
+
+	for (addr = start, i = 0; addr < end;) {
+		unsigned long next;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * We know for certain that we did set special swap entry for
+		 * the range and HMM entry are mark as locked so it means that
+		 * no one beside us can modify them which apply that all level
+		 * of the CPU page table are valid.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		VM_BUG_ON(!pudp);
+		pmdp = pmd_offset(pudp, addr);
+		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			  pmd_trans_huge(*pmdp));
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+		     i = (addr - start) >> PAGE_SHIFT; addr < next;
+		     addr += PAGE_SIZE, ptep++, i++) {
+			struct page *page;
+			swp_entry_t entry;
+
+			/*
+			 * This can't happen no one else can replace our
+			 * precious special entry.
+			 */
+			entry = pte_to_swp_entry(*ptep);
+			VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+			if (!hmm_pte_test_valid_dev(&hmm_pte[i])) {
+				/* Ok we have to restore that page. */
+				set_pte_at(mm, addr, ptep, save_pte[i]);
+				/*
+				 * No need to invalidate - it was non-present
+				 * before.
+				 */
+				update_mmu_cache(vma, addr, ptep);
+				pte_clear(mm, addr, &save_pte[i]);
+				continue;
+			}
+
+			/* Set unlocked entry. */
+			set_pte_at(mm, addr, ptep, hmm_entry);
+			/*
+			 * No need to invalidate - it was non-present
+			 * before.
+			 */
+			update_mmu_cache(vma, addr, ptep);
+
+			page = vm_normal_page(vma, addr, save_pte[i]);
+			/* The zero page is fine to migrate. */
+			if (!page)
+				continue;
+
+			page_remove_rmap(page);
+			dec_mm_counter_fast(mm, MM_ANONPAGES);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+	}
+
+	/* Free pages. */
+	for (addr = start, i = 0, c = 0; addr < end; i++, addr += PAGE_SIZE) {
+		if (pte_none(save_pte[i]))
+			continue;
+		if (c >= MMU_GATHER_BUNDLE) {
+			/*
+			 * TODO: What we really want to do is keep the memory
+			 * accounted inside the memory group and inside rss
+			 * while still freeing the page. So that migration
+			 * back from device memory will not fail because we
+			 * go over memory group limit.
+			 */
+			free_pages_and_swap_cache(pages, c);
+			c = 0;
+		}
+		pages[c] = vm_normal_page(vma, addr, save_pte[i]);
+		c = pages[c] ? c + 1 : c;
+	}
+}
+EXPORT_SYMBOL(mm_hmm_migrate_cleanup);
 #endif
 
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 07/14] HMM: mm add helper to update page table when migrating memory v2.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

For doing memory migration to remote memory we need to unmap range
of anonymous memory from CPU page table and replace page table entry
with special HMM entry.

This is a multi-stage process, first we save and replace page table
entry with special HMM entry, also flushing tlb in the process. If
we run into non allocated entry we either use the zero page or we
allocate new page. For swaped entry we try to swap them in.

Once we have set the page table entry to the special entry we check
the page backing each of the address to make sure that only page
table mappings are holding reference on the page, which means we
can safely migrate the page to device memory. Because the CPU page
table entry are special entry, no get_user_pages() can reference
the page anylonger. So we are safe from race on that front. Note
that the page can still be referenced by get_user_pages() from
other process but in that case the page is write protected and
as we do not drop the mapcount nor the page count we know that
all user of get_user_pages() are only doing read only access (on
write access they would allocate a new page).

Once we have identified all the page that are safe to migrate the
first function return and let HMM schedule the migration with the
device driver.

Finaly there is a cleanup function that will drop the mapcount and
reference count on all page that have been successfully migrated,
or restore the page table entry otherwise.

Changed since v1:
  - Fix pmd/pte allocation when migrating.
  - Fix reverse logic on mm_forbids_zeropage()
  - Add comment on why we add to lru list new page.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/mm.h |  14 ++
 mm/memory.c        | 471 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 485 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3cb884f..f478076 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2345,6 +2345,20 @@ static inline void hmm_mm_init(struct mm_struct *mm)
 	mm->hmm = NULL;
 }
 
+int mm_hmm_migrate(struct mm_struct *mm,
+		   struct vm_area_struct *vma,
+		   pte_t *save_pte,
+		   bool *backoff,
+		   const void *mmu_notifier_exclude,
+		   unsigned long start,
+		   unsigned long end);
+void mm_hmm_migrate_cleanup(struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    pte_t *save_pte,
+			    dma_addr_t *hmm_pte,
+			    unsigned long start,
+			    unsigned long end);
+
 int mm_hmm_migrate_back(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			pte_t *new_pte,
diff --git a/mm/memory.c b/mm/memory.c
index 4b90e8b..268569e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -54,6 +54,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hmm.h>
+#include <linux/hmm_pt.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -3757,6 +3758,476 @@ void mm_hmm_migrate_back_cleanup(struct mm_struct *mm,
 	}
 }
 EXPORT_SYMBOL(mm_hmm_migrate_back_cleanup);
+
+/* mm_hmm_migrate() - unmap range and set special HMM pte for it.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @save_pte: array where to save current CPU page table entry value.
+ * @backoff: Pointer toward a boolean indicating that we need to stop.
+ * @exclude: The mmu_notifier listener to exclude from mmu_notifier callback.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ * Returns: 0 on success, -EINVAL if some argument where invalid, -ENOMEM if
+ * it failed allocating memory for performing the operation, -EFAULT if some
+ * memory backing the range is in bad state, -EAGAIN if backoff flag turned
+ * to true.
+ *
+ * The process of memory migration is bit involve, first we must set all CPU
+ * page table entry to the special HMM locked entry ensuring us exclusive
+ * control over the page table entry (ie no other process can change the page
+ * table but us).
+ *
+ * While doing that we must handle empty and swaped entry. For empty entry we
+ * either use the zero page or allocate a new page. For swap entry we call
+ * __handle_mm_fault() to try to faultin the page (swap entry can be a number
+ * of thing).
+ *
+ * Once we have unmapped we need to check that we can effectively migrate the
+ * page, by testing that no one is holding a reference on the page beside the
+ * reference taken by each page mapping.
+ *
+ * On success every valid entry inside save_pte array is an entry that can be
+ * migrated.
+ *
+ * Note that this function does not free any of the page, nor does it updates
+ * the various memcg counter (exception being for accounting new allocation).
+ * This happen inside the mm_hmm_migrate_cleanup() function.
+ *
+ */
+int mm_hmm_migrate(struct mm_struct *mm,
+		   struct vm_area_struct *vma,
+		   pte_t *save_pte,
+		   bool *backoff,
+		   const void *mmu_notifier_exclude,
+		   unsigned long start,
+		   unsigned long end)
+{
+	pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry_locked());
+	struct mmu_notifier_range range = {
+		.start = start,
+		.end = end,
+		.event = MMU_MIGRATE,
+	};
+	unsigned long addr = start, i;
+	struct mmu_gather tlb;
+	int ret = 0;
+
+	/* Only allow anonymous mapping and sanity check arguments. */
+	if (vma->vm_ops || unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)))
+		return -EINVAL;
+	start &= PAGE_MASK;
+	end = PAGE_ALIGN(end);
+	if (start >= end || end > vma->vm_end)
+		return -EINVAL;
+
+	/* Only need to test on the last address of the range. */
+	if (check_stack_guard_page(vma, end) < 0)
+		return -EFAULT;
+
+	/* Try to fail early on. */
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+
+retry:
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, range.start, range.end);
+	update_hiwater_rss(mm);
+	mmu_notifier_invalidate_range_start_excluding(mm, &range,
+						      mmu_notifier_exclude);
+	tlb_start_vma(&tlb, vma);
+	for (addr = range.start, i = 0; addr < end && !ret;) {
+		unsigned long cstart, next, npages = 0;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * Pretty much the exact same logic as __handle_mm_fault(),
+		 * exception being the handling of huge pmd.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_alloc(mm, pgdp, addr);
+		if (!pudp) {
+			ret = -ENOMEM;
+			break;
+		}
+		pmdp = pmd_alloc(mm, pudp, addr);
+		if (!pmdp) {
+			ret = -ENOMEM;
+			break;
+		}
+		if (unlikely(pmd_trans_splitting(*pmdp))) {
+			wait_split_huge_page(vma->anon_vma, pmdp);
+			ret = -EAGAIN;
+			break;
+		}
+		if (unlikely(pmd_none(*pmdp)) &&
+		    unlikely(__pte_alloc(mm, vma, pmdp, addr))) {
+			ret = -ENOMEM;
+			break;
+		}
+		/*
+		 * If an huge pmd materialized from under us split it and break
+		 * out of the loop to retry.
+		 */
+		if (unlikely(pmd_trans_huge(*pmdp))) {
+			split_huge_page_pmd(vma, addr, pmdp);
+			ret = -EAGAIN;
+			break;
+		}
+
+		/*
+		 * A regular pmd is established and it can't morph into a huge
+		 * pmd from under us anymore at this point because we hold the
+		 * mmap_sem read mode and khugepaged takes it in write mode. So
+		 * now it's safe to run pte_offset_map().
+		 */
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (i = (addr - start) >> PAGE_SHIFT, cstart = addr,
+		     next = min((addr + PMD_SIZE) & PMD_MASK, end);
+		     addr < next; addr += PAGE_SIZE, ptep++, i++) {
+			save_pte[i] = ptep_get_and_clear(mm, addr, ptep);
+			tlb_remove_tlb_entry(&tlb, ptep, addr);
+			set_pte_at(mm, addr, ptep, hmm_entry);
+
+			if (pte_present(save_pte[i]))
+				continue;
+
+			if (!pte_none(save_pte[i])) {
+				set_pte_at(mm, addr, ptep, save_pte[i]);
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+			/*
+			 * TODO: This mm_forbids_zeropage() really does not
+			 * apply to us. First it seems only S390 have it set,
+			 * second we are not even using the zero page entry
+			 * to populate the CPU page table, thought on error
+			 * we might use the save_pte entry to set the CPU
+			 * page table entry.
+			 *
+			 * Live with that oddity for now.
+			 */
+			if (mm_forbids_zeropage(mm)) {
+				pte_clear(mm, addr, &save_pte[i]);
+				npages++;
+				continue;
+			}
+			save_pte[i] = pte_mkspecial(pfn_pte(my_zero_pfn(addr),
+						    vma->vm_page_prot));
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		/*
+		 * So we must allocate pages before checking for error, which
+		 * here indicate that one entry is a swap entry. We need to
+		 * allocate first because otherwise there is no easy way to
+		 * know on retry or in error code path wether the CPU page
+		 * table locked HMM entry is ours or from some other thread.
+		 */
+
+		if (!npages)
+			continue;
+
+		for (next = addr, addr = cstart,
+		     i = (addr - start) >> PAGE_SHIFT;
+		     addr < next; addr += PAGE_SIZE, i++) {
+			struct mem_cgroup *memcg;
+			struct page *page;
+
+			if (pte_present(save_pte[i]) || !pte_none(save_pte[i]))
+				continue;
+
+			page = alloc_zeroed_user_highpage_movable(vma, addr);
+			if (!page) {
+				ret = -ENOMEM;
+				break;
+			}
+			__SetPageUptodate(page);
+			if (mem_cgroup_try_charge(page, mm, GFP_KERNEL,
+						  &memcg)) {
+				page_cache_release(page);
+				ret = -ENOMEM;
+				break;
+			}
+			save_pte[i] = mk_pte(page, vma->vm_page_prot);
+			if (vma->vm_flags & VM_WRITE)
+				save_pte[i] = pte_mkwrite(save_pte[i]);
+			inc_mm_counter_fast(mm, MM_ANONPAGES);
+			/*
+			 * Because we set the page table entry to the special
+			 * HMM locked entry we know no other process might do
+			 * anything with it and thus we can safely account the
+			 * page without holding any lock at this point.
+			 */
+			page_add_new_anon_rmap(page, vma, addr);
+			mem_cgroup_commit_charge(page, memcg, false);
+			/*
+			 * Add to active list so we know vmscan will not waste
+			 * its time with that page while we are still using it.
+			 */
+			lru_cache_add_active_or_unevictable(page, vma);
+		}
+	}
+	tlb_end_vma(&tlb, vma);
+	mmu_notifier_invalidate_range_end_excluding(mm, &range,
+						    mmu_notifier_exclude);
+	tlb_finish_mmu(&tlb, range.start, range.end);
+
+	if (backoff && *backoff) {
+		/* Stick to the range we updated. */
+		ret = -EAGAIN;
+		end = addr;
+		goto out;
+	}
+
+	/* Check if something is missing or something went wrong. */
+	if (ret == -ENOENT) {
+		int flags = FAULT_FLAG_ALLOW_RETRY;
+
+		do {
+			/*
+			 * Using __handle_mm_fault() as current->mm != mm ie we
+			 * might have been call from a kernel thread on behalf
+			 * of a driver and all accounting handle_mm_fault() is
+			 * pointless in our case.
+			 */
+			ret = __handle_mm_fault(mm, vma, addr, flags);
+			flags |= FAULT_FLAG_TRIED;
+		} while ((ret & VM_FAULT_RETRY));
+		if ((ret & VM_FAULT_ERROR)) {
+			/* Stick to the range we updated. */
+			end = addr;
+			ret = -EFAULT;
+			goto out;
+		}
+		range.start = addr;
+		goto retry;
+	}
+	if (ret == -EAGAIN) {
+		range.start = addr;
+		goto retry;
+	}
+	if (ret)
+		/* Stick to the range we updated. */
+		end = addr;
+
+	/*
+	 * At this point no one else can take a reference on the page from this
+	 * process CPU page table. So we can safely check wether we can migrate
+	 * or not the page.
+	 */
+
+out:
+	for (addr = start, i = 0; addr < end;) {
+		unsigned long next;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * We know for certain that we did set special swap entry for
+		 * the range and HMM entry are mark as locked so it means that
+		 * no one beside us can modify them which apply that all level
+		 * of the CPU page table are valid.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		VM_BUG_ON(!pudp);
+		pmdp = pmd_offset(pudp, addr);
+		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			  pmd_trans_huge(*pmdp));
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+		     i = (addr - start) >> PAGE_SHIFT; addr < next;
+		     addr += PAGE_SIZE, ptep++, i++) {
+			struct page *page;
+			swp_entry_t entry;
+			int swapped;
+
+			entry = pte_to_swp_entry(save_pte[i]);
+			if (is_hmm_entry(entry)) {
+				/*
+				 * Logic here is pretty involve. If save_pte is
+				 * an HMM special swap entry then it means that
+				 * we failed to swap in that page so error must
+				 * be set.
+				 *
+				 * If that's not the case than it means we are
+				 * seriously screw.
+				 */
+				VM_BUG_ON(!ret);
+				continue;
+			}
+
+			/*
+			 * This can not happen, no one else can replace our
+			 * special entry and as range end is re-ajusted on
+			 * error.
+			 */
+			entry = pte_to_swp_entry(*ptep);
+			VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+			/* On error or backoff restore all the saved pte. */
+			if (ret)
+				goto restore;
+
+			page = vm_normal_page(vma, addr, save_pte[i]);
+			/* The zero page is fine to migrate. */
+			if (!page)
+				continue;
+
+			/*
+			 * Check that only CPU mapping hold a reference on the
+			 * page. To make thing simpler we just refuse bail out
+			 * if page_mapcount() != page_count() (also accounting
+			 * for swap cache).
+			 *
+			 * There is a small window here where wp_page_copy()
+			 * might have decremented mapcount but have not yet
+			 * decremented the page count. This is not an issue as
+			 * we backoff in that case.
+			 */
+			swapped = PageSwapCache(page);
+			if (page_mapcount(page) + swapped == page_count(page))
+				continue;
+
+restore:
+			/* Ok we have to restore that page. */
+			set_pte_at(mm, addr, ptep, save_pte[i]);
+			/*
+			 * No need to invalidate - it was non-present
+			 * before.
+			 */
+			update_mmu_cache(vma, addr, ptep);
+			pte_clear(mm, addr, &save_pte[i]);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(mm_hmm_migrate);
+
+/* mm_hmm_migrate_cleanup() - unmap range cleanup.
+ *
+ * @mm: The mm struct.
+ * @vma: The vm area struct the range is in.
+ * @save_pte: Array where to save current CPU page table entry value.
+ * @hmm_pte: Array of HMM table entry indicating if migration was successful.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This is call after mm_hmm_migrate() and after effective migration. It will
+ * restore CPU page table entry for page that not been migrated or in case of
+ * failure.
+ *
+ * It will free pages that have been migrated and updates appropriate counters,
+ * it will also "unlock" special HMM pte entry.
+ */
+void mm_hmm_migrate_cleanup(struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    pte_t *save_pte,
+			    dma_addr_t *hmm_pte,
+			    unsigned long start,
+			    unsigned long end)
+{
+	pte_t hmm_entry = swp_entry_to_pte(make_hmm_entry());
+	struct page *pages[MMU_GATHER_BUNDLE];
+	unsigned long addr, c, i;
+
+	for (addr = start, i = 0; addr < end;) {
+		unsigned long next;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		/*
+		 * We know for certain that we did set special swap entry for
+		 * the range and HMM entry are mark as locked so it means that
+		 * no one beside us can modify them which apply that all level
+		 * of the CPU page table are valid.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		VM_BUG_ON(!pudp);
+		pmdp = pmd_offset(pudp, addr);
+		VM_BUG_ON(!pmdp || pmd_bad(*pmdp) || pmd_none(*pmdp) ||
+			  pmd_trans_huge(*pmdp));
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (next = min((addr + PMD_SIZE) & PMD_MASK, end),
+		     i = (addr - start) >> PAGE_SHIFT; addr < next;
+		     addr += PAGE_SIZE, ptep++, i++) {
+			struct page *page;
+			swp_entry_t entry;
+
+			/*
+			 * This can't happen no one else can replace our
+			 * precious special entry.
+			 */
+			entry = pte_to_swp_entry(*ptep);
+			VM_BUG_ON(!is_hmm_entry_locked(entry));
+
+			if (!hmm_pte_test_valid_dev(&hmm_pte[i])) {
+				/* Ok we have to restore that page. */
+				set_pte_at(mm, addr, ptep, save_pte[i]);
+				/*
+				 * No need to invalidate - it was non-present
+				 * before.
+				 */
+				update_mmu_cache(vma, addr, ptep);
+				pte_clear(mm, addr, &save_pte[i]);
+				continue;
+			}
+
+			/* Set unlocked entry. */
+			set_pte_at(mm, addr, ptep, hmm_entry);
+			/*
+			 * No need to invalidate - it was non-present
+			 * before.
+			 */
+			update_mmu_cache(vma, addr, ptep);
+
+			page = vm_normal_page(vma, addr, save_pte[i]);
+			/* The zero page is fine to migrate. */
+			if (!page)
+				continue;
+
+			page_remove_rmap(page);
+			dec_mm_counter_fast(mm, MM_ANONPAGES);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+	}
+
+	/* Free pages. */
+	for (addr = start, i = 0, c = 0; addr < end; i++, addr += PAGE_SIZE) {
+		if (pte_none(save_pte[i]))
+			continue;
+		if (c >= MMU_GATHER_BUNDLE) {
+			/*
+			 * TODO: What we really want to do is keep the memory
+			 * accounted inside the memory group and inside rss
+			 * while still freeing the page. So that migration
+			 * back from device memory will not fail because we
+			 * go over memory group limit.
+			 */
+			free_pages_and_swap_cache(pages, c);
+			c = 0;
+		}
+		pages[c] = vm_normal_page(vma, addr, save_pte[i]);
+		c = pages[c] ? c + 1 : c;
+	}
+}
+EXPORT_SYMBOL(mm_hmm_migrate_cleanup);
 #endif
 
 
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 08/14] HMM: new callback for copying memory from and to device memory v2.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jerome Glisse, Jatin Kumar

From: Jerome Glisse <jglisse@redhat.com>

This patch only adds the new callback device driver must implement
to copy memory from and to device memory.

Changed since v1:
  - Pass down the vma to the copy function.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/hmm.c            |   2 +
 2 files changed, 107 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 7c66513..9fbfc07 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -65,6 +65,8 @@ enum hmm_etype {
 	HMM_DEVICE_RFAULT,
 	HMM_DEVICE_WFAULT,
 	HMM_WRITE_PROTECT,
+	HMM_COPY_FROM_DEVICE,
+	HMM_COPY_TO_DEVICE,
 };
 
 /* struct hmm_event - memory event information.
@@ -170,6 +172,109 @@ struct hmm_device_ops {
 	 */
 	int (*update)(struct hmm_mirror *mirror,
 		      struct hmm_event *event);
+
+	/* copy_from_device() - copy from device memory to system memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the copy.
+	 * @dst: Array containing hmm_pte of destination memory.
+	 * @start: Start address of the range (sub-range of event) to copy.
+	 * @end: End address of the range (sub-range of event) to copy.
+	 * Returns: 0 on success, error code otherwise {-ENOMEM, -EIO}.
+	 *
+	 * Called when migrating memory from device memory to system memory.
+	 * The dst array contains valid DMA address for the device of the page
+	 * to copy to (or pfn of page if hmm_device.device == NULL).
+	 *
+	 * If event.etype == HMM_FORK then device driver only need to schedule
+	 * a copy to the system pages given in the dst hmm_pte array. Do not
+	 * update the device page, and do not pause/stop the device threads
+	 * that are using this address space. Just copy memory.
+	 *
+	 * If event.type == HMM_COPY_FROM_DEVICE then device driver must first
+	 * write protect the range then schedule the copy, then update its page
+	 * table to use the new system memory given the dst array. Some device
+	 * can perform all this in an atomic fashion from device point of view.
+	 * The device driver must also free the device memory once the copy is
+	 * done.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill and CPU page table set to HWPOISON entry.
+	 *
+	 * Note that device driver must clear the valid bit of the dst entry it
+	 * failed to copy.
+	 *
+	 * On failure the mirror will be kill by HMM which will do a HMM_MUNMAP
+	 * invalidation of all the memory when this happen the device driver
+	 * can free the device memory.
+	 *
+	 * Note also that there can be hole in the range being copied ie some
+	 * entry of dst array will not have the valid bit set, device driver
+	 * must simply ignore non valid entry.
+	 *
+	 * Finaly device driver must set the dirty bit for each page that was
+	 * modified since it was copied inside the device memory. This must be
+	 * conservative ie if device can not determine that with certainty then
+	 * it must set the dirty bit unconditionally.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*copy_from_device)(struct hmm_mirror *mirror,
+				const struct hmm_event *event,
+				dma_addr_t *dst,
+				unsigned long start,
+				unsigned long end);
+
+	/* copy_to_device() - copy to device memory from system memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the copy.
+	 * @vma: The vma corresponding to the range.
+	 * @dst: Array containing hmm_pte of destination memory.
+	 * @start: Start address of the range (sub-range of event) to copy.
+	 * @end: End address of the range (sub-range of event) to copy.
+	 * Returns: 0 on success, error code otherwise {-ENOMEM, -EIO}.
+	 *
+	 * Called when migrating memory from system memory to device memory.
+	 * The dst array is empty, all of its entry are equal to zero. Device
+	 * driver must allocate the device memory and populate each entry using
+	 * hmm_pte_from_device_pfn() only the valid device bit and hardware
+	 * specific bit will be preserve (write and dirty will be taken from
+	 * the original entry inside the mirror page table). It is advice to
+	 * set the device pfn to match the physical address of device memory
+	 * being use. The event.etype will be equals to HMM_COPY_TO_DEVICE.
+	 *
+	 * Device driver that can atomically copy a page and update its page
+	 * table entry to point to the device memory can do that. Partial
+	 * failure is allowed, entry that have not been migrated must have
+	 * the HMM_PTE_VALID_DEV bit clear inside the dst array. HMM will
+	 * update the CPU page table of failed entry to point back to the
+	 * system page.
+	 *
+	 * Note that device driver is responsible for allocating and freeing
+	 * the device memory and properly updating to dst array entry with
+	 * the allocated device memory.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 * Errors means that the migration is aborted. So in case of partial
+	 * failure if device do not want to fully abort it must return 0.
+	 * Device driver can update device page table only if it knows it will
+	 * not return failure.
+	 */
+	int (*copy_to_device)(struct hmm_mirror *mirror,
+			      const struct hmm_event *event,
+			      struct vm_area_struct *vma,
+			      dma_addr_t *dst,
+			      unsigned long start,
+			      unsigned long end);
 };
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 6224131..ebde5a8 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -78,6 +78,8 @@ static inline int hmm_event_init(struct hmm_event *event,
 	switch (etype) {
 	case HMM_DEVICE_RFAULT:
 	case HMM_DEVICE_WFAULT:
+	case HMM_COPY_TO_DEVICE:
+	case HMM_COPY_FROM_DEVICE:
 		break;
 	case HMM_FORK:
 	case HMM_WRITE_PROTECT:
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 08/14] HMM: new callback for copying memory from and to device memory v2.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jerome Glisse, Jatin Kumar

From: Jerome Glisse <jglisse@redhat.com>

This patch only adds the new callback device driver must implement
to copy memory from and to device memory.

Changed since v1:
  - Pass down the vma to the copy function.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/hmm.c            |   2 +
 2 files changed, 107 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 7c66513..9fbfc07 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -65,6 +65,8 @@ enum hmm_etype {
 	HMM_DEVICE_RFAULT,
 	HMM_DEVICE_WFAULT,
 	HMM_WRITE_PROTECT,
+	HMM_COPY_FROM_DEVICE,
+	HMM_COPY_TO_DEVICE,
 };
 
 /* struct hmm_event - memory event information.
@@ -170,6 +172,109 @@ struct hmm_device_ops {
 	 */
 	int (*update)(struct hmm_mirror *mirror,
 		      struct hmm_event *event);
+
+	/* copy_from_device() - copy from device memory to system memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the copy.
+	 * @dst: Array containing hmm_pte of destination memory.
+	 * @start: Start address of the range (sub-range of event) to copy.
+	 * @end: End address of the range (sub-range of event) to copy.
+	 * Returns: 0 on success, error code otherwise {-ENOMEM, -EIO}.
+	 *
+	 * Called when migrating memory from device memory to system memory.
+	 * The dst array contains valid DMA address for the device of the page
+	 * to copy to (or pfn of page if hmm_device.device == NULL).
+	 *
+	 * If event.etype == HMM_FORK then device driver only need to schedule
+	 * a copy to the system pages given in the dst hmm_pte array. Do not
+	 * update the device page, and do not pause/stop the device threads
+	 * that are using this address space. Just copy memory.
+	 *
+	 * If event.type == HMM_COPY_FROM_DEVICE then device driver must first
+	 * write protect the range then schedule the copy, then update its page
+	 * table to use the new system memory given the dst array. Some device
+	 * can perform all this in an atomic fashion from device point of view.
+	 * The device driver must also free the device memory once the copy is
+	 * done.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill and CPU page table set to HWPOISON entry.
+	 *
+	 * Note that device driver must clear the valid bit of the dst entry it
+	 * failed to copy.
+	 *
+	 * On failure the mirror will be kill by HMM which will do a HMM_MUNMAP
+	 * invalidation of all the memory when this happen the device driver
+	 * can free the device memory.
+	 *
+	 * Note also that there can be hole in the range being copied ie some
+	 * entry of dst array will not have the valid bit set, device driver
+	 * must simply ignore non valid entry.
+	 *
+	 * Finaly device driver must set the dirty bit for each page that was
+	 * modified since it was copied inside the device memory. This must be
+	 * conservative ie if device can not determine that with certainty then
+	 * it must set the dirty bit unconditionally.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*copy_from_device)(struct hmm_mirror *mirror,
+				const struct hmm_event *event,
+				dma_addr_t *dst,
+				unsigned long start,
+				unsigned long end);
+
+	/* copy_to_device() - copy to device memory from system memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the copy.
+	 * @vma: The vma corresponding to the range.
+	 * @dst: Array containing hmm_pte of destination memory.
+	 * @start: Start address of the range (sub-range of event) to copy.
+	 * @end: End address of the range (sub-range of event) to copy.
+	 * Returns: 0 on success, error code otherwise {-ENOMEM, -EIO}.
+	 *
+	 * Called when migrating memory from system memory to device memory.
+	 * The dst array is empty, all of its entry are equal to zero. Device
+	 * driver must allocate the device memory and populate each entry using
+	 * hmm_pte_from_device_pfn() only the valid device bit and hardware
+	 * specific bit will be preserve (write and dirty will be taken from
+	 * the original entry inside the mirror page table). It is advice to
+	 * set the device pfn to match the physical address of device memory
+	 * being use. The event.etype will be equals to HMM_COPY_TO_DEVICE.
+	 *
+	 * Device driver that can atomically copy a page and update its page
+	 * table entry to point to the device memory can do that. Partial
+	 * failure is allowed, entry that have not been migrated must have
+	 * the HMM_PTE_VALID_DEV bit clear inside the dst array. HMM will
+	 * update the CPU page table of failed entry to point back to the
+	 * system page.
+	 *
+	 * Note that device driver is responsible for allocating and freeing
+	 * the device memory and properly updating to dst array entry with
+	 * the allocated device memory.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 * Errors means that the migration is aborted. So in case of partial
+	 * failure if device do not want to fully abort it must return 0.
+	 * Device driver can update device page table only if it knows it will
+	 * not return failure.
+	 */
+	int (*copy_to_device)(struct hmm_mirror *mirror,
+			      const struct hmm_event *event,
+			      struct vm_area_struct *vma,
+			      dma_addr_t *dst,
+			      unsigned long start,
+			      unsigned long end);
 };
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 6224131..ebde5a8 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -78,6 +78,8 @@ static inline int hmm_event_init(struct hmm_event *event,
 	switch (etype) {
 	case HMM_DEVICE_RFAULT:
 	case HMM_DEVICE_WFAULT:
+	case HMM_COPY_TO_DEVICE:
+	case HMM_COPY_FROM_DEVICE:
 		break;
 	case HMM_FORK:
 	case HMM_WRITE_PROTECT:
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 09/14] HMM: allow to get pointer to spinlock protecting a directory.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Several use case for getting pointer to spinlock protecting a directory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index f745d6c..22100a6 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -255,6 +255,16 @@ static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
 		spin_lock(&pt->lock);
 }
 
+static inline spinlock_t *hmm_pt_directory_lock_ptr(struct hmm_pt *pt,
+						    struct page *ptd,
+						    unsigned level)
+{
+	if (level)
+		return &ptd->ptl;
+	else
+		return &pt->lock;
+}
+
 static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
 					   struct page *ptd,
 					   unsigned level)
@@ -272,6 +282,13 @@ static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
 	spin_lock(&pt->lock);
 }
 
+static inline spinlock_t *hmm_pt_directory_lock_ptr(struct hmm_pt *pt,
+						    struct page *ptd,
+						    unsigned level)
+{
+	return &pt->lock;
+}
+
 static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
 					   struct page *ptd,
 					   unsigned level)
@@ -358,6 +375,14 @@ static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter)
 	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
 }
 
+static inline spinlock_t *hmm_pt_iter_directory_lock_ptr(struct hmm_pt_iter *i)
+{
+	struct hmm_pt *pt = i->pt;
+
+	return hmm_pt_directory_lock_ptr(pt, i->ptd[pt->llevel - 1],
+					 pt->llevel);
+}
+
 static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter)
 {
 	struct hmm_pt *pt = iter->pt;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 09/14] HMM: allow to get pointer to spinlock protecting a directory.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Several use case for getting pointer to spinlock protecting a directory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index f745d6c..22100a6 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -255,6 +255,16 @@ static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
 		spin_lock(&pt->lock);
 }
 
+static inline spinlock_t *hmm_pt_directory_lock_ptr(struct hmm_pt *pt,
+						    struct page *ptd,
+						    unsigned level)
+{
+	if (level)
+		return &ptd->ptl;
+	else
+		return &pt->lock;
+}
+
 static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
 					   struct page *ptd,
 					   unsigned level)
@@ -272,6 +282,13 @@ static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
 	spin_lock(&pt->lock);
 }
 
+static inline spinlock_t *hmm_pt_directory_lock_ptr(struct hmm_pt *pt,
+						    struct page *ptd,
+						    unsigned level)
+{
+	return &pt->lock;
+}
+
 static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
 					   struct page *ptd,
 					   unsigned level)
@@ -358,6 +375,14 @@ static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter)
 	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
 }
 
+static inline spinlock_t *hmm_pt_iter_directory_lock_ptr(struct hmm_pt_iter *i)
+{
+	struct hmm_pt *pt = i->pt;
+
+	return hmm_pt_directory_lock_ptr(pt, i->ptd[pt->llevel - 1],
+					 pt->llevel);
+}
+
 static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter)
 {
 	struct hmm_pt *pt = iter->pt;
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 10/14] HMM: split DMA mapping function in two.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

To be able to reuse the DMA mapping logic, split it in two functions.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 120 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index ebde5a8..01eda36 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -906,76 +906,86 @@ static int hmm_mirror_fault_hugetlb_entry(pte_t *ptep,
 	return 0;
 }
 
+static int hmm_mirror_dma_map_range(struct hmm_mirror *mirror,
+				    dma_addr_t *hmm_pte,
+				    spinlock_t *lock,
+				    unsigned long npages)
+{
+	struct device *dev = mirror->device->dev;
+	unsigned long i;
+	int ret = 0;
+
+	for (i = 0; i < npages; i++) {
+		dma_addr_t dma_addr, pte;
+		struct page *page;
+
+again:
+		pte = ACCESS_ONCE(hmm_pte[i]);
+		if (!hmm_pte_test_valid_pfn(&pte) || !hmm_pte_test_select(&pte))
+			continue;
+
+		page = pfn_to_page(hmm_pte_pfn(pte));
+		VM_BUG_ON(!page);
+		dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+					DMA_BIDIRECTIONAL);
+		if (dma_mapping_error(dev, dma_addr)) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		/*
+		 * Make sure we transfer the dirty bit. Note that there
+		 * might still be a window for another thread to set
+		 * the dirty bit before we check for pte equality. This
+		 * will just lead to a useless retry so it is not the
+		 * end of the world here.
+		 */
+		if (lock)
+			spin_lock(lock);
+		if (hmm_pte_test_dirty(&hmm_pte[i]))
+			hmm_pte_set_dirty(&pte);
+		if (ACCESS_ONCE(hmm_pte[i]) != pte) {
+				if (lock)
+					spin_unlock(lock);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+				if (hmm_pte_test_valid_pfn(&hmm_pte[i]))
+					goto again;
+				continue;
+		}
+		hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
+		if (hmm_pte_test_write(&pte))
+			hmm_pte_set_write(&hmm_pte[i]);
+		if (hmm_pte_test_dirty(&pte))
+			hmm_pte_set_dirty(&hmm_pte[i]);
+		if (lock)
+			spin_unlock(lock);
+	}
+
+	return ret;
+}
+
 static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
 			      struct hmm_pt_iter *iter,
 			      unsigned long start,
 			      unsigned long end)
 {
-	struct device *dev = mirror->device->dev;
 	unsigned long addr;
 	int ret;
 
 	for (ret = 0, addr = start; !ret && addr < end;) {
-		unsigned long i = 0, next = end;
+		unsigned long next = end, npages;
 		dma_addr_t *hmm_pte;
+		spinlock_t *lock;
 
 		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
 		if (!hmm_pte)
 			return -ENOENT;
 
-		do {
-			dma_addr_t dma_addr, pte;
-			struct page *page;
-
-again:
-			pte = ACCESS_ONCE(hmm_pte[i]);
-			if (!hmm_pte_test_valid_pfn(&pte) ||
-			    !hmm_pte_test_select(&pte)) {
-				if (!hmm_pte_test_valid_dma(&pte)) {
-					ret = -ENOENT;
-					break;
-				}
-				continue;
-			}
-
-			page = pfn_to_page(hmm_pte_pfn(pte));
-			VM_BUG_ON(!page);
-			dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
-						DMA_BIDIRECTIONAL);
-			if (dma_mapping_error(dev, dma_addr)) {
-				ret = -ENOMEM;
-				break;
-			}
-
-			hmm_pt_iter_directory_lock(iter);
-			/*
-			 * Make sure we transfer the dirty bit. Note that there
-			 * might still be a window for another thread to set
-			 * the dirty bit before we check for pte equality. This
-			 * will just lead to a useless retry so it is not the
-			 * end of the world here.
-			 */
-			if (hmm_pte_test_dirty(&hmm_pte[i]))
-				hmm_pte_set_dirty(&pte);
-			if (ACCESS_ONCE(hmm_pte[i]) != pte) {
-				hmm_pt_iter_directory_unlock(iter);
-				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
-					       DMA_BIDIRECTIONAL);
-				if (hmm_pte_test_valid_pfn(&pte))
-					goto again;
-				if (!hmm_pte_test_valid_dma(&pte)) {
-					ret = -ENOENT;
-					break;
-				}
-			} else {
-				hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
-				if (hmm_pte_test_write(&pte))
-					hmm_pte_set_write(&hmm_pte[i]);
-				if (hmm_pte_test_dirty(&pte))
-					hmm_pte_set_dirty(&hmm_pte[i]);
-				hmm_pt_iter_directory_unlock(iter);
-			}
-		} while (addr += PAGE_SIZE, i++, addr != next && !ret);
+		npages = (next - addr) >> PAGE_SHIFT;
+		lock = hmm_pt_iter_directory_lock_ptr(iter);
+		ret = hmm_mirror_dma_map_range(mirror, hmm_pte, lock, npages);
+		addr = next;
 	}
 
 	return ret;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 10/14] HMM: split DMA mapping function in two.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

To be able to reuse the DMA mapping logic, split it in two functions.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 120 ++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index ebde5a8..01eda36 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -906,76 +906,86 @@ static int hmm_mirror_fault_hugetlb_entry(pte_t *ptep,
 	return 0;
 }
 
+static int hmm_mirror_dma_map_range(struct hmm_mirror *mirror,
+				    dma_addr_t *hmm_pte,
+				    spinlock_t *lock,
+				    unsigned long npages)
+{
+	struct device *dev = mirror->device->dev;
+	unsigned long i;
+	int ret = 0;
+
+	for (i = 0; i < npages; i++) {
+		dma_addr_t dma_addr, pte;
+		struct page *page;
+
+again:
+		pte = ACCESS_ONCE(hmm_pte[i]);
+		if (!hmm_pte_test_valid_pfn(&pte) || !hmm_pte_test_select(&pte))
+			continue;
+
+		page = pfn_to_page(hmm_pte_pfn(pte));
+		VM_BUG_ON(!page);
+		dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+					DMA_BIDIRECTIONAL);
+		if (dma_mapping_error(dev, dma_addr)) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		/*
+		 * Make sure we transfer the dirty bit. Note that there
+		 * might still be a window for another thread to set
+		 * the dirty bit before we check for pte equality. This
+		 * will just lead to a useless retry so it is not the
+		 * end of the world here.
+		 */
+		if (lock)
+			spin_lock(lock);
+		if (hmm_pte_test_dirty(&hmm_pte[i]))
+			hmm_pte_set_dirty(&pte);
+		if (ACCESS_ONCE(hmm_pte[i]) != pte) {
+				if (lock)
+					spin_unlock(lock);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+				if (hmm_pte_test_valid_pfn(&hmm_pte[i]))
+					goto again;
+				continue;
+		}
+		hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
+		if (hmm_pte_test_write(&pte))
+			hmm_pte_set_write(&hmm_pte[i]);
+		if (hmm_pte_test_dirty(&pte))
+			hmm_pte_set_dirty(&hmm_pte[i]);
+		if (lock)
+			spin_unlock(lock);
+	}
+
+	return ret;
+}
+
 static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
 			      struct hmm_pt_iter *iter,
 			      unsigned long start,
 			      unsigned long end)
 {
-	struct device *dev = mirror->device->dev;
 	unsigned long addr;
 	int ret;
 
 	for (ret = 0, addr = start; !ret && addr < end;) {
-		unsigned long i = 0, next = end;
+		unsigned long next = end, npages;
 		dma_addr_t *hmm_pte;
+		spinlock_t *lock;
 
 		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
 		if (!hmm_pte)
 			return -ENOENT;
 
-		do {
-			dma_addr_t dma_addr, pte;
-			struct page *page;
-
-again:
-			pte = ACCESS_ONCE(hmm_pte[i]);
-			if (!hmm_pte_test_valid_pfn(&pte) ||
-			    !hmm_pte_test_select(&pte)) {
-				if (!hmm_pte_test_valid_dma(&pte)) {
-					ret = -ENOENT;
-					break;
-				}
-				continue;
-			}
-
-			page = pfn_to_page(hmm_pte_pfn(pte));
-			VM_BUG_ON(!page);
-			dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
-						DMA_BIDIRECTIONAL);
-			if (dma_mapping_error(dev, dma_addr)) {
-				ret = -ENOMEM;
-				break;
-			}
-
-			hmm_pt_iter_directory_lock(iter);
-			/*
-			 * Make sure we transfer the dirty bit. Note that there
-			 * might still be a window for another thread to set
-			 * the dirty bit before we check for pte equality. This
-			 * will just lead to a useless retry so it is not the
-			 * end of the world here.
-			 */
-			if (hmm_pte_test_dirty(&hmm_pte[i]))
-				hmm_pte_set_dirty(&pte);
-			if (ACCESS_ONCE(hmm_pte[i]) != pte) {
-				hmm_pt_iter_directory_unlock(iter);
-				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
-					       DMA_BIDIRECTIONAL);
-				if (hmm_pte_test_valid_pfn(&pte))
-					goto again;
-				if (!hmm_pte_test_valid_dma(&pte)) {
-					ret = -ENOENT;
-					break;
-				}
-			} else {
-				hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
-				if (hmm_pte_test_write(&pte))
-					hmm_pte_set_write(&hmm_pte[i]);
-				if (hmm_pte_test_dirty(&pte))
-					hmm_pte_set_dirty(&hmm_pte[i]);
-				hmm_pt_iter_directory_unlock(iter);
-			}
-		} while (addr += PAGE_SIZE, i++, addr != next && !ret);
+		npages = (next - addr) >> PAGE_SHIFT;
+		lock = hmm_pt_iter_directory_lock_ptr(iter);
+		ret = hmm_mirror_dma_map_range(mirror, hmm_pte, lock, npages);
+		addr = next;
 	}
 
 	return ret;
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 11/14] HMM: add helpers for migration back to system memory v3.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch add all necessary functions and helpers for migration
from device memory back to system memory. They are 3 differents
case that would use that code :
  - CPU page fault
  - fork
  - device driver request

Note that this patch use regular memory accounting this means that
migration can fail as a result of memory cgroup resource exhaustion.
Latter patches will modify memcg to allow to keep remote memory
accounted as regular memory thus removing this point of failure.

Changed since v1:
  - Fixed logic in dma unmap code path on migration error.

Changed since v2:
  - Adapt to HMM page table changes.
  - Fix bug in migration failure code path.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 mm/hmm.c | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 151 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 01eda36..abe2fba 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -47,6 +47,12 @@
 
 static struct mmu_notifier_ops hmm_notifier_ops;
 static void hmm_mirror_kill(struct hmm_mirror *mirror);
+static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   pte_t *new_pte,
+				   dma_addr_t *dst,
+				   unsigned long start,
+				   unsigned long end);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 				    struct hmm_event *event,
 				    struct page *page);
@@ -418,6 +424,46 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+static int hmm_migrate_back(struct hmm *hmm,
+			    struct hmm_event *event,
+			    struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    pte_t *new_pte,
+			    dma_addr_t *dst,
+			    unsigned long start,
+			    unsigned long end)
+{
+	struct hmm_mirror *mirror;
+	int r, ret;
+
+	/*
+	 * Do not return right away on error, as there might be valid page we
+	 * can migrate.
+	 */
+	ret = mm_hmm_migrate_back(mm, vma, new_pte, start, end);
+
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(mirror, &hmm->mirrors, mlist) {
+		r = hmm_mirror_migrate_back(mirror, event, new_pte,
+					    dst, start, end);
+		if (r) {
+			ret = ret ? ret : r;
+			mirror = hmm_mirror_ref(mirror);
+			BUG_ON(!mirror);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(mirror);
+			hmm_mirror_unref(&mirror);
+			goto again;
+		}
+	}
+	up_read(&hmm->rwsem);
+
+	mm_hmm_migrate_back_cleanup(mm, vma, new_pte, dst, start, end);
+
+	return ret;
+}
+
 int hmm_handle_cpu_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			pmd_t *pmdp, unsigned long addr,
@@ -1149,6 +1195,111 @@ out:
 }
 EXPORT_SYMBOL(hmm_mirror_fault);
 
+static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   pte_t *new_pte,
+				   dma_addr_t *dst,
+				   unsigned long start,
+				   unsigned long end)
+{
+	unsigned long addr, i, npages = (end - start) >> PAGE_SHIFT;
+	struct hmm_device *device = mirror->device;
+	struct device *dev = mirror->device->dev;
+	struct hmm_pt_iter iter;
+	int r, ret = 0;
+
+	hmm_pt_iter_init(&iter, &mirror->pt);
+	for (addr = start, i = 0; addr < end; addr += PAGE_SIZE, ++i) {
+		unsigned long next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte_clear_select(&dst[i]);
+
+		if (!pte_present(new_pte[i]))
+			continue;
+		hmm_pte = hmm_pt_iter_lookup(&iter, addr, &next);
+		if (!hmm_pte)
+			continue;
+
+		if (!hmm_pte_test_valid_dev(hmm_pte))
+			continue;
+
+		dst[i] = hmm_pte_from_pfn(pte_pfn(new_pte[i]));
+		hmm_pte_set_select(&dst[i]);
+		hmm_pte_set_write(&dst[i]);
+	}
+
+	if (dev) {
+		ret = hmm_mirror_dma_map_range(mirror, dst, NULL, npages);
+		if (ret) {
+			for (i = 0; i < npages; ++i) {
+				if (!hmm_pte_test_select(&dst[i]))
+					continue;
+				if (hmm_pte_test_valid_dma(&dst[i]))
+					continue;
+				dst[i] = 0;
+			}
+		}
+	}
+
+	r = device->ops->copy_from_device(mirror, event, dst, start, end);
+
+	/* Update mirror page table with successfully migrated entry. */
+	for (addr = start; addr < end;) {
+		unsigned long idx, next = end, npages;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+		if (!hmm_pte)
+			continue;
+		idx = (addr - event->start) >> PAGE_SHIFT;
+		npages = (next - addr) >> PAGE_SHIFT;
+		hmm_pt_iter_directory_lock(&iter);
+		for (i = 0; i < npages; i++, idx++) {
+			if (!hmm_pte_test_valid_pfn(&dst[idx]) &&
+			    !hmm_pte_test_valid_dma(&dst[idx])) {
+				if (hmm_pte_test_valid_dev(&hmm_pte[i])) {
+					hmm_pte[i] = 0;
+					hmm_pt_iter_directory_unref(&iter);
+				}
+				continue;
+			}
+
+			VM_BUG_ON(!hmm_pte_test_select(&dst[idx]));
+			VM_BUG_ON(!hmm_pte_test_valid_dev(&hmm_pte[i]));
+			hmm_pte[i] = dst[idx];
+		}
+		hmm_pt_iter_directory_unlock(&iter);
+
+		/* DMA unmap failed migrate entry. */
+		if (dev) {
+			idx = (addr - event->start) >> PAGE_SHIFT;
+			for (i = 0; i < npages; i++, idx++) {
+				dma_addr_t dma_addr;
+
+				/*
+				 * Failed entry have the valid bit clear but
+				 * the select bit remain set.
+				 */
+				if (!hmm_pte_test_select(&dst[idx]) ||
+				    hmm_pte_test_valid_dma(&dst[i]))
+					continue;
+
+				hmm_pte_set_valid_dma(&dst[idx]);
+				dma_addr = hmm_pte_dma_addr(dst[idx]);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+				dst[idx] = 0;
+			}
+		}
+
+		addr = next;
+	}
+	hmm_pt_iter_fini(&iter);
+
+	return ret ? ret : r;
+}
+
 /* hmm_mirror_range_discard() - discard a range of address.
  *
  * @mirror: The mirror struct.
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 11/14] HMM: add helpers for migration back to system memory v3.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch add all necessary functions and helpers for migration
from device memory back to system memory. They are 3 differents
case that would use that code :
  - CPU page fault
  - fork
  - device driver request

Note that this patch use regular memory accounting this means that
migration can fail as a result of memory cgroup resource exhaustion.
Latter patches will modify memcg to allow to keep remote memory
accounted as regular memory thus removing this point of failure.

Changed since v1:
  - Fixed logic in dma unmap code path on migration error.

Changed since v2:
  - Adapt to HMM page table changes.
  - Fix bug in migration failure code path.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 mm/hmm.c | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 151 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 01eda36..abe2fba 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -47,6 +47,12 @@
 
 static struct mmu_notifier_ops hmm_notifier_ops;
 static void hmm_mirror_kill(struct hmm_mirror *mirror);
+static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   pte_t *new_pte,
+				   dma_addr_t *dst,
+				   unsigned long start,
+				   unsigned long end);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 				    struct hmm_event *event,
 				    struct page *page);
@@ -418,6 +424,46 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+static int hmm_migrate_back(struct hmm *hmm,
+			    struct hmm_event *event,
+			    struct mm_struct *mm,
+			    struct vm_area_struct *vma,
+			    pte_t *new_pte,
+			    dma_addr_t *dst,
+			    unsigned long start,
+			    unsigned long end)
+{
+	struct hmm_mirror *mirror;
+	int r, ret;
+
+	/*
+	 * Do not return right away on error, as there might be valid page we
+	 * can migrate.
+	 */
+	ret = mm_hmm_migrate_back(mm, vma, new_pte, start, end);
+
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(mirror, &hmm->mirrors, mlist) {
+		r = hmm_mirror_migrate_back(mirror, event, new_pte,
+					    dst, start, end);
+		if (r) {
+			ret = ret ? ret : r;
+			mirror = hmm_mirror_ref(mirror);
+			BUG_ON(!mirror);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(mirror);
+			hmm_mirror_unref(&mirror);
+			goto again;
+		}
+	}
+	up_read(&hmm->rwsem);
+
+	mm_hmm_migrate_back_cleanup(mm, vma, new_pte, dst, start, end);
+
+	return ret;
+}
+
 int hmm_handle_cpu_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			pmd_t *pmdp, unsigned long addr,
@@ -1149,6 +1195,111 @@ out:
 }
 EXPORT_SYMBOL(hmm_mirror_fault);
 
+static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   pte_t *new_pte,
+				   dma_addr_t *dst,
+				   unsigned long start,
+				   unsigned long end)
+{
+	unsigned long addr, i, npages = (end - start) >> PAGE_SHIFT;
+	struct hmm_device *device = mirror->device;
+	struct device *dev = mirror->device->dev;
+	struct hmm_pt_iter iter;
+	int r, ret = 0;
+
+	hmm_pt_iter_init(&iter, &mirror->pt);
+	for (addr = start, i = 0; addr < end; addr += PAGE_SIZE, ++i) {
+		unsigned long next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte_clear_select(&dst[i]);
+
+		if (!pte_present(new_pte[i]))
+			continue;
+		hmm_pte = hmm_pt_iter_lookup(&iter, addr, &next);
+		if (!hmm_pte)
+			continue;
+
+		if (!hmm_pte_test_valid_dev(hmm_pte))
+			continue;
+
+		dst[i] = hmm_pte_from_pfn(pte_pfn(new_pte[i]));
+		hmm_pte_set_select(&dst[i]);
+		hmm_pte_set_write(&dst[i]);
+	}
+
+	if (dev) {
+		ret = hmm_mirror_dma_map_range(mirror, dst, NULL, npages);
+		if (ret) {
+			for (i = 0; i < npages; ++i) {
+				if (!hmm_pte_test_select(&dst[i]))
+					continue;
+				if (hmm_pte_test_valid_dma(&dst[i]))
+					continue;
+				dst[i] = 0;
+			}
+		}
+	}
+
+	r = device->ops->copy_from_device(mirror, event, dst, start, end);
+
+	/* Update mirror page table with successfully migrated entry. */
+	for (addr = start; addr < end;) {
+		unsigned long idx, next = end, npages;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+		if (!hmm_pte)
+			continue;
+		idx = (addr - event->start) >> PAGE_SHIFT;
+		npages = (next - addr) >> PAGE_SHIFT;
+		hmm_pt_iter_directory_lock(&iter);
+		for (i = 0; i < npages; i++, idx++) {
+			if (!hmm_pte_test_valid_pfn(&dst[idx]) &&
+			    !hmm_pte_test_valid_dma(&dst[idx])) {
+				if (hmm_pte_test_valid_dev(&hmm_pte[i])) {
+					hmm_pte[i] = 0;
+					hmm_pt_iter_directory_unref(&iter);
+				}
+				continue;
+			}
+
+			VM_BUG_ON(!hmm_pte_test_select(&dst[idx]));
+			VM_BUG_ON(!hmm_pte_test_valid_dev(&hmm_pte[i]));
+			hmm_pte[i] = dst[idx];
+		}
+		hmm_pt_iter_directory_unlock(&iter);
+
+		/* DMA unmap failed migrate entry. */
+		if (dev) {
+			idx = (addr - event->start) >> PAGE_SHIFT;
+			for (i = 0; i < npages; i++, idx++) {
+				dma_addr_t dma_addr;
+
+				/*
+				 * Failed entry have the valid bit clear but
+				 * the select bit remain set.
+				 */
+				if (!hmm_pte_test_select(&dst[idx]) ||
+				    hmm_pte_test_valid_dma(&dst[i]))
+					continue;
+
+				hmm_pte_set_valid_dma(&dst[idx]);
+				dma_addr = hmm_pte_dma_addr(dst[idx]);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+				dst[idx] = 0;
+			}
+		}
+
+		addr = next;
+	}
+	hmm_pt_iter_fini(&iter);
+
+	return ret ? ret : r;
+}
+
 /* hmm_mirror_range_discard() - discard a range of address.
  *
  * @mirror: The mirror struct.
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 12/14] HMM: fork copy migrated memory into system memory for child process.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

When forking if process being fork had any memory migrated to some
device memory, we need to make a system copy for the child process.
Latter patches can revisit this and use the same COW semantic for
device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index abe2fba..b473fe9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -480,7 +480,37 @@ int hmm_mm_fork(struct mm_struct *src_mm,
 		unsigned long start,
 		unsigned long end)
 {
-	return -ENOMEM;
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+	struct hmm_event event;
+	dma_addr_t *dst;
+	struct hmm *hmm;
+	pte_t *new_pte;
+	int ret;
+
+	hmm = hmm_ref(src_mm->hmm);
+	if (!hmm)
+		return -EINVAL;
+
+
+	dst = kcalloc(npages, sizeof(*dst), GFP_KERNEL);
+	if (!dst) {
+		hmm_unref(hmm);
+		return -ENOMEM;
+	}
+	new_pte = kcalloc(npages, sizeof(*new_pte), GFP_KERNEL);
+	if (!new_pte) {
+		kfree(dst);
+		hmm_unref(hmm);
+		return -ENOMEM;
+	}
+
+	hmm_event_init(&event, hmm, start, end, HMM_FORK);
+	ret = hmm_migrate_back(hmm, &event, dst_mm, dst_vma, new_pte,
+			       dst, start, end);
+	hmm_unref(hmm);
+	kfree(new_pte);
+	kfree(dst);
+	return ret;
 }
 EXPORT_SYMBOL(hmm_mm_fork);
 
@@ -656,6 +686,12 @@ static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
 	}
 
 	if (hmm_pte_test_valid_dev(hmm_pte)) {
+		/*
+		 * On fork device memory is duplicated so no need to write
+		 * protect it.
+		 */
+		if (event->etype == HMM_FORK)
+			return;
 		*hmm_pte &= event->pte_mask;
 		if (!hmm_pte_test_valid_dev(hmm_pte))
 			hmm_pt_iter_directory_unref(iter);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 12/14] HMM: fork copy migrated memory into system memory for child process.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

When forking if process being fork had any memory migrated to some
device memory, we need to make a system copy for the child process.
Latter patches can revisit this and use the same COW semantic for
device memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index abe2fba..b473fe9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -480,7 +480,37 @@ int hmm_mm_fork(struct mm_struct *src_mm,
 		unsigned long start,
 		unsigned long end)
 {
-	return -ENOMEM;
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+	struct hmm_event event;
+	dma_addr_t *dst;
+	struct hmm *hmm;
+	pte_t *new_pte;
+	int ret;
+
+	hmm = hmm_ref(src_mm->hmm);
+	if (!hmm)
+		return -EINVAL;
+
+
+	dst = kcalloc(npages, sizeof(*dst), GFP_KERNEL);
+	if (!dst) {
+		hmm_unref(hmm);
+		return -ENOMEM;
+	}
+	new_pte = kcalloc(npages, sizeof(*new_pte), GFP_KERNEL);
+	if (!new_pte) {
+		kfree(dst);
+		hmm_unref(hmm);
+		return -ENOMEM;
+	}
+
+	hmm_event_init(&event, hmm, start, end, HMM_FORK);
+	ret = hmm_migrate_back(hmm, &event, dst_mm, dst_vma, new_pte,
+			       dst, start, end);
+	hmm_unref(hmm);
+	kfree(new_pte);
+	kfree(dst);
+	return ret;
 }
 EXPORT_SYMBOL(hmm_mm_fork);
 
@@ -656,6 +686,12 @@ static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
 	}
 
 	if (hmm_pte_test_valid_dev(hmm_pte)) {
+		/*
+		 * On fork device memory is duplicated so no need to write
+		 * protect it.
+		 */
+		if (event->etype == HMM_FORK)
+			return;
 		*hmm_pte &= event->pte_mask;
 		if (!hmm_pte_test_valid_dev(hmm_pte))
 			hmm_pt_iter_directory_unref(iter);
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 13/14] HMM: CPU page fault on migrated memory.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

When CPU try to access memory that have been migrated to device memory
we have to copy it back to system memory. This patch implement the CPU
page fault handler for special HMM pte swap entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index b473fe9..efffb8d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -469,7 +469,59 @@ int hmm_handle_cpu_fault(struct mm_struct *mm,
 			pmd_t *pmdp, unsigned long addr,
 			unsigned flags, pte_t orig_pte)
 {
-	return VM_FAULT_SIGBUS;
+	unsigned long start, end;
+	struct hmm_event event;
+	swp_entry_t entry;
+	struct hmm *hmm;
+	dma_addr_t dst;
+	pte_t new_pte;
+	int ret;
+
+	/* First check for poisonous entry. */
+	entry = pte_to_swp_entry(orig_pte);
+	if (is_hmm_entry_poisonous(entry))
+		return VM_FAULT_SIGBUS;
+
+	hmm = hmm_ref(mm->hmm);
+	if (!hmm) {
+		pte_t poison = swp_entry_to_pte(make_hmm_entry_poisonous());
+		spinlock_t *ptl;
+		pte_t *ptep;
+
+		/* Check if cpu pte is already updated. */
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		if (!pte_same(*ptep, orig_pte)) {
+			pte_unmap_unlock(ptep, ptl);
+			return 0;
+		}
+		set_pte_at(mm, addr, ptep, poison);
+		pte_unmap_unlock(ptep, ptl);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/*
+	 * TODO we likely want to migrate more then one page at a time, we need
+	 * to call into the device driver to get good hint on the range to copy
+	 * back to system memory.
+	 *
+	 * For now just live with the one page at a time solution.
+	 */
+	start = addr & PAGE_MASK;
+	end = start + PAGE_SIZE;
+	hmm_event_init(&event, hmm, start, end, HMM_COPY_FROM_DEVICE);
+
+	ret = hmm_migrate_back(hmm, &event, mm, vma, &new_pte,
+			       &dst, start, end);
+	hmm_unref(hmm);
+	switch (ret) {
+	case 0:
+		return VM_FAULT_MAJOR;
+	case -ENOMEM:
+		return VM_FAULT_OOM;
+	case -EINVAL:
+	default:
+		return VM_FAULT_SIGBUS;
+	}
 }
 EXPORT_SYMBOL(hmm_handle_cpu_fault);
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 13/14] HMM: CPU page fault on migrated memory.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

When CPU try to access memory that have been migrated to device memory
we have to copy it back to system memory. This patch implement the CPU
page fault handler for special HMM pte swap entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index b473fe9..efffb8d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -469,7 +469,59 @@ int hmm_handle_cpu_fault(struct mm_struct *mm,
 			pmd_t *pmdp, unsigned long addr,
 			unsigned flags, pte_t orig_pte)
 {
-	return VM_FAULT_SIGBUS;
+	unsigned long start, end;
+	struct hmm_event event;
+	swp_entry_t entry;
+	struct hmm *hmm;
+	dma_addr_t dst;
+	pte_t new_pte;
+	int ret;
+
+	/* First check for poisonous entry. */
+	entry = pte_to_swp_entry(orig_pte);
+	if (is_hmm_entry_poisonous(entry))
+		return VM_FAULT_SIGBUS;
+
+	hmm = hmm_ref(mm->hmm);
+	if (!hmm) {
+		pte_t poison = swp_entry_to_pte(make_hmm_entry_poisonous());
+		spinlock_t *ptl;
+		pte_t *ptep;
+
+		/* Check if cpu pte is already updated. */
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		if (!pte_same(*ptep, orig_pte)) {
+			pte_unmap_unlock(ptep, ptl);
+			return 0;
+		}
+		set_pte_at(mm, addr, ptep, poison);
+		pte_unmap_unlock(ptep, ptl);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/*
+	 * TODO we likely want to migrate more then one page at a time, we need
+	 * to call into the device driver to get good hint on the range to copy
+	 * back to system memory.
+	 *
+	 * For now just live with the one page at a time solution.
+	 */
+	start = addr & PAGE_MASK;
+	end = start + PAGE_SIZE;
+	hmm_event_init(&event, hmm, start, end, HMM_COPY_FROM_DEVICE);
+
+	ret = hmm_migrate_back(hmm, &event, mm, vma, &new_pte,
+			       &dst, start, end);
+	hmm_unref(hmm);
+	switch (ret) {
+	case 0:
+		return VM_FAULT_MAJOR;
+	case -ENOMEM:
+		return VM_FAULT_OOM;
+	case -EINVAL:
+	default:
+		return VM_FAULT_SIGBUS;
+	}
 }
 EXPORT_SYMBOL(hmm_handle_cpu_fault);
 
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 14/14] HMM: add mirror fault support for system to device memory migration v3.
  2015-10-21 21:10 ` Jérôme Glisse
@ 2015-10-21 21:10   ` Jérôme Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

Migration to device memory is done as a special kind of device mirror
fault. Memory migration being initiated by device driver and never by
HMM (unless it is a migration back to system memory).

Changed since v1:
  - Adapt to HMM page table changes.

Changed since v2:
  - Fix error code path for migration, calling mm_hmm_migrate_cleanup()
    is wrong.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 mm/hmm.c | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 170 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index efffb8d..e3fb586 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -53,6 +53,10 @@ static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
 				   dma_addr_t *dst,
 				   unsigned long start,
 				   unsigned long end);
+static int hmm_mirror_migrate(struct hmm_mirror *mirror,
+			      struct hmm_event *event,
+			      struct vm_area_struct *vma,
+			      struct hmm_pt_iter *iter);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 				    struct hmm_event *event,
 				    struct page *page);
@@ -101,6 +105,12 @@ static inline int hmm_event_init(struct hmm_event *event,
 	return 0;
 }
 
+static inline unsigned long hmm_event_npages(const struct hmm_event *event)
+{
+	return (PAGE_ALIGN(event->end) - (event->start & PAGE_MASK)) >>
+	       PAGE_SHIFT;
+}
+
 
 /* hmm - core HMM functions.
  *
@@ -1251,6 +1261,9 @@ retry:
 	}
 
 	switch (event->etype) {
+	case HMM_COPY_TO_DEVICE:
+		ret = hmm_mirror_migrate(mirror, event, vma, &iter);
+		break;
 	case HMM_DEVICE_WFAULT:
 		if (!(vma->vm_flags & VM_WRITE)) {
 			ret = -EFAULT;
@@ -1388,6 +1401,163 @@ static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
 	return ret ? ret : r;
 }
 
+static int hmm_mirror_migrate(struct hmm_mirror *mirror,
+			      struct hmm_event *event,
+			      struct vm_area_struct *vma,
+			      struct hmm_pt_iter *iter)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm *hmm = mirror->hmm;
+	struct hmm_event invalidate;
+	unsigned long addr, npages;
+	struct hmm_mirror *tmp;
+	dma_addr_t *dst;
+	pte_t *save_pte;
+	int r = 0, ret;
+
+	/* Only allow migration of private anonymous memory. */
+	if (vma->vm_ops || unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)))
+		return -EINVAL;
+
+	/*
+	 * TODO More advance loop for splitting migration into several chunk.
+	 * For now limit the amount that can be migrated in one shot. Also we
+	 * would need to see if we need rescheduling if this is happening as
+	 * part of system call to the device driver.
+	 */
+	npages = hmm_event_npages(event);
+	if (npages * max(sizeof(*dst), sizeof(*save_pte)) > PAGE_SIZE)
+		return -EINVAL;
+	dst = kcalloc(npages, sizeof(*dst), GFP_KERNEL);
+	if (dst == NULL)
+		return -ENOMEM;
+	save_pte = kcalloc(npages, sizeof(*save_pte), GFP_KERNEL);
+	if (save_pte == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = mm_hmm_migrate(hmm->mm, vma, save_pte, &event->backoff,
+			     &hmm->mmu_notifier, event->start, event->end);
+	if (ret == -EAGAIN)
+		goto out;
+	if (ret)
+		goto out;
+
+	/*
+	 * Now invalidate for all other device, note that they can not race
+	 * with us as the CPU page table is full of special entry.
+	 */
+	hmm_event_init(&invalidate, mirror->hmm, event->start,
+		       event->end, HMM_MIGRATE);
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(tmp, &hmm->mirrors, mlist) {
+		if (tmp == mirror)
+			continue;
+		if (hmm_mirror_update(tmp, &invalidate, NULL)) {
+			hmm_mirror_ref(tmp);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(tmp);
+			hmm_mirror_unref(&tmp);
+			goto again;
+		}
+	}
+	up_read(&hmm->rwsem);
+
+	/*
+	 * Populate the mirror page table with saved entry and also mark entry
+	 * that can be migrated.
+	 */
+	for (addr = event->start; addr < event->end;) {
+		unsigned long i, idx, next = event->end, npages;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte) {
+			ret = -ENOMEM;
+			goto out_cleanup;
+		}
+
+		npages = (next - addr) >> PAGE_SHIFT;
+		idx = (addr - event->start) >> PAGE_SHIFT;
+		hmm_pt_iter_directory_lock(iter);
+		for (i = 0; i < npages; i++, idx++) {
+			hmm_pte_clear_select(&hmm_pte[i]);
+			if (!pte_present(save_pte[idx]))
+				continue;
+			hmm_pte_set_select(&hmm_pte[i]);
+			/* This can not be a valid device entry here. */
+			VM_BUG_ON(hmm_pte_test_valid_dev(&hmm_pte[i]));
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
+			if (hmm_pte_test_valid_pfn(&hmm_pte[i]))
+				continue;
+
+			hmm_pt_iter_directory_ref(iter);
+			hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(save_pte[idx]));
+			if (pte_write(save_pte[idx]))
+				hmm_pte_set_write(&hmm_pte[i]);
+			hmm_pte_set_select(&hmm_pte[i]);
+		}
+		hmm_pt_iter_directory_unlock(iter);
+
+		if (device->dev) {
+			spinlock_t *lock;
+
+			lock = hmm_pt_iter_directory_lock_ptr(iter);
+			ret = hmm_mirror_dma_map_range(mirror, hmm_pte,
+						       lock, npages);
+			/* Keep going only for entry that have been mapped. */
+			if (ret) {
+				for (i = 0; i < npages; ++i) {
+					if (!hmm_pte_test_select(&dst[i]))
+						continue;
+					if (hmm_pte_test_valid_dma(&dst[i]))
+						continue;
+					hmm_pte_clear_select(&hmm_pte[i]);
+				}
+			}
+		}
+		addr = next;
+	}
+
+	/* Now Waldo we can do the copy. */
+	r = device->ops->copy_to_device(mirror, event, vma, dst,
+					event->start, event->end);
+
+	/* Update mirror page table with successfully migrated entry. */
+	for (addr = event->start; addr < event->end;) {
+		unsigned long i, idx, next = event->end, npages;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_walk(iter, &addr, &next);
+		if (!hmm_pte)
+			continue;
+		npages = (next - addr) >> PAGE_SHIFT;
+		idx = (addr - event->start) >> PAGE_SHIFT;
+		hmm_pt_iter_directory_lock(iter);
+		for (i = 0; i < npages; i++, idx++) {
+			if (!hmm_pte_test_valid_dev(&dst[idx]))
+				continue;
+
+			VM_BUG_ON(!hmm_pte_test_select(&hmm_pte[i]));
+			hmm_pte[i] = dst[idx];
+		}
+		hmm_pt_iter_directory_unlock(iter);
+		addr = next;
+	}
+
+out_cleanup:
+	mm_hmm_migrate_cleanup(hmm->mm, vma, save_pte, dst,
+			       event->start, event->end);
+out:
+	kfree(save_pte);
+	kfree(dst);
+	return ret ? ret : r;
+}
+
 /* hmm_mirror_range_discard() - discard a range of address.
  *
  * @mirror: The mirror struct.
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v11 14/14] HMM: add mirror fault support for system to device memory migration v3.
@ 2015-10-21 21:10   ` Jérôme Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:10 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

Migration to device memory is done as a special kind of device mirror
fault. Memory migration being initiated by device driver and never by
HMM (unless it is a migration back to system memory).

Changed since v1:
  - Adapt to HMM page table changes.

Changed since v2:
  - Fix error code path for migration, calling mm_hmm_migrate_cleanup()
    is wrong.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 mm/hmm.c | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 170 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index efffb8d..e3fb586 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -53,6 +53,10 @@ static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
 				   dma_addr_t *dst,
 				   unsigned long start,
 				   unsigned long end);
+static int hmm_mirror_migrate(struct hmm_mirror *mirror,
+			      struct hmm_event *event,
+			      struct vm_area_struct *vma,
+			      struct hmm_pt_iter *iter);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 				    struct hmm_event *event,
 				    struct page *page);
@@ -101,6 +105,12 @@ static inline int hmm_event_init(struct hmm_event *event,
 	return 0;
 }
 
+static inline unsigned long hmm_event_npages(const struct hmm_event *event)
+{
+	return (PAGE_ALIGN(event->end) - (event->start & PAGE_MASK)) >>
+	       PAGE_SHIFT;
+}
+
 
 /* hmm - core HMM functions.
  *
@@ -1251,6 +1261,9 @@ retry:
 	}
 
 	switch (event->etype) {
+	case HMM_COPY_TO_DEVICE:
+		ret = hmm_mirror_migrate(mirror, event, vma, &iter);
+		break;
 	case HMM_DEVICE_WFAULT:
 		if (!(vma->vm_flags & VM_WRITE)) {
 			ret = -EFAULT;
@@ -1388,6 +1401,163 @@ static int hmm_mirror_migrate_back(struct hmm_mirror *mirror,
 	return ret ? ret : r;
 }
 
+static int hmm_mirror_migrate(struct hmm_mirror *mirror,
+			      struct hmm_event *event,
+			      struct vm_area_struct *vma,
+			      struct hmm_pt_iter *iter)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm *hmm = mirror->hmm;
+	struct hmm_event invalidate;
+	unsigned long addr, npages;
+	struct hmm_mirror *tmp;
+	dma_addr_t *dst;
+	pte_t *save_pte;
+	int r = 0, ret;
+
+	/* Only allow migration of private anonymous memory. */
+	if (vma->vm_ops || unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)))
+		return -EINVAL;
+
+	/*
+	 * TODO More advance loop for splitting migration into several chunk.
+	 * For now limit the amount that can be migrated in one shot. Also we
+	 * would need to see if we need rescheduling if this is happening as
+	 * part of system call to the device driver.
+	 */
+	npages = hmm_event_npages(event);
+	if (npages * max(sizeof(*dst), sizeof(*save_pte)) > PAGE_SIZE)
+		return -EINVAL;
+	dst = kcalloc(npages, sizeof(*dst), GFP_KERNEL);
+	if (dst == NULL)
+		return -ENOMEM;
+	save_pte = kcalloc(npages, sizeof(*save_pte), GFP_KERNEL);
+	if (save_pte == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = mm_hmm_migrate(hmm->mm, vma, save_pte, &event->backoff,
+			     &hmm->mmu_notifier, event->start, event->end);
+	if (ret == -EAGAIN)
+		goto out;
+	if (ret)
+		goto out;
+
+	/*
+	 * Now invalidate for all other device, note that they can not race
+	 * with us as the CPU page table is full of special entry.
+	 */
+	hmm_event_init(&invalidate, mirror->hmm, event->start,
+		       event->end, HMM_MIGRATE);
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(tmp, &hmm->mirrors, mlist) {
+		if (tmp == mirror)
+			continue;
+		if (hmm_mirror_update(tmp, &invalidate, NULL)) {
+			hmm_mirror_ref(tmp);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(tmp);
+			hmm_mirror_unref(&tmp);
+			goto again;
+		}
+	}
+	up_read(&hmm->rwsem);
+
+	/*
+	 * Populate the mirror page table with saved entry and also mark entry
+	 * that can be migrated.
+	 */
+	for (addr = event->start; addr < event->end;) {
+		unsigned long i, idx, next = event->end, npages;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte) {
+			ret = -ENOMEM;
+			goto out_cleanup;
+		}
+
+		npages = (next - addr) >> PAGE_SHIFT;
+		idx = (addr - event->start) >> PAGE_SHIFT;
+		hmm_pt_iter_directory_lock(iter);
+		for (i = 0; i < npages; i++, idx++) {
+			hmm_pte_clear_select(&hmm_pte[i]);
+			if (!pte_present(save_pte[idx]))
+				continue;
+			hmm_pte_set_select(&hmm_pte[i]);
+			/* This can not be a valid device entry here. */
+			VM_BUG_ON(hmm_pte_test_valid_dev(&hmm_pte[i]));
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
+			if (hmm_pte_test_valid_pfn(&hmm_pte[i]))
+				continue;
+
+			hmm_pt_iter_directory_ref(iter);
+			hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(save_pte[idx]));
+			if (pte_write(save_pte[idx]))
+				hmm_pte_set_write(&hmm_pte[i]);
+			hmm_pte_set_select(&hmm_pte[i]);
+		}
+		hmm_pt_iter_directory_unlock(iter);
+
+		if (device->dev) {
+			spinlock_t *lock;
+
+			lock = hmm_pt_iter_directory_lock_ptr(iter);
+			ret = hmm_mirror_dma_map_range(mirror, hmm_pte,
+						       lock, npages);
+			/* Keep going only for entry that have been mapped. */
+			if (ret) {
+				for (i = 0; i < npages; ++i) {
+					if (!hmm_pte_test_select(&dst[i]))
+						continue;
+					if (hmm_pte_test_valid_dma(&dst[i]))
+						continue;
+					hmm_pte_clear_select(&hmm_pte[i]);
+				}
+			}
+		}
+		addr = next;
+	}
+
+	/* Now Waldo we can do the copy. */
+	r = device->ops->copy_to_device(mirror, event, vma, dst,
+					event->start, event->end);
+
+	/* Update mirror page table with successfully migrated entry. */
+	for (addr = event->start; addr < event->end;) {
+		unsigned long i, idx, next = event->end, npages;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_walk(iter, &addr, &next);
+		if (!hmm_pte)
+			continue;
+		npages = (next - addr) >> PAGE_SHIFT;
+		idx = (addr - event->start) >> PAGE_SHIFT;
+		hmm_pt_iter_directory_lock(iter);
+		for (i = 0; i < npages; i++, idx++) {
+			if (!hmm_pte_test_valid_dev(&dst[idx]))
+				continue;
+
+			VM_BUG_ON(!hmm_pte_test_select(&hmm_pte[i]));
+			hmm_pte[i] = dst[idx];
+		}
+		hmm_pt_iter_directory_unlock(iter);
+		addr = next;
+	}
+
+out_cleanup:
+	mm_hmm_migrate_cleanup(hmm->mm, vma, save_pte, dst,
+			       event->start, event->end);
+out:
+	kfree(save_pte);
+	kfree(dst);
+	return ret ? ret : r;
+}
+
 /* hmm_mirror_range_discard() - discard a range of address.
  *
  * @mirror: The mirror struct.
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2015-10-21 20:21 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-21 21:10 [PATCH v11 00/14] HMM anomymous memory migration to device memory Jérôme Glisse
2015-10-21 21:10 ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 01/14] fork: pass the dst vma to copy_page_range() and its sub-functions Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 02/14] HMM: add special swap filetype for memory migrated to device v2 Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 03/14] HMM: add new HMM page table flag (valid device memory) Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 04/14] HMM: add new HMM page table flag (select flag) Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 05/14] HMM: handle HMM device page table entry on mirror page table fault and update Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 06/14] HMM: mm add helper to update page table when migrating memory back v2 Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 07/14] HMM: mm add helper to update page table when migrating memory v2 Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 08/14] HMM: new callback for copying memory from and to device " Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 09/14] HMM: allow to get pointer to spinlock protecting a directory Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 10/14] HMM: split DMA mapping function in two Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 11/14] HMM: add helpers for migration back to system memory v3 Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 12/14] HMM: fork copy migrated memory into system memory for child process Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 13/14] HMM: CPU page fault on migrated memory Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse
2015-10-21 21:10 ` [PATCH v11 14/14] HMM: add mirror fault support for system to device memory migration v3 Jérôme Glisse
2015-10-21 21:10   ` Jérôme Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.