linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region
@ 2019-01-14  9:54 Aneesh Kumar K.V
  2019-01-14  9:54 ` [PATCH V7 1/4] mm/cma: Add PF flag to force non cma alloc Aneesh Kumar K.V
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-14  9:54 UTC (permalink / raw)
  To: akpm, Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

ppc64 use CMA area for the allocation of guest page table (hash page table). We won't
be able to start guest if we fail to allocate hash page table. We have observed
hash table allocation failure because we failed to migrate pages out of CMA region
because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins
the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we
won't be able to migrate those pages. The pages are also pinned for the lifetime of the
guest.

Currently we support migration of non-compound pages. With THP and with the addition of
 hugetlb migration we can end up allocating compound pages from CMA region. This
patch series add support for migrating compound pages. 

Changes from V6:
* use get_user_pages_longterm instead of get_user_pages_cma_migrate()

Changes from V5:
* Add PF_MEMALLOC_NOCMA
* remote __GFP_THISNODE when allocating target page for migration

Changes from V4:
* use __GFP_NOWARN when allocating pages to avoid page allocation failure warnings.

Changes from V3:
* Move the hugetlb check before transhuge check
* Use compound head page when isolating hugetlb page


Aneesh Kumar K.V (4):
  mm/cma: Add PF flag to force non cma alloc
  mm: Update get_user_pages_longterm to migrate pages allocated from CMA
    region
  powerpc/mm/iommu: Allow migration of cma allocated pages during
    mm_iommu_do_alloc
  powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing

 arch/powerpc/mm/mmu_context_iommu.c | 146 ++++++--------------
 include/linux/hugetlb.h             |   2 +
 include/linux/mm.h                  |   3 +-
 include/linux/sched.h               |   1 +
 include/linux/sched/mm.h            |  48 +++++--
 mm/gup.c                            | 200 ++++++++++++++++++++++++----
 mm/hugetlb.c                        |   4 +-
 7 files changed, 267 insertions(+), 137 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH V7 1/4] mm/cma: Add PF flag to force non cma alloc
  2019-01-14  9:54 [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Aneesh Kumar K.V
@ 2019-01-14  9:54 ` Aneesh Kumar K.V
  2019-01-29 22:52   ` Andrew Morton
  2019-01-14  9:54 ` [PATCH V7 2/4] mm: Update get_user_pages_longterm to migrate pages allocated from CMA region Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-14  9:54 UTC (permalink / raw)
  To: akpm, Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

This patch adds PF_MEMALLOC_NOCMA which make sure any allocation in that context
is marked non-movable and hence cannot be satisfied by CMA region.

This is useful with get_user_pages_longterm where we want to take a page pin by
migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures
that we avoid unnecessary page migration later.

Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/sched.h    |  1 +
 include/linux/sched/mm.h | 48 +++++++++++++++++++++++++++++++++-------
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 224666226e87..687b0edf97b2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1406,6 +1406,7 @@ extern struct pid *cad_pid;
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
 #define PF_MEMSTALL		0x01000000	/* Stalled due to lack of memory */
+#define PF_MEMALLOC_NOCMA	0x02000000	/* All allocation request will have _GFP_MOVABLE cleared */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MUTEX_TESTER		0x20000000	/* Thread belongs to the rt mutex tester */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 3bfa6a0cbba4..0cd9f10423fb 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -148,17 +148,25 @@ static inline bool in_vfork(struct task_struct *tsk)
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
  * PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
-	/*
-	 * NOIO implies both NOIO and NOFS and it is a weaker context
-	 * so always make sure it makes precedence
-	 */
-	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
-		flags &= ~(__GFP_IO | __GFP_FS);
-	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
-		flags &= ~__GFP_FS;
+	if (unlikely(current->flags &
+		     (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) {
+		/*
+		 * NOIO implies both NOIO and NOFS and it is a weaker context
+		 * so always make sure it makes precedence
+		 */
+		if (current->flags & PF_MEMALLOC_NOIO)
+			flags &= ~(__GFP_IO | __GFP_FS);
+		else if (current->flags & PF_MEMALLOC_NOFS)
+			flags &= ~__GFP_FS;
+#ifdef CONFIG_CMA
+		if (current->flags & PF_MEMALLOC_NOCMA)
+			flags &= ~__GFP_MOVABLE;
+#endif
+	}
 	return flags;
 }
 
@@ -248,6 +256,30 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC) | flags;
 }
 
+#ifdef CONFIG_CMA
+static inline unsigned int memalloc_nocma_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+
+	current->flags |= PF_MEMALLOC_NOCMA;
+	return flags;
+}
+
+static inline void memalloc_nocma_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+}
+#else
+static inline unsigned int memalloc_nocma_save(void)
+{
+	return 0;
+}
+
+static inline void memalloc_nocma_restore(unsigned int flags)
+{
+}
+#endif
+
 #ifdef CONFIG_MEMCG
 /**
  * memalloc_use_memcg - Starts the remote memcg charging scope.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V7 2/4] mm: Update get_user_pages_longterm to migrate pages allocated from CMA region
  2019-01-14  9:54 [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Aneesh Kumar K.V
  2019-01-14  9:54 ` [PATCH V7 1/4] mm/cma: Add PF flag to force non cma alloc Aneesh Kumar K.V
@ 2019-01-14  9:54 ` Aneesh Kumar K.V
  2019-01-14  9:54 ` Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-14  9:54 UTC (permalink / raw)
  To: akpm, Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

This patch updates get_user_pages_longterm to migrate pages allocated out
of CMA region. This makes sure that we don't keep non-movable pages (due to page
reference count) in the CMA area.

This will be used by ppc64 in a later patch to avoid pinning pages in the CMA
region. ppc64 uses CMA region for allocation of the hardware page table (hash page
table) and not able to migrate pages out of CMA region results in page table
allocation failures.

One case where we hit this easy is when a guest using a VFIO passthrough device.
VFIO locks all the guest's memory and if the guest memory is backed by CMA
region, it becomes unmovable resulting in fragmenting the CMA and possibly
preventing other guests from allocation a large enough hash page table.

NOTE: We allocate the new page without using __GFP_THISNODE

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/hugetlb.h |   2 +
 include/linux/mm.h      |   3 +-
 mm/gup.c                | 200 +++++++++++++++++++++++++++++++++++-----
 mm/hugetlb.c            |   4 +-
 4 files changed, 182 insertions(+), 27 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 087fd5f48c91..1eed0cdaec0e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -371,6 +371,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address);
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+				     int nid, nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..20ec56f8e2bb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1536,7 +1536,8 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
-#ifdef CONFIG_FS_DAX
+
+#if defined(CONFIG_FS_DAX) || defined(CONFIG_CMA)
 long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 05acd7e2eb22..6e8152594e83 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,9 @@
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>
+#include <linux/sched/mm.h>
 
 #include <asm/mmu_context.h>
 #include <asm/pgtable.h>
@@ -1126,7 +1129,167 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
+
 #ifdef CONFIG_FS_DAX
+static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
+{
+	long i;
+	struct vm_area_struct *vma_prev = NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		struct vm_area_struct *vma = vmas[i];
+
+		if (vma == vma_prev)
+			continue;
+
+		vma_prev = vma;
+
+		if (vma_is_fsdax(vma))
+			return true;
+	}
+	return false;
+}
+#else
+static inline bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
+{
+	return false;
+}
+#endif
+
+#ifdef CONFIG_CMA
+static struct page *new_non_cma_page(struct page *page, unsigned long private)
+{
+	/*
+	 * We want to make sure we allocate the new page from the same node
+	 * as the source page.
+	 */
+	int nid = page_to_nid(page);
+	/*
+	 * Trying to allocate a page for migration. Ignore allocation
+	 * failure warnings. We don't force __GFP_THISNODE here because
+	 * this node here is the node where we have CMA reservation and
+	 * in some case these nodes will have really less non movable
+	 * allocation memory.
+	 */
+	gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
+
+	if (PageHighMem(page))
+		gfp_mask |= __GFP_HIGHMEM;
+
+#ifdef CONFIG_HUGETLB_PAGE
+	if (PageHuge(page)) {
+		struct hstate *h = page_hstate(page);
+		/*
+		 * We don't want to dequeue from the pool because pool pages will
+		 * mostly be from the CMA region.
+		 */
+		return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+	}
+#endif
+	if (PageTransHuge(page)) {
+		struct page *thp;
+		/*
+		 * ignore allocation failure warnings
+		 */
+		gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN;
+
+		/*
+		 * Remove the movable mask so that we don't allocate from
+		 * CMA area again.
+		 */
+		thp_gfpmask &= ~__GFP_MOVABLE;
+		thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	}
+
+	return __alloc_pages_node(nid, gfp_mask, 0);
+}
+
+static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
+					unsigned int gup_flags,
+					struct page **pages,
+					struct vm_area_struct **vmas)
+{
+	long i;
+	bool drain_allow = true;
+	bool migrate_allow = true;
+	LIST_HEAD(cma_page_list);
+
+check_again:
+	for (i = 0; i < nr_pages; i++) {
+		/*
+		 * If we get a page from the CMA zone, since we are going to
+		 * be pinning these entries, we might as well move them out
+		 * of the CMA zone if possible.
+		 */
+		if (is_migrate_cma_page(pages[i])) {
+
+			struct page *head = compound_head(pages[i]);
+
+			if (PageHuge(head)) {
+				isolate_huge_page(head, &cma_page_list);
+			} else {
+				if (!PageLRU(head) && drain_allow) {
+					lru_add_drain_all();
+					drain_allow = false;
+				}
+
+				if (!isolate_lru_page(head)) {
+					list_add_tail(&head->lru, &cma_page_list);
+					mod_node_page_state(page_pgdat(head),
+							    NR_ISOLATED_ANON +
+							    page_is_file_cache(head),
+							    hpage_nr_pages(head));
+				}
+			}
+		}
+	}
+
+	if (!list_empty(&cma_page_list)) {
+		/*
+		 * drop the above get_user_pages reference.
+		 */
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
+
+		if (migrate_pages(&cma_page_list, new_non_cma_page,
+				  NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+			/*
+			 * some of the pages failed migration. Do get_user_pages
+			 * without migration.
+			 */
+			migrate_allow = false;
+
+			if (!list_empty(&cma_page_list))
+				putback_movable_pages(&cma_page_list);
+		}
+		/*
+		 * We did migrate all the pages, Try to get the page references again
+		 * migrating any new CMA pages which we failed to isolate earlier.
+		 */
+		nr_pages = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+		if ((nr_pages > 0) && migrate_allow) {
+			drain_allow = true;
+			goto check_again;
+		}
+	}
+
+	return nr_pages;
+}
+#else
+static inline long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
+					       unsigned int gup_flags,
+					       struct page **pages,
+					       struct vm_area_struct **vmas)
+{
+	return nr_pages;
+}
+#endif
+
 /*
  * This is the same as get_user_pages() in that it assumes we are
  * operating on the current task's mm, but it goes further to validate
@@ -1140,11 +1303,11 @@ EXPORT_SYMBOL(get_user_pages);
  * Contrast this to iov_iter_get_pages() usages which are transient.
  */
 long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
-		unsigned int gup_flags, struct page **pages,
-		struct vm_area_struct **vmas_arg)
+			     unsigned int gup_flags, struct page **pages,
+			     struct vm_area_struct **vmas_arg)
 {
 	struct vm_area_struct **vmas = vmas_arg;
-	struct vm_area_struct *vma_prev = NULL;
+	unsigned long flags;
 	long rc, i;
 
 	if (!pages)
@@ -1157,31 +1320,20 @@ long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
 			return -ENOMEM;
 	}
 
+	flags = memalloc_nocma_save();
 	rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+	memalloc_nocma_restore(flags);
+	if (rc < 0)
+		goto out;
 
-	for (i = 0; i < rc; i++) {
-		struct vm_area_struct *vma = vmas[i];
-
-		if (vma == vma_prev)
-			continue;
-
-		vma_prev = vma;
-
-		if (vma_is_fsdax(vma))
-			break;
-	}
-
-	/*
-	 * Either get_user_pages() failed, or the vma validation
-	 * succeeded, in either case we don't need to put_page() before
-	 * returning.
-	 */
-	if (i >= rc)
+	if (check_dax_vmas(vmas, rc)) {
+		for (i = 0; i < rc; i++)
+			put_page(pages[i]);
+		rc = -EOPNOTSUPP;
 		goto out;
+	}
 
-	for (i = 0; i < rc; i++)
-		put_page(pages[i]);
-	rc = -EOPNOTSUPP;
+	rc = check_and_migrate_cma_pages(start, rc, gup_flags, pages, vmas);
 out:
 	if (vmas != vmas_arg)
 		kfree(vmas);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df2e7dd5ff17..913862771808 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1586,8 +1586,8 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	return page;
 }
 
-static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
-		int nid, nodemask_t *nmask)
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+				     int nid, nodemask_t *nmask)
 {
 	struct page *page;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V7 2/4] mm: Update get_user_pages_longterm to migrate pages allocated from CMA region
  2019-01-14  9:54 [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Aneesh Kumar K.V
  2019-01-14  9:54 ` [PATCH V7 1/4] mm/cma: Add PF flag to force non cma alloc Aneesh Kumar K.V
  2019-01-14  9:54 ` [PATCH V7 2/4] mm: Update get_user_pages_longterm to migrate pages allocated from CMA region Aneesh Kumar K.V
@ 2019-01-14  9:54 ` Aneesh Kumar K.V
  2019-01-14  9:54 ` [PATCH V7 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_do_alloc Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-14  9:54 UTC (permalink / raw)
  To: akpm, Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

This patch updates get_user_pages_longterm to migrate pages allocated out
of CMA region. This makes sure that we don't keep non-movable pages (due to page
reference count) in the CMA area.

This will be used by ppc64 in a later patch to avoid pinning pages in the CMA
region. ppc64 uses CMA region for allocation of the hardware page table (hash page
table) and not able to migrate pages out of CMA region results in page table
allocation failures.

One case where we hit this easy is when a guest using a VFIO passthrough device.
VFIO locks all the guest's memory and if the guest memory is backed by CMA
region, it becomes unmovable resulting in fragmenting the CMA and possibly
preventing other guests from allocation a large enough hash page table.

NOTE: We allocate the new page without using __GFP_THISNODE

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/hugetlb.h |   2 +
 include/linux/mm.h      |   3 +-
 mm/gup.c                | 200 +++++++++++++++++++++++++++++++++++-----
 mm/hugetlb.c            |   4 +-
 4 files changed, 182 insertions(+), 27 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 087fd5f48c91..1eed0cdaec0e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -371,6 +371,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address);
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+				     int nid, nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..20ec56f8e2bb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1536,7 +1536,8 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
-#ifdef CONFIG_FS_DAX
+
+#if defined(CONFIG_FS_DAX) || defined(CONFIG_CMA)
 long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 05acd7e2eb22..6e8152594e83 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -13,6 +13,9 @@
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
 #include <linux/hugetlb.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>
+#include <linux/sched/mm.h>
 
 #include <asm/mmu_context.h>
 #include <asm/pgtable.h>
@@ -1126,7 +1129,167 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
+
 #ifdef CONFIG_FS_DAX
+static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
+{
+	long i;
+	struct vm_area_struct *vma_prev = NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		struct vm_area_struct *vma = vmas[i];
+
+		if (vma == vma_prev)
+			continue;
+
+		vma_prev = vma;
+
+		if (vma_is_fsdax(vma))
+			return true;
+	}
+	return false;
+}
+#else
+static inline bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
+{
+	return false;
+}
+#endif
+
+#ifdef CONFIG_CMA
+static struct page *new_non_cma_page(struct page *page, unsigned long private)
+{
+	/*
+	 * We want to make sure we allocate the new page from the same node
+	 * as the source page.
+	 */
+	int nid = page_to_nid(page);
+	/*
+	 * Trying to allocate a page for migration. Ignore allocation
+	 * failure warnings. We don't force __GFP_THISNODE here because
+	 * this node here is the node where we have CMA reservation and
+	 * in some case these nodes will have really less non movable
+	 * allocation memory.
+	 */
+	gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
+
+	if (PageHighMem(page))
+		gfp_mask |= __GFP_HIGHMEM;
+
+#ifdef CONFIG_HUGETLB_PAGE
+	if (PageHuge(page)) {
+		struct hstate *h = page_hstate(page);
+		/*
+		 * We don't want to dequeue from the pool because pool pages will
+		 * mostly be from the CMA region.
+		 */
+		return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+	}
+#endif
+	if (PageTransHuge(page)) {
+		struct page *thp;
+		/*
+		 * ignore allocation failure warnings
+		 */
+		gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN;
+
+		/*
+		 * Remove the movable mask so that we don't allocate from
+		 * CMA area again.
+		 */
+		thp_gfpmask &= ~__GFP_MOVABLE;
+		thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	}
+
+	return __alloc_pages_node(nid, gfp_mask, 0);
+}
+
+static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
+					unsigned int gup_flags,
+					struct page **pages,
+					struct vm_area_struct **vmas)
+{
+	long i;
+	bool drain_allow = true;
+	bool migrate_allow = true;
+	LIST_HEAD(cma_page_list);
+
+check_again:
+	for (i = 0; i < nr_pages; i++) {
+		/*
+		 * If we get a page from the CMA zone, since we are going to
+		 * be pinning these entries, we might as well move them out
+		 * of the CMA zone if possible.
+		 */
+		if (is_migrate_cma_page(pages[i])) {
+
+			struct page *head = compound_head(pages[i]);
+
+			if (PageHuge(head)) {
+				isolate_huge_page(head, &cma_page_list);
+			} else {
+				if (!PageLRU(head) && drain_allow) {
+					lru_add_drain_all();
+					drain_allow = false;
+				}
+
+				if (!isolate_lru_page(head)) {
+					list_add_tail(&head->lru, &cma_page_list);
+					mod_node_page_state(page_pgdat(head),
+							    NR_ISOLATED_ANON +
+							    page_is_file_cache(head),
+							    hpage_nr_pages(head));
+				}
+			}
+		}
+	}
+
+	if (!list_empty(&cma_page_list)) {
+		/*
+		 * drop the above get_user_pages reference.
+		 */
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
+
+		if (migrate_pages(&cma_page_list, new_non_cma_page,
+				  NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+			/*
+			 * some of the pages failed migration. Do get_user_pages
+			 * without migration.
+			 */
+			migrate_allow = false;
+
+			if (!list_empty(&cma_page_list))
+				putback_movable_pages(&cma_page_list);
+		}
+		/*
+		 * We did migrate all the pages, Try to get the page references again
+		 * migrating any new CMA pages which we failed to isolate earlier.
+		 */
+		nr_pages = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+		if ((nr_pages > 0) && migrate_allow) {
+			drain_allow = true;
+			goto check_again;
+		}
+	}
+
+	return nr_pages;
+}
+#else
+static inline long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
+					       unsigned int gup_flags,
+					       struct page **pages,
+					       struct vm_area_struct **vmas)
+{
+	return nr_pages;
+}
+#endif
+
 /*
  * This is the same as get_user_pages() in that it assumes we are
  * operating on the current task's mm, but it goes further to validate
@@ -1140,11 +1303,11 @@ EXPORT_SYMBOL(get_user_pages);
  * Contrast this to iov_iter_get_pages() usages which are transient.
  */
 long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
-		unsigned int gup_flags, struct page **pages,
-		struct vm_area_struct **vmas_arg)
+			     unsigned int gup_flags, struct page **pages,
+			     struct vm_area_struct **vmas_arg)
 {
 	struct vm_area_struct **vmas = vmas_arg;
-	struct vm_area_struct *vma_prev = NULL;
+	unsigned long flags;
 	long rc, i;
 
 	if (!pages)
@@ -1157,31 +1320,20 @@ long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
 			return -ENOMEM;
 	}
 
+	flags = memalloc_nocma_save();
 	rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+	memalloc_nocma_restore(flags);
+	if (rc < 0)
+		goto out;
 
-	for (i = 0; i < rc; i++) {
-		struct vm_area_struct *vma = vmas[i];
-
-		if (vma == vma_prev)
-			continue;
-
-		vma_prev = vma;
-
-		if (vma_is_fsdax(vma))
-			break;
-	}
-
-	/*
-	 * Either get_user_pages() failed, or the vma validation
-	 * succeeded, in either case we don't need to put_page() before
-	 * returning.
-	 */
-	if (i >= rc)
+	if (check_dax_vmas(vmas, rc)) {
+		for (i = 0; i < rc; i++)
+			put_page(pages[i]);
+		rc = -EOPNOTSUPP;
 		goto out;
+	}
 
-	for (i = 0; i < rc; i++)
-		put_page(pages[i]);
-	rc = -EOPNOTSUPP;
+	rc = check_and_migrate_cma_pages(start, rc, gup_flags, pages, vmas);
 out:
 	if (vmas != vmas_arg)
 		kfree(vmas);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df2e7dd5ff17..913862771808 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1586,8 +1586,8 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 	return page;
 }
 
-static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
-		int nid, nodemask_t *nmask)
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+				     int nid, nodemask_t *nmask)
 {
 	struct page *page;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V7 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_do_alloc
  2019-01-14  9:54 [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2019-01-14  9:54 ` Aneesh Kumar K.V
@ 2019-01-14  9:54 ` Aneesh Kumar K.V
  2019-01-30 11:34   ` Michael Ellerman
  2019-01-14  9:54 ` [PATCH V7 4/4] powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-14  9:54 UTC (permalink / raw)
  To: akpm, Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

The current code doesn't do page migration if the page allocated is a compound page.
With HugeTLB migration support, we can end up allocating hugetlb pages from
CMA region. Also, THP pages can be allocated from CMA region. This patch updates
the code to handle compound pages correctly. The patch also switches to a single
get_user_pages with the right count, instead of doing one get_user_pages per page.
That avoids reading page table multiple times.

Since these page reference updates are long term pin, switch to
get_user_pages_longterm. That makes sure we fail correctly if the guest RAM
is backed by DAX pages.

The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union.
We use the same storage location to store pointers to struct page. We cannot
update all the code path use struct page *, because we access hpas in real mode
and we can't do that struct page * to pfn conversion in real mode.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/mmu_context_iommu.c | 126 +++++++++-------------------
 1 file changed, 39 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index a712a650a8b6..f11a2f15071f 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -21,6 +21,7 @@
 #include <linux/sizes.h>
 #include <asm/mmu_context.h>
 #include <asm/pte-walk.h>
+#include <linux/mm_inline.h>
 
 static DEFINE_MUTEX(mem_list_mutex);
 
@@ -34,8 +35,18 @@ struct mm_iommu_table_group_mem_t {
 	atomic64_t mapped;
 	unsigned int pageshift;
 	u64 ua;			/* userspace address */
-	u64 entries;		/* number of entries in hpas[] */
-	u64 *hpas;		/* vmalloc'ed */
+	u64 entries;		/* number of entries in hpas/hpages[] */
+	/*
+	 * in mm_iommu_get we temporarily use this to store
+	 * struct page address.
+	 *
+	 * We need to convert ua to hpa in real mode. Make it
+	 * simpler by storing physical address.
+	 */
+	union {
+		struct page **hpages;	/* vmalloc'ed */
+		phys_addr_t *hpas;
+	};
 #define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
 	u64 dev_hpa;		/* Device memory base address */
 };
@@ -80,64 +91,15 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-/*
- * Taken from alloc_migrate_target with changes to remove CMA allocations
- */
-struct page *new_iommu_non_cma_page(struct page *page, unsigned long private)
-{
-	gfp_t gfp_mask = GFP_USER;
-	struct page *new_page;
-
-	if (PageCompound(page))
-		return NULL;
-
-	if (PageHighMem(page))
-		gfp_mask |= __GFP_HIGHMEM;
-
-	/*
-	 * We don't want the allocation to force an OOM if possibe
-	 */
-	new_page = alloc_page(gfp_mask | __GFP_NORETRY | __GFP_NOWARN);
-	return new_page;
-}
-
-static int mm_iommu_move_page_from_cma(struct page *page)
-{
-	int ret = 0;
-	LIST_HEAD(cma_migrate_pages);
-
-	/* Ignore huge pages for now */
-	if (PageCompound(page))
-		return -EBUSY;
-
-	lru_add_drain();
-	ret = isolate_lru_page(page);
-	if (ret)
-		return ret;
-
-	list_add(&page->lru, &cma_migrate_pages);
-	put_page(page); /* Drop the gup reference */
-
-	ret = migrate_pages(&cma_migrate_pages, new_iommu_non_cma_page,
-				NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE);
-	if (ret) {
-		if (!list_empty(&cma_migrate_pages))
-			putback_movable_pages(&cma_migrate_pages);
-	}
-
-	return 0;
-}
-
 static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
-		unsigned long entries, unsigned long dev_hpa,
-		struct mm_iommu_table_group_mem_t **pmem)
+			      unsigned long entries, unsigned long dev_hpa,
+			      struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
-	long i, j, ret = 0, locked_entries = 0;
+	long i, ret = 0, locked_entries = 0;
 	unsigned int pageshift;
 	unsigned long flags;
 	unsigned long cur_ua;
-	struct page *page = NULL;
 
 	mutex_lock(&mem_list_mutex);
 
@@ -187,41 +149,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 		goto unlock_exit;
 	}
 
+	down_read(&mm->mmap_sem);
+	ret = get_user_pages_longterm(ua, entries, FOLL_WRITE, mem->hpages, NULL);
+	up_read(&mm->mmap_sem);
+	if (ret != entries) {
+		/* free the reference taken */
+		for (i = 0; i < ret; i++)
+			put_page(mem->hpages[i]);
+
+		vfree(mem->hpas);
+		kfree(mem);
+		ret = -EFAULT;
+		goto unlock_exit;
+	} else {
+		ret = 0;
+	}
+
+	pageshift = PAGE_SHIFT;
 	for (i = 0; i < entries; ++i) {
+		struct page *page = mem->hpages[i];
+
 		cur_ua = ua + (i << PAGE_SHIFT);
-		if (1 != get_user_pages_fast(cur_ua,
-					1/* pages */, 1/* iswrite */, &page)) {
-			ret = -EFAULT;
-			for (j = 0; j < i; ++j)
-				put_page(pfn_to_page(mem->hpas[j] >>
-						PAGE_SHIFT));
-			vfree(mem->hpas);
-			kfree(mem);
-			goto unlock_exit;
-		}
-		/*
-		 * If we get a page from the CMA zone, since we are going to
-		 * be pinning these entries, we might as well move them out
-		 * of the CMA zone if possible. NOTE: faulting in + migration
-		 * can be expensive. Batching can be considered later
-		 */
-		if (is_migrate_cma_page(page)) {
-			if (mm_iommu_move_page_from_cma(page))
-				goto populate;
-			if (1 != get_user_pages_fast(cur_ua,
-						1/* pages */, 1/* iswrite */,
-						&page)) {
-				ret = -EFAULT;
-				for (j = 0; j < i; ++j)
-					put_page(pfn_to_page(mem->hpas[j] >>
-								PAGE_SHIFT));
-				vfree(mem->hpas);
-				kfree(mem);
-				goto unlock_exit;
-			}
-		}
-populate:
-		pageshift = PAGE_SHIFT;
 		if (mem->pageshift > PAGE_SHIFT && PageCompound(page)) {
 			pte_t *pte;
 			struct page *head = compound_head(page);
@@ -239,6 +187,10 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 			local_irq_restore(flags);
 		}
 		mem->pageshift = min(mem->pageshift, pageshift);
+		/*
+		 * We don't need struct page reference any more, switch
+		 * to physical address.
+		 */
 		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V7 4/4] powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing
  2019-01-14  9:54 [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2019-01-14  9:54 ` [PATCH V7 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_do_alloc Aneesh Kumar K.V
@ 2019-01-14  9:54 ` Aneesh Kumar K.V
  2019-01-14  9:54 ` [PATCH V7 5/5] testing Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-14  9:54 UTC (permalink / raw)
  To: akpm, Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

THP pages can get split during different code paths. An incremented reference
count does imply we will not split the compound page. But the pmd entry can be
converted to level 4 pte entries. Keep the code simpler by allowing large
IOMMU page size only if the guest ram is backed by hugetlb pages.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/mmu_context_iommu.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index f11a2f15071f..74d43f522dc2 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -98,8 +98,6 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	struct mm_iommu_table_group_mem_t *mem;
 	long i, ret = 0, locked_entries = 0;
 	unsigned int pageshift;
-	unsigned long flags;
-	unsigned long cur_ua;
 
 	mutex_lock(&mem_list_mutex);
 
@@ -169,22 +167,14 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	for (i = 0; i < entries; ++i) {
 		struct page *page = mem->hpages[i];
 
-		cur_ua = ua + (i << PAGE_SHIFT);
-		if (mem->pageshift > PAGE_SHIFT && PageCompound(page)) {
-			pte_t *pte;
+		/*
+		 * Allow to use larger than 64k IOMMU pages. Only do that
+		 * if we are backed by hugetlb.
+		 */
+		if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) {
 			struct page *head = compound_head(page);
-			unsigned int compshift = compound_order(head);
-			unsigned int pteshift;
-
-			local_irq_save(flags); /* disables as well */
-			pte = find_linux_pte(mm->pgd, cur_ua, NULL, &pteshift);
-
-			/* Double check it is still the same pinned page */
-			if (pte && pte_page(*pte) == head &&
-			    pteshift == compshift + PAGE_SHIFT)
-				pageshift = max_t(unsigned int, pteshift,
-						PAGE_SHIFT);
-			local_irq_restore(flags);
+
+			pageshift = compound_order(head) + PAGE_SHIFT;
 		}
 		mem->pageshift = min(mem->pageshift, pageshift);
 		/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V7 5/5] testing
  2019-01-14  9:54 [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2019-01-14  9:54 ` [PATCH V7 4/4] powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing Aneesh Kumar K.V
@ 2019-01-14  9:54 ` Aneesh Kumar K.V
  2019-01-15 11:25   ` Aneesh Kumar K.V
  2019-01-29 22:56 ` [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Andrew Morton
  2019-02-26 23:53 ` Andrew Morton
  7 siblings, 1 reply; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-14  9:54 UTC (permalink / raw)
  To: akpm, Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

---
 mm/gup.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6e8152594e83..91849c39931a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1226,7 +1226,7 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
 		 * be pinning these entries, we might as well move them out
 		 * of the CMA zone if possible.
 		 */
-		if (is_migrate_cma_page(pages[i])) {
+		if (true || is_migrate_cma_page(pages[i])) {
 
 			struct page *head = compound_head(pages[i]);
 
@@ -1256,6 +1256,7 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
 		for (i = 0; i < nr_pages; i++)
 			put_page(pages[i]);
 
+		pr_emerg("migrating nr_pages");
 		if (migrate_pages(&cma_page_list, new_non_cma_page,
 				  NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
 			/*
@@ -1274,10 +1275,11 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
 		nr_pages = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
 		if ((nr_pages > 0) && migrate_allow) {
 			drain_allow = true;
-			goto check_again;
+			//goto check_again;
 		}
 	}
 
+	pr_emerg("Returning with %ld\n", nr_pages);
 	return nr_pages;
 }
 #else
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH V7 5/5] testing
  2019-01-14  9:54 ` [PATCH V7 5/5] testing Aneesh Kumar K.V
@ 2019-01-15 11:25   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-15 11:25 UTC (permalink / raw)
  To: akpm, Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe
  Cc: linux-mm, linux-kernel, linuxppc-dev

You can ignore this one.

On 1/14/19 3:24 PM, Aneesh Kumar K.V wrote:
> ---
>   mm/gup.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 6e8152594e83..91849c39931a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1226,7 +1226,7 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
>   		 * be pinning these entries, we might as well move them out
>   		 * of the CMA zone if possible.
>   		 */
> -		if (is_migrate_cma_page(pages[i])) {
> +		if (true || is_migrate_cma_page(pages[i])) {
>   
>   			struct page *head = compound_head(pages[i]);
>   
> @@ -1256,6 +1256,7 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
>   		for (i = 0; i < nr_pages; i++)
>   			put_page(pages[i]);
>   
> +		pr_emerg("migrating nr_pages");
>   		if (migrate_pages(&cma_page_list, new_non_cma_page,
>   				  NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
>   			/*
> @@ -1274,10 +1275,11 @@ static long check_and_migrate_cma_pages(unsigned long start, long nr_pages,
>   		nr_pages = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
>   		if ((nr_pages > 0) && migrate_allow) {
>   			drain_allow = true;
> -			goto check_again;
> +			//goto check_again;
>   		}
>   	}
>   
> +	pr_emerg("Returning with %ld\n", nr_pages);
>   	return nr_pages;
>   }
>   #else
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V7 1/4] mm/cma: Add PF flag to force non cma alloc
  2019-01-14  9:54 ` [PATCH V7 1/4] mm/cma: Add PF flag to force non cma alloc Aneesh Kumar K.V
@ 2019-01-29 22:52   ` Andrew Morton
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2019-01-29 22:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe, linux-mm, linux-kernel, linuxppc-dev

On Mon, 14 Jan 2019 15:24:33 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> This patch adds PF_MEMALLOC_NOCMA which make sure any allocation in that context
> is marked non-movable and hence cannot be satisfied by CMA region.
> 
> This is useful with get_user_pages_longterm where we want to take a page pin by
> migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures
> that we avoid unnecessary page migration later.
> 
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1406,6 +1406,7 @@ extern struct pid *cad_pid;
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
>  #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
>  #define PF_MEMSTALL		0x01000000	/* Stalled due to lack of memory */
> +#define PF_MEMALLOC_NOCMA	0x02000000	/* All allocation request will have _GFP_MOVABLE cleared */
>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
>  #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
>  #define PF_MUTEX_TESTER		0x20000000	/* Thread belongs to the rt mutex tester */

This flag has been taken by PF_UMH so I moved it to 0x10000000.

And we have now run out of PF_ flags.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region
  2019-01-14  9:54 [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2019-01-14  9:54 ` [PATCH V7 5/5] testing Aneesh Kumar K.V
@ 2019-01-29 22:56 ` Andrew Morton
  2019-02-26 23:53 ` Andrew Morton
  7 siblings, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2019-01-29 22:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe, linux-mm, linux-kernel, linuxppc-dev

On Mon, 14 Jan 2019 15:24:32 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> ppc64 use CMA area for the allocation of guest page table (hash page table). We won't
> be able to start guest if we fail to allocate hash page table. We have observed
> hash table allocation failure because we failed to migrate pages out of CMA region
> because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins
> the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we
> won't be able to migrate those pages. The pages are also pinned for the lifetime of the
> guest.
> 
> Currently we support migration of non-compound pages. With THP and with the addition of
>  hugetlb migration we can end up allocating compound pages from CMA region. This
> patch series add support for migrating compound pages. 

Very little review activity is in evidence.  Please identify some
appropriate reviewers and ask them to take a look?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V7 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_do_alloc
  2019-01-14  9:54 ` [PATCH V7 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_do_alloc Aneesh Kumar K.V
@ 2019-01-30 11:34   ` Michael Ellerman
  2019-01-31  4:42     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 15+ messages in thread
From: Michael Ellerman @ 2019-01-30 11:34 UTC (permalink / raw)
  To: Aneesh Kumar K.V, akpm, Michal Hocko, Alexey Kardashevskiy,
	David Gibson, Andrea Arcangeli
  Cc: linux-mm, linux-kernel, linuxppc-dev, Aneesh Kumar K.V

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> The current code doesn't do page migration if the page allocated is a compound page.
> With HugeTLB migration support, we can end up allocating hugetlb pages from
> CMA region. Also, THP pages can be allocated from CMA region. This patch updates
> the code to handle compound pages correctly. The patch also switches to a single
> get_user_pages with the right count, instead of doing one get_user_pages per page.
> That avoids reading page table multiple times.

It's not very obvious from the above description that the migration
logic is now being done by get_user_pages_longterm(), it just looks like
it's all being deleted in this patch. Would be good to mention that.

> Since these page reference updates are long term pin, switch to
> get_user_pages_longterm. That makes sure we fail correctly if the guest RAM
> is backed by DAX pages.

Can you explain that in more detail?

> The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union.
> We use the same storage location to store pointers to struct page. We cannot
> update all the code path use struct page *, because we access hpas in real mode
> and we can't do that struct page * to pfn conversion in real mode.

That's a pain, it's asking for bugs mixing two different values in the
same array. But I guess it's the least worst option.

It sounds like that's a separate change you could do in a separate
patch. But it's not, because it's tied to the fact that we're doing a
single GUP call.


> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index a712a650a8b6..f11a2f15071f 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -21,6 +21,7 @@
>  #include <linux/sizes.h>
>  #include <asm/mmu_context.h>
>  #include <asm/pte-walk.h>
> +#include <linux/mm_inline.h>
>  
>  static DEFINE_MUTEX(mem_list_mutex);
>  
> @@ -34,8 +35,18 @@ struct mm_iommu_table_group_mem_t {
>  	atomic64_t mapped;
>  	unsigned int pageshift;
>  	u64 ua;			/* userspace address */
> -	u64 entries;		/* number of entries in hpas[] */
> -	u64 *hpas;		/* vmalloc'ed */
> +	u64 entries;		/* number of entries in hpas/hpages[] */
> +	/*
> +	 * in mm_iommu_get we temporarily use this to store
> +	 * struct page address.
> +	 *
> +	 * We need to convert ua to hpa in real mode. Make it
> +	 * simpler by storing physical address.
> +	 */
> +	union {
> +		struct page **hpages;	/* vmalloc'ed */
> +		phys_addr_t *hpas;
> +	};
>  #define MM_IOMMU_TABLE_INVALID_HPA	((uint64_t)-1)
>  	u64 dev_hpa;		/* Device memory base address */
>  };
> @@ -80,64 +91,15 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> -/*
> - * Taken from alloc_migrate_target with changes to remove CMA allocations
> - */
> -struct page *new_iommu_non_cma_page(struct page *page, unsigned long private)
> -{
> -	gfp_t gfp_mask = GFP_USER;
> -	struct page *new_page;
> -
> -	if (PageCompound(page))
> -		return NULL;
> -
> -	if (PageHighMem(page))
> -		gfp_mask |= __GFP_HIGHMEM;
> -
> -	/*
> -	 * We don't want the allocation to force an OOM if possibe
> -	 */
> -	new_page = alloc_page(gfp_mask | __GFP_NORETRY | __GFP_NOWARN);
> -	return new_page;
> -}
> -
> -static int mm_iommu_move_page_from_cma(struct page *page)
> -{
> -	int ret = 0;
> -	LIST_HEAD(cma_migrate_pages);
> -
> -	/* Ignore huge pages for now */
> -	if (PageCompound(page))
> -		return -EBUSY;
> -
> -	lru_add_drain();
> -	ret = isolate_lru_page(page);
> -	if (ret)
> -		return ret;
> -
> -	list_add(&page->lru, &cma_migrate_pages);
> -	put_page(page); /* Drop the gup reference */
> -
> -	ret = migrate_pages(&cma_migrate_pages, new_iommu_non_cma_page,
> -				NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE);
> -	if (ret) {
> -		if (!list_empty(&cma_migrate_pages))
> -			putback_movable_pages(&cma_migrate_pages);
> -	}
> -
> -	return 0;
> -}
> -
>  static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
> -		unsigned long entries, unsigned long dev_hpa,
> -		struct mm_iommu_table_group_mem_t **pmem)
> +			      unsigned long entries, unsigned long dev_hpa,
> +			      struct mm_iommu_table_group_mem_t **pmem)
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
> -	long i, j, ret = 0, locked_entries = 0;
> +	long i, ret = 0, locked_entries = 0;

I'd prefer we didn't initialise ret here.

>  	unsigned int pageshift;
>  	unsigned long flags;
>  	unsigned long cur_ua;
> -	struct page *page = NULL;
>  
>  	mutex_lock(&mem_list_mutex);
>  
> @@ -187,41 +149,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
>  		goto unlock_exit;
>  	}
>  
> +	down_read(&mm->mmap_sem);
> +	ret = get_user_pages_longterm(ua, entries, FOLL_WRITE, mem->hpages, NULL);
> +	up_read(&mm->mmap_sem);
> +	if (ret != entries) {
> +		/* free the reference taken */
> +		for (i = 0; i < ret; i++)
> +			put_page(mem->hpages[i]);
> +
> +		vfree(mem->hpas);
> +		kfree(mem);
> +		ret = -EFAULT;
> +		goto unlock_exit;
> +	} else {
> +		ret = 0;

Or here.

Instead it should be set to 0 at good_exit.

> +	}
> +
> +	pageshift = PAGE_SHIFT;
>  	for (i = 0; i < entries; ++i) {
> +		struct page *page = mem->hpages[i];
> +
>  		cur_ua = ua + (i << PAGE_SHIFT);
> -		if (1 != get_user_pages_fast(cur_ua,
> -					1/* pages */, 1/* iswrite */, &page)) {
> -			ret = -EFAULT;
> -			for (j = 0; j < i; ++j)
> -				put_page(pfn_to_page(mem->hpas[j] >>
> -						PAGE_SHIFT));
> -			vfree(mem->hpas);
> -			kfree(mem);
> -			goto unlock_exit;
> -		}
> -		/*
> -		 * If we get a page from the CMA zone, since we are going to
> -		 * be pinning these entries, we might as well move them out
> -		 * of the CMA zone if possible. NOTE: faulting in + migration
> -		 * can be expensive. Batching can be considered later
> -		 */
> -		if (is_migrate_cma_page(page)) {
> -			if (mm_iommu_move_page_from_cma(page))
> -				goto populate;
> -			if (1 != get_user_pages_fast(cur_ua,
> -						1/* pages */, 1/* iswrite */,
> -						&page)) {
> -				ret = -EFAULT;
> -				for (j = 0; j < i; ++j)
> -					put_page(pfn_to_page(mem->hpas[j] >>
> -								PAGE_SHIFT));
> -				vfree(mem->hpas);
> -				kfree(mem);
> -				goto unlock_exit;
> -			}
> -		}
> -populate:
> -		pageshift = PAGE_SHIFT;
>  		if (mem->pageshift > PAGE_SHIFT && PageCompound(page)) {
>  			pte_t *pte;
>  			struct page *head = compound_head(page);
> @@ -239,6 +187,10 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
>  			local_irq_restore(flags);
>  		}
>  		mem->pageshift = min(mem->pageshift, pageshift);
> +		/*
> +		 * We don't need struct page reference any more, switch
> +		 * to physical address.
> +		 */
>  		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
>  	}

I'm not any sort of expert on this code, but I don't see anything wrong.

Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>

cheers

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V7 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_do_alloc
  2019-01-30 11:34   ` Michael Ellerman
@ 2019-01-31  4:42     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-01-31  4:42 UTC (permalink / raw)
  To: Michael Ellerman, akpm, Michal Hocko, Alexey Kardashevskiy,
	David Gibson, Andrea Arcangeli
  Cc: linux-mm, linux-kernel, linuxppc-dev

Michael Ellerman <mpe@ellerman.id.au> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
>> The current code doesn't do page migration if the page allocated is a compound page.
>> With HugeTLB migration support, we can end up allocating hugetlb pages from
>> CMA region. Also, THP pages can be allocated from CMA region. This patch updates
>> the code to handle compound pages correctly. The patch also switches to a single
>> get_user_pages with the right count, instead of doing one get_user_pages per page.
>> That avoids reading page table multiple times.
>
> It's not very obvious from the above description that the migration
> logic is now being done by get_user_pages_longterm(), it just looks like
> it's all being deleted in this patch. Would be good to mention that.
>
>> Since these page reference updates are long term pin, switch to
>> get_user_pages_longterm. That makes sure we fail correctly if the guest RAM
>> is backed by DAX pages.
>
> Can you explain that in more detail?

DAX pages lifetime is dictated by file system rules and as such, we need
to make sure that we free these pages on operations like truncate and
punch hole. If we have long term pin on these pages, which are mostly
return to userspace with elevated page count, the entity holding the
long term pin may not be aware of the fact that file got truncated and
the file system blocks possibly got reused. That can result in corruption.

Work is going on to solve this issue by either making operations like
truncate wait or to make these elevated reference counted page/file
system blocks not to be released back to the file system free list.

Till then we prevent long term pin on DAX pages.

Now that we have an API for long term pin, we should ideally be using
that in the vfio code.


>
>> The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union.
>> We use the same storage location to store pointers to struct page. We cannot
>> update all the code path use struct page *, because we access hpas in real mode
>> and we can't do that struct page * to pfn conversion in real mode.
>
> That's a pain, it's asking for bugs mixing two different values in the
> same array. But I guess it's the least worst option.
>
> It sounds like that's a separate change you could do in a separate
> patch. But it's not, because it's tied to the fact that we're doing a
> single GUP call.

-aneesh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region
  2019-01-14  9:54 [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2019-01-29 22:56 ` [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Andrew Morton
@ 2019-02-26 23:53 ` Andrew Morton
  2019-02-27  8:52   ` Aneesh Kumar K.V
  7 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2019-02-26 23:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe, linux-mm, linux-kernel, linuxppc-dev

[patch 1/4]: OK.  I guess.  Was this worth consuming our last PF_ flag?
[patch 2/4]: unreviewed
[patch 3/4]: unreviewed, mpe still unhappy, I expect?
[patch 4/4]: unreviewed

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region
  2019-02-26 23:53 ` Andrew Morton
@ 2019-02-27  8:52   ` Aneesh Kumar K.V
  2019-02-27 11:29     ` Michael Ellerman
  0 siblings, 1 reply; 15+ messages in thread
From: Aneesh Kumar K.V @ 2019-02-27  8:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, mpe, linux-mm, linux-kernel, linuxppc-dev

Andrew Morton <akpm@linux-foundation.org> writes:

> [patch 1/4]: OK.  I guess.  Was this worth consuming our last PF_ flag?

That was done based on request from Andrea and it also helps in avoiding
allocating pages from CMA region where we know we are anyway going to
migrate them out. So yes, this helps. 

> [patch 2/4]: unreviewed
> [patch 3/4]: unreviewed, mpe still unhappy, I expect?

I did reply to that email. I guess mpe is ok with that?

> [patch 4/4]: unreviewed

-aneesh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region
  2019-02-27  8:52   ` Aneesh Kumar K.V
@ 2019-02-27 11:29     ` Michael Ellerman
  0 siblings, 0 replies; 15+ messages in thread
From: Michael Ellerman @ 2019-02-27 11:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Andrew Morton
  Cc: Michal Hocko, Alexey Kardashevskiy, David Gibson,
	Andrea Arcangeli, linux-mm, linux-kernel, linuxppc-dev

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> Andrew Morton <akpm@linux-foundation.org> writes:
>
>> [patch 1/4]: OK.  I guess.  Was this worth consuming our last PF_ flag?
>
> That was done based on request from Andrea and it also helps in avoiding
> allocating pages from CMA region where we know we are anyway going to
> migrate them out. So yes, this helps. 
>
>> [patch 2/4]: unreviewed
>> [patch 3/4]: unreviewed, mpe still unhappy, I expect?
>
> I did reply to that email. I guess mpe is ok with that?

It would be nice to fold your explanation about DAX into the change log,
so it's there for people to see.

And I think my comment about initialising ret still stands.

cheers

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-02-27 11:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-14  9:54 [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Aneesh Kumar K.V
2019-01-14  9:54 ` [PATCH V7 1/4] mm/cma: Add PF flag to force non cma alloc Aneesh Kumar K.V
2019-01-29 22:52   ` Andrew Morton
2019-01-14  9:54 ` [PATCH V7 2/4] mm: Update get_user_pages_longterm to migrate pages allocated from CMA region Aneesh Kumar K.V
2019-01-14  9:54 ` Aneesh Kumar K.V
2019-01-14  9:54 ` [PATCH V7 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_do_alloc Aneesh Kumar K.V
2019-01-30 11:34   ` Michael Ellerman
2019-01-31  4:42     ` Aneesh Kumar K.V
2019-01-14  9:54 ` [PATCH V7 4/4] powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing Aneesh Kumar K.V
2019-01-14  9:54 ` [PATCH V7 5/5] testing Aneesh Kumar K.V
2019-01-15 11:25   ` Aneesh Kumar K.V
2019-01-29 22:56 ` [PATCH V7 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region Andrew Morton
2019-02-26 23:53 ` Andrew Morton
2019-02-27  8:52   ` Aneesh Kumar K.V
2019-02-27 11:29     ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).