[PATCH 0/3] fixes for large mm_populate() and munlock() operations

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] fixes for large mm_populate() and munlock() operations
@ 2013-01-31  0:26 Michel Lespinasse
  2013-01-31  0:26 ` [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages() Michel Lespinasse
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Michel Lespinasse @ 2013-01-31  0:26 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

These 3 changes are to improve the handling of large mm_populate and
munlock operations. They apply on top of mmotm (in particular, they
depend on both my prior mm_populate work and Kirill's "thp: avoid
dumping huge zero page" change).

- Patch 1 fixes an integer overflow issue when populating 2^32 pages.
  The nr_pages argument to get_user_pages would overflow, resulting in 0
  pages being processed per iteration. I am proposing to simply convert
  the nr_pages argument to a long.

- Patch 2 accelerates populating regions with THP pages. get_user_pages()
  can increment the address by a huge page size in this case instead of
  a small page size, and avoid repeated mm->page_table_lock acquisitions.
  This fixes an issue reported by Roman Dubtsov where populating regions
  via mmap MAP_POPULATE was significantly slower than doing so by
  touching pages from userspace.

- Patch 3 is a similar acceleration for the munlock case. I would actually
  like to get Andrea's attention on this one, as I can't explain how
  munlock_vma_page() is safe against racing with split_huge_page().

Note that patches 1-2 are logically independent of patch 3, so if the
discussion of patch 3 takes too long I would ask Andrew to consider
merging patches 1-2 first.

Michel Lespinasse (3):
  mm: use long type for page counts in mm_populate() and get_user_pages()
  mm: accelerate mm_populate() treatment of THP pages
  mm: accelerate munlock() treatment of THP pages

 arch/ia64/xen/xencomm.c    |  3 ++-
 arch/powerpc/kernel/vdso.c |  9 +++++----
 arch/s390/mm/pgtable.c     |  3 ++-
 include/linux/hugetlb.h    |  6 +++---
 include/linux/mm.h         | 16 ++++++++--------
 mm/hugetlb.c               | 10 +++++-----
 mm/internal.h              |  2 +-
 mm/ksm.c                   | 10 +++++++---
 mm/memory.c                | 39 ++++++++++++++++++++++++++-------------
 mm/migrate.c               |  7 +++++--
 mm/mlock.c                 | 37 ++++++++++++++++++++++++-------------
 11 files changed, 88 insertions(+), 54 deletions(-)

-- 
1.8.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages()
  2013-01-31  0:26 [PATCH 0/3] fixes for large mm_populate() and munlock() operations Michel Lespinasse
@ 2013-01-31  0:26 ` Michel Lespinasse
  2013-01-31  0:42   ` Andrew Morton
  2013-02-07  0:39   ` Sasha Levin
  2013-01-31  0:26 ` [PATCH 2/3] mm: accelerate mm_populate() treatment of THP pages Michel Lespinasse
  2013-01-31  0:26 ` [RFC PATCH 3/3] mm: accelerate munlock() " Michel Lespinasse
  2 siblings, 2 replies; 12+ messages in thread
From: Michel Lespinasse @ 2013-01-31  0:26 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

Use long type for page counts in mm_populate() so as to avoid integer
overflow when running the following test code:

int main(void) {
  void *p = mmap(NULL, 0x100000000000, PROT_READ,
                 MAP_PRIVATE | MAP_ANON, -1, 0);
  printf("p: %p\n", p);
  mlockall(MCL_CURRENT);
  printf("done\n");
  return 0;
}

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 include/linux/hugetlb.h |  6 +++---
 include/linux/mm.h      | 14 +++++++-------
 mm/hugetlb.c            | 10 +++++-----
 mm/memory.c             | 14 +++++++-------
 mm/mlock.c              |  5 +++--
 5 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0c80d3f57a5b..fc6ed17cfd17 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -43,9 +43,9 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
 #endif
 
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
-int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
-			struct page **, struct vm_area_struct **,
-			unsigned long *, int *, int, unsigned int flags);
+long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
+			 struct page **, struct vm_area_struct **,
+			 unsigned long *, long *, long, unsigned int flags);
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *);
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a224430578f0..d5716094f191 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1040,13 +1040,13 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
 extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
 		void *buf, int len, int write);
 
-int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		     unsigned long start, int len, unsigned int foll_flags,
-		     struct page **pages, struct vm_area_struct **vmas,
-		     int *nonblocking);
-int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-			unsigned long start, int nr_pages, int write, int force,
-			struct page **pages, struct vm_area_struct **vmas);
+long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		      unsigned long start, long len, unsigned int foll_flags,
+		      struct page **pages, struct vm_area_struct **vmas,
+		      int *nonblocking);
+long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		    unsigned long start, long nr_pages, int write, int force,
+		    struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct kvec;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4f3ea0b1e57c..4ad07221ce60 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2924,14 +2924,14 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
-int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			struct page **pages, struct vm_area_struct **vmas,
-			unsigned long *position, int *length, int i,
-			unsigned int flags)
+long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			 struct page **pages, struct vm_area_struct **vmas,
+			 unsigned long *position, long *length, long i,
+			 unsigned int flags)
 {
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
-	int remainder = *length;
+	long remainder = *length;
 	struct hstate *h = hstate_vma(vma);
 
 	spin_lock(&mm->page_table_lock);
diff --git a/mm/memory.c b/mm/memory.c
index f56683208e7f..381b78c20d84 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1673,12 +1673,12 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add
  * instead of __get_user_pages. __get_user_pages should be used only if
  * you need some special @gup_flags.
  */
-int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		     unsigned long start, int nr_pages, unsigned int gup_flags,
-		     struct page **pages, struct vm_area_struct **vmas,
-		     int *nonblocking)
+long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, long nr_pages, unsigned int gup_flags,
+		struct page **pages, struct vm_area_struct **vmas,
+		int *nonblocking)
 {
-	int i;
+	long i;
 	unsigned long vm_flags;
 
 	if (nr_pages <= 0)
@@ -1977,8 +1977,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
  *
  * See also get_user_pages_fast, for performance critical applications.
  */
-int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, int nr_pages, int write, int force,
+long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, long nr_pages, int write, int force,
 		struct page **pages, struct vm_area_struct **vmas)
 {
 	int flags = FOLL_TOUCH;
diff --git a/mm/mlock.c b/mm/mlock.c
index b1647fbd6bce..e1fa9e4b0a66 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -160,7 +160,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start;
-	int nr_pages = (end - start) / PAGE_SIZE;
+	long nr_pages = (end - start) / PAGE_SIZE;
 	int gup_flags;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
@@ -378,7 +378,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
 	unsigned long end, nstart, nend;
 	struct vm_area_struct *vma = NULL;
 	int locked = 0;
-	int ret = 0;
+	long ret = 0;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
 	VM_BUG_ON(len != PAGE_ALIGN(len));
@@ -421,6 +421,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
 			ret = __mlock_posix_error_return(ret);
 			break;
 		}
+		VM_BUG_ON(!ret);
 		nend = nstart + ret * PAGE_SIZE;
 		ret = 0;
 	}
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] mm: accelerate mm_populate() treatment of THP pages
  2013-01-31  0:26 [PATCH 0/3] fixes for large mm_populate() and munlock() operations Michel Lespinasse
  2013-01-31  0:26 ` [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages() Michel Lespinasse
@ 2013-01-31  0:26 ` Michel Lespinasse
  2013-01-31  3:05   ` Hugh Dickins
  2013-02-02  0:08   ` Andrew Morton
  2013-01-31  0:26 ` [RFC PATCH 3/3] mm: accelerate munlock() " Michel Lespinasse
  2 siblings, 2 replies; 12+ messages in thread
From: Michel Lespinasse @ 2013-01-31  0:26 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

This change adds a page_mask argument to follow_page.

follow_page sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a THP page,
and to 0 in other cases.

__get_user_pages() makes use of this in order to accelerate populating
THP ranges - that is, when both the pages and vmas arrays are NULL,
we don't need to iterate HPAGE_PMD_NR times to cover a single THP page
(and we also avoid taking mm->page_table_lock that many times).

Other follow_page() call sites can safely ignore the value returned in
*page_mask.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 arch/ia64/xen/xencomm.c    |  3 ++-
 arch/powerpc/kernel/vdso.c |  9 +++++----
 arch/s390/mm/pgtable.c     |  3 ++-
 include/linux/mm.h         |  2 +-
 mm/ksm.c                   | 10 +++++++---
 mm/memory.c                | 25 +++++++++++++++++++------
 mm/migrate.c               |  7 +++++--
 mm/mlock.c                 |  4 +++-
 8 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/xen/xencomm.c b/arch/ia64/xen/xencomm.c
index 73d903ca2d64..c5dcf3a574e9 100644
--- a/arch/ia64/xen/xencomm.c
+++ b/arch/ia64/xen/xencomm.c
@@ -44,6 +44,7 @@ xencomm_vtop(unsigned long vaddr)
 {
 	struct page *page;
 	struct vm_area_struct *vma;
+	long page_mask;
 
 	if (vaddr == 0)
 		return 0UL;
@@ -98,7 +99,7 @@ xencomm_vtop(unsigned long vaddr)
 		return ~0UL;
 
 	/* We assume the page is modified.  */
-	page = follow_page(vma, vaddr, FOLL_WRITE | FOLL_TOUCH);
+	page = follow_page(vma, vaddr, FOLL_WRITE | FOLL_TOUCH, &page_mask);
 	if (IS_ERR_OR_NULL(page))
 		return ~0UL;
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 1b2076f049ce..a529502d60f9 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -156,6 +156,7 @@ static void dump_one_vdso_page(struct page *pg, struct page *upg)
 static void dump_vdso_pages(struct vm_area_struct * vma)
 {
 	int i;
+	long page_mask;
 
 	if (!vma || is_32bit_task()) {
 		printk("vDSO32 @ %016lx:\n", (unsigned long)vdso32_kbase);
@@ -163,8 +164,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
 			struct page *pg = virt_to_page(vdso32_kbase +
 						       i*PAGE_SIZE);
 			struct page *upg = (vma && vma->vm_mm) ?
-				follow_page(vma, vma->vm_start + i*PAGE_SIZE, 0)
-				: NULL;
+				follow_page(vma, vma->vm_start + i*PAGE_SIZE,
+					    0, &page_mask) : NULL;
 			dump_one_vdso_page(pg, upg);
 		}
 	}
@@ -174,8 +175,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
 			struct page *pg = virt_to_page(vdso64_kbase +
 						       i*PAGE_SIZE);
 			struct page *upg = (vma && vma->vm_mm) ?
-				follow_page(vma, vma->vm_start + i*PAGE_SIZE, 0)
-				: NULL;
+				follow_page(vma, vma->vm_start + i*PAGE_SIZE,
+					    0, &page_mask) : NULL;
 			dump_one_vdso_page(pg, upg);
 		}
 	}
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index ae44d2a34313..63e897a6cf45 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -792,9 +792,10 @@ void thp_split_vma(struct vm_area_struct *vma)
 {
 	unsigned long addr;
 	struct page *page;
+	long page_mask;
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
-		page = follow_page(vma, addr, FOLL_SPLIT);
+		page = follow_page(vma, addr, FOLL_SPLIT, &page_mask);
 	}
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5716094f191..6dc0ce370df5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1636,7 +1636,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
-			unsigned int foll_flags);
+			 unsigned int foll_flags, long *page_mask);
 #define FOLL_WRITE	0x01	/* check pte is writable */
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
diff --git a/mm/ksm.c b/mm/ksm.c
index 51573858938d..76dfb7133aa4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -330,10 +330,11 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *page;
 	int ret = 0;
+	long page_mask;
 
 	do {
 		cond_resched();
-		page = follow_page(vma, addr, FOLL_GET);
+		page = follow_page(vma, addr, FOLL_GET, &page_mask);
 		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
@@ -427,13 +428,14 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 	unsigned long addr = rmap_item->address;
 	struct vm_area_struct *vma;
 	struct page *page;
+	long page_mask;
 
 	down_read(&mm->mmap_sem);
 	vma = find_mergeable_vma(mm, addr);
 	if (!vma)
 		goto out;
 
-	page = follow_page(vma, addr, FOLL_GET);
+	page = follow_page(vma, addr, FOLL_GET, &page_mask);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
 	if (PageAnon(page) || page_trans_compound_anon(page)) {
@@ -1289,6 +1291,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 	struct mm_slot *slot;
 	struct vm_area_struct *vma;
 	struct rmap_item *rmap_item;
+	long page_mask;
 
 	if (list_empty(&ksm_mm_head.mm_list))
 		return NULL;
@@ -1342,7 +1345,8 @@ next_mm:
 		while (ksm_scan.address < vma->vm_end) {
 			if (ksm_test_exit(mm))
 				break;
-			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
+			*page = follow_page(vma, ksm_scan.address, FOLL_GET,
+					    &page_mask);
 			if (IS_ERR_OR_NULL(*page)) {
 				ksm_scan.address += PAGE_SIZE;
 				cond_resched();
diff --git a/mm/memory.c b/mm/memory.c
index 381b78c20d84..1becba27c28a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1470,7 +1470,7 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
  * by a page descriptor (see also vm_normal_page()).
  */
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
-			unsigned int flags)
+			 unsigned int flags, long *page_mask)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -1480,6 +1480,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 
+	*page_mask = 0;
+
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
 		BUG_ON(flags & FOLL_GET);
@@ -1526,6 +1528,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
 				spin_unlock(&mm->page_table_lock);
+				*page_mask = HPAGE_PMD_NR - 1;
 				goto out;
 			}
 		} else
@@ -1680,6 +1683,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 {
 	long i;
 	unsigned long vm_flags;
+	long page_mask;
 
 	if (nr_pages <= 0)
 		return 0;
@@ -1757,6 +1761,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				get_page(page);
 			}
 			pte_unmap(pte);
+			page_mask = 0;
 			goto next_page;
 		}
 
@@ -1774,6 +1779,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
+			long page_increm;
 
 			/*
 			 * If we have a pending SIGKILL, don't keep faulting
@@ -1783,7 +1789,8 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				return i ? i : -ERESTARTSYS;
 
 			cond_resched();
-			while (!(page = follow_page(vma, start, foll_flags))) {
+			while (!(page = follow_page(vma, start, foll_flags,
+						    &page_mask))) {
 				int ret;
 				unsigned int fault_flags = 0;
 
@@ -1857,13 +1864,19 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 				flush_anon_page(vma, page, start);
 				flush_dcache_page(page);
+				page_mask = 0;
 			}
 next_page:
-			if (vmas)
+			if (vmas) {
 				vmas[i] = vma;
-			i++;
-			start += PAGE_SIZE;
-			nr_pages--;
+				page_mask = 0;
+			}
+			page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+			if (page_increm > nr_pages)
+				page_increm = nr_pages;
+			i += page_increm;
+			start += page_increm * PAGE_SIZE;
+			nr_pages -= page_increm;
 		} while (nr_pages && start < vma->vm_end);
 	} while (nr_pages);
 	return i;
diff --git a/mm/migrate.c b/mm/migrate.c
index c38778610aa8..daa5726c9c46 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1124,6 +1124,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 	int err;
 	struct page_to_node *pp;
 	LIST_HEAD(pagelist);
+	long page_mask;
 
 	down_read(&mm->mmap_sem);
 
@@ -1139,7 +1140,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
 			goto set_status;
 
-		page = follow_page(vma, pp->addr, FOLL_GET|FOLL_SPLIT);
+		page = follow_page(vma, pp->addr, FOLL_GET | FOLL_SPLIT,
+				   &page_mask);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1291,6 +1293,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 				const void __user **pages, int *status)
 {
 	unsigned long i;
+	long page_mask;
 
 	down_read(&mm->mmap_sem);
 
@@ -1304,7 +1307,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 		if (!vma || addr < vma->vm_start)
 			goto set_status;
 
-		page = follow_page(vma, addr, 0);
+		page = follow_page(vma, addr, 0, &page_mask);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
diff --git a/mm/mlock.c b/mm/mlock.c
index e1fa9e4b0a66..2694f17cca2d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -223,6 +223,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
 	unsigned long addr;
+	long page_mask;
 
 	lru_add_drain();
 	vma->vm_flags &= ~VM_LOCKED;
@@ -236,7 +237,8 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		 * suits munlock very well (and if somehow an abnormal page
 		 * has sneaked into the range, we won't oops here: great).
 		 */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP,
+				   &page_mask);
 		if (page && !IS_ERR(page)) {
 			lock_page(page);
 			munlock_vma_page(page);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/3] mm: accelerate munlock() treatment of THP pages
  2013-01-31  0:26 [PATCH 0/3] fixes for large mm_populate() and munlock() operations Michel Lespinasse
  2013-01-31  0:26 ` [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages() Michel Lespinasse
  2013-01-31  0:26 ` [PATCH 2/3] mm: accelerate mm_populate() treatment of THP pages Michel Lespinasse
@ 2013-01-31  0:26 ` Michel Lespinasse
  2 siblings, 0 replies; 12+ messages in thread
From: Michel Lespinasse @ 2013-01-31  0:26 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
at a time. When munlocking THP pages (or the huge zero page), this resulted
in taking the mm->page_table_lock 512 times in a row.

We can do better by making use of the page_mask returned by follow_page
(for the huge zero page case), or the size of the page munlock_vma_page()
operated on (for the true THP page case).

Note - I am sending this as RFC only for now as I can't currently put
my finger on what if anything prevents split_huge_page() from operating
concurrently on the same page as munlock_vma_page(), which would mess
up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
point I missed here ?

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 mm/internal.h |  2 +-
 mm/mlock.c    | 32 ++++++++++++++++++++------------
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1c0c4cc0fcf7..ee85efad03c5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -195,7 +195,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma,
  * must be called with vma's mmap_sem held for read or write, and page locked.
  */
 extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+extern long munlock_vma_page(struct page *page);
 
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
diff --git a/mm/mlock.c b/mm/mlock.c
index 2694f17cca2d..5cd7e5a258da 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -102,13 +102,14 @@ void mlock_vma_page(struct page *page)
  * can't isolate the page, we leave it for putback_lru_page() and vmscan
  * [page_referenced()/try_to_unmap()] to deal with.
  */
-void munlock_vma_page(struct page *page)
+long munlock_vma_page(struct page *page)
 {
+	long nr_pages = hpage_nr_pages(page);
+
 	BUG_ON(!PageLocked(page));
 
 	if (TestClearPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    -hpage_nr_pages(page));
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 		if (!isolate_lru_page(page)) {
 			int ret = SWAP_AGAIN;
 
@@ -141,6 +142,8 @@ void munlock_vma_page(struct page *page)
 				count_vm_event(UNEVICTABLE_PGMUNLOCKED);
 		}
 	}
+
+	return nr_pages;
 }
 
 /**
@@ -159,7 +162,6 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, int *nonblocking)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start;
 	long nr_pages = (end - start) / PAGE_SIZE;
 	int gup_flags;
 
@@ -185,7 +187,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
 		gup_flags |= FOLL_FORCE;
 
-	return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
 				NULL, NULL, nonblocking);
 }
 
@@ -222,14 +224,12 @@ static int __mlock_posix_error_return(long retval)
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
-	unsigned long addr;
-	long page_mask;
-
-	lru_add_drain();
 	vma->vm_flags &= ~VM_LOCKED;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE) {
+	while (start < end) {
 		struct page *page;
+		long page_mask, page_increm;
+
 		/*
 		 * Although FOLL_DUMP is intended for get_dump_page(),
 		 * it just so happens that its special treatment of the
@@ -237,14 +237,22 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		 * suits munlock very well (and if somehow an abnormal page
 		 * has sneaked into the range, we won't oops here: great).
 		 */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP,
+		page = follow_page(vma, start, FOLL_GET | FOLL_DUMP,
 				   &page_mask);
 		if (page && !IS_ERR(page)) {
 			lock_page(page);
-			munlock_vma_page(page);
+			lru_add_drain();
+			/*
+			 * Any THP page found by follow_page() may have gotten
+			 * split before reaching munlock_vma_page(), so we
+			 * need to recompute the page_mask here.
+			 */
+			page_mask = munlock_vma_page(page);
 			unlock_page(page);
 			put_page(page);
 		}
+		page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+		start += page_increm * PAGE_SIZE;
 		cond_resched();
 	}
 }
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages()
  2013-01-31  0:26 ` [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages() Michel Lespinasse
@ 2013-01-31  0:42   ` Andrew Morton
  2013-02-07  0:39   ` Sasha Levin
  1 sibling, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2013-01-31  0:42 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, 30 Jan 2013 16:26:18 -0800
Michel Lespinasse <walken@google.com> wrote:

> Use long type for page counts in mm_populate() so as to avoid integer
> overflow

Would prefer to use unsigned long if we're churning this code.  A "page
count" can never be negative and we avoid the various possible
overflow/signedness issues.

However get_user_pages() and follow_hugetlb_page() return "page count
or -ve errno", so we're somewhat screwed there.  And that's the bulk of
the patch :(

btw, what twit merged a follow_hugetlb_page() which takes an
undocumented argument called "i".  Sigh.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] mm: accelerate mm_populate() treatment of THP pages
  2013-01-31  0:26 ` [PATCH 2/3] mm: accelerate mm_populate() treatment of THP pages Michel Lespinasse
@ 2013-01-31  3:05   ` Hugh Dickins
  2013-01-31  4:27     ` Michel Lespinasse
  2013-02-02  0:08   ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2013-01-31  3:05 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Andrew Morton,
	linux-mm, linux-kernel

On Wed, 30 Jan 2013, Michel Lespinasse wrote:

> This change adds a page_mask argument to follow_page.
> 
> follow_page sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a THP page,
> and to 0 in other cases.
> 
> __get_user_pages() makes use of this in order to accelerate populating
> THP ranges - that is, when both the pages and vmas arrays are NULL,
> we don't need to iterate HPAGE_PMD_NR times to cover a single THP page
> (and we also avoid taking mm->page_table_lock that many times).
> 
> Other follow_page() call sites can safely ignore the value returned in
> *page_mask.
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

I certainly like the skipping.

But (a) if the additional arg has to exist, then I'd much prefer it
to be page_size than page_mask - I realize there's a tiny advantage to
subtracting 1 from an immediate than from a variable, but I don't think
it justifies the peculiar interface.  mask makes people think of masking.

And (b) why can't we just omit the additional arg, and get it from the
page found?  You've explained the unreliability of the !FOLL_GET case
to me privately, but that needs to be spelt out in the commit comment
(and I'd love if we found a counter-argument, the extra arg of interest
to almost no-one does irritate me).

Hugh

> 
> ---
>  arch/ia64/xen/xencomm.c    |  3 ++-
>  arch/powerpc/kernel/vdso.c |  9 +++++----
>  arch/s390/mm/pgtable.c     |  3 ++-
>  include/linux/mm.h         |  2 +-
>  mm/ksm.c                   | 10 +++++++---
>  mm/memory.c                | 25 +++++++++++++++++++------
>  mm/migrate.c               |  7 +++++--
>  mm/mlock.c                 |  4 +++-
>  8 files changed, 44 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/ia64/xen/xencomm.c b/arch/ia64/xen/xencomm.c
> index 73d903ca2d64..c5dcf3a574e9 100644
> --- a/arch/ia64/xen/xencomm.c
> +++ b/arch/ia64/xen/xencomm.c
> @@ -44,6 +44,7 @@ xencomm_vtop(unsigned long vaddr)
>  {
>  	struct page *page;
>  	struct vm_area_struct *vma;
> +	long page_mask;
>  
>  	if (vaddr == 0)
>  		return 0UL;
> @@ -98,7 +99,7 @@ xencomm_vtop(unsigned long vaddr)
>  		return ~0UL;
>  
>  	/* We assume the page is modified.  */
> -	page = follow_page(vma, vaddr, FOLL_WRITE | FOLL_TOUCH);
> +	page = follow_page(vma, vaddr, FOLL_WRITE | FOLL_TOUCH, &page_mask);
>  	if (IS_ERR_OR_NULL(page))
>  		return ~0UL;
>  
> diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
> index 1b2076f049ce..a529502d60f9 100644
> --- a/arch/powerpc/kernel/vdso.c
> +++ b/arch/powerpc/kernel/vdso.c
> @@ -156,6 +156,7 @@ static void dump_one_vdso_page(struct page *pg, struct page *upg)
>  static void dump_vdso_pages(struct vm_area_struct * vma)
>  {
>  	int i;
> +	long page_mask;
>  
>  	if (!vma || is_32bit_task()) {
>  		printk("vDSO32 @ %016lx:\n", (unsigned long)vdso32_kbase);
> @@ -163,8 +164,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
>  			struct page *pg = virt_to_page(vdso32_kbase +
>  						       i*PAGE_SIZE);
>  			struct page *upg = (vma && vma->vm_mm) ?
> -				follow_page(vma, vma->vm_start + i*PAGE_SIZE, 0)
> -				: NULL;
> +				follow_page(vma, vma->vm_start + i*PAGE_SIZE,
> +					    0, &page_mask) : NULL;
>  			dump_one_vdso_page(pg, upg);
>  		}
>  	}
> @@ -174,8 +175,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
>  			struct page *pg = virt_to_page(vdso64_kbase +
>  						       i*PAGE_SIZE);
>  			struct page *upg = (vma && vma->vm_mm) ?
> -				follow_page(vma, vma->vm_start + i*PAGE_SIZE, 0)
> -				: NULL;
> +				follow_page(vma, vma->vm_start + i*PAGE_SIZE,
> +					    0, &page_mask) : NULL;
>  			dump_one_vdso_page(pg, upg);
>  		}
>  	}
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index ae44d2a34313..63e897a6cf45 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -792,9 +792,10 @@ void thp_split_vma(struct vm_area_struct *vma)
>  {
>  	unsigned long addr;
>  	struct page *page;
> +	long page_mask;
>  
>  	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> -		page = follow_page(vma, addr, FOLL_SPLIT);
> +		page = follow_page(vma, addr, FOLL_SPLIT, &page_mask);
>  	}
>  }
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d5716094f191..6dc0ce370df5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1636,7 +1636,7 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>  			unsigned long pfn);
>  
>  struct page *follow_page(struct vm_area_struct *, unsigned long address,
> -			unsigned int foll_flags);
> +			 unsigned int foll_flags, long *page_mask);
>  #define FOLL_WRITE	0x01	/* check pte is writable */
>  #define FOLL_TOUCH	0x02	/* mark page accessed */
>  #define FOLL_GET	0x04	/* do get_page on page */
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 51573858938d..76dfb7133aa4 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -330,10 +330,11 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct page *page;
>  	int ret = 0;
> +	long page_mask;
>  
>  	do {
>  		cond_resched();
> -		page = follow_page(vma, addr, FOLL_GET);
> +		page = follow_page(vma, addr, FOLL_GET, &page_mask);
>  		if (IS_ERR_OR_NULL(page))
>  			break;
>  		if (PageKsm(page))
> @@ -427,13 +428,14 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>  	unsigned long addr = rmap_item->address;
>  	struct vm_area_struct *vma;
>  	struct page *page;
> +	long page_mask;
>  
>  	down_read(&mm->mmap_sem);
>  	vma = find_mergeable_vma(mm, addr);
>  	if (!vma)
>  		goto out;
>  
> -	page = follow_page(vma, addr, FOLL_GET);
> +	page = follow_page(vma, addr, FOLL_GET, &page_mask);
>  	if (IS_ERR_OR_NULL(page))
>  		goto out;
>  	if (PageAnon(page) || page_trans_compound_anon(page)) {
> @@ -1289,6 +1291,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  	struct mm_slot *slot;
>  	struct vm_area_struct *vma;
>  	struct rmap_item *rmap_item;
> +	long page_mask;
>  
>  	if (list_empty(&ksm_mm_head.mm_list))
>  		return NULL;
> @@ -1342,7 +1345,8 @@ next_mm:
>  		while (ksm_scan.address < vma->vm_end) {
>  			if (ksm_test_exit(mm))
>  				break;
> -			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
> +			*page = follow_page(vma, ksm_scan.address, FOLL_GET,
> +					    &page_mask);
>  			if (IS_ERR_OR_NULL(*page)) {
>  				ksm_scan.address += PAGE_SIZE;
>  				cond_resched();
> diff --git a/mm/memory.c b/mm/memory.c
> index 381b78c20d84..1becba27c28a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1470,7 +1470,7 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
>   * by a page descriptor (see also vm_normal_page()).
>   */
>  struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> -			unsigned int flags)
> +			 unsigned int flags, long *page_mask)
>  {
>  	pgd_t *pgd;
>  	pud_t *pud;
> @@ -1480,6 +1480,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  	struct page *page;
>  	struct mm_struct *mm = vma->vm_mm;
>  
> +	*page_mask = 0;
> +
>  	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
>  	if (!IS_ERR(page)) {
>  		BUG_ON(flags & FOLL_GET);
> @@ -1526,6 +1528,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  				page = follow_trans_huge_pmd(vma, address,
>  							     pmd, flags);
>  				spin_unlock(&mm->page_table_lock);
> +				*page_mask = HPAGE_PMD_NR - 1;
>  				goto out;
>  			}
>  		} else
> @@ -1680,6 +1683,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  {
>  	long i;
>  	unsigned long vm_flags;
> +	long page_mask;
>  
>  	if (nr_pages <= 0)
>  		return 0;
> @@ -1757,6 +1761,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  				get_page(page);
>  			}
>  			pte_unmap(pte);
> +			page_mask = 0;
>  			goto next_page;
>  		}
>  
> @@ -1774,6 +1779,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  		do {
>  			struct page *page;
>  			unsigned int foll_flags = gup_flags;
> +			long page_increm;
>  
>  			/*
>  			 * If we have a pending SIGKILL, don't keep faulting
> @@ -1783,7 +1789,8 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  				return i ? i : -ERESTARTSYS;
>  
>  			cond_resched();
> -			while (!(page = follow_page(vma, start, foll_flags))) {
> +			while (!(page = follow_page(vma, start, foll_flags,
> +						    &page_mask))) {
>  				int ret;
>  				unsigned int fault_flags = 0;
>  
> @@ -1857,13 +1864,19 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  
>  				flush_anon_page(vma, page, start);
>  				flush_dcache_page(page);
> +				page_mask = 0;
>  			}
>  next_page:
> -			if (vmas)
> +			if (vmas) {
>  				vmas[i] = vma;
> -			i++;
> -			start += PAGE_SIZE;
> -			nr_pages--;
> +				page_mask = 0;
> +			}
> +			page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
> +			if (page_increm > nr_pages)
> +				page_increm = nr_pages;
> +			i += page_increm;
> +			start += page_increm * PAGE_SIZE;
> +			nr_pages -= page_increm;
>  		} while (nr_pages && start < vma->vm_end);
>  	} while (nr_pages);
>  	return i;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c38778610aa8..daa5726c9c46 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1124,6 +1124,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  	int err;
>  	struct page_to_node *pp;
>  	LIST_HEAD(pagelist);
> +	long page_mask;
>  
>  	down_read(&mm->mmap_sem);
>  
> @@ -1139,7 +1140,8 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  		if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
>  			goto set_status;
>  
> -		page = follow_page(vma, pp->addr, FOLL_GET|FOLL_SPLIT);
> +		page = follow_page(vma, pp->addr, FOLL_GET | FOLL_SPLIT,
> +				   &page_mask);
>  
>  		err = PTR_ERR(page);
>  		if (IS_ERR(page))
> @@ -1291,6 +1293,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  				const void __user **pages, int *status)
>  {
>  	unsigned long i;
> +	long page_mask;
>  
>  	down_read(&mm->mmap_sem);
>  
> @@ -1304,7 +1307,7 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
>  		if (!vma || addr < vma->vm_start)
>  			goto set_status;
>  
> -		page = follow_page(vma, addr, 0);
> +		page = follow_page(vma, addr, 0, &page_mask);
>  
>  		err = PTR_ERR(page);
>  		if (IS_ERR(page))
> diff --git a/mm/mlock.c b/mm/mlock.c
> index e1fa9e4b0a66..2694f17cca2d 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -223,6 +223,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>  			     unsigned long start, unsigned long end)
>  {
>  	unsigned long addr;
> +	long page_mask;
>  
>  	lru_add_drain();
>  	vma->vm_flags &= ~VM_LOCKED;
> @@ -236,7 +237,8 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>  		 * suits munlock very well (and if somehow an abnormal page
>  		 * has sneaked into the range, we won't oops here: great).
>  		 */
> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> +		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP,
> +				   &page_mask);
>  		if (page && !IS_ERR(page)) {
>  			lock_page(page);
>  			munlock_vma_page(page);
> -- 
> 1.8.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] mm: accelerate mm_populate() treatment of THP pages
  2013-01-31  3:05   ` Hugh Dickins
@ 2013-01-31  4:27     ` Michel Lespinasse
  0 siblings, 0 replies; 12+ messages in thread
From: Michel Lespinasse @ 2013-01-31  4:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Andrew Morton,
	linux-mm, linux-kernel

On Wed, Jan 30, 2013 at 7:05 PM, Hugh Dickins <hughd@google.com> wrote:
> On Wed, 30 Jan 2013, Michel Lespinasse wrote:
>
>> This change adds a page_mask argument to follow_page.
>>
>> follow_page sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a THP page,
>> and to 0 in other cases.
>>
>> __get_user_pages() makes use of this in order to accelerate populating
>> THP ranges - that is, when both the pages and vmas arrays are NULL,
>> we don't need to iterate HPAGE_PMD_NR times to cover a single THP page
>> (and we also avoid taking mm->page_table_lock that many times).
>>
>> Other follow_page() call sites can safely ignore the value returned in
>> *page_mask.
>>
>> Signed-off-by: Michel Lespinasse <walken@google.com>
>
> I certainly like the skipping.
>
> And (b) why can't we just omit the additional arg, and get it from the
> page found?  You've explained the unreliability of the !FOLL_GET case
> to me privately, but that needs to be spelt out in the commit comment
> (and I'd love if we found a counter-argument, the extra arg of interest
> to almost no-one does irritate me).

Right. My understanding is that after calling follow_page() without
the FOLL_GET flag, you really can't do much with the returned page
pointer other than checking it for IS_ERR(). We don't get a reference
to the page, so it could get migrated away as soon as follow_page()
releases the page table lock. In the most extreme case, the memory
corresponding to that page could get offlined / dereferencing the page
pointer could fail.

I actually think the follow_page API is very error prone in this way,
as the returned page pointer is very tempting to use, but can't be
safely used. I almost wish we could return something like
ERR_PTR(-ESTALE) or whatever, just to make remove any temptations of
dereferencing that page pointer.

Now I agree the extra argument isn't pretty, but I don't have any
better ideas for communicating the size of the page that got touched.

> But (a) if the additional arg has to exist, then I'd much prefer it
> to be page_size than page_mask - I realize there's a tiny advantage to
> subtracting 1 from an immediate than from a variable, but I don't think
> it justifies the peculiar interface.  mask makes people think of masking.

Yes, I started with a page_size in bytes and then I moved to the
page_mask. I agree the performance advantage is tiny, and I don't mind
switching back to bytes if people are happier with it.

I think one benefit of the page_mask implementation might be that it's
easier for people to see that page_increment will end up in the
[1..HPAGE_PMD_NR] range. Would a page size in 4k page units work out ?
(I'm just not sure how to call such a quantity, though).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] mm: accelerate mm_populate() treatment of THP pages
  2013-01-31  0:26 ` [PATCH 2/3] mm: accelerate mm_populate() treatment of THP pages Michel Lespinasse
  2013-01-31  3:05   ` Hugh Dickins
@ 2013-02-02  0:08   ` Andrew Morton
  1 sibling, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2013-02-02  0:08 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, 30 Jan 2013 16:26:19 -0800
Michel Lespinasse <walken@google.com> wrote:

> This change adds a page_mask argument to follow_page.
> 
> follow_page sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a THP page,
> and to 0 in other cases.
> 
> __get_user_pages() makes use of this in order to accelerate populating
> THP ranges - that is, when both the pages and vmas arrays are NULL,
> we don't need to iterate HPAGE_PMD_NR times to cover a single THP page
> (and we also avoid taking mm->page_table_lock that many times).
> 
> Other follow_page() call sites can safely ignore the value returned in
> *page_mask.

Seems sensible, but the implementation is rather ungainly.

If we're going to add this unused page_mask local to many functions
then let's at least miminize the scope of that variable and explain why
it's there.  For example,

--- a/mm/ksm.c~mm-accelerate-mm_populate-treatment-of-thp-pages-fix
+++ a/mm/ksm.c
@@ -356,9 +356,10 @@ static int break_ksm(struct vm_area_stru
 {
 	struct page *page;
 	int ret = 0;
-	long page_mask;
 
 	do {
+		long page_mask;		/* unused */
+
 		cond_resched();
 		page = follow_page(vma, addr, FOLL_GET, &page_mask);
 		if (IS_ERR_OR_NULL(page))
@@ -454,7 +455,7 @@ static struct page *get_mergeable_page(s
 	unsigned long addr = rmap_item->address;
 	struct vm_area_struct *vma;
 	struct page *page;
-	long page_mask;
+	long page_mask;		/* unused */
 
 	down_read(&mm->mmap_sem);
 	vma = find_mergeable_vma(mm, addr);
@@ -1525,7 +1526,6 @@ static struct rmap_item *scan_get_next_r
 	struct mm_slot *slot;
 	struct vm_area_struct *vma;
 	struct rmap_item *rmap_item;
-	long page_mask;
 	int nid;
 
 	if (list_empty(&ksm_mm_head.mm_list))
@@ -1600,6 +1600,8 @@ next_mm:
 			ksm_scan.address = vma->vm_end;
 
 		while (ksm_scan.address < vma->vm_end) {
+			long page_mask;		/* unused */
+
 			if (ksm_test_exit(mm))
 				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET,



Alternatively we could permit the passing of page_mask==NULL, which is
a common convention in the get_user_pages() area.  Possibly that's a
tad more expensive, but it will save a stack slot.


However I think the best approach here is to just leave the
follow_page() interface alone.  Rename the present implementation to
follow_page_mask() or whatever and add

static inline<?> void follow_page(args)
{
	long page_mask;		/* unused */

	return follow_page_mask(args, &page_mask);
}

Also, I see no sense in permitting negative page_mask values and I
doubt if we'll be needing 64-bit values here.  Make it `unsigned int'?


Finally, this version of the patch surely breaks the nommu build?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages()
  2013-01-31  0:26 ` [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages() Michel Lespinasse
  2013-01-31  0:42   ` Andrew Morton
@ 2013-02-07  0:39   ` Sasha Levin
  2013-02-07  1:10     ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Sasha Levin @ 2013-02-07  0:39 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm, linux-kernel

Hi Michel,

We're now hitting the VM_BUG_ON() which was added in the last hunk of the
patch:

[  143.217822] ------------[ cut here ]------------
[  143.219938] kernel BUG at mm/mlock.c:424!
[  143.220031] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC
[  143.220031] Modules linked in:
[  143.226691] CPU 3
[  143.226691] Pid: 15435, comm: trinity Tainted: G      D W    3.8.0-rc6-next-20130206-sasha-00029-g95a59cb2 #275
[  143.233222] RIP: 0010:[<ffffffff81242e42>]  [<ffffffff81242e42>] __mm_populate+0x112/0x180
[  143.233222] RSP: 0018:ffff880010061ef8  EFLAGS: 00010246
[  143.233222] RAX: 0000000000000000 RBX: ffff880016329228 RCX: 0000000000000000
[  143.233222] RDX: ffffea0002fe2f80 RSI: 0000000000000000 RDI: ffffea0002fe2f80
[  143.233222] RBP: ffff880010061f48 R08: 0000000000000001 R09: 0000000000000000
[  143.233222] R10: 0000000000000000 R11: ffffea0002fe2f80 R12: 00007ffffffff000
[  143.233222] R13: 00007ff388eb3000 R14: 00007ff388e92000 R15: ffff880009a00000
[  143.233222] FS:  00007ff389085700(0000) GS:ffff88000fc00000(0000) knlGS:0000000000000000
[  143.233222] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  143.233222] CR2: 00007fa6b6aff180 CR3: 000000001002f000 CR4: 00000000000406e0
[  143.303103] can: request_module (can-proto-6) failed.
[  143.233222] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  143.233222] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  143.321178] Process trinity (pid: 15435, threadinfo ffff880010060000, task ffff8800100a8000)
[  143.321178] Stack:
[  143.321178]  0000000100000001 ffff880009a00098 0000000000000001 00000000100a8000
[  143.321178]  00007ff3890856a8 0000000000000000 0000000000000001 ffff8800100a8000
[  143.321178]  00007ff3890856a8 0000000000000097 ffff880010061f78 ffffffff812431d0
[  143.321178] Call Trace:
[  143.321178]  [<ffffffff812431d0>] sys_mlockall+0x160/0x1a0
[  143.321178]  [<ffffffff83d73d58>] tracesys+0xe1/0xe6
[  143.321178] Code: e8 54 fa ff ff 48 83 f8 00 7d 1e 8b 55 b4 85 d2 75 2f 48 83 f8 f2 bb f4 ff ff ff 74 3e b3 f5 48 83 f8 f4 0f
45 d8 eb 33 90 75 06 <0f> 0b 0f 1f 40 00 48 c1 e0 0c 49 01 c6 eb 0a 0f 1f 80 00 00 00
[  143.392984] RIP  [<ffffffff81242e42>] __mm_populate+0x112/0x180
[  143.392984]  RSP <ffff880010061ef8>
[  143.411359] ---[ end trace a7919e7f17c0a72d ]---


Thanks,
Sasha

On 01/30/2013 07:26 PM, Michel Lespinasse wrote:
> Use long type for page counts in mm_populate() so as to avoid integer
> overflow when running the following test code:
> 
> int main(void) {
>   void *p = mmap(NULL, 0x100000000000, PROT_READ,
>                  MAP_PRIVATE | MAP_ANON, -1, 0);
>   printf("p: %p\n", p);
>   mlockall(MCL_CURRENT);
>   printf("done\n");
>   return 0;
> }
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>
> 
> ---
>  include/linux/hugetlb.h |  6 +++---
>  include/linux/mm.h      | 14 +++++++-------
>  mm/hugetlb.c            | 10 +++++-----
>  mm/memory.c             | 14 +++++++-------
>  mm/mlock.c              |  5 +++--
>  5 files changed, 25 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 0c80d3f57a5b..fc6ed17cfd17 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -43,9 +43,9 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
>  #endif
>  
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> -int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> -			struct page **, struct vm_area_struct **,
> -			unsigned long *, int *, int, unsigned int flags);
> +long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> +			 struct page **, struct vm_area_struct **,
> +			 unsigned long *, long *, long, unsigned int flags);
>  void unmap_hugepage_range(struct vm_area_struct *,
>  			  unsigned long, unsigned long, struct page *);
>  void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a224430578f0..d5716094f191 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1040,13 +1040,13 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *
>  extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
>  		void *buf, int len, int write);
>  
> -int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> -		     unsigned long start, int len, unsigned int foll_flags,
> -		     struct page **pages, struct vm_area_struct **vmas,
> -		     int *nonblocking);
> -int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> -			unsigned long start, int nr_pages, int write, int force,
> -			struct page **pages, struct vm_area_struct **vmas);
> +long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> +		      unsigned long start, long len, unsigned int foll_flags,
> +		      struct page **pages, struct vm_area_struct **vmas,
> +		      int *nonblocking);
> +long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> +		    unsigned long start, long nr_pages, int write, int force,
> +		    struct page **pages, struct vm_area_struct **vmas);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			struct page **pages);
>  struct kvec;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4f3ea0b1e57c..4ad07221ce60 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2924,14 +2924,14 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  	return NULL;
>  }
>  
> -int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> -			struct page **pages, struct vm_area_struct **vmas,
> -			unsigned long *position, int *length, int i,
> -			unsigned int flags)
> +long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			 struct page **pages, struct vm_area_struct **vmas,
> +			 unsigned long *position, long *length, long i,
> +			 unsigned int flags)
>  {
>  	unsigned long pfn_offset;
>  	unsigned long vaddr = *position;
> -	int remainder = *length;
> +	long remainder = *length;
>  	struct hstate *h = hstate_vma(vma);
>  
>  	spin_lock(&mm->page_table_lock);
> diff --git a/mm/memory.c b/mm/memory.c
> index f56683208e7f..381b78c20d84 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1673,12 +1673,12 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add
>   * instead of __get_user_pages. __get_user_pages should be used only if
>   * you need some special @gup_flags.
>   */
> -int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> -		     unsigned long start, int nr_pages, unsigned int gup_flags,
> -		     struct page **pages, struct vm_area_struct **vmas,
> -		     int *nonblocking)
> +long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, long nr_pages, unsigned int gup_flags,
> +		struct page **pages, struct vm_area_struct **vmas,
> +		int *nonblocking)
>  {
> -	int i;
> +	long i;
>  	unsigned long vm_flags;
>  
>  	if (nr_pages <= 0)
> @@ -1977,8 +1977,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>   *
>   * See also get_user_pages_fast, for performance critical applications.
>   */
> -int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> -		unsigned long start, int nr_pages, int write, int force,
> +long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> +		unsigned long start, long nr_pages, int write, int force,
>  		struct page **pages, struct vm_area_struct **vmas)
>  {
>  	int flags = FOLL_TOUCH;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index b1647fbd6bce..e1fa9e4b0a66 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -160,7 +160,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long addr = start;
> -	int nr_pages = (end - start) / PAGE_SIZE;
> +	long nr_pages = (end - start) / PAGE_SIZE;
>  	int gup_flags;
>  
>  	VM_BUG_ON(start & ~PAGE_MASK);
> @@ -378,7 +378,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
>  	unsigned long end, nstart, nend;
>  	struct vm_area_struct *vma = NULL;
>  	int locked = 0;
> -	int ret = 0;
> +	long ret = 0;
>  
>  	VM_BUG_ON(start & ~PAGE_MASK);
>  	VM_BUG_ON(len != PAGE_ALIGN(len));
> @@ -421,6 +421,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
>  			ret = __mlock_posix_error_return(ret);
>  			break;
>  		}
> +		VM_BUG_ON(!ret);
>  		nend = nstart + ret * PAGE_SIZE;
>  		ret = 0;
>  	}
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages()
  2013-02-07  0:39   ` Sasha Levin
@ 2013-02-07  1:10     ` Andrew Morton
  2013-02-07  1:16       ` Sasha Levin
  2013-02-07 21:35       ` Michel Lespinasse
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2013-02-07  1:10 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Michel Lespinasse, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, 06 Feb 2013 19:39:11 -0500
Sasha Levin <sasha.levin@oracle.com> wrote:

> We're now hitting the VM_BUG_ON() which was added in the last hunk of the
> patch:

hm, why was that added.

Michel, I seem to have confused myself over this series.  I saw a
report this morning which led me to drop
mm-accelerate-munlock-treatment-of-thp-pages.patch but now I can't find
that report and I'm wondering if I should have dropped
mm-accelerate-mm_populate-treatment-of-thp-pages.patch instead.

Given that and Sasha's new report I think I'll drop

mm-use-long-type-for-page-counts-in-mm_populate-and-get_user_pages.patch
mm-use-long-type-for-page-counts-in-mm_populate-and-get_user_pages-fix.patch
mm-use-long-type-for-page-counts-in-mm_populate-and-get_user_pages-fix-fix.patch
mm-accelerate-mm_populate-treatment-of-thp-pages.patch
mm-accelerate-munlock-treatment-of-thp-pages.patch

and let's start again?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages()
  2013-02-07  1:10     ` Andrew Morton
@ 2013-02-07  1:16       ` Sasha Levin
  2013-02-07 21:35       ` Michel Lespinasse
  1 sibling, 0 replies; 12+ messages in thread
From: Sasha Levin @ 2013-02-07  1:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michel Lespinasse, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On 02/06/2013 08:10 PM, Andrew Morton wrote:
> On Wed, 06 Feb 2013 19:39:11 -0500
> Sasha Levin <sasha.levin@oracle.com> wrote:
> 
>> We're now hitting the VM_BUG_ON() which was added in the last hunk of the
>> patch:
> 
> hm, why was that added.
> 
> Michel, I seem to have confused myself over this series.  I saw a
> report this morning which led me to drop
> mm-accelerate-munlock-treatment-of-thp-pages.patch but now I can't find
> that report and I'm wondering if I should have dropped
> mm-accelerate-mm_populate-treatment-of-thp-pages.patch instead.
> 
> Given that and Sasha's new report I think I'll drop
> 
> mm-use-long-type-for-page-counts-in-mm_populate-and-get_user_pages.patch
> mm-use-long-type-for-page-counts-in-mm_populate-and-get_user_pages-fix.patch
> mm-use-long-type-for-page-counts-in-mm_populate-and-get_user_pages-fix-fix.patch
> mm-accelerate-mm_populate-treatment-of-thp-pages.patch
> mm-accelerate-munlock-treatment-of-thp-pages.patch
> 
> and let's start again?

I've sent a report regarding mm-accelerate-munlock-treatment-of-thp-pages.patch
couple of hours ago (https://lkml.org/lkml/2013/2/6/699). Reverting that patch
fixed the issue reported, so I think you've dropped the right one.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages()
  2013-02-07  1:10     ` Andrew Morton
  2013-02-07  1:16       ` Sasha Levin
@ 2013-02-07 21:35       ` Michel Lespinasse
  1 sibling, 0 replies; 12+ messages in thread
From: Michel Lespinasse @ 2013-02-07 21:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, Feb 6, 2013 at 5:10 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Wed, 06 Feb 2013 19:39:11 -0500
> Sasha Levin <sasha.levin@oracle.com> wrote:
>
>> We're now hitting the VM_BUG_ON() which was added in the last hunk of the
>> patch:
>
> hm, why was that added.
>
> Michel, I seem to have confused myself over this series.  I saw a
> report this morning which led me to drop
> mm-accelerate-munlock-treatment-of-thp-pages.patch but now I can't find
> that report and I'm wondering if I should have dropped
> mm-accelerate-mm_populate-treatment-of-thp-pages.patch instead.
>
> Given that and Sasha's new report I think I'll drop
>
> mm-use-long-type-for-page-counts-in-mm_populate-and-get_user_pages.patch
> mm-use-long-type-for-page-counts-in-mm_populate-and-get_user_pages-fix.patch
> mm-use-long-type-for-page-counts-in-mm_populate-and-get_user_pages-fix-fix.patch
> mm-accelerate-mm_populate-treatment-of-thp-pages.patch
> mm-accelerate-munlock-treatment-of-thp-pages.patch
>
> and let's start again?

All right. My bad, there were issues in the patch series. I think
there were two:

- The VM_BUG_ON(!reg) in "mm: use long type for page counts in
mm_populate() and get_user_pages()". The intention there was to test
for what happened in the original overflow case, which is that gup
would return 0 as the passed nr_pages argument (when passed as an int)
would be <= 0. As it turns out, this didn't account for the other case
where gup returns 0, which is when the first page is file-backed, not
found in page cache, and the mmap_sem gets dropped due to a non-NULL
"nonblocking" argument (as is the case with mm_populate())

- The issue in munlock, where the page mask wasn't correctly computed
due to an off by 1 issue.

I agree a clean resend seems the most appropriate course of action at
this point. Sorry for the trouble :/

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-02-07 21:35 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-31  0:26 [PATCH 0/3] fixes for large mm_populate() and munlock() operations Michel Lespinasse
2013-01-31  0:26 ` [PATCH 1/3] mm: use long type for page counts in mm_populate() and get_user_pages() Michel Lespinasse
2013-01-31  0:42   ` Andrew Morton
2013-02-07  0:39   ` Sasha Levin
2013-02-07  1:10     ` Andrew Morton
2013-02-07  1:16       ` Sasha Levin
2013-02-07 21:35       ` Michel Lespinasse
2013-01-31  0:26 ` [PATCH 2/3] mm: accelerate mm_populate() treatment of THP pages Michel Lespinasse
2013-01-31  3:05   ` Hugh Dickins
2013-01-31  4:27     ` Michel Lespinasse
2013-02-02  0:08   ` Andrew Morton
2013-01-31  0:26 ` [RFC PATCH 3/3] mm: accelerate munlock() " Michel Lespinasse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).