All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] fixes for large mm_populate() and munlock() operations
@ 2013-02-04  7:17 ` Michel Lespinasse
  0 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-04  7:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

These 3 changes are to improve the handling of large mm_populate and
munlock operations. They apply on top of mmotm (in particular, they
depend on both my prior mm_populate work and Kirill's "thp: avoid
dumping huge zero page" change).

- Patch 1 fixes an integer overflow issue when populating 2^32 pages.
  The nr_pages argument to get_user_pages would overflow, resulting in 0
  pages being processed per iteration. I am proposing to simply convert
  the nr_pages argument to a long.

- Patch 2 accelerates populating regions with THP pages. get_user_pages()
  can increment the address by a huge page size in this case instead of
  a small page size, and avoid repeated mm->page_table_lock acquisitions.
  This fixes an issue reported by Roman Dubtsov where populating regions
  via mmap MAP_POPULATE was significantly slower than doing so by
  touching pages from userspace.

- Patch 3 is a similar acceleration for the munlock case. I would actually
  like to get Andrea's attention on this one, as I can't explain how
  munlock_vma_page() is safe against racing with split_huge_page().

Note that patches 1-2 are logically independent of patch 3, so if the
discussion of patch 3 takes too long I would ask Andrew to consider
merging patches 1-2 first.

Changes since v1:

- Andrew accepted patch 1 into his -mm tree but suggested the nr_pages
  argument type should actually be unsigned long; I am sending this as
  a "fix" for the previous patch 1 to be collapsed over the previous one.

- In patch 2, I am adding a separate follow_page_mask() function so that
  the callers to the original follow_page() don't have to be modified to
  ignore the returned page_mask (following another suggestion from Andrew).
  Also the page_mask argument type was changed to unsigned int.

- In patch 3, I similarly changed the page_mask values to unsigned int.

Michel Lespinasse (3):
  fix mm: use long type for page counts in mm_populate() and get_user_pages()
  mm: accelerate mm_populate() treatment of THP pages
  mm: accelerate munlock() treatment of THP pages

 include/linux/hugetlb.h |  2 +-
 include/linux/mm.h      | 24 +++++++++++++++++-------
 mm/hugetlb.c            |  8 ++++----
 mm/internal.h           |  2 +-
 mm/memory.c             | 43 +++++++++++++++++++++++++++++--------------
 mm/mlock.c              | 34 ++++++++++++++++++++++------------
 mm/nommu.c              |  6 ++++--
 7 files changed, 78 insertions(+), 41 deletions(-)

-- 
1.8.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 0/3] fixes for large mm_populate() and munlock() operations
@ 2013-02-04  7:17 ` Michel Lespinasse
  0 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-04  7:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

These 3 changes are to improve the handling of large mm_populate and
munlock operations. They apply on top of mmotm (in particular, they
depend on both my prior mm_populate work and Kirill's "thp: avoid
dumping huge zero page" change).

- Patch 1 fixes an integer overflow issue when populating 2^32 pages.
  The nr_pages argument to get_user_pages would overflow, resulting in 0
  pages being processed per iteration. I am proposing to simply convert
  the nr_pages argument to a long.

- Patch 2 accelerates populating regions with THP pages. get_user_pages()
  can increment the address by a huge page size in this case instead of
  a small page size, and avoid repeated mm->page_table_lock acquisitions.
  This fixes an issue reported by Roman Dubtsov where populating regions
  via mmap MAP_POPULATE was significantly slower than doing so by
  touching pages from userspace.

- Patch 3 is a similar acceleration for the munlock case. I would actually
  like to get Andrea's attention on this one, as I can't explain how
  munlock_vma_page() is safe against racing with split_huge_page().

Note that patches 1-2 are logically independent of patch 3, so if the
discussion of patch 3 takes too long I would ask Andrew to consider
merging patches 1-2 first.

Changes since v1:

- Andrew accepted patch 1 into his -mm tree but suggested the nr_pages
  argument type should actually be unsigned long; I am sending this as
  a "fix" for the previous patch 1 to be collapsed over the previous one.

- In patch 2, I am adding a separate follow_page_mask() function so that
  the callers to the original follow_page() don't have to be modified to
  ignore the returned page_mask (following another suggestion from Andrew).
  Also the page_mask argument type was changed to unsigned int.

- In patch 3, I similarly changed the page_mask values to unsigned int.

Michel Lespinasse (3):
  fix mm: use long type for page counts in mm_populate() and get_user_pages()
  mm: accelerate mm_populate() treatment of THP pages
  mm: accelerate munlock() treatment of THP pages

 include/linux/hugetlb.h |  2 +-
 include/linux/mm.h      | 24 +++++++++++++++++-------
 mm/hugetlb.c            |  8 ++++----
 mm/internal.h           |  2 +-
 mm/memory.c             | 43 +++++++++++++++++++++++++++++--------------
 mm/mlock.c              | 34 ++++++++++++++++++++++------------
 mm/nommu.c              |  6 ++++--
 7 files changed, 78 insertions(+), 41 deletions(-)

-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/3] fix mm: use long type for page counts in mm_populate() and get_user_pages()
  2013-02-04  7:17 ` Michel Lespinasse
@ 2013-02-04  7:17   ` Michel Lespinasse
  -1 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-04  7:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

Andrew suggested I make the nr_pages argument an unsigned long rather
than just a long. I was initially worried that nr_pages would be compared
with signed longs, but this isn't the case, so his suggestion is perfectly
valid.

Sending as a 'fix' change, to be collapsed with the original in -mm.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 include/linux/hugetlb.h |  2 +-
 include/linux/mm.h      | 11 ++++++-----
 mm/hugetlb.c            |  8 ++++----
 mm/memory.c             | 12 ++++++------
 mm/mlock.c              |  2 +-
 5 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fc6ed17cfd17..eedc334fb6f5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -45,7 +45,7 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 struct page **, struct vm_area_struct **,
-			 unsigned long *, long *, long, unsigned int flags);
+			 unsigned long *, unsigned long *, long, unsigned int);
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *);
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5716094f191..3d9fbcf9fa94 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1041,12 +1041,13 @@ extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
 		void *buf, int len, int write);
 
 long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		      unsigned long start, long len, unsigned int foll_flags,
-		      struct page **pages, struct vm_area_struct **vmas,
-		      int *nonblocking);
+		      unsigned long start, unsigned long nr_pages,
+		      unsigned int foll_flags, struct page **pages,
+		      struct vm_area_struct **vmas, int *nonblocking);
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, long nr_pages, int write, int force,
-		    struct page **pages, struct vm_area_struct **vmas);
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages,
+		    struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct kvec;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4ad07221ce60..951873c8f57e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2926,12 +2926,12 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
-			 unsigned long *position, long *length, long i,
-			 unsigned int flags)
+			 unsigned long *position, unsigned long *nr_pages,
+			 long i, unsigned int flags)
 {
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
-	long remainder = *length;
+	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
 	spin_lock(&mm->page_table_lock);
@@ -3001,7 +3001,7 @@ same_page:
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	*length = remainder;
+	*nr_pages = remainder;
 	*position = vaddr;
 
 	return i ? i : -EFAULT;
diff --git a/mm/memory.c b/mm/memory.c
index 381b78c20d84..f0b6b2b798c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1674,14 +1674,14 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add
  * you need some special @gup_flags.
  */
 long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, long nr_pages, unsigned int gup_flags,
-		struct page **pages, struct vm_area_struct **vmas,
-		int *nonblocking)
+		unsigned long start, unsigned long nr_pages,
+		unsigned int gup_flags, struct page **pages,
+		struct vm_area_struct **vmas, int *nonblocking)
 {
 	long i;
 	unsigned long vm_flags;
 
-	if (nr_pages <= 0)
+	if (!nr_pages)
 		return 0;
 
 	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
@@ -1978,8 +1978,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
  * See also get_user_pages_fast, for performance critical applications.
  */
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, long nr_pages, int write, int force,
-		struct page **pages, struct vm_area_struct **vmas)
+		unsigned long start, unsigned long nr_pages, int write,
+		int force, struct page **pages, struct vm_area_struct **vmas)
 {
 	int flags = FOLL_TOUCH;
 
diff --git a/mm/mlock.c b/mm/mlock.c
index e1fa9e4b0a66..6baaf4b0e591 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -160,7 +160,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start;
-	long nr_pages = (end - start) / PAGE_SIZE;
+	unsigned long nr_pages = (end - start) / PAGE_SIZE;
 	int gup_flags;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 1/3] fix mm: use long type for page counts in mm_populate() and get_user_pages()
@ 2013-02-04  7:17   ` Michel Lespinasse
  0 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-04  7:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

Andrew suggested I make the nr_pages argument an unsigned long rather
than just a long. I was initially worried that nr_pages would be compared
with signed longs, but this isn't the case, so his suggestion is perfectly
valid.

Sending as a 'fix' change, to be collapsed with the original in -mm.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 include/linux/hugetlb.h |  2 +-
 include/linux/mm.h      | 11 ++++++-----
 mm/hugetlb.c            |  8 ++++----
 mm/memory.c             | 12 ++++++------
 mm/mlock.c              |  2 +-
 5 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fc6ed17cfd17..eedc334fb6f5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -45,7 +45,7 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 struct page **, struct vm_area_struct **,
-			 unsigned long *, long *, long, unsigned int flags);
+			 unsigned long *, unsigned long *, long, unsigned int);
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *);
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5716094f191..3d9fbcf9fa94 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1041,12 +1041,13 @@ extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
 		void *buf, int len, int write);
 
 long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		      unsigned long start, long len, unsigned int foll_flags,
-		      struct page **pages, struct vm_area_struct **vmas,
-		      int *nonblocking);
+		      unsigned long start, unsigned long nr_pages,
+		      unsigned int foll_flags, struct page **pages,
+		      struct vm_area_struct **vmas, int *nonblocking);
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		    unsigned long start, long nr_pages, int write, int force,
-		    struct page **pages, struct vm_area_struct **vmas);
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages,
+		    struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct kvec;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4ad07221ce60..951873c8f57e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2926,12 +2926,12 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
-			 unsigned long *position, long *length, long i,
-			 unsigned int flags)
+			 unsigned long *position, unsigned long *nr_pages,
+			 long i, unsigned int flags)
 {
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
-	long remainder = *length;
+	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
 
 	spin_lock(&mm->page_table_lock);
@@ -3001,7 +3001,7 @@ same_page:
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	*length = remainder;
+	*nr_pages = remainder;
 	*position = vaddr;
 
 	return i ? i : -EFAULT;
diff --git a/mm/memory.c b/mm/memory.c
index 381b78c20d84..f0b6b2b798c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1674,14 +1674,14 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add
  * you need some special @gup_flags.
  */
 long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, long nr_pages, unsigned int gup_flags,
-		struct page **pages, struct vm_area_struct **vmas,
-		int *nonblocking)
+		unsigned long start, unsigned long nr_pages,
+		unsigned int gup_flags, struct page **pages,
+		struct vm_area_struct **vmas, int *nonblocking)
 {
 	long i;
 	unsigned long vm_flags;
 
-	if (nr_pages <= 0)
+	if (!nr_pages)
 		return 0;
 
 	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
@@ -1978,8 +1978,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
  * See also get_user_pages_fast, for performance critical applications.
  */
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, long nr_pages, int write, int force,
-		struct page **pages, struct vm_area_struct **vmas)
+		unsigned long start, unsigned long nr_pages, int write,
+		int force, struct page **pages, struct vm_area_struct **vmas)
 {
 	int flags = FOLL_TOUCH;
 
diff --git a/mm/mlock.c b/mm/mlock.c
index e1fa9e4b0a66..6baaf4b0e591 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -160,7 +160,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr = start;
-	long nr_pages = (end - start) / PAGE_SIZE;
+	unsigned long nr_pages = (end - start) / PAGE_SIZE;
 	int gup_flags;
 
 	VM_BUG_ON(start & ~PAGE_MASK);
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/3] mm: accelerate mm_populate() treatment of THP pages
  2013-02-04  7:17 ` Michel Lespinasse
@ 2013-02-04  7:17   ` Michel Lespinasse
  -1 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-04  7:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

This change adds a follow_page_mask function which is equivalent to
follow_page, but with an extra page_mask argument.

follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a
THP page, and to 0 in other cases.

__get_user_pages() makes use of this in order to accelerate populating
THP ranges - that is, when both the pages and vmas arrays are NULL,
we don't need to iterate HPAGE_PMD_NR times to cover a single THP page
(and we also avoid taking mm->page_table_lock that many times).

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 include/linux/mm.h | 13 +++++++++++--
 mm/memory.c        | 31 +++++++++++++++++++++++--------
 mm/nommu.c         |  6 ++++--
 3 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3d9fbcf9fa94..31e4d42002ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1636,8 +1636,17 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 
-struct page *follow_page(struct vm_area_struct *, unsigned long address,
-			unsigned int foll_flags);
+struct page *follow_page_mask(struct vm_area_struct *vma,
+			      unsigned long address, unsigned int foll_flags,
+			      unsigned int *page_mask);
+
+static inline struct page *follow_page(struct vm_area_struct *vma,
+		unsigned long address, unsigned int foll_flags)
+{
+	unsigned int unused_page_mask;
+	return follow_page_mask(vma, address, foll_flags, &unused_page_mask);
+}
+
 #define FOLL_WRITE	0x01	/* check pte is writable */
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
diff --git a/mm/memory.c b/mm/memory.c
index f0b6b2b798c4..52c8599e7fe4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1458,10 +1458,11 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
 /**
- * follow_page - look up a page descriptor from a user-virtual address
+ * follow_page_mask - look up a page descriptor from a user-virtual address
  * @vma: vm_area_struct mapping @address
  * @address: virtual address to look up
  * @flags: flags modifying lookup behaviour
+ * @page_mask: on output, *page_mask is set according to the size of the page
  *
  * @flags can have FOLL_ flags set, defined in <linux/mm.h>
  *
@@ -1469,8 +1470,9 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
  * an error pointer if there is a mapping to something not represented
  * by a page descriptor (see also vm_normal_page()).
  */
-struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
-			unsigned int flags)
+struct page *follow_page_mask(struct vm_area_struct *vma,
+			      unsigned long address, unsigned int flags,
+			      unsigned int *page_mask)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -1480,6 +1482,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 
+	*page_mask = 0;
+
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
 		BUG_ON(flags & FOLL_GET);
@@ -1526,6 +1530,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
 				spin_unlock(&mm->page_table_lock);
+				*page_mask = HPAGE_PMD_NR - 1;
 				goto out;
 			}
 		} else
@@ -1680,6 +1685,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 {
 	long i;
 	unsigned long vm_flags;
+	unsigned int page_mask;
 
 	if (!nr_pages)
 		return 0;
@@ -1757,6 +1763,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				get_page(page);
 			}
 			pte_unmap(pte);
+			page_mask = 0;
 			goto next_page;
 		}
 
@@ -1774,6 +1781,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
+			unsigned int page_increm;
 
 			/*
 			 * If we have a pending SIGKILL, don't keep faulting
@@ -1783,7 +1791,8 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				return i ? i : -ERESTARTSYS;
 
 			cond_resched();
-			while (!(page = follow_page(vma, start, foll_flags))) {
+			while (!(page = follow_page_mask(vma, start,
+						foll_flags, &page_mask))) {
 				int ret;
 				unsigned int fault_flags = 0;
 
@@ -1857,13 +1866,19 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 				flush_anon_page(vma, page, start);
 				flush_dcache_page(page);
+				page_mask = 0;
 			}
 next_page:
-			if (vmas)
+			if (vmas) {
 				vmas[i] = vma;
-			i++;
-			start += PAGE_SIZE;
-			nr_pages--;
+				page_mask = 0;
+			}
+			page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+			if (page_increm > nr_pages)
+				page_increm = nr_pages;
+			i += page_increm;
+			start += page_increm * PAGE_SIZE;
+			nr_pages -= page_increm;
 		} while (nr_pages && start < vma->vm_end);
 	} while (nr_pages);
 	return i;
diff --git a/mm/nommu.c b/mm/nommu.c
index 429a3d5217fa..9a6a25181267 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1817,9 +1817,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	return ret;
 }
 
-struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
-			unsigned int foll_flags)
+struct page *follow_page_mask(struct vm_area_struct *vma,
+			      unsigned long address, unsigned int flags,
+			      unsigned int *page_mask)
 {
+	*page_mask = 0;
 	return NULL;
 }
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/3] mm: accelerate mm_populate() treatment of THP pages
@ 2013-02-04  7:17   ` Michel Lespinasse
  0 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-04  7:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

This change adds a follow_page_mask function which is equivalent to
follow_page, but with an extra page_mask argument.

follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a
THP page, and to 0 in other cases.

__get_user_pages() makes use of this in order to accelerate populating
THP ranges - that is, when both the pages and vmas arrays are NULL,
we don't need to iterate HPAGE_PMD_NR times to cover a single THP page
(and we also avoid taking mm->page_table_lock that many times).

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 include/linux/mm.h | 13 +++++++++++--
 mm/memory.c        | 31 +++++++++++++++++++++++--------
 mm/nommu.c         |  6 ++++--
 3 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3d9fbcf9fa94..31e4d42002ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1636,8 +1636,17 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 
-struct page *follow_page(struct vm_area_struct *, unsigned long address,
-			unsigned int foll_flags);
+struct page *follow_page_mask(struct vm_area_struct *vma,
+			      unsigned long address, unsigned int foll_flags,
+			      unsigned int *page_mask);
+
+static inline struct page *follow_page(struct vm_area_struct *vma,
+		unsigned long address, unsigned int foll_flags)
+{
+	unsigned int unused_page_mask;
+	return follow_page_mask(vma, address, foll_flags, &unused_page_mask);
+}
+
 #define FOLL_WRITE	0x01	/* check pte is writable */
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
diff --git a/mm/memory.c b/mm/memory.c
index f0b6b2b798c4..52c8599e7fe4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1458,10 +1458,11 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
 /**
- * follow_page - look up a page descriptor from a user-virtual address
+ * follow_page_mask - look up a page descriptor from a user-virtual address
  * @vma: vm_area_struct mapping @address
  * @address: virtual address to look up
  * @flags: flags modifying lookup behaviour
+ * @page_mask: on output, *page_mask is set according to the size of the page
  *
  * @flags can have FOLL_ flags set, defined in <linux/mm.h>
  *
@@ -1469,8 +1470,9 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
  * an error pointer if there is a mapping to something not represented
  * by a page descriptor (see also vm_normal_page()).
  */
-struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
-			unsigned int flags)
+struct page *follow_page_mask(struct vm_area_struct *vma,
+			      unsigned long address, unsigned int flags,
+			      unsigned int *page_mask)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -1480,6 +1482,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 
+	*page_mask = 0;
+
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
 		BUG_ON(flags & FOLL_GET);
@@ -1526,6 +1530,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
 				spin_unlock(&mm->page_table_lock);
+				*page_mask = HPAGE_PMD_NR - 1;
 				goto out;
 			}
 		} else
@@ -1680,6 +1685,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 {
 	long i;
 	unsigned long vm_flags;
+	unsigned int page_mask;
 
 	if (!nr_pages)
 		return 0;
@@ -1757,6 +1763,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				get_page(page);
 			}
 			pte_unmap(pte);
+			page_mask = 0;
 			goto next_page;
 		}
 
@@ -1774,6 +1781,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
+			unsigned int page_increm;
 
 			/*
 			 * If we have a pending SIGKILL, don't keep faulting
@@ -1783,7 +1791,8 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				return i ? i : -ERESTARTSYS;
 
 			cond_resched();
-			while (!(page = follow_page(vma, start, foll_flags))) {
+			while (!(page = follow_page_mask(vma, start,
+						foll_flags, &page_mask))) {
 				int ret;
 				unsigned int fault_flags = 0;
 
@@ -1857,13 +1866,19 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 				flush_anon_page(vma, page, start);
 				flush_dcache_page(page);
+				page_mask = 0;
 			}
 next_page:
-			if (vmas)
+			if (vmas) {
 				vmas[i] = vma;
-			i++;
-			start += PAGE_SIZE;
-			nr_pages--;
+				page_mask = 0;
+			}
+			page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+			if (page_increm > nr_pages)
+				page_increm = nr_pages;
+			i += page_increm;
+			start += page_increm * PAGE_SIZE;
+			nr_pages -= page_increm;
 		} while (nr_pages && start < vma->vm_end);
 	} while (nr_pages);
 	return i;
diff --git a/mm/nommu.c b/mm/nommu.c
index 429a3d5217fa..9a6a25181267 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1817,9 +1817,11 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	return ret;
 }
 
-struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
-			unsigned int foll_flags)
+struct page *follow_page_mask(struct vm_area_struct *vma,
+			      unsigned long address, unsigned int flags,
+			      unsigned int *page_mask)
 {
+	*page_mask = 0;
 	return NULL;
 }
 
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
  2013-02-04  7:17 ` Michel Lespinasse
@ 2013-02-04  7:17   ` Michel Lespinasse
  -1 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-04  7:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
at a time. When munlocking THP pages (or the huge zero page), this resulted
in taking the mm->page_table_lock 512 times in a row.

We can do better by making use of the page_mask returned by follow_page_mask
(for the huge zero page case), or the size of the page munlock_vma_page()
operated on (for the true THP page case).

Note - I am sending this as RFC only for now as I can't currently put
my finger on what if anything prevents split_huge_page() from operating
concurrently on the same page as munlock_vma_page(), which would mess
up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
point I missed here ?

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 mm/internal.h |  2 +-
 mm/mlock.c    | 32 +++++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1c0c4cc0fcf7..8562de0a5197 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -195,7 +195,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma,
  * must be called with vma's mmap_sem held for read or write, and page locked.
  */
 extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+extern unsigned int munlock_vma_page(struct page *page);
 
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
diff --git a/mm/mlock.c b/mm/mlock.c
index 6baaf4b0e591..486702edee35 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -102,13 +102,14 @@ void mlock_vma_page(struct page *page)
  * can't isolate the page, we leave it for putback_lru_page() and vmscan
  * [page_referenced()/try_to_unmap()] to deal with.
  */
-void munlock_vma_page(struct page *page)
+unsigned int munlock_vma_page(struct page *page)
 {
+	unsigned int nr_pages = hpage_nr_pages(page);
+
 	BUG_ON(!PageLocked(page));
 
 	if (TestClearPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    -hpage_nr_pages(page));
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 		if (!isolate_lru_page(page)) {
 			int ret = SWAP_AGAIN;
 
@@ -141,6 +142,8 @@ void munlock_vma_page(struct page *page)
 				count_vm_event(UNEVICTABLE_PGMUNLOCKED);
 		}
 	}
+
+	return nr_pages;
 }
 
 /**
@@ -159,7 +162,6 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, int *nonblocking)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start;
 	unsigned long nr_pages = (end - start) / PAGE_SIZE;
 	int gup_flags;
 
@@ -185,7 +187,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
 		gup_flags |= FOLL_FORCE;
 
-	return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
 				NULL, NULL, nonblocking);
 }
 
@@ -222,13 +224,12 @@ static int __mlock_posix_error_return(long retval)
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
-	unsigned long addr;
-
-	lru_add_drain();
 	vma->vm_flags &= ~VM_LOCKED;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE) {
+	while (start < end) {
 		struct page *page;
+		unsigned int page_mask, page_increm;
+
 		/*
 		 * Although FOLL_DUMP is intended for get_dump_page(),
 		 * it just so happens that its special treatment of the
@@ -236,13 +237,22 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		 * suits munlock very well (and if somehow an abnormal page
 		 * has sneaked into the range, we won't oops here: great).
 		 */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
+					&page_mask);
 		if (page && !IS_ERR(page)) {
 			lock_page(page);
-			munlock_vma_page(page);
+			lru_add_drain();
+			/*
+			 * Any THP page found by follow_page_mask() may have
+			 * gotten split before reaching munlock_vma_page(),
+			 * so we need to recompute the page_mask here.
+			 */
+			page_mask = munlock_vma_page(page);
 			unlock_page(page);
 			put_page(page);
 		}
+		page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+		start += page_increm * PAGE_SIZE;
 		cond_resched();
 	}
 }
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
@ 2013-02-04  7:17   ` Michel Lespinasse
  0 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-04  7:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm
  Cc: linux-kernel

munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
at a time. When munlocking THP pages (or the huge zero page), this resulted
in taking the mm->page_table_lock 512 times in a row.

We can do better by making use of the page_mask returned by follow_page_mask
(for the huge zero page case), or the size of the page munlock_vma_page()
operated on (for the true THP page case).

Note - I am sending this as RFC only for now as I can't currently put
my finger on what if anything prevents split_huge_page() from operating
concurrently on the same page as munlock_vma_page(), which would mess
up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
point I missed here ?

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 mm/internal.h |  2 +-
 mm/mlock.c    | 32 +++++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1c0c4cc0fcf7..8562de0a5197 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -195,7 +195,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma,
  * must be called with vma's mmap_sem held for read or write, and page locked.
  */
 extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+extern unsigned int munlock_vma_page(struct page *page);
 
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
diff --git a/mm/mlock.c b/mm/mlock.c
index 6baaf4b0e591..486702edee35 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -102,13 +102,14 @@ void mlock_vma_page(struct page *page)
  * can't isolate the page, we leave it for putback_lru_page() and vmscan
  * [page_referenced()/try_to_unmap()] to deal with.
  */
-void munlock_vma_page(struct page *page)
+unsigned int munlock_vma_page(struct page *page)
 {
+	unsigned int nr_pages = hpage_nr_pages(page);
+
 	BUG_ON(!PageLocked(page));
 
 	if (TestClearPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    -hpage_nr_pages(page));
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 		if (!isolate_lru_page(page)) {
 			int ret = SWAP_AGAIN;
 
@@ -141,6 +142,8 @@ void munlock_vma_page(struct page *page)
 				count_vm_event(UNEVICTABLE_PGMUNLOCKED);
 		}
 	}
+
+	return nr_pages;
 }
 
 /**
@@ -159,7 +162,6 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, int *nonblocking)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start;
 	unsigned long nr_pages = (end - start) / PAGE_SIZE;
 	int gup_flags;
 
@@ -185,7 +187,7 @@ long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
 		gup_flags |= FOLL_FORCE;
 
-	return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
 				NULL, NULL, nonblocking);
 }
 
@@ -222,13 +224,12 @@ static int __mlock_posix_error_return(long retval)
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
-	unsigned long addr;
-
-	lru_add_drain();
 	vma->vm_flags &= ~VM_LOCKED;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE) {
+	while (start < end) {
 		struct page *page;
+		unsigned int page_mask, page_increm;
+
 		/*
 		 * Although FOLL_DUMP is intended for get_dump_page(),
 		 * it just so happens that its special treatment of the
@@ -236,13 +237,22 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		 * suits munlock very well (and if somehow an abnormal page
 		 * has sneaked into the range, we won't oops here: great).
 		 */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
+		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
+					&page_mask);
 		if (page && !IS_ERR(page)) {
 			lock_page(page);
-			munlock_vma_page(page);
+			lru_add_drain();
+			/*
+			 * Any THP page found by follow_page_mask() may have
+			 * gotten split before reaching munlock_vma_page(),
+			 * so we need to recompute the page_mask here.
+			 */
+			page_mask = munlock_vma_page(page);
 			unlock_page(page);
 			put_page(page);
 		}
+		page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+		start += page_increm * PAGE_SIZE;
 		cond_resched();
 	}
 }
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
  2013-02-04  7:17   ` Michel Lespinasse
@ 2013-02-06 23:44     ` Sasha Levin
  -1 siblings, 0 replies; 20+ messages in thread
From: Sasha Levin @ 2013-02-06 23:44 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm, linux-kernel

On 02/04/2013 02:17 AM, Michel Lespinasse wrote:
> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
> at a time. When munlocking THP pages (or the huge zero page), this resulted
> in taking the mm->page_table_lock 512 times in a row.
> 
> We can do better by making use of the page_mask returned by follow_page_mask
> (for the huge zero page case), or the size of the page munlock_vma_page()
> operated on (for the true THP page case).
> 
> Note - I am sending this as RFC only for now as I can't currently put
> my finger on what if anything prevents split_huge_page() from operating
> concurrently on the same page as munlock_vma_page(), which would mess
> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
> point I missed here ?
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Hi Michel,

Fuzzing with trinity inside a KVM tools guest produces a steady stream of:


[   51.823275] ------------[ cut here ]------------
[   51.823302] kernel BUG at include/linux/page-flags.h:421!
[   51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   51.823307] Dumping ftrace buffer:
[   51.823314]    (ftrace buffer empty)
[   51.823314] Modules linked in:
[   51.823314] CPU 2
[   51.823314] Pid: 7116, comm: trinity Tainted: G        W    3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273
[   51.823316] RIP: 0010:[<ffffffff81242792>]  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
[   51.823317] RSP: 0018:ffff880009641bb8  EFLAGS: 00010282
[   51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000
[   51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040
[   51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000
[   51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958
[   51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff
[   51.823326] FS:  00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000
[   51.823327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0
[   51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000)
[   51.823341] Stack:
[   51.823344]  0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd
[   51.823373]  ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958
[   51.823373]  ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81
[   51.823373] Call Trace:
[   51.823373]  [<ffffffff812429bd>] munlock_vma_pages_range+0x8d/0xf0
[   51.823373]  [<ffffffff81246d81>] exit_mmap+0x51/0x170
[   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
[   51.823373]  [<ffffffff8126a09f>] ? kmem_cache_free+0x22f/0x3b0
[   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
[   51.823373]  [<ffffffff8110af97>] mmput+0x77/0xe0
[   51.823377]  [<ffffffff81114403>] exit_mm+0x113/0x120
[   51.823381]  [<ffffffff83d727f1>] ? _raw_spin_unlock_irq+0x51/0x80
[   51.823384]  [<ffffffff8111465a>] do_exit+0x24a/0x590
[   51.823387]  [<ffffffff81114a6a>] do_group_exit+0x8a/0xc0
[   51.823390]  [<ffffffff81128591>] get_signal_to_deliver+0x501/0x5b0
[   51.823394]  [<ffffffff8106dd42>] do_signal+0x42/0x110
[   51.823399]  [<ffffffff811d8ea4>] ? rcu_eqs_exit_common+0x64/0x340
[   51.823404]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
[   51.823407]  [<ffffffff811849c8>] ? trace_hardirqs_on_caller+0x128/0x160
[   51.823409]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
[   51.823412]  [<ffffffff8106de58>] do_notify_resume+0x48/0xa0
[   51.823415]  [<ffffffff83d732fb>] retint_signal+0x4d/0x92
[   51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48
8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b
[   51.823449] RIP  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
[   51.823450]  RSP <ffff880009641bb8>
[   51.826846] ---[ end trace a7919e7f17c0a72a ]---


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
@ 2013-02-06 23:44     ` Sasha Levin
  0 siblings, 0 replies; 20+ messages in thread
From: Sasha Levin @ 2013-02-06 23:44 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Hugh Dickins,
	Andrew Morton, linux-mm, linux-kernel

On 02/04/2013 02:17 AM, Michel Lespinasse wrote:
> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
> at a time. When munlocking THP pages (or the huge zero page), this resulted
> in taking the mm->page_table_lock 512 times in a row.
> 
> We can do better by making use of the page_mask returned by follow_page_mask
> (for the huge zero page case), or the size of the page munlock_vma_page()
> operated on (for the true THP page case).
> 
> Note - I am sending this as RFC only for now as I can't currently put
> my finger on what if anything prevents split_huge_page() from operating
> concurrently on the same page as munlock_vma_page(), which would mess
> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
> point I missed here ?
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>

Hi Michel,

Fuzzing with trinity inside a KVM tools guest produces a steady stream of:


[   51.823275] ------------[ cut here ]------------
[   51.823302] kernel BUG at include/linux/page-flags.h:421!
[   51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   51.823307] Dumping ftrace buffer:
[   51.823314]    (ftrace buffer empty)
[   51.823314] Modules linked in:
[   51.823314] CPU 2
[   51.823314] Pid: 7116, comm: trinity Tainted: G        W    3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273
[   51.823316] RIP: 0010:[<ffffffff81242792>]  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
[   51.823317] RSP: 0018:ffff880009641bb8  EFLAGS: 00010282
[   51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000
[   51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040
[   51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000
[   51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958
[   51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff
[   51.823326] FS:  00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000
[   51.823327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0
[   51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000)
[   51.823341] Stack:
[   51.823344]  0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd
[   51.823373]  ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958
[   51.823373]  ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81
[   51.823373] Call Trace:
[   51.823373]  [<ffffffff812429bd>] munlock_vma_pages_range+0x8d/0xf0
[   51.823373]  [<ffffffff81246d81>] exit_mmap+0x51/0x170
[   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
[   51.823373]  [<ffffffff8126a09f>] ? kmem_cache_free+0x22f/0x3b0
[   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
[   51.823373]  [<ffffffff8110af97>] mmput+0x77/0xe0
[   51.823377]  [<ffffffff81114403>] exit_mm+0x113/0x120
[   51.823381]  [<ffffffff83d727f1>] ? _raw_spin_unlock_irq+0x51/0x80
[   51.823384]  [<ffffffff8111465a>] do_exit+0x24a/0x590
[   51.823387]  [<ffffffff81114a6a>] do_group_exit+0x8a/0xc0
[   51.823390]  [<ffffffff81128591>] get_signal_to_deliver+0x501/0x5b0
[   51.823394]  [<ffffffff8106dd42>] do_signal+0x42/0x110
[   51.823399]  [<ffffffff811d8ea4>] ? rcu_eqs_exit_common+0x64/0x340
[   51.823404]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
[   51.823407]  [<ffffffff811849c8>] ? trace_hardirqs_on_caller+0x128/0x160
[   51.823409]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
[   51.823412]  [<ffffffff8106de58>] do_notify_resume+0x48/0xa0
[   51.823415]  [<ffffffff83d732fb>] retint_signal+0x4d/0x92
[   51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48
8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b
[   51.823449] RIP  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
[   51.823450]  RSP <ffff880009641bb8>
[   51.826846] ---[ end trace a7919e7f17c0a72a ]---


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
  2013-02-06 23:44     ` Sasha Levin
@ 2013-02-07  2:50       ` Li Zhong
  -1 siblings, 0 replies; 20+ messages in thread
From: Li Zhong @ 2013-02-07  2:50 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Michel Lespinasse, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Hugh Dickins, Andrew Morton, linux-mm, linux-kernel

On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote:
> On 02/04/2013 02:17 AM, Michel Lespinasse wrote:
> > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
> > at a time. When munlocking THP pages (or the huge zero page), this resulted
> > in taking the mm->page_table_lock 512 times in a row.
> > 
> > We can do better by making use of the page_mask returned by follow_page_mask
> > (for the huge zero page case), or the size of the page munlock_vma_page()
> > operated on (for the true THP page case).
> > 
> > Note - I am sending this as RFC only for now as I can't currently put
> > my finger on what if anything prevents split_huge_page() from operating
> > concurrently on the same page as munlock_vma_page(), which would mess
> > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
> > point I missed here ?
> > 
> > Signed-off-by: Michel Lespinasse <walken@google.com>
> 
> Hi Michel,
> 
> Fuzzing with trinity inside a KVM tools guest produces a steady stream of:
> 
> 
> [   51.823275] ------------[ cut here ]------------
> [   51.823302] kernel BUG at include/linux/page-flags.h:421!
> [   51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   51.823307] Dumping ftrace buffer:
> [   51.823314]    (ftrace buffer empty)
> [   51.823314] Modules linked in:
> [   51.823314] CPU 2
> [   51.823314] Pid: 7116, comm: trinity Tainted: G        W    3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273
> [   51.823316] RIP: 0010:[<ffffffff81242792>]  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
> [   51.823317] RSP: 0018:ffff880009641bb8  EFLAGS: 00010282
> [   51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000
> [   51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040
> [   51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000
> [   51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958
> [   51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff
> [   51.823326] FS:  00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000
> [   51.823327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0
> [   51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000)
> [   51.823341] Stack:
> [   51.823344]  0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd
> [   51.823373]  ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958
> [   51.823373]  ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81
> [   51.823373] Call Trace:
> [   51.823373]  [<ffffffff812429bd>] munlock_vma_pages_range+0x8d/0xf0
> [   51.823373]  [<ffffffff81246d81>] exit_mmap+0x51/0x170
> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [<ffffffff8126a09f>] ? kmem_cache_free+0x22f/0x3b0
> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [<ffffffff8110af97>] mmput+0x77/0xe0
> [   51.823377]  [<ffffffff81114403>] exit_mm+0x113/0x120
> [   51.823381]  [<ffffffff83d727f1>] ? _raw_spin_unlock_irq+0x51/0x80
> [   51.823384]  [<ffffffff8111465a>] do_exit+0x24a/0x590
> [   51.823387]  [<ffffffff81114a6a>] do_group_exit+0x8a/0xc0
> [   51.823390]  [<ffffffff81128591>] get_signal_to_deliver+0x501/0x5b0
> [   51.823394]  [<ffffffff8106dd42>] do_signal+0x42/0x110
> [   51.823399]  [<ffffffff811d8ea4>] ? rcu_eqs_exit_common+0x64/0x340
> [   51.823404]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
> [   51.823407]  [<ffffffff811849c8>] ? trace_hardirqs_on_caller+0x128/0x160
> [   51.823409]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
> [   51.823412]  [<ffffffff8106de58>] do_notify_resume+0x48/0xa0
> [   51.823415]  [<ffffffff83d732fb>] retint_signal+0x4d/0x92
> [   51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48
> 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b
> [   51.823449] RIP  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
> [   51.823450]  RSP <ffff880009641bb8>
> [   51.826846] ---[ end trace a7919e7f17c0a72a ]---
> 

The similar warning prevents my system from booting. And it seems to me
that in munlock_vma_pages_range(), the page_mask needs be the page
number returned from munlock_vma_page() minus 1. And the following fix
solved my problem. Would you please have a try? 

Thanks, Zhong


================
diff --git a/mm/mlock.c b/mm/mlock.c
index af1d115..1e3d794 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -255,7 +255,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 			unlock_page(page);
 			put_page(page);
 		}
-		page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+		page_increm = 1 + (~(start >> PAGE_SHIFT) & (page_mask-1));
 		start += page_increm * PAGE_SIZE;
 		cond_resched();
 	}



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
@ 2013-02-07  2:50       ` Li Zhong
  0 siblings, 0 replies; 20+ messages in thread
From: Li Zhong @ 2013-02-07  2:50 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Michel Lespinasse, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Hugh Dickins, Andrew Morton, linux-mm, linux-kernel

On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote:
> On 02/04/2013 02:17 AM, Michel Lespinasse wrote:
> > munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
> > at a time. When munlocking THP pages (or the huge zero page), this resulted
> > in taking the mm->page_table_lock 512 times in a row.
> > 
> > We can do better by making use of the page_mask returned by follow_page_mask
> > (for the huge zero page case), or the size of the page munlock_vma_page()
> > operated on (for the true THP page case).
> > 
> > Note - I am sending this as RFC only for now as I can't currently put
> > my finger on what if anything prevents split_huge_page() from operating
> > concurrently on the same page as munlock_vma_page(), which would mess
> > up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
> > point I missed here ?
> > 
> > Signed-off-by: Michel Lespinasse <walken@google.com>
> 
> Hi Michel,
> 
> Fuzzing with trinity inside a KVM tools guest produces a steady stream of:
> 
> 
> [   51.823275] ------------[ cut here ]------------
> [   51.823302] kernel BUG at include/linux/page-flags.h:421!
> [   51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   51.823307] Dumping ftrace buffer:
> [   51.823314]    (ftrace buffer empty)
> [   51.823314] Modules linked in:
> [   51.823314] CPU 2
> [   51.823314] Pid: 7116, comm: trinity Tainted: G        W    3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273
> [   51.823316] RIP: 0010:[<ffffffff81242792>]  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
> [   51.823317] RSP: 0018:ffff880009641bb8  EFLAGS: 00010282
> [   51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000
> [   51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040
> [   51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000
> [   51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958
> [   51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff
> [   51.823326] FS:  00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000
> [   51.823327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0
> [   51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000)
> [   51.823341] Stack:
> [   51.823344]  0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd
> [   51.823373]  ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958
> [   51.823373]  ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81
> [   51.823373] Call Trace:
> [   51.823373]  [<ffffffff812429bd>] munlock_vma_pages_range+0x8d/0xf0
> [   51.823373]  [<ffffffff81246d81>] exit_mmap+0x51/0x170
> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [<ffffffff8126a09f>] ? kmem_cache_free+0x22f/0x3b0
> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [<ffffffff8110af97>] mmput+0x77/0xe0
> [   51.823377]  [<ffffffff81114403>] exit_mm+0x113/0x120
> [   51.823381]  [<ffffffff83d727f1>] ? _raw_spin_unlock_irq+0x51/0x80
> [   51.823384]  [<ffffffff8111465a>] do_exit+0x24a/0x590
> [   51.823387]  [<ffffffff81114a6a>] do_group_exit+0x8a/0xc0
> [   51.823390]  [<ffffffff81128591>] get_signal_to_deliver+0x501/0x5b0
> [   51.823394]  [<ffffffff8106dd42>] do_signal+0x42/0x110
> [   51.823399]  [<ffffffff811d8ea4>] ? rcu_eqs_exit_common+0x64/0x340
> [   51.823404]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
> [   51.823407]  [<ffffffff811849c8>] ? trace_hardirqs_on_caller+0x128/0x160
> [   51.823409]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
> [   51.823412]  [<ffffffff8106de58>] do_notify_resume+0x48/0xa0
> [   51.823415]  [<ffffffff83d732fb>] retint_signal+0x4d/0x92
> [   51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48
> 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b
> [   51.823449] RIP  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
> [   51.823450]  RSP <ffff880009641bb8>
> [   51.826846] ---[ end trace a7919e7f17c0a72a ]---
> 

The similar warning prevents my system from booting. And it seems to me
that in munlock_vma_pages_range(), the page_mask needs be the page
number returned from munlock_vma_page() minus 1. And the following fix
solved my problem. Would you please have a try? 

Thanks, Zhong


================
diff --git a/mm/mlock.c b/mm/mlock.c
index af1d115..1e3d794 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -255,7 +255,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 			unlock_page(page);
 			put_page(page);
 		}
-		page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
+		page_increm = 1 + (~(start >> PAGE_SHIFT) & (page_mask-1));
 		start += page_increm * PAGE_SIZE;
 		cond_resched();
 	}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
  2013-02-07  2:50       ` Li Zhong
@ 2013-02-07  5:42         ` Sasha Levin
  -1 siblings, 0 replies; 20+ messages in thread
From: Sasha Levin @ 2013-02-07  5:42 UTC (permalink / raw)
  To: Li Zhong
  Cc: Michel Lespinasse, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Hugh Dickins, Andrew Morton, linux-mm, linux-kernel

On 02/06/2013 09:50 PM, Li Zhong wrote:
> On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote:
>> On 02/04/2013 02:17 AM, Michel Lespinasse wrote:
>>> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
>>> at a time. When munlocking THP pages (or the huge zero page), this resulted
>>> in taking the mm->page_table_lock 512 times in a row.
>>>
>>> We can do better by making use of the page_mask returned by follow_page_mask
>>> (for the huge zero page case), or the size of the page munlock_vma_page()
>>> operated on (for the true THP page case).
>>>
>>> Note - I am sending this as RFC only for now as I can't currently put
>>> my finger on what if anything prevents split_huge_page() from operating
>>> concurrently on the same page as munlock_vma_page(), which would mess
>>> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
>>> point I missed here ?
>>>
>>> Signed-off-by: Michel Lespinasse <walken@google.com>
>>
>> Hi Michel,
>>
>> Fuzzing with trinity inside a KVM tools guest produces a steady stream of:
>>
>>
>> [   51.823275] ------------[ cut here ]------------
>> [   51.823302] kernel BUG at include/linux/page-flags.h:421!
>> [   51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [   51.823307] Dumping ftrace buffer:
>> [   51.823314]    (ftrace buffer empty)
>> [   51.823314] Modules linked in:
>> [   51.823314] CPU 2
>> [   51.823314] Pid: 7116, comm: trinity Tainted: G        W    3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273
>> [   51.823316] RIP: 0010:[<ffffffff81242792>]  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
>> [   51.823317] RSP: 0018:ffff880009641bb8  EFLAGS: 00010282
>> [   51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000
>> [   51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040
>> [   51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000
>> [   51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958
>> [   51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff
>> [   51.823326] FS:  00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000
>> [   51.823327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0
>> [   51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [   51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [   51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000)
>> [   51.823341] Stack:
>> [   51.823344]  0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd
>> [   51.823373]  ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958
>> [   51.823373]  ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81
>> [   51.823373] Call Trace:
>> [   51.823373]  [<ffffffff812429bd>] munlock_vma_pages_range+0x8d/0xf0
>> [   51.823373]  [<ffffffff81246d81>] exit_mmap+0x51/0x170
>> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
>> [   51.823373]  [<ffffffff8126a09f>] ? kmem_cache_free+0x22f/0x3b0
>> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
>> [   51.823373]  [<ffffffff8110af97>] mmput+0x77/0xe0
>> [   51.823377]  [<ffffffff81114403>] exit_mm+0x113/0x120
>> [   51.823381]  [<ffffffff83d727f1>] ? _raw_spin_unlock_irq+0x51/0x80
>> [   51.823384]  [<ffffffff8111465a>] do_exit+0x24a/0x590
>> [   51.823387]  [<ffffffff81114a6a>] do_group_exit+0x8a/0xc0
>> [   51.823390]  [<ffffffff81128591>] get_signal_to_deliver+0x501/0x5b0
>> [   51.823394]  [<ffffffff8106dd42>] do_signal+0x42/0x110
>> [   51.823399]  [<ffffffff811d8ea4>] ? rcu_eqs_exit_common+0x64/0x340
>> [   51.823404]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
>> [   51.823407]  [<ffffffff811849c8>] ? trace_hardirqs_on_caller+0x128/0x160
>> [   51.823409]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
>> [   51.823412]  [<ffffffff8106de58>] do_notify_resume+0x48/0xa0
>> [   51.823415]  [<ffffffff83d732fb>] retint_signal+0x4d/0x92
>> [   51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48
>> 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b
>> [   51.823449] RIP  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
>> [   51.823450]  RSP <ffff880009641bb8>
>> [   51.826846] ---[ end trace a7919e7f17c0a72a ]---
>>
> 
> The similar warning prevents my system from booting. And it seems to me
> that in munlock_vma_pages_range(), the page_mask needs be the page
> number returned from munlock_vma_page() minus 1. And the following fix
> solved my problem. Would you please have a try? 

Solved it here as well, awesome!


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
@ 2013-02-07  5:42         ` Sasha Levin
  0 siblings, 0 replies; 20+ messages in thread
From: Sasha Levin @ 2013-02-07  5:42 UTC (permalink / raw)
  To: Li Zhong
  Cc: Michel Lespinasse, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Hugh Dickins, Andrew Morton, linux-mm, linux-kernel

On 02/06/2013 09:50 PM, Li Zhong wrote:
> On Wed, 2013-02-06 at 18:44 -0500, Sasha Levin wrote:
>> On 02/04/2013 02:17 AM, Michel Lespinasse wrote:
>>> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
>>> at a time. When munlocking THP pages (or the huge zero page), this resulted
>>> in taking the mm->page_table_lock 512 times in a row.
>>>
>>> We can do better by making use of the page_mask returned by follow_page_mask
>>> (for the huge zero page case), or the size of the page munlock_vma_page()
>>> operated on (for the true THP page case).
>>>
>>> Note - I am sending this as RFC only for now as I can't currently put
>>> my finger on what if anything prevents split_huge_page() from operating
>>> concurrently on the same page as munlock_vma_page(), which would mess
>>> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
>>> point I missed here ?
>>>
>>> Signed-off-by: Michel Lespinasse <walken@google.com>
>>
>> Hi Michel,
>>
>> Fuzzing with trinity inside a KVM tools guest produces a steady stream of:
>>
>>
>> [   51.823275] ------------[ cut here ]------------
>> [   51.823302] kernel BUG at include/linux/page-flags.h:421!
>> [   51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [   51.823307] Dumping ftrace buffer:
>> [   51.823314]    (ftrace buffer empty)
>> [   51.823314] Modules linked in:
>> [   51.823314] CPU 2
>> [   51.823314] Pid: 7116, comm: trinity Tainted: G        W    3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273
>> [   51.823316] RIP: 0010:[<ffffffff81242792>]  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
>> [   51.823317] RSP: 0018:ffff880009641bb8  EFLAGS: 00010282
>> [   51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000
>> [   51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040
>> [   51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000
>> [   51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958
>> [   51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff
>> [   51.823326] FS:  00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000
>> [   51.823327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0
>> [   51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [   51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [   51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000)
>> [   51.823341] Stack:
>> [   51.823344]  0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd
>> [   51.823373]  ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958
>> [   51.823373]  ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81
>> [   51.823373] Call Trace:
>> [   51.823373]  [<ffffffff812429bd>] munlock_vma_pages_range+0x8d/0xf0
>> [   51.823373]  [<ffffffff81246d81>] exit_mmap+0x51/0x170
>> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
>> [   51.823373]  [<ffffffff8126a09f>] ? kmem_cache_free+0x22f/0x3b0
>> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
>> [   51.823373]  [<ffffffff8110af97>] mmput+0x77/0xe0
>> [   51.823377]  [<ffffffff81114403>] exit_mm+0x113/0x120
>> [   51.823381]  [<ffffffff83d727f1>] ? _raw_spin_unlock_irq+0x51/0x80
>> [   51.823384]  [<ffffffff8111465a>] do_exit+0x24a/0x590
>> [   51.823387]  [<ffffffff81114a6a>] do_group_exit+0x8a/0xc0
>> [   51.823390]  [<ffffffff81128591>] get_signal_to_deliver+0x501/0x5b0
>> [   51.823394]  [<ffffffff8106dd42>] do_signal+0x42/0x110
>> [   51.823399]  [<ffffffff811d8ea4>] ? rcu_eqs_exit_common+0x64/0x340
>> [   51.823404]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
>> [   51.823407]  [<ffffffff811849c8>] ? trace_hardirqs_on_caller+0x128/0x160
>> [   51.823409]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
>> [   51.823412]  [<ffffffff8106de58>] do_notify_resume+0x48/0xa0
>> [   51.823415]  [<ffffffff83d732fb>] retint_signal+0x4d/0x92
>> [   51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48
>> 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b
>> [   51.823449] RIP  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
>> [   51.823450]  RSP <ffff880009641bb8>
>> [   51.826846] ---[ end trace a7919e7f17c0a72a ]---
>>
> 
> The similar warning prevents my system from booting. And it seems to me
> that in munlock_vma_pages_range(), the page_mask needs be the page
> number returned from munlock_vma_page() minus 1. And the following fix
> solved my problem. Would you please have a try? 

Solved it here as well, awesome!


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
  2013-02-06 23:44     ` Sasha Levin
@ 2013-02-07 11:49       ` Hillf Danton
  -1 siblings, 0 replies; 20+ messages in thread
From: Hillf Danton @ 2013-02-07 11:49 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Michel Lespinasse, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Hugh Dickins, Andrew Morton, linux-mm, linux-kernel

On Thu, Feb 7, 2013 at 7:44 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
> On 02/04/2013 02:17 AM, Michel Lespinasse wrote:
>> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
>> at a time. When munlocking THP pages (or the huge zero page), this resulted
>> in taking the mm->page_table_lock 512 times in a row.
>>
>> We can do better by making use of the page_mask returned by follow_page_mask
>> (for the huge zero page case), or the size of the page munlock_vma_page()
>> operated on (for the true THP page case).
>>
>> Note - I am sending this as RFC only for now as I can't currently put
>> my finger on what if anything prevents split_huge_page() from operating
>> concurrently on the same page as munlock_vma_page(), which would mess
>> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
>> point I missed here ?
>>
>> Signed-off-by: Michel Lespinasse <walken@google.com>
>
> Hi Michel,
>
> Fuzzing with trinity inside a KVM tools guest produces a steady stream of:
>
>
> [   51.823275] ------------[ cut here ]------------
> [   51.823302] kernel BUG at include/linux/page-flags.h:421!
> [   51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   51.823307] Dumping ftrace buffer:
> [   51.823314]    (ftrace buffer empty)
> [   51.823314] Modules linked in:
> [   51.823314] CPU 2
> [   51.823314] Pid: 7116, comm: trinity Tainted: G        W    3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273
> [   51.823316] RIP: 0010:[<ffffffff81242792>]  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
> [   51.823317] RSP: 0018:ffff880009641bb8  EFLAGS: 00010282
> [   51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000
> [   51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040
> [   51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000
> [   51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958
> [   51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff
> [   51.823326] FS:  00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000
> [   51.823327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0
> [   51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000)
> [   51.823341] Stack:
> [   51.823344]  0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd
> [   51.823373]  ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958
> [   51.823373]  ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81
> [   51.823373] Call Trace:
> [   51.823373]  [<ffffffff812429bd>] munlock_vma_pages_range+0x8d/0xf0
> [   51.823373]  [<ffffffff81246d81>] exit_mmap+0x51/0x170
> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [<ffffffff8126a09f>] ? kmem_cache_free+0x22f/0x3b0
> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [<ffffffff8110af97>] mmput+0x77/0xe0
> [   51.823377]  [<ffffffff81114403>] exit_mm+0x113/0x120
> [   51.823381]  [<ffffffff83d727f1>] ? _raw_spin_unlock_irq+0x51/0x80
> [   51.823384]  [<ffffffff8111465a>] do_exit+0x24a/0x590
> [   51.823387]  [<ffffffff81114a6a>] do_group_exit+0x8a/0xc0
> [   51.823390]  [<ffffffff81128591>] get_signal_to_deliver+0x501/0x5b0
> [   51.823394]  [<ffffffff8106dd42>] do_signal+0x42/0x110
> [   51.823399]  [<ffffffff811d8ea4>] ? rcu_eqs_exit_common+0x64/0x340
> [   51.823404]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
> [   51.823407]  [<ffffffff811849c8>] ? trace_hardirqs_on_caller+0x128/0x160
> [   51.823409]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
> [   51.823412]  [<ffffffff8106de58>] do_notify_resume+0x48/0xa0
> [   51.823415]  [<ffffffff83d732fb>] retint_signal+0x4d/0x92
> [   51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48
> 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b
> [   51.823449] RIP  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
> [   51.823450]  RSP <ffff880009641bb8>
> [   51.826846] ---[ end trace a7919e7f17c0a72a ]---
>
>
Only is head page mlocked, and we have to avoid checking THP
against tail page.
Would you please try the following, Sasha?
Hillf
---
--- a/mm/mlock.c	Thu Feb  7 19:43:20 2013
+++ b/mm/mlock.c	Thu Feb  7 19:45:54 2013
@@ -104,12 +104,14 @@ void mlock_vma_page(struct page *page)
  */
 unsigned int munlock_vma_page(struct page *page)
 {
-	unsigned int nr_pages = hpage_nr_pages(page);
+	unsigned int nr_pages = 0;

 	BUG_ON(!PageLocked(page));

 	if (TestClearPageMlocked(page)) {
+		nr_pages = hpage_nr_pages(page);
 		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		nr_pages--;
 		if (!isolate_lru_page(page)) {
 			int ret = SWAP_AGAIN;

--

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
@ 2013-02-07 11:49       ` Hillf Danton
  0 siblings, 0 replies; 20+ messages in thread
From: Hillf Danton @ 2013-02-07 11:49 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Michel Lespinasse, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Hugh Dickins, Andrew Morton, linux-mm, linux-kernel

On Thu, Feb 7, 2013 at 7:44 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
> On 02/04/2013 02:17 AM, Michel Lespinasse wrote:
>> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
>> at a time. When munlocking THP pages (or the huge zero page), this resulted
>> in taking the mm->page_table_lock 512 times in a row.
>>
>> We can do better by making use of the page_mask returned by follow_page_mask
>> (for the huge zero page case), or the size of the page munlock_vma_page()
>> operated on (for the true THP page case).
>>
>> Note - I am sending this as RFC only for now as I can't currently put
>> my finger on what if anything prevents split_huge_page() from operating
>> concurrently on the same page as munlock_vma_page(), which would mess
>> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
>> point I missed here ?
>>
>> Signed-off-by: Michel Lespinasse <walken@google.com>
>
> Hi Michel,
>
> Fuzzing with trinity inside a KVM tools guest produces a steady stream of:
>
>
> [   51.823275] ------------[ cut here ]------------
> [   51.823302] kernel BUG at include/linux/page-flags.h:421!
> [   51.823307] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   51.823307] Dumping ftrace buffer:
> [   51.823314]    (ftrace buffer empty)
> [   51.823314] Modules linked in:
> [   51.823314] CPU 2
> [   51.823314] Pid: 7116, comm: trinity Tainted: G        W    3.8.0-rc6-next-20130206-sasha-00027-g3b5963c-dirty #273
> [   51.823316] RIP: 0010:[<ffffffff81242792>]  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
> [   51.823317] RSP: 0018:ffff880009641bb8  EFLAGS: 00010282
> [   51.823319] RAX: 011ffc0000008001 RBX: ffffea0000410040 RCX: 0000000000000000
> [   51.823320] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea0000410040
> [   51.823321] RBP: ffff880009641bc8 R08: 0000000000000000 R09: 0000000000000000
> [   51.823322] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880009633958
> [   51.823324] R13: 0000000001252000 R14: ffffea0000410040 R15: 00000000000000ff
> [   51.823326] FS:  00007fe7a9046700(0000) GS:ffff88000ba00000(0000) knlGS:0000000000000000
> [   51.823327] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.823328] CR2: 00007fc583b90fcb CR3: 0000000009bc8000 CR4: 00000000000406e0
> [   51.823334] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   51.823338] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   51.823340] Process trinity (pid: 7116, threadinfo ffff880009640000, task ffff880009638000)
> [   51.823341] Stack:
> [   51.823344]  0000000000a01000 ffff880009633958 ffff880009641c08 ffffffff812429bd
> [   51.823373]  ffff880009638000 000001ff09638000 ffff880009ade000 ffff880009633958
> [   51.823373]  ffff880009638810 ffff880009ade098 ffff880009641cb8 ffffffff81246d81
> [   51.823373] Call Trace:
> [   51.823373]  [<ffffffff812429bd>] munlock_vma_pages_range+0x8d/0xf0
> [   51.823373]  [<ffffffff81246d81>] exit_mmap+0x51/0x170
> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [<ffffffff8126a09f>] ? kmem_cache_free+0x22f/0x3b0
> [   51.823373]  [<ffffffff81278b4a>] ? __khugepaged_exit+0x8a/0xf0
> [   51.823373]  [<ffffffff8110af97>] mmput+0x77/0xe0
> [   51.823377]  [<ffffffff81114403>] exit_mm+0x113/0x120
> [   51.823381]  [<ffffffff83d727f1>] ? _raw_spin_unlock_irq+0x51/0x80
> [   51.823384]  [<ffffffff8111465a>] do_exit+0x24a/0x590
> [   51.823387]  [<ffffffff81114a6a>] do_group_exit+0x8a/0xc0
> [   51.823390]  [<ffffffff81128591>] get_signal_to_deliver+0x501/0x5b0
> [   51.823394]  [<ffffffff8106dd42>] do_signal+0x42/0x110
> [   51.823399]  [<ffffffff811d8ea4>] ? rcu_eqs_exit_common+0x64/0x340
> [   51.823404]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
> [   51.823407]  [<ffffffff811849c8>] ? trace_hardirqs_on_caller+0x128/0x160
> [   51.823409]  [<ffffffff81184a0d>] ? trace_hardirqs_on+0xd/0x10
> [   51.823412]  [<ffffffff8106de58>] do_notify_resume+0x48/0xa0
> [   51.823415]  [<ffffffff83d732fb>] retint_signal+0x4d/0x92
> [   51.823449] Code: 85 c0 75 0d 48 89 df e8 0d 30 fe ff 0f 1f 44 00 00 48 83 c4 08 5b 5d c3 90 55 48 89 e5 41 54 53 48 89 fb 48
> 8b 07 f6 c4 80 74 06 <0f> 0b 0f 1f 40 00 48 8b 07 48 c1 e8 0e 83 e0 01 83 f8 01 48 8b
> [   51.823449] RIP  [<ffffffff81242792>] munlock_vma_page+0x12/0xf0
> [   51.823450]  RSP <ffff880009641bb8>
> [   51.826846] ---[ end trace a7919e7f17c0a72a ]---
>
>
Only is head page mlocked, and we have to avoid checking THP
against tail page.
Would you please try the following, Sasha?
Hillf
---
--- a/mm/mlock.c	Thu Feb  7 19:43:20 2013
+++ b/mm/mlock.c	Thu Feb  7 19:45:54 2013
@@ -104,12 +104,14 @@ void mlock_vma_page(struct page *page)
  */
 unsigned int munlock_vma_page(struct page *page)
 {
-	unsigned int nr_pages = hpage_nr_pages(page);
+	unsigned int nr_pages = 0;

 	BUG_ON(!PageLocked(page));

 	if (TestClearPageMlocked(page)) {
+		nr_pages = hpage_nr_pages(page);
 		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		nr_pages--;
 		if (!isolate_lru_page(page)) {
 			int ret = SWAP_AGAIN;

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
  2013-02-04  7:17   ` Michel Lespinasse
@ 2013-02-08 20:25     ` Andrea Arcangeli
  -1 siblings, 0 replies; 20+ messages in thread
From: Andrea Arcangeli @ 2013-02-08 20:25 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Mel Gorman, Hugh Dickins, Andrew Morton, linux-mm,
	linux-kernel

Hi Michel,

On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote:
> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
> at a time. When munlocking THP pages (or the huge zero page), this resulted
> in taking the mm->page_table_lock 512 times in a row.
> 
> We can do better by making use of the page_mask returned by follow_page_mask
> (for the huge zero page case), or the size of the page munlock_vma_page()
> operated on (for the true THP page case).
> 
> Note - I am sending this as RFC only for now as I can't currently put
> my finger on what if anything prevents split_huge_page() from operating
> concurrently on the same page as munlock_vma_page(), which would mess
> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
> point I missed here ?

I agree something looks fishy: nor mmap_sem for writing, nor the page
lock can stop split_huge_page_refcount.

Now the mlock side was intended to be safe because mlock_vma_page is
called within follow_page while holding the PT lock or the
page_table_lock (so split_huge_page_refcount will have to wait for it
to be released before it can run). See follow_trans_huge_pmd
assert_spin_locked and the pte_unmap_unlock after mlock_vma_page
returns.

Problem is, the lock side dependen on the TestSetPageMlocked below to
be always repeated on the head page (follow_trans_huge_pmd will always
pass the head page to mlock_vma_page).

void mlock_vma_page(struct page *page)
{
	BUG_ON(!PageLocked(page));

	if (!TestSetPageMlocked(page)) {

But what if the head page was split in between two different
follow_page calls? The second call wouldn't take the pmd_trans_huge
path anymore and the stats would be increased too much.

The problem on the munlock side is even more apparent as you pointed
out above but now I think the mlock side was problematic too.

The good thing is, your accelleration code for the mlock side should
have fixed the mlock race already: not ever risking to end up calling
mlock_vma_page twice on the head page is not an "accelleration" only,
it should also be a natural fix for the race.

To fix the munlock side, which is still present, I think one way would
be to do mlock and unlock within get_user_pages, so they run in the
same place protected by the PT lock or page_table_lock.

There are few things that stop split_huge_page_refcount:
page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep
calling munlock_vma_page outside of get_user_pages (so outside of the
page_table_lock) the other way would be to use the compound_lock.

NOTE: this a purely aesthetical issue in /proc/meminfo, there's
nothing functional (at least in the kernel) connected to it, so no
panic :).

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
@ 2013-02-08 20:25     ` Andrea Arcangeli
  0 siblings, 0 replies; 20+ messages in thread
From: Andrea Arcangeli @ 2013-02-08 20:25 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Mel Gorman, Hugh Dickins, Andrew Morton, linux-mm,
	linux-kernel

Hi Michel,

On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote:
> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
> at a time. When munlocking THP pages (or the huge zero page), this resulted
> in taking the mm->page_table_lock 512 times in a row.
> 
> We can do better by making use of the page_mask returned by follow_page_mask
> (for the huge zero page case), or the size of the page munlock_vma_page()
> operated on (for the true THP page case).
> 
> Note - I am sending this as RFC only for now as I can't currently put
> my finger on what if anything prevents split_huge_page() from operating
> concurrently on the same page as munlock_vma_page(), which would mess
> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
> point I missed here ?

I agree something looks fishy: nor mmap_sem for writing, nor the page
lock can stop split_huge_page_refcount.

Now the mlock side was intended to be safe because mlock_vma_page is
called within follow_page while holding the PT lock or the
page_table_lock (so split_huge_page_refcount will have to wait for it
to be released before it can run). See follow_trans_huge_pmd
assert_spin_locked and the pte_unmap_unlock after mlock_vma_page
returns.

Problem is, the lock side dependen on the TestSetPageMlocked below to
be always repeated on the head page (follow_trans_huge_pmd will always
pass the head page to mlock_vma_page).

void mlock_vma_page(struct page *page)
{
	BUG_ON(!PageLocked(page));

	if (!TestSetPageMlocked(page)) {

But what if the head page was split in between two different
follow_page calls? The second call wouldn't take the pmd_trans_huge
path anymore and the stats would be increased too much.

The problem on the munlock side is even more apparent as you pointed
out above but now I think the mlock side was problematic too.

The good thing is, your accelleration code for the mlock side should
have fixed the mlock race already: not ever risking to end up calling
mlock_vma_page twice on the head page is not an "accelleration" only,
it should also be a natural fix for the race.

To fix the munlock side, which is still present, I think one way would
be to do mlock and unlock within get_user_pages, so they run in the
same place protected by the PT lock or page_table_lock.

There are few things that stop split_huge_page_refcount:
page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep
calling munlock_vma_page outside of get_user_pages (so outside of the
page_table_lock) the other way would be to use the compound_lock.

NOTE: this a purely aesthetical issue in /proc/meminfo, there's
nothing functional (at least in the kernel) connected to it, so no
panic :).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
  2013-02-08 20:25     ` Andrea Arcangeli
@ 2013-02-08 23:17       ` Michel Lespinasse
  -1 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-08 23:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Mel Gorman, Hugh Dickins, Andrew Morton, linux-mm,
	linux-kernel

On Fri, Feb 8, 2013 at 12:25 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Michel,
>
> On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote:
>> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
>> at a time. When munlocking THP pages (or the huge zero page), this resulted
>> in taking the mm->page_table_lock 512 times in a row.
>>
>> We can do better by making use of the page_mask returned by follow_page_mask
>> (for the huge zero page case), or the size of the page munlock_vma_page()
>> operated on (for the true THP page case).
>>
>> Note - I am sending this as RFC only for now as I can't currently put
>> my finger on what if anything prevents split_huge_page() from operating
>> concurrently on the same page as munlock_vma_page(), which would mess
>> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
>> point I missed here ?
>
> I agree something looks fishy: nor mmap_sem for writing, nor the page
> lock can stop split_huge_page_refcount.
>
> Now the mlock side was intended to be safe because mlock_vma_page is
> called within follow_page while holding the PT lock or the
> page_table_lock (so split_huge_page_refcount will have to wait for it
> to be released before it can run). See follow_trans_huge_pmd
> assert_spin_locked and the pte_unmap_unlock after mlock_vma_page
> returns.
>
> Problem is, the lock side dependen on the TestSetPageMlocked below to
> be always repeated on the head page (follow_trans_huge_pmd will always
> pass the head page to mlock_vma_page).
>
> void mlock_vma_page(struct page *page)
> {
>         BUG_ON(!PageLocked(page));
>
>         if (!TestSetPageMlocked(page)) {
>
> But what if the head page was split in between two different
> follow_page calls? The second call wouldn't take the pmd_trans_huge
> path anymore and the stats would be increased too much.

Ah, good point. I am getting increasingly scared about the locking
here the more I look at it :/

> The problem on the munlock side is even more apparent as you pointed
> out above but now I think the mlock side was problematic too.
>
> The good thing is, your accelleration code for the mlock side should
> have fixed the mlock race already: not ever risking to end up calling
> mlock_vma_page twice on the head page is not an "accelleration" only,
> it should also be a natural fix for the race.
>
> To fix the munlock side, which is still present, I think one way would
> be to do mlock and unlock within get_user_pages, so they run in the
> same place protected by the PT lock or page_table_lock.

I actually had a quick try at this before submitting this patch
series; the difficulty with it was that we want to have the page lock
before calling munlock_vma_page() and we can't get it while holding
the page table lock. So we'd need to do some gymnastics involving
trylock_page() and if that fails, release the page table lock / lock
the page / reacquire the page table lock / see if we locked the right
page. This looked nasty, and then I noticed the existing
hpage_nr_pages() test in munlock_vma_page() and decided to just ask
you about it :)

> There are few things that stop split_huge_page_refcount:
> page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep
> calling munlock_vma_page outside of get_user_pages (so outside of the
> page_table_lock) the other way would be to use the compound_lock.
>
> NOTE: this a purely aesthetical issue in /proc/meminfo, there's
> nothing functional (at least in the kernel) connected to it, so no
> panic :).

Yes.

I think we should try and get the locking right, because wrong code is
just confusing, and it leads people to make incorrect assumptions and
propagate them to places where it may have larger consequences. But
other than that, I agree that the statistics issue we're talking about
here doesn't sound too severe.

I am in a bit of a bind here, as I will be taking a vacation abroad
soon. I would like to get these 3 patches out before then, but I don't
want to do anything too major as I'm not sure what kind of
connectivity I'll have to fix things if needed. I think I'll go with
the simplest option of adding the THP optimizaiton first and fixing
the locking issues at a later stage...

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] mm: accelerate munlock() treatment of THP pages
@ 2013-02-08 23:17       ` Michel Lespinasse
  0 siblings, 0 replies; 20+ messages in thread
From: Michel Lespinasse @ 2013-02-08 23:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Mel Gorman, Hugh Dickins, Andrew Morton, linux-mm,
	linux-kernel

On Fri, Feb 8, 2013 at 12:25 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Michel,
>
> On Sun, Feb 03, 2013 at 11:17:12PM -0800, Michel Lespinasse wrote:
>> munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
>> at a time. When munlocking THP pages (or the huge zero page), this resulted
>> in taking the mm->page_table_lock 512 times in a row.
>>
>> We can do better by making use of the page_mask returned by follow_page_mask
>> (for the huge zero page case), or the size of the page munlock_vma_page()
>> operated on (for the true THP page case).
>>
>> Note - I am sending this as RFC only for now as I can't currently put
>> my finger on what if anything prevents split_huge_page() from operating
>> concurrently on the same page as munlock_vma_page(), which would mess
>> up our NR_MLOCK statistics. Is this a latent bug or is there a subtle
>> point I missed here ?
>
> I agree something looks fishy: nor mmap_sem for writing, nor the page
> lock can stop split_huge_page_refcount.
>
> Now the mlock side was intended to be safe because mlock_vma_page is
> called within follow_page while holding the PT lock or the
> page_table_lock (so split_huge_page_refcount will have to wait for it
> to be released before it can run). See follow_trans_huge_pmd
> assert_spin_locked and the pte_unmap_unlock after mlock_vma_page
> returns.
>
> Problem is, the lock side dependen on the TestSetPageMlocked below to
> be always repeated on the head page (follow_trans_huge_pmd will always
> pass the head page to mlock_vma_page).
>
> void mlock_vma_page(struct page *page)
> {
>         BUG_ON(!PageLocked(page));
>
>         if (!TestSetPageMlocked(page)) {
>
> But what if the head page was split in between two different
> follow_page calls? The second call wouldn't take the pmd_trans_huge
> path anymore and the stats would be increased too much.

Ah, good point. I am getting increasingly scared about the locking
here the more I look at it :/

> The problem on the munlock side is even more apparent as you pointed
> out above but now I think the mlock side was problematic too.
>
> The good thing is, your accelleration code for the mlock side should
> have fixed the mlock race already: not ever risking to end up calling
> mlock_vma_page twice on the head page is not an "accelleration" only,
> it should also be a natural fix for the race.
>
> To fix the munlock side, which is still present, I think one way would
> be to do mlock and unlock within get_user_pages, so they run in the
> same place protected by the PT lock or page_table_lock.

I actually had a quick try at this before submitting this patch
series; the difficulty with it was that we want to have the page lock
before calling munlock_vma_page() and we can't get it while holding
the page table lock. So we'd need to do some gymnastics involving
trylock_page() and if that fails, release the page table lock / lock
the page / reacquire the page table lock / see if we locked the right
page. This looked nasty, and then I noticed the existing
hpage_nr_pages() test in munlock_vma_page() and decided to just ask
you about it :)

> There are few things that stop split_huge_page_refcount:
> page_table_lock, lru_lock, compound_lock, anon_vma lock. So if we keep
> calling munlock_vma_page outside of get_user_pages (so outside of the
> page_table_lock) the other way would be to use the compound_lock.
>
> NOTE: this a purely aesthetical issue in /proc/meminfo, there's
> nothing functional (at least in the kernel) connected to it, so no
> panic :).

Yes.

I think we should try and get the locking right, because wrong code is
just confusing, and it leads people to make incorrect assumptions and
propagate them to places where it may have larger consequences. But
other than that, I agree that the statistics issue we're talking about
here doesn't sound too severe.

I am in a bit of a bind here, as I will be taking a vacation abroad
soon. I would like to get these 3 patches out before then, but I don't
want to do anything too major as I'm not sure what kind of
connectivity I'll have to fix things if needed. I think I'll go with
the simplest option of adding the THP optimizaiton first and fixing
the locking issues at a later stage...

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-02-08 23:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-04  7:17 [PATCH v2 0/3] fixes for large mm_populate() and munlock() operations Michel Lespinasse
2013-02-04  7:17 ` Michel Lespinasse
2013-02-04  7:17 ` [PATCH v2 1/3] fix mm: use long type for page counts in mm_populate() and get_user_pages() Michel Lespinasse
2013-02-04  7:17   ` Michel Lespinasse
2013-02-04  7:17 ` [PATCH v2 2/3] mm: accelerate mm_populate() treatment of THP pages Michel Lespinasse
2013-02-04  7:17   ` Michel Lespinasse
2013-02-04  7:17 ` [PATCH v2 3/3] mm: accelerate munlock() " Michel Lespinasse
2013-02-04  7:17   ` Michel Lespinasse
2013-02-06 23:44   ` Sasha Levin
2013-02-06 23:44     ` Sasha Levin
2013-02-07  2:50     ` Li Zhong
2013-02-07  2:50       ` Li Zhong
2013-02-07  5:42       ` Sasha Levin
2013-02-07  5:42         ` Sasha Levin
2013-02-07 11:49     ` Hillf Danton
2013-02-07 11:49       ` Hillf Danton
2013-02-08 20:25   ` Andrea Arcangeli
2013-02-08 20:25     ` Andrea Arcangeli
2013-02-08 23:17     ` Michel Lespinasse
2013-02-08 23:17       ` Michel Lespinasse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.