linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11 v5] update page table walker
@ 2014-02-10 21:44 Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 01/11] pagewalk: update page table walker core Naoya Horiguchi
                   ` (11 more replies)
  0 siblings, 12 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

Hi,

This is ver.5 of page table walker patchset.
I rebased it onto v3.14-rc2.

- v1: http://article.gmane.org/gmane.linux.kernel.mm/108362
- v2: http://article.gmane.org/gmane.linux.kernel.mm/108827
- v3: http://article.gmane.org/gmane.linux.kernel.mm/110561
- v4: http://article.gmane.org/gmane.linux.kernel.mm/111832

Thanks,
Naoya Horiguchi
---
Test code:
  git://github.com/Naoya-Horiguchi/test_rewrite_page_table_walker.git
---
Summary:

Naoya Horiguchi (11):
      pagewalk: update page table walker core
      pagewalk: add walk_page_vma()
      smaps: redefine callback functions for page table walker
      clear_refs: redefine callback functions for page table walker
      pagemap: redefine callback functions for page table walker
      numa_maps: redefine callback functions for page table walker
      memcg: redefine callback functions for page table walker
      madvise: redefine callback functions for page table walker
      arch/powerpc/mm/subpage-prot.c: use walk_page_vma() instead of walk_page_range()
      pagewalk: remove argument hmask from hugetlb_entry()
      mempolicy: apply page table walker on queue_pages_range()

 arch/powerpc/mm/subpage-prot.c |   6 +-
 fs/proc/task_mmu.c             | 267 ++++++++++++-----------------
 include/linux/mm.h             |  24 ++-
 mm/madvise.c                   |  43 ++---
 mm/memcontrol.c                |  71 +++-----
 mm/mempolicy.c                 | 255 +++++++++++-----------------
 mm/pagewalk.c                  | 372 ++++++++++++++++++++++++++---------------
 7 files changed, 506 insertions(+), 532 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/11] pagewalk: update page table walker core
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-12  5:39   ` Joonsoo Kim
                     ` (2 more replies)
  2014-02-10 21:44 ` [PATCH 02/11] pagewalk: add walk_page_vma() Naoya Horiguchi
                   ` (10 subsequent siblings)
  11 siblings, 3 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

This patch updates mm/pagewalk.c to make code less complex and more maintenable.
The basic idea is unchanged and there's no userspace visible effect.

Most of existing callback functions need access to vma to handle each entry.
So we had better add a new member vma in struct mm_walk instead of using
mm_walk->private, which makes code simpler.

One problem in current page table walker is that we check vma in pgd loop.
Historically this was introduced to support hugetlbfs in the strange manner.
It's better and cleaner to do the vma check outside pgd loop.

Another problem is that many users of page table walker now use only
pmd_entry(), although it does both pmd-walk and pte-walk. This makes code
duplication and fluctuation among callers, which worsens the maintenability.

One difficulty of code sharing is that the callers want to determine
whether they try to walk over a specific vma or not in their own way.
To solve this, this patch introduces test_walk() callback.

When we try to use multiple callbacks in different levels, skip control is
also important. For example we have thp enabled in normal configuration, and
we are interested in doing some work for a thp. But sometimes we want to
split it and handle as normal pages, and in another time user would handle
both at pmd level and pte level.
What we need is that when we've done pmd_entry() we want to decide whether
to go down to pte level handling based on the pmd_entry()'s result. So this
patch introduces a skip control flag in mm_walk.
We can't use the returned value for this purpose, because we already
defined the meaning of whole range of returned values (>0 is to terminate
page table walk in caller's specific manner, =0 is to continue to walk,
and <0 is to abort the walk in the general manner.)

ChangeLog v5:
- fix build error ("mm/pagewalk.c:201: error: 'hmask' undeclared")

ChangeLog v4:
- add more comment
- remove verbose variable in walk_page_test()
- rename skip_check to skip_lower_level_walking
- rebased onto mmotm-2014-01-09-16-23

ChangeLog v3:
- rebased onto v3.13-rc3-mmots-2013-12-10-16-38

ChangeLog v2:
- rebase onto mmots
- add pte_none() check in walk_pte_range()
- add cond_sched() in walk_hugetlb_range()
- add skip_check()
- do VM_PFNMAP check only when ->test_walk() is not defined (because some
  caller could handle VM_PFNMAP vma. copy_page_range() is an example.)
- use do-while condition (addr < end) instead of (addr != end)

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/mm.h |  18 ++-
 mm/pagewalk.c      | 352 +++++++++++++++++++++++++++++++++--------------------
 2 files changed, 235 insertions(+), 135 deletions(-)

diff --git v3.14-rc2.orig/include/linux/mm.h v3.14-rc2/include/linux/mm.h
index f28f46eade6a..4d0bc01de43c 100644
--- v3.14-rc2.orig/include/linux/mm.h
+++ v3.14-rc2/include/linux/mm.h
@@ -1067,10 +1067,18 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
  * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
  * @pte_hole: if set, called for each hole at all levels
  * @hugetlb_entry: if set, called for each hugetlb entry
- *		   *Caution*: The caller must hold mmap_sem() if @hugetlb_entry
- * 			      is used.
+ * @test_walk: caller specific callback function to determine whether
+ *             we walk over the current vma or not. A positive returned
+ *             value means "do page table walk over the current vma,"
+ *             and a negative one means "abort current page table walk
+ *             right now." 0 means "skip the current vma."
+ * @mm:        mm_struct representing the target process of page table walk
+ * @vma:       vma currently walked
+ * @skip:      internal control flag which is set when we skip the lower
+ *             level entries.
+ * @private:   private data for callbacks' use
  *
- * (see walk_page_range for more details)
+ * (see the comment on walk_page_range() for more details)
  */
 struct mm_walk {
 	int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
@@ -1086,7 +1094,11 @@ struct mm_walk {
 	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
 			     unsigned long addr, unsigned long next,
 			     struct mm_walk *walk);
+	int (*test_walk)(unsigned long addr, unsigned long next,
+			struct mm_walk *walk);
 	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int skip;
 	void *private;
 };
 
diff --git v3.14-rc2.orig/mm/pagewalk.c v3.14-rc2/mm/pagewalk.c
index 2beeabf502c5..4770558feea8 100644
--- v3.14-rc2.orig/mm/pagewalk.c
+++ v3.14-rc2/mm/pagewalk.c
@@ -3,29 +3,58 @@
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
 
-static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			  struct mm_walk *walk)
+/*
+ * Check the current skip status of page table walker.
+ *
+ * Here what I mean by skip is to skip lower level walking, and that was
+ * determined for each entry independently. For example, when walk_pmd_range
+ * handles a pmd_trans_huge we don't have to walk over ptes under that pmd,
+ * and the skipping does not affect the walking over ptes under other pmds.
+ * That's why we reset @walk->skip after tested.
+ */
+static bool skip_lower_level_walking(struct mm_walk *walk)
+{
+	if (walk->skip) {
+		walk->skip = 0;
+		return true;
+	}
+	return false;
+}
+
+static int walk_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
 {
+	struct mm_struct *mm = walk->mm;
 	pte_t *pte;
+	pte_t *orig_pte;
+	spinlock_t *ptl;
 	int err = 0;
 
-	pte = pte_offset_map(pmd, addr);
-	for (;;) {
+	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	do {
+		if (pte_none(*pte)) {
+			if (walk->pte_hole)
+				err = walk->pte_hole(addr, addr + PAGE_SIZE,
+							walk);
+			if (err)
+				break;
+			continue;
+		}
+		/*
+		 * Callers should have their own way to handle swap entries
+		 * in walk->pte_entry().
+		 */
 		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
 		if (err)
 		       break;
-		addr += PAGE_SIZE;
-		if (addr == end)
-			break;
-		pte++;
-	}
-
-	pte_unmap(pte);
-	return err;
+	} while (pte++, addr += PAGE_SIZE, addr < end);
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+	return addr == end ? 0 : err;
 }
 
-static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-			  struct mm_walk *walk)
+static int walk_pmd_range(pud_t *pud, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -35,6 +64,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	do {
 again:
 		next = pmd_addr_end(addr, end);
+
 		if (pmd_none(*pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);
@@ -42,35 +72,32 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 				break;
 			continue;
 		}
-		/*
-		 * This implies that each ->pmd_entry() handler
-		 * needs to know about pmd_trans_huge() pmds
-		 */
-		if (walk->pmd_entry)
-			err = walk->pmd_entry(pmd, addr, next, walk);
-		if (err)
-			break;
 
-		/*
-		 * Check this here so we only break down trans_huge
-		 * pages when we _need_ to
-		 */
-		if (!walk->pte_entry)
-			continue;
+		if (walk->pmd_entry) {
+			err = walk->pmd_entry(pmd, addr, next, walk);
+			if (skip_lower_level_walking(walk))
+				continue;
+			if (err)
+				break;
+		}
 
-		split_huge_page_pmd_mm(walk->mm, addr, pmd);
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			goto again;
-		err = walk_pte_range(pmd, addr, next, walk);
-		if (err)
-			break;
-	} while (pmd++, addr = next, addr != end);
+		if (walk->pte_entry) {
+			if (walk->vma) {
+				split_huge_page_pmd(walk->vma, addr, pmd);
+				if (pmd_trans_unstable(pmd))
+					goto again;
+			}
+			err = walk_pte_range(pmd, addr, next, walk);
+			if (err)
+				break;
+		}
+	} while (pmd++, addr = next, addr < end);
 
 	return err;
 }
 
-static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
-			  struct mm_walk *walk)
+static int walk_pud_range(pgd_t *pgd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -79,6 +106,7 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
+
 		if (pud_none_or_clear_bad(pud)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);
@@ -86,13 +114,58 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 				break;
 			continue;
 		}
-		if (walk->pud_entry)
+
+		if (walk->pud_entry) {
 			err = walk->pud_entry(pud, addr, next, walk);
-		if (!err && (walk->pmd_entry || walk->pte_entry))
+			if (skip_lower_level_walking(walk))
+				continue;
+			if (err)
+				break;
+		}
+
+		if (walk->pmd_entry || walk->pte_entry) {
 			err = walk_pmd_range(pud, addr, next, walk);
-		if (err)
-			break;
-	} while (pud++, addr = next, addr != end);
+			if (err)
+				break;
+		}
+	} while (pud++, addr = next, addr < end);
+
+	return err;
+}
+
+static int walk_pgd_range(unsigned long addr, unsigned long end,
+			struct mm_walk *walk)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	int err = 0;
+
+	pgd = pgd_offset(walk->mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+
+		if (pgd_none_or_clear_bad(pgd)) {
+			if (walk->pte_hole)
+				err = walk->pte_hole(addr, next, walk);
+			if (err)
+				break;
+			continue;
+		}
+
+		if (walk->pgd_entry) {
+			err = walk->pgd_entry(pgd, addr, next, walk);
+			if (skip_lower_level_walking(walk))
+				continue;
+			if (err)
+				break;
+		}
+
+		if (walk->pud_entry || walk->pmd_entry || walk->pte_entry) {
+			err = walk_pud_range(pgd, addr, next, walk);
+			if (err)
+				break;
+		}
+	} while (pgd++, addr = next, addr < end);
 
 	return err;
 }
@@ -105,144 +178,159 @@ static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr,
 	return boundary < end ? boundary : end;
 }
 
-static int walk_hugetlb_range(struct vm_area_struct *vma,
-			      unsigned long addr, unsigned long end,
-			      struct mm_walk *walk)
+static int walk_hugetlb_range(unsigned long addr, unsigned long end,
+				struct mm_walk *walk)
 {
+	struct mm_struct *mm = walk->mm;
+	struct vm_area_struct *vma = walk->vma;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long next;
 	unsigned long hmask = huge_page_mask(h);
 	pte_t *pte;
 	int err = 0;
+	spinlock_t *ptl;
 
 	do {
 		next = hugetlb_entry_end(h, addr, end);
 		pte = huge_pte_offset(walk->mm, addr & hmask);
+		ptl = huge_pte_lock(h, mm, pte);
+		/*
+		 * Callers should have their own way to handle swap entries
+		 * in walk->hugetlb_entry().
+		 */
 		if (pte && walk->hugetlb_entry)
 			err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+		spin_unlock(ptl);
 		if (err)
-			return err;
+			break;
 	} while (addr = next, addr != end);
-
-	return 0;
+	cond_resched();
+	return err;
 }
 
 #else /* CONFIG_HUGETLB_PAGE */
-static int walk_hugetlb_range(struct vm_area_struct *vma,
-			      unsigned long addr, unsigned long end,
-			      struct mm_walk *walk)
+static inline int walk_hugetlb_range(unsigned long addr, unsigned long end,
+				struct mm_walk *walk)
 {
 	return 0;
 }
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
+/*
+ * Decide whether we really walk over the current vma on [@start, @end)
+ * or skip it. When we skip it, we set @walk->skip to 1.
+ * The return value is used to control the page table walking to
+ * continue (for zero) or not (for non-zero).
+ *
+ * Default check (only VM_PFNMAP check for now) is used when the caller
+ * doesn't define test_walk() callback.
+ */
+static int walk_page_test(unsigned long start, unsigned long end,
+			struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
 
+	if (walk->test_walk)
+		return walk->test_walk(start, end, walk);
+
+	/*
+	 * Do not walk over vma(VM_PFNMAP), because we have no valid struct
+	 * page backing a VM_PFNMAP range. See also commit a9ff785e4437.
+	 */
+	if (vma->vm_flags & VM_PFNMAP)
+		walk->skip = 1;
+	return 0;
+}
+
+static int __walk_page_range(unsigned long start, unsigned long end,
+			struct mm_walk *walk)
+{
+	int err = 0;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (vma && is_vm_hugetlb_page(vma)) {
+		if (walk->hugetlb_entry)
+			err = walk_hugetlb_range(start, end, walk);
+	} else
+		err = walk_pgd_range(start, end, walk);
+
+	return err;
+}
 
 /**
- * walk_page_range - walk a memory map's page tables with a callback
- * @addr: starting address
- * @end: ending address
- * @walk: set of callbacks to invoke for each level of the tree
+ * walk_page_range - walk page table with caller specific callbacks
+ *
+ * Recursively walk the page table tree of the process represented by
+ * @walk->mm within the virtual address range [@start, @end). In walking,
+ * we can call caller-specific callback functions against each entry.
  *
- * Recursively walk the page table for the memory area in a VMA,
- * calling supplied callbacks. Callbacks are called in-order (first
- * PGD, first PUD, first PMD, first PTE, second PTE... second PMD,
- * etc.). If lower-level callbacks are omitted, walking depth is reduced.
+ * Before starting to walk page table, some callers want to check whether
+ * they really want to walk over the vma (for example by checking vm_flags.)
+ * walk_page_test() and @walk->test_walk() do that check.
  *
- * Each callback receives an entry pointer and the start and end of the
- * associated range, and a copy of the original mm_walk for access to
- * the ->private or ->mm fields.
+ * If any callback returns a non-zero value, the page table walk is aborted
+ * immediately and the return value is propagated back to the caller.
+ * Note that the meaning of the positive returned value can be defined
+ * by the caller for its own purpose.
  *
- * Usually no locks are taken, but splitting transparent huge page may
- * take page table lock. And the bottom level iterator will map PTE
- * directories from highmem if necessary.
+ * If the caller defines multiple callbacks in different levels, the
+ * callbacks are called in depth-first manner. It could happen that
+ * multiple callbacks are called on a address. For example if some caller
+ * defines test_walk(), pmd_entry(), and pte_entry(), then callbacks are
+ * called in the order of test_walk(), pmd_entry(), and pte_entry().
+ * If you don't want to go down to lower level at some point and move to
+ * the next entry in the same level, you set @walk->skip to 1.
+ * For example if you succeed to handle some pmd entry as trans_huge entry,
+ * you need not call walk_pte_range() any more, so set it to avoid that.
+ * We can't determine whether to go down to lower level with the return
+ * value of the callback, because the whole range of return values (0, >0,
+ * and <0) are used up for other meanings.
  *
- * If any callback returns a non-zero value, the walk is aborted and
- * the return value is propagated back to the caller. Otherwise 0 is returned.
+ * Each callback can access to the vma over which it is doing page table
+ * walk right now via @walk->vma. @walk->vma is set to NULL in walking
+ * outside a vma. If you want to access to some caller-specific data from
+ * callbacks, @walk->private should be helpful.
  *
- * walk->mm->mmap_sem must be held for at least read if walk->hugetlb_entry
- * is !NULL.
+ * The callers should hold @walk->mm->mmap_sem. Note that the lower level
+ * iterators can take page table lock in lowest level iteration and/or
+ * in split_huge_page_pmd().
  */
-int walk_page_range(unsigned long addr, unsigned long end,
+int walk_page_range(unsigned long start, unsigned long end,
 		    struct mm_walk *walk)
 {
-	pgd_t *pgd;
-	unsigned long next;
 	int err = 0;
+	struct vm_area_struct *vma;
+	unsigned long next;
 
-	if (addr >= end)
-		return err;
+	if (start >= end)
+		return -EINVAL;
 
 	if (!walk->mm)
 		return -EINVAL;
 
 	VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem));
 
-	pgd = pgd_offset(walk->mm, addr);
 	do {
-		struct vm_area_struct *vma = NULL;
-
-		next = pgd_addr_end(addr, end);
-
-		/*
-		 * This function was not intended to be vma based.
-		 * But there are vma special cases to be handled:
-		 * - hugetlb vma's
-		 * - VM_PFNMAP vma's
-		 */
-		vma = find_vma(walk->mm, addr);
-		if (vma) {
-			/*
-			 * There are no page structures backing a VM_PFNMAP
-			 * range, so do not allow split_huge_page_pmd().
-			 */
-			if ((vma->vm_start <= addr) &&
-			    (vma->vm_flags & VM_PFNMAP)) {
-				next = vma->vm_end;
-				pgd = pgd_offset(walk->mm, next);
-				continue;
-			}
-			/*
-			 * Handle hugetlb vma individually because pagetable
-			 * walk for the hugetlb page is dependent on the
-			 * architecture and we can't handled it in the same
-			 * manner as non-huge pages.
-			 */
-			if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
-			    is_vm_hugetlb_page(vma)) {
-				if (vma->vm_end < next)
-					next = vma->vm_end;
-				/*
-				 * Hugepage is very tightly coupled with vma,
-				 * so walk through hugetlb entries within a
-				 * given vma.
-				 */
-				err = walk_hugetlb_range(vma, addr, next, walk);
-				if (err)
-					break;
-				pgd = pgd_offset(walk->mm, next);
+		vma = find_vma(walk->mm, start);
+		if (!vma) { /* after the last vma */
+			walk->vma = NULL;
+			next = end;
+		} else if (start < vma->vm_start) { /* outside the found vma */
+			walk->vma = NULL;
+			next = vma->vm_start;
+		} else { /* inside the found vma */
+			walk->vma = vma;
+			next = vma->vm_end;
+			err = walk_page_test(start, end, walk);
+			if (skip_lower_level_walking(walk))
 				continue;
-			}
-		}
-
-		if (pgd_none_or_clear_bad(pgd)) {
-			if (walk->pte_hole)
-				err = walk->pte_hole(addr, next, walk);
 			if (err)
 				break;
-			pgd++;
-			continue;
 		}
-		if (walk->pgd_entry)
-			err = walk->pgd_entry(pgd, addr, next, walk);
-		if (!err &&
-		    (walk->pud_entry || walk->pmd_entry || walk->pte_entry))
-			err = walk_pud_range(pgd, addr, next, walk);
+		err = __walk_page_range(start, next, walk);
 		if (err)
 			break;
-		pgd++;
-	} while (addr = next, addr < end);
-
+	} while (start = next, start < end);
 	return err;
 }
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 02/11] pagewalk: add walk_page_vma()
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 01/11] pagewalk: update page table walker core Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 03/11] smaps: redefine callback functions for page table walker Naoya Horiguchi
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

Introduces walk_page_vma(), which is useful for the callers which
want to walk over a given vma. It's used by later patches.

ChangeLog v4:
- rename skip_check to skip_lower_level_walking

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/mm.h |  1 +
 mm/pagewalk.c      | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git v3.14-rc2.orig/include/linux/mm.h v3.14-rc2/include/linux/mm.h
index 4d0bc01de43c..144b08617957 100644
--- v3.14-rc2.orig/include/linux/mm.h
+++ v3.14-rc2/include/linux/mm.h
@@ -1104,6 +1104,7 @@ struct mm_walk {
 
 int walk_page_range(unsigned long addr, unsigned long end,
 		struct mm_walk *walk);
+int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk);
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
diff --git v3.14-rc2.orig/mm/pagewalk.c v3.14-rc2/mm/pagewalk.c
index 4770558feea8..2a88dfa58af6 100644
--- v3.14-rc2.orig/mm/pagewalk.c
+++ v3.14-rc2/mm/pagewalk.c
@@ -334,3 +334,21 @@ int walk_page_range(unsigned long start, unsigned long end,
 	} while (start = next, start < end);
 	return err;
 }
+
+int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk)
+{
+	int err;
+
+	if (!walk->mm)
+		return -EINVAL;
+
+	VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem));
+	VM_BUG_ON(!vma);
+	walk->vma = vma;
+	err = walk_page_test(vma->vm_start, vma->vm_end, walk);
+	if (skip_lower_level_walking(walk))
+		return 0;
+	if (err)
+		return err;
+	return __walk_page_range(vma->vm_start, vma->vm_end, walk);
+}
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 03/11] smaps: redefine callback functions for page table walker
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 01/11] pagewalk: update page table walker core Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 02/11] pagewalk: add walk_page_vma() Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 04/11] clear_refs: " Naoya Horiguchi
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

smaps_pte_range() connected to pmd_entry() does both of pmd loop and pte loop.
So this patch moves pte part into smaps_pte() on pte_entry() as expected by
the name.

ChangeLog v2:
- rebase onto mmots

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/task_mmu.c | 47 +++++++++++++++++------------------------------
 1 file changed, 17 insertions(+), 30 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index fb52b548080d..62eedbe50733 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -423,7 +423,6 @@ const struct file_operations proc_tid_maps_operations = {
 
 #ifdef CONFIG_PROC_PAGE_MONITOR
 struct mem_size_stats {
-	struct vm_area_struct *vma;
 	unsigned long resident;
 	unsigned long shared_clean;
 	unsigned long shared_dirty;
@@ -437,15 +436,16 @@ struct mem_size_stats {
 	u64 pss;
 };
 
-
-static void smaps_pte_entry(pte_t ptent, unsigned long addr,
-		unsigned long ptent_size, struct mm_walk *walk)
+static int smaps_pte(pte_t *pte, unsigned long addr, unsigned long end,
+			struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
-	struct vm_area_struct *vma = mss->vma;
+	struct vm_area_struct *vma = walk->vma;
 	pgoff_t pgoff = linear_page_index(vma, addr);
 	struct page *page = NULL;
 	int mapcount;
+	pte_t ptent = *pte;
+	unsigned long ptent_size = end - addr;
 
 	if (pte_present(ptent)) {
 		page = vm_normal_page(vma, addr, ptent);
@@ -462,7 +462,7 @@ static void smaps_pte_entry(pte_t ptent, unsigned long addr,
 	}
 
 	if (!page)
-		return;
+		return 0;
 
 	if (PageAnon(page))
 		mss->anonymous += ptent_size;
@@ -488,35 +488,22 @@ static void smaps_pte_entry(pte_t ptent, unsigned long addr,
 			mss->private_clean += ptent_size;
 		mss->pss += (ptent_size << PSS_SHIFT);
 	}
+	return 0;
 }
 
-static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			   struct mm_walk *walk)
+static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+			struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
-	struct vm_area_struct *vma = mss->vma;
-	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
-		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
+	if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) {
+		smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk);
 		spin_unlock(ptl);
 		mss->anonymous_thp += HPAGE_PMD_SIZE;
-		return 0;
+		/* don't call smaps_pte() */
+		walk->skip = 1;
 	}
-
-	if (pmd_trans_unstable(pmd))
-		return 0;
-	/*
-	 * The mmap_sem held all the way back in m_start() is what
-	 * keeps khugepaged out of here and from collapsing things
-	 * in here.
-	 */
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; pte++, addr += PAGE_SIZE)
-		smaps_pte_entry(*pte, addr, PAGE_SIZE, walk);
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
 	return 0;
 }
 
@@ -581,16 +568,16 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 	struct vm_area_struct *vma = v;
 	struct mem_size_stats mss;
 	struct mm_walk smaps_walk = {
-		.pmd_entry = smaps_pte_range,
+		.pmd_entry = smaps_pmd,
+		.pte_entry = smaps_pte,
 		.mm = vma->vm_mm,
+		.vma = vma,
 		.private = &mss,
 	};
 
 	memset(&mss, 0, sizeof mss);
-	mss.vma = vma;
 	/* mmap_sem is held in m_start */
-	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
-		walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
+	walk_page_vma(vma, &smaps_walk);
 
 	show_map_vma(m, vma, is_pid);
 
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 04/11] clear_refs: redefine callback functions for page table walker
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2014-02-10 21:44 ` [PATCH 03/11] smaps: redefine callback functions for page table walker Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 05/11] pagemap: " Naoya Horiguchi
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

Currently clear_refs_pte_range() is connected to pmd_entry() to split thps
if found. But now this work can be done in core page table walker code.
So we have no reason to keep this callback on pmd_entry(). This patch moves
pte handling code on pte_entry() callback.

clear_refs_write() has some prechecks about if we really walk over a given
vma. It's fine to let them done by test_walk() callback, so let's define it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/task_mmu.c | 82 ++++++++++++++++++++++--------------------------------
 1 file changed, 33 insertions(+), 49 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index 62eedbe50733..8ecae2f55a97 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -698,7 +698,6 @@ enum clear_refs_types {
 };
 
 struct clear_refs_private {
-	struct vm_area_struct *vma;
 	enum clear_refs_types type;
 };
 
@@ -730,41 +729,43 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 #endif
 }
 
-static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
+static int clear_refs_pte(pte_t *pte, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 {
 	struct clear_refs_private *cp = walk->private;
-	struct vm_area_struct *vma = cp->vma;
-	pte_t *pte, ptent;
-	spinlock_t *ptl;
+	struct vm_area_struct *vma = walk->vma;
 	struct page *page;
 
-	split_huge_page_pmd(vma, addr, pmd);
-	if (pmd_trans_unstable(pmd))
+	if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
+		clear_soft_dirty(vma, addr, pte);
 		return 0;
+	}
+	if (!pte_present(*pte))
+		return 0;
+	page = vm_normal_page(vma, addr, *pte);
+	if (!page)
+		return 0;
+	/* Clear accessed and referenced bits. */
+	ptep_test_and_clear_young(vma, addr, pte);
+	ClearPageReferenced(page);
+	return 0;
+}
 
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; pte++, addr += PAGE_SIZE) {
-		ptent = *pte;
-
-		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
-			clear_soft_dirty(vma, addr, pte);
-			continue;
-		}
-
-		if (!pte_present(ptent))
-			continue;
-
-		page = vm_normal_page(vma, addr, ptent);
-		if (!page)
-			continue;
+static int clear_refs_test_walk(unsigned long start, unsigned long end,
+				struct mm_walk *walk)
+{
+	struct clear_refs_private *cp = walk->private;
+	struct vm_area_struct *vma = walk->vma;
 
-		/* Clear accessed and referenced bits. */
-		ptep_test_and_clear_young(vma, addr, pte);
-		ClearPageReferenced(page);
-	}
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
+	/*
+	 * Writing 1 to /proc/pid/clear_refs affects all pages.
+	 * Writing 2 to /proc/pid/clear_refs only affects anonymous pages.
+	 * Writing 3 to /proc/pid/clear_refs only affects file mapped pages.
+	 */
+	if (cp->type == CLEAR_REFS_ANON && vma->vm_file)
+		walk->skip = 1;
+	if (cp->type == CLEAR_REFS_MAPPED && !vma->vm_file)
+		walk->skip = 1;
 	return 0;
 }
 
@@ -806,33 +807,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.type = type,
 		};
 		struct mm_walk clear_refs_walk = {
-			.pmd_entry = clear_refs_pte_range,
+			.pte_entry = clear_refs_pte,
+			.test_walk = clear_refs_test_walk,
 			.mm = mm,
 			.private = &cp,
 		};
 		down_read(&mm->mmap_sem);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
 			mmu_notifier_invalidate_range_start(mm, 0, -1);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			cp.vma = vma;
-			if (is_vm_hugetlb_page(vma))
-				continue;
-			/*
-			 * Writing 1 to /proc/pid/clear_refs affects all pages.
-			 *
-			 * Writing 2 to /proc/pid/clear_refs only affects
-			 * Anonymous pages.
-			 *
-			 * Writing 3 to /proc/pid/clear_refs only affects file
-			 * mapped pages.
-			 */
-			if (type == CLEAR_REFS_ANON && vma->vm_file)
-				continue;
-			if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
-				continue;
-			walk_page_range(vma->vm_start, vma->vm_end,
-					&clear_refs_walk);
-		}
+		for (vma = mm->mmap; vma; vma = vma->vm_next)
+			walk_page_vma(vma, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
 			mmu_notifier_invalidate_range_end(mm, 0, -1);
 		flush_tlb_mm(mm);
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 05/11] pagemap: redefine callback functions for page table walker
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2014-02-10 21:44 ` [PATCH 04/11] clear_refs: " Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 06/11] numa_maps: " Naoya Horiguchi
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

pagemap_pte_range() connected to pmd_entry() does both of pmd loop and
pte loop. So this patch moves pte part into pagemap_pte() on pte_entry().

We remove VM_SOFTDIRTY check in pagemap_pte_range(), because in the new
page table walker we call __walk_page_range() for each vma separately,
so we never experience multiple vmas in single pgd/pud/pmd/pte loop.

ChangeLog v2:
- remove cond_sched() (moved it to walk_hugetlb_range())
- rebase onto mmots

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/task_mmu.c | 76 ++++++++++++++++++++----------------------------------
 1 file changed, 28 insertions(+), 48 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index 8ecae2f55a97..7ed7c88f0687 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -957,19 +957,33 @@ static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemap
 }
 #endif
 
-static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+static int pagemap_pte(pte_t *pte, unsigned long addr, unsigned long end,
 			     struct mm_walk *walk)
 {
-	struct vm_area_struct *vma;
+	struct vm_area_struct *vma = walk->vma;
 	struct pagemapread *pm = walk->private;
-	spinlock_t *ptl;
-	pte_t *pte;
+	pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+
+	if (vma && vma->vm_start <= addr && end <= vma->vm_end) {
+		pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
+		/* unmap before userspace copy */
+		pte_unmap(pte);
+	}
+	return add_to_pagemap(addr, &pme, pm);
+}
+
+static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
 	int err = 0;
+	struct vm_area_struct *vma = walk->vma;
+	struct pagemapread *pm = walk->private;
 	pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+	spinlock_t *ptl;
 
-	/* find the first VMA at or above 'addr' */
-	vma = find_vma(walk->mm, addr);
-	if (vma && pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (!vma)
+		return err;
+	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		int pmd_flags2;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -988,41 +1002,9 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 				break;
 		}
 		spin_unlock(ptl);
-		return err;
-	}
-
-	if (pmd_trans_unstable(pmd))
-		return 0;
-	for (; addr != end; addr += PAGE_SIZE) {
-		int flags2;
-
-		/* check to see if we've left 'vma' behind
-		 * and need a new, higher one */
-		if (vma && (addr >= vma->vm_end)) {
-			vma = find_vma(walk->mm, addr);
-			if (vma && (vma->vm_flags & VM_SOFTDIRTY))
-				flags2 = __PM_SOFT_DIRTY;
-			else
-				flags2 = 0;
-			pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2));
-		}
-
-		/* check that 'vma' actually covers this address,
-		 * and that it isn't a huge page vma */
-		if (vma && (vma->vm_start <= addr) &&
-		    !is_vm_hugetlb_page(vma)) {
-			pte = pte_offset_map(pmd, addr);
-			pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
-			/* unmap before userspace copy */
-			pte_unmap(pte);
-		}
-		err = add_to_pagemap(addr, &pme, pm);
-		if (err)
-			return err;
+		/* don't call pagemap_pte() */
+		walk->skip = 1;
 	}
-
-	cond_resched();
-
 	return err;
 }
 
@@ -1045,12 +1027,11 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
 				 struct mm_walk *walk)
 {
 	struct pagemapread *pm = walk->private;
-	struct vm_area_struct *vma;
+	struct vm_area_struct *vma = walk->vma;
 	int err = 0;
 	int flags2;
 	pagemap_entry_t pme;
 
-	vma = find_vma(walk->mm, addr);
 	WARN_ON_ONCE(!vma);
 
 	if (vma && (vma->vm_flags & VM_SOFTDIRTY))
@@ -1058,6 +1039,7 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
 	else
 		flags2 = 0;
 
+	hmask = huge_page_mask(hstate_vma(vma));
 	for (; addr != end; addr += PAGE_SIZE) {
 		int offset = (addr & ~hmask) >> PAGE_SHIFT;
 		huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2);
@@ -1065,9 +1047,6 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
 		if (err)
 			return err;
 	}
-
-	cond_resched();
-
 	return err;
 }
 #endif /* HUGETLB_PAGE */
@@ -1134,10 +1113,11 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
 	if (!mm || IS_ERR(mm))
 		goto out_free;
 
-	pagemap_walk.pmd_entry = pagemap_pte_range;
+	pagemap_walk.pte_entry = pagemap_pte;
+	pagemap_walk.pmd_entry = pagemap_pmd;
 	pagemap_walk.pte_hole = pagemap_pte_hole;
 #ifdef CONFIG_HUGETLB_PAGE
-	pagemap_walk.hugetlb_entry = pagemap_hugetlb_range;
+	pagemap_walk.hugetlb_entry = pagemap_hugetlb;
 #endif
 	pagemap_walk.mm = mm;
 	pagemap_walk.private = &pm;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 06/11] numa_maps: redefine callback functions for page table walker
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2014-02-10 21:44 ` [PATCH 05/11] pagemap: " Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 07/11] memcg: " Naoya Horiguchi
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

gather_pte_stats() connected to pmd_entry() does both of pmd loop and
pte loop. So this patch moves pte part into pte_entry().

ChangeLog v2:
- rebase onto mmots

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/task_mmu.c | 54 ++++++++++++++++++++++++++----------------------------
 1 file changed, 26 insertions(+), 28 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index 7ed7c88f0687..8b23bbcc5e04 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -1193,7 +1193,6 @@ const struct file_operations proc_pagemap_operations = {
 #ifdef CONFIG_NUMA
 
 struct numa_maps {
-	struct vm_area_struct *vma;
 	unsigned long pages;
 	unsigned long anon;
 	unsigned long active;
@@ -1259,43 +1258,41 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
 	return page;
 }
 
-static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
+static int gather_pte_stats(pte_t *pte, unsigned long addr,
 		unsigned long end, struct mm_walk *walk)
 {
-	struct numa_maps *md;
-	spinlock_t *ptl;
-	pte_t *orig_pte;
-	pte_t *pte;
+	struct numa_maps *md = walk->private;
 
-	md = walk->private;
+	struct page *page = can_gather_numa_stats(*pte, walk->vma, addr);
+	if (!page)
+		return 0;
+	gather_stats(page, md, pte_dirty(*pte), 1);
+	return 0;
+}
+
+static int gather_pmd_stats(pmd_t *pmd, unsigned long addr,
+		unsigned long end, struct mm_walk *walk)
+{
+	struct numa_maps *md = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, md->vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		pte_t huge_pte = *(pte_t *)pmd;
 		struct page *page;
 
-		page = can_gather_numa_stats(huge_pte, md->vma, addr);
+		page = can_gather_numa_stats(huge_pte, vma, addr);
 		if (page)
 			gather_stats(page, md, pte_dirty(huge_pte),
 				     HPAGE_PMD_SIZE/PAGE_SIZE);
 		spin_unlock(ptl);
-		return 0;
+		/* don't call gather_pte_stats() */
+		walk->skip = 1;
 	}
-
-	if (pmd_trans_unstable(pmd))
-		return 0;
-	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
-	do {
-		struct page *page = can_gather_numa_stats(*pte, md->vma, addr);
-		if (!page)
-			continue;
-		gather_stats(page, md, pte_dirty(*pte), 1);
-
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(orig_pte, ptl);
 	return 0;
 }
 #ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
 		unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
 	struct numa_maps *md;
@@ -1314,7 +1311,7 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
 }
 
 #else
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
 		unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
 	return 0;
@@ -1344,12 +1341,12 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	/* Ensure we start with an empty set of numa_maps statistics. */
 	memset(md, 0, sizeof(*md));
 
-	md->vma = vma;
-
-	walk.hugetlb_entry = gather_hugetbl_stats;
-	walk.pmd_entry = gather_pte_stats;
+	walk.hugetlb_entry = gather_hugetlb_stats;
+	walk.pmd_entry = gather_pmd_stats;
+	walk.pte_entry = gather_pte_stats;
 	walk.private = md;
 	walk.mm = mm;
+	walk.vma = vma;
 
 	pol = get_vma_policy(task, vma, vma->vm_start);
 	mpol_to_str(buffer, sizeof(buffer), pol);
@@ -1380,6 +1377,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	if (is_vm_hugetlb_page(vma))
 		seq_printf(m, " huge");
 
+	/* mmap_sem is held by m_start */
 	walk_page_range(vma->vm_start, vma->vm_end, &walk);
 
 	if (!md->pages)
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 07/11] memcg: redefine callback functions for page table walker
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
                   ` (5 preceding siblings ...)
  2014-02-10 21:44 ` [PATCH 06/11] numa_maps: " Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 08/11] madvise: " Naoya Horiguchi
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

Move code around pte loop in mem_cgroup_count_precharge_pte_range() into
mem_cgroup_count_precharge_pte() connected to pte_entry().

We don't change the callback mem_cgroup_move_charge_pte_range() for now,
because we can't do the same replacement easily due to 'goto retry'.

ChangeLog v2:
- rebase onto mmots

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memcontrol.c | 71 ++++++++++++++++++++++-----------------------------------
 1 file changed, 27 insertions(+), 44 deletions(-)

diff --git v3.14-rc2.orig/mm/memcontrol.c v3.14-rc2/mm/memcontrol.c
index 53385cd4e6f0..a2083c24af63 100644
--- v3.14-rc2.orig/mm/memcontrol.c
+++ v3.14-rc2/mm/memcontrol.c
@@ -6900,30 +6900,29 @@ static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 }
 #endif
 
-static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
+static int mem_cgroup_count_precharge_pte(pte_t *pte,
 					unsigned long addr, unsigned long end,
 					struct mm_walk *walk)
 {
-	struct vm_area_struct *vma = walk->private;
-	pte_t *pte;
+	if (get_mctgt_type(walk->vma, addr, *pte, NULL))
+		mc.precharge++;	/* increment precharge temporarily */
+	return 0;
+}
+
+static int mem_cgroup_count_precharge_pmd(pmd_t *pmd,
+					unsigned long addr, unsigned long end,
+					struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
 	spinlock_t *ptl;
 
 	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
-		return 0;
+		/* don't call mem_cgroup_count_precharge_pte() */
+		walk->skip = 1;
 	}
-
-	if (pmd_trans_unstable(pmd))
-		return 0;
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; pte++, addr += PAGE_SIZE)
-		if (get_mctgt_type(vma, addr, *pte, NULL))
-			mc.precharge++;	/* increment precharge temporarily */
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-
 	return 0;
 }
 
@@ -6932,18 +6931,14 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 	unsigned long precharge;
 	struct vm_area_struct *vma;
 
+	struct mm_walk mem_cgroup_count_precharge_walk = {
+		.pmd_entry = mem_cgroup_count_precharge_pmd,
+		.pte_entry = mem_cgroup_count_precharge_pte,
+		.mm = mm,
+	};
 	down_read(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
-		struct mm_walk mem_cgroup_count_precharge_walk = {
-			.pmd_entry = mem_cgroup_count_precharge_pte_range,
-			.mm = mm,
-			.private = vma,
-		};
-		if (is_vm_hugetlb_page(vma))
-			continue;
-		walk_page_range(vma->vm_start, vma->vm_end,
-					&mem_cgroup_count_precharge_walk);
-	}
+	for (vma = mm->mmap; vma; vma = vma->vm_next)
+		walk_page_vma(vma, &mem_cgroup_count_precharge_walk);
 	up_read(&mm->mmap_sem);
 
 	precharge = mc.precharge;
@@ -7082,7 +7077,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				struct mm_walk *walk)
 {
 	int ret = 0;
-	struct vm_area_struct *vma = walk->private;
+	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
 	enum mc_target_type target_type;
@@ -7183,6 +7178,10 @@ put:			/* get_mctgt_type() gets the page */
 static void mem_cgroup_move_charge(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	struct mm_walk mem_cgroup_move_charge_walk = {
+		.pmd_entry = mem_cgroup_move_charge_pte_range,
+		.mm = mm,
+	};
 
 	lru_add_drain_all();
 retry:
@@ -7198,24 +7197,8 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 		cond_resched();
 		goto retry;
 	}
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
-		int ret;
-		struct mm_walk mem_cgroup_move_charge_walk = {
-			.pmd_entry = mem_cgroup_move_charge_pte_range,
-			.mm = mm,
-			.private = vma,
-		};
-		if (is_vm_hugetlb_page(vma))
-			continue;
-		ret = walk_page_range(vma->vm_start, vma->vm_end,
-						&mem_cgroup_move_charge_walk);
-		if (ret)
-			/*
-			 * means we have consumed all precharges and failed in
-			 * doing additional charge. Just abandon here.
-			 */
-			break;
-	}
+	for (vma = mm->mmap; vma; vma = vma->vm_next)
+		walk_page_vma(vma, &mem_cgroup_move_charge_walk);
 	up_read(&mm->mmap_sem);
 }
 
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 08/11] madvise: redefine callback functions for page table walker
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
                   ` (6 preceding siblings ...)
  2014-02-10 21:44 ` [PATCH 07/11] memcg: " Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-03-21  1:47   ` Sasha Levin
  2014-02-10 21:44 ` [PATCH 09/11] arch/powerpc/mm/subpage-prot.c: use walk_page_vma() instead of walk_page_range() Naoya Horiguchi
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

swapin_walk_pmd_entry() is defined as pmd_entry(), but it has no code
about pmd handling (except pmd_none_or_trans_huge_or_clear_bad, but the
same check are now done in core page table walk code).
So let's move this function on pte_entry() as swapin_walk_pte_entry().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/madvise.c | 43 +++++++++++++------------------------------
 1 file changed, 13 insertions(+), 30 deletions(-)

diff --git v3.14-rc2.orig/mm/madvise.c v3.14-rc2/mm/madvise.c
index 539eeb96b323..5e957b984c14 100644
--- v3.14-rc2.orig/mm/madvise.c
+++ v3.14-rc2/mm/madvise.c
@@ -135,38 +135,22 @@ static long madvise_behavior(struct vm_area_struct *vma,
 }
 
 #ifdef CONFIG_SWAP
-static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
+static int swapin_walk_pte_entry(pte_t *pte, unsigned long start,
 	unsigned long end, struct mm_walk *walk)
 {
-	pte_t *orig_pte;
-	struct vm_area_struct *vma = walk->private;
-	unsigned long index;
+	swp_entry_t entry;
+	struct page *page;
+	struct vm_area_struct *vma = walk->vma;
 
-	if (pmd_none_or_trans_huge_or_clear_bad(pmd))
+	if (pte_present(*pte) || pte_none(*pte) || pte_file(*pte))
 		return 0;
-
-	for (index = start; index != end; index += PAGE_SIZE) {
-		pte_t pte;
-		swp_entry_t entry;
-		struct page *page;
-		spinlock_t *ptl;
-
-		orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
-		pte = *(orig_pte + ((index - start) / PAGE_SIZE));
-		pte_unmap_unlock(orig_pte, ptl);
-
-		if (pte_present(pte) || pte_none(pte) || pte_file(pte))
-			continue;
-		entry = pte_to_swp_entry(pte);
-		if (unlikely(non_swap_entry(entry)))
-			continue;
-
-		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
-								vma, index);
-		if (page)
-			page_cache_release(page);
-	}
-
+	entry = pte_to_swp_entry(*pte);
+	if (unlikely(non_swap_entry(entry)))
+		return 0;
+	page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
+				     vma, start);
+	if (page)
+		page_cache_release(page);
 	return 0;
 }
 
@@ -175,8 +159,7 @@ static void force_swapin_readahead(struct vm_area_struct *vma,
 {
 	struct mm_walk walk = {
 		.mm = vma->vm_mm,
-		.pmd_entry = swapin_walk_pmd_entry,
-		.private = vma,
+		.pte_entry = swapin_walk_pte_entry,
 	};
 
 	walk_page_range(start, end, &walk);
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 09/11] arch/powerpc/mm/subpage-prot.c: use walk_page_vma() instead of walk_page_range()
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
                   ` (7 preceding siblings ...)
  2014-02-10 21:44 ` [PATCH 08/11] madvise: " Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 10/11] pagewalk: remove argument hmask from hugetlb_entry() Naoya Horiguchi
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

We don't have to use mm_walk->private to pass vma to the callback
function because of mm_walk->vma.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 arch/powerpc/mm/subpage-prot.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git v3.14-rc2.orig/arch/powerpc/mm/subpage-prot.c v3.14-rc2/arch/powerpc/mm/subpage-prot.c
index a770df2dae70..cec0af0a935f 100644
--- v3.14-rc2.orig/arch/powerpc/mm/subpage-prot.c
+++ v3.14-rc2/arch/powerpc/mm/subpage-prot.c
@@ -134,7 +134,7 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, struct mm_walk *walk)
 {
-	struct vm_area_struct *vma = walk->private;
+	struct vm_area_struct *vma = walk->vma;
 	split_huge_page_pmd(vma, addr, pmd);
 	return 0;
 }
@@ -163,9 +163,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		if (vma->vm_start >= (addr + len))
 			break;
 		vma->vm_flags |= VM_NOHUGEPAGE;
-		subpage_proto_walk.private = vma;
-		walk_page_range(vma->vm_start, vma->vm_end,
-				&subpage_proto_walk);
+		walk_page_vma(vma, &subpage_proto_walk);
 		vma = vma->vm_next;
 	}
 }
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 10/11] pagewalk: remove argument hmask from hugetlb_entry()
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
                   ` (8 preceding siblings ...)
  2014-02-10 21:44 ` [PATCH 09/11] arch/powerpc/mm/subpage-prot.c: use walk_page_vma() instead of walk_page_range() Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-10 21:44 ` [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range() Naoya Horiguchi
  2014-02-10 22:42 ` [PATCH 00/11 v5] update page table walker Andrew Morton
  11 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

hugetlb_entry() doesn't use the argument hmask any more,
so let's remove it now.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/task_mmu.c | 12 ++++++------
 include/linux/mm.h |  5 ++---
 mm/pagewalk.c      |  2 +-
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index 8b23bbcc5e04..f819d0d4a0e8 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -1022,8 +1022,7 @@ static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *
 }
 
 /* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
-				 unsigned long addr, unsigned long end,
+static int pagemap_hugetlb(pte_t *pte, unsigned long addr, unsigned long end,
 				 struct mm_walk *walk)
 {
 	struct pagemapread *pm = walk->private;
@@ -1031,6 +1030,7 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
 	int err = 0;
 	int flags2;
 	pagemap_entry_t pme;
+	unsigned long hmask;
 
 	WARN_ON_ONCE(!vma);
 
@@ -1292,8 +1292,8 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 #ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(pte_t *pte, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
 {
 	struct numa_maps *md;
 	struct page *page;
@@ -1311,8 +1311,8 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
 }
 
 #else
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(pte_t *pte, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
 {
 	return 0;
 }
diff --git v3.14-rc2.orig/include/linux/mm.h v3.14-rc2/include/linux/mm.h
index 144b08617957..7b6b596a5bf1 100644
--- v3.14-rc2.orig/include/linux/mm.h
+++ v3.14-rc2/include/linux/mm.h
@@ -1091,9 +1091,8 @@ struct mm_walk {
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_hole)(unsigned long addr, unsigned long next,
 			struct mm_walk *walk);
-	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
-			     unsigned long addr, unsigned long next,
-			     struct mm_walk *walk);
+	int (*hugetlb_entry)(pte_t *pte, unsigned long addr,
+			unsigned long next, struct mm_walk *walk);
 	int (*test_walk)(unsigned long addr, unsigned long next,
 			struct mm_walk *walk);
 	struct mm_struct *mm;
diff --git v3.14-rc2.orig/mm/pagewalk.c v3.14-rc2/mm/pagewalk.c
index 2a88dfa58af6..416e981243b1 100644
--- v3.14-rc2.orig/mm/pagewalk.c
+++ v3.14-rc2/mm/pagewalk.c
@@ -199,7 +199,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 		 * in walk->hugetlb_entry().
 		 */
 		if (pte && walk->hugetlb_entry)
-			err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+			err = walk->hugetlb_entry(pte, addr, next, walk);
 		spin_unlock(ptl);
 		if (err)
 			break;
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
                   ` (9 preceding siblings ...)
  2014-02-10 21:44 ` [PATCH 10/11] pagewalk: remove argument hmask from hugetlb_entry() Naoya Horiguchi
@ 2014-02-10 21:44 ` Naoya Horiguchi
  2014-02-21  6:30   ` Sasha Levin
  2014-02-10 22:42 ` [PATCH 00/11 v5] update page table walker Andrew Morton
  11 siblings, 1 reply; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-10 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

queue_pages_range() does page table walking in its own way now,
so this patch rewrites it with walk_page_range().
One difficulty was that queue_pages_range() needed to check vmas
to determine whether we queue pages from a given vma or skip it.
Now we have test_walk() callback in mm_walk for that purpose,
so we can do the replacement cleanly. queue_pages_test_walk()
depends on not only the current vma but also the previous one,
so we use queue_pages->prev to keep it.

ChangeLog v2:
- rebase onto mmots
- add VM_PFNMAP check on queue_pages_test_walk()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 255 ++++++++++++++++++++++-----------------------------------
 1 file changed, 99 insertions(+), 156 deletions(-)

diff --git v3.14-rc2.orig/mm/mempolicy.c v3.14-rc2/mm/mempolicy.c
index ae3c8f3595d4..b2155b8adbae 100644
--- v3.14-rc2.orig/mm/mempolicy.c
+++ v3.14-rc2/mm/mempolicy.c
@@ -476,140 +476,66 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags);
 
+struct queue_pages {
+	struct list_head *pagelist;
+	unsigned long flags;
+	nodemask_t *nmask;
+	struct vm_area_struct *prev;
+};
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
  */
-static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
+static int queue_pages_pte(pte_t *pte, unsigned long addr,
+			unsigned long next, struct mm_walk *walk)
 {
-	pte_t *orig_pte;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-		int nid;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
+	int nid;
 
-		if (!pte_present(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		/*
-		 * vm_normal_page() filters out zero pages, but there might
-		 * still be PageReserved pages to skip, perhaps in a VDSO.
-		 */
-		if (PageReserved(page))
-			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-			continue;
+	if (!pte_present(*pte))
+		return 0;
+	page = vm_normal_page(vma, addr, *pte);
+	if (!page)
+		return 0;
+	/*
+	 * vm_normal_page() filters out zero pages, but there might
+	 * still be PageReserved pages to skip, perhaps in a VDSO.
+	 */
+	if (PageReserved(page))
+		return 0;
+	nid = page_to_nid(page);
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 
-		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
-			migrate_page_add(page, private, flags);
-		else
-			break;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(orig_pte, ptl);
-	return addr != end;
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+	return 0;
 }
 
-static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
-		pmd_t *pmd, const nodemask_t *nodes, unsigned long flags,
-				    void *private)
+static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
+				unsigned long next, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
 	int nid;
 	struct page *page;
-	spinlock_t *ptl;
 
-	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
-	page = pte_page(huge_ptep_get((pte_t *)pmd));
+	page = pte_page(huge_ptep_get(pte));
 	nid = page_to_nid(page);
-	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-		goto unlock;
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
-		isolate_huge_page(page, private);
-unlock:
-	spin_unlock(ptl);
+		isolate_huge_page(page, qp->pagelist);
 #else
 	BUG();
 #endif
-}
-
-static inline int queue_pages_pmd_range(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pmd_t *pmd;
-	unsigned long next;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (!pmd_present(*pmd))
-			continue;
-		if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
-			queue_pages_hugetlb_pmd_range(vma, pmd, nodes,
-						flags, private);
-			continue;
-		}
-		split_huge_page_pmd(vma, addr, pmd);
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			continue;
-		if (queue_pages_pte_range(vma, pmd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pmd++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pud_t *pud;
-	unsigned long next;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_huge(*pud) && is_vm_hugetlb_page(vma))
-			continue;
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		if (queue_pages_pmd_range(vma, pud, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pud++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pgd_range(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pgd_t *pgd;
-	unsigned long next;
-
-	pgd = pgd_offset(vma->vm_mm, addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		if (queue_pages_pud_range(vma, pgd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pgd++, addr = next, addr != end);
 	return 0;
 }
 
@@ -642,6 +568,45 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+static int queue_pages_test_walk(unsigned long start, unsigned long end,
+				struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct queue_pages *qp = walk->private;
+	unsigned long endvma = vma->vm_end;
+	unsigned long flags = qp->flags;
+
+	if (endvma > end)
+		endvma = end;
+	if (vma->vm_start > start)
+		start = vma->vm_start;
+
+	if (!(flags & MPOL_MF_DISCONTIG_OK)) {
+		if (!vma->vm_next && vma->vm_end < end)
+			return -EFAULT;
+		if (qp->prev && qp->prev->vm_end < vma->vm_start)
+			return -EFAULT;
+	}
+
+	qp->prev = vma;
+	walk->skip = 1;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 0;
+
+	if (flags & MPOL_MF_LAZY) {
+		change_prot_numa(vma, start, endvma);
+		return 0;
+	}
+
+	if ((flags & MPOL_MF_STRICT) ||
+	    ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
+	     vma_migratable(vma)))
+		/* queue pages from current vma */
+		walk->skip = 0;
+	return 0;
+}
+
 /*
  * Walk through page tables and collect pages to be migrated.
  *
@@ -651,51 +616,29 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
  */
 static struct vm_area_struct *
 queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags, void *private)
+		nodemask_t *nodes, unsigned long flags,
+		struct list_head *pagelist)
 {
 	int err;
-	struct vm_area_struct *first, *vma, *prev;
-
-
-	first = find_vma(mm, start);
-	if (!first)
-		return ERR_PTR(-EFAULT);
-	prev = NULL;
-	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
-		unsigned long endvma = vma->vm_end;
-
-		if (endvma > end)
-			endvma = end;
-		if (vma->vm_start > start)
-			start = vma->vm_start;
-
-		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
-			if (!vma->vm_next && vma->vm_end < end)
-				return ERR_PTR(-EFAULT);
-			if (prev && prev->vm_end < vma->vm_start)
-				return ERR_PTR(-EFAULT);
-		}
-
-		if (flags & MPOL_MF_LAZY) {
-			change_prot_numa(vma, start, endvma);
-			goto next;
-		}
-
-		if ((flags & MPOL_MF_STRICT) ||
-		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-		      vma_migratable(vma))) {
-
-			err = queue_pages_pgd_range(vma, start, endvma, nodes,
-						flags, private);
-			if (err) {
-				first = ERR_PTR(err);
-				break;
-			}
-		}
-next:
-		prev = vma;
-	}
-	return first;
+	struct queue_pages qp = {
+		.pagelist = pagelist,
+		.flags = flags,
+		.nmask = nodes,
+		.prev = NULL,
+	};
+	struct mm_walk queue_pages_walk = {
+		.hugetlb_entry = queue_pages_hugetlb,
+		.pte_entry = queue_pages_pte,
+		.test_walk = queue_pages_test_walk,
+		.mm = mm,
+		.private = &qp,
+	};
+
+	err = walk_page_range(start, end, &queue_pages_walk);
+	if (err < 0)
+		return ERR_PTR(err);
+	else
+		return find_vma(mm, start);
 }
 
 /*
-- 
1.8.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/11 v5] update page table walker
  2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
                   ` (10 preceding siblings ...)
  2014-02-10 21:44 ` [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range() Naoya Horiguchi
@ 2014-02-10 22:42 ` Andrew Morton
  11 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2014-02-10 22:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

On Mon, 10 Feb 2014 16:44:25 -0500 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> This is ver.5 of page table walker patchset.

   text    data     bss     dec     hex filename
 882373  264146  757256 1903775  1d0c9f mm/built-in.o (before)
 881205  264146  757128 1902479  1d078f mm/built-in.o (after)

That worked.  But it adds 15 lines to mm/*.[ch] ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
  2014-02-10 21:44 ` [PATCH 01/11] pagewalk: update page table walker core Naoya Horiguchi
@ 2014-02-12  5:39   ` Joonsoo Kim
  2014-02-12 15:40     ` Naoya Horiguchi
  2014-02-20 23:47   ` Sasha Levin
  2014-06-02 23:49   ` Dave Hansen
  2 siblings, 1 reply; 37+ messages in thread
From: Joonsoo Kim @ 2014-02-12  5:39 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Matt Mackall, Cliff Wickman,
	KOSAKI Motohiro, Johannes Weiner, KAMEZAWA Hiroyuki,
	Michal Hocko, Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel,
	kirill.shutemov, linux-kernel

On Mon, Feb 10, 2014 at 04:44:26PM -0500, Naoya Horiguchi wrote:
> This patch updates mm/pagewalk.c to make code less complex and more maintenable.
> The basic idea is unchanged and there's no userspace visible effect.
> 
> Most of existing callback functions need access to vma to handle each entry.
> So we had better add a new member vma in struct mm_walk instead of using
> mm_walk->private, which makes code simpler.
> 
> One problem in current page table walker is that we check vma in pgd loop.
> Historically this was introduced to support hugetlbfs in the strange manner.
> It's better and cleaner to do the vma check outside pgd loop.
> 
> Another problem is that many users of page table walker now use only
> pmd_entry(), although it does both pmd-walk and pte-walk. This makes code
> duplication and fluctuation among callers, which worsens the maintenability.
> 
> One difficulty of code sharing is that the callers want to determine
> whether they try to walk over a specific vma or not in their own way.
> To solve this, this patch introduces test_walk() callback.
> 
> When we try to use multiple callbacks in different levels, skip control is
> also important. For example we have thp enabled in normal configuration, and
> we are interested in doing some work for a thp. But sometimes we want to
> split it and handle as normal pages, and in another time user would handle
> both at pmd level and pte level.
> What we need is that when we've done pmd_entry() we want to decide whether
> to go down to pte level handling based on the pmd_entry()'s result. So this
> patch introduces a skip control flag in mm_walk.
> We can't use the returned value for this purpose, because we already
> defined the meaning of whole range of returned values (>0 is to terminate
> page table walk in caller's specific manner, =0 is to continue to walk,
> and <0 is to abort the walk in the general manner.)
> 
> ChangeLog v5:
> - fix build error ("mm/pagewalk.c:201: error: 'hmask' undeclared")
> 
> ChangeLog v4:
> - add more comment
> - remove verbose variable in walk_page_test()
> - rename skip_check to skip_lower_level_walking
> - rebased onto mmotm-2014-01-09-16-23
> 
> ChangeLog v3:
> - rebased onto v3.13-rc3-mmots-2013-12-10-16-38
> 
> ChangeLog v2:
> - rebase onto mmots
> - add pte_none() check in walk_pte_range()
> - add cond_sched() in walk_hugetlb_range()
> - add skip_check()
> - do VM_PFNMAP check only when ->test_walk() is not defined (because some
>   caller could handle VM_PFNMAP vma. copy_page_range() is an example.)
> - use do-while condition (addr < end) instead of (addr != end)
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/mm.h |  18 ++-
>  mm/pagewalk.c      | 352 +++++++++++++++++++++++++++++++++--------------------
>  2 files changed, 235 insertions(+), 135 deletions(-)
> 
> diff --git v3.14-rc2.orig/include/linux/mm.h v3.14-rc2/include/linux/mm.h
> index f28f46eade6a..4d0bc01de43c 100644
> --- v3.14-rc2.orig/include/linux/mm.h
> +++ v3.14-rc2/include/linux/mm.h
> @@ -1067,10 +1067,18 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>   * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
>   * @pte_hole: if set, called for each hole at all levels
>   * @hugetlb_entry: if set, called for each hugetlb entry
> - *		   *Caution*: The caller must hold mmap_sem() if @hugetlb_entry
> - * 			      is used.
> + * @test_walk: caller specific callback function to determine whether
> + *             we walk over the current vma or not. A positive returned
> + *             value means "do page table walk over the current vma,"
> + *             and a negative one means "abort current page table walk
> + *             right now." 0 means "skip the current vma."
> + * @mm:        mm_struct representing the target process of page table walk
> + * @vma:       vma currently walked
> + * @skip:      internal control flag which is set when we skip the lower
> + *             level entries.
> + * @private:   private data for callbacks' use
>   *
> - * (see walk_page_range for more details)
> + * (see the comment on walk_page_range() for more details)
>   */
>  struct mm_walk {
>  	int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
> @@ -1086,7 +1094,11 @@ struct mm_walk {
>  	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
>  			     unsigned long addr, unsigned long next,
>  			     struct mm_walk *walk);
> +	int (*test_walk)(unsigned long addr, unsigned long next,
> +			struct mm_walk *walk);
>  	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	int skip;
>  	void *private;
>  };
>  
> diff --git v3.14-rc2.orig/mm/pagewalk.c v3.14-rc2/mm/pagewalk.c
> index 2beeabf502c5..4770558feea8 100644
> --- v3.14-rc2.orig/mm/pagewalk.c
> +++ v3.14-rc2/mm/pagewalk.c
> @@ -3,29 +3,58 @@
>  #include <linux/sched.h>
>  #include <linux/hugetlb.h>
>  
> -static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> -			  struct mm_walk *walk)
> +/*
> + * Check the current skip status of page table walker.
> + *
> + * Here what I mean by skip is to skip lower level walking, and that was
> + * determined for each entry independently. For example, when walk_pmd_range
> + * handles a pmd_trans_huge we don't have to walk over ptes under that pmd,
> + * and the skipping does not affect the walking over ptes under other pmds.
> + * That's why we reset @walk->skip after tested.
> + */
> +static bool skip_lower_level_walking(struct mm_walk *walk)
> +{
> +	if (walk->skip) {
> +		walk->skip = 0;
> +		return true;
> +	}
> +	return false;
> +}
> +
> +static int walk_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
>  {
> +	struct mm_struct *mm = walk->mm;
>  	pte_t *pte;
> +	pte_t *orig_pte;
> +	spinlock_t *ptl;
>  	int err = 0;
>  
> -	pte = pte_offset_map(pmd, addr);
> -	for (;;) {
> +	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +	do {
> +		if (pte_none(*pte)) {
> +			if (walk->pte_hole)
> +				err = walk->pte_hole(addr, addr + PAGE_SIZE,
> +							walk);
> +			if (err)
> +				break;
> +			continue;

Hello, Naoya.

I know that this is too late for review, but I have some opinion about this.

How about removing walk->pte_hole() function pointer and related code on generic
walker? walk->pte_hole() is only used by task_mmu.c and maintaining pte_hole code
only for task_mmu.c just give us maintanance overhead and bad readability on
generic code. With removing it, we can get more simpler generic walker.

We can implement it without pte_hole() on generic walker like as below.

  walk->dont_skip_hole = 1
  if (pte_none(*pte) && !walk->dont_skip_hole)
  	  continue;

  call proper entry callback function which can handle pte_hole cases.

> +		}
> +		/*
> +		 * Callers should have their own way to handle swap entries
> +		 * in walk->pte_entry().
> +		 */
>  		err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
>  		if (err)
>  		       break;
> -		addr += PAGE_SIZE;
> -		if (addr == end)
> -			break;
> -		pte++;
> -	}
> -
> -	pte_unmap(pte);
> -	return err;
> +	} while (pte++, addr += PAGE_SIZE, addr < end);
> +	pte_unmap_unlock(orig_pte, ptl);
> +	cond_resched();
> +	return addr == end ? 0 : err;
>  }
>  
> -static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> -			  struct mm_walk *walk)
> +static int walk_pmd_range(pud_t *pud, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -35,6 +64,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  	do {
>  again:
>  		next = pmd_addr_end(addr, end);
> +
>  		if (pmd_none(*pmd)) {
>  			if (walk->pte_hole)
>  				err = walk->pte_hole(addr, next, walk);
> @@ -42,35 +72,32 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  				break;
>  			continue;
>  		}
> -		/*
> -		 * This implies that each ->pmd_entry() handler
> -		 * needs to know about pmd_trans_huge() pmds
> -		 */
> -		if (walk->pmd_entry)
> -			err = walk->pmd_entry(pmd, addr, next, walk);
> -		if (err)
> -			break;
>  
> -		/*
> -		 * Check this here so we only break down trans_huge
> -		 * pages when we _need_ to
> -		 */
> -		if (!walk->pte_entry)
> -			continue;
> +		if (walk->pmd_entry) {
> +			err = walk->pmd_entry(pmd, addr, next, walk);
> +			if (skip_lower_level_walking(walk))
> +				continue;
> +			if (err)
> +				break;
> +		}
>  
> -		split_huge_page_pmd_mm(walk->mm, addr, pmd);
> -		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> -			goto again;
> -		err = walk_pte_range(pmd, addr, next, walk);
> -		if (err)
> -			break;
> -	} while (pmd++, addr = next, addr != end);
> +		if (walk->pte_entry) {
> +			if (walk->vma) {
> +				split_huge_page_pmd(walk->vma, addr, pmd);
> +				if (pmd_trans_unstable(pmd))
> +					goto again;
> +			}
> +			err = walk_pte_range(pmd, addr, next, walk);
> +			if (err)
> +				break;
> +		}
> +	} while (pmd++, addr = next, addr < end);
>  
>  	return err;
>  }
>  
> -static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> -			  struct mm_walk *walk)
> +static int walk_pud_range(pgd_t *pgd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -79,6 +106,7 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  	pud = pud_offset(pgd, addr);
>  	do {
>  		next = pud_addr_end(addr, end);
> +
>  		if (pud_none_or_clear_bad(pud)) {
>  			if (walk->pte_hole)
>  				err = walk->pte_hole(addr, next, walk);
> @@ -86,13 +114,58 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
>  				break;
>  			continue;
>  		}
> -		if (walk->pud_entry)
> +
> +		if (walk->pud_entry) {
>  			err = walk->pud_entry(pud, addr, next, walk);
> -		if (!err && (walk->pmd_entry || walk->pte_entry))
> +			if (skip_lower_level_walking(walk))
> +				continue;
> +			if (err)
> +				break;

Why do you check skip_lower_level_walking() prior to err check?
I look through all patches roughly and find that this doesn't cause any problem,
since err is 0 whenver walk->skip = 1. But, checking err first would be better.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
  2014-02-12  5:39   ` Joonsoo Kim
@ 2014-02-12 15:40     ` Naoya Horiguchi
  0 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-12 15:40 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Matt Mackall, Cliff Wickman,
	KOSAKI Motohiro, Johannes Weiner, KAMEZAWA Hiroyuki,
	Michal Hocko, Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel,
	kirill.shutemov, linux-kernel

Hi Joonsoo,

On Wed, Feb 12, 2014 at 02:39:56PM +0900, Joonsoo Kim wrote:
...
> > diff --git v3.14-rc2.orig/mm/pagewalk.c v3.14-rc2/mm/pagewalk.c
> > index 2beeabf502c5..4770558feea8 100644
> > --- v3.14-rc2.orig/mm/pagewalk.c
> > +++ v3.14-rc2/mm/pagewalk.c
> > @@ -3,29 +3,58 @@
> >  #include <linux/sched.h>
> >  #include <linux/hugetlb.h>
> >  
> > -static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > -			  struct mm_walk *walk)
> > +/*
> > + * Check the current skip status of page table walker.
> > + *
> > + * Here what I mean by skip is to skip lower level walking, and that was
> > + * determined for each entry independently. For example, when walk_pmd_range
> > + * handles a pmd_trans_huge we don't have to walk over ptes under that pmd,
> > + * and the skipping does not affect the walking over ptes under other pmds.
> > + * That's why we reset @walk->skip after tested.
> > + */
> > +static bool skip_lower_level_walking(struct mm_walk *walk)
> > +{
> > +	if (walk->skip) {
> > +		walk->skip = 0;
> > +		return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +static int walk_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> >  {
> > +	struct mm_struct *mm = walk->mm;
> >  	pte_t *pte;
> > +	pte_t *orig_pte;
> > +	spinlock_t *ptl;
> >  	int err = 0;
> >  
> > -	pte = pte_offset_map(pmd, addr);
> > -	for (;;) {
> > +	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +	do {
> > +		if (pte_none(*pte)) {
> > +			if (walk->pte_hole)
> > +				err = walk->pte_hole(addr, addr + PAGE_SIZE,
> > +							walk);
> > +			if (err)
> > +				break;
> > +			continue;
> 
> Hello, Naoya.
> 
> I know that this is too late for review, but I have some opinion about this.
> 
> How about removing walk->pte_hole() function pointer and related code on generic
> walker? walk->pte_hole() is only used by task_mmu.c and maintaining pte_hole code
> only for task_mmu.c just give us maintanance overhead and bad readability on
> generic code. With removing it, we can get more simpler generic walker.

Yes, this can be possible, I think.

> We can implement it without pte_hole() on generic walker like as below.
> 
>   walk->dont_skip_hole = 1
>   if (pte_none(*pte) && !walk->dont_skip_hole)
>   	  continue;

Currently walk->pte_hole can be called also by walk_p(g|u|m)d_range(),
so this ->dont_skip_hole switch had better be controlled by caller
(i.e. pagemap_read()).

>   call proper entry callback function which can handle pte_hole cases.

yes, we can do hole handling in each level of callbacks.

Now I'm preparing next series of cleanup patches following this patchset.
So I'll add a patch implementing this idea on it.

...
> > @@ -86,13 +114,58 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> >  				break;
> >  			continue;
> >  		}
> > -		if (walk->pud_entry)
> > +
> > +		if (walk->pud_entry) {
> >  			err = walk->pud_entry(pud, addr, next, walk);
> > -		if (!err && (walk->pmd_entry || walk->pte_entry))
> > +			if (skip_lower_level_walking(walk))
> > +				continue;
> > +			if (err)
> > +				break;
> 
> Why do you check skip_lower_level_walking() prior to err check?

No specific reason. I assumed that the callback (walk->pud_entry() in this
example) shouldn't do both of setting walk->skip and returning non-zero value
at one time. I'll add comment about that.

> I look through all patches roughly and find that this doesn't cause any problem,
> since err is 0 whenver walk->skip = 1. But, checking err first would be better.

I agree, it looks safer (we can avoid misbehavior like Null pointer access.)
I'll add it in next patchset. Thank you very much.

Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
  2014-02-10 21:44 ` [PATCH 01/11] pagewalk: update page table walker core Naoya Horiguchi
  2014-02-12  5:39   ` Joonsoo Kim
@ 2014-02-20 23:47   ` Sasha Levin
  2014-02-21  3:20     ` Naoya Horiguchi
                       ` (2 more replies)
  2014-06-02 23:49   ` Dave Hansen
  2 siblings, 3 replies; 37+ messages in thread
From: Sasha Levin @ 2014-02-20 23:47 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

Hi Naoya,

This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
but here's the spew:

[  281.650503] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[  281.651577] IP: [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
[  281.652453] PGD 40b88d067 PUD 40b88c067 PMD 0
[  281.653143] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  281.653869] Dumping ftrace buffer:
[  281.654430]    (ftrace buffer empty)
[  281.654975] Modules linked in:
[  281.655441] CPU: 4 PID: 12314 Comm: trinity-c361 Tainted: G        W 
3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
[  281.657622] task: ffff8804242ab000 ti: ffff880424348000 task.ti: ffff880424348000
[  281.658503] RIP: 0010:[<ffffffff811a31fc>]  [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
[  281.660025] RSP: 0018:ffff880424349ab8  EFLAGS: 00010002
[  281.660761] RAX: 0000000000000086 RBX: 0000000000000018 RCX: 0000000000000000
[  281.660761] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[  281.660761] RBP: ffff880424349b28 R08: 0000000000000001 R09: 0000000000000000
[  281.660761] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804242ab000
[  281.660761] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[  281.660761] FS:  00007f36534b0700(0000) GS:ffff88052bc00000(0000) knlGS:0000000000000000
[  281.660761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  281.660761] CR2: 0000000000000018 CR3: 000000040b88e000 CR4: 00000000000006e0
[  281.660761] Stack:
[  281.660761]  ffff880424349ae8 ffffffff81180695 ffff8804242ab038 0000000000000004
[  281.660761]  00000000001d8500 ffff88052bdd8500 ffff880424349b18 ffffffff81180915
[  281.660761]  ffffffff876a68b0 ffff8804242ab000 0000000000000000 0000000000000001
[  281.660761] Call Trace:
[  281.660761]  [<ffffffff81180695>] ? sched_clock_local+0x25/0x90
[  281.660761]  [<ffffffff81180915>] ? sched_clock_cpu+0xc5/0x110
[  281.660761]  [<ffffffff811a3842>] lock_acquire+0x182/0x1d0
[  281.660761]  [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
[  281.660761]  [<ffffffff811a3daa>] ? __lock_release+0x1da/0x1f0
[  281.660761]  [<ffffffff8438ae5b>] _raw_spin_lock+0x3b/0x70
[  281.660761]  [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
[  281.660761]  [<ffffffff812990d8>] walk_pte_range+0xb8/0x170
[  281.660761]  [<ffffffff812993a1>] walk_pmd_range+0x211/0x240
[  281.660761]  [<ffffffff812994fb>] walk_pud_range+0x12b/0x160
[  281.660761]  [<ffffffff81299639>] walk_pgd_range+0x109/0x140
[  281.660761]  [<ffffffff812996a5>] __walk_page_range+0x35/0x40
[  281.660761]  [<ffffffff81299862>] walk_page_range+0xf2/0x130
[  281.660761]  [<ffffffff812a8ccc>] queue_pages_range+0x6c/0x90
[  281.660761]  [<ffffffff812a8d80>] ? queue_pages_hugetlb+0x90/0x90
[  281.660761]  [<ffffffff812a8cf0>] ? queue_pages_range+0x90/0x90
[  281.660761]  [<ffffffff812a8f50>] ? change_prot_numa+0x30/0x30
[  281.660761]  [<ffffffff812ac9f1>] do_mbind+0x311/0x330
[  281.660761]  [<ffffffff811815c1>] ? vtime_account_user+0x91/0xa0
[  281.660761]  [<ffffffff8124f1a8>] ? context_tracking_user_exit+0xa8/0x1c0
[  281.660761]  [<ffffffff812aca99>] SYSC_mbind+0x89/0xb0
[  281.660761]  [<ffffffff812acac9>] SyS_mbind+0x9/0x10
[  281.660761]  [<ffffffff84395360>] tracesys+0xdd/0xe2
[  281.660761] Code: c2 04 47 49 85 be fa 0b 00 00 48 c7 c7 bb 85 49 85 e8 d9 7b f9 ff 31 c0 e9 9c 
04 00 00 66 90 44 8b 1d a9 b8 ac 04 45 85 db 74 0c <48> 81 3b 40 61 3f 87 75 06 0f 1f 00 45 31 c0 83 
fe 01 77 0c 89
[  281.660761] RIP  [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
[  281.660761]  RSP <ffff880424349ab8>
[  281.660761] CR2: 0000000000000018
[  281.660761] ---[ end trace b6e188d329664196 ]---

Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
  2014-02-20 23:47   ` Sasha Levin
@ 2014-02-21  3:20     ` Naoya Horiguchi
  2014-02-21  4:30     ` Sasha Levin
       [not found]     ` <5306c629.012ce50a.6c48.ffff9844SMTPIN_ADDED_BROKEN@mx.google.com>
  2 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-21  3:20 UTC (permalink / raw)
  To: levinsasha928
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

Hi Sasha,

On Thu, Feb 20, 2014 at 06:47:56PM -0500, Sasha Levin wrote:
> Hi Naoya,
> 
> This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
> but here's the spew:

Thanks for reporting.
I'm not sure what caused this bug from the kernel message. But in my guessing,
it seems that the NULL pointer is deep inside lockdep routine __lock_acquire(),
so if we find out which pointer was NULL, it might be useful to bisect which
the proble is (page table walker or lockdep, or both.)

BTW, just from curiousity, in my build environment many of kernel functions
are inlined, so should not be shown in kernel message. But in your report
we can see the symbols like walk_pte_range() and __lock_acquire() which never
appear in my kernel. How did you do it? I turned off CONFIG_OPTIMIZE_INLINING,
but didn't make it.

Thanks,
Naoya Horiguchi

> 
> [  281.650503] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [  281.651577] IP: [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
> [  281.652453] PGD 40b88d067 PUD 40b88c067 PMD 0
> [  281.653143] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  281.653869] Dumping ftrace buffer:
> [  281.654430]    (ftrace buffer empty)
> [  281.654975] Modules linked in:
> [  281.655441] CPU: 4 PID: 12314 Comm: trinity-c361 Tainted: G
> W 3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
> [  281.657622] task: ffff8804242ab000 ti: ffff880424348000 task.ti: ffff880424348000
> [  281.658503] RIP: 0010:[<ffffffff811a31fc>]  [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
> [  281.660025] RSP: 0018:ffff880424349ab8  EFLAGS: 00010002
> [  281.660761] RAX: 0000000000000086 RBX: 0000000000000018 RCX: 0000000000000000
> [  281.660761] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
> [  281.660761] RBP: ffff880424349b28 R08: 0000000000000001 R09: 0000000000000000
> [  281.660761] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804242ab000
> [  281.660761] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> [  281.660761] FS:  00007f36534b0700(0000) GS:ffff88052bc00000(0000) knlGS:0000000000000000
> [  281.660761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  281.660761] CR2: 0000000000000018 CR3: 000000040b88e000 CR4: 00000000000006e0
> [  281.660761] Stack:
> [  281.660761]  ffff880424349ae8 ffffffff81180695 ffff8804242ab038 0000000000000004
> [  281.660761]  00000000001d8500 ffff88052bdd8500 ffff880424349b18 ffffffff81180915
> [  281.660761]  ffffffff876a68b0 ffff8804242ab000 0000000000000000 0000000000000001
> [  281.660761] Call Trace:
> [  281.660761]  [<ffffffff81180695>] ? sched_clock_local+0x25/0x90
> [  281.660761]  [<ffffffff81180915>] ? sched_clock_cpu+0xc5/0x110
> [  281.660761]  [<ffffffff811a3842>] lock_acquire+0x182/0x1d0
> [  281.660761]  [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
> [  281.660761]  [<ffffffff811a3daa>] ? __lock_release+0x1da/0x1f0
> [  281.660761]  [<ffffffff8438ae5b>] _raw_spin_lock+0x3b/0x70
> [  281.660761]  [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
> [  281.660761]  [<ffffffff812990d8>] walk_pte_range+0xb8/0x170
> [  281.660761]  [<ffffffff812993a1>] walk_pmd_range+0x211/0x240
> [  281.660761]  [<ffffffff812994fb>] walk_pud_range+0x12b/0x160
> [  281.660761]  [<ffffffff81299639>] walk_pgd_range+0x109/0x140
> [  281.660761]  [<ffffffff812996a5>] __walk_page_range+0x35/0x40
> [  281.660761]  [<ffffffff81299862>] walk_page_range+0xf2/0x130
> [  281.660761]  [<ffffffff812a8ccc>] queue_pages_range+0x6c/0x90
> [  281.660761]  [<ffffffff812a8d80>] ? queue_pages_hugetlb+0x90/0x90
> [  281.660761]  [<ffffffff812a8cf0>] ? queue_pages_range+0x90/0x90
> [  281.660761]  [<ffffffff812a8f50>] ? change_prot_numa+0x30/0x30
> [  281.660761]  [<ffffffff812ac9f1>] do_mbind+0x311/0x330
> [  281.660761]  [<ffffffff811815c1>] ? vtime_account_user+0x91/0xa0
> [  281.660761]  [<ffffffff8124f1a8>] ? context_tracking_user_exit+0xa8/0x1c0
> [  281.660761]  [<ffffffff812aca99>] SYSC_mbind+0x89/0xb0
> [  281.660761]  [<ffffffff812acac9>] SyS_mbind+0x9/0x10
> [  281.660761]  [<ffffffff84395360>] tracesys+0xdd/0xe2
> [  281.660761] Code: c2 04 47 49 85 be fa 0b 00 00 48 c7 c7 bb 85 49
> 85 e8 d9 7b f9 ff 31 c0 e9 9c 04 00 00 66 90 44 8b 1d a9 b8 ac 04 45
> 85 db 74 0c <48> 81 3b 40 61 3f 87 75 06 0f 1f 00 45 31 c0 83 fe 01
> 77 0c 89
> [  281.660761] RIP  [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
> [  281.660761]  RSP <ffff880424349ab8>
> [  281.660761] CR2: 0000000000000018
> [  281.660761] ---[ end trace b6e188d329664196 ]---
> 
> Thanks,
> Sasha
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
  2014-02-20 23:47   ` Sasha Levin
  2014-02-21  3:20     ` Naoya Horiguchi
@ 2014-02-21  4:30     ` Sasha Levin
       [not found]     ` <5306c629.012ce50a.6c48.ffff9844SMTPIN_ADDED_BROKEN@mx.google.com>
  2 siblings, 0 replies; 37+ messages in thread
From: Sasha Levin @ 2014-02-21  4:30 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

On 02/20/2014 06:47 PM, Sasha Levin wrote:
> Hi Naoya,
>
> This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
> but here's the spew:
>
> [  281.650503] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [  281.651577] IP: [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
> [  281.652453] PGD 40b88d067 PUD 40b88c067 PMD 0
> [  281.653143] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  281.653869] Dumping ftrace buffer:
> [  281.654430]    (ftrace buffer empty)
> [  281.654975] Modules linked in:
> [  281.655441] CPU: 4 PID: 12314 Comm: trinity-c361 Tainted: G        W
> 3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
> [  281.657622] task: ffff8804242ab000 ti: ffff880424348000 task.ti: ffff880424348000
> [  281.658503] RIP: 0010:[<ffffffff811a31fc>]  [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
> [  281.660025] RSP: 0018:ffff880424349ab8  EFLAGS: 00010002
> [  281.660761] RAX: 0000000000000086 RBX: 0000000000000018 RCX: 0000000000000000
> [  281.660761] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
> [  281.660761] RBP: ffff880424349b28 R08: 0000000000000001 R09: 0000000000000000
> [  281.660761] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804242ab000
> [  281.660761] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> [  281.660761] FS:  00007f36534b0700(0000) GS:ffff88052bc00000(0000) knlGS:0000000000000000
> [  281.660761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  281.660761] CR2: 0000000000000018 CR3: 000000040b88e000 CR4: 00000000000006e0
> [  281.660761] Stack:
> [  281.660761]  ffff880424349ae8 ffffffff81180695 ffff8804242ab038 0000000000000004
> [  281.660761]  00000000001d8500 ffff88052bdd8500 ffff880424349b18 ffffffff81180915
> [  281.660761]  ffffffff876a68b0 ffff8804242ab000 0000000000000000 0000000000000001
> [  281.660761] Call Trace:
> [  281.660761]  [<ffffffff81180695>] ? sched_clock_local+0x25/0x90
> [  281.660761]  [<ffffffff81180915>] ? sched_clock_cpu+0xc5/0x110
> [  281.660761]  [<ffffffff811a3842>] lock_acquire+0x182/0x1d0
> [  281.660761]  [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
> [  281.660761]  [<ffffffff811a3daa>] ? __lock_release+0x1da/0x1f0
> [  281.660761]  [<ffffffff8438ae5b>] _raw_spin_lock+0x3b/0x70
> [  281.660761]  [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
> [  281.660761]  [<ffffffff812990d8>] walk_pte_range+0xb8/0x170
> [  281.660761]  [<ffffffff812993a1>] walk_pmd_range+0x211/0x240
> [  281.660761]  [<ffffffff812994fb>] walk_pud_range+0x12b/0x160
> [  281.660761]  [<ffffffff81299639>] walk_pgd_range+0x109/0x140
> [  281.660761]  [<ffffffff812996a5>] __walk_page_range+0x35/0x40
> [  281.660761]  [<ffffffff81299862>] walk_page_range+0xf2/0x130
> [  281.660761]  [<ffffffff812a8ccc>] queue_pages_range+0x6c/0x90
> [  281.660761]  [<ffffffff812a8d80>] ? queue_pages_hugetlb+0x90/0x90
> [  281.660761]  [<ffffffff812a8cf0>] ? queue_pages_range+0x90/0x90
> [  281.660761]  [<ffffffff812a8f50>] ? change_prot_numa+0x30/0x30
> [  281.660761]  [<ffffffff812ac9f1>] do_mbind+0x311/0x330
> [  281.660761]  [<ffffffff811815c1>] ? vtime_account_user+0x91/0xa0
> [  281.660761]  [<ffffffff8124f1a8>] ? context_tracking_user_exit+0xa8/0x1c0
> [  281.660761]  [<ffffffff812aca99>] SYSC_mbind+0x89/0xb0
> [  281.660761]  [<ffffffff812acac9>] SyS_mbind+0x9/0x10
> [  281.660761]  [<ffffffff84395360>] tracesys+0xdd/0xe2
> [  281.660761] Code: c2 04 47 49 85 be fa 0b 00 00 48 c7 c7 bb 85 49 85 e8 d9 7b f9 ff 31 c0 e9 9c
> 04 00 00 66 90 44 8b 1d a9 b8 ac 04 45 85 db 74 0c <48> 81 3b 40 61 3f 87 75 06 0f 1f 00 45 31 c0 83
> fe 01 77 0c 89
> [  281.660761] RIP  [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
> [  281.660761]  RSP <ffff880424349ab8>
> [  281.660761] CR2: 0000000000000018
> [  281.660761] ---[ end trace b6e188d329664196 ]---

Out of curiosity, I'm testing out a new piece of code to make decoding this dump a bit easier. Let 
me know if it helped at all. Lines are based on -next from today:

[  281.650503] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[  281.651577] IP: [<kernel/locking/lockdep.c:3069>] __lock_acquire+0xbc/0x580
[  281.652453] PGD 40b88d067 PUD 40b88c067 PMD 0
[  281.653143] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  281.653869] Dumping ftrace buffer:
[  281.654430]    (ftrace buffer empty)
[  281.654975] Modules linked in:
[  281.655441] CPU: 4 PID: 12314 Comm: trinity-c361 Tainted: G        W 
3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
[  281.657622] task: ffff8804242ab000 ti: ffff880424348000 task.ti: ffff880424348000
[  281.658503] RIP: 0010:[<kernel/locking/lockdep.c:3069>]  [<kernel/locking/lockdep.c:3069>] 
__lock_acquire+0xbc/0x580
[  281.660025] RSP: 0018:ffff880424349ab8  EFLAGS: 00010002
[  281.660761] RAX: 0000000000000086 RBX: 0000000000000018 RCX: 0000000000000000
[  281.660761] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[  281.660761] RBP: ffff880424349b28 R08: 0000000000000001 R09: 0000000000000000
[  281.660761] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804242ab000
[  281.660761] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[  281.660761] FS:  00007f36534b0700(0000) GS:ffff88052bc00000(0000) knlGS:0000000000000000
[  281.660761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  281.660761] CR2: 0000000000000018 CR3: 000000040b88e000 CR4: 00000000000006e0
[  281.660761] Stack:
[  281.660761]  ffff880424349ae8 ffffffff81180695 ffff8804242ab038 0000000000000004
[  281.660761]  00000000001d8500 ffff88052bdd8500 ffff880424349b18 ffffffff81180915
[  281.660761]  ffffffff876a68b0 ffff8804242ab000 0000000000000000 0000000000000001
[  281.660761] Call Trace:
[  281.660761]  [<kernel/sched/clock.c:206>] ? sched_clock_local+0x25/0x90
[  281.660761]  [<arch/x86/include/asm/preempt.h:98 kernel/sched/clock.c:312>] ? 
sched_clock_cpu+0xc5/0x110
[  281.660761]  [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] 
lock_acquire+0x182/0x1d0
[  281.660761]  [<include/linux/spinlock.h:303 mm/pagewalk.c:33>] ? walk_pte_range+0xb8/0x170
[  281.660761]  [<kernel/locking/lockdep.c:3506>] ? __lock_release+0x1da/0x1f0
[  281.660761]  [<include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151>] 
_raw_spin_lock+0x3b/0x70
[  281.660761]  [<include/linux/spinlock.h:303 mm/pagewalk.c:33>] ? walk_pte_range+0xb8/0x170
[  281.660761]  [<include/linux/spinlock.h:303 mm/pagewalk.c:33>] walk_pte_range+0xb8/0x170
[  281.660761]  [<mm/pagewalk.c:90>] walk_pmd_range+0x211/0x240
[  281.660761]  [<mm/pagewalk.c:128>] walk_pud_range+0x12b/0x160
[  281.660761]  [<mm/pagewalk.c:165>] walk_pgd_range+0x109/0x140
[  281.660761]  [<mm/pagewalk.c:259>] __walk_page_range+0x35/0x40
[  281.660761]  [<mm/pagewalk.c:332>] walk_page_range+0xf2/0x130
[  281.660761]  [<mm/mempolicy.c:637>] queue_pages_range+0x6c/0x90
[  281.660761]  [<mm/mempolicy.c:492>] ? queue_pages_hugetlb+0x90/0x90
[  281.660761]  [<mm/mempolicy.c:521>] ? queue_pages_range+0x90/0x90
[  281.660761]  [<mm/mempolicy.c:573>] ? change_prot_numa+0x30/0x30
[  281.660761]  [<mm/mempolicy.c:1241>] do_mbind+0x311/0x330
[  281.660761]  [<kernel/sched/cputime.c:681>] ? vtime_account_user+0x91/0xa0
[  281.660761]  [<arch/x86/include/asm/atomic.h:26 include/linux/jump_label.h:148 
include/trace/events/context_tracking.h:47 kernel/context_tracking.c:178>] ? 
context_tracking_user_exit+0xa8/0x1c0
[  281.660761]  [<mm/mempolicy.c:1356>] SYSC_mbind+0x89/0xb0
[  281.660761]  [<mm/mempolicy.c:1340>] SyS_mbind+0x9/0x10
[  281.660761]  [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2
[  281.660761] Code: c2 04 47 49 85 be fa 0b 00 00 48 c7 c7 bb 85 49 85 e8 d9 7b f9 ff 31 c0 e9 9c 
04 00 00 66 90 44 8b 1d a9 b8 ac 04 45 85 db 74 0c <48> 81 3b 40 61 3f 87 75 06 0f 1f 00 45 31 c0 83 
fe 01 77 0c 89
[  281.660761] RIP  [<kernel/locking/lockdep.c:3069>] __lock_acquire+0xbc/0x580
[  281.660761]  RSP <ffff880424349ab8>
[  281.660761] CR2: 0000000000000018
[  281.660761] ---[ end trace b6e188d329664196 ]---


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
  2014-02-10 21:44 ` [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range() Naoya Horiguchi
@ 2014-02-21  6:30   ` Sasha Levin
  2014-02-21 16:58     ` Naoya Horiguchi
       [not found]     ` <530785b2.d55c8c0a.3868.ffffa4e1SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 37+ messages in thread
From: Sasha Levin @ 2014-02-21  6:30 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
> queue_pages_range() does page table walking in its own way now,
> so this patch rewrites it with walk_page_range().
> One difficulty was that queue_pages_range() needed to check vmas
> to determine whether we queue pages from a given vma or skip it.
> Now we have test_walk() callback in mm_walk for that purpose,
> so we can do the replacement cleanly. queue_pages_test_walk()
> depends on not only the current vma but also the previous one,
> so we use queue_pages->prev to keep it.
>
> ChangeLog v2:
> - rebase onto mmots
> - add VM_PFNMAP check on queue_pages_test_walk()
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---

Hi Naoya,

I'm seeing another spew in today's -next, and it seems to be related to this patch. Here's the spew 
(with line numbers instead of kernel addresses):


[ 1411.889835] kernel BUG at mm/hugetlb.c:3580!
[ 1411.890108] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1411.890468] Dumping ftrace buffer:
[ 1411.890468]    (ftrace buffer empty)
[ 1411.890468] Modules linked in:
[ 1411.890468] CPU: 0 PID: 2653 Comm: trinity-c285 Tainted: G        W 
3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
[ 1411.890468] task: ffff8801be0cb000 ti: ffff8801e471c000 task.ti: ffff8801e471c000
[ 1411.890468] RIP: 0010:[<mm/hugetlb.c:3580>]  [<mm/hugetlb.c:3580>] isolate_huge_page+0x1c/0xb0
[ 1411.890468] RSP: 0018:ffff8801e471dae8  EFLAGS: 00010246
[ 1411.890468] RAX: ffff88012b900000 RBX: ffffea0000000000 RCX: 0000000000000000
[ 1411.890468] RDX: 0000000000000000 RSI: ffff8801be0cbd00 RDI: 0000000000000000
[ 1411.890468] RBP: ffff8801e471daf8 R08: 0000000000000000 R09: 0000000000000000
[ 1411.890468] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8801e471dcf8
[ 1411.890468] R13: ffffffff87d39120 R14: ffff8801e471dbc8 R15: 00007f30b1800000
[ 1411.890468] FS:  00007f30b50bb700(0000) GS:ffff88012bc00000(0000) knlGS:0000000000000000
[ 1411.890468] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1411.890468] CR2: 0000000001609a10 CR3: 00000001e4703000 CR4: 00000000000006f0
[ 1411.890468] Stack:
[ 1411.890468]  00007f30b1000000 00007f30b0e00000 ffff8801e471db08 ffffffff812a8d71
[ 1411.890468]  ffff8801e471db78 ffffffff81298fb1 00007f30b0d00000 ffff880478a16c38
[ 1411.890468]  ffff8802291c6060 ffffffffffe00000 ffffffffffe00000 ffff8804fd7fa7d0
[ 1411.890468] Call Trace:
[ 1411.890468]  [<mm/mempolicy.c:540>] queue_pages_hugetlb+0x81/0x90
[ 1411.890468]  [<include/linux/spinlock.h:343 mm/pagewalk.c:203>] walk_hugetlb_range+0x111/0x180
[ 1411.890468]  [<mm/pagewalk.c:254>] __walk_page_range+0x25/0x40
[ 1411.890468]  [<mm/pagewalk.c:332>] walk_page_range+0xf2/0x130
[ 1411.890468]  [<mm/mempolicy.c:637>] queue_pages_range+0x6c/0x90
[ 1411.890468]  [<mm/mempolicy.c:492>] ? queue_pages_hugetlb+0x90/0x90
[ 1411.890468]  [<mm/mempolicy.c:521>] ? queue_pages_range+0x90/0x90
[ 1411.890468]  [<mm/mempolicy.c:573>] ? change_prot_numa+0x30/0x30
[ 1411.890468]  [<mm/mempolicy.c:1004>] migrate_to_node+0x77/0xc0
[ 1411.890468]  [<mm/mempolicy.c:1110>] do_migrate_pages+0x1a8/0x230
[ 1411.890468]  [<mm/mempolicy.c:1461>] SYSC_migrate_pages+0x316/0x380
[ 1411.890468]  [<include/linux/rcupdate.h:799 mm/mempolicy.c:1407>] ? SYSC_migrate_pages+0xac/0x380
[ 1411.890468]  [<kernel/sched/cputime.c:681>] ? vtime_account_user+0x91/0xa0
[ 1411.890468]  [<mm/mempolicy.c:1381>] SyS_migrate_pages+0x9/0x10
[ 1411.890468]  [<arch/x86/ia32/ia32entry.S:430>] ia32_do_call+0x13/0x13
[ 1411.890468] Code: 4c 8b 6d f8 c9 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 54 49 89 f4 53 48 
89 fb 48 8b 07 f6 c4 40 75 13 31 f6 e8 84 48 fb ff <0f> 0b 66 90 eb fe 66 0f 1f 44 00 00 8b 4f 1c 48 
8d 77 1c 85 c9
[ 1411.890468] RIP  [<mm/hugetlb.c:3580>] isolate_huge_page+0x1c/0xb0
[ 1411.890468]  RSP <ffff8801e471dae8>


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
       [not found]     ` <5306c629.012ce50a.6c48.ffff9844SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-02-21  6:43       ` Sasha Levin
  2014-02-21 16:35         ` Naoya Horiguchi
       [not found]         ` <1393000553-ocl81482@n-horiguchi@ah.jp.nec.com>
  0 siblings, 2 replies; 37+ messages in thread
From: Sasha Levin @ 2014-02-21  6:43 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

On 02/20/2014 10:20 PM, Naoya Horiguchi wrote:
> Hi Sasha,
> 
> On Thu, Feb 20, 2014 at 06:47:56PM -0500, Sasha Levin wrote:
>> Hi Naoya,
>>
>> This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
>> but here's the spew:
> 
> Thanks for reporting.
> I'm not sure what caused this bug from the kernel message. But in my guessing,
> it seems that the NULL pointer is deep inside lockdep routine __lock_acquire(),
> so if we find out which pointer was NULL, it might be useful to bisect which
> the proble is (page table walker or lockdep, or both.)

This actually points to walk_pte_range() trying to lock a NULL spinlock. It happens when we call
pte_offset_map_lock() and get a NULL ptl out of pte_lockptr().

> BTW, just from curiousity, in my build environment many of kernel functions
> are inlined, so should not be shown in kernel message. But in your report
> we can see the symbols like walk_pte_range() and __lock_acquire() which never
> appear in my kernel. How did you do it? I turned off CONFIG_OPTIMIZE_INLINING,
> but didn't make it.

I'm really not sure. I've got a bunch of debug options enabled and it just seems to do the trick.

Try CONFIG_READABLE_ASM maybe?


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
  2014-02-21  6:43       ` Sasha Levin
@ 2014-02-21 16:35         ` Naoya Horiguchi
       [not found]         ` <1393000553-ocl81482@n-horiguchi@ah.jp.nec.com>
  1 sibling, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-21 16:35 UTC (permalink / raw)
  To: sasha.levin
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

On Fri, Feb 21, 2014 at 01:43:20AM -0500, Sasha Levin wrote:
> On 02/20/2014 10:20 PM, Naoya Horiguchi wrote:
> > Hi Sasha,
> > 
> > On Thu, Feb 20, 2014 at 06:47:56PM -0500, Sasha Levin wrote:
> >> Hi Naoya,
> >>
> >> This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
> >> but here's the spew:
> > 
> > Thanks for reporting.
> > I'm not sure what caused this bug from the kernel message. But in my guessing,
> > it seems that the NULL pointer is deep inside lockdep routine __lock_acquire(),
> > so if we find out which pointer was NULL, it might be useful to bisect which
> > the proble is (page table walker or lockdep, or both.)
> 
> This actually points to walk_pte_range() trying to lock a NULL spinlock. It happens when we call
> pte_offset_map_lock() and get a NULL ptl out of pte_lockptr().

I don't think page->ptl was NULL, because if so we hit NULL pointer dereference
outside __lock_acquire() (it's derefered in __raw_spin_lock()).
Maybe page->ptl->lock_dep was NULL. I'll digging it more to find out how we failed
to set this lock_dep thing.

> > BTW, just from curiousity, in my build environment many of kernel functions
> > are inlined, so should not be shown in kernel message. But in your report
> > we can see the symbols like walk_pte_range() and __lock_acquire() which never
> > appear in my kernel. How did you do it? I turned off CONFIG_OPTIMIZE_INLINING,
> > but didn't make it.
> 
> I'm really not sure. I've got a bunch of debug options enabled and it just seems to do the trick.
> 
> Try CONFIG_READABLE_ASM maybe?

Hmm, it makes no change, can I have your config?

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
       [not found]         ` <1393000553-ocl81482@n-horiguchi@ah.jp.nec.com>
@ 2014-02-21 16:50           ` Sasha Levin
  0 siblings, 0 replies; 37+ messages in thread
From: Sasha Levin @ 2014-02-21 16:50 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2193 bytes --]

On 02/21/2014 11:35 AM, Naoya Horiguchi wrote:
> On Fri, Feb 21, 2014 at 01:43:20AM -0500, Sasha Levin wrote:
>> On 02/20/2014 10:20 PM, Naoya Horiguchi wrote:
>>> Hi Sasha,
>>>
>>> On Thu, Feb 20, 2014 at 06:47:56PM -0500, Sasha Levin wrote:
>>>> Hi Naoya,
>>>>
>>>> This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
>>>> but here's the spew:
>>>
>>> Thanks for reporting.
>>> I'm not sure what caused this bug from the kernel message. But in my guessing,
>>> it seems that the NULL pointer is deep inside lockdep routine __lock_acquire(),
>>> so if we find out which pointer was NULL, it might be useful to bisect which
>>> the proble is (page table walker or lockdep, or both.)
>>
>> This actually points to walk_pte_range() trying to lock a NULL spinlock. It happens when we call
>> pte_offset_map_lock() and get a NULL ptl out of pte_lockptr().
> 
> I don't think page->ptl was NULL, because if so we hit NULL pointer dereference
> outside __lock_acquire() (it's derefered in __raw_spin_lock()).
> Maybe page->ptl->lock_dep was NULL. I'll digging it more to find out how we failed
> to set this lock_dep thing.

I don't see __raw_spin_lock() derefing it before calling __lock_acquire():

	static inline void __raw_spin_lock(raw_spinlock_t *lock)
	{
		preempt_disable();
		spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
		LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
	}

So after we disable preemption, spin_acquire() is basically a macro that ends up pointing to
lock_acquire().

__raw_spin_lock() would dereference 'lock' only after the lockdep call.

>>> BTW, just from curiousity, in my build environment many of kernel functions
>>> are inlined, so should not be shown in kernel message. But in your report
>>> we can see the symbols like walk_pte_range() and __lock_acquire() which never
>>> appear in my kernel. How did you do it? I turned off CONFIG_OPTIMIZE_INLINING,
>>> but didn't make it.
>>
>> I'm really not sure. I've got a bunch of debug options enabled and it just seems to do the trick.
>>
>> Try CONFIG_READABLE_ASM maybe?
> 
> Hmm, it makes no change, can I have your config?

Sure, attached.


Thanks,
Sasha


[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 39429 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
  2014-02-21  6:30   ` Sasha Levin
@ 2014-02-21 16:58     ` Naoya Horiguchi
       [not found]     ` <530785b2.d55c8c0a.3868.ffffa4e1SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-21 16:58 UTC (permalink / raw)
  To: levinsasha928
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

Hi Sasha,

On Fri, Feb 21, 2014 at 01:30:53AM -0500, Sasha Levin wrote:
> On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
> >queue_pages_range() does page table walking in its own way now,
> >so this patch rewrites it with walk_page_range().
> >One difficulty was that queue_pages_range() needed to check vmas
> >to determine whether we queue pages from a given vma or skip it.
> >Now we have test_walk() callback in mm_walk for that purpose,
> >so we can do the replacement cleanly. queue_pages_test_walk()
> >depends on not only the current vma but also the previous one,
> >so we use queue_pages->prev to keep it.
> >
> >ChangeLog v2:
> >- rebase onto mmots
> >- add VM_PFNMAP check on queue_pages_test_walk()
> >
> >Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >---
> 
> Hi Naoya,
> 
> I'm seeing another spew in today's -next, and it seems to be related
> to this patch. Here's the spew (with line numbers instead of kernel
> addresses):

Thanks. (line numbers translation is very helpful.)

This bug looks strange to me.
"kernel BUG at mm/hugetlb.c:3580" means we try to do isolate_huge_page()
for !PageHead page. But the caller queue_pages_hugetlb() gets the page
with "page = pte_page(huge_ptep_get(pte))", so it should be the head page!

mm/hugetlb.c:3580 is VM_BUG_ON_PAGE(!PageHead(page), page), so we expect to
have dump_page output at this point, is that in your kernel log?

Thanks,
Naoya Horiguchi

> 
> [ 1411.889835] kernel BUG at mm/hugetlb.c:3580!
> [ 1411.890108] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 1411.890468] Dumping ftrace buffer:
> [ 1411.890468]    (ftrace buffer empty)
> [ 1411.890468] Modules linked in:
> [ 1411.890468] CPU: 0 PID: 2653 Comm: trinity-c285 Tainted: G
> W 3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
> [ 1411.890468] task: ffff8801be0cb000 ti: ffff8801e471c000 task.ti: ffff8801e471c000
> [ 1411.890468] RIP: 0010:[<mm/hugetlb.c:3580>]  [<mm/hugetlb.c:3580>] isolate_huge_page+0x1c/0xb0
> [ 1411.890468] RSP: 0018:ffff8801e471dae8  EFLAGS: 00010246
> [ 1411.890468] RAX: ffff88012b900000 RBX: ffffea0000000000 RCX: 0000000000000000
> [ 1411.890468] RDX: 0000000000000000 RSI: ffff8801be0cbd00 RDI: 0000000000000000
> [ 1411.890468] RBP: ffff8801e471daf8 R08: 0000000000000000 R09: 0000000000000000
> [ 1411.890468] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8801e471dcf8
> [ 1411.890468] R13: ffffffff87d39120 R14: ffff8801e471dbc8 R15: 00007f30b1800000
> [ 1411.890468] FS:  00007f30b50bb700(0000) GS:ffff88012bc00000(0000) knlGS:0000000000000000
> [ 1411.890468] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 1411.890468] CR2: 0000000001609a10 CR3: 00000001e4703000 CR4: 00000000000006f0
> [ 1411.890468] Stack:
> [ 1411.890468]  00007f30b1000000 00007f30b0e00000 ffff8801e471db08 ffffffff812a8d71
> [ 1411.890468]  ffff8801e471db78 ffffffff81298fb1 00007f30b0d00000 ffff880478a16c38
> [ 1411.890468]  ffff8802291c6060 ffffffffffe00000 ffffffffffe00000 ffff8804fd7fa7d0
> [ 1411.890468] Call Trace:
> [ 1411.890468]  [<mm/mempolicy.c:540>] queue_pages_hugetlb+0x81/0x90
> [ 1411.890468]  [<include/linux/spinlock.h:343 mm/pagewalk.c:203>] walk_hugetlb_range+0x111/0x180
> [ 1411.890468]  [<mm/pagewalk.c:254>] __walk_page_range+0x25/0x40
> [ 1411.890468]  [<mm/pagewalk.c:332>] walk_page_range+0xf2/0x130
> [ 1411.890468]  [<mm/mempolicy.c:637>] queue_pages_range+0x6c/0x90
> [ 1411.890468]  [<mm/mempolicy.c:492>] ? queue_pages_hugetlb+0x90/0x90
> [ 1411.890468]  [<mm/mempolicy.c:521>] ? queue_pages_range+0x90/0x90
> [ 1411.890468]  [<mm/mempolicy.c:573>] ? change_prot_numa+0x30/0x30
> [ 1411.890468]  [<mm/mempolicy.c:1004>] migrate_to_node+0x77/0xc0
> [ 1411.890468]  [<mm/mempolicy.c:1110>] do_migrate_pages+0x1a8/0x230
> [ 1411.890468]  [<mm/mempolicy.c:1461>] SYSC_migrate_pages+0x316/0x380
> [ 1411.890468]  [<include/linux/rcupdate.h:799 mm/mempolicy.c:1407>] ? SYSC_migrate_pages+0xac/0x380
> [ 1411.890468]  [<kernel/sched/cputime.c:681>] ? vtime_account_user+0x91/0xa0
> [ 1411.890468]  [<mm/mempolicy.c:1381>] SyS_migrate_pages+0x9/0x10
> [ 1411.890468]  [<arch/x86/ia32/ia32entry.S:430>] ia32_do_call+0x13/0x13
> [ 1411.890468] Code: 4c 8b 6d f8 c9 c3 66 0f 1f 84 00 00 00 00 00 55
> 48 89 e5 41 54 49 89 f4 53 48 89 fb 48 8b 07 f6 c4 40 75 13 31 f6 e8
> 84 48 fb ff <0f> 0b 66 90 eb fe 66 0f 1f 44 00 00 8b 4f 1c 48 8d 77
> 1c 85 c9
> [ 1411.890468] RIP  [<mm/hugetlb.c:3580>] isolate_huge_page+0x1c/0xb0
> [ 1411.890468]  RSP <ffff8801e471dae8>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
       [not found]     ` <530785b2.d55c8c0a.3868.ffffa4e1SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-02-21 17:18       ` Sasha Levin
  2014-02-21 17:25         ` Naoya Horiguchi
       [not found]         ` <1393003512-qjyhnu0@n-horiguchi@ah.jp.nec.com>
  0 siblings, 2 replies; 37+ messages in thread
From: Sasha Levin @ 2014-02-21 17:18 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

On 02/21/2014 11:58 AM, Naoya Horiguchi wrote:
> Hi Sasha,
> 
> On Fri, Feb 21, 2014 at 01:30:53AM -0500, Sasha Levin wrote:
>> On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
>>> queue_pages_range() does page table walking in its own way now,
>>> so this patch rewrites it with walk_page_range().
>>> One difficulty was that queue_pages_range() needed to check vmas
>>> to determine whether we queue pages from a given vma or skip it.
>>> Now we have test_walk() callback in mm_walk for that purpose,
>>> so we can do the replacement cleanly. queue_pages_test_walk()
>>> depends on not only the current vma but also the previous one,
>>> so we use queue_pages->prev to keep it.
>>>
>>> ChangeLog v2:
>>> - rebase onto mmots
>>> - add VM_PFNMAP check on queue_pages_test_walk()
>>>
>>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>> ---
>>
>> Hi Naoya,
>>
>> I'm seeing another spew in today's -next, and it seems to be related
>> to this patch. Here's the spew (with line numbers instead of kernel
>> addresses):
> 
> Thanks. (line numbers translation is very helpful.)
> 
> This bug looks strange to me.
> "kernel BUG at mm/hugetlb.c:3580" means we try to do isolate_huge_page()
> for !PageHead page. But the caller queue_pages_hugetlb() gets the page
> with "page = pte_page(huge_ptep_get(pte))", so it should be the head page!
> 
> mm/hugetlb.c:3580 is VM_BUG_ON_PAGE(!PageHead(page), page), so we expect to
> have dump_page output at this point, is that in your kernel log?

This is usually a sign of a race between that code and thp splitting, see
https://lkml.org/lkml/2013/12/23/457 for example.

I forgot to add the dump_page output to my extraction process and the complete logs all long gone.
I'll grab it when it happens again.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
  2014-02-21 17:18       ` Sasha Levin
@ 2014-02-21 17:25         ` Naoya Horiguchi
       [not found]         ` <1393003512-qjyhnu0@n-horiguchi@ah.jp.nec.com>
  1 sibling, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-21 17:25 UTC (permalink / raw)
  To: sasha.levin
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

On Fri, Feb 21, 2014 at 12:18:11PM -0500, Sasha Levin wrote:
> On 02/21/2014 11:58 AM, Naoya Horiguchi wrote:
> > On Fri, Feb 21, 2014 at 01:30:53AM -0500, Sasha Levin wrote:
> >> On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
> >>> queue_pages_range() does page table walking in its own way now,
> >>> so this patch rewrites it with walk_page_range().
> >>> One difficulty was that queue_pages_range() needed to check vmas
> >>> to determine whether we queue pages from a given vma or skip it.
> >>> Now we have test_walk() callback in mm_walk for that purpose,
> >>> so we can do the replacement cleanly. queue_pages_test_walk()
> >>> depends on not only the current vma but also the previous one,
> >>> so we use queue_pages->prev to keep it.
> >>>
> >>> ChangeLog v2:
> >>> - rebase onto mmots
> >>> - add VM_PFNMAP check on queue_pages_test_walk()
> >>>
> >>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >>> ---
> >>
> >> Hi Naoya,
> >>
> >> I'm seeing another spew in today's -next, and it seems to be related
> >> to this patch. Here's the spew (with line numbers instead of kernel
> >> addresses):
> > 
> > Thanks. (line numbers translation is very helpful.)
> > 
> > This bug looks strange to me.
> > "kernel BUG at mm/hugetlb.c:3580" means we try to do isolate_huge_page()
> > for !PageHead page. But the caller queue_pages_hugetlb() gets the page
> > with "page = pte_page(huge_ptep_get(pte))", so it should be the head page!
> > 
> > mm/hugetlb.c:3580 is VM_BUG_ON_PAGE(!PageHead(page), page), so we expect to
> > have dump_page output at this point, is that in your kernel log?
> 
> This is usually a sign of a race between that code and thp splitting, see
> https://lkml.org/lkml/2013/12/23/457 for example.

queue_pages_hugetlb() is for hugetlbfs, not for thp, so I don't think that
it's related to thp splitting, but I agree it's a race.

> I forgot to add the dump_page output to my extraction process and the complete logs all long gone.
> I'll grab it when it happens again.

Thank you. It'll be useful.

Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
       [not found]         ` <1393003512-qjyhnu0@n-horiguchi@ah.jp.nec.com>
@ 2014-02-23 13:04           ` Sasha Levin
  2014-02-23 18:59             ` Naoya Horiguchi
  0 siblings, 1 reply; 37+ messages in thread
From: Sasha Levin @ 2014-02-23 13:04 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

On 02/21/2014 12:25 PM, Naoya Horiguchi wrote:
> On Fri, Feb 21, 2014 at 12:18:11PM -0500, Sasha Levin wrote:
>> On 02/21/2014 11:58 AM, Naoya Horiguchi wrote:
>>> On Fri, Feb 21, 2014 at 01:30:53AM -0500, Sasha Levin wrote:
>>>> On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
>>>>> queue_pages_range() does page table walking in its own way now,
>>>>> so this patch rewrites it with walk_page_range().
>>>>> One difficulty was that queue_pages_range() needed to check vmas
>>>>> to determine whether we queue pages from a given vma or skip it.
>>>>> Now we have test_walk() callback in mm_walk for that purpose,
>>>>> so we can do the replacement cleanly. queue_pages_test_walk()
>>>>> depends on not only the current vma but also the previous one,
>>>>> so we use queue_pages->prev to keep it.
>>>>>
>>>>> ChangeLog v2:
>>>>> - rebase onto mmots
>>>>> - add VM_PFNMAP check on queue_pages_test_walk()
>>>>>
>>>>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>>> ---
>>>>
>>>> Hi Naoya,
>>>>
>>>> I'm seeing another spew in today's -next, and it seems to be related
>>>> to this patch. Here's the spew (with line numbers instead of kernel
>>>> addresses):
>>>
>>> Thanks. (line numbers translation is very helpful.)
>>>
>>> This bug looks strange to me.
>>> "kernel BUG at mm/hugetlb.c:3580" means we try to do isolate_huge_page()
>>> for !PageHead page. But the caller queue_pages_hugetlb() gets the page
>>> with "page = pte_page(huge_ptep_get(pte))", so it should be the head page!
>>>
>>> mm/hugetlb.c:3580 is VM_BUG_ON_PAGE(!PageHead(page), page), so we expect to
>>> have dump_page output at this point, is that in your kernel log?
>>
>> This is usually a sign of a race between that code and thp splitting, see
>> https://lkml.org/lkml/2013/12/23/457 for example.
> 
> queue_pages_hugetlb() is for hugetlbfs, not for thp, so I don't think that
> it's related to thp splitting, but I agree it's a race.
> 
>> I forgot to add the dump_page output to my extraction process and the complete logs all long gone.
>> I'll grab it when it happens again.
> 
> Thank you. It'll be useful.

And here it is:

[  755.524966] page:ffffea0000000000 count:0 mapcount:1 mapping:          (null) index:0x0
[  755.526067] page flags: 0x0()

Followed by the same stack trace as before.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
  2014-02-23 13:04           ` Sasha Levin
@ 2014-02-23 18:59             ` Naoya Horiguchi
  0 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-02-23 18:59 UTC (permalink / raw)
  To: sasha.levin
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

On Sun, Feb 23, 2014 at 08:04:56AM -0500, Sasha Levin wrote:
...
> And here it is:
> 
> [  755.524966] page:ffffea0000000000 count:0 mapcount:1 mapping:          (null) index:0x0
> [  755.526067] page flags: 0x0()
> 
> Followed by the same stack trace as before.

Thanks.

It seems that this page is pfn 0, so we might have invalid value on page
table entry (pointing to pfn 0.) In this -next tree we have some update around
hugetlb fault code (like  "mm, hugetlb: improve page-fault scalability",)
so I'll check there could be a race window from this viewpoint.

Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/11] madvise: redefine callback functions for page table walker
  2014-02-10 21:44 ` [PATCH 08/11] madvise: " Naoya Horiguchi
@ 2014-03-21  1:47   ` Sasha Levin
  2014-03-21  2:43     ` [PATCH] madvise: fix locking in force_swapin_readahead() (Re: [PATCH 08/11] madvise: redefine callback functions for page table walker) Naoya Horiguchi
  0 siblings, 1 reply; 37+ messages in thread
From: Sasha Levin @ 2014-03-21  1:47 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
> swapin_walk_pmd_entry() is defined as pmd_entry(), but it has no code
> about pmd handling (except pmd_none_or_trans_huge_or_clear_bad, but the
> same check are now done in core page table walk code).
> So let's move this function on pte_entry() as swapin_walk_pte_entry().
>
> Signed-off-by: Naoya Horiguchi<n-horiguchi@ah.jp.nec.com>

This patch seems to generate:

[  305.267354] =================================
[  305.268051] [ INFO: inconsistent lock state ]
[  305.268678] 3.14.0-rc7-next-20140320-sasha-00015-gd752393-dirty #261 Tainted: G        W
[  305.269992] ---------------------------------
[  305.270152] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[  305.270152] trinity-c57/13619 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  305.270152]  (&(ptlock_ptr(page))->rlock#2){+.+.?.}, at: walk_pte_range (include/linux/spinlock.h:303 mm/pagewalk.c:33)
[  305.270152] {IN-RECLAIM_FS-W} state was registered at:
[  305.270152]   mark_irqflags (kernel/locking/lockdep.c:2821)
[  305.270152]   __lock_acquire (kernel/locking/lockdep.c:3138)
[  305.270152]   lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[  305.270152]   _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[  305.270152]   __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:624)
[  305.270152]   page_referenced_one (mm/rmap.c:706)
[  305.270152]   rmap_walk_anon (mm/rmap.c:1613)
[  305.270152]   rmap_walk (mm/rmap.c:1685)
[  305.270152]   page_referenced (mm/rmap.c:802)
[  305.270152]   shrink_active_list (mm/vmscan.c:1704)
[  305.270152]   balance_pgdat (mm/vmscan.c:2741 mm/vmscan.c:2996)
[  305.270152]   kswapd (mm/vmscan.c:3296)
[  305.270152]   kthread (kernel/kthread.c:216)
[  305.270152]   ret_from_fork (arch/x86/kernel/entry_64.S:555)
[  305.270152] irq event stamp: 20863
[  305.270152] hardirqs last  enabled at (20863): alloc_pages_vma (arch/x86/include/asm/paravirt.h:809 include/linux/seqlock.h:81 include/linux/seqlock.h:146 include/linux/cpus
et.h:98 mm/mempolicy.c:1990)
[  305.270152] hardirqs last disabled at (20862): alloc_pages_vma (include/linux/seqlock.h:79 include/linux/seqlock.h:146 include/linux/cpuset.h:98 mm/mempolicy.c:1990)
[  305.270152] softirqs last  enabled at (19858): __do_softirq (arch/x86/include/asm/preempt.h:22 kernel/softirq.c:298)
[  305.270152] softirqs last disabled at (19855): irq_exit (kernel/softirq.c:348 kernel/softirq.c:389)
[  305.270152]
[  305.270152] other info that might help us debug this:
[  305.270152]  Possible unsafe locking scenario:
[  305.270152]
[  305.270152]        CPU0
[  305.270152]        ----
[  305.270152]   lock(&(ptlock_ptr(page))->rlock#2);
[  305.270152]   <Interrupt>
[  305.270152]     lock(&(ptlock_ptr(page))->rlock#2);
[  305.270152]
[  305.270152]  *** DEADLOCK ***
[  305.270152]
[  305.270152] 2 locks held by trinity-c57/13619:
[  305.270152]  #0:  (&mm->mmap_sem){++++++}, at: SyS_madvise (arch/x86/include/asm/current.h:14 mm/madvise.c:492 mm/madvise.c:448)
[  305.270152]  #1:  (&(ptlock_ptr(page))->rlock#2){+.+.?.}, at: walk_pte_range (include/linux/spinlock.h:303 mm/pagewalk.c:33)
[  305.270152]
[  305.270152] stack backtrace:
[  305.270152] CPU: 23 PID: 13619 Comm: trinity-c57 Tainted: G        W     3.14.0-rc7-next-20140320-sasha-00015-gd752393-dirty #261
[  305.270152]  ffff8804ab8e0d28 ffff8804ab9c5968 ffffffff844b76e7 0000000000000001
[  305.270152]  ffff8804ab8e0000 ffff8804ab9c59c8 ffffffff811a55f7 0000000000000000
[  305.270152]  0000000000000001 ffff880400000001 ffffffff87e18ed8 000000000000000a
[  305.270152] Call Trace:
[  305.270152]  dump_stack (lib/dump_stack.c:52)
[  305.270152]  print_usage_bug (kernel/locking/lockdep.c:2254)
[  305.270152]  ? check_usage_forwards (kernel/locking/lockdep.c:2371)
[  305.270152]  mark_lock_irq (kernel/locking/lockdep.c:2465)
[  305.270152]  mark_lock (kernel/locking/lockdep.c:2920)
[  305.270152]  mark_held_locks (kernel/locking/lockdep.c:2523)
[  305.270152]  lockdep_trace_alloc (kernel/locking/lockdep.c:2745 kernel/locking/lockdep.c:2760)
[  305.270152]  __alloc_pages_nodemask (mm/page_alloc.c:2722)
[  305.270152]  ? mark_held_locks (kernel/locking/lockdep.c:2523)
[  305.270152]  ? alloc_pages_vma (arch/x86/include/asm/paravirt.h:809 include/linux/seqlock.h:81 include/linux/seqlock.h:146 include/linux/cpuset.h:98 mm/mempolicy.c:1990)
[  305.270152]  alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:2006)
[  305.270152]  ? read_swap_cache_async (mm/swap_state.c:328)
[  305.270152]  ? __const_udelay (arch/x86/lib/delay.c:126)
[  305.270152]  read_swap_cache_async (mm/swap_state.c:328)
[  305.270152]  ? walk_pte_range (include/linux/spinlock.h:303 mm/pagewalk.c:33)
[  305.270152]  swapin_walk_pte_entry (mm/madvise.c:152)
[  305.270152]  walk_pte_range (mm/pagewalk.c:47)
[  305.270152]  ? sched_clock (arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:305)
[  305.270152]  walk_pmd_range (mm/pagewalk.c:90)
[  305.270152]  ? sched_clock (arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:305)
[  305.270152]  ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[  305.270152]  walk_pud_range (mm/pagewalk.c:128)
[  305.270152]  walk_pgd_range (mm/pagewalk.c:165)
[  305.270152]  __walk_page_range (mm/pagewalk.c:259)
[  305.270152]  walk_page_range (mm/pagewalk.c:333)
[  305.270152]  madvise_willneed (mm/madvise.c:167 mm/madvise.c:211)
[  305.270152]  ? madvise_hwpoison (mm/madvise.c:140)
[  305.270152]  madvise_vma (mm/madvise.c:369)
[  305.270152]  ? find_vma (mm/mmap.c:2021)
[  305.270152]  SyS_madvise (mm/madvise.c:518 mm/madvise.c:448)
[  305.270152]  ia32_do_call (arch/x86/ia32/ia32entry.S:430)


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] madvise: fix locking in force_swapin_readahead() (Re: [PATCH 08/11] madvise: redefine callback functions for page table walker)
  2014-03-21  1:47   ` Sasha Levin
@ 2014-03-21  2:43     ` Naoya Horiguchi
  2014-03-21  5:16       ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Naoya Horiguchi @ 2014-03-21  2:43 UTC (permalink / raw)
  To: sasha.levin
  Cc: linux-mm, akpm, mpm, cpw, kosaki.motohiro, hannes,
	kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel


On Thu, Mar 20, 2014 at 09:47:04PM -0400, Sasha Levin wrote:
> On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
> >swapin_walk_pmd_entry() is defined as pmd_entry(), but it has no code
> >about pmd handling (except pmd_none_or_trans_huge_or_clear_bad, but the
> >same check are now done in core page table walk code).
> >So let's move this function on pte_entry() as swapin_walk_pte_entry().
> >
> >Signed-off-by: Naoya Horiguchi<n-horiguchi@ah.jp.nec.com>
> 
> This patch seems to generate:

Sasha, thank you for reporting.
I forgot to unlock ptlock before entering read_swap_cache_async() which
holds page lock in it, as a result lock ordering rule (written in mm/rmap.c)
was violated (we should take in the order of mmap_sem -> page lock -> ptlock.)
The following patch should fix this. Could you test with it?

---

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] madvise: fix locking in force_swapin_readahead() (Re: [PATCH 08/11] madvise: redefine callback functions for page table walker)
  2014-03-21  2:43     ` [PATCH] madvise: fix locking in force_swapin_readahead() (Re: [PATCH 08/11] madvise: redefine callback functions for page table walker) Naoya Horiguchi
@ 2014-03-21  5:16       ` Hugh Dickins
  2014-03-21  6:22         ` Naoya Horiguchi
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2014-03-21  5:16 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Shaohua Li, sasha.levin, linux-mm, akpm, mpm, cpw,
	kosaki.motohiro, hannes, kamezawa.hiroyu, mhocko, aneesh.kumar,
	xemul, riel, kirill.shutemov, linux-kernel

On Thu, 20 Mar 2014, Naoya Horiguchi wrote:
> On Thu, Mar 20, 2014 at 09:47:04PM -0400, Sasha Levin wrote:
> > On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
> > >swapin_walk_pmd_entry() is defined as pmd_entry(), but it has no code
> > >about pmd handling (except pmd_none_or_trans_huge_or_clear_bad, but the
> > >same check are now done in core page table walk code).
> > >So let's move this function on pte_entry() as swapin_walk_pte_entry().
> > >
> > >Signed-off-by: Naoya Horiguchi<n-horiguchi@ah.jp.nec.com>
> > 
> > This patch seems to generate:
> 
> Sasha, thank you for reporting.
> I forgot to unlock ptlock before entering read_swap_cache_async() which
> holds page lock in it, as a result lock ordering rule (written in mm/rmap.c)
> was violated (we should take in the order of mmap_sem -> page lock -> ptlock.)
> The following patch should fix this. Could you test with it?
> 
> ---
> From c0d56af5874dc40467c9b3a0f9e53b39b3c4f1c5 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Thu, 20 Mar 2014 22:30:51 -0400
> Subject: [PATCH] madvise: fix locking in force_swapin_readahead()
> 
> We take mmap_sem and ptlock in walking over ptes with swapin_walk_pte_entry(),
> but inside it we call read_swap_cache_async() which holds page lock.
> So we should unlock ptlock to call read_swap_cache_async() to meet lock order
> rule (mmap_sem -> page lock -> ptlock).
> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

NAK.  You are now unlocking and relocking the spinlock, good; but on
arm frv or i386 CONFIG_HIGHPTE you are leaving the page table atomically
kmapped across read_swap_cache_async(), which (never mind lock ordering)
is quite likely to block waiting to allocate memory.

I do not see
madvise-redefine-callback-functions-for-page-table-walker.patch
as an improvement.  I can see what's going on in Shaohua's original
code, whereas this style makes bugs more likely.  Please drop it.

Hugh

> ---
>  mm/madvise.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 5e957b984c14..ed9c31e3b5ff 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -141,24 +141,35 @@ static int swapin_walk_pte_entry(pte_t *pte, unsigned long start,
>  	swp_entry_t entry;
>  	struct page *page;
>  	struct vm_area_struct *vma = walk->vma;
> +	spinlock_t *ptl = (spinlock_t *)walk->private;
>  
>  	if (pte_present(*pte) || pte_none(*pte) || pte_file(*pte))
>  		return 0;
>  	entry = pte_to_swp_entry(*pte);
>  	if (unlikely(non_swap_entry(entry)))
>  		return 0;
> +	spin_unlock(ptl);
>  	page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
>  				     vma, start);
> +	spin_lock(ptl);
>  	if (page)
>  		page_cache_release(page);
>  	return 0;
>  }
>  
> +static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> +	unsigned long end, struct mm_walk *walk)
> +{
> +	walk->private = pte_lockptr(walk->mm, pmd);
> +	return 0;
> +}
> +
>  static void force_swapin_readahead(struct vm_area_struct *vma,
>  		unsigned long start, unsigned long end)
>  {
>  	struct mm_walk walk = {
>  		.mm = vma->vm_mm,
> +		.pmd_entry = swapin_walk_pmd_entry,
>  		.pte_entry = swapin_walk_pte_entry,
>  	};
>  
> -- 
> 1.8.5.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] madvise: fix locking in force_swapin_readahead() (Re: [PATCH 08/11] madvise: redefine callback functions for page table walker)
  2014-03-21  5:16       ` Hugh Dickins
@ 2014-03-21  6:22         ` Naoya Horiguchi
  0 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-03-21  6:22 UTC (permalink / raw)
  To: hughd
  Cc: shli, sasha.levin, linux-mm, akpm, mpm, cpw, kosaki.motohiro,
	hannes, kamezawa.hiroyu, mhocko, aneesh.kumar, xemul, riel,
	kirill.shutemov, linux-kernel

On Thu, Mar 20, 2014 at 10:16:21PM -0700, Hugh Dickins wrote:
> On Thu, 20 Mar 2014, Naoya Horiguchi wrote:
> > On Thu, Mar 20, 2014 at 09:47:04PM -0400, Sasha Levin wrote:
> > > On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
> > > >swapin_walk_pmd_entry() is defined as pmd_entry(), but it has no code
> > > >about pmd handling (except pmd_none_or_trans_huge_or_clear_bad, but the
> > > >same check are now done in core page table walk code).
> > > >So let's move this function on pte_entry() as swapin_walk_pte_entry().
> > > >
> > > >Signed-off-by: Naoya Horiguchi<n-horiguchi@ah.jp.nec.com>
> > > 
> > > This patch seems to generate:
> > 
> > Sasha, thank you for reporting.
> > I forgot to unlock ptlock before entering read_swap_cache_async() which
> > holds page lock in it, as a result lock ordering rule (written in mm/rmap.c)
> > was violated (we should take in the order of mmap_sem -> page lock -> ptlock.)
> > The following patch should fix this. Could you test with it?
> > 
> > ---
> > From c0d56af5874dc40467c9b3a0f9e53b39b3c4f1c5 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Date: Thu, 20 Mar 2014 22:30:51 -0400
> > Subject: [PATCH] madvise: fix locking in force_swapin_readahead()
> > 
> > We take mmap_sem and ptlock in walking over ptes with swapin_walk_pte_entry(),
> > but inside it we call read_swap_cache_async() which holds page lock.
> > So we should unlock ptlock to call read_swap_cache_async() to meet lock order
> > rule (mmap_sem -> page lock -> ptlock).
> > 
> > Reported-by: Sasha Levin <sasha.levin@oracle.com>
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> NAK.  You are now unlocking and relocking the spinlock, good; but on
> arm frv or i386 CONFIG_HIGHPTE you are leaving the page table atomically
> kmapped across read_swap_cache_async(), which (never mind lock ordering)
> is quite likely to block waiting to allocate memory.

Thanks for pointing out, you're right.
walk_pte_range() doesn't fit to pte loop in original swapin_walk_pmd_entry(),
so I should not have changed this code.

> I do not see
> madvise-redefine-callback-functions-for-page-table-walker.patch
> as an improvement.  I can see what's going on in Shaohua's original
> code, whereas this style makes bugs more likely.  Please drop it.

OK, I agree that.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
  2014-02-10 21:44 ` [PATCH 01/11] pagewalk: update page table walker core Naoya Horiguchi
  2014-02-12  5:39   ` Joonsoo Kim
  2014-02-20 23:47   ` Sasha Levin
@ 2014-06-02 23:49   ` Dave Hansen
  2014-06-03  0:29     ` Naoya Horiguchi
  2 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2014-06-02 23:49 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

On 02/10/2014 01:44 PM, Naoya Horiguchi wrote:
> When we try to use multiple callbacks in different levels, skip control is
> also important. For example we have thp enabled in normal configuration, and
> we are interested in doing some work for a thp. But sometimes we want to
> split it and handle as normal pages, and in another time user would handle
> both at pmd level and pte level.
> What we need is that when we've done pmd_entry() we want to decide whether
> to go down to pte level handling based on the pmd_entry()'s result. So this
> patch introduces a skip control flag in mm_walk.
> We can't use the returned value for this purpose, because we already
> defined the meaning of whole range of returned values (>0 is to terminate
> page table walk in caller's specific manner, =0 is to continue to walk,
> and <0 is to abort the walk in the general manner.)

This seems a bit complicated for a case which doesn't exist in practice
in the kernel today.  We don't even *have* a single ->pte_entry handler.
 Everybody just sets ->pmd_entry and does the splitting and handling of
individual pte entries in there.  The only reason it's needed is because
of the later patches in the series, which is kinda goofy.

I'm biased, but I think the abstraction here is done in the wrong place.

Naoya, could you take a looked at the new handler I proposed?  Would
that help make this simpler?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] pagewalk: update page table walker core
  2014-06-02 23:49   ` Dave Hansen
@ 2014-06-03  0:29     ` Naoya Horiguchi
  0 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-06-03  0:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, Matt Mackall, Cliff Wickman,
	KOSAKI Motohiro, Johannes Weiner, KAMEZAWA Hiroyuki,
	Michal Hocko, Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel,
	Kirill A. Shutemov, linux-kernel

On Mon, Jun 02, 2014 at 04:49:18PM -0700, Dave Hansen wrote:
> On 02/10/2014 01:44 PM, Naoya Horiguchi wrote:
> > When we try to use multiple callbacks in different levels, skip control is
> > also important. For example we have thp enabled in normal configuration, and
> > we are interested in doing some work for a thp. But sometimes we want to
> > split it and handle as normal pages, and in another time user would handle
> > both at pmd level and pte level.
> > What we need is that when we've done pmd_entry() we want to decide whether
> > to go down to pte level handling based on the pmd_entry()'s result. So this
> > patch introduces a skip control flag in mm_walk.
> > We can't use the returned value for this purpose, because we already
> > defined the meaning of whole range of returned values (>0 is to terminate
> > page table walk in caller's specific manner, =0 is to continue to walk,
> > and <0 is to abort the walk in the general manner.)
> 
> This seems a bit complicated for a case which doesn't exist in practice
> in the kernel today.  We don't even *have* a single ->pte_entry handler.

Following users have their own pte_entry() by latter part of this patchset:
- queue_pages_range()
- mem_cgroup_count_precharge()
- show_numa_map()
- pagemap_read()
- clear_refs_write()
- show_smap()
- or1k_dma_alloc()
- or1k_dma_free()
- subpage_mark_vma_nohuge

>  Everybody just sets ->pmd_entry and does the splitting and handling of
> individual pte entries in there.

Walking over every pte entry under some pmd is common task, so if you don't
have any good reason, we should do it in mm/pagewalk.c side, not in each
pmd_entry() callback. (Callbacks should focus on their own task.)

>  The only reason it's needed is because
> of the later patches in the series, which is kinda goofy.

Most of current users use pte_entry() in the latest linux-mm.
Only few callers (mem_cgroup_move_charge() and force_swapin_readahead())
make their pmd_entry() handle pte-level walk in their own way.

BTW, we have some potential callers of page table walker which currently
does page walk completely in their own way. Here's the list:
- mincore()
- copy_page_range()
- remap_pfn_range()        
- zap_page_range()         
- free_pgtables()          
- vmap_page_range_noflush()
- change_protection_range()
Yes, my work for cleanuping page table walker is still on the way.

> I'm biased, but I think the abstraction here is done in the wrong place.
> 
> Naoya, could you take a looked at the new handler I proposed?  Would
> that help make this simpler?

I'll look through this series later and I'd like to add some of your
patches on top of this patchset.

Thanks,
Naoya Horiguchi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
  2014-01-13 16:54 [PATCH 00/11 v4] " Naoya Horiguchi
@ 2014-01-13 16:54 ` Naoya Horiguchi
  0 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2014-01-13 16:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

queue_pages_range() does page table walking in its own way now,
so this patch rewrites it with walk_page_range().
One difficulty was that queue_pages_range() needed to check vmas
to determine whether we queue pages from a given vma or skip it.
Now we have test_walk() callback in mm_walk for that purpose,
so we can do the replacement cleanly. queue_pages_test_walk()
depends on not only the current vma but also the previous one,
so we use queue_pages->prev to keep it.

ChangeLog v2:
- rebase onto mmots
- add VM_PFNMAP check on queue_pages_test_walk()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 255 ++++++++++++++++++++++-----------------------------------
 1 file changed, 99 insertions(+), 156 deletions(-)

diff --git mmotm-2014-01-09-16-23.orig/mm/mempolicy.c mmotm-2014-01-09-16-23/mm/mempolicy.c
index 9bfb1a020aa6..1007bed55678 100644
--- mmotm-2014-01-09-16-23.orig/mm/mempolicy.c
+++ mmotm-2014-01-09-16-23/mm/mempolicy.c
@@ -476,140 +476,66 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags);
 
+struct queue_pages {
+	struct list_head *pagelist;
+	unsigned long flags;
+	nodemask_t *nmask;
+	struct vm_area_struct *prev;
+};
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
  */
-static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
+static int queue_pages_pte(pte_t *pte, unsigned long addr,
+			unsigned long next, struct mm_walk *walk)
 {
-	pte_t *orig_pte;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-		int nid;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
+	int nid;
 
-		if (!pte_present(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		/*
-		 * vm_normal_page() filters out zero pages, but there might
-		 * still be PageReserved pages to skip, perhaps in a VDSO.
-		 */
-		if (PageReserved(page))
-			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-			continue;
+	if (!pte_present(*pte))
+		return 0;
+	page = vm_normal_page(vma, addr, *pte);
+	if (!page)
+		return 0;
+	/*
+	 * vm_normal_page() filters out zero pages, but there might
+	 * still be PageReserved pages to skip, perhaps in a VDSO.
+	 */
+	if (PageReserved(page))
+		return 0;
+	nid = page_to_nid(page);
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 
-		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
-			migrate_page_add(page, private, flags);
-		else
-			break;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(orig_pte, ptl);
-	return addr != end;
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+	return 0;
 }
 
-static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
-		pmd_t *pmd, const nodemask_t *nodes, unsigned long flags,
-				    void *private)
+static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
+				unsigned long next, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
 	int nid;
 	struct page *page;
-	spinlock_t *ptl;
 
-	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
-	page = pte_page(huge_ptep_get((pte_t *)pmd));
+	page = pte_page(huge_ptep_get(pte));
 	nid = page_to_nid(page);
-	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-		goto unlock;
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
-		isolate_huge_page(page, private);
-unlock:
-	spin_unlock(ptl);
+		isolate_huge_page(page, qp->pagelist);
 #else
 	BUG();
 #endif
-}
-
-static inline int queue_pages_pmd_range(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pmd_t *pmd;
-	unsigned long next;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (!pmd_present(*pmd))
-			continue;
-		if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
-			queue_pages_hugetlb_pmd_range(vma, pmd, nodes,
-						flags, private);
-			continue;
-		}
-		split_huge_page_pmd(vma, addr, pmd);
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			continue;
-		if (queue_pages_pte_range(vma, pmd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pmd++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pud_t *pud;
-	unsigned long next;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_huge(*pud) && is_vm_hugetlb_page(vma))
-			continue;
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		if (queue_pages_pmd_range(vma, pud, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pud++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pgd_range(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pgd_t *pgd;
-	unsigned long next;
-
-	pgd = pgd_offset(vma->vm_mm, addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		if (queue_pages_pud_range(vma, pgd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pgd++, addr = next, addr != end);
 	return 0;
 }
 
@@ -643,6 +569,45 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
 
+static int queue_pages_test_walk(unsigned long start, unsigned long end,
+				struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct queue_pages *qp = walk->private;
+	unsigned long endvma = vma->vm_end;
+	unsigned long flags = qp->flags;
+
+	if (endvma > end)
+		endvma = end;
+	if (vma->vm_start > start)
+		start = vma->vm_start;
+
+	if (!(flags & MPOL_MF_DISCONTIG_OK)) {
+		if (!vma->vm_next && vma->vm_end < end)
+			return -EFAULT;
+		if (qp->prev && qp->prev->vm_end < vma->vm_start)
+			return -EFAULT;
+	}
+
+	qp->prev = vma;
+	walk->skip = 1;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 0;
+
+	if (flags & MPOL_MF_LAZY) {
+		change_prot_numa(vma, start, endvma);
+		return 0;
+	}
+
+	if ((flags & MPOL_MF_STRICT) ||
+	    ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
+	     vma_migratable(vma)))
+		/* queue pages from current vma */
+		walk->skip = 0;
+	return 0;
+}
+
 /*
  * Walk through page tables and collect pages to be migrated.
  *
@@ -652,51 +617,29 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
  */
 static struct vm_area_struct *
 queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags, void *private)
+		nodemask_t *nodes, unsigned long flags,
+		struct list_head *pagelist)
 {
 	int err;
-	struct vm_area_struct *first, *vma, *prev;
-
-
-	first = find_vma(mm, start);
-	if (!first)
-		return ERR_PTR(-EFAULT);
-	prev = NULL;
-	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
-		unsigned long endvma = vma->vm_end;
-
-		if (endvma > end)
-			endvma = end;
-		if (vma->vm_start > start)
-			start = vma->vm_start;
-
-		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
-			if (!vma->vm_next && vma->vm_end < end)
-				return ERR_PTR(-EFAULT);
-			if (prev && prev->vm_end < vma->vm_start)
-				return ERR_PTR(-EFAULT);
-		}
-
-		if (flags & MPOL_MF_LAZY) {
-			change_prot_numa(vma, start, endvma);
-			goto next;
-		}
-
-		if ((flags & MPOL_MF_STRICT) ||
-		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-		      vma_migratable(vma))) {
-
-			err = queue_pages_pgd_range(vma, start, endvma, nodes,
-						flags, private);
-			if (err) {
-				first = ERR_PTR(err);
-				break;
-			}
-		}
-next:
-		prev = vma;
-	}
-	return first;
+	struct queue_pages qp = {
+		.pagelist = pagelist,
+		.flags = flags,
+		.nmask = nodes,
+		.prev = NULL,
+	};
+	struct mm_walk queue_pages_walk = {
+		.hugetlb_entry = queue_pages_hugetlb,
+		.pte_entry = queue_pages_pte,
+		.test_walk = queue_pages_test_walk,
+		.mm = mm,
+		.private = &qp,
+	};
+
+	err = walk_page_range(start, end, &queue_pages_walk);
+	if (err < 0)
+		return ERR_PTR(err);
+	else
+		return find_vma(mm, start);
 }
 
 /*
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
  2013-12-11 22:08 [PATCH 00/11 v3] update page table walker Naoya Horiguchi
@ 2013-12-11 22:09 ` Naoya Horiguchi
  0 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2013-12-11 22:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

queue_pages_range() does page table walking in its own way now,
so this patch rewrites it with walk_page_range().
One difficulty was that queue_pages_range() needed to check vmas
to determine whether we queue pages from a given vma or skip it.
Now we have test_walk() callback in mm_walk for that purpose,
so we can do the replacement cleanly. queue_pages_test_walk()
depends on not only the current vma but also the previous one,
so we use queue_pages->prev to keep it.

ChangeLog v2:
- rebase onto mmots
- add VM_PFNMAP check on queue_pages_test_walk()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 255 ++++++++++++++++++++++-----------------------------------
 1 file changed, 99 insertions(+), 156 deletions(-)

diff --git v3.13-rc3-mmots-2013-12-10-16-38.orig/mm/mempolicy.c v3.13-rc3-mmots-2013-12-10-16-38/mm/mempolicy.c
index 9f73b29d304d..281fd12e9767 100644
--- v3.13-rc3-mmots-2013-12-10-16-38.orig/mm/mempolicy.c
+++ v3.13-rc3-mmots-2013-12-10-16-38/mm/mempolicy.c
@@ -476,140 +476,66 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags);
 
+struct queue_pages {
+	struct list_head *pagelist;
+	unsigned long flags;
+	nodemask_t *nmask;
+	struct vm_area_struct *prev;
+};
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
  */
-static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
+static int queue_pages_pte(pte_t *pte, unsigned long addr,
+			unsigned long next, struct mm_walk *walk)
 {
-	pte_t *orig_pte;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-		int nid;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
+	int nid;
 
-		if (!pte_present(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		/*
-		 * vm_normal_page() filters out zero pages, but there might
-		 * still be PageReserved pages to skip, perhaps in a VDSO.
-		 */
-		if (PageReserved(page))
-			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-			continue;
+	if (!pte_present(*pte))
+		return 0;
+	page = vm_normal_page(vma, addr, *pte);
+	if (!page)
+		return 0;
+	/*
+	 * vm_normal_page() filters out zero pages, but there might
+	 * still be PageReserved pages to skip, perhaps in a VDSO.
+	 */
+	if (PageReserved(page))
+		return 0;
+	nid = page_to_nid(page);
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 
-		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
-			migrate_page_add(page, private, flags);
-		else
-			break;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(orig_pte, ptl);
-	return addr != end;
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+	return 0;
 }
 
-static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
-		pmd_t *pmd, const nodemask_t *nodes, unsigned long flags,
-				    void *private)
+static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
+				unsigned long next, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
 	int nid;
 	struct page *page;
-	spinlock_t *ptl;
 
-	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
-	page = pte_page(huge_ptep_get((pte_t *)pmd));
+	page = pte_page(huge_ptep_get(pte));
 	nid = page_to_nid(page);
-	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-		goto unlock;
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
-		isolate_huge_page(page, private);
-unlock:
-	spin_unlock(ptl);
+		isolate_huge_page(page, qp->pagelist);
 #else
 	BUG();
 #endif
-}
-
-static inline int queue_pages_pmd_range(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pmd_t *pmd;
-	unsigned long next;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (!pmd_present(*pmd))
-			continue;
-		if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
-			queue_pages_hugetlb_pmd_range(vma, pmd, nodes,
-						flags, private);
-			continue;
-		}
-		split_huge_page_pmd(vma, addr, pmd);
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			continue;
-		if (queue_pages_pte_range(vma, pmd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pmd++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pud_t *pud;
-	unsigned long next;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_huge(*pud) && is_vm_hugetlb_page(vma))
-			continue;
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		if (queue_pages_pmd_range(vma, pud, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pud++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pgd_range(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pgd_t *pgd;
-	unsigned long next;
-
-	pgd = pgd_offset(vma->vm_mm, addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		if (queue_pages_pud_range(vma, pgd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pgd++, addr = next, addr != end);
 	return 0;
 }
 
@@ -642,6 +568,45 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+static int queue_pages_test_walk(unsigned long start, unsigned long end,
+				struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct queue_pages *qp = walk->private;
+	unsigned long endvma = vma->vm_end;
+	unsigned long flags = qp->flags;
+
+	if (endvma > end)
+		endvma = end;
+	if (vma->vm_start > start)
+		start = vma->vm_start;
+
+	if (!(flags & MPOL_MF_DISCONTIG_OK)) {
+		if (!vma->vm_next && vma->vm_end < end)
+			return -EFAULT;
+		if (qp->prev && qp->prev->vm_end < vma->vm_start)
+			return -EFAULT;
+	}
+
+	qp->prev = vma;
+	walk->skip = 1;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 0;
+
+	if (flags & MPOL_MF_LAZY) {
+		change_prot_numa(vma, start, endvma);
+		return 0;
+	}
+
+	if ((flags & MPOL_MF_STRICT) ||
+	    ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
+	     vma_migratable(vma)))
+		/* queue pages from current vma */
+		walk->skip = 0;
+	return 0;
+}
+
 /*
  * Walk through page tables and collect pages to be migrated.
  *
@@ -651,51 +616,29 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
  */
 static struct vm_area_struct *
 queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags, void *private)
+		nodemask_t *nodes, unsigned long flags,
+		struct list_head *pagelist)
 {
 	int err;
-	struct vm_area_struct *first, *vma, *prev;
-
-
-	first = find_vma(mm, start);
-	if (!first)
-		return ERR_PTR(-EFAULT);
-	prev = NULL;
-	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
-		unsigned long endvma = vma->vm_end;
-
-		if (endvma > end)
-			endvma = end;
-		if (vma->vm_start > start)
-			start = vma->vm_start;
-
-		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
-			if (!vma->vm_next && vma->vm_end < end)
-				return ERR_PTR(-EFAULT);
-			if (prev && prev->vm_end < vma->vm_start)
-				return ERR_PTR(-EFAULT);
-		}
-
-		if (flags & MPOL_MF_LAZY) {
-			change_prot_numa(vma, start, endvma);
-			goto next;
-		}
-
-		if ((flags & MPOL_MF_STRICT) ||
-		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-		      vma_migratable(vma))) {
-
-			err = queue_pages_pgd_range(vma, start, endvma, nodes,
-						flags, private);
-			if (err) {
-				first = ERR_PTR(err);
-				break;
-			}
-		}
-next:
-		prev = vma;
-	}
-	return first;
+	struct queue_pages qp = {
+		.pagelist = pagelist,
+		.flags = flags,
+		.nmask = nodes,
+		.prev = NULL,
+	};
+	struct mm_walk queue_pages_walk = {
+		.hugetlb_entry = queue_pages_hugetlb,
+		.pte_entry = queue_pages_pte,
+		.test_walk = queue_pages_test_walk,
+		.mm = mm,
+		.private = &qp,
+	};
+
+	err = walk_page_range(start, end, &queue_pages_walk);
+	if (err < 0)
+		return ERR_PTR(err);
+	else
+		return find_vma(mm, start);
 }
 
 /*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
  2013-10-30 21:44 [PATCH 00/11 v2] update page table walker Naoya Horiguchi
@ 2013-10-30 21:44 ` Naoya Horiguchi
  0 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2013-10-30 21:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, Rik van Riel, kirill.shutemov,
	linux-kernel

queue_pages_range() does page table walking in its own way now,
so this patch rewrites it with walk_page_range().
One difficulty was that queue_pages_range() needed to check vmas
to determine whether we queue pages from a given vma or skip it.
Now we have test_walk() callback in mm_walk for that purpose,
so we can do the replacement cleanly. queue_pages_test_walk()
depends on not only the current vma but also the previous one,
so we use queue_pages->prev to keep it.

ChangeLog v2:
- rebase onto mmots
- add VM_PFNMAP check on queue_pages_test_walk()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 255 ++++++++++++++++++++++-----------------------------------
 1 file changed, 99 insertions(+), 156 deletions(-)

diff --git v3.12-rc7-mmots-2013-10-29-16-24.orig/mm/mempolicy.c v3.12-rc7-mmots-2013-10-29-16-24/mm/mempolicy.c
index f8f9790..913df80 100644
--- v3.12-rc7-mmots-2013-10-29-16-24.orig/mm/mempolicy.c
+++ v3.12-rc7-mmots-2013-10-29-16-24/mm/mempolicy.c
@@ -476,140 +476,66 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags);
 
+struct queue_pages {
+	struct list_head *pagelist;
+	unsigned long flags;
+	nodemask_t *nmask;
+	struct vm_area_struct *prev;
+};
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
  */
-static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
+static int queue_pages_pte(pte_t *pte, unsigned long addr,
+			unsigned long next, struct mm_walk *walk)
 {
-	pte_t *orig_pte;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-		int nid;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
+	int nid;
 
-		if (!pte_present(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		/*
-		 * vm_normal_page() filters out zero pages, but there might
-		 * still be PageReserved pages to skip, perhaps in a VDSO.
-		 */
-		if (PageReserved(page))
-			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-			continue;
+	if (!pte_present(*pte))
+		return 0;
+	page = vm_normal_page(vma, addr, *pte);
+	if (!page)
+		return 0;
+	/*
+	 * vm_normal_page() filters out zero pages, but there might
+	 * still be PageReserved pages to skip, perhaps in a VDSO.
+	 */
+	if (PageReserved(page))
+		return 0;
+	nid = page_to_nid(page);
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 
-		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
-			migrate_page_add(page, private, flags);
-		else
-			break;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(orig_pte, ptl);
-	return addr != end;
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+	return 0;
 }
 
-static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
-		pmd_t *pmd, const nodemask_t *nodes, unsigned long flags,
-				    void *private)
+static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
+				unsigned long next, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
 	int nid;
 	struct page *page;
-	spinlock_t *ptl;
 
-	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
-	page = pte_page(huge_ptep_get((pte_t *)pmd));
+	page = pte_page(huge_ptep_get(pte));
 	nid = page_to_nid(page);
-	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-		goto unlock;
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
-		isolate_huge_page(page, private);
-unlock:
-	spin_unlock(ptl);
+		isolate_huge_page(page, qp->pagelist);
 #else
 	BUG();
 #endif
-}
-
-static inline int queue_pages_pmd_range(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pmd_t *pmd;
-	unsigned long next;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (!pmd_present(*pmd))
-			continue;
-		if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
-			queue_pages_hugetlb_pmd_range(vma, pmd, nodes,
-						flags, private);
-			continue;
-		}
-		split_huge_page_pmd(vma, addr, pmd);
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			continue;
-		if (queue_pages_pte_range(vma, pmd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pmd++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pud_t *pud;
-	unsigned long next;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_huge(*pud) && is_vm_hugetlb_page(vma))
-			continue;
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		if (queue_pages_pmd_range(vma, pud, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pud++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pgd_range(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pgd_t *pgd;
-	unsigned long next;
-
-	pgd = pgd_offset(vma->vm_mm, addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		if (queue_pages_pud_range(vma, pgd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pgd++, addr = next, addr != end);
 	return 0;
 }
 
@@ -643,6 +569,45 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
 
+static int queue_pages_test_walk(unsigned long start, unsigned long end,
+				struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct queue_pages *qp = walk->private;
+	unsigned long endvma = vma->vm_end;
+	unsigned long flags = qp->flags;
+
+	if (endvma > end)
+		endvma = end;
+	if (vma->vm_start > start)
+		start = vma->vm_start;
+
+	if (!(flags & MPOL_MF_DISCONTIG_OK)) {
+		if (!vma->vm_next && vma->vm_end < end)
+			return -EFAULT;
+		if (qp->prev && qp->prev->vm_end < vma->vm_start)
+			return -EFAULT;
+	}
+
+	qp->prev = vma;
+	walk->skip = 1;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 0;
+
+	if (flags & MPOL_MF_LAZY) {
+		change_prot_numa(vma, start, endvma);
+		return 0;
+	}
+
+	if ((flags & MPOL_MF_STRICT) ||
+	    ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
+	     vma_migratable(vma)))
+		/* queue pages from current vma */
+		walk->skip = 0;
+	return 0;
+}
+
 /*
  * Walk through page tables and collect pages to be migrated.
  *
@@ -652,51 +617,29 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
  */
 static struct vm_area_struct *
 queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags, void *private)
+		nodemask_t *nodes, unsigned long flags,
+		struct list_head *pagelist)
 {
 	int err;
-	struct vm_area_struct *first, *vma, *prev;
-
-
-	first = find_vma(mm, start);
-	if (!first)
-		return ERR_PTR(-EFAULT);
-	prev = NULL;
-	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
-		unsigned long endvma = vma->vm_end;
-
-		if (endvma > end)
-			endvma = end;
-		if (vma->vm_start > start)
-			start = vma->vm_start;
-
-		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
-			if (!vma->vm_next && vma->vm_end < end)
-				return ERR_PTR(-EFAULT);
-			if (prev && prev->vm_end < vma->vm_start)
-				return ERR_PTR(-EFAULT);
-		}
-
-		if (flags & MPOL_MF_LAZY) {
-			change_prot_numa(vma, start, endvma);
-			goto next;
-		}
-
-		if ((flags & MPOL_MF_STRICT) ||
-		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-		      vma_migratable(vma))) {
-
-			err = queue_pages_pgd_range(vma, start, endvma, nodes,
-						flags, private);
-			if (err) {
-				first = ERR_PTR(err);
-				break;
-			}
-		}
-next:
-		prev = vma;
-	}
-	return first;
+	struct queue_pages qp = {
+		.pagelist = pagelist,
+		.flags = flags,
+		.nmask = nodes,
+		.prev = NULL,
+	};
+	struct mm_walk queue_pages_walk = {
+		.hugetlb_entry = queue_pages_hugetlb,
+		.pte_entry = queue_pages_pte,
+		.test_walk = queue_pages_test_walk,
+		.mm = mm,
+		.private = &qp,
+	};
+
+	err = walk_page_range(start, end, &queue_pages_walk);
+	if (err < 0)
+		return ERR_PTR(err);
+	else
+		return find_vma(mm, start);
 }
 
 /*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()
  2013-10-14 17:36 [PATCH 0/11] update page table walker Naoya Horiguchi
@ 2013-10-14 17:37 ` Naoya Horiguchi
  0 siblings, 0 replies; 37+ messages in thread
From: Naoya Horiguchi @ 2013-10-14 17:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matt Mackall, Cliff Wickman, KOSAKI Motohiro,
	Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko,
	Aneesh Kumar K.V, Pavel Emelyanov, linux-kernel

queue_pages_range() does page table walking in its own way now,
so this patch rewrites it with walk_page_range().
One difficulty was that queue_pages_range() need to check vmas
to determine whether we queue pages from a given vma or skip it.
Now we have test_walk() callback in mm_walk for that purpose,
so we can do the replacement cleanly. queue_pages_test_walk()
depends on not only the current vma but also the previous vma,
so we use queue_pages->prev to remember it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/mempolicy.c | 251 ++++++++++++++++++++++-----------------------------------
 1 file changed, 96 insertions(+), 155 deletions(-)

diff --git v3.12-rc4.orig/mm/mempolicy.c v3.12-rc4/mm/mempolicy.c
index 0472964..2f1889f 100644
--- v3.12-rc4.orig/mm/mempolicy.c
+++ v3.12-rc4/mm/mempolicy.c
@@ -476,139 +476,66 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags);
 
+struct queue_pages {
+	struct list_head *pagelist;
+	unsigned long flags;
+	nodemask_t *nmask;
+	struct vm_area_struct *prev;
+};
+
 /*
  * Scan through pages checking if pages follow certain conditions,
  * and move them to the pagelist if they do.
  */
-static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
+static int queue_pages_pte(pte_t *pte, unsigned long addr,
+			unsigned long next, struct mm_walk *walk)
 {
-	pte_t *orig_pte;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	do {
-		struct page *page;
-		int nid;
+	struct vm_area_struct *vma = walk->vma;
+	struct page *page;
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
+	int nid;
 
-		if (!pte_present(*pte))
-			continue;
-		page = vm_normal_page(vma, addr, *pte);
-		if (!page)
-			continue;
-		/*
-		 * vm_normal_page() filters out zero pages, but there might
-		 * still be PageReserved pages to skip, perhaps in a VDSO.
-		 */
-		if (PageReserved(page))
-			continue;
-		nid = page_to_nid(page);
-		if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-			continue;
+	if (!pte_present(*pte))
+		return 0;
+	page = vm_normal_page(vma, addr, *pte);
+	if (!page)
+		return 0;
+	/*
+	 * vm_normal_page() filters out zero pages, but there might
+	 * still be PageReserved pages to skip, perhaps in a VDSO.
+	 */
+	if (PageReserved(page))
+		return 0;
+	nid = page_to_nid(page);
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 
-		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
-			migrate_page_add(page, private, flags);
-		else
-			break;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap_unlock(orig_pte, ptl);
-	return addr != end;
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_page_add(page, qp->pagelist, flags);
+	return 0;
 }
 
-static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
-		pmd_t *pmd, const nodemask_t *nodes, unsigned long flags,
-				    void *private)
+static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
+				unsigned long next, struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
+	struct queue_pages *qp = walk->private;
+	unsigned long flags = qp->flags;
 	int nid;
 	struct page *page;
 
-	spin_lock(&vma->vm_mm->page_table_lock);
-	page = pte_page(huge_ptep_get((pte_t *)pmd));
+	page = pte_page(huge_ptep_get(pte));
 	nid = page_to_nid(page);
-	if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
-		goto unlock;
+	if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+		return 0;
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
-		isolate_huge_page(page, private);
-unlock:
-	spin_unlock(&vma->vm_mm->page_table_lock);
+		isolate_huge_page(page, qp->pagelist);
 #else
 	BUG();
 #endif
-}
-
-static inline int queue_pages_pmd_range(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pmd_t *pmd;
-	unsigned long next;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		if (!pmd_present(*pmd))
-			continue;
-		if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
-			queue_pages_hugetlb_pmd_range(vma, pmd, nodes,
-						flags, private);
-			continue;
-		}
-		split_huge_page_pmd(vma, addr, pmd);
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			continue;
-		if (queue_pages_pte_range(vma, pmd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pmd++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pud_t *pud;
-	unsigned long next;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_huge(*pud) && is_vm_hugetlb_page(vma))
-			continue;
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		if (queue_pages_pmd_range(vma, pud, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pud++, addr = next, addr != end);
-	return 0;
-}
-
-static inline int queue_pages_pgd_range(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags,
-		void *private)
-{
-	pgd_t *pgd;
-	unsigned long next;
-
-	pgd = pgd_offset(vma->vm_mm, addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		if (queue_pages_pud_range(vma, pgd, addr, next, nodes,
-				    flags, private))
-			return -EIO;
-	} while (pgd++, addr = next, addr != end);
 	return 0;
 }
 
@@ -642,6 +569,42 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
 
+static int queue_pages_test_walk(unsigned long start, unsigned long end,
+				struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct queue_pages *qp = walk->private;
+	unsigned long endvma = vma->vm_end;
+	unsigned long flags = qp->flags;
+
+	if (endvma > end)
+		endvma = end;
+	if (vma->vm_start > start)
+		start = vma->vm_start;
+
+	if (!(flags & MPOL_MF_DISCONTIG_OK)) {
+		if (!vma->vm_next && vma->vm_end < end)
+			return -EFAULT;
+		if (qp->prev && qp->prev->vm_end < vma->vm_start)
+			return -EFAULT;
+	}
+
+	qp->prev = vma;
+	walk->skip = 1;
+
+	if (flags & MPOL_MF_LAZY) {
+		change_prot_numa(vma, start, endvma);
+		return 0;
+	}
+
+	if ((flags & MPOL_MF_STRICT) ||
+	    ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
+	     vma_migratable(vma)))
+		/* queue pages from current vma */
+		walk->skip = 0;
+	return 0;
+}
+
 /*
  * Walk through page tables and collect pages to be migrated.
  *
@@ -651,51 +614,29 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
  */
 static struct vm_area_struct *
 queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
-		const nodemask_t *nodes, unsigned long flags, void *private)
+		nodemask_t *nodes, unsigned long flags,
+		struct list_head *pagelist)
 {
 	int err;
-	struct vm_area_struct *first, *vma, *prev;
-
-
-	first = find_vma(mm, start);
-	if (!first)
-		return ERR_PTR(-EFAULT);
-	prev = NULL;
-	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
-		unsigned long endvma = vma->vm_end;
-
-		if (endvma > end)
-			endvma = end;
-		if (vma->vm_start > start)
-			start = vma->vm_start;
-
-		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
-			if (!vma->vm_next && vma->vm_end < end)
-				return ERR_PTR(-EFAULT);
-			if (prev && prev->vm_end < vma->vm_start)
-				return ERR_PTR(-EFAULT);
-		}
-
-		if (flags & MPOL_MF_LAZY) {
-			change_prot_numa(vma, start, endvma);
-			goto next;
-		}
-
-		if ((flags & MPOL_MF_STRICT) ||
-		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-		      vma_migratable(vma))) {
-
-			err = queue_pages_pgd_range(vma, start, endvma, nodes,
-						flags, private);
-			if (err) {
-				first = ERR_PTR(err);
-				break;
-			}
-		}
-next:
-		prev = vma;
-	}
-	return first;
+	struct queue_pages qp = {
+		.pagelist = pagelist,
+		.flags = flags,
+		.nmask = nodes,
+		.prev = NULL,
+	};
+	struct mm_walk queue_pages_walk = {
+		.hugetlb_entry = queue_pages_hugetlb,
+		.pte_entry = queue_pages_pte,
+		.test_walk = queue_pages_test_walk,
+		.mm = mm,
+		.private = &qp,
+	};
+
+	err = walk_page_range(start, end, &queue_pages_walk);
+	if (err < 0)
+		return ERR_PTR(err);
+	else
+		return find_vma(mm, start);
 }
 
 /*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-06-03  0:34 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-10 21:44 [PATCH 00/11 v5] update page table walker Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 01/11] pagewalk: update page table walker core Naoya Horiguchi
2014-02-12  5:39   ` Joonsoo Kim
2014-02-12 15:40     ` Naoya Horiguchi
2014-02-20 23:47   ` Sasha Levin
2014-02-21  3:20     ` Naoya Horiguchi
2014-02-21  4:30     ` Sasha Levin
     [not found]     ` <5306c629.012ce50a.6c48.ffff9844SMTPIN_ADDED_BROKEN@mx.google.com>
2014-02-21  6:43       ` Sasha Levin
2014-02-21 16:35         ` Naoya Horiguchi
     [not found]         ` <1393000553-ocl81482@n-horiguchi@ah.jp.nec.com>
2014-02-21 16:50           ` Sasha Levin
2014-06-02 23:49   ` Dave Hansen
2014-06-03  0:29     ` Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 02/11] pagewalk: add walk_page_vma() Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 03/11] smaps: redefine callback functions for page table walker Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 04/11] clear_refs: " Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 05/11] pagemap: " Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 06/11] numa_maps: " Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 07/11] memcg: " Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 08/11] madvise: " Naoya Horiguchi
2014-03-21  1:47   ` Sasha Levin
2014-03-21  2:43     ` [PATCH] madvise: fix locking in force_swapin_readahead() (Re: [PATCH 08/11] madvise: redefine callback functions for page table walker) Naoya Horiguchi
2014-03-21  5:16       ` Hugh Dickins
2014-03-21  6:22         ` Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 09/11] arch/powerpc/mm/subpage-prot.c: use walk_page_vma() instead of walk_page_range() Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 10/11] pagewalk: remove argument hmask from hugetlb_entry() Naoya Horiguchi
2014-02-10 21:44 ` [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range() Naoya Horiguchi
2014-02-21  6:30   ` Sasha Levin
2014-02-21 16:58     ` Naoya Horiguchi
     [not found]     ` <530785b2.d55c8c0a.3868.ffffa4e1SMTPIN_ADDED_BROKEN@mx.google.com>
2014-02-21 17:18       ` Sasha Levin
2014-02-21 17:25         ` Naoya Horiguchi
     [not found]         ` <1393003512-qjyhnu0@n-horiguchi@ah.jp.nec.com>
2014-02-23 13:04           ` Sasha Levin
2014-02-23 18:59             ` Naoya Horiguchi
2014-02-10 22:42 ` [PATCH 00/11 v5] update page table walker Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2014-01-13 16:54 [PATCH 00/11 v4] " Naoya Horiguchi
2014-01-13 16:54 ` [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range() Naoya Horiguchi
2013-12-11 22:08 [PATCH 00/11 v3] update page table walker Naoya Horiguchi
2013-12-11 22:09 ` [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range() Naoya Horiguchi
2013-10-30 21:44 [PATCH 00/11 v2] update page table walker Naoya Horiguchi
2013-10-30 21:44 ` [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range() Naoya Horiguchi
2013-10-14 17:36 [PATCH 0/11] update page table walker Naoya Horiguchi
2013-10-14 17:37 ` [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range() Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).