linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6 v4] pagemap handles transparent hugepage
@ 2012-01-27 23:02 Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Hi,

I rebased the patchset onto 3.3-rc1, and made some fixes on thp
optimization patch based on the feedbacks from Andrea.

Naoya Horiguchi (6):
  pagemap: avoid splitting thp when reading
  thp: optimize away unnecessary page table locking
  pagemap: export KPF_THP
  pagemap: document KPF_THP and make page-types aware of
  introduce thp_ptep_get()
  pagemap: introduce data structure for pagemap entry

 Documentation/vm/page-types.c     |    2 +
 Documentation/vm/pagemap.txt      |    4 +
 arch/x86/include/asm/pgtable.h    |    5 ++
 fs/proc/page.c                    |    2 +
 fs/proc/task_mmu.c                |  135 +++++++++++++++++++++----------------
 include/asm-generic/pgtable.h     |    4 +
 include/linux/huge_mm.h           |   17 +++++
 include/linux/kernel-page-flags.h |    1 +
 mm/huge_memory.c                  |  120 +++++++++++++++-----------------
 9 files changed, 169 insertions(+), 121 deletions(-)

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-29 13:17   ` Hillf Danton
  2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Thp split is not necessary if we explicitly check whether pmds are
mapping thps or not. This patch introduces this check and adds code
to generate pagemap entries for pmds mapping thps, which results in
less performance impact of pagemap on thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changes since v3:
  - Generate pagemap entry directly from pmd to avoid messy casting

Changes since v2:
  - Add comment on if check in thp_pte_to_pagemap_entry()
  - Convert type of offset into unsigned long

Changes since v1:
  - Move pfn declaration to the beginning of pagemap_pte_range()
---
 fs/proc/task_mmu.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 47 insertions(+), 6 deletions(-)

diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
index e418c5a..cfbba8d 100644
--- 3.3-rc1.orig/fs/proc/task_mmu.c
+++ 3.3-rc1/fs/proc/task_mmu.c
@@ -600,6 +600,9 @@ struct pagemapread {
 	u64 *buffer;
 };
 
+#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
+#define PAGEMAP_WALK_MASK	(PMD_MASK)
+
 #define PM_ENTRY_BYTES      sizeof(u64)
 #define PM_STATUS_BITS      3
 #define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
@@ -658,6 +661,27 @@ static u64 pte_to_pagemap_entry(pte_t pte)
 	return pme;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+{
+	u64 pme = 0;
+	/*
+	 * Currently pmd for thp is always present because thp can not be
+	 * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
+	 * This if-check is just to prepare for future implementation.
+	 */
+	if (pmd_present(pmd))
+		pme = PM_PFRAME(pmd_pfn(pmd) + offset)
+			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
+	return pme;
+}
+#else
+static inline u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+{
+	return 0;
+}
+#endif
+
 static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			     struct mm_walk *walk)
 {
@@ -665,14 +689,33 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	pte_t *pte;
 	int err = 0;
-
-	split_huge_page_pmd(walk->mm, pmd);
+	u64 pfn = PM_NOT_PRESENT;
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
-	for (; addr != end; addr += PAGE_SIZE) {
-		u64 pfn = PM_NOT_PRESENT;
 
+	spin_lock(&walk->mm->page_table_lock);
+	if (pmd_trans_huge(*pmd)) {
+		if (pmd_trans_splitting(*pmd)) {
+			spin_unlock(&walk->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+		} else {
+			for (; addr != end; addr += PAGE_SIZE) {
+				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
+					>> PAGE_SHIFT;
+				pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
+				err = add_to_pagemap(addr, pfn, pm);
+				if (err)
+					break;
+			}
+			spin_unlock(&walk->mm->page_table_lock);
+			return err;
+		}
+	} else {
+		spin_unlock(&walk->mm->page_table_lock);
+	}
+
+	for (; addr != end; addr += PAGE_SIZE) {
 		/* check to see if we've left 'vma' behind
 		 * and need a new, higher one */
 		if (vma && (addr >= vma->vm_end))
@@ -754,8 +797,6 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * determine which areas of memory are actually mapped and llseek to
  * skip over unmapped regions.
  */
-#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
-#define PAGEMAP_WALK_MASK	(PMD_MASK)
 static ssize_t pagemap_read(struct file *file, char __user *buf,
 			    size_t count, loff_t *ppos)
 {
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-28 11:23   ` Hillf Danton
  2012-01-30  6:22   ` KAMEZAWA Hiroyuki
  2012-01-27 23:02 ` [PATCH 3/6] pagemap: export KPF_THP Naoya Horiguchi
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Currently when we check if we can handle thp as it is or we need to
split it into regular sized pages, we hold page table lock prior to
check whether a given pmd is mapping thp or not. Because of this,
when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
To remove it, this patch introduces a optimized check function and
replace several similar logics with it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>

Changes since v3:
  - Fix likely/unlikely pattern in pmd_trans_huge_stable()
  - Change suffix from _stable to _lock
  - Introduce __pmd_trans_huge_lock() to avoid micro-regression
  - Return 1 when wait_split_huge_page path is taken

Changes since v2:
  - Fix missing "return 0" in "thp under splitting" path
  - Remove unneeded comment
  - Change the name of check function to describe what it does
  - Add VM_BUG_ON(mmap_sem)
---
 fs/proc/task_mmu.c      |   70 +++++++++------------------
 include/linux/huge_mm.h |   17 +++++++
 mm/huge_memory.c        |  120 ++++++++++++++++++++++-------------------------
 3 files changed, 96 insertions(+), 111 deletions(-)

diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
index cfbba8d..2622b64 100644
--- 3.3-rc1.orig/fs/proc/task_mmu.c
+++ 3.3-rc1/fs/proc/task_mmu.c
@@ -394,20 +394,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			smaps_pte_entry(*(pte_t *)pmd, addr,
-					HPAGE_PMD_SIZE, walk);
-			spin_unlock(&walk->mm->page_table_lock);
-			mss->anonymous_thp += HPAGE_PMD_SIZE;
-			return 0;
-		}
-	} else {
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
 		spin_unlock(&walk->mm->page_table_lock);
+		mss->anonymous_thp += HPAGE_PMD_SIZE;
+		return 0;
 	}
 	/*
 	 * The mmap_sem held all the way back in m_start() is what
@@ -694,25 +685,17 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			for (; addr != end; addr += PAGE_SIZE) {
-				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
-					>> PAGE_SHIFT;
-				pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
-				err = add_to_pagemap(addr, pfn, pm);
-				if (err)
-					break;
-			}
-			spin_unlock(&walk->mm->page_table_lock);
-			return err;
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		for (; addr != end; addr += PAGE_SIZE) {
+			unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
+				>> PAGE_SHIFT;
+			pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
+			err = add_to_pagemap(addr, pfn, pm);
+			if (err)
+				break;
 		}
-	} else {
 		spin_unlock(&walk->mm->page_table_lock);
+		return err;
 	}
 
 	for (; addr != end; addr += PAGE_SIZE) {
@@ -979,24 +962,17 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 
 	md = walk->private;
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(md->vma->anon_vma, pmd);
-		} else {
-			pte_t huge_pte = *(pte_t *)pmd;
-			struct page *page;
-
-			page = can_gather_numa_stats(huge_pte, md->vma, addr);
-			if (page)
-				gather_stats(page, md, pte_dirty(huge_pte),
-						HPAGE_PMD_SIZE/PAGE_SIZE);
-			spin_unlock(&walk->mm->page_table_lock);
-			return 0;
-		}
-	} else {
+
+	if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
+		pte_t huge_pte = *(pte_t *)pmd;
+		struct page *page;
+
+		page = can_gather_numa_stats(huge_pte, md->vma, addr);
+		if (page)
+			gather_stats(page, md, pte_dirty(huge_pte),
+				     HPAGE_PMD_SIZE/PAGE_SIZE);
 		spin_unlock(&walk->mm->page_table_lock);
+		return 0;
 	}
 
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
diff --git 3.3-rc1.orig/include/linux/huge_mm.h 3.3-rc1/include/linux/huge_mm.h
index 1b92129..f56cacb 100644
--- 3.3-rc1.orig/include/linux/huge_mm.h
+++ 3.3-rc1/include/linux/huge_mm.h
@@ -113,6 +113,18 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    long adjust_next);
+extern int __pmd_trans_huge_lock(pmd_t *pmd,
+				 struct vm_area_struct *vma);
+/* mmap_sem must be held on entry */
+static inline int pmd_trans_huge_lock(pmd_t *pmd,
+				      struct vm_area_struct *vma)
+{
+	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+	if (pmd_trans_huge(*pmd))
+		return __pmd_trans_huge_lock(pmd, vma);
+	else
+		return 0;
+}
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
@@ -176,6 +188,11 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int pmd_trans_huge_lock(pmd_t *pmd,
+				      struct vm_area_struct *vma)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git 3.3-rc1.orig/mm/huge_memory.c 3.3-rc1/mm/huge_memory.c
index b3ffc21..cca6461 100644
--- 3.3-rc1.orig/mm/huge_memory.c
+++ 3.3-rc1/mm/huge_memory.c
@@ -1030,30 +1030,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 {
 	int ret = 0;
 
-	spin_lock(&tlb->mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&tlb->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma,
-					     pmd);
-		} else {
-			struct page *page;
-			pgtable_t pgtable;
-			pgtable = get_pmd_huge_pte(tlb->mm);
-			page = pmd_page(*pmd);
-			pmd_clear(pmd);
-			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-			page_remove_rmap(page);
-			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-			VM_BUG_ON(!PageHead(page));
-			spin_unlock(&tlb->mm->page_table_lock);
-			tlb_remove_page(tlb, page);
-			pte_free(tlb->mm, pgtable);
-			ret = 1;
-		}
-	} else
+	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+		struct page *page;
+		pgtable_t pgtable;
+		pgtable = get_pmd_huge_pte(tlb->mm);
+		page = pmd_page(*pmd);
+		pmd_clear(pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		page_remove_rmap(page);
+		VM_BUG_ON(page_mapcount(page) < 0);
+		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		VM_BUG_ON(!PageHead(page));
 		spin_unlock(&tlb->mm->page_table_lock);
+		tlb_remove_page(tlb, page);
+		pte_free(tlb->mm, pgtable);
+		ret = 1;
+	}
 
 	return ret;
 }
@@ -1064,21 +1056,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	int ret = 0;
 
-	spin_lock(&vma->vm_mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		ret = !pmd_trans_splitting(*pmd);
-		spin_unlock(&vma->vm_mm->page_table_lock);
-		if (unlikely(!ret))
-			wait_split_huge_page(vma->anon_vma, pmd);
-		else {
-			/*
-			 * All logical pages in the range are present
-			 * if backed by a huge page.
-			 */
-			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
-		}
-	} else
+	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+		/*
+		 * All logical pages in the range are present
+		 * if backed by a huge page.
+		 */
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		memset(vec, 1, (end - addr) >> PAGE_SHIFT);
+	}
 
 	return ret;
 }
@@ -1108,20 +1093,10 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		goto out;
 	}
 
-	spin_lock(&mm->page_table_lock);
-	if (likely(pmd_trans_huge(*old_pmd))) {
-		if (pmd_trans_splitting(*old_pmd)) {
-			spin_unlock(&mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, old_pmd);
-			ret = -1;
-		} else {
-			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
-			VM_BUG_ON(!pmd_none(*new_pmd));
-			set_pmd_at(mm, new_addr, new_pmd, pmd);
-			spin_unlock(&mm->page_table_lock);
-			ret = 1;
-		}
-	} else {
+	if ((ret = __pmd_trans_huge_lock(old_pmd, vma)) == 1) {
+		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+		VM_BUG_ON(!pmd_none(*new_pmd));
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		spin_unlock(&mm->page_table_lock);
 	}
 out:
@@ -1134,24 +1109,41 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
 
-	spin_lock(&mm->page_table_lock);
+	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+		pmd_t entry;
+		entry = pmdp_get_and_clear(mm, addr, pmd);
+		entry = pmd_modify(entry, newprot);
+		set_pmd_at(mm, addr, pmd, entry);
+		spin_unlock(&vma->vm_mm->page_table_lock);
+		ret = 1;
+	}
+
+	return ret;
+}
+
+/*
+ * Returns 1 if a given pmd maps a stable (not under splitting) thp,
+ * -1 if the pmd maps thp under splitting, 0 if the pmd does not map thp.
+ *
+ * Note that if it returns 1, this routine returns without unlocking page
+ * table locks. So callers must unlock them.
+ */
+int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	spin_lock(&vma->vm_mm->page_table_lock);
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&mm->page_table_lock);
+			spin_unlock(&vma->vm_mm->page_table_lock);
 			wait_split_huge_page(vma->anon_vma, pmd);
+			return -1;
 		} else {
-			pmd_t entry;
-
-			entry = pmdp_get_and_clear(mm, addr, pmd);
-			entry = pmd_modify(entry, newprot);
-			set_pmd_at(mm, addr, pmd, entry);
-			spin_unlock(&vma->vm_mm->page_table_lock);
-			ret = 1;
+			/* Thp mapped by 'pmd' is stable, so we can
+			 * handle it as it is. */
+			return 1;
 		}
-	} else
-		spin_unlock(&vma->vm_mm->page_table_lock);
-
-	return ret;
+	}
+	spin_unlock(&vma->vm_mm->page_table_lock);
+	return 0;
 }
 
 pmd_t *page_check_address_pmd(struct page *page,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/6] pagemap: export KPF_THP
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 4/6] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

This flag shows that a given pages is a subpage of transparent hugepage.
It helps us debug and test kernel by showing physical address of thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Changes since v2:
  - replace if with else-if not to set KPF_THP for hugetlbfs page

Changes since v1:
  - remove unnecessary ifdefs
  - fix confusing patch description
---
 fs/proc/page.c                    |    2 ++
 include/linux/kernel-page-flags.h |    1 +
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git 3.3-rc1.orig/fs/proc/page.c 3.3-rc1/fs/proc/page.c
index 6d8e6a9..7fcd0d6 100644
--- 3.3-rc1.orig/fs/proc/page.c
+++ 3.3-rc1/fs/proc/page.c
@@ -115,6 +115,8 @@ u64 stable_page_flags(struct page *page)
 		u |= 1 << KPF_COMPOUND_TAIL;
 	if (PageHuge(page))
 		u |= 1 << KPF_HUGE;
+	else if (PageTransCompound(page))
+		u |= 1 << KPF_THP;
 
 	/*
 	 * Caveats on high order pages: page->_count will only be set
diff --git 3.3-rc1.orig/include/linux/kernel-page-flags.h 3.3-rc1/include/linux/kernel-page-flags.h
index bd92a89..26a6571 100644
--- 3.3-rc1.orig/include/linux/kernel-page-flags.h
+++ 3.3-rc1/include/linux/kernel-page-flags.h
@@ -30,6 +30,7 @@
 #define KPF_NOPAGE		20
 
 #define KPF_KSM			21
+#define KPF_THP			22
 
 /* kernel hacking assistances
  * WARNING: subject to change, never rely on them!
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/6] pagemap: document KPF_THP and make page-types aware of it
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2012-01-27 23:02 ` [PATCH 3/6] pagemap: export KPF_THP Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 5/6] introduce thp_ptep_get() Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 6/6] pagemap: introduce data structure for pagemap entry Naoya Horiguchi
  5 siblings, 0 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

page-types, which is a common user of pagemap, gets aware of thp
with this patch. This helps system admins and kernel hackers know
about how thp works.
Here is a sample output of page-types over a thp:

  $ page-types -p <pid> --raw --list

  voffset offset  len     flags
  ...
  7f9d40200       3f8400  1       ___U_lA____Ma_bH______t____________
  7f9d40201       3f8401  1ff     ________________T_____t____________

               flags      page-count       MB  symbolic-flags                     long-symbolic-flags
  0x0000000000410000             511        1  ________________T_____t____________        compound_tail,thp
  0x000000000040d868               1        0  ___U_lA____Ma_bH______t____________        uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Changes since v1:
  - fix misused word
---
 Documentation/vm/page-types.c |    2 ++
 Documentation/vm/pagemap.txt  |    4 ++++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git 3.3-rc1.orig/Documentation/vm/page-types.c 3.3-rc1/Documentation/vm/page-types.c
index 7445caa..0b13f02 100644
--- 3.3-rc1.orig/Documentation/vm/page-types.c
+++ 3.3-rc1/Documentation/vm/page-types.c
@@ -98,6 +98,7 @@
 #define KPF_HWPOISON		19
 #define KPF_NOPAGE		20
 #define KPF_KSM			21
+#define KPF_THP			22
 
 /* [32-] kernel hacking assistances */
 #define KPF_RESERVED		32
@@ -147,6 +148,7 @@ static const char *page_flag_names[] = {
 	[KPF_HWPOISON]		= "X:hwpoison",
 	[KPF_NOPAGE]		= "n:nopage",
 	[KPF_KSM]		= "x:ksm",
+	[KPF_THP]		= "t:thp",
 
 	[KPF_RESERVED]		= "r:reserved",
 	[KPF_MLOCKED]		= "m:mlocked",
diff --git 3.3-rc1.orig/Documentation/vm/pagemap.txt 3.3-rc1/Documentation/vm/pagemap.txt
index df09b96..4600cbe 100644
--- 3.3-rc1.orig/Documentation/vm/pagemap.txt
+++ 3.3-rc1/Documentation/vm/pagemap.txt
@@ -60,6 +60,7 @@ There are three components to pagemap:
     19. HWPOISON
     20. NOPAGE
     21. KSM
+    22. THP
 
 Short descriptions to the page flags:
 
@@ -97,6 +98,9 @@ Short descriptions to the page flags:
 21. KSM
     identical memory pages dynamically shared between one or more processes
 
+22. THP
+    contiguous pages which construct transparent hugepages
+
     [IO related page flags]
  1. ERROR     IO error occurred
  3. UPTODATE  page has up-to-date data
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/6] introduce thp_ptep_get()
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2012-01-27 23:02 ` [PATCH 4/6] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-30  6:26   ` KAMEZAWA Hiroyuki
  2012-01-27 23:02 ` [PATCH 6/6] pagemap: introduce data structure for pagemap entry Naoya Horiguchi
  5 siblings, 1 reply; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Casting pmd into pte_t to handle thp is strongly architecture dependent.
This patch introduces a new function to separate this dependency from
independent part.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 arch/x86/include/asm/pgtable.h |    5 +++++
 fs/proc/task_mmu.c             |    4 ++--
 include/asm-generic/pgtable.h  |    4 ++++
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git 3.3-rc1.orig/arch/x86/include/asm/pgtable.h 3.3-rc1/arch/x86/include/asm/pgtable.h
index 49afb3f..4cfcc7e 100644
--- 3.3-rc1.orig/arch/x86/include/asm/pgtable.h
+++ 3.3-rc1/arch/x86/include/asm/pgtable.h
@@ -165,6 +165,11 @@ static inline int has_transparent_hugepage(void)
 {
 	return cpu_has_pse;
 }
+
+static inline pte_t thp_ptep_get(pmd_t *pmd)
+{
+	return *(pte_t *)pmd;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
index 2622b64..e2063d9 100644
--- 3.3-rc1.orig/fs/proc/task_mmu.c
+++ 3.3-rc1/fs/proc/task_mmu.c
@@ -395,7 +395,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	spinlock_t *ptl;
 
 	if (pmd_trans_huge_lock(pmd, vma) == 1) {
-		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
+		smaps_pte_entry(thp_ptep_get(pmd), addr, HPAGE_PMD_SIZE, walk);
 		spin_unlock(&walk->mm->page_table_lock);
 		mss->anonymous_thp += HPAGE_PMD_SIZE;
 		return 0;
@@ -964,7 +964,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	md = walk->private;
 
 	if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
-		pte_t huge_pte = *(pte_t *)pmd;
+		pte_t huge_pte = thp_ptep_get(pmd);
 		struct page *page;
 
 		page = can_gather_numa_stats(huge_pte, md->vma, addr);
diff --git 3.3-rc1.orig/include/asm-generic/pgtable.h 3.3-rc1/include/asm-generic/pgtable.h
index 76bff2b..f346bdc 100644
--- 3.3-rc1.orig/include/asm-generic/pgtable.h
+++ 3.3-rc1/include/asm-generic/pgtable.h
@@ -434,6 +434,10 @@ static inline int pmd_trans_splitting(pmd_t pmd)
 {
 	return 0;
 }
+static inline pte_t thp_ptep_get(pmd_t *pmd)
+{
+	return 0;
+}
 #ifndef __HAVE_ARCH_PMD_WRITE
 static inline int pmd_write(pmd_t pmd)
 {
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/6] pagemap: introduce data structure for pagemap entry
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2012-01-27 23:02 ` [PATCH 5/6] introduce thp_ptep_get() Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-30  6:31   ` KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Currently a local variable of pagemap entry in pagemap_pte_range()
is named pfn and typed with u64, but it's not correct (pfn should
be unsigned long.)
This patch introduces special type for pagemap entry and replace
code with it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/proc/task_mmu.c |   66 +++++++++++++++++++++++++++------------------------
 1 files changed, 35 insertions(+), 31 deletions(-)

diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
index e2063d9..c2807a3 100644
--- 3.3-rc1.orig/fs/proc/task_mmu.c
+++ 3.3-rc1/fs/proc/task_mmu.c
@@ -586,9 +586,13 @@ const struct file_operations proc_clear_refs_operations = {
 	.llseek		= noop_llseek,
 };
 
+typedef struct {
+	u64 pme;
+} pme_t;
+
 struct pagemapread {
 	int pos, len;
-	u64 *buffer;
+	pme_t *buffer;
 };
 
 #define PAGEMAP_WALK_SIZE	(PMD_SIZE)
@@ -611,10 +615,15 @@ struct pagemapread {
 #define PM_NOT_PRESENT      PM_PSHIFT(PAGE_SHIFT)
 #define PM_END_OF_BUFFER    1
 
-static int add_to_pagemap(unsigned long addr, u64 pfn,
+static inline pme_t make_pme(u64 val)
+{
+	return (pme_t) { .pme = val };
+}
+
+static int add_to_pagemap(unsigned long addr, pme_t *pme,
 			  struct pagemapread *pm)
 {
-	pm->buffer[pm->pos++] = pfn;
+	pm->buffer[pm->pos++] = *pme;
 	if (pm->pos >= pm->len)
 		return PM_END_OF_BUFFER;
 	return 0;
@@ -626,8 +635,10 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	unsigned long addr;
 	int err = 0;
+	pme_t pme = make_pme(PM_NOT_PRESENT);
+
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
-		err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
+		err = add_to_pagemap(addr, &pme, pm);
 		if (err)
 			break;
 	}
@@ -640,36 +651,31 @@ static u64 swap_pte_to_pagemap_entry(pte_t pte)
 	return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
 }
 
-static u64 pte_to_pagemap_entry(pte_t pte)
+static void pte_to_pagemap_entry(pme_t *pme, pte_t pte)
 {
-	u64 pme = 0;
 	if (is_swap_pte(pte))
-		pme = PM_PFRAME(swap_pte_to_pagemap_entry(pte))
-			| PM_PSHIFT(PAGE_SHIFT) | PM_SWAP;
+		*pme = make_pme(PM_PFRAME(swap_pte_to_pagemap_entry(pte))
+				| PM_PSHIFT(PAGE_SHIFT) | PM_SWAP);
 	else if (pte_present(pte))
-		pme = PM_PFRAME(pte_pfn(pte))
-			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
-	return pme;
+		*pme = make_pme(PM_PFRAME(pte_pfn(pte))
+				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+static void thp_pmd_to_pagemap_entry(pme_t *pme, pmd_t pmd, int offset)
 {
-	u64 pme = 0;
 	/*
 	 * Currently pmd for thp is always present because thp can not be
 	 * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
 	 * This if-check is just to prepare for future implementation.
 	 */
 	if (pmd_present(pmd))
-		pme = PM_PFRAME(pmd_pfn(pmd) + offset)
-			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
-	return pme;
+		*pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
+				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
 }
 #else
-static inline u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+static inline void thp_pmd_to_pagemap_entry(pme_t *pme, pmd_t pmd, int offset)
 {
-	return 0;
 }
 #endif
 
@@ -680,7 +686,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	pte_t *pte;
 	int err = 0;
-	u64 pfn = PM_NOT_PRESENT;
+	pme_t pme = make_pme(PM_NOT_PRESENT);
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
@@ -689,8 +695,8 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		for (; addr != end; addr += PAGE_SIZE) {
 			unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
 				>> PAGE_SHIFT;
-			pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
-			err = add_to_pagemap(addr, pfn, pm);
+			thp_pmd_to_pagemap_entry(&pme, *pmd, offset);
+			err = add_to_pagemap(addr, &pme, pm);
 			if (err)
 				break;
 		}
@@ -709,11 +715,11 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		if (vma && (vma->vm_start <= addr) &&
 		    !is_vm_hugetlb_page(vma)) {
 			pte = pte_offset_map(pmd, addr);
-			pfn = pte_to_pagemap_entry(*pte);
+			pte_to_pagemap_entry(&pme, *pte);
 			/* unmap before userspace copy */
 			pte_unmap(pte);
 		}
-		err = add_to_pagemap(addr, pfn, pm);
+		err = add_to_pagemap(addr, &pme, pm);
 		if (err)
 			return err;
 	}
@@ -724,13 +730,11 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static u64 huge_pte_to_pagemap_entry(pte_t pte, int offset)
+static void huge_pte_to_pagemap_entry(pme_t *pme, pte_t pte, int offset)
 {
-	u64 pme = 0;
 	if (pte_present(pte))
-		pme = PM_PFRAME(pte_pfn(pte) + offset)
-			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
-	return pme;
+		*pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset)
+				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
 }
 
 /* This function walks within one hugetlb entry in the single call */
@@ -740,12 +744,12 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
 {
 	struct pagemapread *pm = walk->private;
 	int err = 0;
-	u64 pfn;
+	pme_t pme = make_pme(PM_NOT_PRESENT);
 
 	for (; addr != end; addr += PAGE_SIZE) {
 		int offset = (addr & ~hmask) >> PAGE_SHIFT;
-		pfn = huge_pte_to_pagemap_entry(*pte, offset);
-		err = add_to_pagemap(addr, pfn, pm);
+		huge_pte_to_pagemap_entry(&pme, *pte, offset);
+		err = add_to_pagemap(addr, &pme, pm);
 		if (err)
 			return err;
 	}
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
@ 2012-01-28 11:23   ` Hillf Danton
  2012-01-28 22:33     ` Naoya Horiguchi
  2012-01-30  6:22   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 30+ messages in thread
From: Hillf Danton @ 2012-01-28 11:23 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, LKML,
	Hillf Danton

Hi Naoya

On Sat, Jan 28, 2012 at 7:02 AM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> Currently when we check if we can handle thp as it is or we need to
> split it into regular sized pages, we hold page table lock prior to
> check whether a given pmd is mapping thp or not. Because of this,
> when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> To remove it, this patch introduces a optimized check function and
> replace several similar logics with it.
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: David Rientjes <rientjes@google.com>
>
> Changes since v3:
>  - Fix likely/unlikely pattern in pmd_trans_huge_stable()
>  - Change suffix from _stable to _lock
>  - Introduce __pmd_trans_huge_lock() to avoid micro-regression
>  - Return 1 when wait_split_huge_page path is taken
>
> Changes since v2:
>  - Fix missing "return 0" in "thp under splitting" path
>  - Remove unneeded comment
>  - Change the name of check function to describe what it does
>  - Add VM_BUG_ON(mmap_sem)
> ---
>  fs/proc/task_mmu.c      |   70 +++++++++------------------
>  include/linux/huge_mm.h |   17 +++++++
>  mm/huge_memory.c        |  120 ++++++++++++++++++++++-------------------------
>  3 files changed, 96 insertions(+), 111 deletions(-)
>
[...]

> @@ -1064,21 +1056,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  {
>        int ret = 0;
>
> -       spin_lock(&vma->vm_mm->page_table_lock);
> -       if (likely(pmd_trans_huge(*pmd))) {
> -               ret = !pmd_trans_splitting(*pmd);

Here the value of ret is either false or true,

> -               spin_unlock(&vma->vm_mm->page_table_lock);
> -               if (unlikely(!ret))
> -                       wait_split_huge_page(vma->anon_vma, pmd);
> -               else {
> -                       /*
> -                        * All logical pages in the range are present
> -                        * if backed by a huge page.
> -                        */
> -                       memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> -               }
> -       } else
> +       if (__pmd_trans_huge_lock(pmd, vma) == 1) {
> +               /*
> +                * All logical pages in the range are present
> +                * if backed by a huge page.
> +                */
>                spin_unlock(&vma->vm_mm->page_table_lock);
> +               memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> +       }
>
>        return ret;

what is the returned value of this function? /Hillf

>  }
> @@ -1108,20 +1093,10 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>                goto out;
>        }

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-28 11:23   ` Hillf Danton
@ 2012-01-28 22:33     ` Naoya Horiguchi
  0 siblings, 0 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-28 22:33 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	LKML

Hi Hillf,

On Sat, Jan 28, 2012 at 07:23:47PM +0800, Hillf Danton wrote:
> Hi Naoya
> 
> On Sat, Jan 28, 2012 at 7:02 AM, Naoya Horiguchi
> <n-horiguchi@ah.jp.nec.com> wrote:
> > Currently when we check if we can handle thp as it is or we need to
> > split it into regular sized pages, we hold page table lock prior to
> > check whether a given pmd is mapping thp or not. Because of this,
> > when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> > To remove it, this patch introduces a optimized check function and
> > replace several similar logics with it.
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: David Rientjes <rientjes@google.com>
> >
> > Changes since v3:
> > - Fix likely/unlikely pattern in pmd_trans_huge_stable()
> > - Change suffix from _stable to _lock
> > - Introduce __pmd_trans_huge_lock() to avoid micro-regression
> > - Return 1 when wait_split_huge_page path is taken
> >
> > Changes since v2:
> > - Fix missing "return 0" in "thp under splitting" path
> > - Remove unneeded comment
> > - Change the name of check function to describe what it does
> > - Add VM_BUG_ON(mmap_sem)
> > ---
> > fs/proc/task_mmu.c   |  70 +++++++++------------------
> > include/linux/huge_mm.h |  17 +++++++
> > mm/huge_memory.c    | 120 ++++++++++++++++++++++-------------------------
> > 3 files changed, 96 insertions(+), 111 deletions(-)
> >
> [...]
> 
> > @@ -1064,21 +1056,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > {
> >    int ret = 0;
> >
> > -    spin_lock(&vma->vm_mm->page_table_lock);
> > -    if (likely(pmd_trans_huge(*pmd))) {
> > -        ret = !pmd_trans_splitting(*pmd);
> 
> Here the value of ret is either false or true,

You're right.

> > -        spin_unlock(&vma->vm_mm->page_table_lock);
> > -        if (unlikely(!ret))
> > -            wait_split_huge_page(vma->anon_vma, pmd);
> > -        else {
> > -            /*
> > -            * All logical pages in the range are present
> > -            * if backed by a huge page.
> > -            */
> > -            memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> > -        }
> > -    } else
> > +    if (__pmd_trans_huge_lock(pmd, vma) == 1) {
> > +        /*
> > +        * All logical pages in the range are present
> > +        * if backed by a huge page.
> > +        */
> >        spin_unlock(&vma->vm_mm->page_table_lock);
> > +        memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> > +    }
> >
> >    return ret;
> 
> what is the returned value of this function? /Hillf

In this patch, mincore_huge_pmd() always returns 0 and it's obviously wrong.
We need to set ret to 1 in if-block.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
@ 2012-01-29 13:17   ` Hillf Danton
  2012-01-30 19:23     ` Naoya Horiguchi
  0 siblings, 1 reply; 30+ messages in thread
From: Hillf Danton @ 2012-01-29 13:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, LKML,
	Hillf Danton

Hi Naoya

On Sat, Jan 28, 2012 at 7:02 AM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> Thp split is not necessary if we explicitly check whether pmds are
> mapping thps or not. This patch introduces this check and adds code
> to generate pagemap entries for pmds mapping thps, which results in
> less performance impact of pagemap on thp.
>

Could the method proposed here cover the two cases of split THP in mem cgroup?

Thanks
Hillf

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
  2012-01-28 11:23   ` Hillf Danton
@ 2012-01-30  6:22   ` KAMEZAWA Hiroyuki
  2012-02-02  5:27     ` Naoya Horiguchi
  1 sibling, 1 reply; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-30  6:22 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Fri, 27 Jan 2012 18:02:49 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Currently when we check if we can handle thp as it is or we need to
> split it into regular sized pages, we hold page table lock prior to
> check whether a given pmd is mapping thp or not. Because of this,
> when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> To remove it, this patch introduces a optimized check function and
> replace several similar logics with it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: David Rientjes <rientjes@google.com>
> 
> Changes since v3:
>   - Fix likely/unlikely pattern in pmd_trans_huge_stable()
>   - Change suffix from _stable to _lock
>   - Introduce __pmd_trans_huge_lock() to avoid micro-regression
>   - Return 1 when wait_split_huge_page path is taken
> 
> Changes since v2:
>   - Fix missing "return 0" in "thp under splitting" path
>   - Remove unneeded comment
>   - Change the name of check function to describe what it does
>   - Add VM_BUG_ON(mmap_sem)



> +/*
> + * Returns 1 if a given pmd maps a stable (not under splitting) thp,
> + * -1 if the pmd maps thp under splitting, 0 if the pmd does not map thp.
> + *
> + * Note that if it returns 1, this routine returns without unlocking page
> + * table locks. So callers must unlock them.
> + */


Seems nice clean up but... why you need to return (-1, 0, 1) ?

It seems the caller can't see the difference between -1 and 0.

Why not just return 0 (not locked) or 1 (thp found and locked) ?

Thanks,
-Kame

> +int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
> +{
> +	spin_lock(&vma->vm_mm->page_table_lock);
>  	if (likely(pmd_trans_huge(*pmd))) {
>  		if (unlikely(pmd_trans_splitting(*pmd))) {
> -			spin_unlock(&mm->page_table_lock);
> +			spin_unlock(&vma->vm_mm->page_table_lock);
>  			wait_split_huge_page(vma->anon_vma, pmd);
> +			return -1;
>  		} else {
> -			pmd_t entry;
> -
> -			entry = pmdp_get_and_clear(mm, addr, pmd);
> -			entry = pmd_modify(entry, newprot);
> -			set_pmd_at(mm, addr, pmd, entry);
> -			spin_unlock(&vma->vm_mm->page_table_lock);
> -			ret = 1;
> +			/* Thp mapped by 'pmd' is stable, so we can
> +			 * handle it as it is. */
> +			return 1;
>  		}
> -	} else
> -		spin_unlock(&vma->vm_mm->page_table_lock);
> -
> -	return ret;
> +	}
> +	spin_unlock(&vma->vm_mm->page_table_lock);
> +	return 0;
>  }
>  
>  pmd_t *page_check_address_pmd(struct page *page,
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] introduce thp_ptep_get()
  2012-01-27 23:02 ` [PATCH 5/6] introduce thp_ptep_get() Naoya Horiguchi
@ 2012-01-30  6:26   ` KAMEZAWA Hiroyuki
  2012-01-30 19:24     ` Naoya Horiguchi
  0 siblings, 1 reply; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-30  6:26 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Fri, 27 Jan 2012 18:02:52 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Casting pmd into pte_t to handle thp is strongly architecture dependent.
> This patch introduces a new function to separate this dependency from
> independent part.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  arch/x86/include/asm/pgtable.h |    5 +++++
>  fs/proc/task_mmu.c             |    4 ++--
>  include/asm-generic/pgtable.h  |    4 ++++
>  3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git 3.3-rc1.orig/arch/x86/include/asm/pgtable.h 3.3-rc1/arch/x86/include/asm/pgtable.h
> index 49afb3f..4cfcc7e 100644
> --- 3.3-rc1.orig/arch/x86/include/asm/pgtable.h
> +++ 3.3-rc1/arch/x86/include/asm/pgtable.h
> @@ -165,6 +165,11 @@ static inline int has_transparent_hugepage(void)
>  {
>  	return cpu_has_pse;
>  }
> +
> +static inline pte_t thp_ptep_get(pmd_t *pmd)
> +{
> +	return *(pte_t *)pmd;
> +}

I'm sorry but I don't think the name is good. 
(But I know my sense of naming is bad ;)

How about pmd_to_pte_t ?

And, I wonder you can add error check as VM_BUG_ON(!pmd_trans_huge(*pmd));


Thanks,
-Kame




>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
> diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
> index 2622b64..e2063d9 100644
> --- 3.3-rc1.orig/fs/proc/task_mmu.c
> +++ 3.3-rc1/fs/proc/task_mmu.c
> @@ -395,7 +395,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	spinlock_t *ptl;
>  
>  	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> -		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
> +		smaps_pte_entry(thp_ptep_get(pmd), addr, HPAGE_PMD_SIZE, walk);
>  		spin_unlock(&walk->mm->page_table_lock);
>  		mss->anonymous_thp += HPAGE_PMD_SIZE;
>  		return 0;
> @@ -964,7 +964,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
>  	md = walk->private;
>  
>  	if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
> -		pte_t huge_pte = *(pte_t *)pmd;
> +		pte_t huge_pte = thp_ptep_get(pmd);
>  		struct page *page;
>  
>  		page = can_gather_numa_stats(huge_pte, md->vma, addr);
> diff --git 3.3-rc1.orig/include/asm-generic/pgtable.h 3.3-rc1/include/asm-generic/pgtable.h
> index 76bff2b..f346bdc 100644
> --- 3.3-rc1.orig/include/asm-generic/pgtable.h
> +++ 3.3-rc1/include/asm-generic/pgtable.h
> @@ -434,6 +434,10 @@ static inline int pmd_trans_splitting(pmd_t pmd)
>  {
>  	return 0;
>  }
> +static inline pte_t thp_ptep_get(pmd_t *pmd)
> +{
> +	return 0;
> +}
>  #ifndef __HAVE_ARCH_PMD_WRITE
>  static inline int pmd_write(pmd_t pmd)
>  {
> -- 
> 1.7.7.6
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] pagemap: introduce data structure for pagemap entry
  2012-01-27 23:02 ` [PATCH 6/6] pagemap: introduce data structure for pagemap entry Naoya Horiguchi
@ 2012-01-30  6:31   ` KAMEZAWA Hiroyuki
  2012-01-30 19:27     ` Naoya Horiguchi
  0 siblings, 1 reply; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-30  6:31 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Fri, 27 Jan 2012 18:02:53 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Currently a local variable of pagemap entry in pagemap_pte_range()
> is named pfn and typed with u64, but it's not correct (pfn should
> be unsigned long.)
> This patch introduces special type for pagemap entry and replace
> code with it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/proc/task_mmu.c |   66 +++++++++++++++++++++++++++------------------------
>  1 files changed, 35 insertions(+), 31 deletions(-)
> 
> diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
> index e2063d9..c2807a3 100644
> --- 3.3-rc1.orig/fs/proc/task_mmu.c
> +++ 3.3-rc1/fs/proc/task_mmu.c
> @@ -586,9 +586,13 @@ const struct file_operations proc_clear_refs_operations = {
>  	.llseek		= noop_llseek,
>  };
>  
> +typedef struct {
> +	u64 pme;
> +} pme_t;
> +

A nitpick..

How about pagemap_entry_t rather than pme_t ?

At 1st look, I wondered whether this is a new kind of page table entry type or not ..

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-29 13:17   ` Hillf Danton
@ 2012-01-30 19:23     ` Naoya Horiguchi
  0 siblings, 0 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-30 19:23 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	LKML

On Sun, Jan 29, 2012 at 09:17:32PM +0800, Hillf Danton wrote:
> Hi Naoya
> 
> On Sat, Jan 28, 2012 at 7:02 AM, Naoya Horiguchi
> <n-horiguchi@ah.jp.nec.com> wrote:
> > Thp split is not necessary if we explicitly check whether pmds are
> > mapping thps or not. This patch introduces this check and adds code
> > to generate pagemap entries for pmds mapping thps, which results in
> > less performance impact of pagemap on thp.
> >
> 
> Could the method proposed here cover the two cases of split THP in mem cgroup?

No for now, but yes if "move charge" function supports THP.
I think this can be a bit large step so it is the next work.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] introduce thp_ptep_get()
  2012-01-30  6:26   ` KAMEZAWA Hiroyuki
@ 2012-01-30 19:24     ` Naoya Horiguchi
  0 siblings, 0 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-30 19:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	linux-kernel

On Mon, Jan 30, 2012 at 03:26:46PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Jan 2012 18:02:52 -0500
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > Casting pmd into pte_t to handle thp is strongly architecture dependent.
> > This patch introduces a new function to separate this dependency from
> > independent part.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  arch/x86/include/asm/pgtable.h |    5 +++++
> >  fs/proc/task_mmu.c             |    4 ++--
> >  include/asm-generic/pgtable.h  |    4 ++++
> >  3 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git 3.3-rc1.orig/arch/x86/include/asm/pgtable.h 3.3-rc1/arch/x86/include/asm/pgtable.h
> > index 49afb3f..4cfcc7e 100644
> > --- 3.3-rc1.orig/arch/x86/include/asm/pgtable.h
> > +++ 3.3-rc1/arch/x86/include/asm/pgtable.h
> > @@ -165,6 +165,11 @@ static inline int has_transparent_hugepage(void)
> >  {
> >  	return cpu_has_pse;
> >  }
> > +
> > +static inline pte_t thp_ptep_get(pmd_t *pmd)
> > +{
> > +	return *(pte_t *)pmd;
> > +}
> 
> I'm sorry but I don't think the name is good. 
> (But I know my sense of naming is bad ;)

I named it from huge_ptep_get() defined for hugetlbfs.
But by my rethinking, it's bad naming.

> 
> How about pmd_to_pte_t ?

OK, it looks straightforward.
I'll take it if someone does not have better ideas.

> And, I wonder you can add error check as VM_BUG_ON(!pmd_trans_huge(*pmd));

Good, I'll add it.

Thank you, KAMEZAWA-san.
Naoya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/6] pagemap: introduce data structure for pagemap entry
  2012-01-30  6:31   ` KAMEZAWA Hiroyuki
@ 2012-01-30 19:27     ` Naoya Horiguchi
  0 siblings, 0 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-30 19:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	linux-kernel

On Mon, Jan 30, 2012 at 03:31:11PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Jan 2012 18:02:53 -0500
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > Currently a local variable of pagemap entry in pagemap_pte_range()
> > is named pfn and typed with u64, but it's not correct (pfn should
> > be unsigned long.)
> > This patch introduces special type for pagemap entry and replace
> > code with it.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  fs/proc/task_mmu.c |   66 +++++++++++++++++++++++++++------------------------
> >  1 files changed, 35 insertions(+), 31 deletions(-)
> > 
> > diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
> > index e2063d9..c2807a3 100644
> > --- 3.3-rc1.orig/fs/proc/task_mmu.c
> > +++ 3.3-rc1/fs/proc/task_mmu.c
> > @@ -586,9 +586,13 @@ const struct file_operations proc_clear_refs_operations = {
> >  	.llseek		= noop_llseek,
> >  };
> >  
> > +typedef struct {
> > +	u64 pme;
> > +} pme_t;
> > +
> 
> A nitpick..
> 
> How about pagemap_entry_t rather than pme_t ?

OK, I'll use this type name.

> At 1st look, I wondered whether this is a new kind of page table entry type or not ..

We had better avoid this type of confusion if possible.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-30  6:22   ` KAMEZAWA Hiroyuki
@ 2012-02-02  5:27     ` Naoya Horiguchi
  2012-02-02  8:32       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 30+ messages in thread
From: Naoya Horiguchi @ 2012-02-02  5:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	linux-kernel

On Mon, Jan 30, 2012 at 03:22:12PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Jan 2012 18:02:49 -0500
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> > Currently when we check if we can handle thp as it is or we need to
> > split it into regular sized pages, we hold page table lock prior to
> > check whether a given pmd is mapping thp or not. Because of this,
> > when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> > To remove it, this patch introduces a optimized check function and
> > replace several similar logics with it.
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: David Rientjes <rientjes@google.com>
> >
> > Changes since v3:
> >   - Fix likely/unlikely pattern in pmd_trans_huge_stable()
> >   - Change suffix from _stable to _lock
> >   - Introduce __pmd_trans_huge_lock() to avoid micro-regression
> >   - Return 1 when wait_split_huge_page path is taken
> >
> > Changes since v2:
> >   - Fix missing "return 0" in "thp under splitting" path
> >   - Remove unneeded comment
> >   - Change the name of check function to describe what it does
> >   - Add VM_BUG_ON(mmap_sem)
>
>
> > +/*
> > + * Returns 1 if a given pmd maps a stable (not under splitting) thp,
> > + * -1 if the pmd maps thp under splitting, 0 if the pmd does not map thp.
> > + *
> > + * Note that if it returns 1, this routine returns without unlocking page
> > + * table locks. So callers must unlock them.
> > + */
>
>
> Seems nice clean up but... why you need to return (-1, 0, 1) ?
>
> It seems the caller can't see the difference between -1 and 0.
>
> Why not just return 0 (not locked) or 1 (thp found and locked) ?

Sorry, I changed wrongly from v3.
We can do fine without return value of -1 if we remove else-if (!err)
{...} block after move_huge_pmd() call in move_page_tables(), right?
(split_huge_page_pmd() after wait_split_huge_page() do nothing...)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-02-02  5:27     ` Naoya Horiguchi
@ 2012-02-02  8:32       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-02  8:32 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Thu,  2 Feb 2012 00:27:58 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> On Mon, Jan 30, 2012 at 03:22:12PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Fri, 27 Jan 2012 18:02:49 -0500
> > Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> >
> > > Currently when we check if we can handle thp as it is or we need to
> > > split it into regular sized pages, we hold page table lock prior to
> > > check whether a given pmd is mapping thp or not. Because of this,
> > > when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> > > To remove it, this patch introduces a optimized check function and
> > > replace several similar logics with it.
> > >
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Cc: David Rientjes <rientjes@google.com>
> > >
> > > Changes since v3:
> > >   - Fix likely/unlikely pattern in pmd_trans_huge_stable()
> > >   - Change suffix from _stable to _lock
> > >   - Introduce __pmd_trans_huge_lock() to avoid micro-regression
> > >   - Return 1 when wait_split_huge_page path is taken
> > >
> > > Changes since v2:
> > >   - Fix missing "return 0" in "thp under splitting" path
> > >   - Remove unneeded comment
> > >   - Change the name of check function to describe what it does
> > >   - Add VM_BUG_ON(mmap_sem)
> >
> >
> > > +/*
> > > + * Returns 1 if a given pmd maps a stable (not under splitting) thp,
> > > + * -1 if the pmd maps thp under splitting, 0 if the pmd does not map thp.
> > > + *
> > > + * Note that if it returns 1, this routine returns without unlocking page
> > > + * table locks. So callers must unlock them.
> > > + */
> >
> >
> > Seems nice clean up but... why you need to return (-1, 0, 1) ?
> >
> > It seems the caller can't see the difference between -1 and 0.
> >
> > Why not just return 0 (not locked) or 1 (thp found and locked) ?
> 
> Sorry, I changed wrongly from v3.
> We can do fine without return value of -1 if we remove else-if (!err)
> {...} block after move_huge_pmd() call in move_page_tables(), right?
> (split_huge_page_pmd() after wait_split_huge_page() do nothing...)
> 

Hm ?

               if (pmd_trans_huge(*old_pmd)) {
                        int err = 0;
                        if (extent == HPAGE_PMD_SIZE)
                                err = move_huge_pmd(vma, new_vma, old_addr,
                                                    new_addr, old_end,
                                                    old_pmd, new_pmd);
                        if (err > 0) {
                                need_flush = true;
                                continue;
                        } else if (!err) {
                                split_huge_page_pmd(vma->vm_mm, old_pmd);
                        }
                        VM_BUG_ON(pmd_trans_huge(*old_pmd));
                }

I think you're right. BUG_ON() in wait_split_huge_page() 

 #define wait_split_huge_page(__anon_vma, __pmd)                         \
        do {                                                            \
                pmd_t *____pmd = (__pmd);                               \
                anon_vma_lock(__anon_vma);                              \
                anon_vma_unlock(__anon_vma);                            \
                BUG_ON(pmd_trans_splitting(*____pmd) ||                 \
                       pmd_trans_huge(*____pmd));                       \
        } while (0)

says pmd is always splitted.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-02-20 11:38       ` Hugh Dickins
@ 2012-02-20 11:54         ` Jiri Slaby
  0 siblings, 0 replies; 30+ messages in thread
From: Jiri Slaby @ 2012-02-20 11:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

On 02/20/2012 12:38 PM, Hugh Dickins wrote:
> That fixes the case I hit, thank you.  Though I did have to apply
> the task_mmu.c part by hand, there are differences on neighbouring
> lines.

The same here.

> Jiri, your "Regression: Bad page map in process xyz" is actually
> on linux-next, isn't it?  I wonder if this patch will fix yours too
> (you were using zypper, I was updating with yast2).

Yes, it does. Thanks.

-- 
js
suse labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-02-20  7:28     ` Naoya Horiguchi
@ 2012-02-20 11:38       ` Hugh Dickins
  2012-02-20 11:54         ` Jiri Slaby
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2012-02-20 11:38 UTC (permalink / raw)
  To: Naoya Horiguchi, Jiri Slaby
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

On Mon, 20 Feb 2012, Naoya Horiguchi wrote:
> On Sun, Feb 19, 2012 at 01:21:02PM -0800, Hugh Dickins wrote:
> > On Wed, 8 Feb 2012, Naoya Horiguchi wrote:
> > > Currently when we check if we can handle thp as it is or we need to
> > > split it into regular sized pages, we hold page table lock prior to
> > > check whether a given pmd is mapping thp or not. Because of this,
> > > when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> > > To remove it, this patch introduces a optimized check function and
> > > replace several similar logics with it.
> > > 
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Cc: David Rientjes <rientjes@google.com>
> > > 
> > > Changes since v4:
> > >   - Rethink returned value of __pmd_trans_huge_lock()
> > 
> > [snip]
> > 
> > > --- 3.3-rc2.orig/mm/mremap.c
> > > +++ 3.3-rc2/mm/mremap.c
> > > @@ -155,8 +155,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > >  			if (err > 0) {
> > >  				need_flush = true;
> > >  				continue;
> > > -			} else if (!err) {
> > > -				split_huge_page_pmd(vma->vm_mm, old_pmd);
> > >  			}
> > >  			VM_BUG_ON(pmd_trans_huge(*old_pmd));
> > >  		}
> 
> Thanks for reporting, 
> 
> > Is that what you intended to do there?
> 
> No. This is a bug.
> 
> > I just hit that VM_BUG_ON on rc3-next-20120217.
> 
> I found that when extend != HPAGE_PMD_SIZE, thp is not split so
> it hits the VM_BUG_ON.
> The following patch cancels the change in returned value in v4->v5
> and I confirmed this fixes the problem in my simple test.
> Andrew, could you add it on top of this optimization patch?
> 
> Naoya
> ----------------------------------------------------
> From 3c49816cab7d8cb072d9dffb97242e40f5124230 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Mon, 20 Feb 2012 01:48:12 -0500
> Subject: [PATCH] fix mremap bug of failing to split thp
> 
> The patch "thp: optimize away unnecessary page table locking" introduced
> a bug to move_page_tables(), where we fail to split thp when move_huge_pmd()
> is not called. To fix it, this patch reverts the return value changes and
> readd if (!err) block.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

That fixes the case I hit, thank you.  Though I did have to apply
the task_mmu.c part by hand, there are differences on neighbouring
lines.

Jiri, your "Regression: Bad page map in process xyz" is actually
on linux-next, isn't it?  I wonder if this patch will fix yours too
(you were using zypper, I was updating with yast2).

Hugh

> ---
>  fs/proc/task_mmu.c |    6 +++---
>  mm/huge_memory.c   |   13 ++++++-------
>  mm/mremap.c        |    2 ++
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 7810281..2d12325 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -394,7 +394,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> -	if (pmd_trans_huge_lock(pmd, vma)) {
> +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
>  		smaps_pte_entry(pmd_to_pte_t(pmd), addr, HPAGE_PMD_SIZE, walk);
>  		spin_unlock(&walk->mm->page_table_lock);
>  		mss->anonymous_thp += HPAGE_PMD_SIZE;
> @@ -696,7 +696,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	/* find the first VMA at or above 'addr' */
>  	vma = find_vma(walk->mm, addr);
>  
> -	if (pmd_trans_huge_lock(pmd, vma)) {
> +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
>  		for (; addr != end; addr += PAGE_SIZE) {
>  			unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
>  				>> PAGE_SHIFT;
> @@ -973,7 +973,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
>  
>  	md = walk->private;
>  
> -	if (pmd_trans_huge_lock(pmd, md->vma)) {
> +	if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
>  		pte_t huge_pte = pmd_to_pte_t(pmd);
>  		struct page *page;
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bbf57b5..f342bb2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1030,7 +1030,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  {
>  	int ret = 0;
>  
> -	if (__pmd_trans_huge_lock(pmd, vma)) {
> +	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
>  		struct page *page;
>  		pgtable_t pgtable;
>  		pgtable = get_pmd_huge_pte(tlb->mm);
> @@ -1056,7 +1056,7 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  {
>  	int ret = 0;
>  
> -	if (__pmd_trans_huge_lock(pmd, vma)) {
> +	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
>  		/*
>  		 * All logical pages in the range are present
>  		 * if backed by a huge page.
> @@ -1094,12 +1094,11 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>  		goto out;
>  	}
>  
> -	if (__pmd_trans_huge_lock(old_pmd, vma)) {
> +	if ((ret = __pmd_trans_huge_lock(old_pmd, vma)) == 1) {
>  		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
>  		VM_BUG_ON(!pmd_none(*new_pmd));
>  		set_pmd_at(mm, new_addr, new_pmd, pmd);
>  		spin_unlock(&mm->page_table_lock);
> -		ret = 1;
>  	}
>  out:
>  	return ret;
> @@ -1111,7 +1110,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct mm_struct *mm = vma->vm_mm;
>  	int ret = 0;
>  
> -	if (__pmd_trans_huge_lock(pmd, vma)) {
> +	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
>  		pmd_t entry;
>  		entry = pmdp_get_and_clear(mm, addr, pmd);
>  		entry = pmd_modify(entry, newprot);
> @@ -1125,7 +1124,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  /*
>   * Returns 1 if a given pmd maps a stable (not under splitting) thp.
> - * Returns 0 otherwise.
> + * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
>   *
>   * Note that if it returns 1, this routine returns without unlocking page
>   * table locks. So callers must unlock them.
> @@ -1137,7 +1136,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
>  		if (unlikely(pmd_trans_splitting(*pmd))) {
>  			spin_unlock(&vma->vm_mm->page_table_lock);
>  			wait_split_huge_page(vma->anon_vma, pmd);
> -			return 0;
> +			return -1;
>  		} else {
>  			/* Thp mapped by 'pmd' is stable, so we can
>  			 * handle it as it is. */
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 22458b9..87bb839 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -155,6 +155,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  			if (err > 0) {
>  				need_flush = true;
>  				continue;
> +			} else if (!err) {
> +				split_huge_page_pmd(vma->vm_mm, old_pmd);
>  			}
>  			VM_BUG_ON(pmd_trans_huge(*old_pmd));
>  		}
> -- 
> 1.7.7.6

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-02-19 21:21   ` Hugh Dickins
@ 2012-02-20  7:28     ` Naoya Horiguchi
  2012-02-20 11:38       ` Hugh Dickins
  0 siblings, 1 reply; 30+ messages in thread
From: Naoya Horiguchi @ 2012-02-20  7:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

On Sun, Feb 19, 2012 at 01:21:02PM -0800, Hugh Dickins wrote:
> On Wed, 8 Feb 2012, Naoya Horiguchi wrote:
> > Currently when we check if we can handle thp as it is or we need to
> > split it into regular sized pages, we hold page table lock prior to
> > check whether a given pmd is mapping thp or not. Because of this,
> > when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> > To remove it, this patch introduces a optimized check function and
> > replace several similar logics with it.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: David Rientjes <rientjes@google.com>
> > 
> > Changes since v4:
> >   - Rethink returned value of __pmd_trans_huge_lock()
> 
> [snip]
> 
> > --- 3.3-rc2.orig/mm/mremap.c
> > +++ 3.3-rc2/mm/mremap.c
> > @@ -155,8 +155,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  			if (err > 0) {
> >  				need_flush = true;
> >  				continue;
> > -			} else if (!err) {
> > -				split_huge_page_pmd(vma->vm_mm, old_pmd);
> >  			}
> >  			VM_BUG_ON(pmd_trans_huge(*old_pmd));
> >  		}

Thanks for reporting, 

> Is that what you intended to do there?

No. This is a bug.

> I just hit that VM_BUG_ON on rc3-next-20120217.

I found that when extend != HPAGE_PMD_SIZE, thp is not split so
it hits the VM_BUG_ON.
The following patch cancels the change in returned value in v4->v5
and I confirmed this fixes the problem in my simple test.
Andrew, could you add it on top of this optimization patch?

Naoya
----------------------------------------------------
>From 3c49816cab7d8cb072d9dffb97242e40f5124230 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Mon, 20 Feb 2012 01:48:12 -0500
Subject: [PATCH] fix mremap bug of failing to split thp

The patch "thp: optimize away unnecessary page table locking" introduced
a bug to move_page_tables(), where we fail to split thp when move_huge_pmd()
is not called. To fix it, this patch reverts the return value changes and
readd if (!err) block.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 fs/proc/task_mmu.c |    6 +++---
 mm/huge_memory.c   |   13 ++++++-------
 mm/mremap.c        |    2 ++
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7810281..2d12325 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -394,7 +394,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma)) {
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
 		smaps_pte_entry(pmd_to_pte_t(pmd), addr, HPAGE_PMD_SIZE, walk);
 		spin_unlock(&walk->mm->page_table_lock);
 		mss->anonymous_thp += HPAGE_PMD_SIZE;
@@ -696,7 +696,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
 
-	if (pmd_trans_huge_lock(pmd, vma)) {
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
 		for (; addr != end; addr += PAGE_SIZE) {
 			unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
 				>> PAGE_SHIFT;
@@ -973,7 +973,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 
 	md = walk->private;
 
-	if (pmd_trans_huge_lock(pmd, md->vma)) {
+	if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
 		pte_t huge_pte = pmd_to_pte_t(pmd);
 		struct page *page;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbf57b5..f342bb2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1030,7 +1030,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 {
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma)) {
+	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		struct page *page;
 		pgtable_t pgtable;
 		pgtable = get_pmd_huge_pte(tlb->mm);
@@ -1056,7 +1056,7 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma)) {
+	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		/*
 		 * All logical pages in the range are present
 		 * if backed by a huge page.
@@ -1094,12 +1094,11 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		goto out;
 	}
 
-	if (__pmd_trans_huge_lock(old_pmd, vma)) {
+	if ((ret = __pmd_trans_huge_lock(old_pmd, vma)) == 1) {
 		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
 		VM_BUG_ON(!pmd_none(*new_pmd));
 		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		spin_unlock(&mm->page_table_lock);
-		ret = 1;
 	}
 out:
 	return ret;
@@ -1111,7 +1110,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma)) {
+	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
 		entry = pmdp_get_and_clear(mm, addr, pmd);
 		entry = pmd_modify(entry, newprot);
@@ -1125,7 +1124,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 /*
  * Returns 1 if a given pmd maps a stable (not under splitting) thp.
- * Returns 0 otherwise.
+ * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
  *
  * Note that if it returns 1, this routine returns without unlocking page
  * table locks. So callers must unlock them.
@@ -1137,7 +1136,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
 			wait_split_huge_page(vma->anon_vma, pmd);
-			return 0;
+			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
 			 * handle it as it is. */
diff --git a/mm/mremap.c b/mm/mremap.c
index 22458b9..87bb839 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -155,6 +155,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			if (err > 0) {
 				need_flush = true;
 				continue;
+			} else if (!err) {
+				split_huge_page_pmd(vma->vm_mm, old_pmd);
 			}
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-02-08 15:51 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
  2012-02-09  2:19   ` KAMEZAWA Hiroyuki
@ 2012-02-19 21:21   ` Hugh Dickins
  2012-02-20  7:28     ` Naoya Horiguchi
  1 sibling, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2012-02-19 21:21 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

On Wed, 8 Feb 2012, Naoya Horiguchi wrote:
> Currently when we check if we can handle thp as it is or we need to
> split it into regular sized pages, we hold page table lock prior to
> check whether a given pmd is mapping thp or not. Because of this,
> when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> To remove it, this patch introduces a optimized check function and
> replace several similar logics with it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: David Rientjes <rientjes@google.com>
> 
> Changes since v4:
>   - Rethink returned value of __pmd_trans_huge_lock()

[snip]

> --- 3.3-rc2.orig/mm/mremap.c
> +++ 3.3-rc2/mm/mremap.c
> @@ -155,8 +155,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  			if (err > 0) {
>  				need_flush = true;
>  				continue;
> -			} else if (!err) {
> -				split_huge_page_pmd(vma->vm_mm, old_pmd);
>  			}
>  			VM_BUG_ON(pmd_trans_huge(*old_pmd));
>  		}

Is that what you intended to do there?
I just hit that VM_BUG_ON on rc3-next-20120217.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-02-08 15:51 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
@ 2012-02-09  2:19   ` KAMEZAWA Hiroyuki
  2012-02-19 21:21   ` Hugh Dickins
  1 sibling, 0 replies; 30+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-09  2:19 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Wed,  8 Feb 2012 10:51:38 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Currently when we check if we can handle thp as it is or we need to
> split it into regular sized pages, we hold page table lock prior to
> check whether a given pmd is mapping thp or not. Because of this,
> when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> To remove it, this patch introduces a optimized check function and
> replace several similar logics with it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: David Rientjes <rientjes@google.com>
> 

I prefer 'bool' for return type..but seems much better.

Thank you. I think this clean up is very good.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-02-08 15:51 [PATCH 0/6 v5] pagemap handles transparent hugepage Naoya Horiguchi
@ 2012-02-08 15:51 ` Naoya Horiguchi
  2012-02-09  2:19   ` KAMEZAWA Hiroyuki
  2012-02-19 21:21   ` Hugh Dickins
  0 siblings, 2 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-02-08 15:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Currently when we check if we can handle thp as it is or we need to
split it into regular sized pages, we hold page table lock prior to
check whether a given pmd is mapping thp or not. Because of this,
when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
To remove it, this patch introduces a optimized check function and
replace several similar logics with it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>

Changes since v4:
  - Rethink returned value of __pmd_trans_huge_lock()

Changes since v3:
  - Fix likely/unlikely pattern in pmd_trans_huge_stable()
  - Change suffix from _stable to _lock
  - Introduce __pmd_trans_huge_lock() to avoid micro-regression
  - Return 1 when wait_split_huge_page path is taken

Changes since v2:
  - Fix missing "return 0" in "thp under splitting" path
  - Remove unneeded comment
  - Change the name of check function to describe what it does
  - Add VM_BUG_ON(mmap_sem)
---
 fs/proc/task_mmu.c      |   70 +++++++++------------------
 include/linux/huge_mm.h |   17 +++++++
 mm/huge_memory.c        |  122 ++++++++++++++++++++++------------------------
 mm/mremap.c             |    2 -
 4 files changed, 98 insertions(+), 113 deletions(-)

diff --git 3.3-rc2.orig/fs/proc/task_mmu.c 3.3-rc2/fs/proc/task_mmu.c
index eb0a93e..9ff102a 100644
--- 3.3-rc2.orig/fs/proc/task_mmu.c
+++ 3.3-rc2/fs/proc/task_mmu.c
@@ -394,20 +394,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			smaps_pte_entry(*(pte_t *)pmd, addr,
-					HPAGE_PMD_SIZE, walk);
-			spin_unlock(&walk->mm->page_table_lock);
-			mss->anonymous_thp += HPAGE_PMD_SIZE;
-			return 0;
-		}
-	} else {
+	if (pmd_trans_huge_lock(pmd, vma)) {
+		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
 		spin_unlock(&walk->mm->page_table_lock);
+		mss->anonymous_thp += HPAGE_PMD_SIZE;
+		return 0;
 	}
 	/*
 	 * The mmap_sem held all the way back in m_start() is what
@@ -697,25 +688,17 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			for (; addr != end; addr += PAGE_SIZE) {
-				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
-					>> PAGE_SHIFT;
-				pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
-				err = add_to_pagemap(addr, pfn, pm);
-				if (err)
-					break;
-			}
-			spin_unlock(&walk->mm->page_table_lock);
-			return err;
+	if (pmd_trans_huge_lock(pmd, vma)) {
+		for (; addr != end; addr += PAGE_SIZE) {
+			unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
+				>> PAGE_SHIFT;
+			pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
+			err = add_to_pagemap(addr, pfn, pm);
+			if (err)
+				break;
 		}
-	} else {
 		spin_unlock(&walk->mm->page_table_lock);
+		return err;
 	}
 
 	for (; addr != end; addr += PAGE_SIZE) {
@@ -982,24 +965,17 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 
 	md = walk->private;
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(md->vma->anon_vma, pmd);
-		} else {
-			pte_t huge_pte = *(pte_t *)pmd;
-			struct page *page;
-
-			page = can_gather_numa_stats(huge_pte, md->vma, addr);
-			if (page)
-				gather_stats(page, md, pte_dirty(huge_pte),
-						HPAGE_PMD_SIZE/PAGE_SIZE);
-			spin_unlock(&walk->mm->page_table_lock);
-			return 0;
-		}
-	} else {
+
+	if (pmd_trans_huge_lock(pmd, md->vma)) {
+		pte_t huge_pte = *(pte_t *)pmd;
+		struct page *page;
+
+		page = can_gather_numa_stats(huge_pte, md->vma, addr);
+		if (page)
+			gather_stats(page, md, pte_dirty(huge_pte),
+				     HPAGE_PMD_SIZE/PAGE_SIZE);
 		spin_unlock(&walk->mm->page_table_lock);
+		return 0;
 	}
 
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
diff --git 3.3-rc2.orig/include/linux/huge_mm.h 3.3-rc2/include/linux/huge_mm.h
index 1b92129..f56cacb 100644
--- 3.3-rc2.orig/include/linux/huge_mm.h
+++ 3.3-rc2/include/linux/huge_mm.h
@@ -113,6 +113,18 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    long adjust_next);
+extern int __pmd_trans_huge_lock(pmd_t *pmd,
+				 struct vm_area_struct *vma);
+/* mmap_sem must be held on entry */
+static inline int pmd_trans_huge_lock(pmd_t *pmd,
+				      struct vm_area_struct *vma)
+{
+	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+	if (pmd_trans_huge(*pmd))
+		return __pmd_trans_huge_lock(pmd, vma);
+	else
+		return 0;
+}
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
@@ -176,6 +188,11 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int pmd_trans_huge_lock(pmd_t *pmd,
+				      struct vm_area_struct *vma)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git 3.3-rc2.orig/mm/huge_memory.c 3.3-rc2/mm/huge_memory.c
index b3ffc21..bbf57b5 100644
--- 3.3-rc2.orig/mm/huge_memory.c
+++ 3.3-rc2/mm/huge_memory.c
@@ -1030,30 +1030,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 {
 	int ret = 0;
 
-	spin_lock(&tlb->mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&tlb->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma,
-					     pmd);
-		} else {
-			struct page *page;
-			pgtable_t pgtable;
-			pgtable = get_pmd_huge_pte(tlb->mm);
-			page = pmd_page(*pmd);
-			pmd_clear(pmd);
-			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-			page_remove_rmap(page);
-			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-			VM_BUG_ON(!PageHead(page));
-			spin_unlock(&tlb->mm->page_table_lock);
-			tlb_remove_page(tlb, page);
-			pte_free(tlb->mm, pgtable);
-			ret = 1;
-		}
-	} else
+	if (__pmd_trans_huge_lock(pmd, vma)) {
+		struct page *page;
+		pgtable_t pgtable;
+		pgtable = get_pmd_huge_pte(tlb->mm);
+		page = pmd_page(*pmd);
+		pmd_clear(pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		page_remove_rmap(page);
+		VM_BUG_ON(page_mapcount(page) < 0);
+		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		VM_BUG_ON(!PageHead(page));
 		spin_unlock(&tlb->mm->page_table_lock);
+		tlb_remove_page(tlb, page);
+		pte_free(tlb->mm, pgtable);
+		ret = 1;
+	}
 
 	return ret;
 }
@@ -1064,21 +1056,15 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	int ret = 0;
 
-	spin_lock(&vma->vm_mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		ret = !pmd_trans_splitting(*pmd);
-		spin_unlock(&vma->vm_mm->page_table_lock);
-		if (unlikely(!ret))
-			wait_split_huge_page(vma->anon_vma, pmd);
-		else {
-			/*
-			 * All logical pages in the range are present
-			 * if backed by a huge page.
-			 */
-			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
-		}
-	} else
+	if (__pmd_trans_huge_lock(pmd, vma)) {
+		/*
+		 * All logical pages in the range are present
+		 * if backed by a huge page.
+		 */
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		memset(vec, 1, (end - addr) >> PAGE_SHIFT);
+		ret = 1;
+	}
 
 	return ret;
 }
@@ -1108,21 +1094,12 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		goto out;
 	}
 
-	spin_lock(&mm->page_table_lock);
-	if (likely(pmd_trans_huge(*old_pmd))) {
-		if (pmd_trans_splitting(*old_pmd)) {
-			spin_unlock(&mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, old_pmd);
-			ret = -1;
-		} else {
-			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
-			VM_BUG_ON(!pmd_none(*new_pmd));
-			set_pmd_at(mm, new_addr, new_pmd, pmd);
-			spin_unlock(&mm->page_table_lock);
-			ret = 1;
-		}
-	} else {
+	if (__pmd_trans_huge_lock(old_pmd, vma)) {
+		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+		VM_BUG_ON(!pmd_none(*new_pmd));
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		spin_unlock(&mm->page_table_lock);
+		ret = 1;
 	}
 out:
 	return ret;
@@ -1134,24 +1111,41 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
 
-	spin_lock(&mm->page_table_lock);
+	if (__pmd_trans_huge_lock(pmd, vma)) {
+		pmd_t entry;
+		entry = pmdp_get_and_clear(mm, addr, pmd);
+		entry = pmd_modify(entry, newprot);
+		set_pmd_at(mm, addr, pmd, entry);
+		spin_unlock(&vma->vm_mm->page_table_lock);
+		ret = 1;
+	}
+
+	return ret;
+}
+
+/*
+ * Returns 1 if a given pmd maps a stable (not under splitting) thp.
+ * Returns 0 otherwise.
+ *
+ * Note that if it returns 1, this routine returns without unlocking page
+ * table locks. So callers must unlock them.
+ */
+int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	spin_lock(&vma->vm_mm->page_table_lock);
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&mm->page_table_lock);
+			spin_unlock(&vma->vm_mm->page_table_lock);
 			wait_split_huge_page(vma->anon_vma, pmd);
+			return 0;
 		} else {
-			pmd_t entry;
-
-			entry = pmdp_get_and_clear(mm, addr, pmd);
-			entry = pmd_modify(entry, newprot);
-			set_pmd_at(mm, addr, pmd, entry);
-			spin_unlock(&vma->vm_mm->page_table_lock);
-			ret = 1;
+			/* Thp mapped by 'pmd' is stable, so we can
+			 * handle it as it is. */
+			return 1;
 		}
-	} else
-		spin_unlock(&vma->vm_mm->page_table_lock);
-
-	return ret;
+	}
+	spin_unlock(&vma->vm_mm->page_table_lock);
+	return 0;
 }
 
 pmd_t *page_check_address_pmd(struct page *page,
diff --git 3.3-rc2.orig/mm/mremap.c 3.3-rc2/mm/mremap.c
index 87bb839..22458b9 100644
--- 3.3-rc2.orig/mm/mremap.c
+++ 3.3-rc2/mm/mremap.c
@@ -155,8 +155,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			if (err > 0) {
 				need_flush = true;
 				continue;
-			} else if (!err) {
-				split_huge_page_pmd(vma->vm_mm, old_pmd);
 			}
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
@ 2012-01-16 17:19 Naoya Horiguchi
  0 siblings, 0 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-16 17:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel

On Sat, Jan 14, 2012 at 06:19:56PM +0100, Andrea Arcangeli wrote:
> On Thu, Jan 12, 2012 at 02:34:54PM -0500, Naoya Horiguchi wrote:
...
> > index 36b3d98..b7811df 100644
> > --- 3.2-rc5.orig/mm/huge_memory.c
> > +++ 3.2-rc5/mm/huge_memory.c
> > @@ -1001,29 +1001,21 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  {
> >  	int ret = 0;
> >  
> > -	spin_lock(&tlb->mm->page_table_lock);
> > -	if (likely(pmd_trans_huge(*pmd))) {
> > -		if (unlikely(pmd_trans_splitting(*pmd))) {
> > -			spin_unlock(&tlb->mm->page_table_lock);
> > -			wait_split_huge_page(vma->anon_vma,
> > -					     pmd);
> > -		} else {
> > -			struct page *page;
> > -			pgtable_t pgtable;
> > -			pgtable = get_pmd_huge_pte(tlb->mm);
> > -			page = pmd_page(*pmd);
> > -			pmd_clear(pmd);
> > -			page_remove_rmap(page);
> > -			VM_BUG_ON(page_mapcount(page) < 0);
> > -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > -			VM_BUG_ON(!PageHead(page));
> > -			spin_unlock(&tlb->mm->page_table_lock);
> > -			tlb_remove_page(tlb, page);
> > -			pte_free(tlb->mm, pgtable);
> > -			ret = 1;
> > -		}
> > -	} else
> > +	if (likely(pmd_trans_huge_stable(pmd, vma))) {
> > +		struct page *page;
> > +		pgtable_t pgtable;
> > +		pgtable = get_pmd_huge_pte(tlb->mm);
> > +		page = pmd_page(*pmd);
> > +		pmd_clear(pmd);
> > +		page_remove_rmap(page);
> > +		VM_BUG_ON(page_mapcount(page) < 0);
> > +		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > +		VM_BUG_ON(!PageHead(page));
> >  		spin_unlock(&tlb->mm->page_table_lock);
> > +		tlb_remove_page(tlb, page);
> > +		pte_free(tlb->mm, pgtable);
> > +		ret = 1;
> > +	}
> 
> This has been micro slowed down. I think you should use
> pmd_trans_huge_stable only in places where pmd_trans_huge cannot be
> set. I would back out the above as it's a micro-regression.

I guess this micro-regression happens because I failed to correctly replace
likely()/unlikey() applied to pmd_trans_huge() and pmd_trans_splitting().
I should have keep them in pmd_trans_huge_stable() instead of applying
likely() on pmd_trans_huge_stable().

> Maybe what you could do if you want to clean it up further is to make
> a static inline in huge_mm of pmd_trans_huge_stable that only checks
> pmd_trans_huge and then calls __pmd_trans_huge_stable, and use
> __pmd_trans_huge_stable above.

OK, I agree.

> > @@ -1034,21 +1026,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >  {
> >  	int ret = 0;
> >  
> > -	spin_lock(&vma->vm_mm->page_table_lock);
> > -	if (likely(pmd_trans_huge(*pmd))) {
> > -		ret = !pmd_trans_splitting(*pmd);
> > -		spin_unlock(&vma->vm_mm->page_table_lock);
> > -		if (unlikely(!ret))
> > -			wait_split_huge_page(vma->anon_vma, pmd);
> > -		else {
> > -			/*
> > -			 * All logical pages in the range are present
> > -			 * if backed by a huge page.
> > -			 */
> > -			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> > -		}
> > -	} else
> > +	if (likely(pmd_trans_huge_stable(pmd, vma))) {
> > +		/*
> > +		 * All logical pages in the range are present
> > +		 * if backed by a huge page.
> > +		 */
> >  		spin_unlock(&vma->vm_mm->page_table_lock);
> > +		memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> > +	}
> >  
> >  	return ret;
> >  }
> 
> same slowdown here. Here even __pmd_trans_huge_stable wouldn't be
> enough to optimize it as it'd still generate more .text with two
> spin_unlock (one in __pmd_trans_huge_stable and one retained above)
> instead of just 1 in the original version.

Yes, additional spin_unlock() raises the binary size of mincore_huge_pmd().
But replacing with __pmd_trans_huge_stable() which unifies duplicate codes
reduces the binary size too. I think the amount of size reduction is larger
than that of size growth of additional spin_unlock().

> I'd avoid the cleanup for
> the above ultra optimized version.

Anyway if you don't like this replacement, I'll leave it as it is.

> > @@ -1078,21 +1063,12 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
> >  		goto out;
> >  	}
> >  
> > -	spin_lock(&mm->page_table_lock);
> > -	if (likely(pmd_trans_huge(*old_pmd))) {
> > -		if (pmd_trans_splitting(*old_pmd)) {
> > -			spin_unlock(&mm->page_table_lock);
> > -			wait_split_huge_page(vma->anon_vma, old_pmd);
> > -			ret = -1;
> > -		} else {
> > -			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
> > -			VM_BUG_ON(!pmd_none(*new_pmd));
> > -			set_pmd_at(mm, new_addr, new_pmd, pmd);
> > -			spin_unlock(&mm->page_table_lock);
> > -			ret = 1;
> > -		}
> > -	} else {
> > +	if (likely(pmd_trans_huge_stable(old_pmd, vma))) {
> > +		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
> > +		VM_BUG_ON(!pmd_none(*new_pmd));
> > +		set_pmd_at(mm, new_addr, new_pmd, pmd);
> >  		spin_unlock(&mm->page_table_lock);
> > +		ret = 1;
> >  	}
> 
> Same slowdown here, needs __pmd_trans_huge_stable as usual, but you
> are now forcing mremap to call split_huge_page even if it's not needed
> (i.e. after wait_split_huge_page).

I didn't think the behavior changes but this can be performance regression
of additional if-check. As you commented below, we had better go to change
the return value of wait_split_huge_page path in __pmd_trans_huge_stable().

> I'd like no-regression cleanups so
> I'd reverse the above and avoid changing already ultra-optimized code
> paths.

I agree.

> >  out:
> >  	return ret;
> > @@ -1104,27 +1080,48 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	int ret = 0;
> >  
> > -	spin_lock(&mm->page_table_lock);
> > -	if (likely(pmd_trans_huge(*pmd))) {
> > -		if (unlikely(pmd_trans_splitting(*pmd))) {
> > -			spin_unlock(&mm->page_table_lock);
> > -			wait_split_huge_page(vma->anon_vma, pmd);
> > -		} else {
> > -			pmd_t entry;
> > +	if (likely(pmd_trans_huge_stable(pmd, vma))) {
> > +		pmd_t entry;
> >  
> > -			entry = pmdp_get_and_clear(mm, addr, pmd);
> > -			entry = pmd_modify(entry, newprot);
> > -			set_pmd_at(mm, addr, pmd, entry);
> > -			spin_unlock(&vma->vm_mm->page_table_lock);
> > -			flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
> > -			ret = 1;
> > -		}
> > -	} else
> > +		entry = pmdp_get_and_clear(mm, addr, pmd);
> > +		entry = pmd_modify(entry, newprot);
> > +		set_pmd_at(mm, addr, pmd, entry);
> >  		spin_unlock(&vma->vm_mm->page_table_lock);
> > +		flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
> > +		ret = 1;
> > +	}
> >  
> >  	return ret;
> 
> Needs __pmd_trans_huge_stable. Ok to cleanup with that (no regression
> in this case with the __ version).
> 
> > diff --git 3.2-rc5.orig/mm/mremap.c 3.2-rc5/mm/mremap.c
> > index d6959cb..d534668 100644
> > --- 3.2-rc5.orig/mm/mremap.c
> > +++ 3.2-rc5/mm/mremap.c
> > @@ -155,9 +155,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  			if (err > 0) {
> >  				need_flush = true;
> >  				continue;
> > -			} else if (!err) {
> > -				split_huge_page_pmd(vma->vm_mm, old_pmd);
> >  			}
> > +			split_huge_page_pmd(vma->vm_mm, old_pmd);
> >  			VM_BUG_ON(pmd_trans_huge(*old_pmd));
> >  		}
> >  		if (pmd_none(*new_pmd) && __pte_alloc(new_vma->vm_mm, new_vma,
> 
> regression. If you really want to optimize this and cleanup you could
> make __pmd_trans_huge_stable return -1 if wait_split_huge_page path
> was taken, then you just change the other checks to == 1 and behave
> the same if it's 0 or -1, except in move_huge_pmd where you'll return
> -1 if __pmd_trans_huge_stable returned -1 to retain the above
> optimizaton.

All right.

> Maybe it's not much of an optimization anyway because we trade one
> branch for another, and both should be in l1 cache (though the retval
> is even guaranteed in a register not only in l1 cache so it's even
> better to check that for a branch), but to me is more about keeping
> the code strict which kinds of self-documents it, because conceptually
> calling split_huge_page_pmd if wait_split_huge_page was called is
> superflous (even if at runtime it won't make any difference).

OK, I cancel this change.

> Thanks for cleaning up this, especially where pmd_trans_huge_stable is
> perfect fit, this is a nice cleanup.

Thank you for your valuable feedbacks!

Naoya

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-12 19:34 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
  2012-01-13 12:04   ` Hillf Danton
@ 2012-01-14 17:19   ` Andrea Arcangeli
  1 sibling, 0 replies; 30+ messages in thread
From: Andrea Arcangeli @ 2012-01-14 17:19 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel

On Thu, Jan 12, 2012 at 02:34:54PM -0500, Naoya Horiguchi wrote:
> @@ -694,26 +686,18 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	/* find the first VMA at or above 'addr' */
>  	vma = find_vma(walk->mm, addr);
>  
> -	spin_lock(&walk->mm->page_table_lock);
> -	if (pmd_trans_huge(*pmd)) {
> -		if (pmd_trans_splitting(*pmd)) {
> -			spin_unlock(&walk->mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -		} else {
> -			for (; addr != end; addr += PAGE_SIZE) {
> -				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
> -					>> PAGE_SHIFT;
> -				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
> -							       offset);
> -				err = add_to_pagemap(addr, pfn, pm);
> -				if (err)
> -					break;
> -			}
> -			spin_unlock(&walk->mm->page_table_lock);
> -			return err;
> +	if (pmd_trans_huge_stable(pmd, vma)) {
> +		for (; addr != end; addr += PAGE_SIZE) {
> +			unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
> +				>> PAGE_SHIFT;
> +			pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
> +						       offset);
> +			err = add_to_pagemap(addr, pfn, pm);
> +			if (err)
> +				break;
>  		}
> -	} else {
>  		spin_unlock(&walk->mm->page_table_lock);

This was already pointed out by Hillf, I didn't see a new submit but I
guess it's in the works, thanks.

> index 36b3d98..b7811df 100644
> --- 3.2-rc5.orig/mm/huge_memory.c
> +++ 3.2-rc5/mm/huge_memory.c
> @@ -1001,29 +1001,21 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  {
>  	int ret = 0;
>  
> -	spin_lock(&tlb->mm->page_table_lock);
> -	if (likely(pmd_trans_huge(*pmd))) {
> -		if (unlikely(pmd_trans_splitting(*pmd))) {
> -			spin_unlock(&tlb->mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma,
> -					     pmd);
> -		} else {
> -			struct page *page;
> -			pgtable_t pgtable;
> -			pgtable = get_pmd_huge_pte(tlb->mm);
> -			page = pmd_page(*pmd);
> -			pmd_clear(pmd);
> -			page_remove_rmap(page);
> -			VM_BUG_ON(page_mapcount(page) < 0);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> -			VM_BUG_ON(!PageHead(page));
> -			spin_unlock(&tlb->mm->page_table_lock);
> -			tlb_remove_page(tlb, page);
> -			pte_free(tlb->mm, pgtable);
> -			ret = 1;
> -		}
> -	} else
> +	if (likely(pmd_trans_huge_stable(pmd, vma))) {
> +		struct page *page;
> +		pgtable_t pgtable;
> +		pgtable = get_pmd_huge_pte(tlb->mm);
> +		page = pmd_page(*pmd);
> +		pmd_clear(pmd);
> +		page_remove_rmap(page);
> +		VM_BUG_ON(page_mapcount(page) < 0);
> +		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		VM_BUG_ON(!PageHead(page));
>  		spin_unlock(&tlb->mm->page_table_lock);
> +		tlb_remove_page(tlb, page);
> +		pte_free(tlb->mm, pgtable);
> +		ret = 1;
> +	}

This has been micro slowed down. I think you should use
pmd_trans_huge_stable only in places where pmd_trans_huge cannot be
set. I would back out the above as it's a micro-regression.

Maybe what you could do if you want to clean it up further is to make
a static inline in huge_mm of pmd_trans_huge_stable that only checks
pmd_trans_huge and then calls __pmd_trans_huge_stable, and use
__pmd_trans_huge_stable above.

> @@ -1034,21 +1026,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  {
>  	int ret = 0;
>  
> -	spin_lock(&vma->vm_mm->page_table_lock);
> -	if (likely(pmd_trans_huge(*pmd))) {
> -		ret = !pmd_trans_splitting(*pmd);
> -		spin_unlock(&vma->vm_mm->page_table_lock);
> -		if (unlikely(!ret))
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -		else {
> -			/*
> -			 * All logical pages in the range are present
> -			 * if backed by a huge page.
> -			 */
> -			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> -		}
> -	} else
> +	if (likely(pmd_trans_huge_stable(pmd, vma))) {
> +		/*
> +		 * All logical pages in the range are present
> +		 * if backed by a huge page.
> +		 */
>  		spin_unlock(&vma->vm_mm->page_table_lock);
> +		memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> +	}
>  
>  	return ret;
>  }

same slowdown here. Here even __pmd_trans_huge_stable wouldn't be
enough to optimize it as it'd still generate more .text with two
spin_unlock (one in __pmd_trans_huge_stable and one retained above)
instead of just 1 in the original version. I'd avoid the cleanup for
the above ultra optimized version.

> @@ -1078,21 +1063,12 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>  		goto out;
>  	}
>  
> -	spin_lock(&mm->page_table_lock);
> -	if (likely(pmd_trans_huge(*old_pmd))) {
> -		if (pmd_trans_splitting(*old_pmd)) {
> -			spin_unlock(&mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma, old_pmd);
> -			ret = -1;
> -		} else {
> -			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
> -			VM_BUG_ON(!pmd_none(*new_pmd));
> -			set_pmd_at(mm, new_addr, new_pmd, pmd);
> -			spin_unlock(&mm->page_table_lock);
> -			ret = 1;
> -		}
> -	} else {
> +	if (likely(pmd_trans_huge_stable(old_pmd, vma))) {
> +		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
> +		VM_BUG_ON(!pmd_none(*new_pmd));
> +		set_pmd_at(mm, new_addr, new_pmd, pmd);
>  		spin_unlock(&mm->page_table_lock);
> +		ret = 1;
>  	}

Same slowdown here, needs __pmd_trans_huge_stable as usual, but you
are now forcing mremap to call split_huge_page even if it's not needed
(i.e. after wait_split_huge_page). I'd like no-regression cleanups so
I'd reverse the above and avoid changing already ultra-optimized code
paths.

>  out:
>  	return ret;
> @@ -1104,27 +1080,48 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct mm_struct *mm = vma->vm_mm;
>  	int ret = 0;
>  
> -	spin_lock(&mm->page_table_lock);
> -	if (likely(pmd_trans_huge(*pmd))) {
> -		if (unlikely(pmd_trans_splitting(*pmd))) {
> -			spin_unlock(&mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -		} else {
> -			pmd_t entry;
> +	if (likely(pmd_trans_huge_stable(pmd, vma))) {
> +		pmd_t entry;
>  
> -			entry = pmdp_get_and_clear(mm, addr, pmd);
> -			entry = pmd_modify(entry, newprot);
> -			set_pmd_at(mm, addr, pmd, entry);
> -			spin_unlock(&vma->vm_mm->page_table_lock);
> -			flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
> -			ret = 1;
> -		}
> -	} else
> +		entry = pmdp_get_and_clear(mm, addr, pmd);
> +		entry = pmd_modify(entry, newprot);
> +		set_pmd_at(mm, addr, pmd, entry);
>  		spin_unlock(&vma->vm_mm->page_table_lock);
> +		flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
> +		ret = 1;
> +	}
>  
>  	return ret;

Needs __pmd_trans_huge_stable. Ok to cleanup with that (no regression
in this case with the __ version).

> diff --git 3.2-rc5.orig/mm/mremap.c 3.2-rc5/mm/mremap.c
> index d6959cb..d534668 100644
> --- 3.2-rc5.orig/mm/mremap.c
> +++ 3.2-rc5/mm/mremap.c
> @@ -155,9 +155,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  			if (err > 0) {
>  				need_flush = true;
>  				continue;
> -			} else if (!err) {
> -				split_huge_page_pmd(vma->vm_mm, old_pmd);
>  			}
> +			split_huge_page_pmd(vma->vm_mm, old_pmd);
>  			VM_BUG_ON(pmd_trans_huge(*old_pmd));
>  		}
>  		if (pmd_none(*new_pmd) && __pte_alloc(new_vma->vm_mm, new_vma,

regression. If you really want to optimize this and cleanup you could
make __pmd_trans_huge_stable return -1 if wait_split_huge_page path
was taken, then you just change the other checks to == 1 and behave
the same if it's 0 or -1, except in move_huge_pmd where you'll return
-1 if __pmd_trans_huge_stable returned -1 to retain the above
optimizaton.

Maybe it's not much of an optimization anyway because we trade one
branch for another, and both should be in l1 cache (though the retval
is even guaranteed in a register not only in l1 cache so it's even
better to check that for a branch), but to me is more about keeping
the code strict which kinds of self-documents it, because conceptually
calling split_huge_page_pmd if wait_split_huge_page was called is
superflous (even if at runtime it won't make any difference).

Thanks for cleaning up this, especially where pmd_trans_huge_stable is
perfect fit, this is a nice cleanup.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-13 15:14     ` Naoya Horiguchi
@ 2012-01-14  3:24       ` Hillf Danton
  0 siblings, 0 replies; 30+ messages in thread
From: Hillf Danton @ 2012-01-14  3:24 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, LKML

On Fri, Jan 13, 2012 at 11:14 PM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> Hi Hillf,
>
> (1/13/2012 7:04), Hillf Danton wrote:
> [...]
>>> +/*
>>> + * Returns 1 if a given pmd is mapping a thp and stable (not under splitting.)
>>> + * Returns 0 otherwise. Note that if it returns 1, this routine returns without
>>> + * unlocking page table locks. So callers must unlock them.
>>> + */
>>> +int pmd_trans_huge_stable(pmd_t *pmd, struct vm_area_struct *vma)
>>> +{
>>> +       VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
>>> +
>>> +       if (!pmd_trans_huge(*pmd))
>>> +               return 0;
>>> +
>>> +       spin_lock(&vma->vm_mm->page_table_lock);
>>> +       if (likely(pmd_trans_huge(*pmd))) {
>>> +               if (pmd_trans_splitting(*pmd)) {
>>> +                       spin_unlock(&vma->vm_mm->page_table_lock);
>>> +                       wait_split_huge_page(vma->anon_vma, pmd);
>>> +                       return 0;
>>> +               } else {
>>
>>                             spin_unlock(&vma->vm_mm->page_table_lock);     yes?
>
> No. Unlocking is supposed to be done by the caller as commented.
>
Thanks for correcting /Hillf

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-13 12:04   ` Hillf Danton
@ 2012-01-13 15:14     ` Naoya Horiguchi
  2012-01-14  3:24       ` Hillf Danton
  0 siblings, 1 reply; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-13 15:14 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, LKML

Hi Hillf,

(1/13/2012 7:04), Hillf Danton wrote:
[...]
>> +/*
>> + * Returns 1 if a given pmd is mapping a thp and stable (not under splitting.)
>> + * Returns 0 otherwise. Note that if it returns 1, this routine returns without
>> + * unlocking page table locks. So callers must unlock them.
>> + */
>> +int pmd_trans_huge_stable(pmd_t *pmd, struct vm_area_struct *vma)
>> +{
>> +       VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
>> +
>> +       if (!pmd_trans_huge(*pmd))
>> +               return 0;
>> +
>> +       spin_lock(&vma->vm_mm->page_table_lock);
>> +       if (likely(pmd_trans_huge(*pmd))) {
>> +               if (pmd_trans_splitting(*pmd)) {
>> +                       spin_unlock(&vma->vm_mm->page_table_lock);
>> +                       wait_split_huge_page(vma->anon_vma, pmd);
>> +                       return 0;
>> +               } else {
> 
>                             spin_unlock(&vma->vm_mm->page_table_lock);     yes?

No. Unlocking is supposed to be done by the caller as commented.

Thanks,
Naoya

> 
>> +                       /* Thp mapped by 'pmd' is stable, so we can
>> +                        * handle it as it is. */
>> +                       return 1;
>> +               }
>> +       }
>> +       spin_unlock(&vma->vm_mm->page_table_lock);
>> +       return 0;
>> +}
>> +


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-12 19:34 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
@ 2012-01-13 12:04   ` Hillf Danton
  2012-01-13 15:14     ` Naoya Horiguchi
  2012-01-14 17:19   ` Andrea Arcangeli
  1 sibling, 1 reply; 30+ messages in thread
From: Hillf Danton @ 2012-01-13 12:04 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, LKML

On Fri, Jan 13, 2012 at 3:34 AM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> Currently when we check if we can handle thp as it is or we need to
> split it into regular sized pages, we hold page table lock prior to
> check whether a given pmd is mapping thp or not. Because of this,
> when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> To remove it, this patch introduces a optimized check function and
> replace several similar logics with it.
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: David Rientjes <rientjes@google.com>
>
> Changes since v2:
>  - Fix missing "return 0" in "thp under splitting" path
>  - Remove unneeded comment
>  - Change the name of check function to describe what it does
>  - Add VM_BUG_ON(mmap_sem)
> ---
>  fs/proc/task_mmu.c      |   73 +++++++++------------------
>  include/linux/huge_mm.h |    7 +++
>  mm/huge_memory.c        |  127 +++++++++++++++++++++++------------------------
>  mm/mremap.c             |    3 +-
>  4 files changed, 95 insertions(+), 115 deletions(-)
>
[...]
> +/*
> + * Returns 1 if a given pmd is mapping a thp and stable (not under splitting.)
> + * Returns 0 otherwise. Note that if it returns 1, this routine returns without
> + * unlocking page table locks. So callers must unlock them.
> + */
> +int pmd_trans_huge_stable(pmd_t *pmd, struct vm_area_struct *vma)
> +{
> +       VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
> +
> +       if (!pmd_trans_huge(*pmd))
> +               return 0;
> +
> +       spin_lock(&vma->vm_mm->page_table_lock);
> +       if (likely(pmd_trans_huge(*pmd))) {
> +               if (pmd_trans_splitting(*pmd)) {
> +                       spin_unlock(&vma->vm_mm->page_table_lock);
> +                       wait_split_huge_page(vma->anon_vma, pmd);
> +                       return 0;
> +               } else {

                            spin_unlock(&vma->vm_mm->page_table_lock);     yes?

> +                       /* Thp mapped by 'pmd' is stable, so we can
> +                        * handle it as it is. */
> +                       return 1;
> +               }
> +       }
> +       spin_unlock(&vma->vm_mm->page_table_lock);
> +       return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-12 19:34 [PATCH 0/6 v3] pagemap handles transparent hugepage Naoya Horiguchi
@ 2012-01-12 19:34 ` Naoya Horiguchi
  2012-01-13 12:04   ` Hillf Danton
  2012-01-14 17:19   ` Andrea Arcangeli
  0 siblings, 2 replies; 30+ messages in thread
From: Naoya Horiguchi @ 2012-01-12 19:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Currently when we check if we can handle thp as it is or we need to
split it into regular sized pages, we hold page table lock prior to
check whether a given pmd is mapping thp or not. Because of this,
when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
To remove it, this patch introduces a optimized check function and
replace several similar logics with it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>

Changes since v2:
  - Fix missing "return 0" in "thp under splitting" path
  - Remove unneeded comment
  - Change the name of check function to describe what it does
  - Add VM_BUG_ON(mmap_sem)
---
 fs/proc/task_mmu.c      |   73 +++++++++------------------
 include/linux/huge_mm.h |    7 +++
 mm/huge_memory.c        |  127 +++++++++++++++++++++++------------------------
 mm/mremap.c             |    3 +-
 4 files changed, 95 insertions(+), 115 deletions(-)

diff --git 3.2-rc5.orig/fs/proc/task_mmu.c 3.2-rc5/fs/proc/task_mmu.c
index bd19177..5db15eb 100644
--- 3.2-rc5.orig/fs/proc/task_mmu.c
+++ 3.2-rc5/fs/proc/task_mmu.c
@@ -394,20 +394,12 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			smaps_pte_entry(*(pte_t *)pmd, addr,
-					HPAGE_PMD_SIZE, walk);
-			spin_unlock(&walk->mm->page_table_lock);
-			mss->anonymous_thp += HPAGE_PMD_SIZE;
-			return 0;
-		}
-	} else {
+	if (pmd_trans_huge_stable(pmd, vma)) {
+		smaps_pte_entry(*(pte_t *)pmd, addr,
+				HPAGE_PMD_SIZE, walk);
 		spin_unlock(&walk->mm->page_table_lock);
+		mss->anonymous_thp += HPAGE_PMD_SIZE;
+		return 0;
 	}
 	/*
 	 * The mmap_sem held all the way back in m_start() is what
@@ -694,26 +686,18 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			for (; addr != end; addr += PAGE_SIZE) {
-				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
-					>> PAGE_SHIFT;
-				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
-							       offset);
-				err = add_to_pagemap(addr, pfn, pm);
-				if (err)
-					break;
-			}
-			spin_unlock(&walk->mm->page_table_lock);
-			return err;
+	if (pmd_trans_huge_stable(pmd, vma)) {
+		for (; addr != end; addr += PAGE_SIZE) {
+			unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
+				>> PAGE_SHIFT;
+			pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
+						       offset);
+			err = add_to_pagemap(addr, pfn, pm);
+			if (err)
+				break;
 		}
-	} else {
 		spin_unlock(&walk->mm->page_table_lock);
+		return err;
 	}
 
 	for (; addr != end; addr += PAGE_SIZE) {
@@ -980,24 +964,17 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 
 	md = walk->private;
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(md->vma->anon_vma, pmd);
-		} else {
-			pte_t huge_pte = *(pte_t *)pmd;
-			struct page *page;
-
-			page = can_gather_numa_stats(huge_pte, md->vma, addr);
-			if (page)
-				gather_stats(page, md, pte_dirty(huge_pte),
-						HPAGE_PMD_SIZE/PAGE_SIZE);
-			spin_unlock(&walk->mm->page_table_lock);
-			return 0;
-		}
-	} else {
+
+	if (pmd_trans_huge_stable(pmd, md->vma)) {
+		pte_t huge_pte = *(pte_t *)pmd;
+		struct page *page;
+
+		page = can_gather_numa_stats(huge_pte, md->vma, addr);
+		if (page)
+			gather_stats(page, md, pte_dirty(huge_pte),
+				     HPAGE_PMD_SIZE/PAGE_SIZE);
 		spin_unlock(&walk->mm->page_table_lock);
+		return 0;
 	}
 
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
diff --git 3.2-rc5.orig/include/linux/huge_mm.h 3.2-rc5/include/linux/huge_mm.h
index a9ace9c..bb87147 100644
--- 3.2-rc5.orig/include/linux/huge_mm.h
+++ 3.2-rc5/include/linux/huge_mm.h
@@ -113,6 +113,8 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    long adjust_next);
+extern int pmd_trans_huge_stable(pmd_t *pmd,
+				struct vm_area_struct *vma);
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
@@ -176,6 +178,11 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int pmd_trans_huge_stable(pmd_t *pmd,
+					struct vm_area_struct *vma)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git 3.2-rc5.orig/mm/huge_memory.c 3.2-rc5/mm/huge_memory.c
index 36b3d98..b7811df 100644
--- 3.2-rc5.orig/mm/huge_memory.c
+++ 3.2-rc5/mm/huge_memory.c
@@ -1001,29 +1001,21 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 {
 	int ret = 0;
 
-	spin_lock(&tlb->mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&tlb->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma,
-					     pmd);
-		} else {
-			struct page *page;
-			pgtable_t pgtable;
-			pgtable = get_pmd_huge_pte(tlb->mm);
-			page = pmd_page(*pmd);
-			pmd_clear(pmd);
-			page_remove_rmap(page);
-			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-			VM_BUG_ON(!PageHead(page));
-			spin_unlock(&tlb->mm->page_table_lock);
-			tlb_remove_page(tlb, page);
-			pte_free(tlb->mm, pgtable);
-			ret = 1;
-		}
-	} else
+	if (likely(pmd_trans_huge_stable(pmd, vma))) {
+		struct page *page;
+		pgtable_t pgtable;
+		pgtable = get_pmd_huge_pte(tlb->mm);
+		page = pmd_page(*pmd);
+		pmd_clear(pmd);
+		page_remove_rmap(page);
+		VM_BUG_ON(page_mapcount(page) < 0);
+		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		VM_BUG_ON(!PageHead(page));
 		spin_unlock(&tlb->mm->page_table_lock);
+		tlb_remove_page(tlb, page);
+		pte_free(tlb->mm, pgtable);
+		ret = 1;
+	}
 
 	return ret;
 }
@@ -1034,21 +1026,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	int ret = 0;
 
-	spin_lock(&vma->vm_mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		ret = !pmd_trans_splitting(*pmd);
-		spin_unlock(&vma->vm_mm->page_table_lock);
-		if (unlikely(!ret))
-			wait_split_huge_page(vma->anon_vma, pmd);
-		else {
-			/*
-			 * All logical pages in the range are present
-			 * if backed by a huge page.
-			 */
-			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
-		}
-	} else
+	if (likely(pmd_trans_huge_stable(pmd, vma))) {
+		/*
+		 * All logical pages in the range are present
+		 * if backed by a huge page.
+		 */
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		memset(vec, 1, (end - addr) >> PAGE_SHIFT);
+	}
 
 	return ret;
 }
@@ -1078,21 +1063,12 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		goto out;
 	}
 
-	spin_lock(&mm->page_table_lock);
-	if (likely(pmd_trans_huge(*old_pmd))) {
-		if (pmd_trans_splitting(*old_pmd)) {
-			spin_unlock(&mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, old_pmd);
-			ret = -1;
-		} else {
-			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
-			VM_BUG_ON(!pmd_none(*new_pmd));
-			set_pmd_at(mm, new_addr, new_pmd, pmd);
-			spin_unlock(&mm->page_table_lock);
-			ret = 1;
-		}
-	} else {
+	if (likely(pmd_trans_huge_stable(old_pmd, vma))) {
+		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+		VM_BUG_ON(!pmd_none(*new_pmd));
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		spin_unlock(&mm->page_table_lock);
+		ret = 1;
 	}
 out:
 	return ret;
@@ -1104,27 +1080,48 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
 
-	spin_lock(&mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			pmd_t entry;
+	if (likely(pmd_trans_huge_stable(pmd, vma))) {
+		pmd_t entry;
 
-			entry = pmdp_get_and_clear(mm, addr, pmd);
-			entry = pmd_modify(entry, newprot);
-			set_pmd_at(mm, addr, pmd, entry);
-			spin_unlock(&vma->vm_mm->page_table_lock);
-			flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
-			ret = 1;
-		}
-	} else
+		entry = pmdp_get_and_clear(mm, addr, pmd);
+		entry = pmd_modify(entry, newprot);
+		set_pmd_at(mm, addr, pmd, entry);
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
+		ret = 1;
+	}
 
 	return ret;
 }
 
+/*
+ * Returns 1 if a given pmd is mapping a thp and stable (not under splitting.)
+ * Returns 0 otherwise. Note that if it returns 1, this routine returns without
+ * unlocking page table locks. So callers must unlock them.
+ */
+int pmd_trans_huge_stable(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+
+	if (!pmd_trans_huge(*pmd))
+		return 0;
+
+	spin_lock(&vma->vm_mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (pmd_trans_splitting(*pmd)) {
+			spin_unlock(&vma->vm_mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+			return 0;
+		} else {
+			/* Thp mapped by 'pmd' is stable, so we can
+			 * handle it as it is. */
+			return 1;
+		}
+	}
+	spin_unlock(&vma->vm_mm->page_table_lock);
+	return 0;
+}
+
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
diff --git 3.2-rc5.orig/mm/mremap.c 3.2-rc5/mm/mremap.c
index d6959cb..d534668 100644
--- 3.2-rc5.orig/mm/mremap.c
+++ 3.2-rc5/mm/mremap.c
@@ -155,9 +155,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			if (err > 0) {
 				need_flush = true;
 				continue;
-			} else if (!err) {
-				split_huge_page_pmd(vma->vm_mm, old_pmd);
 			}
+			split_huge_page_pmd(vma->vm_mm, old_pmd);
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
 		if (pmd_none(*new_pmd) && __pte_alloc(new_vma->vm_mm, new_vma,
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-02-20 11:54 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
2012-01-29 13:17   ` Hillf Danton
2012-01-30 19:23     ` Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
2012-01-28 11:23   ` Hillf Danton
2012-01-28 22:33     ` Naoya Horiguchi
2012-01-30  6:22   ` KAMEZAWA Hiroyuki
2012-02-02  5:27     ` Naoya Horiguchi
2012-02-02  8:32       ` KAMEZAWA Hiroyuki
2012-01-27 23:02 ` [PATCH 3/6] pagemap: export KPF_THP Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 4/6] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 5/6] introduce thp_ptep_get() Naoya Horiguchi
2012-01-30  6:26   ` KAMEZAWA Hiroyuki
2012-01-30 19:24     ` Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 6/6] pagemap: introduce data structure for pagemap entry Naoya Horiguchi
2012-01-30  6:31   ` KAMEZAWA Hiroyuki
2012-01-30 19:27     ` Naoya Horiguchi
  -- strict thread matches above, loose matches on Subject: below --
2012-02-08 15:51 [PATCH 0/6 v5] pagemap handles transparent hugepage Naoya Horiguchi
2012-02-08 15:51 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
2012-02-09  2:19   ` KAMEZAWA Hiroyuki
2012-02-19 21:21   ` Hugh Dickins
2012-02-20  7:28     ` Naoya Horiguchi
2012-02-20 11:38       ` Hugh Dickins
2012-02-20 11:54         ` Jiri Slaby
2012-01-16 17:19 Naoya Horiguchi
2012-01-12 19:34 [PATCH 0/6 v3] pagemap handles transparent hugepage Naoya Horiguchi
2012-01-12 19:34 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
2012-01-13 12:04   ` Hillf Danton
2012-01-13 15:14     ` Naoya Horiguchi
2012-01-14  3:24       ` Hillf Danton
2012-01-14 17:19   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).