linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6 v4] pagemap handles transparent hugepage
@ 2012-01-27 23:02 Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Hi,

I rebased the patchset onto 3.3-rc1, and made some fixes on thp
optimization patch based on the feedbacks from Andrea.

Naoya Horiguchi (6):
  pagemap: avoid splitting thp when reading
  thp: optimize away unnecessary page table locking
  pagemap: export KPF_THP
  pagemap: document KPF_THP and make page-types aware of
  introduce thp_ptep_get()
  pagemap: introduce data structure for pagemap entry

 Documentation/vm/page-types.c     |    2 +
 Documentation/vm/pagemap.txt      |    4 +
 arch/x86/include/asm/pgtable.h    |    5 ++
 fs/proc/page.c                    |    2 +
 fs/proc/task_mmu.c                |  135 +++++++++++++++++++++----------------
 include/asm-generic/pgtable.h     |    4 +
 include/linux/huge_mm.h           |   17 +++++
 include/linux/kernel-page-flags.h |    1 +
 mm/huge_memory.c                  |  120 +++++++++++++++-----------------
 9 files changed, 169 insertions(+), 121 deletions(-)

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-29 13:17   ` Hillf Danton
  2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Thp split is not necessary if we explicitly check whether pmds are
mapping thps or not. This patch introduces this check and adds code
to generate pagemap entries for pmds mapping thps, which results in
less performance impact of pagemap on thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changes since v3:
  - Generate pagemap entry directly from pmd to avoid messy casting

Changes since v2:
  - Add comment on if check in thp_pte_to_pagemap_entry()
  - Convert type of offset into unsigned long

Changes since v1:
  - Move pfn declaration to the beginning of pagemap_pte_range()
---
 fs/proc/task_mmu.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 47 insertions(+), 6 deletions(-)

diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
index e418c5a..cfbba8d 100644
--- 3.3-rc1.orig/fs/proc/task_mmu.c
+++ 3.3-rc1/fs/proc/task_mmu.c
@@ -600,6 +600,9 @@ struct pagemapread {
 	u64 *buffer;
 };
 
+#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
+#define PAGEMAP_WALK_MASK	(PMD_MASK)
+
 #define PM_ENTRY_BYTES      sizeof(u64)
 #define PM_STATUS_BITS      3
 #define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
@@ -658,6 +661,27 @@ static u64 pte_to_pagemap_entry(pte_t pte)
 	return pme;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+{
+	u64 pme = 0;
+	/*
+	 * Currently pmd for thp is always present because thp can not be
+	 * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
+	 * This if-check is just to prepare for future implementation.
+	 */
+	if (pmd_present(pmd))
+		pme = PM_PFRAME(pmd_pfn(pmd) + offset)
+			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
+	return pme;
+}
+#else
+static inline u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+{
+	return 0;
+}
+#endif
+
 static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			     struct mm_walk *walk)
 {
@@ -665,14 +689,33 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	pte_t *pte;
 	int err = 0;
-
-	split_huge_page_pmd(walk->mm, pmd);
+	u64 pfn = PM_NOT_PRESENT;
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
-	for (; addr != end; addr += PAGE_SIZE) {
-		u64 pfn = PM_NOT_PRESENT;
 
+	spin_lock(&walk->mm->page_table_lock);
+	if (pmd_trans_huge(*pmd)) {
+		if (pmd_trans_splitting(*pmd)) {
+			spin_unlock(&walk->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+		} else {
+			for (; addr != end; addr += PAGE_SIZE) {
+				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
+					>> PAGE_SHIFT;
+				pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
+				err = add_to_pagemap(addr, pfn, pm);
+				if (err)
+					break;
+			}
+			spin_unlock(&walk->mm->page_table_lock);
+			return err;
+		}
+	} else {
+		spin_unlock(&walk->mm->page_table_lock);
+	}
+
+	for (; addr != end; addr += PAGE_SIZE) {
 		/* check to see if we've left 'vma' behind
 		 * and need a new, higher one */
 		if (vma && (addr >= vma->vm_end))
@@ -754,8 +797,6 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * determine which areas of memory are actually mapped and llseek to
  * skip over unmapped regions.
  */
-#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
-#define PAGEMAP_WALK_MASK	(PMD_MASK)
 static ssize_t pagemap_read(struct file *file, char __user *buf,
 			    size_t count, loff_t *ppos)
 {
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-28 11:23   ` Hillf Danton
  2012-01-30  6:22   ` KAMEZAWA Hiroyuki
  2012-01-27 23:02 ` [PATCH 3/6] pagemap: export KPF_THP Naoya Horiguchi
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Currently when we check if we can handle thp as it is or we need to
split it into regular sized pages, we hold page table lock prior to
check whether a given pmd is mapping thp or not. Because of this,
when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
To remove it, this patch introduces a optimized check function and
replace several similar logics with it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>

Changes since v3:
  - Fix likely/unlikely pattern in pmd_trans_huge_stable()
  - Change suffix from _stable to _lock
  - Introduce __pmd_trans_huge_lock() to avoid micro-regression
  - Return 1 when wait_split_huge_page path is taken

Changes since v2:
  - Fix missing "return 0" in "thp under splitting" path
  - Remove unneeded comment
  - Change the name of check function to describe what it does
  - Add VM_BUG_ON(mmap_sem)
---
 fs/proc/task_mmu.c      |   70 +++++++++------------------
 include/linux/huge_mm.h |   17 +++++++
 mm/huge_memory.c        |  120 ++++++++++++++++++++++-------------------------
 3 files changed, 96 insertions(+), 111 deletions(-)

diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
index cfbba8d..2622b64 100644
--- 3.3-rc1.orig/fs/proc/task_mmu.c
+++ 3.3-rc1/fs/proc/task_mmu.c
@@ -394,20 +394,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			smaps_pte_entry(*(pte_t *)pmd, addr,
-					HPAGE_PMD_SIZE, walk);
-			spin_unlock(&walk->mm->page_table_lock);
-			mss->anonymous_thp += HPAGE_PMD_SIZE;
-			return 0;
-		}
-	} else {
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
 		spin_unlock(&walk->mm->page_table_lock);
+		mss->anonymous_thp += HPAGE_PMD_SIZE;
+		return 0;
 	}
 	/*
 	 * The mmap_sem held all the way back in m_start() is what
@@ -694,25 +685,17 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			for (; addr != end; addr += PAGE_SIZE) {
-				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
-					>> PAGE_SHIFT;
-				pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
-				err = add_to_pagemap(addr, pfn, pm);
-				if (err)
-					break;
-			}
-			spin_unlock(&walk->mm->page_table_lock);
-			return err;
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		for (; addr != end; addr += PAGE_SIZE) {
+			unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
+				>> PAGE_SHIFT;
+			pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
+			err = add_to_pagemap(addr, pfn, pm);
+			if (err)
+				break;
 		}
-	} else {
 		spin_unlock(&walk->mm->page_table_lock);
+		return err;
 	}
 
 	for (; addr != end; addr += PAGE_SIZE) {
@@ -979,24 +962,17 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 
 	md = walk->private;
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(md->vma->anon_vma, pmd);
-		} else {
-			pte_t huge_pte = *(pte_t *)pmd;
-			struct page *page;
-
-			page = can_gather_numa_stats(huge_pte, md->vma, addr);
-			if (page)
-				gather_stats(page, md, pte_dirty(huge_pte),
-						HPAGE_PMD_SIZE/PAGE_SIZE);
-			spin_unlock(&walk->mm->page_table_lock);
-			return 0;
-		}
-	} else {
+
+	if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
+		pte_t huge_pte = *(pte_t *)pmd;
+		struct page *page;
+
+		page = can_gather_numa_stats(huge_pte, md->vma, addr);
+		if (page)
+			gather_stats(page, md, pte_dirty(huge_pte),
+				     HPAGE_PMD_SIZE/PAGE_SIZE);
 		spin_unlock(&walk->mm->page_table_lock);
+		return 0;
 	}
 
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
diff --git 3.3-rc1.orig/include/linux/huge_mm.h 3.3-rc1/include/linux/huge_mm.h
index 1b92129..f56cacb 100644
--- 3.3-rc1.orig/include/linux/huge_mm.h
+++ 3.3-rc1/include/linux/huge_mm.h
@@ -113,6 +113,18 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    long adjust_next);
+extern int __pmd_trans_huge_lock(pmd_t *pmd,
+				 struct vm_area_struct *vma);
+/* mmap_sem must be held on entry */
+static inline int pmd_trans_huge_lock(pmd_t *pmd,
+				      struct vm_area_struct *vma)
+{
+	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+	if (pmd_trans_huge(*pmd))
+		return __pmd_trans_huge_lock(pmd, vma);
+	else
+		return 0;
+}
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
@@ -176,6 +188,11 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int pmd_trans_huge_lock(pmd_t *pmd,
+				      struct vm_area_struct *vma)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git 3.3-rc1.orig/mm/huge_memory.c 3.3-rc1/mm/huge_memory.c
index b3ffc21..cca6461 100644
--- 3.3-rc1.orig/mm/huge_memory.c
+++ 3.3-rc1/mm/huge_memory.c
@@ -1030,30 +1030,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 {
 	int ret = 0;
 
-	spin_lock(&tlb->mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&tlb->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma,
-					     pmd);
-		} else {
-			struct page *page;
-			pgtable_t pgtable;
-			pgtable = get_pmd_huge_pte(tlb->mm);
-			page = pmd_page(*pmd);
-			pmd_clear(pmd);
-			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-			page_remove_rmap(page);
-			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-			VM_BUG_ON(!PageHead(page));
-			spin_unlock(&tlb->mm->page_table_lock);
-			tlb_remove_page(tlb, page);
-			pte_free(tlb->mm, pgtable);
-			ret = 1;
-		}
-	} else
+	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+		struct page *page;
+		pgtable_t pgtable;
+		pgtable = get_pmd_huge_pte(tlb->mm);
+		page = pmd_page(*pmd);
+		pmd_clear(pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		page_remove_rmap(page);
+		VM_BUG_ON(page_mapcount(page) < 0);
+		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		VM_BUG_ON(!PageHead(page));
 		spin_unlock(&tlb->mm->page_table_lock);
+		tlb_remove_page(tlb, page);
+		pte_free(tlb->mm, pgtable);
+		ret = 1;
+	}
 
 	return ret;
 }
@@ -1064,21 +1056,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	int ret = 0;
 
-	spin_lock(&vma->vm_mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		ret = !pmd_trans_splitting(*pmd);
-		spin_unlock(&vma->vm_mm->page_table_lock);
-		if (unlikely(!ret))
-			wait_split_huge_page(vma->anon_vma, pmd);
-		else {
-			/*
-			 * All logical pages in the range are present
-			 * if backed by a huge page.
-			 */
-			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
-		}
-	} else
+	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+		/*
+		 * All logical pages in the range are present
+		 * if backed by a huge page.
+		 */
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		memset(vec, 1, (end - addr) >> PAGE_SHIFT);
+	}
 
 	return ret;
 }
@@ -1108,20 +1093,10 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		goto out;
 	}
 
-	spin_lock(&mm->page_table_lock);
-	if (likely(pmd_trans_huge(*old_pmd))) {
-		if (pmd_trans_splitting(*old_pmd)) {
-			spin_unlock(&mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, old_pmd);
-			ret = -1;
-		} else {
-			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
-			VM_BUG_ON(!pmd_none(*new_pmd));
-			set_pmd_at(mm, new_addr, new_pmd, pmd);
-			spin_unlock(&mm->page_table_lock);
-			ret = 1;
-		}
-	} else {
+	if ((ret = __pmd_trans_huge_lock(old_pmd, vma)) == 1) {
+		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+		VM_BUG_ON(!pmd_none(*new_pmd));
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		spin_unlock(&mm->page_table_lock);
 	}
 out:
@@ -1134,24 +1109,41 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
 
-	spin_lock(&mm->page_table_lock);
+	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+		pmd_t entry;
+		entry = pmdp_get_and_clear(mm, addr, pmd);
+		entry = pmd_modify(entry, newprot);
+		set_pmd_at(mm, addr, pmd, entry);
+		spin_unlock(&vma->vm_mm->page_table_lock);
+		ret = 1;
+	}
+
+	return ret;
+}
+
+/*
+ * Returns 1 if a given pmd maps a stable (not under splitting) thp,
+ * -1 if the pmd maps thp under splitting, 0 if the pmd does not map thp.
+ *
+ * Note that if it returns 1, this routine returns without unlocking page
+ * table locks. So callers must unlock them.
+ */
+int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	spin_lock(&vma->vm_mm->page_table_lock);
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&mm->page_table_lock);
+			spin_unlock(&vma->vm_mm->page_table_lock);
 			wait_split_huge_page(vma->anon_vma, pmd);
+			return -1;
 		} else {
-			pmd_t entry;
-
-			entry = pmdp_get_and_clear(mm, addr, pmd);
-			entry = pmd_modify(entry, newprot);
-			set_pmd_at(mm, addr, pmd, entry);
-			spin_unlock(&vma->vm_mm->page_table_lock);
-			ret = 1;
+			/* Thp mapped by 'pmd' is stable, so we can
+			 * handle it as it is. */
+			return 1;
 		}
-	} else
-		spin_unlock(&vma->vm_mm->page_table_lock);
-
-	return ret;
+	}
+	spin_unlock(&vma->vm_mm->page_table_lock);
+	return 0;
 }
 
 pmd_t *page_check_address_pmd(struct page *page,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] pagemap: export KPF_THP
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 4/6] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

This flag shows that a given pages is a subpage of transparent hugepage.
It helps us debug and test kernel by showing physical address of thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Changes since v2:
  - replace if with else-if not to set KPF_THP for hugetlbfs page

Changes since v1:
  - remove unnecessary ifdefs
  - fix confusing patch description
---
 fs/proc/page.c                    |    2 ++
 include/linux/kernel-page-flags.h |    1 +
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git 3.3-rc1.orig/fs/proc/page.c 3.3-rc1/fs/proc/page.c
index 6d8e6a9..7fcd0d6 100644
--- 3.3-rc1.orig/fs/proc/page.c
+++ 3.3-rc1/fs/proc/page.c
@@ -115,6 +115,8 @@ u64 stable_page_flags(struct page *page)
 		u |= 1 << KPF_COMPOUND_TAIL;
 	if (PageHuge(page))
 		u |= 1 << KPF_HUGE;
+	else if (PageTransCompound(page))
+		u |= 1 << KPF_THP;
 
 	/*
 	 * Caveats on high order pages: page->_count will only be set
diff --git 3.3-rc1.orig/include/linux/kernel-page-flags.h 3.3-rc1/include/linux/kernel-page-flags.h
index bd92a89..26a6571 100644
--- 3.3-rc1.orig/include/linux/kernel-page-flags.h
+++ 3.3-rc1/include/linux/kernel-page-flags.h
@@ -30,6 +30,7 @@
 #define KPF_NOPAGE		20
 
 #define KPF_KSM			21
+#define KPF_THP			22
 
 /* kernel hacking assistances
  * WARNING: subject to change, never rely on them!
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] pagemap: document KPF_THP and make page-types aware of it
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2012-01-27 23:02 ` [PATCH 3/6] pagemap: export KPF_THP Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 5/6] introduce thp_ptep_get() Naoya Horiguchi
  2012-01-27 23:02 ` [PATCH 6/6] pagemap: introduce data structure for pagemap entry Naoya Horiguchi
  5 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

page-types, which is a common user of pagemap, gets aware of thp
with this patch. This helps system admins and kernel hackers know
about how thp works.
Here is a sample output of page-types over a thp:

  $ page-types -p <pid> --raw --list

  voffset offset  len     flags
  ...
  7f9d40200       3f8400  1       ___U_lA____Ma_bH______t____________
  7f9d40201       3f8401  1ff     ________________T_____t____________

               flags      page-count       MB  symbolic-flags                     long-symbolic-flags
  0x0000000000410000             511        1  ________________T_____t____________        compound_tail,thp
  0x000000000040d868               1        0  ___U_lA____Ma_bH______t____________        uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Changes since v1:
  - fix misused word
---
 Documentation/vm/page-types.c |    2 ++
 Documentation/vm/pagemap.txt  |    4 ++++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git 3.3-rc1.orig/Documentation/vm/page-types.c 3.3-rc1/Documentation/vm/page-types.c
index 7445caa..0b13f02 100644
--- 3.3-rc1.orig/Documentation/vm/page-types.c
+++ 3.3-rc1/Documentation/vm/page-types.c
@@ -98,6 +98,7 @@
 #define KPF_HWPOISON		19
 #define KPF_NOPAGE		20
 #define KPF_KSM			21
+#define KPF_THP			22
 
 /* [32-] kernel hacking assistances */
 #define KPF_RESERVED		32
@@ -147,6 +148,7 @@ static const char *page_flag_names[] = {
 	[KPF_HWPOISON]		= "X:hwpoison",
 	[KPF_NOPAGE]		= "n:nopage",
 	[KPF_KSM]		= "x:ksm",
+	[KPF_THP]		= "t:thp",
 
 	[KPF_RESERVED]		= "r:reserved",
 	[KPF_MLOCKED]		= "m:mlocked",
diff --git 3.3-rc1.orig/Documentation/vm/pagemap.txt 3.3-rc1/Documentation/vm/pagemap.txt
index df09b96..4600cbe 100644
--- 3.3-rc1.orig/Documentation/vm/pagemap.txt
+++ 3.3-rc1/Documentation/vm/pagemap.txt
@@ -60,6 +60,7 @@ There are three components to pagemap:
     19. HWPOISON
     20. NOPAGE
     21. KSM
+    22. THP
 
 Short descriptions to the page flags:
 
@@ -97,6 +98,9 @@ Short descriptions to the page flags:
 21. KSM
     identical memory pages dynamically shared between one or more processes
 
+22. THP
+    contiguous pages which construct transparent hugepages
+
     [IO related page flags]
  1. ERROR     IO error occurred
  3. UPTODATE  page has up-to-date data
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] introduce thp_ptep_get()
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2012-01-27 23:02 ` [PATCH 4/6] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-30  6:26   ` KAMEZAWA Hiroyuki
  2012-01-27 23:02 ` [PATCH 6/6] pagemap: introduce data structure for pagemap entry Naoya Horiguchi
  5 siblings, 1 reply; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Casting pmd into pte_t to handle thp is strongly architecture dependent.
This patch introduces a new function to separate this dependency from
independent part.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 arch/x86/include/asm/pgtable.h |    5 +++++
 fs/proc/task_mmu.c             |    4 ++--
 include/asm-generic/pgtable.h  |    4 ++++
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git 3.3-rc1.orig/arch/x86/include/asm/pgtable.h 3.3-rc1/arch/x86/include/asm/pgtable.h
index 49afb3f..4cfcc7e 100644
--- 3.3-rc1.orig/arch/x86/include/asm/pgtable.h
+++ 3.3-rc1/arch/x86/include/asm/pgtable.h
@@ -165,6 +165,11 @@ static inline int has_transparent_hugepage(void)
 {
 	return cpu_has_pse;
 }
+
+static inline pte_t thp_ptep_get(pmd_t *pmd)
+{
+	return *(pte_t *)pmd;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
index 2622b64..e2063d9 100644
--- 3.3-rc1.orig/fs/proc/task_mmu.c
+++ 3.3-rc1/fs/proc/task_mmu.c
@@ -395,7 +395,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	spinlock_t *ptl;
 
 	if (pmd_trans_huge_lock(pmd, vma) == 1) {
-		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
+		smaps_pte_entry(thp_ptep_get(pmd), addr, HPAGE_PMD_SIZE, walk);
 		spin_unlock(&walk->mm->page_table_lock);
 		mss->anonymous_thp += HPAGE_PMD_SIZE;
 		return 0;
@@ -964,7 +964,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	md = walk->private;
 
 	if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
-		pte_t huge_pte = *(pte_t *)pmd;
+		pte_t huge_pte = thp_ptep_get(pmd);
 		struct page *page;
 
 		page = can_gather_numa_stats(huge_pte, md->vma, addr);
diff --git 3.3-rc1.orig/include/asm-generic/pgtable.h 3.3-rc1/include/asm-generic/pgtable.h
index 76bff2b..f346bdc 100644
--- 3.3-rc1.orig/include/asm-generic/pgtable.h
+++ 3.3-rc1/include/asm-generic/pgtable.h
@@ -434,6 +434,10 @@ static inline int pmd_trans_splitting(pmd_t pmd)
 {
 	return 0;
 }
+static inline pte_t thp_ptep_get(pmd_t *pmd)
+{
+	return 0;
+}
 #ifndef __HAVE_ARCH_PMD_WRITE
 static inline int pmd_write(pmd_t pmd)
 {
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] pagemap: introduce data structure for pagemap entry
  2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2012-01-27 23:02 ` [PATCH 5/6] introduce thp_ptep_get() Naoya Horiguchi
@ 2012-01-27 23:02 ` Naoya Horiguchi
  2012-01-30  6:31   ` KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-27 23:02 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Currently a local variable of pagemap entry in pagemap_pte_range()
is named pfn and typed with u64, but it's not correct (pfn should
be unsigned long.)
This patch introduces special type for pagemap entry and replace
code with it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/proc/task_mmu.c |   66 +++++++++++++++++++++++++++------------------------
 1 files changed, 35 insertions(+), 31 deletions(-)

diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
index e2063d9..c2807a3 100644
--- 3.3-rc1.orig/fs/proc/task_mmu.c
+++ 3.3-rc1/fs/proc/task_mmu.c
@@ -586,9 +586,13 @@ const struct file_operations proc_clear_refs_operations = {
 	.llseek		= noop_llseek,
 };
 
+typedef struct {
+	u64 pme;
+} pme_t;
+
 struct pagemapread {
 	int pos, len;
-	u64 *buffer;
+	pme_t *buffer;
 };
 
 #define PAGEMAP_WALK_SIZE	(PMD_SIZE)
@@ -611,10 +615,15 @@ struct pagemapread {
 #define PM_NOT_PRESENT      PM_PSHIFT(PAGE_SHIFT)
 #define PM_END_OF_BUFFER    1
 
-static int add_to_pagemap(unsigned long addr, u64 pfn,
+static inline pme_t make_pme(u64 val)
+{
+	return (pme_t) { .pme = val };
+}
+
+static int add_to_pagemap(unsigned long addr, pme_t *pme,
 			  struct pagemapread *pm)
 {
-	pm->buffer[pm->pos++] = pfn;
+	pm->buffer[pm->pos++] = *pme;
 	if (pm->pos >= pm->len)
 		return PM_END_OF_BUFFER;
 	return 0;
@@ -626,8 +635,10 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	unsigned long addr;
 	int err = 0;
+	pme_t pme = make_pme(PM_NOT_PRESENT);
+
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
-		err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
+		err = add_to_pagemap(addr, &pme, pm);
 		if (err)
 			break;
 	}
@@ -640,36 +651,31 @@ static u64 swap_pte_to_pagemap_entry(pte_t pte)
 	return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
 }
 
-static u64 pte_to_pagemap_entry(pte_t pte)
+static void pte_to_pagemap_entry(pme_t *pme, pte_t pte)
 {
-	u64 pme = 0;
 	if (is_swap_pte(pte))
-		pme = PM_PFRAME(swap_pte_to_pagemap_entry(pte))
-			| PM_PSHIFT(PAGE_SHIFT) | PM_SWAP;
+		*pme = make_pme(PM_PFRAME(swap_pte_to_pagemap_entry(pte))
+				| PM_PSHIFT(PAGE_SHIFT) | PM_SWAP);
 	else if (pte_present(pte))
-		pme = PM_PFRAME(pte_pfn(pte))
-			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
-	return pme;
+		*pme = make_pme(PM_PFRAME(pte_pfn(pte))
+				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+static void thp_pmd_to_pagemap_entry(pme_t *pme, pmd_t pmd, int offset)
 {
-	u64 pme = 0;
 	/*
 	 * Currently pmd for thp is always present because thp can not be
 	 * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
 	 * This if-check is just to prepare for future implementation.
 	 */
 	if (pmd_present(pmd))
-		pme = PM_PFRAME(pmd_pfn(pmd) + offset)
-			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
-	return pme;
+		*pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
+				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
 }
 #else
-static inline u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+static inline void thp_pmd_to_pagemap_entry(pme_t *pme, pmd_t pmd, int offset)
 {
-	return 0;
 }
 #endif
 
@@ -680,7 +686,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	pte_t *pte;
 	int err = 0;
-	u64 pfn = PM_NOT_PRESENT;
+	pme_t pme = make_pme(PM_NOT_PRESENT);
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
@@ -689,8 +695,8 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		for (; addr != end; addr += PAGE_SIZE) {
 			unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
 				>> PAGE_SHIFT;
-			pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
-			err = add_to_pagemap(addr, pfn, pm);
+			thp_pmd_to_pagemap_entry(&pme, *pmd, offset);
+			err = add_to_pagemap(addr, &pme, pm);
 			if (err)
 				break;
 		}
@@ -709,11 +715,11 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		if (vma && (vma->vm_start <= addr) &&
 		    !is_vm_hugetlb_page(vma)) {
 			pte = pte_offset_map(pmd, addr);
-			pfn = pte_to_pagemap_entry(*pte);
+			pte_to_pagemap_entry(&pme, *pte);
 			/* unmap before userspace copy */
 			pte_unmap(pte);
 		}
-		err = add_to_pagemap(addr, pfn, pm);
+		err = add_to_pagemap(addr, &pme, pm);
 		if (err)
 			return err;
 	}
@@ -724,13 +730,11 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static u64 huge_pte_to_pagemap_entry(pte_t pte, int offset)
+static void huge_pte_to_pagemap_entry(pme_t *pme, pte_t pte, int offset)
 {
-	u64 pme = 0;
 	if (pte_present(pte))
-		pme = PM_PFRAME(pte_pfn(pte) + offset)
-			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
-	return pme;
+		*pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset)
+				| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT);
 }
 
 /* This function walks within one hugetlb entry in the single call */
@@ -740,12 +744,12 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
 {
 	struct pagemapread *pm = walk->private;
 	int err = 0;
-	u64 pfn;
+	pme_t pme = make_pme(PM_NOT_PRESENT);
 
 	for (; addr != end; addr += PAGE_SIZE) {
 		int offset = (addr & ~hmask) >> PAGE_SHIFT;
-		pfn = huge_pte_to_pagemap_entry(*pte, offset);
-		err = add_to_pagemap(addr, pfn, pm);
+		huge_pte_to_pagemap_entry(&pme, *pte, offset);
+		err = add_to_pagemap(addr, &pme, pm);
 		if (err)
 			return err;
 	}
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
@ 2012-01-28 11:23   ` Hillf Danton
  2012-01-28 22:33     ` Naoya Horiguchi
  2012-01-30  6:22   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 24+ messages in thread
From: Hillf Danton @ 2012-01-28 11:23 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, LKML,
	Hillf Danton

Hi Naoya

On Sat, Jan 28, 2012 at 7:02 AM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> Currently when we check if we can handle thp as it is or we need to
> split it into regular sized pages, we hold page table lock prior to
> check whether a given pmd is mapping thp or not. Because of this,
> when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> To remove it, this patch introduces a optimized check function and
> replace several similar logics with it.
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: David Rientjes <rientjes@google.com>
>
> Changes since v3:
>  - Fix likely/unlikely pattern in pmd_trans_huge_stable()
>  - Change suffix from _stable to _lock
>  - Introduce __pmd_trans_huge_lock() to avoid micro-regression
>  - Return 1 when wait_split_huge_page path is taken
>
> Changes since v2:
>  - Fix missing "return 0" in "thp under splitting" path
>  - Remove unneeded comment
>  - Change the name of check function to describe what it does
>  - Add VM_BUG_ON(mmap_sem)
> ---
>  fs/proc/task_mmu.c      |   70 +++++++++------------------
>  include/linux/huge_mm.h |   17 +++++++
>  mm/huge_memory.c        |  120 ++++++++++++++++++++++-------------------------
>  3 files changed, 96 insertions(+), 111 deletions(-)
>
[...]

> @@ -1064,21 +1056,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  {
>        int ret = 0;
>
> -       spin_lock(&vma->vm_mm->page_table_lock);
> -       if (likely(pmd_trans_huge(*pmd))) {
> -               ret = !pmd_trans_splitting(*pmd);

Here the value of ret is either false or true,

> -               spin_unlock(&vma->vm_mm->page_table_lock);
> -               if (unlikely(!ret))
> -                       wait_split_huge_page(vma->anon_vma, pmd);
> -               else {
> -                       /*
> -                        * All logical pages in the range are present
> -                        * if backed by a huge page.
> -                        */
> -                       memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> -               }
> -       } else
> +       if (__pmd_trans_huge_lock(pmd, vma) == 1) {
> +               /*
> +                * All logical pages in the range are present
> +                * if backed by a huge page.
> +                */
>                spin_unlock(&vma->vm_mm->page_table_lock);
> +               memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> +       }
>
>        return ret;

what is the returned value of this function? /Hillf

>  }
> @@ -1108,20 +1093,10 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>                goto out;
>        }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-28 11:23   ` Hillf Danton
@ 2012-01-28 22:33     ` Naoya Horiguchi
  0 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-28 22:33 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	LKML

Hi Hillf,

On Sat, Jan 28, 2012 at 07:23:47PM +0800, Hillf Danton wrote:
> Hi Naoya
> 
> On Sat, Jan 28, 2012 at 7:02 AM, Naoya Horiguchi
> <n-horiguchi@ah.jp.nec.com> wrote:
> > Currently when we check if we can handle thp as it is or we need to
> > split it into regular sized pages, we hold page table lock prior to
> > check whether a given pmd is mapping thp or not. Because of this,
> > when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> > To remove it, this patch introduces a optimized check function and
> > replace several similar logics with it.
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: David Rientjes <rientjes@google.com>
> >
> > Changes since v3:
> > - Fix likely/unlikely pattern in pmd_trans_huge_stable()
> > - Change suffix from _stable to _lock
> > - Introduce __pmd_trans_huge_lock() to avoid micro-regression
> > - Return 1 when wait_split_huge_page path is taken
> >
> > Changes since v2:
> > - Fix missing "return 0" in "thp under splitting" path
> > - Remove unneeded comment
> > - Change the name of check function to describe what it does
> > - Add VM_BUG_ON(mmap_sem)
> > ---
> > fs/proc/task_mmu.c   |  70 +++++++++------------------
> > include/linux/huge_mm.h |  17 +++++++
> > mm/huge_memory.c    | 120 ++++++++++++++++++++++-------------------------
> > 3 files changed, 96 insertions(+), 111 deletions(-)
> >
> [...]
> 
> > @@ -1064,21 +1056,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > {
> >    int ret = 0;
> >
> > -    spin_lock(&vma->vm_mm->page_table_lock);
> > -    if (likely(pmd_trans_huge(*pmd))) {
> > -        ret = !pmd_trans_splitting(*pmd);
> 
> Here the value of ret is either false or true,

You're right.

> > -        spin_unlock(&vma->vm_mm->page_table_lock);
> > -        if (unlikely(!ret))
> > -            wait_split_huge_page(vma->anon_vma, pmd);
> > -        else {
> > -            /*
> > -            * All logical pages in the range are present
> > -            * if backed by a huge page.
> > -            */
> > -            memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> > -        }
> > -    } else
> > +    if (__pmd_trans_huge_lock(pmd, vma) == 1) {
> > +        /*
> > +        * All logical pages in the range are present
> > +        * if backed by a huge page.
> > +        */
> >        spin_unlock(&vma->vm_mm->page_table_lock);
> > +        memset(vec, 1, (end - addr) >> PAGE_SHIFT);
> > +    }
> >
> >    return ret;
> 
> what is the returned value of this function? /Hillf

In this patch, mincore_huge_pmd() always returns 0 and it's obviously wrong.
We need to set ret to 1 in if-block.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
@ 2012-01-29 13:17   ` Hillf Danton
  2012-01-30 19:23     ` Naoya Horiguchi
  0 siblings, 1 reply; 24+ messages in thread
From: Hillf Danton @ 2012-01-29 13:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, LKML,
	Hillf Danton

Hi Naoya

On Sat, Jan 28, 2012 at 7:02 AM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> Thp split is not necessary if we explicitly check whether pmds are
> mapping thps or not. This patch introduces this check and adds code
> to generate pagemap entries for pmds mapping thps, which results in
> less performance impact of pagemap on thp.
>

Could the method proposed here cover the two cases of split THP in mem cgroup?

Thanks
Hillf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
  2012-01-28 11:23   ` Hillf Danton
@ 2012-01-30  6:22   ` KAMEZAWA Hiroyuki
  2012-02-02  5:27     ` Naoya Horiguchi
  1 sibling, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-30  6:22 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Fri, 27 Jan 2012 18:02:49 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Currently when we check if we can handle thp as it is or we need to
> split it into regular sized pages, we hold page table lock prior to
> check whether a given pmd is mapping thp or not. Because of this,
> when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> To remove it, this patch introduces a optimized check function and
> replace several similar logics with it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: David Rientjes <rientjes@google.com>
> 
> Changes since v3:
>   - Fix likely/unlikely pattern in pmd_trans_huge_stable()
>   - Change suffix from _stable to _lock
>   - Introduce __pmd_trans_huge_lock() to avoid micro-regression
>   - Return 1 when wait_split_huge_page path is taken
> 
> Changes since v2:
>   - Fix missing "return 0" in "thp under splitting" path
>   - Remove unneeded comment
>   - Change the name of check function to describe what it does
>   - Add VM_BUG_ON(mmap_sem)



> +/*
> + * Returns 1 if a given pmd maps a stable (not under splitting) thp,
> + * -1 if the pmd maps thp under splitting, 0 if the pmd does not map thp.
> + *
> + * Note that if it returns 1, this routine returns without unlocking page
> + * table locks. So callers must unlock them.
> + */


Seems nice clean up but... why you need to return (-1, 0, 1) ?

It seems the caller can't see the difference between -1 and 0.

Why not just return 0 (not locked) or 1 (thp found and locked) ?

Thanks,
-Kame

> +int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
> +{
> +	spin_lock(&vma->vm_mm->page_table_lock);
>  	if (likely(pmd_trans_huge(*pmd))) {
>  		if (unlikely(pmd_trans_splitting(*pmd))) {
> -			spin_unlock(&mm->page_table_lock);
> +			spin_unlock(&vma->vm_mm->page_table_lock);
>  			wait_split_huge_page(vma->anon_vma, pmd);
> +			return -1;
>  		} else {
> -			pmd_t entry;
> -
> -			entry = pmdp_get_and_clear(mm, addr, pmd);
> -			entry = pmd_modify(entry, newprot);
> -			set_pmd_at(mm, addr, pmd, entry);
> -			spin_unlock(&vma->vm_mm->page_table_lock);
> -			ret = 1;
> +			/* Thp mapped by 'pmd' is stable, so we can
> +			 * handle it as it is. */
> +			return 1;
>  		}
> -	} else
> -		spin_unlock(&vma->vm_mm->page_table_lock);
> -
> -	return ret;
> +	}
> +	spin_unlock(&vma->vm_mm->page_table_lock);
> +	return 0;
>  }
>  
>  pmd_t *page_check_address_pmd(struct page *page,
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] introduce thp_ptep_get()
  2012-01-27 23:02 ` [PATCH 5/6] introduce thp_ptep_get() Naoya Horiguchi
@ 2012-01-30  6:26   ` KAMEZAWA Hiroyuki
  2012-01-30 19:24     ` Naoya Horiguchi
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-30  6:26 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Fri, 27 Jan 2012 18:02:52 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Casting pmd into pte_t to handle thp is strongly architecture dependent.
> This patch introduces a new function to separate this dependency from
> independent part.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  arch/x86/include/asm/pgtable.h |    5 +++++
>  fs/proc/task_mmu.c             |    4 ++--
>  include/asm-generic/pgtable.h  |    4 ++++
>  3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git 3.3-rc1.orig/arch/x86/include/asm/pgtable.h 3.3-rc1/arch/x86/include/asm/pgtable.h
> index 49afb3f..4cfcc7e 100644
> --- 3.3-rc1.orig/arch/x86/include/asm/pgtable.h
> +++ 3.3-rc1/arch/x86/include/asm/pgtable.h
> @@ -165,6 +165,11 @@ static inline int has_transparent_hugepage(void)
>  {
>  	return cpu_has_pse;
>  }
> +
> +static inline pte_t thp_ptep_get(pmd_t *pmd)
> +{
> +	return *(pte_t *)pmd;
> +}

I'm sorry but I don't think the name is good. 
(But I know my sense of naming is bad ;)

How about pmd_to_pte_t ?

And, I wonder you can add error check as VM_BUG_ON(!pmd_trans_huge(*pmd));


Thanks,
-Kame




>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
> diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
> index 2622b64..e2063d9 100644
> --- 3.3-rc1.orig/fs/proc/task_mmu.c
> +++ 3.3-rc1/fs/proc/task_mmu.c
> @@ -395,7 +395,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	spinlock_t *ptl;
>  
>  	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> -		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
> +		smaps_pte_entry(thp_ptep_get(pmd), addr, HPAGE_PMD_SIZE, walk);
>  		spin_unlock(&walk->mm->page_table_lock);
>  		mss->anonymous_thp += HPAGE_PMD_SIZE;
>  		return 0;
> @@ -964,7 +964,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
>  	md = walk->private;
>  
>  	if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
> -		pte_t huge_pte = *(pte_t *)pmd;
> +		pte_t huge_pte = thp_ptep_get(pmd);
>  		struct page *page;
>  
>  		page = can_gather_numa_stats(huge_pte, md->vma, addr);
> diff --git 3.3-rc1.orig/include/asm-generic/pgtable.h 3.3-rc1/include/asm-generic/pgtable.h
> index 76bff2b..f346bdc 100644
> --- 3.3-rc1.orig/include/asm-generic/pgtable.h
> +++ 3.3-rc1/include/asm-generic/pgtable.h
> @@ -434,6 +434,10 @@ static inline int pmd_trans_splitting(pmd_t pmd)
>  {
>  	return 0;
>  }
> +static inline pte_t thp_ptep_get(pmd_t *pmd)
> +{
> +	return 0;
> +}
>  #ifndef __HAVE_ARCH_PMD_WRITE
>  static inline int pmd_write(pmd_t pmd)
>  {
> -- 
> 1.7.7.6
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] pagemap: introduce data structure for pagemap entry
  2012-01-27 23:02 ` [PATCH 6/6] pagemap: introduce data structure for pagemap entry Naoya Horiguchi
@ 2012-01-30  6:31   ` KAMEZAWA Hiroyuki
  2012-01-30 19:27     ` Naoya Horiguchi
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-30  6:31 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Fri, 27 Jan 2012 18:02:53 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Currently a local variable of pagemap entry in pagemap_pte_range()
> is named pfn and typed with u64, but it's not correct (pfn should
> be unsigned long.)
> This patch introduces special type for pagemap entry and replace
> code with it.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/proc/task_mmu.c |   66 +++++++++++++++++++++++++++------------------------
>  1 files changed, 35 insertions(+), 31 deletions(-)
> 
> diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
> index e2063d9..c2807a3 100644
> --- 3.3-rc1.orig/fs/proc/task_mmu.c
> +++ 3.3-rc1/fs/proc/task_mmu.c
> @@ -586,9 +586,13 @@ const struct file_operations proc_clear_refs_operations = {
>  	.llseek		= noop_llseek,
>  };
>  
> +typedef struct {
> +	u64 pme;
> +} pme_t;
> +

A nitpick..

How about pagemap_entry_t rather than pme_t ?

At 1st look, I wondered whether this is a new kind of page table entry type or not ..

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-29 13:17   ` Hillf Danton
@ 2012-01-30 19:23     ` Naoya Horiguchi
  0 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-30 19:23 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	LKML

On Sun, Jan 29, 2012 at 09:17:32PM +0800, Hillf Danton wrote:
> Hi Naoya
> 
> On Sat, Jan 28, 2012 at 7:02 AM, Naoya Horiguchi
> <n-horiguchi@ah.jp.nec.com> wrote:
> > Thp split is not necessary if we explicitly check whether pmds are
> > mapping thps or not. This patch introduces this check and adds code
> > to generate pagemap entries for pmds mapping thps, which results in
> > less performance impact of pagemap on thp.
> >
> 
> Could the method proposed here cover the two cases of split THP in mem cgroup?

No for now, but yes if "move charge" function supports THP.
I think this can be a bit large step so it is the next work.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] introduce thp_ptep_get()
  2012-01-30  6:26   ` KAMEZAWA Hiroyuki
@ 2012-01-30 19:24     ` Naoya Horiguchi
  0 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-30 19:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	linux-kernel

On Mon, Jan 30, 2012 at 03:26:46PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Jan 2012 18:02:52 -0500
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > Casting pmd into pte_t to handle thp is strongly architecture dependent.
> > This patch introduces a new function to separate this dependency from
> > independent part.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  arch/x86/include/asm/pgtable.h |    5 +++++
> >  fs/proc/task_mmu.c             |    4 ++--
> >  include/asm-generic/pgtable.h  |    4 ++++
> >  3 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git 3.3-rc1.orig/arch/x86/include/asm/pgtable.h 3.3-rc1/arch/x86/include/asm/pgtable.h
> > index 49afb3f..4cfcc7e 100644
> > --- 3.3-rc1.orig/arch/x86/include/asm/pgtable.h
> > +++ 3.3-rc1/arch/x86/include/asm/pgtable.h
> > @@ -165,6 +165,11 @@ static inline int has_transparent_hugepage(void)
> >  {
> >  	return cpu_has_pse;
> >  }
> > +
> > +static inline pte_t thp_ptep_get(pmd_t *pmd)
> > +{
> > +	return *(pte_t *)pmd;
> > +}
> 
> I'm sorry but I don't think the name is good. 
> (But I know my sense of naming is bad ;)

I named it from huge_ptep_get() defined for hugetlbfs.
But by my rethinking, it's bad naming.

> 
> How about pmd_to_pte_t ?

OK, it looks straightforward.
I'll take it if someone does not have better ideas.

> And, I wonder you can add error check as VM_BUG_ON(!pmd_trans_huge(*pmd));

Good, I'll add it.

Thank you, KAMEZAWA-san.
Naoya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] pagemap: introduce data structure for pagemap entry
  2012-01-30  6:31   ` KAMEZAWA Hiroyuki
@ 2012-01-30 19:27     ` Naoya Horiguchi
  0 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-30 19:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	linux-kernel

On Mon, Jan 30, 2012 at 03:31:11PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Jan 2012 18:02:53 -0500
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > Currently a local variable of pagemap entry in pagemap_pte_range()
> > is named pfn and typed with u64, but it's not correct (pfn should
> > be unsigned long.)
> > This patch introduces special type for pagemap entry and replace
> > code with it.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  fs/proc/task_mmu.c |   66 +++++++++++++++++++++++++++------------------------
> >  1 files changed, 35 insertions(+), 31 deletions(-)
> > 
> > diff --git 3.3-rc1.orig/fs/proc/task_mmu.c 3.3-rc1/fs/proc/task_mmu.c
> > index e2063d9..c2807a3 100644
> > --- 3.3-rc1.orig/fs/proc/task_mmu.c
> > +++ 3.3-rc1/fs/proc/task_mmu.c
> > @@ -586,9 +586,13 @@ const struct file_operations proc_clear_refs_operations = {
> >  	.llseek		= noop_llseek,
> >  };
> >  
> > +typedef struct {
> > +	u64 pme;
> > +} pme_t;
> > +
> 
> A nitpick..
> 
> How about pagemap_entry_t rather than pme_t ?

OK, I'll use this type name.

> At 1st look, I wondered whether this is a new kind of page table entry type or not ..

We had better avoid this type of confusion if possible.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-01-30  6:22   ` KAMEZAWA Hiroyuki
@ 2012-02-02  5:27     ` Naoya Horiguchi
  2012-02-02  8:32       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 24+ messages in thread
From: Naoya Horiguchi @ 2012-02-02  5:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	linux-kernel

On Mon, Jan 30, 2012 at 03:22:12PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 Jan 2012 18:02:49 -0500
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> > Currently when we check if we can handle thp as it is or we need to
> > split it into regular sized pages, we hold page table lock prior to
> > check whether a given pmd is mapping thp or not. Because of this,
> > when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> > To remove it, this patch introduces a optimized check function and
> > replace several similar logics with it.
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: David Rientjes <rientjes@google.com>
> >
> > Changes since v3:
> >   - Fix likely/unlikely pattern in pmd_trans_huge_stable()
> >   - Change suffix from _stable to _lock
> >   - Introduce __pmd_trans_huge_lock() to avoid micro-regression
> >   - Return 1 when wait_split_huge_page path is taken
> >
> > Changes since v2:
> >   - Fix missing "return 0" in "thp under splitting" path
> >   - Remove unneeded comment
> >   - Change the name of check function to describe what it does
> >   - Add VM_BUG_ON(mmap_sem)
>
>
> > +/*
> > + * Returns 1 if a given pmd maps a stable (not under splitting) thp,
> > + * -1 if the pmd maps thp under splitting, 0 if the pmd does not map thp.
> > + *
> > + * Note that if it returns 1, this routine returns without unlocking page
> > + * table locks. So callers must unlock them.
> > + */
>
>
> Seems nice clean up but... why you need to return (-1, 0, 1) ?
>
> It seems the caller can't see the difference between -1 and 0.
>
> Why not just return 0 (not locked) or 1 (thp found and locked) ?

Sorry, I changed wrongly from v3.
We can do fine without return value of -1 if we remove else-if (!err)
{...} block after move_huge_pmd() call in move_page_tables(), right?
(split_huge_page_pmd() after wait_split_huge_page() do nothing...)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] thp: optimize away unnecessary page table locking
  2012-02-02  5:27     ` Naoya Horiguchi
@ 2012-02-02  8:32       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-02  8:32 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Thu,  2 Feb 2012 00:27:58 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> On Mon, Jan 30, 2012 at 03:22:12PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Fri, 27 Jan 2012 18:02:49 -0500
> > Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> >
> > > Currently when we check if we can handle thp as it is or we need to
> > > split it into regular sized pages, we hold page table lock prior to
> > > check whether a given pmd is mapping thp or not. Because of this,
> > > when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> > > To remove it, this patch introduces a optimized check function and
> > > replace several similar logics with it.
> > >
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Cc: David Rientjes <rientjes@google.com>
> > >
> > > Changes since v3:
> > >   - Fix likely/unlikely pattern in pmd_trans_huge_stable()
> > >   - Change suffix from _stable to _lock
> > >   - Introduce __pmd_trans_huge_lock() to avoid micro-regression
> > >   - Return 1 when wait_split_huge_page path is taken
> > >
> > > Changes since v2:
> > >   - Fix missing "return 0" in "thp under splitting" path
> > >   - Remove unneeded comment
> > >   - Change the name of check function to describe what it does
> > >   - Add VM_BUG_ON(mmap_sem)
> >
> >
> > > +/*
> > > + * Returns 1 if a given pmd maps a stable (not under splitting) thp,
> > > + * -1 if the pmd maps thp under splitting, 0 if the pmd does not map thp.
> > > + *
> > > + * Note that if it returns 1, this routine returns without unlocking page
> > > + * table locks. So callers must unlock them.
> > > + */
> >
> >
> > Seems nice clean up but... why you need to return (-1, 0, 1) ?
> >
> > It seems the caller can't see the difference between -1 and 0.
> >
> > Why not just return 0 (not locked) or 1 (thp found and locked) ?
> 
> Sorry, I changed wrongly from v3.
> We can do fine without return value of -1 if we remove else-if (!err)
> {...} block after move_huge_pmd() call in move_page_tables(), right?
> (split_huge_page_pmd() after wait_split_huge_page() do nothing...)
> 

Hm ?

               if (pmd_trans_huge(*old_pmd)) {
                        int err = 0;
                        if (extent == HPAGE_PMD_SIZE)
                                err = move_huge_pmd(vma, new_vma, old_addr,
                                                    new_addr, old_end,
                                                    old_pmd, new_pmd);
                        if (err > 0) {
                                need_flush = true;
                                continue;
                        } else if (!err) {
                                split_huge_page_pmd(vma->vm_mm, old_pmd);
                        }
                        VM_BUG_ON(pmd_trans_huge(*old_pmd));
                }

I think you're right. BUG_ON() in wait_split_huge_page() 

 #define wait_split_huge_page(__anon_vma, __pmd)                         \
        do {                                                            \
                pmd_t *____pmd = (__pmd);                               \
                anon_vma_lock(__anon_vma);                              \
                anon_vma_unlock(__anon_vma);                            \
                BUG_ON(pmd_trans_splitting(*____pmd) ||                 \
                       pmd_trans_huge(*____pmd));                       \
        } while (0)

says pmd is always splitted.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-02-08 15:51 [PATCH 0/6 v5] pagemap handles transparent hugepage Naoya Horiguchi
@ 2012-02-08 15:51 ` Naoya Horiguchi
  0 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-02-08 15:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Thp split is not necessary if we explicitly check whether pmds are
mapping thps or not. This patch introduces this check and adds code
to generate pagemap entries for pmds mapping thps, which results in
less performance impact of pagemap on thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

ToDo:
  - Avoid thp split in another 2 split_huge_page_pmd() in mm/memcontrol.c

Changes since v3:
  - Generate pagemap entry directly from pmd to avoid messy casting

Changes since v2:
  - Add comment on if check in thp_pte_to_pagemap_entry()
  - Convert type of offset into unsigned long

Changes since v1:
  - Move pfn declaration to the beginning of pagemap_pte_range()
---
 fs/proc/task_mmu.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 47 insertions(+), 6 deletions(-)

diff --git 3.3-rc2.orig/fs/proc/task_mmu.c 3.3-rc2/fs/proc/task_mmu.c
index 7dcd2a2..eb0a93e 100644
--- 3.3-rc2.orig/fs/proc/task_mmu.c
+++ 3.3-rc2/fs/proc/task_mmu.c
@@ -603,6 +603,9 @@ struct pagemapread {
 	u64 *buffer;
 };
 
+#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
+#define PAGEMAP_WALK_MASK	(PMD_MASK)
+
 #define PM_ENTRY_BYTES      sizeof(u64)
 #define PM_STATUS_BITS      3
 #define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
@@ -661,6 +664,27 @@ static u64 pte_to_pagemap_entry(pte_t pte)
 	return pme;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+{
+	u64 pme = 0;
+	/*
+	 * Currently pmd for thp is always present because thp can not be
+	 * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
+	 * This if-check is just to prepare for future implementation.
+	 */
+	if (pmd_present(pmd))
+		pme = PM_PFRAME(pmd_pfn(pmd) + offset)
+			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
+	return pme;
+}
+#else
+static inline u64 thp_pmd_to_pagemap_entry(pmd_t pmd, int offset)
+{
+	return 0;
+}
+#endif
+
 static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			     struct mm_walk *walk)
 {
@@ -668,14 +692,33 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	pte_t *pte;
 	int err = 0;
-
-	split_huge_page_pmd(walk->mm, pmd);
+	u64 pfn = PM_NOT_PRESENT;
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
-	for (; addr != end; addr += PAGE_SIZE) {
-		u64 pfn = PM_NOT_PRESENT;
 
+	spin_lock(&walk->mm->page_table_lock);
+	if (pmd_trans_huge(*pmd)) {
+		if (pmd_trans_splitting(*pmd)) {
+			spin_unlock(&walk->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+		} else {
+			for (; addr != end; addr += PAGE_SIZE) {
+				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
+					>> PAGE_SHIFT;
+				pfn = thp_pmd_to_pagemap_entry(*pmd, offset);
+				err = add_to_pagemap(addr, pfn, pm);
+				if (err)
+					break;
+			}
+			spin_unlock(&walk->mm->page_table_lock);
+			return err;
+		}
+	} else {
+		spin_unlock(&walk->mm->page_table_lock);
+	}
+
+	for (; addr != end; addr += PAGE_SIZE) {
 		/* check to see if we've left 'vma' behind
 		 * and need a new, higher one */
 		if (vma && (addr >= vma->vm_end))
@@ -757,8 +800,6 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * determine which areas of memory are actually mapped and llseek to
  * skip over unmapped regions.
  */
-#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
-#define PAGEMAP_WALK_MASK	(PMD_MASK)
 static ssize_t pagemap_read(struct file *file, char __user *buf,
 			    size_t count, loff_t *ppos)
 {
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-15 12:06     ` Andrea Arcangeli
@ 2012-01-16 17:18       ` Naoya Horiguchi
  0 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-16 17:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel

On Sun, Jan 15, 2012 at 01:06:05PM +0100, Andrea Arcangeli wrote:
> On Sat, Jan 14, 2012 at 06:00:26PM +0100, Andrea Arcangeli wrote:
> > Why don't you pass the pmd and then do "if (pmd_present(pmd))
> > page_to_pfn(pmd_page(pmd)) ? What's the argument for the cast. I'm
> 
> Of course I meant pmd_pfn above... in short as a replacement of the
> pte_pfn in your patch.
> 
> About the _stable function, I was now thinking maybe _lock suffix is
> more appropriate than _stable, because that function effectively has
> the objective of taking the page_table_lock in the most efficient way,
> and not much else other than taking the lock. Adding a comment that
> it's only safe to call with the mmap_sem held in the inline version in
> the .h file (the one that then would call the __ version in the .c
> file).

OK, I will use _lock suffix and add the comment in the next post.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
@ 2012-01-16 17:18 Naoya Horiguchi
  0 siblings, 0 replies; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-16 17:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel

On Sat, Jan 14, 2012 at 06:00:26PM +0100, Andrea Arcangeli wrote:
> On Thu, Jan 12, 2012 at 02:34:53PM -0500, Naoya Horiguchi wrote:
> > +		if (pmd_trans_splitting(*pmd)) {
> > +			spin_unlock(&walk->mm->page_table_lock);
> > +			wait_split_huge_page(vma->anon_vma, pmd);
> > +		} else {
> > +			for (; addr != end; addr += PAGE_SIZE) {
> > +				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
> > +					>> PAGE_SHIFT;
> > +				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
> > +							       offset);
> 
> What is this that then morphs into a pme (which still has a cast
> inside its creation)? thp_pte_to_pagemap_entry don't seem to be passed
> ptes too. The only case where it is valid to do a cast like that is
> when a function is used by both ptes sand pmds and the code tends to
> work for both with minimal modification to differentiate the two
> cases. Considering the function that gets the cast is called thp_ this
> is hardly the case here.

Agreed.

> Why don't you pass the pmd and then do "if (pmd_present(pmd))
> page_to_pfn(pmd_page(pmd)) ? What's the argument for the cast. I'm
> just reviewing this series and maybe it was covered in previous
> versions.

OK, I can do this by introducing pmd_pte as you commented in another email.

> I don't get this pme thing for something as trivial as above that
> shouldn't require any cast at all.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-14 17:00   ` Andrea Arcangeli
@ 2012-01-15 12:06     ` Andrea Arcangeli
  2012-01-16 17:18       ` Naoya Horiguchi
  0 siblings, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2012-01-15 12:06 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel

On Sat, Jan 14, 2012 at 06:00:26PM +0100, Andrea Arcangeli wrote:
> Why don't you pass the pmd and then do "if (pmd_present(pmd))
> page_to_pfn(pmd_page(pmd)) ? What's the argument for the cast. I'm

Of course I meant pmd_pfn above... in short as a replacement of the
pte_pfn in your patch.

About the _stable function, I was now thinking maybe _lock suffix is
more appropriate than _stable, because that function effectively has
the objective of taking the page_table_lock in the most efficient way,
and not much else other than taking the lock. Adding a comment that
it's only safe to call with the mmap_sem held in the inline version in
the .h file (the one that then would call the __ version in the .c
file).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-12 19:34 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
@ 2012-01-14 17:00   ` Andrea Arcangeli
  2012-01-15 12:06     ` Andrea Arcangeli
  0 siblings, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2012-01-14 17:00 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-kernel

On Thu, Jan 12, 2012 at 02:34:53PM -0500, Naoya Horiguchi wrote:
> +		if (pmd_trans_splitting(*pmd)) {
> +			spin_unlock(&walk->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma, pmd);
> +		} else {
> +			for (; addr != end; addr += PAGE_SIZE) {
> +				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
> +					>> PAGE_SHIFT;
> +				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
> +							       offset);

What is this that then morphs into a pme (which still has a cast
inside its creation)? thp_pte_to_pagemap_entry don't seem to be passed
ptes too. The only case where it is valid to do a cast like that is
when a function is used by both ptes sand pmds and the code tends to
work for both with minimal modification to differentiate the two
cases. Considering the function that gets the cast is called thp_ this
is hardly the case here.

Why don't you pass the pmd and then do "if (pmd_present(pmd))
page_to_pfn(pmd_page(pmd)) ? What's the argument for the cast. I'm
just reviewing this series and maybe it was covered in previous
versions.

I don't get this pme thing for something as trivial as above that
shouldn't require any cast at all.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-12 19:34 [PATCH 0/6 v3] pagemap handles transparent hugepage Naoya Horiguchi
@ 2012-01-12 19:34 ` Naoya Horiguchi
  2012-01-14 17:00   ` Andrea Arcangeli
  0 siblings, 1 reply; 24+ messages in thread
From: Naoya Horiguchi @ 2012-01-12 19:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Thp split is not necessary if we explicitly check whether pmds are
mapping thps or not. This patch introduces the check and the code
to generate pagemap entries for pmds mapping thps, which results in
less performance impact of pagemap on thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changes since v2:
  - Add comment on if check in thp_pte_to_pagemap_entry()
  - Convert type of offset into unsigned long

Changes since v1:
  - Move pfn declaration to the beginning of pagemap_pte_range()
---
 fs/proc/task_mmu.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 48 insertions(+), 6 deletions(-)

diff --git 3.2-rc5.orig/fs/proc/task_mmu.c 3.2-rc5/fs/proc/task_mmu.c
index e418c5a..bd19177 100644
--- 3.2-rc5.orig/fs/proc/task_mmu.c
+++ 3.2-rc5/fs/proc/task_mmu.c
@@ -600,6 +600,9 @@ struct pagemapread {
 	u64 *buffer;
 };
 
+#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
+#define PAGEMAP_WALK_MASK	(PMD_MASK)
+
 #define PM_ENTRY_BYTES      sizeof(u64)
 #define PM_STATUS_BITS      3
 #define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
@@ -658,6 +661,27 @@ static u64 pte_to_pagemap_entry(pte_t pte)
 	return pme;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
+{
+	u64 pme = 0;
+	/*
+	 * Currently pte for thp is always present because thp can not be
+	 * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
+	 * This if-check is just to prepare for future implementation.
+	 */
+	if (pte_present(pte))
+		pme = PM_PFRAME(pte_pfn(pte) + offset)
+			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
+	return pme;
+}
+#else
+static inline u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
+{
+	return 0;
+}
+#endif
+
 static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			     struct mm_walk *walk)
 {
@@ -665,14 +689,34 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	pte_t *pte;
 	int err = 0;
-
-	split_huge_page_pmd(walk->mm, pmd);
+	u64 pfn = PM_NOT_PRESENT;
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
-	for (; addr != end; addr += PAGE_SIZE) {
-		u64 pfn = PM_NOT_PRESENT;
 
+	spin_lock(&walk->mm->page_table_lock);
+	if (pmd_trans_huge(*pmd)) {
+		if (pmd_trans_splitting(*pmd)) {
+			spin_unlock(&walk->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+		} else {
+			for (; addr != end; addr += PAGE_SIZE) {
+				unsigned long offset = (addr & ~PAGEMAP_WALK_MASK)
+					>> PAGE_SHIFT;
+				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
+							       offset);
+				err = add_to_pagemap(addr, pfn, pm);
+				if (err)
+					break;
+			}
+			spin_unlock(&walk->mm->page_table_lock);
+			return err;
+		}
+	} else {
+		spin_unlock(&walk->mm->page_table_lock);
+	}
+
+	for (; addr != end; addr += PAGE_SIZE) {
 		/* check to see if we've left 'vma' behind
 		 * and need a new, higher one */
 		if (vma && (addr >= vma->vm_end))
@@ -754,8 +798,6 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * determine which areas of memory are actually mapped and llseek to
  * skip over unmapped regions.
  */
-#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
-#define PAGEMAP_WALK_MASK	(PMD_MASK)
 static ssize_t pagemap_read(struct file *file, char __user *buf,
 			    size_t count, loff_t *ppos)
 {
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-02-08 15:53 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-27 23:02 [PATCH 0/6 v4] pagemap handles transparent hugepage Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
2012-01-29 13:17   ` Hillf Danton
2012-01-30 19:23     ` Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 2/6] thp: optimize away unnecessary page table locking Naoya Horiguchi
2012-01-28 11:23   ` Hillf Danton
2012-01-28 22:33     ` Naoya Horiguchi
2012-01-30  6:22   ` KAMEZAWA Hiroyuki
2012-02-02  5:27     ` Naoya Horiguchi
2012-02-02  8:32       ` KAMEZAWA Hiroyuki
2012-01-27 23:02 ` [PATCH 3/6] pagemap: export KPF_THP Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 4/6] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 5/6] introduce thp_ptep_get() Naoya Horiguchi
2012-01-30  6:26   ` KAMEZAWA Hiroyuki
2012-01-30 19:24     ` Naoya Horiguchi
2012-01-27 23:02 ` [PATCH 6/6] pagemap: introduce data structure for pagemap entry Naoya Horiguchi
2012-01-30  6:31   ` KAMEZAWA Hiroyuki
2012-01-30 19:27     ` Naoya Horiguchi
  -- strict thread matches above, loose matches on Subject: below --
2012-02-08 15:51 [PATCH 0/6 v5] pagemap handles transparent hugepage Naoya Horiguchi
2012-02-08 15:51 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
2012-01-16 17:18 Naoya Horiguchi
2012-01-12 19:34 [PATCH 0/6 v3] pagemap handles transparent hugepage Naoya Horiguchi
2012-01-12 19:34 ` [PATCH 1/6] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
2012-01-14 17:00   ` Andrea Arcangeli
2012-01-15 12:06     ` Andrea Arcangeli
2012-01-16 17:18       ` Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).