linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4 v2] pagemap handles transparent hugepage
@ 2011-12-21 22:23 Naoya Horiguchi
  2011-12-21 22:23 ` [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2011-12-21 22:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Hi everyone,

Thanks for all your reviews and comments on the previous post.
I made some changes based on the feedbacks and added another patch
(patch 2) to optimize huge pmd check as recommended by David.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2011-12-21 22:23 [PATCH 0/4 v2] pagemap handles transparent hugepage Naoya Horiguchi
@ 2011-12-21 22:23 ` Naoya Horiguchi
  2011-12-26  8:26   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2011-12-21 22:23 ` [PATCH 2/4] thp: optimize away unnecessary page table locking Naoya Horiguchi
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2011-12-21 22:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Thp split is not necessary if we explicitly check whether pmds are
mapping thps or not. This patch introduces the check and the code
to generate pagemap entries for pmds mapping thps, which results in
less performance impact of pagemap on thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>

Changes since v1:
  - move pfn declaration to the beginning of pagemap_pte_range()
---
 fs/proc/task_mmu.c |   49 +++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 43 insertions(+), 6 deletions(-)

diff --git 3.2-rc5.orig/fs/proc/task_mmu.c 3.2-rc5/fs/proc/task_mmu.c
index e418c5a..0df61ab 100644
--- 3.2-rc5.orig/fs/proc/task_mmu.c
+++ 3.2-rc5/fs/proc/task_mmu.c
@@ -600,6 +600,9 @@ struct pagemapread {
 	u64 *buffer;
 };
 
+#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
+#define PAGEMAP_WALK_MASK	(PMD_MASK)
+
 #define PM_ENTRY_BYTES      sizeof(u64)
 #define PM_STATUS_BITS      3
 #define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
@@ -658,6 +661,22 @@ static u64 pte_to_pagemap_entry(pte_t pte)
 	return pme;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
+{
+	u64 pme = 0;
+	if (pte_present(pte))
+		pme = PM_PFRAME(pte_pfn(pte) + offset)
+			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
+	return pme;
+}
+#else
+static inline u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
+{
+	return 0;
+}
+#endif
+
 static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			     struct mm_walk *walk)
 {
@@ -665,14 +684,34 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct pagemapread *pm = walk->private;
 	pte_t *pte;
 	int err = 0;
-
-	split_huge_page_pmd(walk->mm, pmd);
+	u64 pfn = PM_NOT_PRESENT;
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
-	for (; addr != end; addr += PAGE_SIZE) {
-		u64 pfn = PM_NOT_PRESENT;
 
+	spin_lock(&walk->mm->page_table_lock);
+	if (pmd_trans_huge(*pmd)) {
+		if (pmd_trans_splitting(*pmd)) {
+			spin_unlock(&walk->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+		} else {
+			for (; addr != end; addr += PAGE_SIZE) {
+				int offset = (addr & ~PAGEMAP_WALK_MASK)
+					>> PAGE_SHIFT;
+				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
+							       offset);
+				err = add_to_pagemap(addr, pfn, pm);
+				if (err)
+					break;
+			}
+			spin_unlock(&walk->mm->page_table_lock);
+			return err;
+		}
+	} else {
+		spin_unlock(&walk->mm->page_table_lock);
+	}
+
+	for (; addr != end; addr += PAGE_SIZE) {
 		/* check to see if we've left 'vma' behind
 		 * and need a new, higher one */
 		if (vma && (addr >= vma->vm_end))
@@ -754,8 +793,6 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
  * determine which areas of memory are actually mapped and llseek to
  * skip over unmapped regions.
  */
-#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
-#define PAGEMAP_WALK_MASK	(PMD_MASK)
 static ssize_t pagemap_read(struct file *file, char __user *buf,
 			    size_t count, loff_t *ppos)
 {
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/4] thp: optimize away unnecessary page table locking
  2011-12-21 22:23 [PATCH 0/4 v2] pagemap handles transparent hugepage Naoya Horiguchi
  2011-12-21 22:23 ` [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
@ 2011-12-21 22:23 ` Naoya Horiguchi
  2011-12-30  3:59   ` KOSAKI Motohiro
  2011-12-21 22:23 ` [PATCH 3/4] pagemap: export KPF_THP Naoya Horiguchi
  2011-12-21 22:23 ` [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
  3 siblings, 1 reply; 20+ messages in thread
From: Naoya Horiguchi @ 2011-12-21 22:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

Currently when we check if we can handle thp as it is or we need to
split it into regular sized pages, we hold page table lock prior to
check whether a given pmd is mapping thp or not. Because of this,
when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
To remove it, this patch introduces a optimized check function and
replace several similar logics with it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
---
 fs/proc/task_mmu.c      |   74 ++++++++++------------------
 include/linux/huge_mm.h |    7 +++
 mm/huge_memory.c        |  124 ++++++++++++++++++++++------------------------
 mm/mremap.c             |    3 +-
 4 files changed, 93 insertions(+), 115 deletions(-)

diff --git 3.2-rc5.orig/fs/proc/task_mmu.c 3.2-rc5/fs/proc/task_mmu.c
index 0df61ab..3b79dd4 100644
--- 3.2-rc5.orig/fs/proc/task_mmu.c
+++ 3.2-rc5/fs/proc/task_mmu.c
@@ -394,20 +394,12 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			smaps_pte_entry(*(pte_t *)pmd, addr,
-					HPAGE_PMD_SIZE, walk);
-			spin_unlock(&walk->mm->page_table_lock);
-			mss->anonymous_thp += HPAGE_PMD_SIZE;
-			return 0;
-		}
-	} else {
+	if (check_and_wait_split_huge_pmd(pmd, vma)) {
+		smaps_pte_entry(*(pte_t *)pmd, addr,
+				HPAGE_PMD_SIZE, walk);
 		spin_unlock(&walk->mm->page_table_lock);
+		mss->anonymous_thp += HPAGE_PMD_SIZE;
+		return 0;
 	}
 	/*
 	 * The mmap_sem held all the way back in m_start() is what
@@ -689,26 +681,19 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
 
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			for (; addr != end; addr += PAGE_SIZE) {
-				int offset = (addr & ~PAGEMAP_WALK_MASK)
-					>> PAGE_SHIFT;
-				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
-							       offset);
-				err = add_to_pagemap(addr, pfn, pm);
-				if (err)
-					break;
-			}
-			spin_unlock(&walk->mm->page_table_lock);
-			return err;
+	/* David comment */
+	if (check_and_wait_split_huge_pmd(pmd, vma)) {
+		for (; addr != end; addr += PAGE_SIZE) {
+			int offset = (addr & ~PAGEMAP_WALK_MASK)
+				>> PAGE_SHIFT;
+			pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
+						       offset);
+			err = add_to_pagemap(addr, pfn, pm);
+			if (err)
+				break;
 		}
-	} else {
 		spin_unlock(&walk->mm->page_table_lock);
+		return err;
 	}
 
 	for (; addr != end; addr += PAGE_SIZE) {
@@ -975,24 +960,17 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 
 	md = walk->private;
-	spin_lock(&walk->mm->page_table_lock);
-	if (pmd_trans_huge(*pmd)) {
-		if (pmd_trans_splitting(*pmd)) {
-			spin_unlock(&walk->mm->page_table_lock);
-			wait_split_huge_page(md->vma->anon_vma, pmd);
-		} else {
-			pte_t huge_pte = *(pte_t *)pmd;
-			struct page *page;
-
-			page = can_gather_numa_stats(huge_pte, md->vma, addr);
-			if (page)
-				gather_stats(page, md, pte_dirty(huge_pte),
-						HPAGE_PMD_SIZE/PAGE_SIZE);
-			spin_unlock(&walk->mm->page_table_lock);
-			return 0;
-		}
-	} else {
+
+	if (check_and_wait_split_huge_pmd(pmd, md->vma)) {
+		pte_t huge_pte = *(pte_t *)pmd;
+		struct page *page;
+
+		page = can_gather_numa_stats(huge_pte, md->vma, addr);
+		if (page)
+			gather_stats(page, md, pte_dirty(huge_pte),
+				     HPAGE_PMD_SIZE/PAGE_SIZE);
 		spin_unlock(&walk->mm->page_table_lock);
+		return 0;
 	}
 
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
diff --git 3.2-rc5.orig/include/linux/huge_mm.h 3.2-rc5/include/linux/huge_mm.h
index a9ace9c..477c8e3 100644
--- 3.2-rc5.orig/include/linux/huge_mm.h
+++ 3.2-rc5/include/linux/huge_mm.h
@@ -113,6 +113,8 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    long adjust_next);
+extern int check_and_wait_split_huge_pmd(pmd_t *pmd,
+				struct vm_area_struct *vma);
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
@@ -176,6 +178,11 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
+static inline int check_and_wait_split_huge_pmd(pmd_t *pmd,
+					struct vm_area_struct *vma)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git 3.2-rc5.orig/mm/huge_memory.c 3.2-rc5/mm/huge_memory.c
index 36b3d98..b73c744 100644
--- 3.2-rc5.orig/mm/huge_memory.c
+++ 3.2-rc5/mm/huge_memory.c
@@ -1001,29 +1001,21 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 {
 	int ret = 0;
 
-	spin_lock(&tlb->mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&tlb->mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma,
-					     pmd);
-		} else {
-			struct page *page;
-			pgtable_t pgtable;
-			pgtable = get_pmd_huge_pte(tlb->mm);
-			page = pmd_page(*pmd);
-			pmd_clear(pmd);
-			page_remove_rmap(page);
-			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-			VM_BUG_ON(!PageHead(page));
-			spin_unlock(&tlb->mm->page_table_lock);
-			tlb_remove_page(tlb, page);
-			pte_free(tlb->mm, pgtable);
-			ret = 1;
-		}
-	} else
+	if (likely(check_and_wait_split_huge_pmd(pmd, vma))) {
+		struct page *page;
+		pgtable_t pgtable;
+		pgtable = get_pmd_huge_pte(tlb->mm);
+		page = pmd_page(*pmd);
+		pmd_clear(pmd);
+		page_remove_rmap(page);
+		VM_BUG_ON(page_mapcount(page) < 0);
+		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		VM_BUG_ON(!PageHead(page));
 		spin_unlock(&tlb->mm->page_table_lock);
+		tlb_remove_page(tlb, page);
+		pte_free(tlb->mm, pgtable);
+		ret = 1;
+	}
 
 	return ret;
 }
@@ -1034,21 +1026,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	int ret = 0;
 
-	spin_lock(&vma->vm_mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		ret = !pmd_trans_splitting(*pmd);
-		spin_unlock(&vma->vm_mm->page_table_lock);
-		if (unlikely(!ret))
-			wait_split_huge_page(vma->anon_vma, pmd);
-		else {
-			/*
-			 * All logical pages in the range are present
-			 * if backed by a huge page.
-			 */
-			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
-		}
-	} else
+	if (likely(check_and_wait_split_huge_pmd(pmd, vma))) {
+		/*
+		 * All logical pages in the range are present
+		 * if backed by a huge page.
+		 */
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		memset(vec, 1, (end - addr) >> PAGE_SHIFT);
+	}
 
 	return ret;
 }
@@ -1078,21 +1063,12 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		goto out;
 	}
 
-	spin_lock(&mm->page_table_lock);
-	if (likely(pmd_trans_huge(*old_pmd))) {
-		if (pmd_trans_splitting(*old_pmd)) {
-			spin_unlock(&mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, old_pmd);
-			ret = -1;
-		} else {
-			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
-			VM_BUG_ON(!pmd_none(*new_pmd));
-			set_pmd_at(mm, new_addr, new_pmd, pmd);
-			spin_unlock(&mm->page_table_lock);
-			ret = 1;
-		}
-	} else {
+	if (likely(check_and_wait_split_huge_pmd(old_pmd, vma))) {
+		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+		VM_BUG_ON(!pmd_none(*new_pmd));
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
 		spin_unlock(&mm->page_table_lock);
+		ret = 1;
 	}
 out:
 	return ret;
@@ -1104,27 +1080,45 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	int ret = 0;
 
-	spin_lock(&mm->page_table_lock);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(&mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
-		} else {
-			pmd_t entry;
+	if (likely(check_and_wait_split_huge_pmd(pmd, vma))) {
+		pmd_t entry;
 
-			entry = pmdp_get_and_clear(mm, addr, pmd);
-			entry = pmd_modify(entry, newprot);
-			set_pmd_at(mm, addr, pmd, entry);
-			spin_unlock(&vma->vm_mm->page_table_lock);
-			flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
-			ret = 1;
-		}
-	} else
+		entry = pmdp_get_and_clear(mm, addr, pmd);
+		entry = pmd_modify(entry, newprot);
+		set_pmd_at(mm, addr, pmd, entry);
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
+		ret = 1;
+	}
 
 	return ret;
 }
 
+/*
+ * Returns 1 if a given pmd is mapping a thp and stable (not under splitting.)
+ * Returns 0 otherwise. Note that if it returns 1, this routine returns without
+ * unlocking page table locks. So callers must unlock them.
+ */
+int check_and_wait_split_huge_pmd(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	if (!pmd_trans_huge(*pmd))
+		return 0;
+
+	spin_lock(&vma->vm_mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (pmd_trans_splitting(*pmd)) {
+			spin_unlock(&vma->vm_mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+		} else {
+			/* Thp mapped by 'pmd' is stable, so we can
+			 * handle it as it is. */
+			return 1;
+		}
+	}
+	spin_unlock(&vma->vm_mm->page_table_lock);
+	return 0;
+}
+
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
diff --git 3.2-rc5.orig/mm/mremap.c 3.2-rc5/mm/mremap.c
index d6959cb..d534668 100644
--- 3.2-rc5.orig/mm/mremap.c
+++ 3.2-rc5/mm/mremap.c
@@ -155,9 +155,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			if (err > 0) {
 				need_flush = true;
 				continue;
-			} else if (!err) {
-				split_huge_page_pmd(vma->vm_mm, old_pmd);
 			}
+			split_huge_page_pmd(vma->vm_mm, old_pmd);
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
 		if (pmd_none(*new_pmd) && __pte_alloc(new_vma->vm_mm, new_vma,
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/4] pagemap: export KPF_THP
  2011-12-21 22:23 [PATCH 0/4 v2] pagemap handles transparent hugepage Naoya Horiguchi
  2011-12-21 22:23 ` [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
  2011-12-21 22:23 ` [PATCH 2/4] thp: optimize away unnecessary page table locking Naoya Horiguchi
@ 2011-12-21 22:23 ` Naoya Horiguchi
  2011-12-26  8:40   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2011-12-21 22:23 ` [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
  3 siblings, 3 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2011-12-21 22:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

This flag shows that a given pages is a subpage of transparent hugepage.
It helps us debug and test kernel by showing physical address of thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Nacked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

Changes since v1:
  - remove unnecessary ifdefs
  - fix confusing patch description
---
 fs/proc/page.c                    |    2 ++
 include/linux/kernel-page-flags.h |    1 +
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git 3.2-rc5.orig/fs/proc/page.c 3.2-rc5/fs/proc/page.c
index 6d8e6a9..cb2dcea 100644
--- 3.2-rc5.orig/fs/proc/page.c
+++ 3.2-rc5/fs/proc/page.c
@@ -115,6 +115,8 @@ u64 stable_page_flags(struct page *page)
 		u |= 1 << KPF_COMPOUND_TAIL;
 	if (PageHuge(page))
 		u |= 1 << KPF_HUGE;
+	if (PageTransCompound(page))
+		u |= 1 << KPF_THP;
 
 	/*
 	 * Caveats on high order pages: page->_count will only be set
diff --git 3.2-rc5.orig/include/linux/kernel-page-flags.h 3.2-rc5/include/linux/kernel-page-flags.h
index bd92a89..26a6571 100644
--- 3.2-rc5.orig/include/linux/kernel-page-flags.h
+++ 3.2-rc5/include/linux/kernel-page-flags.h
@@ -30,6 +30,7 @@
 #define KPF_NOPAGE		20
 
 #define KPF_KSM			21
+#define KPF_THP			22
 
 /* kernel hacking assistances
  * WARNING: subject to change, never rely on them!
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it
  2011-12-21 22:23 [PATCH 0/4 v2] pagemap handles transparent hugepage Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2011-12-21 22:23 ` [PATCH 3/4] pagemap: export KPF_THP Naoya Horiguchi
@ 2011-12-21 22:23 ` Naoya Horiguchi
  2011-12-26  8:42   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  3 siblings, 3 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2011-12-21 22:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel, Naoya Horiguchi

page-types, which is a common user of pagemap, gets aware of thp
with this patch. This helps system admins and kernel hackers know
about how thp works.
Here is a sample output of page-types over a thp:

  $ page-types -p <pid> --raw --list

  voffset offset  len     flags
  ...
  7f9d40200       3f8400  1       ___U_lA____Ma_bH______t____________
  7f9d40201       3f8401  1ff     ________________T_____t____________

               flags      page-count       MB  symbolic-flags                     long-symbolic-flags
  0x0000000000410000             511        1  ________________T_____t____________        compound_tail,thp
  0x000000000040d868               1        0  ___U_lA____Ma_bH______t____________        uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>

Changes since v1:
  - fix misused word
---
 Documentation/vm/page-types.c |    2 ++
 Documentation/vm/pagemap.txt  |    4 ++++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git 3.2-rc5.orig/Documentation/vm/page-types.c 3.2-rc5/Documentation/vm/page-types.c
index 7445caa..0b13f02 100644
--- 3.2-rc5.orig/Documentation/vm/page-types.c
+++ 3.2-rc5/Documentation/vm/page-types.c
@@ -98,6 +98,7 @@
 #define KPF_HWPOISON		19
 #define KPF_NOPAGE		20
 #define KPF_KSM			21
+#define KPF_THP			22
 
 /* [32-] kernel hacking assistances */
 #define KPF_RESERVED		32
@@ -147,6 +148,7 @@ static const char *page_flag_names[] = {
 	[KPF_HWPOISON]		= "X:hwpoison",
 	[KPF_NOPAGE]		= "n:nopage",
 	[KPF_KSM]		= "x:ksm",
+	[KPF_THP]		= "t:thp",
 
 	[KPF_RESERVED]		= "r:reserved",
 	[KPF_MLOCKED]		= "m:mlocked",
diff --git 3.2-rc5.orig/Documentation/vm/pagemap.txt 3.2-rc5/Documentation/vm/pagemap.txt
index df09b96..4600cbe 100644
--- 3.2-rc5.orig/Documentation/vm/pagemap.txt
+++ 3.2-rc5/Documentation/vm/pagemap.txt
@@ -60,6 +60,7 @@ There are three components to pagemap:
     19. HWPOISON
     20. NOPAGE
     21. KSM
+    22. THP
 
 Short descriptions to the page flags:
 
@@ -97,6 +98,9 @@ Short descriptions to the page flags:
 21. KSM
     identical memory pages dynamically shared between one or more processes
 
+22. THP
+    contiguous pages which construct transparent hugepages
+
     [IO related page flags]
  1. ERROR     IO error occurred
  3. UPTODATE  page has up-to-date data
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2011-12-21 22:23 ` [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
@ 2011-12-26  8:26   ` KAMEZAWA Hiroyuki
  2011-12-30  3:39   ` KOSAKI Motohiro
  2012-01-04 23:50   ` Andrew Morton
  2 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-26  8:26 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Wed, 21 Dec 2011 17:23:45 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Thp split is not necessary if we explicitly check whether pmds are
> mapping thps or not. This patch introduces the check and the code
> to generate pagemap entries for pmds mapping thps, which results in
> less performance impact of pagemap on thp.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> 
> Changes since v1:
>   - move pfn declaration to the beginning of pagemap_pte_range()
> ---

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] pagemap: export KPF_THP
  2011-12-21 22:23 ` [PATCH 3/4] pagemap: export KPF_THP Naoya Horiguchi
@ 2011-12-26  8:40   ` KAMEZAWA Hiroyuki
  2011-12-30  4:01   ` KOSAKI Motohiro
  2012-01-04 23:55   ` Andrew Morton
  2 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-26  8:40 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Wed, 21 Dec 2011 17:23:47 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> This flag shows that a given pages is a subpage of transparent hugepage.
> It helps us debug and test kernel by showing physical address of thp.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Nacked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it
  2011-12-21 22:23 ` [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
@ 2011-12-26  8:42   ` KAMEZAWA Hiroyuki
  2011-12-30  4:02   ` KOSAKI Motohiro
  2012-01-04 23:57   ` Andrew Morton
  2 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-26  8:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro, linux-kernel

On Wed, 21 Dec 2011 17:23:48 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> page-types, which is a common user of pagemap, gets aware of thp
> with this patch. This helps system admins and kernel hackers know
> about how thp works.
> Here is a sample output of page-types over a thp:
> 
>   $ page-types -p <pid> --raw --list
> 
>   voffset offset  len     flags
>   ...
>   7f9d40200       3f8400  1       ___U_lA____Ma_bH______t____________
>   7f9d40201       3f8401  1ff     ________________T_____t____________
> 
>                flags      page-count       MB  symbolic-flags                     long-symbolic-flags
>   0x0000000000410000             511        1  ________________T_____t____________        compound_tail,thp
>   0x000000000040d868               1        0  ___U_lA____Ma_bH______t____________        uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Wu Fengguang <fengguang.wu@intel.com>
> 
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2011-12-21 22:23 ` [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
  2011-12-26  8:26   ` KAMEZAWA Hiroyuki
@ 2011-12-30  3:39   ` KOSAKI Motohiro
  2012-01-03 20:07     ` Naoya Horiguchi
  2012-01-04 23:50   ` Andrew Morton
  2 siblings, 1 reply; 20+ messages in thread
From: KOSAKI Motohiro @ 2011-12-30  3:39 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

(12/21/11 5:23 PM), Naoya Horiguchi wrote:
> Thp split is not necessary if we explicitly check whether pmds are
> mapping thps or not. This patch introduces the check and the code
> to generate pagemap entries for pmds mapping thps, which results in
> less performance impact of pagemap on thp.
> 
> Signed-off-by: Naoya Horiguchi<n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Andi Kleen<ak@linux.intel.com>
> 
> Changes since v1:
>    - move pfn declaration to the beginning of pagemap_pte_range()
> ---
>   fs/proc/task_mmu.c |   49 +++++++++++++++++++++++++++++++++++++++++++------
>   1 files changed, 43 insertions(+), 6 deletions(-)
> 
> diff --git 3.2-rc5.orig/fs/proc/task_mmu.c 3.2-rc5/fs/proc/task_mmu.c
> index e418c5a..0df61ab 100644
> --- 3.2-rc5.orig/fs/proc/task_mmu.c
> +++ 3.2-rc5/fs/proc/task_mmu.c
> @@ -600,6 +600,9 @@ struct pagemapread {
>   	u64 *buffer;
>   };
> 
> +#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
> +#define PAGEMAP_WALK_MASK	(PMD_MASK)
> +
>   #define PM_ENTRY_BYTES      sizeof(u64)
>   #define PM_STATUS_BITS      3
>   #define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
> @@ -658,6 +661,22 @@ static u64 pte_to_pagemap_entry(pte_t pte)
>   	return pme;
>   }
> 
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
> +{
> +	u64 pme = 0;
> +	if (pte_present(pte))

When does pte_present() return 0?

> +		pme = PM_PFRAME(pte_pfn(pte) + offset)
> +			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
> +	return pme;
> +}
> +#else
> +static inline u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
> +{
> +	return 0;
> +}
> +#endif
> +
>   static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   			     struct mm_walk *walk)
>   {
> @@ -665,14 +684,34 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	struct pagemapread *pm = walk->private;
>   	pte_t *pte;
>   	int err = 0;
> -
> -	split_huge_page_pmd(walk->mm, pmd);
> +	u64 pfn = PM_NOT_PRESENT;
> 
>   	/* find the first VMA at or above 'addr' */
>   	vma = find_vma(walk->mm, addr);
> -	for (; addr != end; addr += PAGE_SIZE) {
> -		u64 pfn = PM_NOT_PRESENT;
> 
> +	spin_lock(&walk->mm->page_table_lock);
> +	if (pmd_trans_huge(*pmd)) {
> +		if (pmd_trans_splitting(*pmd)) {
> +			spin_unlock(&walk->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma, pmd);
> +		} else {
> +			for (; addr != end; addr += PAGE_SIZE) {
> +				int offset = (addr&  ~PAGEMAP_WALK_MASK)
> +					>>  PAGE_SHIFT;

implicit narrowing conversion. offset should be unsigned long.


> +				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
> +							       offset);

This (pte_t*) cast looks introduce new implicit assumption. Please don't
put x86 assumption here directly.




> +				err = add_to_pagemap(addr, pfn, pm);
> +				if (err)
> +					break;
> +			}
> +			spin_unlock(&walk->mm->page_table_lock);
> +			return err;
> +		}
> +	} else {
> +		spin_unlock(&walk->mm->page_table_lock);
> +	}

coding standard violation. plz run check_patch.pl.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] thp: optimize away unnecessary page table locking
  2011-12-21 22:23 ` [PATCH 2/4] thp: optimize away unnecessary page table locking Naoya Horiguchi
@ 2011-12-30  3:59   ` KOSAKI Motohiro
  2012-01-03 20:08     ` Naoya Horiguchi
  0 siblings, 1 reply; 20+ messages in thread
From: KOSAKI Motohiro @ 2011-12-30  3:59 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

(12/21/11 5:23 PM), Naoya Horiguchi wrote:
> Currently when we check if we can handle thp as it is or we need to
> split it into regular sized pages, we hold page table lock prior to
> check whether a given pmd is mapping thp or not. Because of this,
> when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead.
> To remove it, this patch introduces a optimized check function and
> replace several similar logics with it.
> 
> Signed-off-by: Naoya Horiguchi<n-horiguchi@ah.jp.nec.com>
> Cc: David Rientjes<rientjes@google.com>

ok, this looks a valuable patch.


> ---
>   fs/proc/task_mmu.c      |   74 ++++++++++------------------
>   include/linux/huge_mm.h |    7 +++
>   mm/huge_memory.c        |  124 ++++++++++++++++++++++------------------------
>   mm/mremap.c             |    3 +-
>   4 files changed, 93 insertions(+), 115 deletions(-)
> 
> diff --git 3.2-rc5.orig/fs/proc/task_mmu.c 3.2-rc5/fs/proc/task_mmu.c
> index 0df61ab..3b79dd4 100644
> --- 3.2-rc5.orig/fs/proc/task_mmu.c
> +++ 3.2-rc5/fs/proc/task_mmu.c
> @@ -394,20 +394,12 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	pte_t *pte;
>   	spinlock_t *ptl;
> 
> -	spin_lock(&walk->mm->page_table_lock);
> -	if (pmd_trans_huge(*pmd)) {
> -		if (pmd_trans_splitting(*pmd)) {
> -			spin_unlock(&walk->mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -		} else {
> -			smaps_pte_entry(*(pte_t *)pmd, addr,
> -					HPAGE_PMD_SIZE, walk);
> -			spin_unlock(&walk->mm->page_table_lock);
> -			mss->anonymous_thp += HPAGE_PMD_SIZE;
> -			return 0;
> -		}
> -	} else {
> +	if (check_and_wait_split_huge_pmd(pmd, vma)) {
> +		smaps_pte_entry(*(pte_t *)pmd, addr,
> +				HPAGE_PMD_SIZE, walk);
>   		spin_unlock(&walk->mm->page_table_lock);
> +		mss->anonymous_thp += HPAGE_PMD_SIZE;
> +		return 0;
>   	}
>   	/*
>   	 * The mmap_sem held all the way back in m_start() is what
> @@ -689,26 +681,19 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	/* find the first VMA at or above 'addr' */
>   	vma = find_vma(walk->mm, addr);
> 
> -	spin_lock(&walk->mm->page_table_lock);
> -	if (pmd_trans_huge(*pmd)) {
> -		if (pmd_trans_splitting(*pmd)) {
> -			spin_unlock(&walk->mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -		} else {
> -			for (; addr != end; addr += PAGE_SIZE) {
> -				int offset = (addr&  ~PAGEMAP_WALK_MASK)
> -					>>  PAGE_SHIFT;
> -				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
> -							       offset);
> -				err = add_to_pagemap(addr, pfn, pm);
> -				if (err)
> -					break;
> -			}
> -			spin_unlock(&walk->mm->page_table_lock);
> -			return err;
> +	/* David comment */

This commnet doesn't explain anything.


> +	if (check_and_wait_split_huge_pmd(pmd, vma)) {
> +		for (; addr != end; addr += PAGE_SIZE) {
> +			int offset = (addr&  ~PAGEMAP_WALK_MASK)
> +				>>  PAGE_SHIFT;
> +			pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
> +						       offset);
> +			err = add_to_pagemap(addr, pfn, pm);
> +			if (err)
> +				break;
>   		}
> -	} else {
>   		spin_unlock(&walk->mm->page_table_lock);
> +		return err;
>   	}
> 
>   	for (; addr != end; addr += PAGE_SIZE) {
> @@ -975,24 +960,17 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
>   	pte_t *pte;
> 
>   	md = walk->private;
> -	spin_lock(&walk->mm->page_table_lock);
> -	if (pmd_trans_huge(*pmd)) {
> -		if (pmd_trans_splitting(*pmd)) {
> -			spin_unlock(&walk->mm->page_table_lock);
> -			wait_split_huge_page(md->vma->anon_vma, pmd);
> -		} else {
> -			pte_t huge_pte = *(pte_t *)pmd;
> -			struct page *page;
> -
> -			page = can_gather_numa_stats(huge_pte, md->vma, addr);
> -			if (page)
> -				gather_stats(page, md, pte_dirty(huge_pte),
> -						HPAGE_PMD_SIZE/PAGE_SIZE);
> -			spin_unlock(&walk->mm->page_table_lock);
> -			return 0;
> -		}
> -	} else {
> +
> +	if (check_and_wait_split_huge_pmd(pmd, md->vma)) {
> +		pte_t huge_pte = *(pte_t *)pmd;
> +		struct page *page;
> +
> +		page = can_gather_numa_stats(huge_pte, md->vma, addr);
> +		if (page)
> +			gather_stats(page, md, pte_dirty(huge_pte),
> +				     HPAGE_PMD_SIZE/PAGE_SIZE);
>   		spin_unlock(&walk->mm->page_table_lock);
> +		return 0;
>   	}
> 
>   	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr,&ptl);
> diff --git 3.2-rc5.orig/include/linux/huge_mm.h 3.2-rc5/include/linux/huge_mm.h
> index a9ace9c..477c8e3 100644
> --- 3.2-rc5.orig/include/linux/huge_mm.h
> +++ 3.2-rc5/include/linux/huge_mm.h
> @@ -113,6 +113,8 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>   				    unsigned long start,
>   				    unsigned long end,
>   				    long adjust_next);
> +extern int check_and_wait_split_huge_pmd(pmd_t *pmd,
> +				struct vm_area_struct *vma);
>   static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>   					 unsigned long start,
>   					 unsigned long end,
> @@ -176,6 +178,11 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
>   					 long adjust_next)
>   {
>   }
> +static inline int check_and_wait_split_huge_pmd(pmd_t *pmd,
> +					struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> 
>   #endif /* _LINUX_HUGE_MM_H */
> diff --git 3.2-rc5.orig/mm/huge_memory.c 3.2-rc5/mm/huge_memory.c
> index 36b3d98..b73c744 100644
> --- 3.2-rc5.orig/mm/huge_memory.c
> +++ 3.2-rc5/mm/huge_memory.c
> @@ -1001,29 +1001,21 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   {
>   	int ret = 0;
> 
> -	spin_lock(&tlb->mm->page_table_lock);
> -	if (likely(pmd_trans_huge(*pmd))) {
> -		if (unlikely(pmd_trans_splitting(*pmd))) {
> -			spin_unlock(&tlb->mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma,
> -					     pmd);
> -		} else {
> -			struct page *page;
> -			pgtable_t pgtable;
> -			pgtable = get_pmd_huge_pte(tlb->mm);
> -			page = pmd_page(*pmd);
> -			pmd_clear(pmd);
> -			page_remove_rmap(page);
> -			VM_BUG_ON(page_mapcount(page)<  0);
> -			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> -			VM_BUG_ON(!PageHead(page));
> -			spin_unlock(&tlb->mm->page_table_lock);
> -			tlb_remove_page(tlb, page);
> -			pte_free(tlb->mm, pgtable);
> -			ret = 1;
> -		}
> -	} else
> +	if (likely(check_and_wait_split_huge_pmd(pmd, vma))) {
> +		struct page *page;
> +		pgtable_t pgtable;
> +		pgtable = get_pmd_huge_pte(tlb->mm);
> +		page = pmd_page(*pmd);
> +		pmd_clear(pmd);
> +		page_remove_rmap(page);
> +		VM_BUG_ON(page_mapcount(page)<  0);
> +		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +		VM_BUG_ON(!PageHead(page));
>   		spin_unlock(&tlb->mm->page_table_lock);
> +		tlb_remove_page(tlb, page);
> +		pte_free(tlb->mm, pgtable);
> +		ret = 1;
> +	}
> 
>   	return ret;
>   }
> @@ -1034,21 +1026,14 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   {
>   	int ret = 0;
> 
> -	spin_lock(&vma->vm_mm->page_table_lock);
> -	if (likely(pmd_trans_huge(*pmd))) {
> -		ret = !pmd_trans_splitting(*pmd);
> -		spin_unlock(&vma->vm_mm->page_table_lock);
> -		if (unlikely(!ret))
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -		else {
> -			/*
> -			 * All logical pages in the range are present
> -			 * if backed by a huge page.
> -			 */
> -			memset(vec, 1, (end - addr)>>  PAGE_SHIFT);
> -		}
> -	} else
> +	if (likely(check_and_wait_split_huge_pmd(pmd, vma))) {
> +		/*
> +		 * All logical pages in the range are present
> +		 * if backed by a huge page.
> +		 */
>   		spin_unlock(&vma->vm_mm->page_table_lock);
> +		memset(vec, 1, (end - addr)>>  PAGE_SHIFT);
> +	}
> 
>   	return ret;
>   }
> @@ -1078,21 +1063,12 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>   		goto out;
>   	}
> 
> -	spin_lock(&mm->page_table_lock);
> -	if (likely(pmd_trans_huge(*old_pmd))) {
> -		if (pmd_trans_splitting(*old_pmd)) {
> -			spin_unlock(&mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma, old_pmd);
> -			ret = -1;
> -		} else {
> -			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
> -			VM_BUG_ON(!pmd_none(*new_pmd));
> -			set_pmd_at(mm, new_addr, new_pmd, pmd);
> -			spin_unlock(&mm->page_table_lock);
> -			ret = 1;
> -		}
> -	} else {
> +	if (likely(check_and_wait_split_huge_pmd(old_pmd, vma))) {
> +		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
> +		VM_BUG_ON(!pmd_none(*new_pmd));
> +		set_pmd_at(mm, new_addr, new_pmd, pmd);
>   		spin_unlock(&mm->page_table_lock);
> +		ret = 1;
>   	}
>   out:
>   	return ret;
> @@ -1104,27 +1080,45 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   	struct mm_struct *mm = vma->vm_mm;
>   	int ret = 0;
> 
> -	spin_lock(&mm->page_table_lock);
> -	if (likely(pmd_trans_huge(*pmd))) {
> -		if (unlikely(pmd_trans_splitting(*pmd))) {
> -			spin_unlock(&mm->page_table_lock);
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -		} else {
> -			pmd_t entry;
> +	if (likely(check_and_wait_split_huge_pmd(pmd, vma))) {
> +		pmd_t entry;
> 
> -			entry = pmdp_get_and_clear(mm, addr, pmd);
> -			entry = pmd_modify(entry, newprot);
> -			set_pmd_at(mm, addr, pmd, entry);
> -			spin_unlock(&vma->vm_mm->page_table_lock);
> -			flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
> -			ret = 1;
> -		}
> -	} else
> +		entry = pmdp_get_and_clear(mm, addr, pmd);
> +		entry = pmd_modify(entry, newprot);
> +		set_pmd_at(mm, addr, pmd, entry);
>   		spin_unlock(&vma->vm_mm->page_table_lock);
> +		flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
> +		ret = 1;
> +	}
> 
>   	return ret;
>   }
> 
> +/*
> + * Returns 1 if a given pmd is mapping a thp and stable (not under splitting.)
> + * Returns 0 otherwise. Note that if it returns 1, this routine returns without
> + * unlocking page table locks. So callers must unlock them.
> + */
> +int check_and_wait_split_huge_pmd(pmd_t *pmd, struct vm_area_struct *vma)

We always should avoid a name of "check". It doesn't explain what the
function does.


> +{

VM_BUG_ON(!rwsem_is_locked(vma->mm)) here?

> +	if (!pmd_trans_huge(*pmd))
> +		return 0;
> +
> +	spin_lock(&vma->vm_mm->page_table_lock);
> +	if (likely(pmd_trans_huge(*pmd))) {
> +		if (pmd_trans_splitting(*pmd)) {
> +			spin_unlock(&vma->vm_mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma, pmd);
> +		} else {
> +			/* Thp mapped by 'pmd' is stable, so we can
> +			 * handle it as it is. */
> +			return 1;
> +		}
> +	}
> +	spin_unlock(&vma->vm_mm->page_table_lock);
> +	return 0;
> +}
> +
>   pmd_t *page_check_address_pmd(struct page *page,
>   			      struct mm_struct *mm,
>   			      unsigned long address,
> diff --git 3.2-rc5.orig/mm/mremap.c 3.2-rc5/mm/mremap.c
> index d6959cb..d534668 100644
> --- 3.2-rc5.orig/mm/mremap.c
> +++ 3.2-rc5/mm/mremap.c
> @@ -155,9 +155,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>   			if (err>  0) {
>   				need_flush = true;
>   				continue;
> -			} else if (!err) {
> -				split_huge_page_pmd(vma->vm_mm, old_pmd);
>   			}
> +			split_huge_page_pmd(vma->vm_mm, old_pmd);

unrelated hunk?



>   			VM_BUG_ON(pmd_trans_huge(*old_pmd));
>   		}
>   		if (pmd_none(*new_pmd)&&  __pte_alloc(new_vma->vm_mm, new_vma,


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] pagemap: export KPF_THP
  2011-12-21 22:23 ` [PATCH 3/4] pagemap: export KPF_THP Naoya Horiguchi
  2011-12-26  8:40   ` KAMEZAWA Hiroyuki
@ 2011-12-30  4:01   ` KOSAKI Motohiro
  2012-01-04 23:55   ` Andrew Morton
  2 siblings, 0 replies; 20+ messages in thread
From: KOSAKI Motohiro @ 2011-12-30  4:01 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

(12/21/11 5:23 PM), Naoya Horiguchi wrote:
> This flag shows that a given pages is a subpage of transparent hugepage.
> It helps us debug and test kernel by showing physical address of thp.
> 
> Signed-off-by: Naoya Horiguchi<n-horiguchi@ah.jp.nec.com>
> Nacked-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Wu Fengguang<fengguang.wu@intel.com>

ok, many people like this patch. so I don't argue this anymore.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it
  2011-12-21 22:23 ` [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
  2011-12-26  8:42   ` KAMEZAWA Hiroyuki
@ 2011-12-30  4:02   ` KOSAKI Motohiro
  2012-01-04 23:57   ` Andrew Morton
  2 siblings, 0 replies; 20+ messages in thread
From: KOSAKI Motohiro @ 2011-12-30  4:02 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

(12/21/11 5:23 PM), Naoya Horiguchi wrote:
> page-types, which is a common user of pagemap, gets aware of thp
> with this patch. This helps system admins and kernel hackers know
> about how thp works.
> Here is a sample output of page-types over a thp:
> 
>    $ page-types -p<pid>  --raw --list
> 
>    voffset offset  len     flags
>    ...
>    7f9d40200       3f8400  1       ___U_lA____Ma_bH______t____________
>    7f9d40201       3f8401  1ff     ________________T_____t____________
> 
>                 flags      page-count       MB  symbolic-flags                     long-symbolic-flags
>    0x0000000000410000             511        1  ________________T_____t____________        compound_tail,thp
>    0x000000000040d868               1        0  ___U_lA____Ma_bH______t____________        uptodate,lru,active,mmap,anonymous,swapbacked,compound_head,thp
> 
> Signed-off-by: Naoya Horiguchi<n-horiguchi@ah.jp.nec.com>
> Acked-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2011-12-30  3:39   ` KOSAKI Motohiro
@ 2012-01-03 20:07     ` Naoya Horiguchi
  2012-01-03 21:06       ` KOSAKI Motohiro
  0 siblings, 1 reply; 20+ messages in thread
From: Naoya Horiguchi @ 2012-01-03 20:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

Hi,

Thank you for your reviewing.

On Thu, Dec 29, 2011 at 10:39:18PM -0500, KOSAKI Motohiro wrote:
...
> > --- 3.2-rc5.orig/fs/proc/task_mmu.c
> > +++ 3.2-rc5/fs/proc/task_mmu.c
> > @@ -600,6 +600,9 @@ struct pagemapread {
> >   	u64 *buffer;
> >   };
> > 
> > +#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
> > +#define PAGEMAP_WALK_MASK	(PMD_MASK)
> > +
> >   #define PM_ENTRY_BYTES      sizeof(u64)
> >   #define PM_STATUS_BITS      3
> >   #define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
> > @@ -658,6 +661,22 @@ static u64 pte_to_pagemap_entry(pte_t pte)
> >   	return pme;
> >   }
> > 
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
> > +{
> > +	u64 pme = 0;
> > +	if (pte_present(pte))
> 
> When does pte_present() return 0?

It does when the page pointed to by pte is swapped-out, under page migration,
or HWPOISONed. But currenly it can't happen on thp because thp will be
splitted before these operations are processed.
So this if-sentense is not necessary for now, but I think it's not a bad idea
to put it now to prepare for future implementation.

>
> > +		pme = PM_PFRAME(pte_pfn(pte) + offset)
> > +			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
> > +	return pme;
> > +}
> > +#else
> > +static inline u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >   static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >   			     struct mm_walk *walk)
> >   {
> > @@ -665,14 +684,34 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >   	struct pagemapread *pm = walk->private;
> >   	pte_t *pte;
> >   	int err = 0;
> > -
> > -	split_huge_page_pmd(walk->mm, pmd);
> > +	u64 pfn = PM_NOT_PRESENT;
> > 
> >   	/* find the first VMA at or above 'addr' */
> >   	vma = find_vma(walk->mm, addr);
> > -	for (; addr != end; addr += PAGE_SIZE) {
> > -		u64 pfn = PM_NOT_PRESENT;
> > 
> > +	spin_lock(&walk->mm->page_table_lock);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		if (pmd_trans_splitting(*pmd)) {
> > +			spin_unlock(&walk->mm->page_table_lock);
> > +			wait_split_huge_page(vma->anon_vma, pmd);
> > +		} else {
> > +			for (; addr != end; addr += PAGE_SIZE) {
> > +				int offset = (addr&  ~PAGEMAP_WALK_MASK)
> > +					>>  PAGE_SHIFT;
> 
> implicit narrowing conversion. offset should be unsigned long.

OK.

> 
> 
> > +				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
> > +							       offset);
> 
> This (pte_t*) cast looks introduce new implicit assumption. Please don't
> put x86 assumption here directly.

OK, I think it's better to write a separate patch for this job because
similar assumption is used in smaps_pte_range() and gather_pte_stats().

> 
> 
> > +				err = add_to_pagemap(addr, pfn, pm);
> > +				if (err)
> > +					break;
> > +			}
> > +			spin_unlock(&walk->mm->page_table_lock);
> > +			return err;
> > +		}
> > +	} else {
> > +		spin_unlock(&walk->mm->page_table_lock);
> > +	}
> 
> coding standard violation. plz run check_patch.pl.

checkpatch.pl says nothing for here. According to Documentation/CodingStyle,
"no braces for single statement" rule is not applicable for else-blocks with
one statement if corresponding if-blocks have multiple statements.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] thp: optimize away unnecessary page table locking
  2011-12-30  3:59   ` KOSAKI Motohiro
@ 2012-01-03 20:08     ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2012-01-03 20:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Rientjes,
	Andi Kleen, Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

On Thu, Dec 29, 2011 at 10:59:53PM -0500, KOSAKI Motohiro wrote:
...
> > @@ -689,26 +681,19 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >   	/* find the first VMA at or above 'addr' */
> >   	vma = find_vma(walk->mm, addr);
> > 
> > -	spin_lock(&walk->mm->page_table_lock);
> > -	if (pmd_trans_huge(*pmd)) {
> > -		if (pmd_trans_splitting(*pmd)) {
> > -			spin_unlock(&walk->mm->page_table_lock);
> > -			wait_split_huge_page(vma->anon_vma, pmd);
> > -		} else {
> > -			for (; addr != end; addr += PAGE_SIZE) {
> > -				int offset = (addr&  ~PAGEMAP_WALK_MASK)
> > -					>>  PAGE_SHIFT;
> > -				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
> > -							       offset);
> > -				err = add_to_pagemap(addr, pfn, pm);
> > -				if (err)
> > -					break;
> > -			}
> > -			spin_unlock(&walk->mm->page_table_lock);
> > -			return err;
> > +	/* David comment */
> 
> This commnet doesn't explain anything.

Sorry, I forgot to remove.

...
> > diff --git 3.2-rc5.orig/mm/huge_memory.c 3.2-rc5/mm/huge_memory.c
> > index 36b3d98..b73c744 100644
> > --- 3.2-rc5.orig/mm/huge_memory.c
> > +++ 3.2-rc5/mm/huge_memory.c
...
> > @@ -1104,27 +1080,45 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >   	struct mm_struct *mm = vma->vm_mm;
> >   	int ret = 0;
> > 
> > -	spin_lock(&mm->page_table_lock);
> > -	if (likely(pmd_trans_huge(*pmd))) {
> > -		if (unlikely(pmd_trans_splitting(*pmd))) {
> > -			spin_unlock(&mm->page_table_lock);
> > -			wait_split_huge_page(vma->anon_vma, pmd);
> > -		} else {
> > -			pmd_t entry;
> > +	if (likely(check_and_wait_split_huge_pmd(pmd, vma))) {
> > +		pmd_t entry;
> > 
> > -			entry = pmdp_get_and_clear(mm, addr, pmd);
> > -			entry = pmd_modify(entry, newprot);
> > -			set_pmd_at(mm, addr, pmd, entry);
> > -			spin_unlock(&vma->vm_mm->page_table_lock);
> > -			flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
> > -			ret = 1;
> > -		}
> > -	} else
> > +		entry = pmdp_get_and_clear(mm, addr, pmd);
> > +		entry = pmd_modify(entry, newprot);
> > +		set_pmd_at(mm, addr, pmd, entry);
> >   		spin_unlock(&vma->vm_mm->page_table_lock);
> > +		flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
> > +		ret = 1;
> > +	}
> > 
> >   	return ret;
> >   }
> > 
> > +/*
> > + * Returns 1 if a given pmd is mapping a thp and stable (not under splitting.)
> > + * Returns 0 otherwise. Note that if it returns 1, this routine returns without
> > + * unlocking page table locks. So callers must unlock them.
> > + */
> > +int check_and_wait_split_huge_pmd(pmd_t *pmd, struct vm_area_struct *vma)
> 
> We always should avoid a name of "check". It doesn't explain what the
> function does.

How about pmd_trans_huge_stable()?

> 
> > +{
> 
> VM_BUG_ON(!rwsem_is_locked(vma->mm)) here?

OK, I will add VM_BUG_ON(!rwsem_is_locked(vma->mm->mmap_sem)),
which helps us make sure that new user of this function holds mmap_sem.

> > +	if (!pmd_trans_huge(*pmd))
> > +		return 0;
> > +
> > +	spin_lock(&vma->vm_mm->page_table_lock);
> > +	if (likely(pmd_trans_huge(*pmd))) {
> > +		if (pmd_trans_splitting(*pmd)) {
> > +			spin_unlock(&vma->vm_mm->page_table_lock);
> > +			wait_split_huge_page(vma->anon_vma, pmd);
> > +		} else {
> > +			/* Thp mapped by 'pmd' is stable, so we can
> > +			 * handle it as it is. */
> > +			return 1;
> > +		}
> > +	}
> > +	spin_unlock(&vma->vm_mm->page_table_lock);
> > +	return 0;
> > +}
> > +
> >   pmd_t *page_check_address_pmd(struct page *page,
> >   			      struct mm_struct *mm,
> >   			      unsigned long address,
> > diff --git 3.2-rc5.orig/mm/mremap.c 3.2-rc5/mm/mremap.c
> > index d6959cb..d534668 100644
> > --- 3.2-rc5.orig/mm/mremap.c
> > +++ 3.2-rc5/mm/mremap.c
> > @@ -155,9 +155,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >   			if (err>  0) {
> >   				need_flush = true;
> >   				continue;
> > -			} else if (!err) {
> > -				split_huge_page_pmd(vma->vm_mm, old_pmd);
> >   			}
> > +			split_huge_page_pmd(vma->vm_mm, old_pmd);
> 
> unrelated hunk?

All users (except one) of the logic which I want to replace with
check_and_wait_split_huge_pmd() expect it to return:
  1: when pmd maps thp and is not under splitting,
  0: when pmd maps thp and is under splitting,
  0: when pmd doesn't map thp.

But only move_huge_pmd() expects differently:
  1: when pmd maps thp and is not under splitting,
 -1: when pmd maps thp and is under splitting,
  0: when pmd doesn't map thp.

move_huge_pmd() is used only around the above hunk, so I chose to change
the caller. It makes no behavioral change because split_huge_page_pmd()
does nothing when old_pmd doesn't map thp.
Is it better to separate changing return value into another patch?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-03 20:07     ` Naoya Horiguchi
@ 2012-01-03 21:06       ` KOSAKI Motohiro
  2012-01-03 21:31         ` Naoya Horiguchi
  0 siblings, 1 reply; 20+ messages in thread
From: KOSAKI Motohiro @ 2012-01-03 21:06 UTC (permalink / raw)
  To: n-horiguchi
  Cc: kosaki.motohiro, linux-mm, akpm, rientjes, andi, fengguang.wu,
	aarcange, kamezawa.hiroyu, linux-kernel

On 1/3/2012 3:07 PM, Naoya Horiguchi wrote:
> Hi,
> 
> Thank you for your reviewing.
> 
> On Thu, Dec 29, 2011 at 10:39:18PM -0500, KOSAKI Motohiro wrote:
> ...
>>> --- 3.2-rc5.orig/fs/proc/task_mmu.c
>>> +++ 3.2-rc5/fs/proc/task_mmu.c
>>> @@ -600,6 +600,9 @@ struct pagemapread {
>>>   	u64 *buffer;
>>>   };
>>>
>>> +#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
>>> +#define PAGEMAP_WALK_MASK	(PMD_MASK)
>>> +
>>>   #define PM_ENTRY_BYTES      sizeof(u64)
>>>   #define PM_STATUS_BITS      3
>>>   #define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
>>> @@ -658,6 +661,22 @@ static u64 pte_to_pagemap_entry(pte_t pte)
>>>   	return pme;
>>>   }
>>>
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +static u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
>>> +{
>>> +	u64 pme = 0;
>>> +	if (pte_present(pte))
>>
>> When does pte_present() return 0?
> 
> It does when the page pointed to by pte is swapped-out, under page migration,
> or HWPOISONed. But currenly it can't happen on thp because thp will be
> splitted before these operations are processed.
> So this if-sentense is not necessary for now, but I think it's not a bad idea
> to put it now to prepare for future implementation.

You certainly need to add a comment. otherwise you add *unnecessary* complexity
and people is going to be puzzled.


>>> +				pfn = thp_pte_to_pagemap_entry(*(pte_t *)pmd,
>>> +							       offset);
>>
>> This (pte_t*) cast looks introduce new implicit assumption. Please don't
>> put x86 assumption here directly.
> 
> OK, I think it's better to write a separate patch for this job because
> similar assumption is used in smaps_pte_range() and gather_pte_stats().

Sound sane.


>>> +	} else {
>>> +		spin_unlock(&walk->mm->page_table_lock);
>>> +	}
>>
>> coding standard violation. plz run check_patch.pl.
> 
> checkpatch.pl says nothing for here. According to Documentation/CodingStyle,
> "no braces for single statement" rule is not applicable for else-blocks with
> one statement if corresponding if-blocks have multiple statements.


ok

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-03 21:06       ` KOSAKI Motohiro
@ 2012-01-03 21:31         ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2012-01-03 21:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, linux-mm, akpm, rientjes, andi, fengguang.wu,
	aarcange, kamezawa.hiroyu, linux-kernel

(1/3/2012 16:06), KOSAKI Motohiro wrote:
> On 1/3/2012 3:07 PM, Naoya Horiguchi wrote:
>> On Thu, Dec 29, 2011 at 10:39:18PM -0500, KOSAKI Motohiro wrote:
>> ...
>>>> --- 3.2-rc5.orig/fs/proc/task_mmu.c
>>>> +++ 3.2-rc5/fs/proc/task_mmu.c
>>>> @@ -600,6 +600,9 @@ struct pagemapread {
>>>>   	u64 *buffer;
>>>>   };
>>>>
>>>> +#define PAGEMAP_WALK_SIZE	(PMD_SIZE)
>>>> +#define PAGEMAP_WALK_MASK	(PMD_MASK)
>>>> +
>>>>   #define PM_ENTRY_BYTES      sizeof(u64)
>>>>   #define PM_STATUS_BITS      3
>>>>   #define PM_STATUS_OFFSET    (64 - PM_STATUS_BITS)
>>>> @@ -658,6 +661,22 @@ static u64 pte_to_pagemap_entry(pte_t pte)
>>>>   	return pme;
>>>>   }
>>>>
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +static u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
>>>> +{
>>>> +	u64 pme = 0;
>>>> +	if (pte_present(pte))
>>>
>>> When does pte_present() return 0?
>>
>> It does when the page pointed to by pte is swapped-out, under page migration,
>> or HWPOISONed. But currenly it can't happen on thp because thp will be
>> splitted before these operations are processed.
>> So this if-sentense is not necessary for now, but I think it's not a bad idea
>> to put it now to prepare for future implementation.
> 
> You certainly need to add a comment. otherwise you add *unnecessary* complexity
> and people is going to be puzzled.

OK, I care about that.
Thank you.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2011-12-21 22:23 ` [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
  2011-12-26  8:26   ` KAMEZAWA Hiroyuki
  2011-12-30  3:39   ` KOSAKI Motohiro
@ 2012-01-04 23:50   ` Andrew Morton
  2012-01-05 16:28     ` Naoya Horiguchi
  2 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2012-01-04 23:50 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel

On Wed, 21 Dec 2011 17:23:45 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Thp split is not necessary if we explicitly check whether pmds are
> mapping thps or not. This patch introduces the check and the code
> to generate pagemap entries for pmds mapping thps, which results in
> less performance impact of pagemap on thp.
> 
>
> ...

The type choices seem odd:

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
> +{
> +	u64 pme = 0;

Why are these u64?

Should we have a pme_t, matching pte_t, pmd_t, etc?

> +	if (pte_present(pte))
> +		pme = PM_PFRAME(pte_pfn(pte) + offset)
> +			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
> +	return pme;
> +}
> +#else
> +static inline u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
> +{
> +	return 0;
> +}
> +#endif
> +
>  static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			     struct mm_walk *walk)
>  {
> @@ -665,14 +684,34 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	struct pagemapread *pm = walk->private;
>  	pte_t *pte;
>  	int err = 0;
> -
> -	split_huge_page_pmd(walk->mm, pmd);
> +	u64 pfn = PM_NOT_PRESENT;

Again, why a u64?  pfn's are usually unsigned long.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] pagemap: export KPF_THP
  2011-12-21 22:23 ` [PATCH 3/4] pagemap: export KPF_THP Naoya Horiguchi
  2011-12-26  8:40   ` KAMEZAWA Hiroyuki
  2011-12-30  4:01   ` KOSAKI Motohiro
@ 2012-01-04 23:55   ` Andrew Morton
  2 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2012-01-04 23:55 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel

On Wed, 21 Dec 2011 17:23:47 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> This flag shows that a given pages is a subpage of transparent hugepage.
> It helps us debug and test kernel by showing physical address of thp.
>
> ...
>
> --- 3.2-rc5.orig/include/linux/kernel-page-flags.h
> +++ 3.2-rc5/include/linux/kernel-page-flags.h
> @@ -30,6 +30,7 @@
>  #define KPF_NOPAGE		20
>  
>  #define KPF_KSM			21
> +#define KPF_THP			22
>  

Please also update and test Documentation/vm/page-types.c.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it
  2011-12-21 22:23 ` [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
  2011-12-26  8:42   ` KAMEZAWA Hiroyuki
  2011-12-30  4:02   ` KOSAKI Motohiro
@ 2012-01-04 23:57   ` Andrew Morton
  2 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2012-01-04 23:57 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, David Rientjes, Andi Kleen, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	linux-kernel

On Wed, 21 Dec 2011 17:23:48 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> page-types, which is a common user of pagemap, gets aware of thp
> with this patch. This helps system admins and kernel hackers know
> about how thp works.
> Here is a sample output of page-types over a thp:

Oh, there it is.

It would be nice to generate a /proc/pid/pagemap test for the forthcoming
tools/testing/selftests/.  But I guess page-types.c is good
enough for now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap
  2012-01-04 23:50   ` Andrew Morton
@ 2012-01-05 16:28     ` Naoya Horiguchi
  0 siblings, 0 replies; 20+ messages in thread
From: Naoya Horiguchi @ 2012-01-05 16:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Naoya Horiguchi, linux-mm, David Rientjes, Andi Kleen,
	Wu Fengguang, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-kernel

On Wed, Jan 04, 2012 at 03:50:42PM -0800, Andrew Morton wrote:
> On Wed, 21 Dec 2011 17:23:45 -0500
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > Thp split is not necessary if we explicitly check whether pmds are
> > mapping thps or not. This patch introduces the check and the code
> > to generate pagemap entries for pmds mapping thps, which results in
> > less performance impact of pagemap on thp.
> > 
> >
> > ...
> 
> The type choices seem odd:
> 
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
> > +{
> > +	u64 pme = 0;
> 
> Why are these u64?

I guess (I just copied this type choice from other *pte_to_pagemap_entry()
type functions) it's because each entry in /proc/pid/pagemap is in fixed
sized (64 bit) format as described in the comment above pagemap_read().

> Should we have a pme_t, matching pte_t, pmd_t, etc?

Yes, it makes code's meaning clearer.

> 
> > +	if (pte_present(pte))
> > +		pme = PM_PFRAME(pte_pfn(pte) + offset)
> > +			| PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT;
> > +	return pme;
> > +}
> > +#else
> > +static inline u64 thp_pte_to_pagemap_entry(pte_t pte, int offset)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >  			     struct mm_walk *walk)
> >  {
> > @@ -665,14 +684,34 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >  	struct pagemapread *pm = walk->private;
> >  	pte_t *pte;
> >  	int err = 0;
> > -
> > -	split_huge_page_pmd(walk->mm, pmd);
> > +	u64 pfn = PM_NOT_PRESENT;
> 
> Again, why a u64?  pfn's are usually unsigned long.

I think variable's name 'pfn' is wrong rather than type choice
because this variable stores pagemap entry which is not a pure pfn.
There's room for improvement, so I'll try it in the next turn.

Thanks,
Naoya

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-01-05 16:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-21 22:23 [PATCH 0/4 v2] pagemap handles transparent hugepage Naoya Horiguchi
2011-12-21 22:23 ` [PATCH 1/4] pagemap: avoid splitting thp when reading /proc/pid/pagemap Naoya Horiguchi
2011-12-26  8:26   ` KAMEZAWA Hiroyuki
2011-12-30  3:39   ` KOSAKI Motohiro
2012-01-03 20:07     ` Naoya Horiguchi
2012-01-03 21:06       ` KOSAKI Motohiro
2012-01-03 21:31         ` Naoya Horiguchi
2012-01-04 23:50   ` Andrew Morton
2012-01-05 16:28     ` Naoya Horiguchi
2011-12-21 22:23 ` [PATCH 2/4] thp: optimize away unnecessary page table locking Naoya Horiguchi
2011-12-30  3:59   ` KOSAKI Motohiro
2012-01-03 20:08     ` Naoya Horiguchi
2011-12-21 22:23 ` [PATCH 3/4] pagemap: export KPF_THP Naoya Horiguchi
2011-12-26  8:40   ` KAMEZAWA Hiroyuki
2011-12-30  4:01   ` KOSAKI Motohiro
2012-01-04 23:55   ` Andrew Morton
2011-12-21 22:23 ` [PATCH 4/4] pagemap: document KPF_THP and make page-types aware of it Naoya Horiguchi
2011-12-26  8:42   ` KAMEZAWA Hiroyuki
2011-12-30  4:02   ` KOSAKI Motohiro
2012-01-04 23:57   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).