[RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics
@ 2011-02-01  0:33 Dave Hansen
  2011-02-01  0:33 ` [RFC][PATCH 1/6] count transparent hugepage splits Dave Hansen
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-01  0:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Michael J Wolf, Andrea Arcangeli, Dave Hansen

I'm working on some more reports that transparent huge pages and
KSM do not play nicely together.  Basically, whenever THP's are
present along with KSM, there is a lot of attrition over time,
and we do not see much overall progress keeping THP's around:

	http://sr71.net/~dave/ibm/038_System_Anonymous_Pages.png

(That's Karl Rister's graph, thanks Karl!)

However, I realized that we do not currently have a nice way to
find out where individual THP's might be on the system.  We
have an overall count, but no way of telling which processes or
VMAs they might be in.

I started to implement this in the /proc/$pid/smaps code, but
quickly realized that the lib/pagewalk.c code unconditionally
splits THPs up.  This set reworks that code a bit and, in the
end, gives you a per-map count of the numbers of huge pages.
It also makes it possible for page walks to _not_ split THPs.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 1/6] count transparent hugepage splits
  2011-02-01  0:33 [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Dave Hansen
@ 2011-02-01  0:33 ` Dave Hansen
  2011-02-01  9:58   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  2011-02-01  0:33 ` [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary Dave Hansen
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-01  0:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Michael J Wolf, Andrea Arcangeli, Dave Hansen


The khugepaged process collapses transparent hugepages for us.  Whenever
it collapses a page into a transparent hugepage, we increment a nice
global counter exported in sysfs:

	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed

But, transparent hugepages also get broken down in quite a few
places in the kernel.  We do not have a good idea how how many of
those collpased pages are "new" versus how many are just fixing up
spots that got split a moment before.

Note: "splits" and "collapses" are opposites in this context.

This patch adds a new sysfs file:

	/sys/kernel/mm/transparent_hugepage/pages_split

It is global, like "pages_collapsed", and is incremented whenever any
transparent hugepage on the system has been broken down in to normal
PAGE_SIZE base pages.  This way, we can get an idea how well khugepaged
is keeping up collapsing pages that have been split.

I put it under /sys/kernel/mm/transparent_hugepage/ instead of the
khugepaged/ directory since it is not strictly related to
khugepaged; it can get incremented on pages other than those
collapsed by khugepaged.

The variable storing this is a plain integer.  I needs the same
amount of locking that 'khugepaged_pages_collapsed' has, for
instance.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/Documentation/vm/transhuge.txt |    8 ++++++++
 linux-2.6.git-dave/mm/huge_memory.c               |   12 ++++++++++++
 2 files changed, 20 insertions(+)

diff -puN mm/huge_memory.c~count-thp-splits mm/huge_memory.c
--- linux-2.6.git/mm/huge_memory.c~count-thp-splits	2011-01-31 11:05:51.484526127 -0800
+++ linux-2.6.git-dave/mm/huge_memory.c	2011-01-31 11:05:51.508526113 -0800
@@ -38,6 +38,8 @@ unsigned long transparent_hugepage_flags
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
 
+static unsigned int huge_pages_split;
+
 /* default scan 8*512 pte (or vmas) every 30 second */
 static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8;
 static unsigned int khugepaged_pages_collapsed;
@@ -307,12 +309,20 @@ static struct kobj_attribute debug_cow_a
 	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
 #endif /* CONFIG_DEBUG_VM */
 
+static ssize_t pages_split_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", huge_pages_split);
+}
+static struct kobj_attribute pages_split_attr = __ATTR_RO(pages_split);
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
 #endif
+	&pages_split_attr.attr,
 	NULL,
 };
 
@@ -1314,6 +1324,8 @@ static int __split_huge_page_map(struct 
 	}
 	spin_unlock(&mm->page_table_lock);
 
+	if (ret)
+		huge_pages_split++;
 	return ret;
 }
 
diff -puN fs/proc/meminfo.c~count-thp-splits fs/proc/meminfo.c
diff -puN Documentation/vm/transhuge.txt~count-thp-splits Documentation/vm/transhuge.txt
--- linux-2.6.git/Documentation/vm/transhuge.txt~count-thp-splits	2011-01-31 11:05:51.500526118 -0800
+++ linux-2.6.git-dave/Documentation/vm/transhuge.txt	2011-01-31 11:05:51.508526113 -0800
@@ -120,6 +120,14 @@ khugepaged will be automatically started
 transparent_hugepage/enabled is set to "always" or "madvise, and it'll
 be automatically shutdown if it's set to "never".
 
+Not all kernel code is aware of transparent hugepages.  Sometimes,
+it is necessary to fall back to small pages so that this kernel
+code can deal with small pages.  This might also happen if, for
+instance, munmap() was called in the middle of a transparent huge
+page.  We track these splits in:
+
+	/sys/kernel/mm/transparent_hugepage/pages_split
+
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's
_

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary
  2011-02-01  0:33 [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Dave Hansen
  2011-02-01  0:33 ` [RFC][PATCH 1/6] count transparent hugepage splits Dave Hansen
@ 2011-02-01  0:33 ` Dave Hansen
  2011-02-01 10:04   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  2011-02-01  0:34 ` [RFC][PATCH 3/6] break out smaps_pte_entry() from smaps_pte_range() Dave Hansen
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-01  0:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Michael J Wolf, Andrea Arcangeli, Dave Hansen


Right now, if a mm_walk has either ->pte_entry or ->pmd_entry
set, it will unconditionally split and transparent huge pages
it runs in to.  In practice, that means that anyone doing a

	cat /proc/$pid/smaps

will unconditionally break down every huge page in the process
and depend on khugepaged to re-collapse it later.  This is
fairly suboptimal.

This patch changes that behavior.  It teaches each ->pmd_entry
handler (there are three) that they must break down the THPs
themselves.  Also, the _generic_ code will never break down
a THP unless a ->pte_entry handler is actually set.

This means that the ->pmd_entry handlers can now choose to
deal with THPs without breaking them down.


---

 linux-2.6.git-dave/fs/proc/task_mmu.c |    6 ++++++
 linux-2.6.git-dave/mm/pagewalk.c      |   24 ++++++++++++++++++++----
 2 files changed, 26 insertions(+), 4 deletions(-)

diff -puN mm/pagewalk.c~pagewalk-dont-always-split-thp mm/pagewalk.c
--- linux-2.6.git/mm/pagewalk.c~pagewalk-dont-always-split-thp	2011-01-27 10:57:02.309914973 -0800
+++ linux-2.6.git-dave/mm/pagewalk.c	2011-01-27 10:57:02.317914965 -0800
@@ -33,19 +33,35 @@ static int walk_pmd_range(pud_t *pud, un
 
 	pmd = pmd_offset(pud, addr);
 	do {
+	again:
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(walk->mm, pmd);
-		if (pmd_none_or_clear_bad(pmd)) {
+		if (pmd_none(*pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);
 			if (err)
 				break;
 			continue;
 		}
+		/*
+		 * This implies that each ->pmd_entry() handler
+		 * needs to know about pmd_trans_huge() pmds
+		 */
 		if (walk->pmd_entry)
 			err = walk->pmd_entry(pmd, addr, next, walk);
-		if (!err && walk->pte_entry)
-			err = walk_pte_range(pmd, addr, next, walk);
+		if (err)
+			break;
+
+		/*
+		 * Check this here so we only break down trans_huge
+		 * pages when we _need_ to
+		 */
+		if (!walk->pte_entry)
+			continue;
+
+		split_huge_page_pmd(walk->mm, pmd);
+		if (pmd_none_or_clear_bad(pmd))
+			goto again;
+		err = walk_pte_range(pmd, addr, next, walk);
 		if (err)
 			break;
 	} while (pmd++, addr = next, addr != end);
diff -puN fs/proc/task_mmu.c~pagewalk-dont-always-split-thp fs/proc/task_mmu.c
--- linux-2.6.git/fs/proc/task_mmu.c~pagewalk-dont-always-split-thp	2011-01-27 10:57:02.313914969 -0800
+++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-01-27 10:57:02.321914961 -0800
@@ -343,6 +343,8 @@ static int smaps_pte_range(pmd_t *pmd, u
 	struct page *page;
 	int mapcount;
 
+	split_huge_page_pmd(walk->mm, pmd);
+
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
@@ -467,6 +469,8 @@ static int clear_refs_pte_range(pmd_t *p
 	spinlock_t *ptl;
 	struct page *page;
 
+	split_huge_page_pmd(walk->mm, pmd);
+
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
@@ -623,6 +627,8 @@ static int pagemap_pte_range(pmd_t *pmd,
 	pte_t *pte;
 	int err = 0;
 
+	split_huge_page_pmd(walk->mm, pmd);
+
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
 	for (; addr != end; addr += PAGE_SIZE) {
_

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 3/6] break out smaps_pte_entry() from smaps_pte_range()
  2011-02-01  0:33 [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Dave Hansen
  2011-02-01  0:33 ` [RFC][PATCH 1/6] count transparent hugepage splits Dave Hansen
  2011-02-01  0:33 ` [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary Dave Hansen
@ 2011-02-01  0:34 ` Dave Hansen
  2011-02-01 10:08   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  2011-02-01  0:34 ` [RFC][PATCH 4/6] pass pte size argument in to smaps_pte_entry() Dave Hansen
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-01  0:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Michael J Wolf, Andrea Arcangeli, Dave Hansen


We will use smaps_pte_entry() in a moment to handle both small
and transparent large pages.  But, we must break it out of
smaps_pte_range() first.


---

 linux-2.6.git-dave/fs/proc/task_mmu.c |   85 ++++++++++++++++++----------------
 1 file changed, 46 insertions(+), 39 deletions(-)

diff -puN fs/proc/task_mmu.c~break-out-smaps_pte_entry fs/proc/task_mmu.c
--- linux-2.6.git/fs/proc/task_mmu.c~break-out-smaps_pte_entry	2011-01-27 11:03:06.761548697 -0800
+++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-01-27 11:03:06.773548685 -0800
@@ -333,56 +333,63 @@ struct mem_size_stats {
 	u64 pss;
 };
 
-static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			   struct mm_walk *walk)
+
+static void smaps_pte_entry(pte_t ptent, unsigned long addr,
+		struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 	struct vm_area_struct *vma = mss->vma;
-	pte_t *pte, ptent;
-	spinlock_t *ptl;
 	struct page *page;
 	int mapcount;
 
-	split_huge_page_pmd(walk->mm, pmd);
-
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; pte++, addr += PAGE_SIZE) {
-		ptent = *pte;
+	if (is_swap_pte(ptent)) {
+		mss->swap += PAGE_SIZE;
+		return;
+	}
 
-		if (is_swap_pte(ptent)) {
-			mss->swap += PAGE_SIZE;
-			continue;
-		}
+	if (!pte_present(ptent))
+		return;
 
-		if (!pte_present(ptent))
-			continue;
+	page = vm_normal_page(vma, addr, ptent);
+	if (!page)
+		return;
+
+	if (PageAnon(page))
+		mss->anonymous += PAGE_SIZE;
+
+	mss->resident += PAGE_SIZE;
+	/* Accumulate the size in pages that have been accessed. */
+	if (pte_young(ptent) || PageReferenced(page))
+		mss->referenced += PAGE_SIZE;
+	mapcount = page_mapcount(page);
+	if (mapcount >= 2) {
+		if (pte_dirty(ptent) || PageDirty(page))
+			mss->shared_dirty += PAGE_SIZE;
+		else
+			mss->shared_clean += PAGE_SIZE;
+		mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
+	} else {
+		if (pte_dirty(ptent) || PageDirty(page))
+			mss->private_dirty += PAGE_SIZE;
+		else
+			mss->private_clean += PAGE_SIZE;
+		mss->pss += (PAGE_SIZE << PSS_SHIFT);
+	}
+}
 
-		page = vm_normal_page(vma, addr, ptent);
-		if (!page)
-			continue;
+static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			   struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+	struct vm_area_struct *vma = mss->vma;
+	pte_t *pte;
+	spinlock_t *ptl;
 
-		if (PageAnon(page))
-			mss->anonymous += PAGE_SIZE;
+	split_huge_page_pmd(walk->mm, pmd);
 
-		mss->resident += PAGE_SIZE;
-		/* Accumulate the size in pages that have been accessed. */
-		if (pte_young(ptent) || PageReferenced(page))
-			mss->referenced += PAGE_SIZE;
-		mapcount = page_mapcount(page);
-		if (mapcount >= 2) {
-			if (pte_dirty(ptent) || PageDirty(page))
-				mss->shared_dirty += PAGE_SIZE;
-			else
-				mss->shared_clean += PAGE_SIZE;
-			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
-		} else {
-			if (pte_dirty(ptent) || PageDirty(page))
-				mss->private_dirty += PAGE_SIZE;
-			else
-				mss->private_clean += PAGE_SIZE;
-			mss->pss += (PAGE_SIZE << PSS_SHIFT);
-		}
-	}
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE)
+		smaps_pte_entry(*pte, addr, walk);
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
 	return 0;
diff -puN mm/huge_memory.c~break-out-smaps_pte_entry mm/huge_memory.c
_

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 4/6] pass pte size argument in to smaps_pte_entry()
  2011-02-01  0:33 [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Dave Hansen
                   ` (2 preceding siblings ...)
  2011-02-01  0:34 ` [RFC][PATCH 3/6] break out smaps_pte_entry() from smaps_pte_range() Dave Hansen
@ 2011-02-01  0:34 ` Dave Hansen
  2011-02-01 10:09   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  2011-02-01  0:34 ` [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds Dave Hansen
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-01  0:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Michael J Wolf, Andrea Arcangeli, Dave Hansen


This patch adds an argument to the new smaps_pte_entry()
function to let it account in things other than PAGE_SIZE
units.  I changed all of the PAGE_SIZE sites, even though
not all of them can be reached for transparent huge pages,
just so this will continue to work without changes as THPs
are improved.


---

 linux-2.6.git-dave/fs/proc/task_mmu.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff -puN fs/proc/task_mmu.c~give-smaps_pte_range-a-size-arg fs/proc/task_mmu.c
--- linux-2.6.git/fs/proc/task_mmu.c~give-smaps_pte_range-a-size-arg	2011-01-31 11:14:20.400194563 -0800
+++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-01-31 11:15:02.680162163 -0800
@@ -335,7 +335,7 @@ struct mem_size_stats {
 
 
 static void smaps_pte_entry(pte_t ptent, unsigned long addr,
-		struct mm_walk *walk)
+		unsigned long ptent_size, struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 	struct vm_area_struct *vma = mss->vma;
@@ -343,7 +343,7 @@ static void smaps_pte_entry(pte_t ptent,
 	int mapcount;
 
 	if (is_swap_pte(ptent)) {
-		mss->swap += PAGE_SIZE;
+		mss->swap += ptent_size;
 		return;
 	}
 
@@ -355,25 +355,25 @@ static void smaps_pte_entry(pte_t ptent,
 		return;
 
 	if (PageAnon(page))
-		mss->anonymous += PAGE_SIZE;
+		mss->anonymous += ptent_size;
 
-	mss->resident += PAGE_SIZE;
+	mss->resident += ptent_size;
 	/* Accumulate the size in pages that have been accessed. */
 	if (pte_young(ptent) || PageReferenced(page))
-		mss->referenced += PAGE_SIZE;
+		mss->referenced += ptent_size;
 	mapcount = page_mapcount(page);
 	if (mapcount >= 2) {
 		if (pte_dirty(ptent) || PageDirty(page))
-			mss->shared_dirty += PAGE_SIZE;
+			mss->shared_dirty += ptent_size;
 		else
-			mss->shared_clean += PAGE_SIZE;
-		mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
+			mss->shared_clean += ptent_size;
+		mss->pss += (ptent_size << PSS_SHIFT) / mapcount;
 	} else {
 		if (pte_dirty(ptent) || PageDirty(page))
-			mss->private_dirty += PAGE_SIZE;
+			mss->private_dirty += ptent_size;
 		else
-			mss->private_clean += PAGE_SIZE;
-		mss->pss += (PAGE_SIZE << PSS_SHIFT);
+			mss->private_clean += ptent_size;
+		mss->pss += (ptent_size << PSS_SHIFT);
 	}
 }
 
@@ -389,7 +389,7 @@ static int smaps_pte_range(pmd_t *pmd, u
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
-		smaps_pte_entry(*pte, addr, walk);
+		smaps_pte_entry(*pte, addr, PAGE_SIZE, walk);
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
 	return 0;
_

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds
  2011-02-01  0:33 [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Dave Hansen
                   ` (3 preceding siblings ...)
  2011-02-01  0:34 ` [RFC][PATCH 4/6] pass pte size argument in to smaps_pte_entry() Dave Hansen
@ 2011-02-01  0:34 ` Dave Hansen
  2011-02-01 10:11   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  2011-02-01  0:34 ` [RFC][PATCH 6/6] have smaps show transparent huge pages Dave Hansen
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-01  0:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Michael J Wolf, Andrea Arcangeli, Dave Hansen


This adds code to explicitly detect  and handle
pmd_trans_huge() pmds.  It then passes HPAGE_SIZE units
in to the smap_pte_entry() function instead of PAGE_SIZE.

This means that using /proc/$pid/smaps now will no longer
cause THPs to be broken down in to small pages.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/fs/proc/task_mmu.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff -puN fs/proc/task_mmu.c~teach-smaps_pte_range-about-thp-pmds fs/proc/task_mmu.c
--- linux-2.6.git/fs/proc/task_mmu.c~teach-smaps_pte_range-about-thp-pmds	2011-01-31 15:10:55.387856566 -0800
+++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-01-31 15:25:12.231239775 -0800
@@ -1,5 +1,6 @@
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
+#include <linux/huge_mm.h>
 #include <linux/mount.h>
 #include <linux/seq_file.h>
 #include <linux/highmem.h>
@@ -385,6 +386,17 @@ static int smaps_pte_range(pmd_t *pmd, u
 	pte_t *pte;
 	spinlock_t *ptl;
 
+	if (pmd_trans_huge(*pmd)) {
+		if (pmd_trans_splitting(*pmd)) {
+			spin_unlock(&walk->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+			spin_lock(&walk->mm->page_table_lock);
+			goto normal_ptes;
+		}
+		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_SIZE, walk);
+		return 0;
+	}
+normal_ptes:
 	split_huge_page_pmd(walk->mm, pmd);
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
diff -puN mm/vmscan.c~teach-smaps_pte_range-about-thp-pmds mm/vmscan.c
diff -puN include/trace/events/vmscan.h~teach-smaps_pte_range-about-thp-pmds include/trace/events/vmscan.h
diff -puN mm/pagewalk.c~teach-smaps_pte_range-about-thp-pmds mm/pagewalk.c
diff -puN mm/huge_memory.c~teach-smaps_pte_range-about-thp-pmds mm/huge_memory.c
diff -puN mm/memory.c~teach-smaps_pte_range-about-thp-pmds mm/memory.c
diff -puN include/linux/huge_mm.h~teach-smaps_pte_range-about-thp-pmds include/linux/huge_mm.h
diff -puN mm/internal.h~teach-smaps_pte_range-about-thp-pmds mm/internal.h
_

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC][PATCH 6/6] have smaps show transparent huge pages
  2011-02-01  0:33 [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Dave Hansen
                   ` (4 preceding siblings ...)
  2011-02-01  0:34 ` [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds Dave Hansen
@ 2011-02-01  0:34 ` Dave Hansen
  2011-02-01 10:12   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  2011-02-01 15:38 ` [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Andrea Arcangeli
  2011-02-03 21:54 ` David Rientjes
  7 siblings, 2 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-01  0:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, Michael J Wolf, Andrea Arcangeli, Dave Hansen


Now that the mere act of _looking_ at /proc/$pid/smaps will not
destroy transparent huge pages, tell how much of the VMA is
actually mapped with them.

This way, we can make sure that we're getting THPs where we
expect to see them.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/fs/proc/task_mmu.c |    4 ++++
 1 file changed, 4 insertions(+)

diff -puN fs/proc/task_mmu.c~teach-smaps-thp fs/proc/task_mmu.c
--- linux-2.6.git/fs/proc/task_mmu.c~teach-smaps-thp	2011-01-31 16:25:15.948685305 -0800
+++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-01-31 16:25:34.600671992 -0800
@@ -331,6 +331,7 @@ struct mem_size_stats {
 	unsigned long private_dirty;
 	unsigned long referenced;
 	unsigned long anonymous;
+	unsigned long anonymous_thp;
 	unsigned long swap;
 	u64 pss;
 };
@@ -394,6 +395,7 @@ static int smaps_pte_range(pmd_t *pmd, u
 			spin_lock(&walk->mm->page_table_lock);
 			goto normal_ptes;
 		}
+		mss->anonymous_thp += HPAGE_SIZE;
 		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_SIZE, walk);
 		return 0;
 	}
@@ -438,6 +440,7 @@ static int show_smap(struct seq_file *m,
 		   "Private_Dirty:  %8lu kB\n"
 		   "Referenced:     %8lu kB\n"
 		   "Anonymous:      %8lu kB\n"
+		   "AnonHugePages:  %8lu kB\n"
 		   "Swap:           %8lu kB\n"
 		   "KernelPageSize: %8lu kB\n"
 		   "MMUPageSize:    %8lu kB\n"
@@ -451,6 +454,7 @@ static int show_smap(struct seq_file *m,
 		   mss.private_dirty >> 10,
 		   mss.referenced >> 10,
 		   mss.anonymous >> 10,
+		   mss.anonymous_thp >> 10,
 		   mss.swap >> 10,
 		   vma_kernel_pagesize(vma) >> 10,
 		   vma_mmu_pagesize(vma) >> 10,
_

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/6] count transparent hugepage splits
  2011-02-01  0:33 ` [RFC][PATCH 1/6] count transparent hugepage splits Dave Hansen
@ 2011-02-01  9:58   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  1 sibling, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2011-02-01  9:58 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, Jan 31, 2011 at 04:33:58PM -0800, Dave Hansen wrote:
> 
> The khugepaged process collapses transparent hugepages for us.  Whenever
> it collapses a page into a transparent hugepage, we increment a nice
> global counter exported in sysfs:
> 
> 	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
> 
> But, transparent hugepages also get broken down in quite a few
> places in the kernel.  We do not have a good idea how how many of
> those collpased pages are "new" versus how many are just fixing up
> spots that got split a moment before.
> 
> Note: "splits" and "collapses" are opposites in this context.
> 
> This patch adds a new sysfs file:
> 
> 	/sys/kernel/mm/transparent_hugepage/pages_split
> 
> It is global, like "pages_collapsed", and is incremented whenever any
> transparent hugepage on the system has been broken down in to normal
> PAGE_SIZE base pages.  This way, we can get an idea how well khugepaged
> is keeping up collapsing pages that have been split.
> 
> I put it under /sys/kernel/mm/transparent_hugepage/ instead of the
> khugepaged/ directory since it is not strictly related to
> khugepaged; it can get incremented on pages other than those
> collapsed by khugepaged.
> 
> The variable storing this is a plain integer.  I needs the same
> amount of locking that 'khugepaged_pages_collapsed' has, for
> instance.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary
  2011-02-01  0:33 ` [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary Dave Hansen
@ 2011-02-01 10:04   ` Johannes Weiner
  2011-02-01 15:03     ` Dave Hansen
  2011-02-03 21:22   ` David Rientjes
  1 sibling, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2011-02-01 10:04 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, Jan 31, 2011 at 04:33:59PM -0800, Dave Hansen wrote:
> 
> Right now, if a mm_walk has either ->pte_entry or ->pmd_entry
> set, it will unconditionally split and transparent huge pages
> it runs in to.  In practice, that means that anyone doing a
> 
> 	cat /proc/$pid/smaps
> 
> will unconditionally break down every huge page in the process
> and depend on khugepaged to re-collapse it later.  This is
> fairly suboptimal.
> 
> This patch changes that behavior.  It teaches each ->pmd_entry
> handler (there are three) that they must break down the THPs
> themselves.  Also, the _generic_ code will never break down
> a THP unless a ->pte_entry handler is actually set.
> 
> This means that the ->pmd_entry handlers can now choose to
> deal with THPs without breaking them down.

Makes perfect sense.  But you forgot to push down the splitting into
the two handlers in mm/memcontrol.c.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/6] break out smaps_pte_entry() from smaps_pte_range()
  2011-02-01  0:34 ` [RFC][PATCH 3/6] break out smaps_pte_entry() from smaps_pte_range() Dave Hansen
@ 2011-02-01 10:08   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  1 sibling, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2011-02-01 10:08 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, Jan 31, 2011 at 04:34:01PM -0800, Dave Hansen wrote:
> 
> We will use smaps_pte_entry() in a moment to handle both small
> and transparent large pages.  But, we must break it out of
> smaps_pte_range() first.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 4/6] pass pte size argument in to smaps_pte_entry()
  2011-02-01  0:34 ` [RFC][PATCH 4/6] pass pte size argument in to smaps_pte_entry() Dave Hansen
@ 2011-02-01 10:09   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  1 sibling, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2011-02-01 10:09 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, Jan 31, 2011 at 04:34:02PM -0800, Dave Hansen wrote:
> 
> This patch adds an argument to the new smaps_pte_entry()
> function to let it account in things other than PAGE_SIZE
> units.  I changed all of the PAGE_SIZE sites, even though
> not all of them can be reached for transparent huge pages,
> just so this will continue to work without changes as THPs
> are improved.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds
  2011-02-01  0:34 ` [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds Dave Hansen
@ 2011-02-01 10:11   ` Johannes Weiner
  2011-02-01 15:02     ` Dave Hansen
  2011-02-03 21:22   ` David Rientjes
  1 sibling, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2011-02-01 10:11 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, Jan 31, 2011 at 04:34:03PM -0800, Dave Hansen wrote:
> 
> This adds code to explicitly detect  and handle
> pmd_trans_huge() pmds.  It then passes HPAGE_SIZE units
> in to the smap_pte_entry() function instead of PAGE_SIZE.
> 
> This means that using /proc/$pid/smaps now will no longer
> cause THPs to be broken down in to small pages.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
> 
>  linux-2.6.git-dave/fs/proc/task_mmu.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff -puN fs/proc/task_mmu.c~teach-smaps_pte_range-about-thp-pmds fs/proc/task_mmu.c
> --- linux-2.6.git/fs/proc/task_mmu.c~teach-smaps_pte_range-about-thp-pmds	2011-01-31 15:10:55.387856566 -0800
> +++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-01-31 15:25:12.231239775 -0800
> @@ -1,5 +1,6 @@
>  #include <linux/mm.h>
>  #include <linux/hugetlb.h>
> +#include <linux/huge_mm.h>
>  #include <linux/mount.h>
>  #include <linux/seq_file.h>
>  #include <linux/highmem.h>
> @@ -385,6 +386,17 @@ static int smaps_pte_range(pmd_t *pmd, u
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> +	if (pmd_trans_huge(*pmd)) {
> +		if (pmd_trans_splitting(*pmd)) {
> +			spin_unlock(&walk->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma, pmd);
> +			spin_lock(&walk->mm->page_table_lock);
> +			goto normal_ptes;
> +		}
> +		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_SIZE, walk);
> +		return 0;
> +	}
> +normal_ptes:
>  	split_huge_page_pmd(walk->mm, pmd);

This line can go away now...?

Looks good to me, otherwise.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/6] have smaps show transparent huge pages
  2011-02-01  0:34 ` [RFC][PATCH 6/6] have smaps show transparent huge pages Dave Hansen
@ 2011-02-01 10:12   ` Johannes Weiner
  2011-02-03 21:22   ` David Rientjes
  1 sibling, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2011-02-01 10:12 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, Jan 31, 2011 at 04:34:05PM -0800, Dave Hansen wrote:
> 
> Now that the mere act of _looking_ at /proc/$pid/smaps will not
> destroy transparent huge pages, tell how much of the VMA is
> actually mapped with them.
> 
> This way, we can make sure that we're getting THPs where we
> expect to see them.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds
  2011-02-01 10:11   ` Johannes Weiner
@ 2011-02-01 15:02     ` Dave Hansen
  2011-02-01 16:09       ` Andrea Arcangeli
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2011-02-01 15:02 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Tue, 2011-02-01 at 11:11 +0100, Johannes Weiner wrote:
> On Mon, Jan 31, 2011 at 04:34:03PM -0800, Dave Hansen wrote:
> > +	if (pmd_trans_huge(*pmd)) {
> > +		if (pmd_trans_splitting(*pmd)) {
> > +			spin_unlock(&walk->mm->page_table_lock);
> > +			wait_split_huge_page(vma->anon_vma, pmd);
> > +			spin_lock(&walk->mm->page_table_lock);
> > +			goto normal_ptes;
> > +		}
> > +		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_SIZE, walk);
> > +		return 0;
> > +	}
> > +normal_ptes:
> >  	split_huge_page_pmd(walk->mm, pmd);
> 
> This line can go away now...?

I did this because I was unsure what keeps khugepaged away from the
newly-split ptes between the wait_split_huge_page() and the
reacquisition of the mm->page_table_lock.  mmap_sem, perhaps?

Looking at follow_page() and some of the other wait_split_huge_page(),
it looks like this is unnecessary.  

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary
  2011-02-01 10:04   ` Johannes Weiner
@ 2011-02-01 15:03     ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-01 15:03 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Tue, 2011-02-01 at 11:04 +0100, Johannes Weiner wrote:
> On Mon, Jan 31, 2011 at 04:33:59PM -0800, Dave Hansen wrote:
> > Right now, if a mm_walk has either ->pte_entry or ->pmd_entry
> > set, it will unconditionally split and transparent huge pages
> > it runs in to.  In practice, that means that anyone doing a
> > 
> >       cat /proc/$pid/smaps
> > 
> > will unconditionally break down every huge page in the process
> > and depend on khugepaged to re-collapse it later.  This is
> > fairly suboptimal.
> > 
> > This patch changes that behavior.  It teaches each ->pmd_entry
> > handler (there are three) that they must break down the THPs
> > themselves.  Also, the _generic_ code will never break down
> > a THP unless a ->pte_entry handler is actually set.
> > 
> > This means that the ->pmd_entry handlers can now choose to
> > deal with THPs without breaking them down.
> 
> Makes perfect sense.  But you forgot to push down the splitting into
> the two handlers in mm/memcontrol.c. 

I did indeed.  I'll go fix those up.  Thanks for the review!

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics
  2011-02-01  0:33 [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Dave Hansen
                   ` (5 preceding siblings ...)
  2011-02-01  0:34 ` [RFC][PATCH 6/6] have smaps show transparent huge pages Dave Hansen
@ 2011-02-01 15:38 ` Andrea Arcangeli
  2011-02-01 17:15   ` Dave Hansen
  2011-02-03 21:54 ` David Rientjes
  7 siblings, 1 reply; 38+ messages in thread
From: Andrea Arcangeli @ 2011-02-01 15:38 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf

On Mon, Jan 31, 2011 at 04:33:57PM -0800, Dave Hansen wrote:
> I'm working on some more reports that transparent huge pages and
> KSM do not play nicely together.  Basically, whenever THP's are
> present along with KSM, there is a lot of attrition over time,
> and we do not see much overall progress keeping THP's around:
> 
> 	http://sr71.net/~dave/ibm/038_System_Anonymous_Pages.png
> 
> (That's Karl Rister's graph, thanks Karl!)

Well if the pages_sharing/pages_shared count goes up, this is a
feature not a bug.... You need to print that too in the chart to show
this is not ok.

KSM will slowdown performance also during copy-on-writes when
pages_sharing goes up, not only because of creating non-linearity
inside 2m chunks (which makes mandatory to use ptes and not hugepmd,
it's not an inefficiency of some sort that can be optimized away
unfortunately). We sure could change KSM to merge 2M pages instead of
4k pages, but then the memory-density would decrease of several order
of magnitudes making the KSM scan almost useless (ok, with guest
heavily using THP that may change, but all pagecache is still 4k... so
for now it'd be next to useless).

I'm in the process of adding a no-ksm option to qemu-kvm command line
so you can selectively choose which VM runs with KSM or not (otherwise
you can switch ksm off globally to be sure not to degrade
performance).

> However, I realized that we do not currently have a nice way to find
> out where individual THP's might be on the system.  We have an
> overall count, but no way of telling which processes or VMAs they
> might be in.
> 
> I started to implement this in the /proc/$pid/smaps code, but
> quickly realized that the lib/pagewalk.c code unconditionally
> splits THPs up.  This set reworks that code a bit and, in the
> end, gives you a per-map count of the numbers of huge pages.
> It also makes it possible for page walks to _not_ split THPs.

That's something in the TODO list indeed thanks a lot for working on
this (I think we discussed this earlier too).

I would prefer to close the issues that you just previously reported,
sometime with mmap_sem and issues like that, before adding more
features though but I don't want to defer things either so it's up to
you.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds
  2011-02-01 15:02     ` Dave Hansen
@ 2011-02-01 16:09       ` Andrea Arcangeli
  0 siblings, 0 replies; 38+ messages in thread
From: Andrea Arcangeli @ 2011-02-01 16:09 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Johannes Weiner, linux-kernel, linux-mm, Michael J Wolf

On Tue, Feb 01, 2011 at 07:02:30AM -0800, Dave Hansen wrote:
> On Tue, 2011-02-01 at 11:11 +0100, Johannes Weiner wrote:
> > On Mon, Jan 31, 2011 at 04:34:03PM -0800, Dave Hansen wrote:
> > > +	if (pmd_trans_huge(*pmd)) {
> > > +		if (pmd_trans_splitting(*pmd)) {
> > > +			spin_unlock(&walk->mm->page_table_lock);
> > > +			wait_split_huge_page(vma->anon_vma, pmd);
> > > +			spin_lock(&walk->mm->page_table_lock);
> > > +			goto normal_ptes;
> > > +		}
> > > +		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_SIZE, walk);
> > > +		return 0;
> > > +	}
> > > +normal_ptes:
> > >  	split_huge_page_pmd(walk->mm, pmd);
> > 
> > This line can go away now...?
> 
> I did this because I was unsure what keeps khugepaged away from the
> newly-split ptes between the wait_split_huge_page() and the
> reacquisition of the mm->page_table_lock.  mmap_sem, perhaps?

Any of mmap_sem read mode, PG_lock and anon_vma_lock keeps khugepaged
away.

> Looking at follow_page() and some of the other wait_split_huge_page(),
> it looks like this is unnecessary.  

When wait_split_huge_page returns after the pmd was splitting, the pmd
can't return huge under you as long as you hold any of the above.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics
  2011-02-01 15:38 ` [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Andrea Arcangeli
@ 2011-02-01 17:15   ` Dave Hansen
  2011-02-01 20:39     ` Andrea Arcangeli
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2011-02-01 17:15 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm, Michael J Wolf

On Tue, 2011-02-01 at 16:38 +0100, Andrea Arcangeli wrote:
> On Mon, Jan 31, 2011 at 04:33:57PM -0800, Dave Hansen wrote:
> > I'm working on some more reports that transparent huge pages and
> > KSM do not play nicely together.  Basically, whenever THP's are
> > present along with KSM, there is a lot of attrition over time,
> > and we do not see much overall progress keeping THP's around:
> > 
> > 	http://sr71.net/~dave/ibm/038_System_Anonymous_Pages.png
> > 
> > (That's Karl Rister's graph, thanks Karl!)
> 
> Well if the pages_sharing/pages_shared count goes up, this is a
> feature not a bug.... You need to print that too in the chart to show
> this is not ok

Here are the KSM sharing bits for the same run:

	http://sr71.net/~dave/ibm/009_KSM_Pages.png

It bounces around a little bit on the ends, but it's fairly static
during the test, even when there's a good downward slope on the THP's.

Hot of the presses, Karl also managed to do a run last night with the
khugepaged scanning rates turned all the way up:

	http://sr71.net/~dave/ibm/038_System_Anonymous_Pages-scan-always.png

The THP's there are a lot more stable.  I'd read that as saying that the
scanning probably just isn't keeping up with whatever is breaking the
pages up.

> KSM will slowdown performance also during copy-on-writes when
> pages_sharing goes up, not only because of creating non-linearity
> inside 2m chunks (which makes mandatory to use ptes and not hugepmd,
> it's not an inefficiency of some sort that can be optimized away
> unfortunately). We sure could change KSM to merge 2M pages instead of
> 4k pages, but then the memory-density would decrease of several order
> of magnitudes making the KSM scan almost useless (ok, with guest
> heavily using THP that may change, but all pagecache is still 4k... so
> for now it'd be next to useless).

Yup, unless we do something special, the odds of sharing those 2MB
suckers are near zero.

> I would prefer to close the issues that you just previously reported,
> sometime with mmap_sem and issues like that, before adding more
> features though but I don't want to defer things either so it's up to
> you.

I'm happy to hold on to them for another release.  I'm actually going to
go look at the freezes I saw now that I have these out in the wild.
I'll probably stick them in a git tree and keep them up to date.

Are there any other THP issues you're chasing at the moment?

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics
  2011-02-01 17:15   ` Dave Hansen
@ 2011-02-01 20:39     ` Andrea Arcangeli
  2011-02-01 20:56       ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Andrea Arcangeli @ 2011-02-01 20:39 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrew Morton

On Tue, Feb 01, 2011 at 09:15:47AM -0800, Dave Hansen wrote:
> On Tue, 2011-02-01 at 16:38 +0100, Andrea Arcangeli wrote:
> > On Mon, Jan 31, 2011 at 04:33:57PM -0800, Dave Hansen wrote:
> > > I'm working on some more reports that transparent huge pages and
> > > KSM do not play nicely together.  Basically, whenever THP's are
> > > present along with KSM, there is a lot of attrition over time,
> > > and we do not see much overall progress keeping THP's around:
> > > 
> > > 	http://sr71.net/~dave/ibm/038_System_Anonymous_Pages.png
> > > 
> > > (That's Karl Rister's graph, thanks Karl!)
> > 
> > Well if the pages_sharing/pages_shared count goes up, this is a
> > feature not a bug.... You need to print that too in the chart to show
> > this is not ok
> 
> Here are the KSM sharing bits for the same run:
> 
> 	http://sr71.net/~dave/ibm/009_KSM_Pages.png
> 
> It bounces around a little bit on the ends, but it's fairly static
> during the test, even when there's a good downward slope on the THP's.
> 
> Hot of the presses, Karl also managed to do a run last night with the
> khugepaged scanning rates turned all the way up:
> 
> 	http://sr71.net/~dave/ibm/038_System_Anonymous_Pages-scan-always.png
> 
> The THP's there are a lot more stable.  I'd read that as saying that the
> scanning probably just isn't keeping up with whatever is breaking the
> pages up.

This is exactly the case.  But note that it's not obvious that keeping
the hugepage number up steady is beneficial in terms of final
performance: what happens now is you split and collapse (collapse
requires copy so it's more costly than split) hugepages at higher
frequency than before (the hugepages are still split but now they're
collapsed faster and ksm has to split them again). So now the speedup
from hugepages needs to also offset the cost of the more frequent
split/collapse events that didn't happen before.

So I guess considering the time is of the order of 2/3 hours and there
are "only" 88G of memory, speeding up khugepaged is going to be
beneficial considering how big boost hugepages gives to the guest with
NPT/EPT and even bigger boost for regular shadow paging, but it also
depends on guest. In short khugepaged by default is tuned in a way
that can't run in the way of the CPU.

I recall Avi once suggested someday we'd need khugepaged running at
100% load cpu like it may happen for ksmd. It's not needed yet though
but this is just to say you're perfectly right assuming the default
scanning rate may be too slow. But it's really meant to be tuned
depending on the size of the system, how many cores, how many gb of
ram to scan etc, if KSM is on, etc...

KSM internal checksum check before adding rmap_items to the unstable
tree already should prevent false sharing between ksm and
khugepaged. Plus khugepaged will stay away if the page is shared by
KSM. So because of this the risk of false sharing is low even running
both at 100% load on the same VM.

khugepaged default scan rate may be good idea to increase too for
large systems with lots of cpus and lots of memory, the default is
super paranoid and to be optimistic to still get a net gain for the
netbooks and cellphones if any collapse happens ;).

I'm generally very pleased to see these charts.

It reminds me that Microsoft and Oracle can't support hugepages and
KSM simultaneously, and at least Oracle Xen doesn't even have KSM at
all (there's some attempt to share guest cache or find equal I/O or
stuff like that but it won't help for guest anonymous memory which
incidentally is where HPC at places like LHC needs KSM running). Xen
doesn't even support swapping (easy to use THP in hypervisor if you
don't support swapping, of course Oracle Xen wouldn't support swapping
even if it would use 4k pages.. ;).

So current status of KSM over THP, where KSM scans inside THP pages,
and where 2m pages, 4k regular pages, and 4k KSM nonlinar pages with
special rmap_item are mixed in the same VMA, and where both regular 4k
pages, 2m pages and even the ksm pages are all swappable, and with mmu
notifier keeping shadow MMU or NPT/EPT in full synchrony with the
Linux VM so that both the KSM and THP algorithms are 100%
secondary-mmu agnostic, is quite a milestone for 2.6.38 and I hope
this will be a good proof of how the KVM design is superior and with
less effort we can achieve things that they may never support with
equal or better performance (in the KSM case our shared guest
filesystem cache remains readonly mapped and natively handled by guest
without any paravirt, like if KSM didn't merge anything).

> I'm happy to hold on to them for another release.  I'm actually going to
> go look at the freezes I saw now that I have these out in the wild.

I'm also going to have a closer look. The other report you can see
search for subject "khugepaged" in lkml from Jindřich (I think
compaction is too heavy, a walk in the park compared to lumpy reclaim
but we need to make it more latency friendly, I also got a report that
latency increases with heavy I/O that I think is the same thing that
Jindřich sees).

Yet another report shows a full hang with khugepaged waiting on
mmap_sem but I think that is not related to THP, maybe something hung
on the mmap_sem, khugepaged doesn't seem the holder of it.

I myself had an issue with PROVE_LOCKING and I deadlocked inside
_raw_spin_unlock_irq (how can I possibly deadlock inside unlock? must
be prove locking bug? I didn't yet check if there have been updates in
that area).

There is also a known bug that I didn't fix yet and it's the next
thing I need to address and it's the pgd_lock hold with irq disabled
taking the page_table_lock. That is buggy with THP=off too, but it
only triggers with NR_CPUS small enough so to disable PT locks (for
small smp builds PT lock becomes the page_table_lock and then it'll
deadlock when sending IPI with page_table_lock hold because the
vmalloc fault will take pgd_lock with irqs disabled). I've no idea why
pgd_lock is taken with irq disabled.

No other issue. It's not as bad as it seems, likely turning
congestion_wait in compaction and breaking the compaction look and a
few other tweaks to compaction will solve the latency issue with lots
of dirty cache. the pgd_lock is likely easily fixable too. Your issue
and the mmap_sem read mode hang I didn't look into yet but they may
not be related to this. Also your issue looked like the pgd_lock bug
that I already know about, so maybe it's nothing new (maybe the
mmap_sem too, I've no idea at the moment). I never reproduced this
myself yet. Anyway there's nothing really concerning, no mm corruption
or weird oops whatsoever ever reported, a spinlock or mmap_sem
deadlock not enough to worry me ;).

> I'll probably stick them in a git tree and keep them up to date.
> 
> Are there any other THP issues you're chasing at the moment?

I reviewed it too, and I see no problems. So if you fix those two bits
that Johannes pointed out (remove the split that is superfluous after
wait_ if you hold mmap_sem or rmap lock and add the split to the other
walkers not yet covered) I think they can go in now without waiting.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics
  2011-02-01 20:39     ` Andrea Arcangeli
@ 2011-02-01 20:56       ` Dave Hansen
  2011-02-02  0:07         ` Andrea Arcangeli
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2011-02-01 20:56 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrew Morton

On Tue, 2011-02-01 at 21:39 +0100, Andrea Arcangeli wrote:
> So now the speedup
> from hugepages needs to also offset the cost of the more frequent
> split/collapse events that didn't happen before.

My concern here is the downward slope.  I interpret that as saying that
we'll eventually have _zero_ THPs.  Plus, the benefits are decreasing
constantly, even though the scanning overhead is fixed (or increasing
even).

> So I guess considering the time is of the order of 2/3 hours and there
> are "only" 88G of memory, speeding up khugepaged is going to be
> beneficial considering how big boost hugepages gives to the guest with
> NPT/EPT and even bigger boost for regular shadow paging, but it also
> depends on guest. In short khugepaged by default is tuned in a way
> that can't run in the way of the CPU. 

I guess we could also try and figure out whether the khugepaged CPU
overhead really comes from the scanning or the collapsing operations
themselves.  Should be as easy as some oprofiling.

If it really is the scanning, I bet we could be a lot more efficient
with khugepaged as well.  In the case of KVM guests, we're going to have
awfully fixed virtual addresses and processes where collapsing can take
place.

It might make sense to just have split_huge_page() stick the vaddr and
the mm in a queue.  khugepaged could scan those addresses first instead
of just going after the system as a whole.

For cases where the page got split, but wasn't modified, should we have
a non-copying, non-allocating fastpath to re-merge it?

-- Dave

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics
  2011-02-01 20:56       ` Dave Hansen
@ 2011-02-02  0:07         ` Andrea Arcangeli
  2011-02-08 17:54           ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Andrea Arcangeli @ 2011-02-02  0:07 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrew Morton

On Tue, Feb 01, 2011 at 12:56:41PM -0800, Dave Hansen wrote:
> On Tue, 2011-02-01 at 21:39 +0100, Andrea Arcangeli wrote:
> > So now the speedup
> > from hugepages needs to also offset the cost of the more frequent
> > split/collapse events that didn't happen before.
> 
> My concern here is the downward slope.  I interpret that as saying that
> we'll eventually have _zero_ THPs.  Plus, the benefits are decreasing
> constantly, even though the scanning overhead is fixed (or increasing
> even).

It doesn't seem a downward slope, it seems to level out at 900000
pages. As shown by the other chart a faster khugepaged scan rate would
make it level out at an higher percentage of memory being huge with an
increased cost in khugepaged but it may very well payoff for the final
performance (especially on plenty-core).

> I guess we could also try and figure out whether the khugepaged CPU
> overhead really comes from the scanning or the collapsing operations
> themselves.  Should be as easy as some oprofiling.

Actually I already know, the scanning is super fast. So it's no real
big deal to increase the scanning. It's big deal only if there are
plenty more of collapse/split. Compared to the KSM scan, the
khugepaged scan costs nothing.

> If it really is the scanning, I bet we could be a lot more efficient
> with khugepaged as well.  In the case of KVM guests, we're going to have
> awfully fixed virtual addresses and processes where collapsing can take
> place.
> 
> It might make sense to just have split_huge_page() stick the vaddr and
> the mm in a queue.  khugepaged could scan those addresses first instead
> of just going after the system as a whole.

That would apply to KSM and swapping only though, not to all
split_huge_page. It may not be bad idea. But the scanning really is
fast. So it may not be necessary. Clearly the more memory you have,
the faster the scanning has to be to obtain the same percentage of
memory in hugepages in presence of KSM.

Also note: did you tune the ksmd scanning values? Or you only run echo
1 >run?  Clearly if you increased the ksmd scanning values decreasing
the scan_sleep_millisecs or increased the pages_scanned, you've to
increase the khugepaged scanning values too accordingly. Not saying
the current default is ok for such an huge system that you're
using. But I doubt the ksm default is ok either for such an huge
system. So if you go tweak ksmd at 100% cpu load (which will also
cause more false sharing as the interval between the cksum comparsion
before adding to unstable tree decreases significantly) and khugepaged
doesn't collapse the false-sharing regions, it's normal. (in that case
either slowdown ksm or speedup khugepaged would help, slowing down ksm
may actually lead to better performance and not much less memory used)

In fact the "keep track" of split_huge_page location for khugepaged,
may actually hide issues in KSM if there's a piece of memory flipped
twice fast but that stays at the same value all the time, then the
cksum heuristic that decides if the page is constant enough to be
added to the unstable tree, may get false positives from the cksum. If
we notice in KSM the page cows fast after after sharing it for a
couple of times we should stop merging it. It'd be better to improve
KSM intelligence to avoid false sharing in that case. Now I don't have
enough data (I don't even know what runs in guest) clearly to tell if
this could ever be because of 1) undetectable false sharing from KSM
through the cksum, 2) a too fast ksm scan invalidating the ckshm, 3)
or genuine khugepaged scan too slow not keeping up with KSM optimal
changes (which would be perfectly normal if ksmd scan rate has been
increased a lot but khugepaged wasn't accordingly).

In short I don't issues at the moment with this workload if increasing
khugepaged (or slowing down ksm) optimizes it.

> For cases where the page got split, but wasn't modified, should we have
> a non-copying, non-allocating fastpath to re-merge it?

Even if it's modified it's ok, as long as the pages are still
physically contiguous collapse can happen in place. I never dreamed to
attempt it only because of the increased complexity,
__split_huge_page_refcount is complex enough so because I could avoid
converting from regular page to hugepage in place I happily avoided it
;). That BUG_ON(page_mapcount != mapcount) I can have nightmares at
night with it, so I was pleased not to have more of those. Anyway I
think it's not important optimization we should spend more energy in
making sure split_huge_page is never called for nothing. However in
the mid term I'm not against it, it may always happen sometime that
the hugepage is splitted by memory pressure but then the subpages
aren't swapped out.

There's also one other thing to optimize that I think has more
priority (it's not going to benefit KVM though) that is to collapse
readonly shared anon pages, which currently it can't do.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/6] count transparent hugepage splits
  2011-02-01  0:33 ` [RFC][PATCH 1/6] count transparent hugepage splits Dave Hansen
  2011-02-01  9:58   ` Johannes Weiner
@ 2011-02-03 21:22   ` David Rientjes
  2011-02-04 21:18     ` Andrea Arcangeli
  1 sibling, 1 reply; 38+ messages in thread
From: David Rientjes @ 2011-02-03 21:22 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, 31 Jan 2011, Dave Hansen wrote:

> 
> The khugepaged process collapses transparent hugepages for us.  Whenever
> it collapses a page into a transparent hugepage, we increment a nice
> global counter exported in sysfs:
> 
> 	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
> 
> But, transparent hugepages also get broken down in quite a few
> places in the kernel.  We do not have a good idea how how many of
> those collpased pages are "new" versus how many are just fixing up
> spots that got split a moment before.
> 
> Note: "splits" and "collapses" are opposites in this context.
> 
> This patch adds a new sysfs file:
> 
> 	/sys/kernel/mm/transparent_hugepage/pages_split
> 
> It is global, like "pages_collapsed", and is incremented whenever any
> transparent hugepage on the system has been broken down in to normal
> PAGE_SIZE base pages.  This way, we can get an idea how well khugepaged
> is keeping up collapsing pages that have been split.
> 
> I put it under /sys/kernel/mm/transparent_hugepage/ instead of the
> khugepaged/ directory since it is not strictly related to
> khugepaged; it can get incremented on pages other than those
> collapsed by khugepaged.
> 
> The variable storing this is a plain integer.  I needs the same
> amount of locking that 'khugepaged_pages_collapsed' has, for
> instance.

i.e. no global locking, but we've accepted the occassional off-by-one 
error (even though splitting of hugepages isn't by any means lightning 
fast and the overhead of atomic ops would be negligible).

> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary
  2011-02-01  0:33 ` [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary Dave Hansen
  2011-02-01 10:04   ` Johannes Weiner
@ 2011-02-03 21:22   ` David Rientjes
  2011-02-03 21:33     ` Dave Hansen
  1 sibling, 1 reply; 38+ messages in thread
From: David Rientjes @ 2011-02-03 21:22 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, 31 Jan 2011, Dave Hansen wrote:

> 
> Right now, if a mm_walk has either ->pte_entry or ->pmd_entry
> set, it will unconditionally split and transparent huge pages
> it runs in to.  In practice, that means that anyone doing a
> 
> 	cat /proc/$pid/smaps
> 
> will unconditionally break down every huge page in the process
> and depend on khugepaged to re-collapse it later.  This is
> fairly suboptimal.
> 
> This patch changes that behavior.  It teaches each ->pmd_entry
> handler (there are three) that they must break down the THPs
> themselves.  Also, the _generic_ code will never break down
> a THP unless a ->pte_entry handler is actually set.
> 
> This means that the ->pmd_entry handlers can now choose to
> deal with THPs without breaking them down.
> 
> 

No sign-off (goes for patches 3 and 4, as well)?

> ---
> 
>  linux-2.6.git-dave/fs/proc/task_mmu.c |    6 ++++++
>  linux-2.6.git-dave/mm/pagewalk.c      |   24 ++++++++++++++++++++----
>  2 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff -puN mm/pagewalk.c~pagewalk-dont-always-split-thp mm/pagewalk.c
> --- linux-2.6.git/mm/pagewalk.c~pagewalk-dont-always-split-thp	2011-01-27 10:57:02.309914973 -0800
> +++ linux-2.6.git-dave/mm/pagewalk.c	2011-01-27 10:57:02.317914965 -0800
> @@ -33,19 +33,35 @@ static int walk_pmd_range(pud_t *pud, un
>  
>  	pmd = pmd_offset(pud, addr);
>  	do {
> +	again:

checkpatch will warn about the indent.

>  		next = pmd_addr_end(addr, end);
> -		split_huge_page_pmd(walk->mm, pmd);
> -		if (pmd_none_or_clear_bad(pmd)) {
> +		if (pmd_none(*pmd)) {

Not sure why this has been changed from pmd_none_or_clear_bad(), that's 
been done even prior to THP.

>  			if (walk->pte_hole)
>  				err = walk->pte_hole(addr, next, walk);
>  			if (err)
>  				break;
>  			continue;
>  		}
> +		/*
> +		 * This implies that each ->pmd_entry() handler
> +		 * needs to know about pmd_trans_huge() pmds
> +		 */

Probably needs to be documented somewhere for users of pagewalk?

>  		if (walk->pmd_entry)
>  			err = walk->pmd_entry(pmd, addr, next, walk);
> -		if (!err && walk->pte_entry)
> -			err = walk_pte_range(pmd, addr, next, walk);
> +		if (err)
> +			break;
> +
> +		/*
> +		 * Check this here so we only break down trans_huge
> +		 * pages when we _need_ to
> +		 */
> +		if (!walk->pte_entry)
> +			continue;
> +
> +		split_huge_page_pmd(walk->mm, pmd);
> +		if (pmd_none_or_clear_bad(pmd))
> +			goto again;
> +		err = walk_pte_range(pmd, addr, next, walk);
>  		if (err)
>  			break;
>  	} while (pmd++, addr = next, addr != end);
> diff -puN fs/proc/task_mmu.c~pagewalk-dont-always-split-thp fs/proc/task_mmu.c
> --- linux-2.6.git/fs/proc/task_mmu.c~pagewalk-dont-always-split-thp	2011-01-27 10:57:02.313914969 -0800
> +++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-01-27 10:57:02.321914961 -0800
> @@ -343,6 +343,8 @@ static int smaps_pte_range(pmd_t *pmd, u
>  	struct page *page;
>  	int mapcount;
>  
> +	split_huge_page_pmd(walk->mm, pmd);
> +
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> @@ -467,6 +469,8 @@ static int clear_refs_pte_range(pmd_t *p
>  	spinlock_t *ptl;
>  	struct page *page;
>  
> +	split_huge_page_pmd(walk->mm, pmd);
> +
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> @@ -623,6 +627,8 @@ static int pagemap_pte_range(pmd_t *pmd,
>  	pte_t *pte;
>  	int err = 0;
>  
> +	split_huge_page_pmd(walk->mm, pmd);
> +
>  	/* find the first VMA at or above 'addr' */
>  	vma = find_vma(walk->mm, addr);
>  	for (; addr != end; addr += PAGE_SIZE) {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/6] break out smaps_pte_entry() from smaps_pte_range()
  2011-02-01  0:34 ` [RFC][PATCH 3/6] break out smaps_pte_entry() from smaps_pte_range() Dave Hansen
  2011-02-01 10:08   ` Johannes Weiner
@ 2011-02-03 21:22   ` David Rientjes
  2011-02-03 21:40     ` Dave Hansen
  1 sibling, 1 reply; 38+ messages in thread
From: David Rientjes @ 2011-02-03 21:22 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, 31 Jan 2011, Dave Hansen wrote:

> 
> We will use smaps_pte_entry() in a moment to handle both small
> and transparent large pages.  But, we must break it out of
> smaps_pte_range() first.
> 

The extraction from smaps_pte_range() looks good.  What's the performance 
impact on very frequent consumers of /proc/pid/smaps, though, as the 
result of the calls throughout the iteration if smaps_pte_entry() doesn't 
get inlined (supposedly because you'll be reusing the extracted function 
again elsewhere)?

> 
> ---
> 
>  linux-2.6.git-dave/fs/proc/task_mmu.c |   85 ++++++++++++++++++----------------
>  1 file changed, 46 insertions(+), 39 deletions(-)
> 
> diff -puN fs/proc/task_mmu.c~break-out-smaps_pte_entry fs/proc/task_mmu.c
> --- linux-2.6.git/fs/proc/task_mmu.c~break-out-smaps_pte_entry	2011-01-27 11:03:06.761548697 -0800
> +++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-01-27 11:03:06.773548685 -0800
> @@ -333,56 +333,63 @@ struct mem_size_stats {
>  	u64 pss;
>  };
>  
> -static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> -			   struct mm_walk *walk)
> +
> +static void smaps_pte_entry(pte_t ptent, unsigned long addr,
> +		struct mm_walk *walk)
>  {
>  	struct mem_size_stats *mss = walk->private;
>  	struct vm_area_struct *vma = mss->vma;
> -	pte_t *pte, ptent;
> -	spinlock_t *ptl;
>  	struct page *page;
>  	int mapcount;
>  
> -	split_huge_page_pmd(walk->mm, pmd);
> -
> -	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> -	for (; addr != end; pte++, addr += PAGE_SIZE) {
> -		ptent = *pte;
> +	if (is_swap_pte(ptent)) {
> +		mss->swap += PAGE_SIZE;
> +		return;
> +	}
>  
> -		if (is_swap_pte(ptent)) {
> -			mss->swap += PAGE_SIZE;
> -			continue;
> -		}
> +	if (!pte_present(ptent))
> +		return;
>  
> -		if (!pte_present(ptent))
> -			continue;
> +	page = vm_normal_page(vma, addr, ptent);
> +	if (!page)
> +		return;
> +
> +	if (PageAnon(page))
> +		mss->anonymous += PAGE_SIZE;
> +
> +	mss->resident += PAGE_SIZE;
> +	/* Accumulate the size in pages that have been accessed. */
> +	if (pte_young(ptent) || PageReferenced(page))
> +		mss->referenced += PAGE_SIZE;
> +	mapcount = page_mapcount(page);
> +	if (mapcount >= 2) {
> +		if (pte_dirty(ptent) || PageDirty(page))
> +			mss->shared_dirty += PAGE_SIZE;
> +		else
> +			mss->shared_clean += PAGE_SIZE;
> +		mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
> +	} else {
> +		if (pte_dirty(ptent) || PageDirty(page))
> +			mss->private_dirty += PAGE_SIZE;
> +		else
> +			mss->private_clean += PAGE_SIZE;
> +		mss->pss += (PAGE_SIZE << PSS_SHIFT);
> +	}
> +}
>  
> -		page = vm_normal_page(vma, addr, ptent);
> -		if (!page)
> -			continue;
> +static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> +			   struct mm_walk *walk)
> +{
> +	struct mem_size_stats *mss = walk->private;
> +	struct vm_area_struct *vma = mss->vma;
> +	pte_t *pte;
> +	spinlock_t *ptl;
>  
> -		if (PageAnon(page))
> -			mss->anonymous += PAGE_SIZE;
> +	split_huge_page_pmd(walk->mm, pmd);
>  
> -		mss->resident += PAGE_SIZE;
> -		/* Accumulate the size in pages that have been accessed. */
> -		if (pte_young(ptent) || PageReferenced(page))
> -			mss->referenced += PAGE_SIZE;
> -		mapcount = page_mapcount(page);
> -		if (mapcount >= 2) {
> -			if (pte_dirty(ptent) || PageDirty(page))
> -				mss->shared_dirty += PAGE_SIZE;
> -			else
> -				mss->shared_clean += PAGE_SIZE;
> -			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
> -		} else {
> -			if (pte_dirty(ptent) || PageDirty(page))
> -				mss->private_dirty += PAGE_SIZE;
> -			else
> -				mss->private_clean += PAGE_SIZE;
> -			mss->pss += (PAGE_SIZE << PSS_SHIFT);
> -		}
> -	}
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE)
> +		smaps_pte_entry(*pte, addr, walk);
>  	pte_unmap_unlock(pte - 1, ptl);
>  	cond_resched();
>  	return 0;
> diff -puN mm/huge_memory.c~break-out-smaps_pte_entry mm/huge_memory.c
> _

Is there a missing change to mm/huge_memory.c?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 4/6] pass pte size argument in to smaps_pte_entry()
  2011-02-01  0:34 ` [RFC][PATCH 4/6] pass pte size argument in to smaps_pte_entry() Dave Hansen
  2011-02-01 10:09   ` Johannes Weiner
@ 2011-02-03 21:22   ` David Rientjes
  1 sibling, 0 replies; 38+ messages in thread
From: David Rientjes @ 2011-02-03 21:22 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, 31 Jan 2011, Dave Hansen wrote:

> 
> This patch adds an argument to the new smaps_pte_entry()
> function to let it account in things other than PAGE_SIZE
> units.  I changed all of the PAGE_SIZE sites, even though
> not all of them can be reached for transparent huge pages,
> just so this will continue to work without changes as THPs
> are improved.
> 

When signed-off:

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds
  2011-02-01  0:34 ` [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds Dave Hansen
  2011-02-01 10:11   ` Johannes Weiner
@ 2011-02-03 21:22   ` David Rientjes
  2011-02-03 21:34     ` Dave Hansen
  1 sibling, 1 reply; 38+ messages in thread
From: David Rientjes @ 2011-02-03 21:22 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, 31 Jan 2011, Dave Hansen wrote:

> 
> This adds code to explicitly detect  and handle
> pmd_trans_huge() pmds.  It then passes HPAGE_SIZE units
> in to the smap_pte_entry() function instead of PAGE_SIZE.
> 
> This means that using /proc/$pid/smaps now will no longer
> cause THPs to be broken down in to small pages.
> 

Nice!

> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
> 
>  linux-2.6.git-dave/fs/proc/task_mmu.c |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff -puN fs/proc/task_mmu.c~teach-smaps_pte_range-about-thp-pmds fs/proc/task_mmu.c
> --- linux-2.6.git/fs/proc/task_mmu.c~teach-smaps_pte_range-about-thp-pmds	2011-01-31 15:10:55.387856566 -0800
> +++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-01-31 15:25:12.231239775 -0800
> @@ -1,5 +1,6 @@
>  #include <linux/mm.h>
>  #include <linux/hugetlb.h>
> +#include <linux/huge_mm.h>
>  #include <linux/mount.h>
>  #include <linux/seq_file.h>
>  #include <linux/highmem.h>
> @@ -385,6 +386,17 @@ static int smaps_pte_range(pmd_t *pmd, u
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> +	if (pmd_trans_huge(*pmd)) {
> +		if (pmd_trans_splitting(*pmd)) {
> +			spin_unlock(&walk->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma, pmd);
> +			spin_lock(&walk->mm->page_table_lock);
> +			goto normal_ptes;
> +		}
> +		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_SIZE, walk);
> +		return 0;
> +	}
> +normal_ptes:

Small nitpick: the label isn't necessary, just use an else-clause on your 
nested conditional.

>  	split_huge_page_pmd(walk->mm, pmd);
>  
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> diff -puN mm/vmscan.c~teach-smaps_pte_range-about-thp-pmds mm/vmscan.c
> diff -puN include/trace/events/vmscan.h~teach-smaps_pte_range-about-thp-pmds include/trace/events/vmscan.h
> diff -puN mm/pagewalk.c~teach-smaps_pte_range-about-thp-pmds mm/pagewalk.c
> diff -puN mm/huge_memory.c~teach-smaps_pte_range-about-thp-pmds mm/huge_memory.c
> diff -puN mm/memory.c~teach-smaps_pte_range-about-thp-pmds mm/memory.c
> diff -puN include/linux/huge_mm.h~teach-smaps_pte_range-about-thp-pmds include/linux/huge_mm.h
> diff -puN mm/internal.h~teach-smaps_pte_range-about-thp-pmds mm/internal.h
> _

What are all these?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 6/6] have smaps show transparent huge pages
  2011-02-01  0:34 ` [RFC][PATCH 6/6] have smaps show transparent huge pages Dave Hansen
  2011-02-01 10:12   ` Johannes Weiner
@ 2011-02-03 21:22   ` David Rientjes
  1 sibling, 0 replies; 38+ messages in thread
From: David Rientjes @ 2011-02-03 21:22 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, 31 Jan 2011, Dave Hansen wrote:

> 
> Now that the mere act of _looking_ at /proc/$pid/smaps will not
> destroy transparent huge pages, tell how much of the VMA is
> actually mapped with them.
> 
> This way, we can make sure that we're getting THPs where we
> expect to see them.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary
  2011-02-03 21:22   ` David Rientjes
@ 2011-02-03 21:33     ` Dave Hansen
  2011-02-03 21:46       ` David Rientjes
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2011-02-03 21:33 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Thu, 2011-02-03 at 13:22 -0800, David Rientjes wrote:
> > diff -puN mm/pagewalk.c~pagewalk-dont-always-split-thp mm/pagewalk.c
> > --- linux-2.6.git/mm/pagewalk.c~pagewalk-dont-always-split-thp	2011-01-27 10:57:02.309914973 -0800
> > +++ linux-2.6.git-dave/mm/pagewalk.c	2011-01-27 10:57:02.317914965 -0800
> > @@ -33,19 +33,35 @@ static int walk_pmd_range(pud_t *pud, un
> >  
> >  	pmd = pmd_offset(pud, addr);
> >  	do {
> > +	again:
> 
> checkpatch will warn about the indent.
> 
> >  		next = pmd_addr_end(addr, end);
> > -		split_huge_page_pmd(walk->mm, pmd);
> > -		if (pmd_none_or_clear_bad(pmd)) {
> > +		if (pmd_none(*pmd)) {
> 
> Not sure why this has been changed from pmd_none_or_clear_bad(), that's 
> been done even prior to THP.

The bad check will trigger on huge pmds.  We can not use it here.  We
can, however, use pmd_none().  The bad check was moved below to where we
actually dereference the pmd.

> >  			if (walk->pte_hole)
> >  				err = walk->pte_hole(addr, next, walk);
> >  			if (err)
> >  				break;
> >  			continue;
> >  		}
> > +		/*
> > +		 * This implies that each ->pmd_entry() handler
> > +		 * needs to know about pmd_trans_huge() pmds
> > +		 */
> 
> Probably needs to be documented somewhere for users of pagewalk?

Probably, but we don't currently have any central documentation for it.
Guess we could make some, or just ensure that all the users got updated.
Any ideas where to put it other than the mm_walk struct?

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds
  2011-02-03 21:22   ` David Rientjes
@ 2011-02-03 21:34     ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-03 21:34 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Thu, 2011-02-03 at 13:22 -0800, David Rientjes wrote:
> > @@ -385,6 +386,17 @@ static int smaps_pte_range(pmd_t *pmd, u
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> >  
> > +	if (pmd_trans_huge(*pmd)) {
> > +		if (pmd_trans_splitting(*pmd)) {
> > +			spin_unlock(&walk->mm->page_table_lock);
> > +			wait_split_huge_page(vma->anon_vma, pmd);
> > +			spin_lock(&walk->mm->page_table_lock);
> > +			goto normal_ptes;
> > +		}
> > +		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_SIZE, walk);
> > +		return 0;
> > +	}
> > +normal_ptes:
> 
> Small nitpick: the label isn't necessary, just use an else-clause on your 
> nested conditional.

Works for me.

> >  	split_huge_page_pmd(walk->mm, pmd);
> >  
> >  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > diff -puN mm/vmscan.c~teach-smaps_pte_range-about-thp-pmds mm/vmscan.c
> > diff -puN include/trace/events/vmscan.h~teach-smaps_pte_range-about-thp-pmds include/trace/events/vmscan.h
> > diff -puN mm/pagewalk.c~teach-smaps_pte_range-about-thp-pmds mm/pagewalk.c
> > diff -puN mm/huge_memory.c~teach-smaps_pte_range-about-thp-pmds mm/huge_memory.c
> > diff -puN mm/memory.c~teach-smaps_pte_range-about-thp-pmds mm/memory.c
> > diff -puN include/linux/huge_mm.h~teach-smaps_pte_range-about-thp-pmds include/linux/huge_mm.h
> > diff -puN mm/internal.h~teach-smaps_pte_range-about-thp-pmds mm/internal.h
> > _
> 
> What are all these?

Junk.  I'll pull them out.

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 3/6] break out smaps_pte_entry() from smaps_pte_range()
  2011-02-03 21:22   ` David Rientjes
@ 2011-02-03 21:40     ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-03 21:40 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Thu, 2011-02-03 at 13:22 -0800, David Rientjes wrote:
> On Mon, 31 Jan 2011, Dave Hansen wrote:
> > We will use smaps_pte_entry() in a moment to handle both small
> > and transparent large pages.  But, we must break it out of
> > smaps_pte_range() first.
> 
> The extraction from smaps_pte_range() looks good.  What's the performance 
> impact on very frequent consumers of /proc/pid/smaps, though, as the 
> result of the calls throughout the iteration if smaps_pte_entry() doesn't 
> get inlined (supposedly because you'll be reusing the extracted function 
> again elsewhere)?

We could try and coerce it in to always inlining it, I guess.  I just
can't imagine this changes the cost _that_ much.  Unless I have some
specific concers, I tend to leave this up to the compiler, and none of
the users look particularly fastpathy or performance sensitive to me.

...
> > -	}
> > +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (; addr != end; pte++, addr += PAGE_SIZE)
> > +		smaps_pte_entry(*pte, addr, walk);
> >  	pte_unmap_unlock(pte - 1, ptl);
> >  	cond_resched();
> >  	return 0;
> > diff -puN mm/huge_memory.c~break-out-smaps_pte_entry mm/huge_memory.c
> > _
> 
> Is there a missing change to mm/huge_memory.c?

Nope, it was just more of those empty diffs like in the last patch.
It's cruft from patch-scripts and some code that I use to ensure I don't
miss file edits when making patches.  I'll pull them out.

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary
  2011-02-03 21:33     ` Dave Hansen
@ 2011-02-03 21:46       ` David Rientjes
  2011-02-04 17:19         ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: David Rientjes @ 2011-02-03 21:46 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Thu, 3 Feb 2011, Dave Hansen wrote:

> > > diff -puN mm/pagewalk.c~pagewalk-dont-always-split-thp mm/pagewalk.c
> > > --- linux-2.6.git/mm/pagewalk.c~pagewalk-dont-always-split-thp	2011-01-27 10:57:02.309914973 -0800
> > > +++ linux-2.6.git-dave/mm/pagewalk.c	2011-01-27 10:57:02.317914965 -0800
> > > @@ -33,19 +33,35 @@ static int walk_pmd_range(pud_t *pud, un
> > >  
> > >  	pmd = pmd_offset(pud, addr);
> > >  	do {
> > > +	again:
> > 
> > checkpatch will warn about the indent.
> > 
> > >  		next = pmd_addr_end(addr, end);
> > > -		split_huge_page_pmd(walk->mm, pmd);
> > > -		if (pmd_none_or_clear_bad(pmd)) {
> > > +		if (pmd_none(*pmd)) {
> > 
> > Not sure why this has been changed from pmd_none_or_clear_bad(), that's 
> > been done even prior to THP.
> 
> The bad check will trigger on huge pmds.  We can not use it here.  We
> can, however, use pmd_none().  The bad check was moved below to where we
> actually dereference the pmd.
> 

Ah, right, thanks.

> > >  			if (walk->pte_hole)
> > >  				err = walk->pte_hole(addr, next, walk);
> > >  			if (err)
> > >  				break;
> > >  			continue;
> > >  		}
> > > +		/*
> > > +		 * This implies that each ->pmd_entry() handler
> > > +		 * needs to know about pmd_trans_huge() pmds
> > > +		 */
> > 
> > Probably needs to be documented somewhere for users of pagewalk?
> 
> Probably, but we don't currently have any central documentation for it.
> Guess we could make some, or just ensure that all the users got updated.
> Any ideas where to put it other than the mm_walk struct?
> 

I think noting it where struct mm_walk is declared would be best (just a 
"/* must handle pmd_trans_huge() */" would be sufficient) although 
eventually it might be cleaner to add a ->pmd_huge_entry().

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics
  2011-02-01  0:33 [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Dave Hansen
                   ` (6 preceding siblings ...)
  2011-02-01 15:38 ` [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Andrea Arcangeli
@ 2011-02-03 21:54 ` David Rientjes
  7 siblings, 0 replies; 38+ messages in thread
From: David Rientjes @ 2011-02-03 21:54 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Mon, 31 Jan 2011, Dave Hansen wrote:

> I'm working on some more reports that transparent huge pages and
> KSM do not play nicely together.  Basically, whenever THP's are
> present along with KSM, there is a lot of attrition over time,
> and we do not see much overall progress keeping THP's around:
> 
> 	http://sr71.net/~dave/ibm/038_System_Anonymous_Pages.png
> 
> (That's Karl Rister's graph, thanks Karl!)
> 
> However, I realized that we do not currently have a nice way to
> find out where individual THP's might be on the system.  We
> have an overall count, but no way of telling which processes or
> VMAs they might be in.
> 
> I started to implement this in the /proc/$pid/smaps code, but
> quickly realized that the lib/pagewalk.c code unconditionally
> splits THPs up.  This set reworks that code a bit and, in the
> end, gives you a per-map count of the numbers of huge pages.
> It also makes it possible for page walks to _not_ split THPs.
> 

Nice!  I'd like to start using this patchset immediately, I'm hoping 
you'll re-propose it with the fixes soon.

Thanks Dave.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary
  2011-02-03 21:46       ` David Rientjes
@ 2011-02-04 17:19         ` Dave Hansen
  2011-02-04 21:10           ` Andrea Arcangeli
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2011-02-04 17:19 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli

On Thu, 2011-02-03 at 13:46 -0800, David Rientjes wrote:
> > Probably, but we don't currently have any central documentation for it.
> > Guess we could make some, or just ensure that all the users got updated.
> > Any ideas where to put it other than the mm_walk struct?
> 
> I think noting it where struct mm_walk is declared would be best (just a 
> "/* must handle pmd_trans_huge() */" would be sufficient) although 
> eventually it might be cleaner to add a ->pmd_huge_entry(). 

For code maintenance, I really like _not_ hiding this in the API
somewhere.  This way, we have a great, self-explanatory tag wherever
code (possibly) hasn't properly dealt with THPs.  We get a nice,
greppable, cscope'able:

	split_huge_page_pmd()

wherever we need to "teach" the code about THP.

It's kinda like the BKL. :)

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary
  2011-02-04 17:19         ` Dave Hansen
@ 2011-02-04 21:10           ` Andrea Arcangeli
  0 siblings, 0 replies; 38+ messages in thread
From: Andrea Arcangeli @ 2011-02-04 21:10 UTC (permalink / raw)
  To: Dave Hansen; +Cc: David Rientjes, linux-kernel, linux-mm, Michael J Wolf

On Fri, Feb 04, 2011 at 09:19:12AM -0800, Dave Hansen wrote:
> For code maintenance, I really like _not_ hiding this in the API
> somewhere.  This way, we have a great, self-explanatory tag wherever
> code (possibly) hasn't properly dealt with THPs.  We get a nice,
> greppable, cscope'able:
> 
> 	split_huge_page_pmd()
> 
> wherever we need to "teach" the code about THP.
> 
> It's kinda like the BKL. :)

It is in my view too ;).

However currently it's not greppable if we don't differentiate it a
bit from the legitimate/optimal usages. split_huge_page_pmd currently
isn't always sign of code not THP aware. It's sign of not THP aware
code only for cases like smaps that is readonly in terms of vma/pte
mangling, but for example mprotect isn't a readonly thing and it's
already fully mprotect aware, but split_huge_page_pmd still comes very
handy when userland asks to create a vma that can't fit an hugepmd
(there are several other places like that). When that ever happens
(like changing protection of only the last 4k of an hugepage backed
mapping) replacing the hugepmd with a regular pmd pointing to a pte
(where we can alter the protection of only the last 4k) becomes
compulsory. So it's not always a sign of lack of optimization,
sometime it's needed and userland should be optimized instead ;).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/6] count transparent hugepage splits
  2011-02-03 21:22   ` David Rientjes
@ 2011-02-04 21:18     ` Andrea Arcangeli
  2011-02-04 21:28       ` Dave Hansen
  0 siblings, 1 reply; 38+ messages in thread
From: Andrea Arcangeli @ 2011-02-04 21:18 UTC (permalink / raw)
  To: David Rientjes; +Cc: Dave Hansen, linux-kernel, linux-mm, Michael J Wolf

On Thu, Feb 03, 2011 at 01:22:14PM -0800, David Rientjes wrote:
> i.e. no global locking, but we've accepted the occassional off-by-one 
> error (even though splitting of hugepages isn't by any means lightning 
> fast and the overhead of atomic ops would be negligible).

Agreed losing an increment is not a problem, but in very large systems
it will become a bottleneck. It's not super urgent, but I think it
needs to become a per-cpu counter sooner than later (not needed
immediately but I would appreciate an incremental patch soon to
address that). split_huge_page is already fully SMP scalable if the
rmap isn't shared (i.e. fully SMP scalable across different execve)
and I'd like it to stay that way because split_huge_page can run at
high frequency at times from different processes, so in very large
systems it may be measurable, with that cacheline bouncing around 1024
cpus. pages_collapsed is not a problem because it's only used by one
kernel thread so it can't be contended. Again not super urgent but
better to optimize it ;).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 1/6] count transparent hugepage splits
  2011-02-04 21:18     ` Andrea Arcangeli
@ 2011-02-04 21:28       ` Dave Hansen
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2011-02-04 21:28 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David Rientjes, linux-kernel, linux-mm, Michael J Wolf

On Fri, 2011-02-04 at 22:18 +0100, Andrea Arcangeli wrote:
> On Thu, Feb 03, 2011 at 01:22:14PM -0800, David Rientjes wrote:
> > i.e. no global locking, but we've accepted the occassional off-by-one 
> > error (even though splitting of hugepages isn't by any means lightning 
> > fast and the overhead of atomic ops would be negligible).
> 
> Agreed losing an increment is not a problem, but in very large systems
> it will become a bottleneck. It's not super urgent, but I think it
> needs to become a per-cpu counter sooner than later (not needed
> immediately but I would appreciate an incremental patch soon to
> address that). 

Seems like something that would be fairly trivial with the existing
count_vm_event() infrastructure.  Any reason not to use that?  I'll be
happy to tack a patch on to my series.

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics
  2011-02-02  0:07         ` Andrea Arcangeli
@ 2011-02-08 17:54           ` Dave Hansen
  2011-02-08 18:17             ` Andrea Arcangeli
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2011-02-08 17:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrew Morton

On Wed, 2011-02-02 at 01:07 +0100, Andrea Arcangeli wrote:
> > I guess we could also try and figure out whether the khugepaged CPU
> > overhead really comes from the scanning or the collapsing operations
> > themselves.  Should be as easy as some oprofiling.
> 
> Actually I already know, the scanning is super fast. So it's no real
> big deal to increase the scanning. It's big deal only if there are
> plenty more of collapse/split. Compared to the KSM scan, the
> khugepaged scan costs nothing.

Just FYI, I did some profiling on a workload that constantly split and
joined pages.  Very little of the overhead was in the scanning itself,
so I think you're dead-on here.

-- Dave


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics
  2011-02-08 17:54           ` Dave Hansen
@ 2011-02-08 18:17             ` Andrea Arcangeli
  0 siblings, 0 replies; 38+ messages in thread
From: Andrea Arcangeli @ 2011-02-08 18:17 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrew Morton

Hello,

On Tue, Feb 08, 2011 at 09:54:34AM -0800, Dave Hansen wrote:
> Just FYI, I did some profiling on a workload that constantly split and
> joined pages.  Very little of the overhead was in the scanning itself,
> so I think you're dead-on here.

Yep, my way to deduce it has been to set both to 100%, and check the
rate of increase of
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans vs
/sys/kernel/mm/ksm/full_scans and the differences is enormous. So a
100% CPU ksmd scan can probably be followed more than well with a 1%
CPU khugepaged scan and probably achieve the exact same hugepage ratio
of a 100% khugepaged scan. The default khugepaged scan is super
paranoid (it has to be, considering the default ksm scan is
zero). Maybe we can still increase the default pages_to_scan a bit. I
suspect most of the current cost should be in the scheduler and that
only accounts for 1 kthread schedule event every 10 sec.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2011-02-08 18:17 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-01  0:33 [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Dave Hansen
2011-02-01  0:33 ` [RFC][PATCH 1/6] count transparent hugepage splits Dave Hansen
2011-02-01  9:58   ` Johannes Weiner
2011-02-03 21:22   ` David Rientjes
2011-02-04 21:18     ` Andrea Arcangeli
2011-02-04 21:28       ` Dave Hansen
2011-02-01  0:33 ` [RFC][PATCH 2/6] pagewalk: only split huge pages when necessary Dave Hansen
2011-02-01 10:04   ` Johannes Weiner
2011-02-01 15:03     ` Dave Hansen
2011-02-03 21:22   ` David Rientjes
2011-02-03 21:33     ` Dave Hansen
2011-02-03 21:46       ` David Rientjes
2011-02-04 17:19         ` Dave Hansen
2011-02-04 21:10           ` Andrea Arcangeli
2011-02-01  0:34 ` [RFC][PATCH 3/6] break out smaps_pte_entry() from smaps_pte_range() Dave Hansen
2011-02-01 10:08   ` Johannes Weiner
2011-02-03 21:22   ` David Rientjes
2011-02-03 21:40     ` Dave Hansen
2011-02-01  0:34 ` [RFC][PATCH 4/6] pass pte size argument in to smaps_pte_entry() Dave Hansen
2011-02-01 10:09   ` Johannes Weiner
2011-02-03 21:22   ` David Rientjes
2011-02-01  0:34 ` [RFC][PATCH 5/6] teach smaps_pte_range() about THP pmds Dave Hansen
2011-02-01 10:11   ` Johannes Weiner
2011-02-01 15:02     ` Dave Hansen
2011-02-01 16:09       ` Andrea Arcangeli
2011-02-03 21:22   ` David Rientjes
2011-02-03 21:34     ` Dave Hansen
2011-02-01  0:34 ` [RFC][PATCH 6/6] have smaps show transparent huge pages Dave Hansen
2011-02-01 10:12   ` Johannes Weiner
2011-02-03 21:22   ` David Rientjes
2011-02-01 15:38 ` [RFC][PATCH 0/6] more detailed per-process transparent hugepage statistics Andrea Arcangeli
2011-02-01 17:15   ` Dave Hansen
2011-02-01 20:39     ` Andrea Arcangeli
2011-02-01 20:56       ` Dave Hansen
2011-02-02  0:07         ` Andrea Arcangeli
2011-02-08 17:54           ` Dave Hansen
2011-02-08 18:17             ` Andrea Arcangeli
2011-02-03 21:54 ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).