All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 1/3] Add flags for temporary compound pages
       [not found] <cover.1386790423.git.athorlton@sgi.com>
@ 2013-12-12 18:00   ` Alex Thorlton
  2013-12-12 18:00   ` Alex Thorlton
  2013-12-12 18:00   ` Alex Thorlton
  2 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel, linux-mm

This chunk of the patch just adds the necessary page flags to support
what I call a temporary compound page.  Temporary compound pages provide
us with the ability to turn a chunk of pages into a proper THP at some
point after pulling them off the free list, but also allow us to fault
in single 4K pages from the chunk.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Rientjes <rientjes@google.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

---
 include/linux/gfp.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..6ec63dd 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_COMP_TEMP	0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -77,6 +78,7 @@ struct vm_area_struct;
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY) /* See above */
 #define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)	/* Add compound page metadata */
+#define __GFP_COMP_TEMP	((__force gfp_t)___GFP_COMP_TEMP)	/* Treat as temp compound page */
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)	/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves.
 							 * This takes precedence over the
@@ -121,6 +123,9 @@ struct vm_area_struct;
 #define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
 			 __GFP_NO_KSWAPD)
+#define GFP_TEMP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP_TEMP | \
+				 __GFP_NOMEMALLOC | __GFP_NORETRY | \
+				 __GFP_NOWARN | __GFP_NO_KSWAPD)
 
 #ifdef CONFIG_NUMA
 #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
-- 
1.7.12.4



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 1/3] Add flags for temporary compound pages
@ 2013-12-12 18:00   ` Alex Thorlton
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

This chunk of the patch just adds the necessary page flags to support
what I call a temporary compound page.  Temporary compound pages provide
us with the ability to turn a chunk of pages into a proper THP at some
point after pulling them off the free list, but also allow us to fault
in single 4K pages from the chunk.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Rientjes <rientjes@google.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

---
 include/linux/gfp.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9b4dd49..6ec63dd 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_COMP_TEMP	0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -77,6 +78,7 @@ struct vm_area_struct;
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY) /* See above */
 #define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)	/* Add compound page metadata */
+#define __GFP_COMP_TEMP	((__force gfp_t)___GFP_COMP_TEMP)	/* Treat as temp compound page */
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)	/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves.
 							 * This takes precedence over the
@@ -121,6 +123,9 @@ struct vm_area_struct;
 #define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
 			 __GFP_NO_KSWAPD)
+#define GFP_TEMP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP_TEMP | \
+				 __GFP_NOMEMALLOC | __GFP_NORETRY | \
+				 __GFP_NOWARN | __GFP_NO_KSWAPD)
 
 #ifdef CONFIG_NUMA
 #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
-- 
1.7.12.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 2/3] Add tunable to control THP behavior
       [not found] <cover.1386790423.git.athorlton@sgi.com>
@ 2013-12-12 18:00   ` Alex Thorlton
  2013-12-12 18:00   ` Alex Thorlton
  2013-12-12 18:00   ` Alex Thorlton
  2 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel, linux-mm

This part of the patch adds a tunable to
/sys/kernel/mm/transparent_hugepage called threshold.  This threshold
determines how many pages a user must fault in from a single node before
a temporary compound page is turned into a THP.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Rientjes <rientjes@google.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

---
 include/linux/huge_mm.h  |  2 ++
 include/linux/mm_types.h |  1 +
 mm/huge_memory.c         | 30 ++++++++++++++++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428..0943b1b6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -177,6 +177,8 @@ static inline struct page *compound_trans_head(struct page *page)
 extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
+extern int thp_threshold_check(void);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..b5efa23 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -408,6 +408,7 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+	int thp_threshold;
 #endif
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cca80d9..5d388e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
+/* default to 1 page threshold for handing out thps; maintains old behavior */
+static int transparent_hugepage_threshold = 1;
+
 /* default scan 8*512 pte (or vmas) every 30 second */
 static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8;
 static unsigned int khugepaged_pages_collapsed;
@@ -237,6 +240,11 @@ static struct shrinker huge_zero_page_shrinker = {
 	.seeks = DEFAULT_SEEKS,
 };
 
+int thp_threshold_check()
+{
+	return transparent_hugepage_threshold;
+}
+
 #ifdef CONFIG_SYSFS
 
 static ssize_t double_flag_show(struct kobject *kobj,
@@ -376,6 +384,27 @@ static ssize_t use_zero_page_store(struct kobject *kobj,
 }
 static struct kobj_attribute use_zero_page_attr =
 	__ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store);
+static ssize_t threshold_show(struct kobject *kobj,
+			      struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", transparent_hugepage_threshold);
+}
+static ssize_t threshold_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	int err, value;
+
+	err = kstrtoint(buf, 10, &value);
+	if (err || value < 1 || value > HPAGE_PMD_NR)
+		return -EINVAL;
+
+	transparent_hugepage_threshold = value;
+
+	return count;
+}
+static struct kobj_attribute threshold_attr =
+	__ATTR(threshold, 0644, threshold_show, threshold_store);
 #ifdef CONFIG_DEBUG_VM
 static ssize_t debug_cow_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf)
@@ -397,6 +426,7 @@ static struct kobj_attribute debug_cow_attr =
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
+	&threshold_attr.attr,
 	&use_zero_page_attr.attr,
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
-- 
1.7.12.4



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 2/3] Add tunable to control THP behavior
@ 2013-12-12 18:00   ` Alex Thorlton
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

This part of the patch adds a tunable to
/sys/kernel/mm/transparent_hugepage called threshold.  This threshold
determines how many pages a user must fault in from a single node before
a temporary compound page is turned into a THP.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Rientjes <rientjes@google.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

---
 include/linux/huge_mm.h  |  2 ++
 include/linux/mm_types.h |  1 +
 mm/huge_memory.c         | 30 ++++++++++++++++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428..0943b1b6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -177,6 +177,8 @@ static inline struct page *compound_trans_head(struct page *page)
 extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
+extern int thp_threshold_check(void);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851ee..b5efa23 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -408,6 +408,7 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+	int thp_threshold;
 #endif
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cca80d9..5d388e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
+/* default to 1 page threshold for handing out thps; maintains old behavior */
+static int transparent_hugepage_threshold = 1;
+
 /* default scan 8*512 pte (or vmas) every 30 second */
 static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8;
 static unsigned int khugepaged_pages_collapsed;
@@ -237,6 +240,11 @@ static struct shrinker huge_zero_page_shrinker = {
 	.seeks = DEFAULT_SEEKS,
 };
 
+int thp_threshold_check()
+{
+	return transparent_hugepage_threshold;
+}
+
 #ifdef CONFIG_SYSFS
 
 static ssize_t double_flag_show(struct kobject *kobj,
@@ -376,6 +384,27 @@ static ssize_t use_zero_page_store(struct kobject *kobj,
 }
 static struct kobj_attribute use_zero_page_attr =
 	__ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store);
+static ssize_t threshold_show(struct kobject *kobj,
+			      struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%d\n", transparent_hugepage_threshold);
+}
+static ssize_t threshold_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	int err, value;
+
+	err = kstrtoint(buf, 10, &value);
+	if (err || value < 1 || value > HPAGE_PMD_NR)
+		return -EINVAL;
+
+	transparent_hugepage_threshold = value;
+
+	return count;
+}
+static struct kobj_attribute threshold_attr =
+	__ATTR(threshold, 0644, threshold_show, threshold_store);
 #ifdef CONFIG_DEBUG_VM
 static ssize_t debug_cow_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf)
@@ -397,6 +426,7 @@ static struct kobj_attribute debug_cow_attr =
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
+	&threshold_attr.attr,
 	&use_zero_page_attr.attr,
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
-- 
1.7.12.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 3/3] Change THP behavior
       [not found] <cover.1386790423.git.athorlton@sgi.com>
@ 2013-12-12 18:00   ` Alex Thorlton
  2013-12-12 18:00   ` Alex Thorlton
  2013-12-12 18:00   ` Alex Thorlton
  2 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel, linux-mm

This patch implements the functionality we're really going for here.
It adds the decision making behavior to determine when to grab a
temporary compound page, and whether or not to fault in single pages or
to turn the temporary page into a THP.  This one is rather large, might
split it up a bit more for later versions

I've left most of my comments in here just to provide people with some
insight into what I may have been thinking when I chose to do something
in a certain way.  These will probably be trimmed down in later versions
of the patch.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Rientjes <rientjes@google.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

---
 include/linux/huge_mm.h  |   6 +
 include/linux/mm_types.h |  13 +++
 kernel/fork.c            |   1 +
 mm/huge_memory.c         | 283 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h            |   1 +
 mm/memory.c              |  29 ++++-
 mm/page_alloc.c          |  66 ++++++++++-
 7 files changed, 392 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0943b1b6..c1e407d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
 				      struct vm_area_struct *vma,
 				      unsigned long address, pmd_t *pmd,
 				      unsigned int flags);
+extern struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm,
+						 pmd_t *pmd);
+extern int do_huge_pmd_temp_page(struct mm_struct *mm,
+				 struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmd,
+				 unsigned int flags);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 			 struct vm_area_struct *vma);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b5efa23..d48c6ab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -322,6 +322,17 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct temp_hugepage {
+	pmd_t *pmd;
+	struct page *page;
+	spinlock_t temp_hugepage_lock;
+	int node;			/* node id of the first page in the chunk */
+	int ref_count;			/* number of pages faulted in from the chunk */
+	struct list_head list;		/* pointers to next/prev chunks */
+};
+#endif
+
 struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
@@ -408,7 +419,9 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+	spinlock_t thp_list_lock; /* lock to protect thp_temp_list */
 	int thp_threshold;
+	struct list_head thp_temp_list; /* list of 512 page chunks for THPs */
 #endif
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..a3ccf857 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -816,6 +816,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
+	INIT_LIST_HEAD(&mm->thp_temp_list);
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	mm->first_nid = NUMA_PTE_SCAN_INIT;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d388e4..43ea095 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -788,6 +788,20 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 			       HPAGE_PMD_ORDER, vma, haddr, nd);
 }
 
+static inline gfp_t alloc_temp_hugepage_gfpmask(gfp_t extra_gfp)
+{
+	return GFP_TEMP_TRANSHUGE | extra_gfp;
+}
+
+static inline struct page *alloc_temp_hugepage_vma(int defrag,
+					      struct vm_area_struct *vma,
+					      unsigned long haddr, int nd,
+					      gfp_t extra_gfp)
+{
+	return alloc_pages_vma(alloc_temp_hugepage_gfpmask(extra_gfp),
+			       HPAGE_PMD_ORDER, vma, haddr, nd);
+}
+
 #ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
@@ -871,6 +885,275 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+/* We need to hold mm->thp_list_lock during this search */
+struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm, pmd_t *pmd)
+{
+	struct temp_hugepage *temp_thp;
+	/*
+	 * we need to check to make sure that the PMD isn't already
+	 * on the list. return the temp_hugepage struct if we find one
+	 * otherwise we just return NULL
+	 */
+	list_for_each_entry(temp_thp, &mm->thp_temp_list, list) {
+		if (temp_thp->pmd == pmd) {
+			return temp_thp;
+		}
+	}
+
+	return NULL;
+}
+
+int do_huge_pmd_temp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	int i;
+	spinlock_t *ptl;
+	struct page *page;
+	pte_t *pte;
+	pte_t entry;
+	struct temp_hugepage *temp_thp;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return VM_FAULT_FALLBACK;
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+	/*
+	 * we're not going to handle this case yet, for now
+	 * we'll just fall back to regular pages
+	 */
+	if (!(flags & FAULT_FLAG_WRITE) &&
+			transparent_hugepage_use_zero_page()) {
+		pgtable_t pgtable;
+		struct page *zero_page;
+		bool set;
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable))
+			return VM_FAULT_OOM;
+		zero_page = get_huge_zero_page();
+		if (unlikely(!zero_page)) {
+			pte_free(mm, pgtable);
+			count_vm_event(THP_FAULT_FALLBACK);
+			return VM_FAULT_FALLBACK;
+		}
+		spin_lock(&mm->page_table_lock);
+		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+				zero_page);
+		spin_unlock(&mm->page_table_lock);
+		if (!set) {
+			pte_free(mm, pgtable);
+			put_huge_zero_page();
+		}
+		return 0;
+	}
+
+	/*
+	 * Here's where we either need to store the PMD on the list
+	 * and give them a regular page, or make the decision to flip
+	 * the PSE bit and send them back with a hugepage
+	 *
+	 * + First we call find_pmd_mm_freelist to determine if the pmd
+	 *   we're interested in has already been faulted into
+	 */
+	spin_lock(&mm->thp_list_lock);
+	temp_thp = find_pmd_mm_freelist(mm, pmd);
+
+	/* this is a temporary workaround to avoid putting the pages back on the freelist */
+	if (temp_thp && temp_thp->node == -1) {
+		spin_unlock(&mm->thp_list_lock);
+		goto single_fault;
+	}
+
+	/*
+	 * we need to create a list entry and add it to the
+	 * new per-mm free list if we didn't find an existing
+	 * entry
+	 */
+	if (!temp_thp && pmd_none(*pmd)) {
+		/* try to get 512 pages from the freelist */
+		page = alloc_temp_hugepage_vma(transparent_hugepage_defrag(vma),
+					  vma, haddr, numa_node_id(), 0);
+
+		if (unlikely(!page)) {
+			/* we should probably change the VM event here? */
+			count_vm_event(THP_FAULT_FALLBACK);
+			return VM_FAULT_FALLBACK;
+		}
+
+		/* do this here instead of below, to get the whole page ready */
+		clear_huge_page(page, haddr, HPAGE_PMD_NR);
+
+		/* add a new temp_hugepage entry to the local freelist */
+		temp_thp = kmalloc(sizeof(struct temp_hugepage), GFP_KERNEL);
+		if (!temp_thp)
+			return VM_FAULT_OOM;
+		temp_thp->pmd = pmd;
+		temp_thp->page = page;
+		temp_thp->node = numa_node_id();
+		temp_thp->ref_count = 1;
+		list_add(&temp_thp->list, &mm->thp_temp_list);
+	/*
+	 * otherwise we increment the reference count, and decide whether
+	 * or not to create a THP
+	 */
+	} else if (temp_thp && !pmd_none(*pmd) && temp_thp->node == numa_node_id()) {
+		temp_thp->ref_count++;
+	/* if they allocated from a different node, they don't get a thp */
+	} else if (temp_thp && !pmd_none(*pmd) && temp_thp->node != numa_node_id()) {
+		/*
+		 * for now we handle this by pushing the rest of the faults through our
+		 * custom fault code below, eventually we will want to put the unused
+		 * pages from out temp_hugepage back on the freelist, so they can be
+		 * faulted in by the normal code paths
+		 */
+
+		temp_thp->node = -1;
+	} else {
+		spin_unlock(&mm->thp_list_lock);
+		return VM_FAULT_FALLBACK;
+	}
+
+	spin_unlock(&mm->thp_list_lock);
+
+	/*
+	 * now that we've done the accounting work, we check to see if
+	 * we've exceeded our threshold
+	 */
+	if (temp_thp->ref_count >= mm->thp_threshold) {
+		pmd_t pmd_entry;
+		pgtable_t pgtable;
+
+		/*
+		 * we'll do all of the following beneath the big ptl for now
+		 * this will need to be modified to work with the split ptl
+		 */
+		spin_lock(&mm->page_table_lock);
+
+		/*
+		 * once we get past the lock we have to make sure that somebody
+		 * else hasn't already turned this guy into a THP, if they have,
+		 * then the page we need is already faulted in as part of the THP
+		 * they created
+		 */
+		if (PageTransHuge(temp_thp->page)) {
+			spin_unlock(&mm->page_table_lock);
+			return 0;
+		}
+
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable)) {
+			spin_unlock(&mm->page_table_lock);
+			return VM_FAULT_OOM;
+		}
+
+		/* might wanna move this? */
+		__SetPageUptodate(temp_thp->page);
+
+		/* turn the pages into one compound page */
+		make_compound_page(temp_thp->page, HPAGE_PMD_ORDER);
+
+		/* set up the pmd */
+		pmd_entry = mk_huge_pmd(temp_thp->page, vma->vm_page_prot);
+		pmd_entry = maybe_pmd_mkwrite(pmd_mkdirty(pmd_entry), vma);
+
+		/* remap the new page since we cleared the mappings */
+		page_add_anon_rmap(temp_thp->page, vma, address);
+
+		/* deposit the thp */
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+
+		set_pmd_at(mm, haddr, pmd, pmd_entry);
+		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR - mm->thp_threshold + 1);
+		/* mm->nr_ptes++; */
+
+		/* delete the reference to this compound page from our list */
+		spin_lock(&mm->thp_list_lock);
+		list_del(&temp_thp->list);
+		spin_unlock(&mm->thp_list_lock);
+
+		spin_unlock(&mm->page_table_lock);
+		return 0;
+	} else {
+single_fault:
+		/* fault in the page */
+		if (pmd_none(*pmd) && __pte_alloc(mm, vma, temp_thp->pmd, address))
+			return VM_FAULT_OOM;
+
+		/*
+		 * we'll do all of the following beneath the big ptl for now
+		 * this will need to be modified to work with the split ptl
+		 */
+		spin_lock(&mm->page_table_lock);
+
+		page = temp_thp->page + (int) pte_index(address);
+
+		/* set the page's refcount */
+		set_page_refcounted(page);
+		pte = pte_offset_map(temp_thp->pmd, address);
+
+		/* might wanna move this? */
+		__SetPageUptodate(page);
+
+		if (!pte_present(*pte)) {
+			if (pte_none(*pte)) {
+				pte_unmap(pte);
+
+				if (unlikely(anon_vma_prepare(vma))) {
+					spin_unlock(&mm->page_table_lock);
+					return VM_FAULT_OOM;
+				}
+
+				entry = mk_pte(page, vma->vm_page_prot);
+				if (vma->vm_flags & VM_WRITE)
+					entry = pte_mkwrite(pte_mkdirty(entry));
+
+				pte = pte_offset_map_lock(mm, temp_thp->pmd, address, &ptl);
+
+				page_add_new_anon_rmap(page, vma, haddr);
+				add_mm_counter(mm, MM_ANONPAGES, 1);
+
+				set_pte_at(mm, address, pte, entry);
+
+				pte_unmap_unlock(pte, ptl);
+				spin_unlock(&mm->page_table_lock);
+
+				return 0;
+			}
+		} else {
+			spin_unlock(&mm->page_table_lock);
+			return VM_FAULT_FALLBACK;
+		}
+	}
+
+	/* I don't know what this does right now.  I'm leaving it */
+	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
+
+	/*
+	 * here's the important piece, where we actually make our 512
+	 * page chunk into a THP, by setting the PSE bit.  This is the
+	 * spot we really need to change.  In the end, we could probably
+	 * spin this up into the old function, but we'll keep them separate
+	 * for now
+	 */
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
+
+	/* again, probably want a different VM event here */
+	count_vm_event(THP_FAULT_ALLOC);
+	return 0;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
diff --git a/mm/internal.h b/mm/internal.h
index 684f7aa..8fc296b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -98,6 +98,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
  */
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned long order);
+extern void make_compound_page(struct page *page, unsigned long order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index d176154..014d9ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3764,13 +3764,30 @@ retry:
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+	if (transparent_hugepage_enabled(vma)) {
 		int ret = VM_FAULT_FALLBACK;
-		if (!vma->vm_ops)
-			ret = do_huge_pmd_anonymous_page(mm, vma, address,
-					pmd, flags);
-		if (!(ret & VM_FAULT_FALLBACK))
-			return ret;
+		/*
+		 * This is a temporary location for this code, just to get things
+		 * up and running.  I'll come up with a better way to handle this
+		 * later
+		 */
+		if (!mm->thp_threshold)
+			mm->thp_threshold = thp_threshold_check();
+		if (!mm->thp_temp_list.next && !mm->thp_temp_list.prev)
+			INIT_LIST_HEAD(&mm->thp_temp_list);
+		if (mm->thp_threshold > 1) {
+			if (!vma->vm_ops)
+				ret = do_huge_pmd_temp_page(mm, vma, address,
+							    pmd, flags);
+			if (!(ret & VM_FAULT_FALLBACK))
+				return ret;
+		} else if (pmd_none(*pmd)) {
+			if (!vma->vm_ops)
+				ret = do_huge_pmd_anonymous_page(mm, vma, address,
+								 pmd, flags);
+			if (!(ret & VM_FAULT_FALLBACK))
+				return ret;
+		}
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..48e13fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -375,6 +375,65 @@ void prep_compound_page(struct page *page, unsigned long order)
 	}
 }
 
+/*
+ * This function is used to create a proper compound page from a chunk of
+ * contiguous pages, most likely allocated as a temporary hugepage
+ */
+void make_compound_page(struct page *page, unsigned long order)
+{
+	int i, max_count = 0, max_mapcount = 0;
+	int nr_pages = 1 << order;
+
+	set_compound_page_dtor(page, free_compound_page);
+	set_compound_order(page, order);
+
+	__SetPageHead(page);
+
+	/*
+	 * we clear all the mappings here, so we have to remember to set
+	 * them back up!
+	 */
+	page->mapping = NULL;
+
+	max_count = (int) atomic_read(&page->_count);
+	max_mapcount = (int) atomic_read(&page->_mapcount);
+
+	for (i = 1; i < nr_pages; i++) {
+		int cur_count, cur_mapcount;
+		struct page *p = page + i;
+		p->flags = 0; /* this seems dumb */
+		__SetPageTail(p);
+		set_page_count(p, 0);
+		p->first_page = page;
+		p->mapping = NULL;
+
+		cur_count = (int) atomic_read(&page->_count);
+		cur_mapcount = (int) atomic_read(&page->_mapcount);
+		atomic_set(&page->_count, 0);
+		atomic_set(&page->_mapcount, -1);
+		if (cur_count > max_count)
+			max_count = cur_count;
+		if (cur_mapcount > max_mapcount)
+			max_mapcount = cur_mapcount;
+
+		/*
+		 * poison the LRU entries for all the tail pages (aside from the
+		 * first one), the entries for the head page should be okay
+		 */
+		if (i != 1) {
+			p->lru.next = LIST_POISON1;
+			p->lru.prev = LIST_POISON2;
+		}
+	}
+
+	atomic_set(&page->_count, max_count);
+	/*
+	 * we set to max_mapcount - 1 here because we're going to
+	 * map this page again later.  This definitely doesn't seem right.
+	 */
+	atomic_set(&page->_mapcount, max_mapcount - 1);
+}
+
 /* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
@@ -865,7 +924,12 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 	}
 
 	set_page_private(page, 0);
-	set_page_refcounted(page);
+	 /*
+	  * We don't want to set _count for temporary compound pages, since
+	  * we may not immediately fault in the first page
+	  */
+	if (!(gfp_flags & __GFP_COMP_TEMP))
+		set_page_refcounted(page);
 
 	arch_alloc_page(page, order);
 	kernel_map_pages(page, 1 << order, 1);
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 3/3] Change THP behavior
@ 2013-12-12 18:00   ` Alex Thorlton
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

This patch implements the functionality we're really going for here.
It adds the decision making behavior to determine when to grab a
temporary compound page, and whether or not to fault in single pages or
to turn the temporary page into a THP.  This one is rather large, might
split it up a bit more for later versions

I've left most of my comments in here just to provide people with some
insight into what I may have been thinking when I chose to do something
in a certain way.  These will probably be trimmed down in later versions
of the patch.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Rientjes <rientjes@google.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

---
 include/linux/huge_mm.h  |   6 +
 include/linux/mm_types.h |  13 +++
 kernel/fork.c            |   1 +
 mm/huge_memory.c         | 283 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h            |   1 +
 mm/memory.c              |  29 ++++-
 mm/page_alloc.c          |  66 ++++++++++-
 7 files changed, 392 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0943b1b6..c1e407d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
 				      struct vm_area_struct *vma,
 				      unsigned long address, pmd_t *pmd,
 				      unsigned int flags);
+extern struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm,
+						 pmd_t *pmd);
+extern int do_huge_pmd_temp_page(struct mm_struct *mm,
+				 struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmd,
+				 unsigned int flags);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 			 struct vm_area_struct *vma);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b5efa23..d48c6ab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -322,6 +322,17 @@ struct mm_rss_stat {
 	atomic_long_t count[NR_MM_COUNTERS];
 };
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct temp_hugepage {
+	pmd_t *pmd;
+	struct page *page;
+	spinlock_t temp_hugepage_lock;
+	int node;			/* node id of the first page in the chunk */
+	int ref_count;			/* number of pages faulted in from the chunk */
+	struct list_head list;		/* pointers to next/prev chunks */
+};
+#endif
+
 struct kioctx_table;
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
@@ -408,7 +419,9 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+	spinlock_t thp_list_lock; /* lock to protect thp_temp_list */
 	int thp_threshold;
+	struct list_head thp_temp_list; /* list of 512 page chunks for THPs */
 #endif
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..a3ccf857 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -816,6 +816,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
+	INIT_LIST_HEAD(&mm->thp_temp_list);
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	mm->first_nid = NUMA_PTE_SCAN_INIT;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d388e4..43ea095 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -788,6 +788,20 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 			       HPAGE_PMD_ORDER, vma, haddr, nd);
 }
 
+static inline gfp_t alloc_temp_hugepage_gfpmask(gfp_t extra_gfp)
+{
+	return GFP_TEMP_TRANSHUGE | extra_gfp;
+}
+
+static inline struct page *alloc_temp_hugepage_vma(int defrag,
+					      struct vm_area_struct *vma,
+					      unsigned long haddr, int nd,
+					      gfp_t extra_gfp)
+{
+	return alloc_pages_vma(alloc_temp_hugepage_gfpmask(extra_gfp),
+			       HPAGE_PMD_ORDER, vma, haddr, nd);
+}
+
 #ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
@@ -871,6 +885,275 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+/* We need to hold mm->thp_list_lock during this search */
+struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm, pmd_t *pmd)
+{
+	struct temp_hugepage *temp_thp;
+	/*
+	 * we need to check to make sure that the PMD isn't already
+	 * on the list. return the temp_hugepage struct if we find one
+	 * otherwise we just return NULL
+	 */
+	list_for_each_entry(temp_thp, &mm->thp_temp_list, list) {
+		if (temp_thp->pmd == pmd) {
+			return temp_thp;
+		}
+	}
+
+	return NULL;
+}
+
+int do_huge_pmd_temp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	int i;
+	spinlock_t *ptl;
+	struct page *page;
+	pte_t *pte;
+	pte_t entry;
+	struct temp_hugepage *temp_thp;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return VM_FAULT_FALLBACK;
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+	/*
+	 * we're not going to handle this case yet, for now
+	 * we'll just fall back to regular pages
+	 */
+	if (!(flags & FAULT_FLAG_WRITE) &&
+			transparent_hugepage_use_zero_page()) {
+		pgtable_t pgtable;
+		struct page *zero_page;
+		bool set;
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable))
+			return VM_FAULT_OOM;
+		zero_page = get_huge_zero_page();
+		if (unlikely(!zero_page)) {
+			pte_free(mm, pgtable);
+			count_vm_event(THP_FAULT_FALLBACK);
+			return VM_FAULT_FALLBACK;
+		}
+		spin_lock(&mm->page_table_lock);
+		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+				zero_page);
+		spin_unlock(&mm->page_table_lock);
+		if (!set) {
+			pte_free(mm, pgtable);
+			put_huge_zero_page();
+		}
+		return 0;
+	}
+
+	/*
+	 * Here's where we either need to store the PMD on the list
+	 * and give them a regular page, or make the decision to flip
+	 * the PSE bit and send them back with a hugepage
+	 *
+	 * + First we call find_pmd_mm_freelist to determine if the pmd
+	 *   we're interested in has already been faulted into
+	 */
+	spin_lock(&mm->thp_list_lock);
+	temp_thp = find_pmd_mm_freelist(mm, pmd);
+
+	/* this is a temporary workaround to avoid putting the pages back on the freelist */
+	if (temp_thp && temp_thp->node == -1) {
+		spin_unlock(&mm->thp_list_lock);
+		goto single_fault;
+	}
+
+	/*
+	 * we need to create a list entry and add it to the
+	 * new per-mm free list if we didn't find an existing
+	 * entry
+	 */
+	if (!temp_thp && pmd_none(*pmd)) {
+		/* try to get 512 pages from the freelist */
+		page = alloc_temp_hugepage_vma(transparent_hugepage_defrag(vma),
+					  vma, haddr, numa_node_id(), 0);
+
+		if (unlikely(!page)) {
+			/* we should probably change the VM event here? */
+			count_vm_event(THP_FAULT_FALLBACK);
+			return VM_FAULT_FALLBACK;
+		}
+
+		/* do this here instead of below, to get the whole page ready */
+		clear_huge_page(page, haddr, HPAGE_PMD_NR);
+
+		/* add a new temp_hugepage entry to the local freelist */
+		temp_thp = kmalloc(sizeof(struct temp_hugepage), GFP_KERNEL);
+		if (!temp_thp)
+			return VM_FAULT_OOM;
+		temp_thp->pmd = pmd;
+		temp_thp->page = page;
+		temp_thp->node = numa_node_id();
+		temp_thp->ref_count = 1;
+		list_add(&temp_thp->list, &mm->thp_temp_list);
+	/*
+	 * otherwise we increment the reference count, and decide whether
+	 * or not to create a THP
+	 */
+	} else if (temp_thp && !pmd_none(*pmd) && temp_thp->node == numa_node_id()) {
+		temp_thp->ref_count++;
+	/* if they allocated from a different node, they don't get a thp */
+	} else if (temp_thp && !pmd_none(*pmd) && temp_thp->node != numa_node_id()) {
+		/*
+		 * for now we handle this by pushing the rest of the faults through our
+		 * custom fault code below, eventually we will want to put the unused
+		 * pages from out temp_hugepage back on the freelist, so they can be
+		 * faulted in by the normal code paths
+		 */
+
+		temp_thp->node = -1;
+	} else {
+		spin_unlock(&mm->thp_list_lock);
+		return VM_FAULT_FALLBACK;
+	}
+
+	spin_unlock(&mm->thp_list_lock);
+
+	/*
+	 * now that we've done the accounting work, we check to see if
+	 * we've exceeded our threshold
+	 */
+	if (temp_thp->ref_count >= mm->thp_threshold) {
+		pmd_t pmd_entry;
+		pgtable_t pgtable;
+
+		/*
+		 * we'll do all of the following beneath the big ptl for now
+		 * this will need to be modified to work with the split ptl
+		 */
+		spin_lock(&mm->page_table_lock);
+
+		/*
+		 * once we get past the lock we have to make sure that somebody
+		 * else hasn't already turned this guy into a THP, if they have,
+		 * then the page we need is already faulted in as part of the THP
+		 * they created
+		 */
+		if (PageTransHuge(temp_thp->page)) {
+			spin_unlock(&mm->page_table_lock);
+			return 0;
+		}
+
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable)) {
+			spin_unlock(&mm->page_table_lock);
+			return VM_FAULT_OOM;
+		}
+
+		/* might wanna move this? */
+		__SetPageUptodate(temp_thp->page);
+
+		/* turn the pages into one compound page */
+		make_compound_page(temp_thp->page, HPAGE_PMD_ORDER);
+
+		/* set up the pmd */
+		pmd_entry = mk_huge_pmd(temp_thp->page, vma->vm_page_prot);
+		pmd_entry = maybe_pmd_mkwrite(pmd_mkdirty(pmd_entry), vma);
+
+		/* remap the new page since we cleared the mappings */
+		page_add_anon_rmap(temp_thp->page, vma, address);
+
+		/* deposit the thp */
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+
+		set_pmd_at(mm, haddr, pmd, pmd_entry);
+		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR - mm->thp_threshold + 1);
+		/* mm->nr_ptes++; */
+
+		/* delete the reference to this compound page from our list */
+		spin_lock(&mm->thp_list_lock);
+		list_del(&temp_thp->list);
+		spin_unlock(&mm->thp_list_lock);
+
+		spin_unlock(&mm->page_table_lock);
+		return 0;
+	} else {
+single_fault:
+		/* fault in the page */
+		if (pmd_none(*pmd) && __pte_alloc(mm, vma, temp_thp->pmd, address))
+			return VM_FAULT_OOM;
+
+		/*
+		 * we'll do all of the following beneath the big ptl for now
+		 * this will need to be modified to work with the split ptl
+		 */
+		spin_lock(&mm->page_table_lock);
+
+		page = temp_thp->page + (int) pte_index(address);
+
+		/* set the page's refcount */
+		set_page_refcounted(page);
+		pte = pte_offset_map(temp_thp->pmd, address);
+
+		/* might wanna move this? */
+		__SetPageUptodate(page);
+
+		if (!pte_present(*pte)) {
+			if (pte_none(*pte)) {
+				pte_unmap(pte);
+
+				if (unlikely(anon_vma_prepare(vma))) {
+					spin_unlock(&mm->page_table_lock);
+					return VM_FAULT_OOM;
+				}
+
+				entry = mk_pte(page, vma->vm_page_prot);
+				if (vma->vm_flags & VM_WRITE)
+					entry = pte_mkwrite(pte_mkdirty(entry));
+
+				pte = pte_offset_map_lock(mm, temp_thp->pmd, address, &ptl);
+
+				page_add_new_anon_rmap(page, vma, haddr);
+				add_mm_counter(mm, MM_ANONPAGES, 1);
+
+				set_pte_at(mm, address, pte, entry);
+
+				pte_unmap_unlock(pte, ptl);
+				spin_unlock(&mm->page_table_lock);
+
+				return 0;
+			}
+		} else {
+			spin_unlock(&mm->page_table_lock);
+			return VM_FAULT_FALLBACK;
+		}
+	}
+
+	/* I don't know what this does right now.  I'm leaving it */
+	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
+
+	/*
+	 * here's the important piece, where we actually make our 512
+	 * page chunk into a THP, by setting the PSE bit.  This is the
+	 * spot we really need to change.  In the end, we could probably
+	 * spin this up into the old function, but we'll keep them separate
+	 * for now
+	 */
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
+
+	/* again, probably want a different VM event here */
+	count_vm_event(THP_FAULT_ALLOC);
+	return 0;
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
diff --git a/mm/internal.h b/mm/internal.h
index 684f7aa..8fc296b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -98,6 +98,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
  */
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned long order);
+extern void make_compound_page(struct page *page, unsigned long order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index d176154..014d9ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3764,13 +3764,30 @@ retry:
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+	if (transparent_hugepage_enabled(vma)) {
 		int ret = VM_FAULT_FALLBACK;
-		if (!vma->vm_ops)
-			ret = do_huge_pmd_anonymous_page(mm, vma, address,
-					pmd, flags);
-		if (!(ret & VM_FAULT_FALLBACK))
-			return ret;
+		/*
+		 * This is a temporary location for this code, just to get things
+		 * up and running.  I'll come up with a better way to handle this
+		 * later
+		 */
+		if (!mm->thp_threshold)
+			mm->thp_threshold = thp_threshold_check();
+		if (!mm->thp_temp_list.next && !mm->thp_temp_list.prev)
+			INIT_LIST_HEAD(&mm->thp_temp_list);
+		if (mm->thp_threshold > 1) {
+			if (!vma->vm_ops)
+				ret = do_huge_pmd_temp_page(mm, vma, address,
+							    pmd, flags);
+			if (!(ret & VM_FAULT_FALLBACK))
+				return ret;
+		} else if (pmd_none(*pmd)) {
+			if (!vma->vm_ops)
+				ret = do_huge_pmd_anonymous_page(mm, vma, address,
+								 pmd, flags);
+			if (!(ret & VM_FAULT_FALLBACK))
+				return ret;
+		}
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..48e13fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -375,6 +375,65 @@ void prep_compound_page(struct page *page, unsigned long order)
 	}
 }
 
+/*
+ * This function is used to create a proper compound page from a chunk of
+ * contiguous pages, most likely allocated as a temporary hugepage
+ */
+void make_compound_page(struct page *page, unsigned long order)
+{
+	int i, max_count = 0, max_mapcount = 0;
+	int nr_pages = 1 << order;
+
+	set_compound_page_dtor(page, free_compound_page);
+	set_compound_order(page, order);
+
+	__SetPageHead(page);
+
+	/*
+	 * we clear all the mappings here, so we have to remember to set
+	 * them back up!
+	 */
+	page->mapping = NULL;
+
+	max_count = (int) atomic_read(&page->_count);
+	max_mapcount = (int) atomic_read(&page->_mapcount);
+
+	for (i = 1; i < nr_pages; i++) {
+		int cur_count, cur_mapcount;
+		struct page *p = page + i;
+		p->flags = 0; /* this seems dumb */
+		__SetPageTail(p);
+		set_page_count(p, 0);
+		p->first_page = page;
+		p->mapping = NULL;
+
+		cur_count = (int) atomic_read(&page->_count);
+		cur_mapcount = (int) atomic_read(&page->_mapcount);
+		atomic_set(&page->_count, 0);
+		atomic_set(&page->_mapcount, -1);
+		if (cur_count > max_count)
+			max_count = cur_count;
+		if (cur_mapcount > max_mapcount)
+			max_mapcount = cur_mapcount;
+
+		/*
+		 * poison the LRU entries for all the tail pages (aside from the
+		 * first one), the entries for the head page should be okay
+		 */
+		if (i != 1) {
+			p->lru.next = LIST_POISON1;
+			p->lru.prev = LIST_POISON2;
+		}
+	}
+
+	atomic_set(&page->_count, max_count);
+	/*
+	 * we set to max_mapcount - 1 here because we're going to
+	 * map this page again later.  This definitely doesn't seem right.
+	 */
+	atomic_set(&page->_mapcount, max_mapcount - 1);
+}
+
 /* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
@@ -865,7 +924,12 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 	}
 
 	set_page_private(page, 0);
-	set_page_refcounted(page);
+	 /*
+	  * We don't want to set _count for temporary compound pages, since
+	  * we may not immediately fault in the first page
+	  */
+	if (!(gfp_flags & __GFP_COMP_TEMP))
+		set_page_refcounted(page);
 
 	arch_alloc_page(page, order);
 	kernel_map_pages(page, 1 << order, 1);
-- 
1.7.12.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
  2013-12-12 18:00   ` Alex Thorlton
@ 2013-12-12 20:11     ` Andy Lutomirski
  -1 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2013-12-12 20:11 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

On Thu, Dec 12, 2013 at 10:00 AM, Alex Thorlton <athorlton@sgi.com> wrote:
> This part of the patch adds a tunable to
> /sys/kernel/mm/transparent_hugepage called threshold.  This threshold
> determines how many pages a user must fault in from a single node before
> a temporary compound page is turned into a THP.

Is there a setting that will turn off the must-be-the-same-node
behavior?  There are workloads where TLB matters more than cross-node
traffic (or where all the pages are hopelessly shared between nodes,
but hugepages are still useful).

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
@ 2013-12-12 20:11     ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2013-12-12 20:11 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

On Thu, Dec 12, 2013 at 10:00 AM, Alex Thorlton <athorlton@sgi.com> wrote:
> This part of the patch adds a tunable to
> /sys/kernel/mm/transparent_hugepage called threshold.  This threshold
> determines how many pages a user must fault in from a single node before
> a temporary compound page is turned into a THP.

Is there a setting that will turn off the must-be-the-same-node
behavior?  There are workloads where TLB matters more than cross-node
traffic (or where all the pages are hopelessly shared between nodes,
but hugepages are still useful).

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
  2013-12-12 20:11     ` Andy Lutomirski
@ 2013-12-12 20:49       ` Alex Thorlton
  -1 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 20:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

> Is there a setting that will turn off the must-be-the-same-node
> behavior?  There are workloads where TLB matters more than cross-node
> traffic (or where all the pages are hopelessly shared between nodes,
> but hugepages are still useful).

That's pretty much how THPs already behave in the kernel, so if you want
to allow THPs to be handed out to one node, but referenced from many
others, you'd just set the threshold to 1, and let the existing code
take over.

As for the must-be-the-same-node behavior:  I'd actually say it's more
like a "must have so much on one node" behavior, in that, if you set the
threshold to 16, for example, 16 4K pages must be faulted in on the same
node, in the same contiguous 2M chunk, before a THP will be created.
What happens after that THP is created is out of our control, it could
be referenced from anywhere.

The idea here is that we can tune things so that jobs that behave poorly
with THP on will not be given THPs, but the jobs that like THPs can
still get them.  Granted, there are still issues with this approach, but
I think it's a bit better than just handing out a THP because we touched
one byte in a 2M chunk.

- Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
@ 2013-12-12 20:49       ` Alex Thorlton
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 20:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

> Is there a setting that will turn off the must-be-the-same-node
> behavior?  There are workloads where TLB matters more than cross-node
> traffic (or where all the pages are hopelessly shared between nodes,
> but hugepages are still useful).

That's pretty much how THPs already behave in the kernel, so if you want
to allow THPs to be handed out to one node, but referenced from many
others, you'd just set the threshold to 1, and let the existing code
take over.

As for the must-be-the-same-node behavior:  I'd actually say it's more
like a "must have so much on one node" behavior, in that, if you set the
threshold to 16, for example, 16 4K pages must be faulted in on the same
node, in the same contiguous 2M chunk, before a THP will be created.
What happens after that THP is created is out of our control, it could
be referenced from anywhere.

The idea here is that we can tune things so that jobs that behave poorly
with THP on will not be given THPs, but the jobs that like THPs can
still get them.  Granted, there are still issues with this approach, but
I think it's a bit better than just handing out a THP because we touched
one byte in a 2M chunk.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
  2013-12-12 20:49       ` Alex Thorlton
@ 2013-12-12 20:52         ` Andy Lutomirski
  -1 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2013-12-12 20:52 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

On Thu, Dec 12, 2013 at 12:49 PM, Alex Thorlton <athorlton@sgi.com> wrote:
>
> > Is there a setting that will turn off the must-be-the-same-node
> > behavior?  There are workloads where TLB matters more than cross-node
> > traffic (or where all the pages are hopelessly shared between nodes,
> > but hugepages are still useful).
>
> That's pretty much how THPs already behave in the kernel, so if you want
> to allow THPs to be handed out to one node, but referenced from many
> others, you'd just set the threshold to 1, and let the existing code
> take over.
>

Right.  I like that behavior for my workload.  (Although I currently
allocate huge pages -- when I wrote that code, THP interacted so badly
with pagecache that it was a non-starter.  I think it's fixed now,
though.)


>
> As for the must-be-the-same-node behavior:  I'd actually say it's more
> like a "must have so much on one node" behavior, in that, if you set the
> threshold to 16, for example, 16 4K pages must be faulted in on the same
> node, in the same contiguous 2M chunk, before a THP will be created.
> What happens after that THP is created is out of our control, it could
> be referenced from anywhere.

In that case, I guess I misunderstood your description.  Are saying
that, once any node accesses this many pages in the potential THP,
then the whole THP will be mapped?

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
@ 2013-12-12 20:52         ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2013-12-12 20:52 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

On Thu, Dec 12, 2013 at 12:49 PM, Alex Thorlton <athorlton@sgi.com> wrote:
>
> > Is there a setting that will turn off the must-be-the-same-node
> > behavior?  There are workloads where TLB matters more than cross-node
> > traffic (or where all the pages are hopelessly shared between nodes,
> > but hugepages are still useful).
>
> That's pretty much how THPs already behave in the kernel, so if you want
> to allow THPs to be handed out to one node, but referenced from many
> others, you'd just set the threshold to 1, and let the existing code
> take over.
>

Right.  I like that behavior for my workload.  (Although I currently
allocate huge pages -- when I wrote that code, THP interacted so badly
with pagecache that it was a non-starter.  I think it's fixed now,
though.)


>
> As for the must-be-the-same-node behavior:  I'd actually say it's more
> like a "must have so much on one node" behavior, in that, if you set the
> threshold to 16, for example, 16 4K pages must be faulted in on the same
> node, in the same contiguous 2M chunk, before a THP will be created.
> What happens after that THP is created is out of our control, it could
> be referenced from anywhere.

In that case, I guess I misunderstood your description.  Are saying
that, once any node accesses this many pages in the potential THP,
then the whole THP will be mapped?

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
  2013-12-12 20:52         ` Andy Lutomirski
@ 2013-12-12 21:04           ` Alex Thorlton
  -1 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 21:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

> Right.  I like that behavior for my workload.  (Although I currently
> allocate huge pages -- when I wrote that code, THP interacted so badly
> with pagecache that it was a non-starter.  I think it's fixed now,
> though.)

In that case, it's probably best to just stick with current behavior,
and leave the threshold at 1, unless we implement something like I
discuss below.

> In that case, I guess I misunderstood your description.  Are saying
> that, once any node accesses this many pages in the potential THP,
> then the whole THP will be mapped?

Well, right now, this patch completely gives up on mapping a THP if two
different nodes take a page from our chunk before the threshold is hit,
so I think you're mostly understanding it correctly.

One thing we could consider is adding an option to map the THP on
the node with the *most* references to the potential THP, instead of
giving up on mapping the THP when multiple nodes reference it.  That
might be a good middle ground, but I can see some performance issues
coming into play there if the threshold is set too high, since we'll
have to move all the pages in the chunk to the node that hit the
threshold.

- Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
@ 2013-12-12 21:04           ` Alex Thorlton
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 21:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

> Right.  I like that behavior for my workload.  (Although I currently
> allocate huge pages -- when I wrote that code, THP interacted so badly
> with pagecache that it was a non-starter.  I think it's fixed now,
> though.)

In that case, it's probably best to just stick with current behavior,
and leave the threshold at 1, unless we implement something like I
discuss below.

> In that case, I guess I misunderstood your description.  Are saying
> that, once any node accesses this many pages in the potential THP,
> then the whole THP will be mapped?

Well, right now, this patch completely gives up on mapping a THP if two
different nodes take a page from our chunk before the threshold is hit,
so I think you're mostly understanding it correctly.

One thing we could consider is adding an option to map the THP on
the node with the *most* references to the potential THP, instead of
giving up on mapping the THP when multiple nodes reference it.  That
might be a good middle ground, but I can see some performance issues
coming into play there if the threshold is set too high, since we'll
have to move all the pages in the chunk to the node that hit the
threshold.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
  2013-12-12 18:00   ` Alex Thorlton
@ 2013-12-12 21:37     ` Rik van Riel
  -1 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2013-12-12 21:37 UTC (permalink / raw)
  To: Alex Thorlton, linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise,
	Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro,
	David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner,
	Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa,
	Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel,
	Andrea Arcangeli

On 12/12/2013 01:00 PM, Alex Thorlton wrote:
> This part of the patch adds a tunable to
> /sys/kernel/mm/transparent_hugepage called threshold.  This threshold
> determines how many pages a user must fault in from a single node before
> a temporary compound page is turned into a THP.

> +++ b/mm/huge_memory.c
> @@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
>   	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>   	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>
> +/* default to 1 page threshold for handing out thps; maintains old behavior */
> +static int transparent_hugepage_threshold = 1;

I assume the motivation for writing all this code is that "1"
was not a good value in your tests.

That makes me wonder, why should 1 be the default value with
your patches?

If there is a better value, why should we not use that?

What is the upside of using a better value?

What is the downside?

Is there a value that would to bound the downside, so it
is almost always smaller than the upside?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
@ 2013-12-12 21:37     ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2013-12-12 21:37 UTC (permalink / raw)
  To: Alex Thorlton, linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise,
	Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro,
	David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner,
	Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa,
	Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel,
	Andrea Arcangeli

On 12/12/2013 01:00 PM, Alex Thorlton wrote:
> This part of the patch adds a tunable to
> /sys/kernel/mm/transparent_hugepage called threshold.  This threshold
> determines how many pages a user must fault in from a single node before
> a temporary compound page is turned into a THP.

> +++ b/mm/huge_memory.c
> @@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
>   	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>   	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>
> +/* default to 1 page threshold for handing out thps; maintains old behavior */
> +static int transparent_hugepage_threshold = 1;

I assume the motivation for writing all this code is that "1"
was not a good value in your tests.

That makes me wonder, why should 1 be the default value with
your patches?

If there is a better value, why should we not use that?

What is the upside of using a better value?

What is the downside?

Is there a value that would to bound the downside, so it
is almost always smaller than the upside?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
  2013-12-12 21:37     ` Rik van Riel
@ 2013-12-12 23:17       ` Alex Thorlton
  -1 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 23:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel, Andrea Arcangeli

On Thu, Dec 12, 2013 at 04:37:11PM -0500, Rik van Riel wrote:
> On 12/12/2013 01:00 PM, Alex Thorlton wrote:
> >This part of the patch adds a tunable to
> >/sys/kernel/mm/transparent_hugepage called threshold.  This threshold
> >determines how many pages a user must fault in from a single node before
> >a temporary compound page is turned into a THP.
> 
> >+++ b/mm/huge_memory.c
> >@@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
> >  	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
> >  	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
> >
> >+/* default to 1 page threshold for handing out thps; maintains old behavior */
> >+static int transparent_hugepage_threshold = 1;
> 
> I assume the motivation for writing all this code is that "1"
> was not a good value in your tests.

Yes, that's correct.

> That makes me wonder, why should 1 be the default value with
> your patches?

The main reason I set the default to 1 was because the majority of
jobs aren't hurt by the existing THP behavior.  I figured it would be
best to default to having things behave the same as they do now, but
provide the option to increase the threshold on systems that run jobs
that could be adversely affected by the current behavior.

> If there is a better value, why should we not use that?
>
> What is the upside of using a better value?
>
> What is the downside?

The problem here is that what the "better" value is can vary greatly
depending on how a particular task allocates memory.  Setting the
threshold too high can negatively affect the performance of jobs that
behave well with the current behavior, setting it too low won't yield a
performance increase for the jobs that are hurt by the current
behavior.  With some more thorough testing, I'm sure that we could
arrive at a value that will help out jobs which behave poorly under
current conditions, while having a minimal effect on jobs that already
perform well.  At this point, I'm looking more to ensure that everybody
likes this approach to solving the problem before putting the finishing
touches on the patches, and doing testing to find a good middle ground.
 
> Is there a value that would to bound the downside, so it
> is almost always smaller than the upside?

Again, the problem here is that, to find a good value, we have to know
quite a bit about why a particular value is bad for a particular job.
While, as stated above, I think we can probably find a good middle
ground to use as a default, in the end it will be the job of individual
sysadmins to determine what value works best for their particular
applications, and tune things accordingly.

- Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 2/3] Add tunable to control THP behavior
@ 2013-12-12 23:17       ` Alex Thorlton
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-12 23:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel, Andrea Arcangeli

On Thu, Dec 12, 2013 at 04:37:11PM -0500, Rik van Riel wrote:
> On 12/12/2013 01:00 PM, Alex Thorlton wrote:
> >This part of the patch adds a tunable to
> >/sys/kernel/mm/transparent_hugepage called threshold.  This threshold
> >determines how many pages a user must fault in from a single node before
> >a temporary compound page is turned into a THP.
> 
> >+++ b/mm/huge_memory.c
> >@@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
> >  	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
> >  	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
> >
> >+/* default to 1 page threshold for handing out thps; maintains old behavior */
> >+static int transparent_hugepage_threshold = 1;
> 
> I assume the motivation for writing all this code is that "1"
> was not a good value in your tests.

Yes, that's correct.

> That makes me wonder, why should 1 be the default value with
> your patches?

The main reason I set the default to 1 was because the majority of
jobs aren't hurt by the existing THP behavior.  I figured it would be
best to default to having things behave the same as they do now, but
provide the option to increase the threshold on systems that run jobs
that could be adversely affected by the current behavior.

> If there is a better value, why should we not use that?
>
> What is the upside of using a better value?
>
> What is the downside?

The problem here is that what the "better" value is can vary greatly
depending on how a particular task allocates memory.  Setting the
threshold too high can negatively affect the performance of jobs that
behave well with the current behavior, setting it too low won't yield a
performance increase for the jobs that are hurt by the current
behavior.  With some more thorough testing, I'm sure that we could
arrive at a value that will help out jobs which behave poorly under
current conditions, while having a minimal effect on jobs that already
perform well.  At this point, I'm looking more to ensure that everybody
likes this approach to solving the problem before putting the finishing
touches on the patches, and doing testing to find a good middle ground.
 
> Is there a value that would to bound the downside, so it
> is almost always smaller than the upside?

Again, the problem here is that, to find a good value, we have to know
quite a bit about why a particular value is bad for a particular job.
While, as stated above, I think we can probably find a good middle
ground to use as a default, in the end it will be the job of individual
sysadmins to determine what value works best for their particular
applications, and tune things accordingly.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH 3/3] Change THP behavior
  2013-12-12 18:00   ` Alex Thorlton
@ 2013-12-13 13:13     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 24+ messages in thread
From: Kirill A. Shutemov @ 2013-12-13 13:13 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel, linux-mm

Alex Thorlton wrote:
> +	/*
> +	 * now that we've done the accounting work, we check to see if
> +	 * we've exceeded our threshold
> +	 */
> +	if (temp_thp->ref_count >= mm->thp_threshold) {
> +		pmd_t pmd_entry;
> +		pgtable_t pgtable;
> +
> +		/*
> +		 * we'll do all of the following beneath the big ptl for now
> +		 * this will need to be modified to work with the split ptl
> +		 */
> +		spin_lock(&mm->page_table_lock);
> +
> +		/*
> +		 * once we get past the lock we have to make sure that somebody
> +		 * else hasn't already turned this guy into a THP, if they have,
> +		 * then the page we need is already faulted in as part of the THP
> +		 * they created
> +		 */
> +		if (PageTransHuge(temp_thp->page)) {
> +			spin_unlock(&mm->page_table_lock);
> +			return 0;
> +		}
> +
> +		pgtable = pte_alloc_one(mm, haddr);
> +		if (unlikely(!pgtable)) {
> +			spin_unlock(&mm->page_table_lock);
> +			return VM_FAULT_OOM;
> +		}
> +
> +		/* might wanna move this? */
> +		__SetPageUptodate(temp_thp->page);
> +
> +		/* turn the pages into one compound page */
> +		make_compound_page(temp_thp->page, HPAGE_PMD_ORDER);
> +
> +		/* set up the pmd */
> +		pmd_entry = mk_huge_pmd(temp_thp->page, vma->vm_page_prot);
> +		pmd_entry = maybe_pmd_mkwrite(pmd_mkdirty(pmd_entry), vma);
> +
> +		/* remap the new page since we cleared the mappings */
> +		page_add_anon_rmap(temp_thp->page, vma, address);
> +
> +		/* deposit the thp */
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +
> +		set_pmd_at(mm, haddr, pmd, pmd_entry);
> +		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR - mm->thp_threshold + 1);
> +		/* mm->nr_ptes++; */
> +
> +		/* delete the reference to this compound page from our list */
> +		spin_lock(&mm->thp_list_lock);
> +		list_del(&temp_thp->list);
> +		spin_unlock(&mm->thp_list_lock);
> +
> +		spin_unlock(&mm->page_table_lock);
> +		return 0;

Hm. I think this part is not correct: you collapse temp thp page
into real one only for current procees. What will happen if a process with
temp thp pages was forked?

And I don't think this problem is an easy one. khugepaged can't collapse
pages with page->_count != 1 for the same reason: to make it properly you
need to take mmap_sem for all processes and collapse all pages at once.
And if a page is pinned, we also can't collapse.

Sorry, I don't think the whole idea has much potential. :(

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH 3/3] Change THP behavior
@ 2013-12-13 13:13     ` Kirill A. Shutemov
  0 siblings, 0 replies; 24+ messages in thread
From: Kirill A. Shutemov @ 2013-12-13 13:13 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

Alex Thorlton wrote:
> +	/*
> +	 * now that we've done the accounting work, we check to see if
> +	 * we've exceeded our threshold
> +	 */
> +	if (temp_thp->ref_count >= mm->thp_threshold) {
> +		pmd_t pmd_entry;
> +		pgtable_t pgtable;
> +
> +		/*
> +		 * we'll do all of the following beneath the big ptl for now
> +		 * this will need to be modified to work with the split ptl
> +		 */
> +		spin_lock(&mm->page_table_lock);
> +
> +		/*
> +		 * once we get past the lock we have to make sure that somebody
> +		 * else hasn't already turned this guy into a THP, if they have,
> +		 * then the page we need is already faulted in as part of the THP
> +		 * they created
> +		 */
> +		if (PageTransHuge(temp_thp->page)) {
> +			spin_unlock(&mm->page_table_lock);
> +			return 0;
> +		}
> +
> +		pgtable = pte_alloc_one(mm, haddr);
> +		if (unlikely(!pgtable)) {
> +			spin_unlock(&mm->page_table_lock);
> +			return VM_FAULT_OOM;
> +		}
> +
> +		/* might wanna move this? */
> +		__SetPageUptodate(temp_thp->page);
> +
> +		/* turn the pages into one compound page */
> +		make_compound_page(temp_thp->page, HPAGE_PMD_ORDER);
> +
> +		/* set up the pmd */
> +		pmd_entry = mk_huge_pmd(temp_thp->page, vma->vm_page_prot);
> +		pmd_entry = maybe_pmd_mkwrite(pmd_mkdirty(pmd_entry), vma);
> +
> +		/* remap the new page since we cleared the mappings */
> +		page_add_anon_rmap(temp_thp->page, vma, address);
> +
> +		/* deposit the thp */
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +
> +		set_pmd_at(mm, haddr, pmd, pmd_entry);
> +		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR - mm->thp_threshold + 1);
> +		/* mm->nr_ptes++; */
> +
> +		/* delete the reference to this compound page from our list */
> +		spin_lock(&mm->thp_list_lock);
> +		list_del(&temp_thp->list);
> +		spin_unlock(&mm->thp_list_lock);
> +
> +		spin_unlock(&mm->page_table_lock);
> +		return 0;

Hm. I think this part is not correct: you collapse temp thp page
into real one only for current procees. What will happen if a process with
temp thp pages was forked?

And I don't think this problem is an easy one. khugepaged can't collapse
pages with page->_count != 1 for the same reason: to make it properly you
need to take mmap_sem for all processes and collapse all pages at once.
And if a page is pinned, we also can't collapse.

Sorry, I don't think the whole idea has much potential. :(

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 3/3] Change THP behavior
  2013-12-12 18:00   ` Alex Thorlton
@ 2013-12-13 18:12     ` Oleg Nesterov
  -1 siblings, 0 replies; 24+ messages in thread
From: Oleg Nesterov @ 2013-12-13 18:12 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

I know almost nothing about thp, unlikely I understand this patch
correctly...

But, afaics, the main idea is that until we have mm->thp_threshold
faults we install the tail pages of temp_hugepage->page as a normal
anonymous page, then we actually add the Head/Tail metadata and add
the necessary huge_pmd/etc.

I simply can't understand how this can work until make_compound_page()
is called. Just for example, what happens after sys_munmap() ? If
nothing else, who will find and free temp_hugepage connected to this
area? Or, what if sys_mremap() moves this vma? find_pmd_mm_freelist()
won't find the right temp_thp after that? Or split_vma, etc.

And make_compound_page() itself looks suspicious,

On 12/12, Alex Thorlton wrote:
>
> +void make_compound_page(struct page *page, unsigned long order)
> +{
> +	int i, max_count = 0, max_mapcount = 0;
> +	int nr_pages = 1 << order;
> +
> +	set_compound_page_dtor(page, free_compound_page);
> +	set_compound_order(page, order);
> +
> +	__SetPageHead(page);
> +
> +	/*
> +	 * we clear all the mappings here, so we have to remember to set
> +	 * them back up!
> +	 */
> +	page->mapping = NULL;
> +
> +	max_count = (int) atomic_read(&page->_count);
> +	max_mapcount = (int) atomic_read(&page->_mapcount);
> +
> +	for (i = 1; i < nr_pages; i++) {
> +		int cur_count, cur_mapcount;
> +		struct page *p = page + i;
> +		p->flags = 0; /* this seems dumb */
> +		__SetPageTail(p);

Just for example, what if put_page(p) or get_page(p) is called after
__SetPageTail() ?

Afaics, this page was already visible to, say, get_user_pages() and
it can have external references.

In fact everything else looks suspicious too but let me repeat that
I do not really understand this code. So please do not count this as
review, but perhaps the changelog should tell more to explain what
this patch actually does and how this all should work?

Oleg.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 3/3] Change THP behavior
@ 2013-12-13 18:12     ` Oleg Nesterov
  0 siblings, 0 replies; 24+ messages in thread
From: Oleg Nesterov @ 2013-12-13 18:12 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

I know almost nothing about thp, unlikely I understand this patch
correctly...

But, afaics, the main idea is that until we have mm->thp_threshold
faults we install the tail pages of temp_hugepage->page as a normal
anonymous page, then we actually add the Head/Tail metadata and add
the necessary huge_pmd/etc.

I simply can't understand how this can work until make_compound_page()
is called. Just for example, what happens after sys_munmap() ? If
nothing else, who will find and free temp_hugepage connected to this
area? Or, what if sys_mremap() moves this vma? find_pmd_mm_freelist()
won't find the right temp_thp after that? Or split_vma, etc.

And make_compound_page() itself looks suspicious,

On 12/12, Alex Thorlton wrote:
>
> +void make_compound_page(struct page *page, unsigned long order)
> +{
> +	int i, max_count = 0, max_mapcount = 0;
> +	int nr_pages = 1 << order;
> +
> +	set_compound_page_dtor(page, free_compound_page);
> +	set_compound_order(page, order);
> +
> +	__SetPageHead(page);
> +
> +	/*
> +	 * we clear all the mappings here, so we have to remember to set
> +	 * them back up!
> +	 */
> +	page->mapping = NULL;
> +
> +	max_count = (int) atomic_read(&page->_count);
> +	max_mapcount = (int) atomic_read(&page->_mapcount);
> +
> +	for (i = 1; i < nr_pages; i++) {
> +		int cur_count, cur_mapcount;
> +		struct page *p = page + i;
> +		p->flags = 0; /* this seems dumb */
> +		__SetPageTail(p);

Just for example, what if put_page(p) or get_page(p) is called after
__SetPageTail() ?

Afaics, this page was already visible to, say, get_user_pages() and
it can have external references.

In fact everything else looks suspicious too but let me repeat that
I do not really understand this code. So please do not count this as
review, but perhaps the changelog should tell more to explain what
this patch actually does and how this all should work?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 3/3] Change THP behavior
  2013-12-13 13:13     ` Kirill A. Shutemov
@ 2013-12-16 17:37       ` Alex Thorlton
  -1 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-16 17:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Andrew Morton, Benjamin Herrenschmidt, Rik van Riel,
	Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise,
	Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro,
	David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner,
	Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa,
	Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel

> Hm. I think this part is not correct: you collapse temp thp page
> into real one only for current procees. What will happen if a process with
> temp thp pages was forked?

That's a scenario that I hadn't yet addressed, but definitely something
I'll consider going forward.  I think we can come up with a way to
appropriately handle that situation.

> And I don't think this problem is an easy one. khugepaged can't collapse
> pages with page->_count != 1 for the same reason: to make it properly you
> need to take mmap_sem for all processes and collapse all pages at once.
> And if a page is pinned, we also can't collapse.

Again, a few things here that I hadn't taken into account.  I'll look
for a way to address these concerns.

> Sorry, I don't think the whole idea has much potential. :(

I understand that there are some issues with this initial pass at the
idea, but I think we can get things corrected and come up with a
workable solution here.  When it comes down to it, there are relatively
few options to correct the issue that I'm focusing on here, and this one
seems to be the most appropriate one that we've tried so far.  We've
looked at disabling THP on a per-cpuset/per-process basis, and that has
been met with fairly strong resistance, for good reason.  I'll take
another pass at this and hopefully be able to address some of your
concerns with the idea.

Thanks for taking the time to look the patch over!

- Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 3/3] Change THP behavior
@ 2013-12-16 17:37       ` Alex Thorlton
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Thorlton @ 2013-12-16 17:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Andrew Morton, Benjamin Herrenschmidt, Rik van Riel,
	Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise,
	Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro,
	David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner,
	Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa,
	Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel

> Hm. I think this part is not correct: you collapse temp thp page
> into real one only for current procees. What will happen if a process with
> temp thp pages was forked?

That's a scenario that I hadn't yet addressed, but definitely something
I'll consider going forward.  I think we can come up with a way to
appropriately handle that situation.

> And I don't think this problem is an easy one. khugepaged can't collapse
> pages with page->_count != 1 for the same reason: to make it properly you
> need to take mmap_sem for all processes and collapse all pages at once.
> And if a page is pinned, we also can't collapse.

Again, a few things here that I hadn't taken into account.  I'll look
for a way to address these concerns.

> Sorry, I don't think the whole idea has much potential. :(

I understand that there are some issues with this initial pass at the
idea, but I think we can get things corrected and come up with a
workable solution here.  When it comes down to it, there are relatively
few options to correct the issue that I'm focusing on here, and this one
seems to be the most appropriate one that we've tried so far.  We've
looked at disabling THP on a per-cpuset/per-process basis, and that has
been met with fairly strong resistance, for good reason.  I'll take
another pass at this and hopefully be able to address some of your
concerns with the idea.

Thanks for taking the time to look the patch over!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-12-16 17:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1386790423.git.athorlton@sgi.com>
2013-12-12 18:00 ` [RFC PATCH 1/3] Add flags for temporary compound pages Alex Thorlton
2013-12-12 18:00   ` Alex Thorlton
2013-12-12 18:00 ` [RFC PATCH 2/3] Add tunable to control THP behavior Alex Thorlton
2013-12-12 18:00   ` Alex Thorlton
2013-12-12 20:11   ` Andy Lutomirski
2013-12-12 20:11     ` Andy Lutomirski
2013-12-12 20:49     ` Alex Thorlton
2013-12-12 20:49       ` Alex Thorlton
2013-12-12 20:52       ` Andy Lutomirski
2013-12-12 20:52         ` Andy Lutomirski
2013-12-12 21:04         ` Alex Thorlton
2013-12-12 21:04           ` Alex Thorlton
2013-12-12 21:37   ` Rik van Riel
2013-12-12 21:37     ` Rik van Riel
2013-12-12 23:17     ` Alex Thorlton
2013-12-12 23:17       ` Alex Thorlton
2013-12-12 18:00 ` [RFC PATCH 3/3] Change " Alex Thorlton
2013-12-12 18:00   ` Alex Thorlton
2013-12-13 13:13   ` Kirill A. Shutemov
2013-12-13 13:13     ` Kirill A. Shutemov
2013-12-16 17:37     ` Alex Thorlton
2013-12-16 17:37       ` Alex Thorlton
2013-12-13 18:12   ` Oleg Nesterov
2013-12-13 18:12     ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.