* [RFC PATCH 1/3] Add flags for temporary compound pages [not found] <cover.1386790423.git.athorlton@sgi.com> @ 2013-12-12 18:00 ` Alex Thorlton 2013-12-12 18:00 ` Alex Thorlton 2013-12-12 18:00 ` Alex Thorlton 2 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel, linux-mm This chunk of the patch just adds the necessary page flags to support what I call a temporary compound page. Temporary compound pages provide us with the ability to turn a chunk of pages into a proper THP at some point after pulling them off the free list, but also allow us to fault in single 4K pages from the chunk. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Cliff Wickman <cpw@sgi.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/gfp.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 9b4dd49..6ec63dd 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x400000u #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u +#define ___GFP_COMP_TEMP 0x2000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -77,6 +78,7 @@ struct vm_area_struct; #define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) /* See above */ #define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */ #define __GFP_COMP ((__force gfp_t)___GFP_COMP) /* Add compound page metadata */ +#define __GFP_COMP_TEMP ((__force gfp_t)___GFP_COMP_TEMP) /* Treat as temp compound page */ #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) /* Return zeroed page on success */ #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves. * This takes precedence over the @@ -121,6 +123,9 @@ struct vm_area_struct; #define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ __GFP_NO_KSWAPD) +#define GFP_TEMP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP_TEMP | \ + __GFP_NOMEMALLOC | __GFP_NORETRY | \ + __GFP_NOWARN | __GFP_NO_KSWAPD) #ifdef CONFIG_NUMA #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) -- 1.7.12.4 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH 1/3] Add flags for temporary compound pages @ 2013-12-12 18:00 ` Alex Thorlton 0 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel This chunk of the patch just adds the necessary page flags to support what I call a temporary compound page. Temporary compound pages provide us with the ability to turn a chunk of pages into a proper THP at some point after pulling them off the free list, but also allow us to fault in single 4K pages from the chunk. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Cliff Wickman <cpw@sgi.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/gfp.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 9b4dd49..6ec63dd 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x400000u #define ___GFP_OTHER_NODE 0x800000u #define ___GFP_WRITE 0x1000000u +#define ___GFP_COMP_TEMP 0x2000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -77,6 +78,7 @@ struct vm_area_struct; #define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) /* See above */ #define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */ #define __GFP_COMP ((__force gfp_t)___GFP_COMP) /* Add compound page metadata */ +#define __GFP_COMP_TEMP ((__force gfp_t)___GFP_COMP_TEMP) /* Treat as temp compound page */ #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) /* Return zeroed page on success */ #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves. * This takes precedence over the @@ -121,6 +123,9 @@ struct vm_area_struct; #define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ __GFP_NO_KSWAPD) +#define GFP_TEMP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP_TEMP | \ + __GFP_NOMEMALLOC | __GFP_NORETRY | \ + __GFP_NOWARN | __GFP_NO_KSWAPD) #ifdef CONFIG_NUMA #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) -- 1.7.12.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH 2/3] Add tunable to control THP behavior [not found] <cover.1386790423.git.athorlton@sgi.com> @ 2013-12-12 18:00 ` Alex Thorlton 2013-12-12 18:00 ` Alex Thorlton 2013-12-12 18:00 ` Alex Thorlton 2 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel, linux-mm This part of the patch adds a tunable to /sys/kernel/mm/transparent_hugepage called threshold. This threshold determines how many pages a user must fault in from a single node before a temporary compound page is turned into a THP. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Cliff Wickman <cpw@sgi.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/huge_mm.h | 2 ++ include/linux/mm_types.h | 1 + mm/huge_memory.c | 30 ++++++++++++++++++++++++++++++ 3 files changed, 33 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 3935428..0943b1b6 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -177,6 +177,8 @@ static inline struct page *compound_trans_head(struct page *page) extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pmd_t pmd, pmd_t *pmdp); +extern int thp_threshold_check(void); + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d9851ee..b5efa23 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -408,6 +408,7 @@ struct mm_struct { #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE pgtable_t pmd_huge_pte; /* protected by page_table_lock */ + int thp_threshold; #endif #ifdef CONFIG_CPUMASK_OFFSTACK struct cpumask cpumask_allocation; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cca80d9..5d388e4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly = (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); +/* default to 1 page threshold for handing out thps; maintains old behavior */ +static int transparent_hugepage_threshold = 1; + /* default scan 8*512 pte (or vmas) every 30 second */ static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8; static unsigned int khugepaged_pages_collapsed; @@ -237,6 +240,11 @@ static struct shrinker huge_zero_page_shrinker = { .seeks = DEFAULT_SEEKS, }; +int thp_threshold_check() +{ + return transparent_hugepage_threshold; +} + #ifdef CONFIG_SYSFS static ssize_t double_flag_show(struct kobject *kobj, @@ -376,6 +384,27 @@ static ssize_t use_zero_page_store(struct kobject *kobj, } static struct kobj_attribute use_zero_page_attr = __ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store); +static ssize_t threshold_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%d\n", transparent_hugepage_threshold); +} +static ssize_t threshold_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err, value; + + err = kstrtoint(buf, 10, &value); + if (err || value < 1 || value > HPAGE_PMD_NR) + return -EINVAL; + + transparent_hugepage_threshold = value; + + return count; +} +static struct kobj_attribute threshold_attr = + __ATTR(threshold, 0644, threshold_show, threshold_store); #ifdef CONFIG_DEBUG_VM static ssize_t debug_cow_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @@ -397,6 +426,7 @@ static struct kobj_attribute debug_cow_attr = static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, + &threshold_attr.attr, &use_zero_page_attr.attr, #ifdef CONFIG_DEBUG_VM &debug_cow_attr.attr, -- 1.7.12.4 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH 2/3] Add tunable to control THP behavior @ 2013-12-12 18:00 ` Alex Thorlton 0 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel This part of the patch adds a tunable to /sys/kernel/mm/transparent_hugepage called threshold. This threshold determines how many pages a user must fault in from a single node before a temporary compound page is turned into a THP. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Cliff Wickman <cpw@sgi.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/huge_mm.h | 2 ++ include/linux/mm_types.h | 1 + mm/huge_memory.c | 30 ++++++++++++++++++++++++++++++ 3 files changed, 33 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 3935428..0943b1b6 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -177,6 +177,8 @@ static inline struct page *compound_trans_head(struct page *page) extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pmd_t pmd, pmd_t *pmdp); +extern int thp_threshold_check(void); + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d9851ee..b5efa23 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -408,6 +408,7 @@ struct mm_struct { #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE pgtable_t pmd_huge_pte; /* protected by page_table_lock */ + int thp_threshold; #endif #ifdef CONFIG_CPUMASK_OFFSTACK struct cpumask cpumask_allocation; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cca80d9..5d388e4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly = (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); +/* default to 1 page threshold for handing out thps; maintains old behavior */ +static int transparent_hugepage_threshold = 1; + /* default scan 8*512 pte (or vmas) every 30 second */ static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8; static unsigned int khugepaged_pages_collapsed; @@ -237,6 +240,11 @@ static struct shrinker huge_zero_page_shrinker = { .seeks = DEFAULT_SEEKS, }; +int thp_threshold_check() +{ + return transparent_hugepage_threshold; +} + #ifdef CONFIG_SYSFS static ssize_t double_flag_show(struct kobject *kobj, @@ -376,6 +384,27 @@ static ssize_t use_zero_page_store(struct kobject *kobj, } static struct kobj_attribute use_zero_page_attr = __ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store); +static ssize_t threshold_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%d\n", transparent_hugepage_threshold); +} +static ssize_t threshold_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err, value; + + err = kstrtoint(buf, 10, &value); + if (err || value < 1 || value > HPAGE_PMD_NR) + return -EINVAL; + + transparent_hugepage_threshold = value; + + return count; +} +static struct kobj_attribute threshold_attr = + __ATTR(threshold, 0644, threshold_show, threshold_store); #ifdef CONFIG_DEBUG_VM static ssize_t debug_cow_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @@ -397,6 +426,7 @@ static struct kobj_attribute debug_cow_attr = static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, + &threshold_attr.attr, &use_zero_page_attr.attr, #ifdef CONFIG_DEBUG_VM &debug_cow_attr.attr, -- 1.7.12.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior 2013-12-12 18:00 ` Alex Thorlton @ 2013-12-12 20:11 ` Andy Lutomirski -1 siblings, 0 replies; 24+ messages in thread From: Andy Lutomirski @ 2013-12-12 20:11 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel On Thu, Dec 12, 2013 at 10:00 AM, Alex Thorlton <athorlton@sgi.com> wrote: > This part of the patch adds a tunable to > /sys/kernel/mm/transparent_hugepage called threshold. This threshold > determines how many pages a user must fault in from a single node before > a temporary compound page is turned into a THP. Is there a setting that will turn off the must-be-the-same-node behavior? There are workloads where TLB matters more than cross-node traffic (or where all the pages are hopelessly shared between nodes, but hugepages are still useful). --Andy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior @ 2013-12-12 20:11 ` Andy Lutomirski 0 siblings, 0 replies; 24+ messages in thread From: Andy Lutomirski @ 2013-12-12 20:11 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel On Thu, Dec 12, 2013 at 10:00 AM, Alex Thorlton <athorlton@sgi.com> wrote: > This part of the patch adds a tunable to > /sys/kernel/mm/transparent_hugepage called threshold. This threshold > determines how many pages a user must fault in from a single node before > a temporary compound page is turned into a THP. Is there a setting that will turn off the must-be-the-same-node behavior? There are workloads where TLB matters more than cross-node traffic (or where all the pages are hopelessly shared between nodes, but hugepages are still useful). --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior 2013-12-12 20:11 ` Andy Lutomirski @ 2013-12-12 20:49 ` Alex Thorlton -1 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 20:49 UTC (permalink / raw) To: Andy Lutomirski Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel > Is there a setting that will turn off the must-be-the-same-node > behavior? There are workloads where TLB matters more than cross-node > traffic (or where all the pages are hopelessly shared between nodes, > but hugepages are still useful). That's pretty much how THPs already behave in the kernel, so if you want to allow THPs to be handed out to one node, but referenced from many others, you'd just set the threshold to 1, and let the existing code take over. As for the must-be-the-same-node behavior: I'd actually say it's more like a "must have so much on one node" behavior, in that, if you set the threshold to 16, for example, 16 4K pages must be faulted in on the same node, in the same contiguous 2M chunk, before a THP will be created. What happens after that THP is created is out of our control, it could be referenced from anywhere. The idea here is that we can tune things so that jobs that behave poorly with THP on will not be given THPs, but the jobs that like THPs can still get them. Granted, there are still issues with this approach, but I think it's a bit better than just handing out a THP because we touched one byte in a 2M chunk. - Alex ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior @ 2013-12-12 20:49 ` Alex Thorlton 0 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 20:49 UTC (permalink / raw) To: Andy Lutomirski Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel > Is there a setting that will turn off the must-be-the-same-node > behavior? There are workloads where TLB matters more than cross-node > traffic (or where all the pages are hopelessly shared between nodes, > but hugepages are still useful). That's pretty much how THPs already behave in the kernel, so if you want to allow THPs to be handed out to one node, but referenced from many others, you'd just set the threshold to 1, and let the existing code take over. As for the must-be-the-same-node behavior: I'd actually say it's more like a "must have so much on one node" behavior, in that, if you set the threshold to 16, for example, 16 4K pages must be faulted in on the same node, in the same contiguous 2M chunk, before a THP will be created. What happens after that THP is created is out of our control, it could be referenced from anywhere. The idea here is that we can tune things so that jobs that behave poorly with THP on will not be given THPs, but the jobs that like THPs can still get them. Granted, there are still issues with this approach, but I think it's a bit better than just handing out a THP because we touched one byte in a 2M chunk. - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior 2013-12-12 20:49 ` Alex Thorlton @ 2013-12-12 20:52 ` Andy Lutomirski -1 siblings, 0 replies; 24+ messages in thread From: Andy Lutomirski @ 2013-12-12 20:52 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel On Thu, Dec 12, 2013 at 12:49 PM, Alex Thorlton <athorlton@sgi.com> wrote: > > > Is there a setting that will turn off the must-be-the-same-node > > behavior? There are workloads where TLB matters more than cross-node > > traffic (or where all the pages are hopelessly shared between nodes, > > but hugepages are still useful). > > That's pretty much how THPs already behave in the kernel, so if you want > to allow THPs to be handed out to one node, but referenced from many > others, you'd just set the threshold to 1, and let the existing code > take over. > Right. I like that behavior for my workload. (Although I currently allocate huge pages -- when I wrote that code, THP interacted so badly with pagecache that it was a non-starter. I think it's fixed now, though.) > > As for the must-be-the-same-node behavior: I'd actually say it's more > like a "must have so much on one node" behavior, in that, if you set the > threshold to 16, for example, 16 4K pages must be faulted in on the same > node, in the same contiguous 2M chunk, before a THP will be created. > What happens after that THP is created is out of our control, it could > be referenced from anywhere. In that case, I guess I misunderstood your description. Are saying that, once any node accesses this many pages in the potential THP, then the whole THP will be mapped? --Andy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior @ 2013-12-12 20:52 ` Andy Lutomirski 0 siblings, 0 replies; 24+ messages in thread From: Andy Lutomirski @ 2013-12-12 20:52 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel On Thu, Dec 12, 2013 at 12:49 PM, Alex Thorlton <athorlton@sgi.com> wrote: > > > Is there a setting that will turn off the must-be-the-same-node > > behavior? There are workloads where TLB matters more than cross-node > > traffic (or where all the pages are hopelessly shared between nodes, > > but hugepages are still useful). > > That's pretty much how THPs already behave in the kernel, so if you want > to allow THPs to be handed out to one node, but referenced from many > others, you'd just set the threshold to 1, and let the existing code > take over. > Right. I like that behavior for my workload. (Although I currently allocate huge pages -- when I wrote that code, THP interacted so badly with pagecache that it was a non-starter. I think it's fixed now, though.) > > As for the must-be-the-same-node behavior: I'd actually say it's more > like a "must have so much on one node" behavior, in that, if you set the > threshold to 16, for example, 16 4K pages must be faulted in on the same > node, in the same contiguous 2M chunk, before a THP will be created. > What happens after that THP is created is out of our control, it could > be referenced from anywhere. In that case, I guess I misunderstood your description. Are saying that, once any node accesses this many pages in the potential THP, then the whole THP will be mapped? --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior 2013-12-12 20:52 ` Andy Lutomirski @ 2013-12-12 21:04 ` Alex Thorlton -1 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 21:04 UTC (permalink / raw) To: Andy Lutomirski Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel > Right. I like that behavior for my workload. (Although I currently > allocate huge pages -- when I wrote that code, THP interacted so badly > with pagecache that it was a non-starter. I think it's fixed now, > though.) In that case, it's probably best to just stick with current behavior, and leave the threshold at 1, unless we implement something like I discuss below. > In that case, I guess I misunderstood your description. Are saying > that, once any node accesses this many pages in the potential THP, > then the whole THP will be mapped? Well, right now, this patch completely gives up on mapping a THP if two different nodes take a page from our chunk before the threshold is hit, so I think you're mostly understanding it correctly. One thing we could consider is adding an option to map the THP on the node with the *most* references to the potential THP, instead of giving up on mapping the THP when multiple nodes reference it. That might be a good middle ground, but I can see some performance issues coming into play there if the threshold is set too high, since we'll have to move all the pages in the chunk to the node that hit the threshold. - Alex ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior @ 2013-12-12 21:04 ` Alex Thorlton 0 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 21:04 UTC (permalink / raw) To: Andy Lutomirski Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel > Right. I like that behavior for my workload. (Although I currently > allocate huge pages -- when I wrote that code, THP interacted so badly > with pagecache that it was a non-starter. I think it's fixed now, > though.) In that case, it's probably best to just stick with current behavior, and leave the threshold at 1, unless we implement something like I discuss below. > In that case, I guess I misunderstood your description. Are saying > that, once any node accesses this many pages in the potential THP, > then the whole THP will be mapped? Well, right now, this patch completely gives up on mapping a THP if two different nodes take a page from our chunk before the threshold is hit, so I think you're mostly understanding it correctly. One thing we could consider is adding an option to map the THP on the node with the *most* references to the potential THP, instead of giving up on mapping the THP when multiple nodes reference it. That might be a good middle ground, but I can see some performance issues coming into play there if the threshold is set too high, since we'll have to move all the pages in the chunk to the node that hit the threshold. - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior 2013-12-12 18:00 ` Alex Thorlton @ 2013-12-12 21:37 ` Rik van Riel -1 siblings, 0 replies; 24+ messages in thread From: Rik van Riel @ 2013-12-12 21:37 UTC (permalink / raw) To: Alex Thorlton, linux-mm Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel, Andrea Arcangeli On 12/12/2013 01:00 PM, Alex Thorlton wrote: > This part of the patch adds a tunable to > /sys/kernel/mm/transparent_hugepage called threshold. This threshold > determines how many pages a user must fault in from a single node before > a temporary compound page is turned into a THP. > +++ b/mm/huge_memory.c > @@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly = > (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| > (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); > > +/* default to 1 page threshold for handing out thps; maintains old behavior */ > +static int transparent_hugepage_threshold = 1; I assume the motivation for writing all this code is that "1" was not a good value in your tests. That makes me wonder, why should 1 be the default value with your patches? If there is a better value, why should we not use that? What is the upside of using a better value? What is the downside? Is there a value that would to bound the downside, so it is almost always smaller than the upside? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior @ 2013-12-12 21:37 ` Rik van Riel 0 siblings, 0 replies; 24+ messages in thread From: Rik van Riel @ 2013-12-12 21:37 UTC (permalink / raw) To: Alex Thorlton, linux-mm Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel, Andrea Arcangeli On 12/12/2013 01:00 PM, Alex Thorlton wrote: > This part of the patch adds a tunable to > /sys/kernel/mm/transparent_hugepage called threshold. This threshold > determines how many pages a user must fault in from a single node before > a temporary compound page is turned into a THP. > +++ b/mm/huge_memory.c > @@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly = > (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| > (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); > > +/* default to 1 page threshold for handing out thps; maintains old behavior */ > +static int transparent_hugepage_threshold = 1; I assume the motivation for writing all this code is that "1" was not a good value in your tests. That makes me wonder, why should 1 be the default value with your patches? If there is a better value, why should we not use that? What is the upside of using a better value? What is the downside? Is there a value that would to bound the downside, so it is almost always smaller than the upside? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior 2013-12-12 21:37 ` Rik van Riel @ 2013-12-12 23:17 ` Alex Thorlton -1 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 23:17 UTC (permalink / raw) To: Rik van Riel Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel, Andrea Arcangeli On Thu, Dec 12, 2013 at 04:37:11PM -0500, Rik van Riel wrote: > On 12/12/2013 01:00 PM, Alex Thorlton wrote: > >This part of the patch adds a tunable to > >/sys/kernel/mm/transparent_hugepage called threshold. This threshold > >determines how many pages a user must fault in from a single node before > >a temporary compound page is turned into a THP. > > >+++ b/mm/huge_memory.c > >@@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly = > > (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| > > (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); > > > >+/* default to 1 page threshold for handing out thps; maintains old behavior */ > >+static int transparent_hugepage_threshold = 1; > > I assume the motivation for writing all this code is that "1" > was not a good value in your tests. Yes, that's correct. > That makes me wonder, why should 1 be the default value with > your patches? The main reason I set the default to 1 was because the majority of jobs aren't hurt by the existing THP behavior. I figured it would be best to default to having things behave the same as they do now, but provide the option to increase the threshold on systems that run jobs that could be adversely affected by the current behavior. > If there is a better value, why should we not use that? > > What is the upside of using a better value? > > What is the downside? The problem here is that what the "better" value is can vary greatly depending on how a particular task allocates memory. Setting the threshold too high can negatively affect the performance of jobs that behave well with the current behavior, setting it too low won't yield a performance increase for the jobs that are hurt by the current behavior. With some more thorough testing, I'm sure that we could arrive at a value that will help out jobs which behave poorly under current conditions, while having a minimal effect on jobs that already perform well. At this point, I'm looking more to ensure that everybody likes this approach to solving the problem before putting the finishing touches on the patches, and doing testing to find a good middle ground. > Is there a value that would to bound the downside, so it > is almost always smaller than the upside? Again, the problem here is that, to find a good value, we have to know quite a bit about why a particular value is bad for a particular job. While, as stated above, I think we can probably find a good middle ground to use as a default, in the end it will be the job of individual sysadmins to determine what value works best for their particular applications, and tune things accordingly. - Alex ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] Add tunable to control THP behavior @ 2013-12-12 23:17 ` Alex Thorlton 0 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 23:17 UTC (permalink / raw) To: Rik van Riel Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel, Andrea Arcangeli On Thu, Dec 12, 2013 at 04:37:11PM -0500, Rik van Riel wrote: > On 12/12/2013 01:00 PM, Alex Thorlton wrote: > >This part of the patch adds a tunable to > >/sys/kernel/mm/transparent_hugepage called threshold. This threshold > >determines how many pages a user must fault in from a single node before > >a temporary compound page is turned into a THP. > > >+++ b/mm/huge_memory.c > >@@ -44,6 +44,9 @@ unsigned long transparent_hugepage_flags __read_mostly = > > (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| > > (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); > > > >+/* default to 1 page threshold for handing out thps; maintains old behavior */ > >+static int transparent_hugepage_threshold = 1; > > I assume the motivation for writing all this code is that "1" > was not a good value in your tests. Yes, that's correct. > That makes me wonder, why should 1 be the default value with > your patches? The main reason I set the default to 1 was because the majority of jobs aren't hurt by the existing THP behavior. I figured it would be best to default to having things behave the same as they do now, but provide the option to increase the threshold on systems that run jobs that could be adversely affected by the current behavior. > If there is a better value, why should we not use that? > > What is the upside of using a better value? > > What is the downside? The problem here is that what the "better" value is can vary greatly depending on how a particular task allocates memory. Setting the threshold too high can negatively affect the performance of jobs that behave well with the current behavior, setting it too low won't yield a performance increase for the jobs that are hurt by the current behavior. With some more thorough testing, I'm sure that we could arrive at a value that will help out jobs which behave poorly under current conditions, while having a minimal effect on jobs that already perform well. At this point, I'm looking more to ensure that everybody likes this approach to solving the problem before putting the finishing touches on the patches, and doing testing to find a good middle ground. > Is there a value that would to bound the downside, so it > is almost always smaller than the upside? Again, the problem here is that, to find a good value, we have to know quite a bit about why a particular value is bad for a particular job. While, as stated above, I think we can probably find a good middle ground to use as a default, in the end it will be the job of individual sysadmins to determine what value works best for their particular applications, and tune things accordingly. - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC PATCH 3/3] Change THP behavior [not found] <cover.1386790423.git.athorlton@sgi.com> @ 2013-12-12 18:00 ` Alex Thorlton 2013-12-12 18:00 ` Alex Thorlton 2013-12-12 18:00 ` Alex Thorlton 2 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel, linux-mm This patch implements the functionality we're really going for here. It adds the decision making behavior to determine when to grab a temporary compound page, and whether or not to fault in single pages or to turn the temporary page into a THP. This one is rather large, might split it up a bit more for later versions I've left most of my comments in here just to provide people with some insight into what I may have been thinking when I chose to do something in a certain way. These will probably be trimmed down in later versions of the patch. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Cliff Wickman <cpw@sgi.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/huge_mm.h | 6 + include/linux/mm_types.h | 13 +++ kernel/fork.c | 1 + mm/huge_memory.c | 283 +++++++++++++++++++++++++++++++++++++++++++++++ mm/internal.h | 1 + mm/memory.c | 29 ++++- mm/page_alloc.c | 66 ++++++++++- 7 files changed, 392 insertions(+), 7 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0943b1b6..c1e407d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags); +extern struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm, + pmd_t *pmd); +extern int do_huge_pmd_temp_page(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags); extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index b5efa23..d48c6ab 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -322,6 +322,17 @@ struct mm_rss_stat { atomic_long_t count[NR_MM_COUNTERS]; }; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +struct temp_hugepage { + pmd_t *pmd; + struct page *page; + spinlock_t temp_hugepage_lock; + int node; /* node id of the first page in the chunk */ + int ref_count; /* number of pages faulted in from the chunk */ + struct list_head list; /* pointers to next/prev chunks */ +}; +#endif + struct kioctx_table; struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ @@ -408,7 +419,9 @@ struct mm_struct { #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE pgtable_t pmd_huge_pte; /* protected by page_table_lock */ + spinlock_t thp_list_lock; /* lock to protect thp_temp_list */ int thp_threshold; + struct list_head thp_temp_list; /* list of 512 page chunks for THPs */ #endif #ifdef CONFIG_CPUMASK_OFFSTACK struct cpumask cpumask_allocation; diff --git a/kernel/fork.c b/kernel/fork.c index 086fe73..a3ccf857 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -816,6 +816,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk) #ifdef CONFIG_TRANSPARENT_HUGEPAGE mm->pmd_huge_pte = NULL; + INIT_LIST_HEAD(&mm->thp_temp_list); #endif #ifdef CONFIG_NUMA_BALANCING mm->first_nid = NUMA_PTE_SCAN_INIT; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5d388e4..43ea095 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -788,6 +788,20 @@ static inline struct page *alloc_hugepage_vma(int defrag, HPAGE_PMD_ORDER, vma, haddr, nd); } +static inline gfp_t alloc_temp_hugepage_gfpmask(gfp_t extra_gfp) +{ + return GFP_TEMP_TRANSHUGE | extra_gfp; +} + +static inline struct page *alloc_temp_hugepage_vma(int defrag, + struct vm_area_struct *vma, + unsigned long haddr, int nd, + gfp_t extra_gfp) +{ + return alloc_pages_vma(alloc_temp_hugepage_gfpmask(extra_gfp), + HPAGE_PMD_ORDER, vma, haddr, nd); +} + #ifndef CONFIG_NUMA static inline struct page *alloc_hugepage(int defrag) { @@ -871,6 +885,275 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, return 0; } +/* We need to hold mm->thp_list_lock during this search */ +struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm, pmd_t *pmd) +{ + struct temp_hugepage *temp_thp; + /* + * we need to check to make sure that the PMD isn't already + * on the list. return the temp_hugepage struct if we find one + * otherwise we just return NULL + */ + list_for_each_entry(temp_thp, &mm->thp_temp_list, list) { + if (temp_thp->pmd == pmd) { + return temp_thp; + } + } + + return NULL; +} + +int do_huge_pmd_temp_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags) +{ + int i; + spinlock_t *ptl; + struct page *page; + pte_t *pte; + pte_t entry; + struct temp_hugepage *temp_thp; + unsigned long haddr = address & HPAGE_PMD_MASK; + + if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) + return VM_FAULT_FALLBACK; + if (unlikely(anon_vma_prepare(vma))) + return VM_FAULT_OOM; + if (unlikely(khugepaged_enter(vma))) + return VM_FAULT_OOM; + /* + * we're not going to handle this case yet, for now + * we'll just fall back to regular pages + */ + if (!(flags & FAULT_FLAG_WRITE) && + transparent_hugepage_use_zero_page()) { + pgtable_t pgtable; + struct page *zero_page; + bool set; + pgtable = pte_alloc_one(mm, haddr); + if (unlikely(!pgtable)) + return VM_FAULT_OOM; + zero_page = get_huge_zero_page(); + if (unlikely(!zero_page)) { + pte_free(mm, pgtable); + count_vm_event(THP_FAULT_FALLBACK); + return VM_FAULT_FALLBACK; + } + spin_lock(&mm->page_table_lock); + set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd, + zero_page); + spin_unlock(&mm->page_table_lock); + if (!set) { + pte_free(mm, pgtable); + put_huge_zero_page(); + } + return 0; + } + + /* + * Here's where we either need to store the PMD on the list + * and give them a regular page, or make the decision to flip + * the PSE bit and send them back with a hugepage + * + * + First we call find_pmd_mm_freelist to determine if the pmd + * we're interested in has already been faulted into + */ + spin_lock(&mm->thp_list_lock); + temp_thp = find_pmd_mm_freelist(mm, pmd); + + /* this is a temporary workaround to avoid putting the pages back on the freelist */ + if (temp_thp && temp_thp->node == -1) { + spin_unlock(&mm->thp_list_lock); + goto single_fault; + } + + /* + * we need to create a list entry and add it to the + * new per-mm free list if we didn't find an existing + * entry + */ + if (!temp_thp && pmd_none(*pmd)) { + /* try to get 512 pages from the freelist */ + page = alloc_temp_hugepage_vma(transparent_hugepage_defrag(vma), + vma, haddr, numa_node_id(), 0); + + if (unlikely(!page)) { + /* we should probably change the VM event here? */ + count_vm_event(THP_FAULT_FALLBACK); + return VM_FAULT_FALLBACK; + } + + /* do this here instead of below, to get the whole page ready */ + clear_huge_page(page, haddr, HPAGE_PMD_NR); + + /* add a new temp_hugepage entry to the local freelist */ + temp_thp = kmalloc(sizeof(struct temp_hugepage), GFP_KERNEL); + if (!temp_thp) + return VM_FAULT_OOM; + temp_thp->pmd = pmd; + temp_thp->page = page; + temp_thp->node = numa_node_id(); + temp_thp->ref_count = 1; + list_add(&temp_thp->list, &mm->thp_temp_list); + /* + * otherwise we increment the reference count, and decide whether + * or not to create a THP + */ + } else if (temp_thp && !pmd_none(*pmd) && temp_thp->node == numa_node_id()) { + temp_thp->ref_count++; + /* if they allocated from a different node, they don't get a thp */ + } else if (temp_thp && !pmd_none(*pmd) && temp_thp->node != numa_node_id()) { + /* + * for now we handle this by pushing the rest of the faults through our + * custom fault code below, eventually we will want to put the unused + * pages from out temp_hugepage back on the freelist, so they can be + * faulted in by the normal code paths + */ + + temp_thp->node = -1; + } else { + spin_unlock(&mm->thp_list_lock); + return VM_FAULT_FALLBACK; + } + + spin_unlock(&mm->thp_list_lock); + + /* + * now that we've done the accounting work, we check to see if + * we've exceeded our threshold + */ + if (temp_thp->ref_count >= mm->thp_threshold) { + pmd_t pmd_entry; + pgtable_t pgtable; + + /* + * we'll do all of the following beneath the big ptl for now + * this will need to be modified to work with the split ptl + */ + spin_lock(&mm->page_table_lock); + + /* + * once we get past the lock we have to make sure that somebody + * else hasn't already turned this guy into a THP, if they have, + * then the page we need is already faulted in as part of the THP + * they created + */ + if (PageTransHuge(temp_thp->page)) { + spin_unlock(&mm->page_table_lock); + return 0; + } + + pgtable = pte_alloc_one(mm, haddr); + if (unlikely(!pgtable)) { + spin_unlock(&mm->page_table_lock); + return VM_FAULT_OOM; + } + + /* might wanna move this? */ + __SetPageUptodate(temp_thp->page); + + /* turn the pages into one compound page */ + make_compound_page(temp_thp->page, HPAGE_PMD_ORDER); + + /* set up the pmd */ + pmd_entry = mk_huge_pmd(temp_thp->page, vma->vm_page_prot); + pmd_entry = maybe_pmd_mkwrite(pmd_mkdirty(pmd_entry), vma); + + /* remap the new page since we cleared the mappings */ + page_add_anon_rmap(temp_thp->page, vma, address); + + /* deposit the thp */ + pgtable_trans_huge_deposit(mm, pmd, pgtable); + + set_pmd_at(mm, haddr, pmd, pmd_entry); + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR - mm->thp_threshold + 1); + /* mm->nr_ptes++; */ + + /* delete the reference to this compound page from our list */ + spin_lock(&mm->thp_list_lock); + list_del(&temp_thp->list); + spin_unlock(&mm->thp_list_lock); + + spin_unlock(&mm->page_table_lock); + return 0; + } else { +single_fault: + /* fault in the page */ + if (pmd_none(*pmd) && __pte_alloc(mm, vma, temp_thp->pmd, address)) + return VM_FAULT_OOM; + + /* + * we'll do all of the following beneath the big ptl for now + * this will need to be modified to work with the split ptl + */ + spin_lock(&mm->page_table_lock); + + page = temp_thp->page + (int) pte_index(address); + + /* set the page's refcount */ + set_page_refcounted(page); + pte = pte_offset_map(temp_thp->pmd, address); + + /* might wanna move this? */ + __SetPageUptodate(page); + + if (!pte_present(*pte)) { + if (pte_none(*pte)) { + pte_unmap(pte); + + if (unlikely(anon_vma_prepare(vma))) { + spin_unlock(&mm->page_table_lock); + return VM_FAULT_OOM; + } + + entry = mk_pte(page, vma->vm_page_prot); + if (vma->vm_flags & VM_WRITE) + entry = pte_mkwrite(pte_mkdirty(entry)); + + pte = pte_offset_map_lock(mm, temp_thp->pmd, address, &ptl); + + page_add_new_anon_rmap(page, vma, haddr); + add_mm_counter(mm, MM_ANONPAGES, 1); + + set_pte_at(mm, address, pte, entry); + + pte_unmap_unlock(pte, ptl); + spin_unlock(&mm->page_table_lock); + + return 0; + } + } else { + spin_unlock(&mm->page_table_lock); + return VM_FAULT_FALLBACK; + } + } + + /* I don't know what this does right now. I'm leaving it */ + if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) { + put_page(page); + count_vm_event(THP_FAULT_FALLBACK); + return VM_FAULT_FALLBACK; + } + + /* + * here's the important piece, where we actually make our 512 + * page chunk into a THP, by setting the PSE bit. This is the + * spot we really need to change. In the end, we could probably + * spin this up into the old function, but we'll keep them separate + * for now + */ + if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) { + mem_cgroup_uncharge_page(page); + put_page(page); + count_vm_event(THP_FAULT_FALLBACK); + return VM_FAULT_FALLBACK; + } + + /* again, probably want a different VM event here */ + count_vm_event(THP_FAULT_ALLOC); + return 0; +} + int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma) diff --git a/mm/internal.h b/mm/internal.h index 684f7aa..8fc296b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -98,6 +98,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); */ extern void __free_pages_bootmem(struct page *page, unsigned int order); extern void prep_compound_page(struct page *page, unsigned long order); +extern void make_compound_page(struct page *page, unsigned long order); #ifdef CONFIG_MEMORY_FAILURE extern bool is_free_buddy_page(struct page *page); #endif diff --git a/mm/memory.c b/mm/memory.c index d176154..014d9ba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3764,13 +3764,30 @@ retry: pmd = pmd_alloc(mm, pud, address); if (!pmd) return VM_FAULT_OOM; - if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) { + if (transparent_hugepage_enabled(vma)) { int ret = VM_FAULT_FALLBACK; - if (!vma->vm_ops) - ret = do_huge_pmd_anonymous_page(mm, vma, address, - pmd, flags); - if (!(ret & VM_FAULT_FALLBACK)) - return ret; + /* + * This is a temporary location for this code, just to get things + * up and running. I'll come up with a better way to handle this + * later + */ + if (!mm->thp_threshold) + mm->thp_threshold = thp_threshold_check(); + if (!mm->thp_temp_list.next && !mm->thp_temp_list.prev) + INIT_LIST_HEAD(&mm->thp_temp_list); + if (mm->thp_threshold > 1) { + if (!vma->vm_ops) + ret = do_huge_pmd_temp_page(mm, vma, address, + pmd, flags); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } else if (pmd_none(*pmd)) { + if (!vma->vm_ops) + ret = do_huge_pmd_anonymous_page(mm, vma, address, + pmd, flags); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } } else { pmd_t orig_pmd = *pmd; int ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dd886fa..48e13fc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -375,6 +375,65 @@ void prep_compound_page(struct page *page, unsigned long order) } } +/* + * This function is used to create a proper compound page from a chunk of + * contiguous pages, most likely allocated as a temporary hugepage + */ +void make_compound_page(struct page *page, unsigned long order) +{ + int i, max_count = 0, max_mapcount = 0; + int nr_pages = 1 << order; + + set_compound_page_dtor(page, free_compound_page); + set_compound_order(page, order); + + __SetPageHead(page); + + /* + * we clear all the mappings here, so we have to remember to set + * them back up! + */ + page->mapping = NULL; + + max_count = (int) atomic_read(&page->_count); + max_mapcount = (int) atomic_read(&page->_mapcount); + + for (i = 1; i < nr_pages; i++) { + int cur_count, cur_mapcount; + struct page *p = page + i; + p->flags = 0; /* this seems dumb */ + __SetPageTail(p); + set_page_count(p, 0); + p->first_page = page; + p->mapping = NULL; + + cur_count = (int) atomic_read(&page->_count); + cur_mapcount = (int) atomic_read(&page->_mapcount); + atomic_set(&page->_count, 0); + atomic_set(&page->_mapcount, -1); + if (cur_count > max_count) + max_count = cur_count; + if (cur_mapcount > max_mapcount) + max_mapcount = cur_mapcount; + + /* + * poison the LRU entries for all the tail pages (aside from the + * first one), the entries for the head page should be okay + */ + if (i != 1) { + p->lru.next = LIST_POISON1; + p->lru.prev = LIST_POISON2; + } + } + + atomic_set(&page->_count, max_count); + /* + * we set to max_mapcount - 1 here because we're going to + * map this page again later. This definitely doesn't seem right. + */ + atomic_set(&page->_mapcount, max_mapcount - 1); +} + /* update __split_huge_page_refcount if you change this function */ static int destroy_compound_page(struct page *page, unsigned long order) { @@ -865,7 +924,12 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) } set_page_private(page, 0); - set_page_refcounted(page); + /* + * We don't want to set _count for temporary compound pages, since + * we may not immediately fault in the first page + */ + if (!(gfp_flags & __GFP_COMP_TEMP)) + set_page_refcounted(page); arch_alloc_page(page, order); kernel_map_pages(page, 1 << order, 1); -- 1.7.12.4 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH 3/3] Change THP behavior @ 2013-12-12 18:00 ` Alex Thorlton 0 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel This patch implements the functionality we're really going for here. It adds the decision making behavior to determine when to grab a temporary compound page, and whether or not to fault in single pages or to turn the temporary page into a THP. This one is rather large, might split it up a bit more for later versions I've left most of my comments in here just to provide people with some insight into what I may have been thinking when I chose to do something in a certain way. These will probably be trimmed down in later versions of the patch. Signed-off-by: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Cliff Wickman <cpw@sgi.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- include/linux/huge_mm.h | 6 + include/linux/mm_types.h | 13 +++ kernel/fork.c | 1 + mm/huge_memory.c | 283 +++++++++++++++++++++++++++++++++++++++++++++++ mm/internal.h | 1 + mm/memory.c | 29 ++++- mm/page_alloc.c | 66 ++++++++++- 7 files changed, 392 insertions(+), 7 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0943b1b6..c1e407d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags); +extern struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm, + pmd_t *pmd); +extern int do_huge_pmd_temp_page(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags); extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index b5efa23..d48c6ab 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -322,6 +322,17 @@ struct mm_rss_stat { atomic_long_t count[NR_MM_COUNTERS]; }; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +struct temp_hugepage { + pmd_t *pmd; + struct page *page; + spinlock_t temp_hugepage_lock; + int node; /* node id of the first page in the chunk */ + int ref_count; /* number of pages faulted in from the chunk */ + struct list_head list; /* pointers to next/prev chunks */ +}; +#endif + struct kioctx_table; struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ @@ -408,7 +419,9 @@ struct mm_struct { #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE pgtable_t pmd_huge_pte; /* protected by page_table_lock */ + spinlock_t thp_list_lock; /* lock to protect thp_temp_list */ int thp_threshold; + struct list_head thp_temp_list; /* list of 512 page chunks for THPs */ #endif #ifdef CONFIG_CPUMASK_OFFSTACK struct cpumask cpumask_allocation; diff --git a/kernel/fork.c b/kernel/fork.c index 086fe73..a3ccf857 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -816,6 +816,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk) #ifdef CONFIG_TRANSPARENT_HUGEPAGE mm->pmd_huge_pte = NULL; + INIT_LIST_HEAD(&mm->thp_temp_list); #endif #ifdef CONFIG_NUMA_BALANCING mm->first_nid = NUMA_PTE_SCAN_INIT; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5d388e4..43ea095 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -788,6 +788,20 @@ static inline struct page *alloc_hugepage_vma(int defrag, HPAGE_PMD_ORDER, vma, haddr, nd); } +static inline gfp_t alloc_temp_hugepage_gfpmask(gfp_t extra_gfp) +{ + return GFP_TEMP_TRANSHUGE | extra_gfp; +} + +static inline struct page *alloc_temp_hugepage_vma(int defrag, + struct vm_area_struct *vma, + unsigned long haddr, int nd, + gfp_t extra_gfp) +{ + return alloc_pages_vma(alloc_temp_hugepage_gfpmask(extra_gfp), + HPAGE_PMD_ORDER, vma, haddr, nd); +} + #ifndef CONFIG_NUMA static inline struct page *alloc_hugepage(int defrag) { @@ -871,6 +885,275 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, return 0; } +/* We need to hold mm->thp_list_lock during this search */ +struct temp_hugepage *find_pmd_mm_freelist(struct mm_struct *mm, pmd_t *pmd) +{ + struct temp_hugepage *temp_thp; + /* + * we need to check to make sure that the PMD isn't already + * on the list. return the temp_hugepage struct if we find one + * otherwise we just return NULL + */ + list_for_each_entry(temp_thp, &mm->thp_temp_list, list) { + if (temp_thp->pmd == pmd) { + return temp_thp; + } + } + + return NULL; +} + +int do_huge_pmd_temp_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags) +{ + int i; + spinlock_t *ptl; + struct page *page; + pte_t *pte; + pte_t entry; + struct temp_hugepage *temp_thp; + unsigned long haddr = address & HPAGE_PMD_MASK; + + if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) + return VM_FAULT_FALLBACK; + if (unlikely(anon_vma_prepare(vma))) + return VM_FAULT_OOM; + if (unlikely(khugepaged_enter(vma))) + return VM_FAULT_OOM; + /* + * we're not going to handle this case yet, for now + * we'll just fall back to regular pages + */ + if (!(flags & FAULT_FLAG_WRITE) && + transparent_hugepage_use_zero_page()) { + pgtable_t pgtable; + struct page *zero_page; + bool set; + pgtable = pte_alloc_one(mm, haddr); + if (unlikely(!pgtable)) + return VM_FAULT_OOM; + zero_page = get_huge_zero_page(); + if (unlikely(!zero_page)) { + pte_free(mm, pgtable); + count_vm_event(THP_FAULT_FALLBACK); + return VM_FAULT_FALLBACK; + } + spin_lock(&mm->page_table_lock); + set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd, + zero_page); + spin_unlock(&mm->page_table_lock); + if (!set) { + pte_free(mm, pgtable); + put_huge_zero_page(); + } + return 0; + } + + /* + * Here's where we either need to store the PMD on the list + * and give them a regular page, or make the decision to flip + * the PSE bit and send them back with a hugepage + * + * + First we call find_pmd_mm_freelist to determine if the pmd + * we're interested in has already been faulted into + */ + spin_lock(&mm->thp_list_lock); + temp_thp = find_pmd_mm_freelist(mm, pmd); + + /* this is a temporary workaround to avoid putting the pages back on the freelist */ + if (temp_thp && temp_thp->node == -1) { + spin_unlock(&mm->thp_list_lock); + goto single_fault; + } + + /* + * we need to create a list entry and add it to the + * new per-mm free list if we didn't find an existing + * entry + */ + if (!temp_thp && pmd_none(*pmd)) { + /* try to get 512 pages from the freelist */ + page = alloc_temp_hugepage_vma(transparent_hugepage_defrag(vma), + vma, haddr, numa_node_id(), 0); + + if (unlikely(!page)) { + /* we should probably change the VM event here? */ + count_vm_event(THP_FAULT_FALLBACK); + return VM_FAULT_FALLBACK; + } + + /* do this here instead of below, to get the whole page ready */ + clear_huge_page(page, haddr, HPAGE_PMD_NR); + + /* add a new temp_hugepage entry to the local freelist */ + temp_thp = kmalloc(sizeof(struct temp_hugepage), GFP_KERNEL); + if (!temp_thp) + return VM_FAULT_OOM; + temp_thp->pmd = pmd; + temp_thp->page = page; + temp_thp->node = numa_node_id(); + temp_thp->ref_count = 1; + list_add(&temp_thp->list, &mm->thp_temp_list); + /* + * otherwise we increment the reference count, and decide whether + * or not to create a THP + */ + } else if (temp_thp && !pmd_none(*pmd) && temp_thp->node == numa_node_id()) { + temp_thp->ref_count++; + /* if they allocated from a different node, they don't get a thp */ + } else if (temp_thp && !pmd_none(*pmd) && temp_thp->node != numa_node_id()) { + /* + * for now we handle this by pushing the rest of the faults through our + * custom fault code below, eventually we will want to put the unused + * pages from out temp_hugepage back on the freelist, so they can be + * faulted in by the normal code paths + */ + + temp_thp->node = -1; + } else { + spin_unlock(&mm->thp_list_lock); + return VM_FAULT_FALLBACK; + } + + spin_unlock(&mm->thp_list_lock); + + /* + * now that we've done the accounting work, we check to see if + * we've exceeded our threshold + */ + if (temp_thp->ref_count >= mm->thp_threshold) { + pmd_t pmd_entry; + pgtable_t pgtable; + + /* + * we'll do all of the following beneath the big ptl for now + * this will need to be modified to work with the split ptl + */ + spin_lock(&mm->page_table_lock); + + /* + * once we get past the lock we have to make sure that somebody + * else hasn't already turned this guy into a THP, if they have, + * then the page we need is already faulted in as part of the THP + * they created + */ + if (PageTransHuge(temp_thp->page)) { + spin_unlock(&mm->page_table_lock); + return 0; + } + + pgtable = pte_alloc_one(mm, haddr); + if (unlikely(!pgtable)) { + spin_unlock(&mm->page_table_lock); + return VM_FAULT_OOM; + } + + /* might wanna move this? */ + __SetPageUptodate(temp_thp->page); + + /* turn the pages into one compound page */ + make_compound_page(temp_thp->page, HPAGE_PMD_ORDER); + + /* set up the pmd */ + pmd_entry = mk_huge_pmd(temp_thp->page, vma->vm_page_prot); + pmd_entry = maybe_pmd_mkwrite(pmd_mkdirty(pmd_entry), vma); + + /* remap the new page since we cleared the mappings */ + page_add_anon_rmap(temp_thp->page, vma, address); + + /* deposit the thp */ + pgtable_trans_huge_deposit(mm, pmd, pgtable); + + set_pmd_at(mm, haddr, pmd, pmd_entry); + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR - mm->thp_threshold + 1); + /* mm->nr_ptes++; */ + + /* delete the reference to this compound page from our list */ + spin_lock(&mm->thp_list_lock); + list_del(&temp_thp->list); + spin_unlock(&mm->thp_list_lock); + + spin_unlock(&mm->page_table_lock); + return 0; + } else { +single_fault: + /* fault in the page */ + if (pmd_none(*pmd) && __pte_alloc(mm, vma, temp_thp->pmd, address)) + return VM_FAULT_OOM; + + /* + * we'll do all of the following beneath the big ptl for now + * this will need to be modified to work with the split ptl + */ + spin_lock(&mm->page_table_lock); + + page = temp_thp->page + (int) pte_index(address); + + /* set the page's refcount */ + set_page_refcounted(page); + pte = pte_offset_map(temp_thp->pmd, address); + + /* might wanna move this? */ + __SetPageUptodate(page); + + if (!pte_present(*pte)) { + if (pte_none(*pte)) { + pte_unmap(pte); + + if (unlikely(anon_vma_prepare(vma))) { + spin_unlock(&mm->page_table_lock); + return VM_FAULT_OOM; + } + + entry = mk_pte(page, vma->vm_page_prot); + if (vma->vm_flags & VM_WRITE) + entry = pte_mkwrite(pte_mkdirty(entry)); + + pte = pte_offset_map_lock(mm, temp_thp->pmd, address, &ptl); + + page_add_new_anon_rmap(page, vma, haddr); + add_mm_counter(mm, MM_ANONPAGES, 1); + + set_pte_at(mm, address, pte, entry); + + pte_unmap_unlock(pte, ptl); + spin_unlock(&mm->page_table_lock); + + return 0; + } + } else { + spin_unlock(&mm->page_table_lock); + return VM_FAULT_FALLBACK; + } + } + + /* I don't know what this does right now. I'm leaving it */ + if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) { + put_page(page); + count_vm_event(THP_FAULT_FALLBACK); + return VM_FAULT_FALLBACK; + } + + /* + * here's the important piece, where we actually make our 512 + * page chunk into a THP, by setting the PSE bit. This is the + * spot we really need to change. In the end, we could probably + * spin this up into the old function, but we'll keep them separate + * for now + */ + if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) { + mem_cgroup_uncharge_page(page); + put_page(page); + count_vm_event(THP_FAULT_FALLBACK); + return VM_FAULT_FALLBACK; + } + + /* again, probably want a different VM event here */ + count_vm_event(THP_FAULT_ALLOC); + return 0; +} + int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma) diff --git a/mm/internal.h b/mm/internal.h index 684f7aa..8fc296b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -98,6 +98,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); */ extern void __free_pages_bootmem(struct page *page, unsigned int order); extern void prep_compound_page(struct page *page, unsigned long order); +extern void make_compound_page(struct page *page, unsigned long order); #ifdef CONFIG_MEMORY_FAILURE extern bool is_free_buddy_page(struct page *page); #endif diff --git a/mm/memory.c b/mm/memory.c index d176154..014d9ba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3764,13 +3764,30 @@ retry: pmd = pmd_alloc(mm, pud, address); if (!pmd) return VM_FAULT_OOM; - if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) { + if (transparent_hugepage_enabled(vma)) { int ret = VM_FAULT_FALLBACK; - if (!vma->vm_ops) - ret = do_huge_pmd_anonymous_page(mm, vma, address, - pmd, flags); - if (!(ret & VM_FAULT_FALLBACK)) - return ret; + /* + * This is a temporary location for this code, just to get things + * up and running. I'll come up with a better way to handle this + * later + */ + if (!mm->thp_threshold) + mm->thp_threshold = thp_threshold_check(); + if (!mm->thp_temp_list.next && !mm->thp_temp_list.prev) + INIT_LIST_HEAD(&mm->thp_temp_list); + if (mm->thp_threshold > 1) { + if (!vma->vm_ops) + ret = do_huge_pmd_temp_page(mm, vma, address, + pmd, flags); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } else if (pmd_none(*pmd)) { + if (!vma->vm_ops) + ret = do_huge_pmd_anonymous_page(mm, vma, address, + pmd, flags); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } } else { pmd_t orig_pmd = *pmd; int ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dd886fa..48e13fc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -375,6 +375,65 @@ void prep_compound_page(struct page *page, unsigned long order) } } +/* + * This function is used to create a proper compound page from a chunk of + * contiguous pages, most likely allocated as a temporary hugepage + */ +void make_compound_page(struct page *page, unsigned long order) +{ + int i, max_count = 0, max_mapcount = 0; + int nr_pages = 1 << order; + + set_compound_page_dtor(page, free_compound_page); + set_compound_order(page, order); + + __SetPageHead(page); + + /* + * we clear all the mappings here, so we have to remember to set + * them back up! + */ + page->mapping = NULL; + + max_count = (int) atomic_read(&page->_count); + max_mapcount = (int) atomic_read(&page->_mapcount); + + for (i = 1; i < nr_pages; i++) { + int cur_count, cur_mapcount; + struct page *p = page + i; + p->flags = 0; /* this seems dumb */ + __SetPageTail(p); + set_page_count(p, 0); + p->first_page = page; + p->mapping = NULL; + + cur_count = (int) atomic_read(&page->_count); + cur_mapcount = (int) atomic_read(&page->_mapcount); + atomic_set(&page->_count, 0); + atomic_set(&page->_mapcount, -1); + if (cur_count > max_count) + max_count = cur_count; + if (cur_mapcount > max_mapcount) + max_mapcount = cur_mapcount; + + /* + * poison the LRU entries for all the tail pages (aside from the + * first one), the entries for the head page should be okay + */ + if (i != 1) { + p->lru.next = LIST_POISON1; + p->lru.prev = LIST_POISON2; + } + } + + atomic_set(&page->_count, max_count); + /* + * we set to max_mapcount - 1 here because we're going to + * map this page again later. This definitely doesn't seem right. + */ + atomic_set(&page->_mapcount, max_mapcount - 1); +} + /* update __split_huge_page_refcount if you change this function */ static int destroy_compound_page(struct page *page, unsigned long order) { @@ -865,7 +924,12 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) } set_page_private(page, 0); - set_page_refcounted(page); + /* + * We don't want to set _count for temporary compound pages, since + * we may not immediately fault in the first page + */ + if (!(gfp_flags & __GFP_COMP_TEMP)) + set_page_refcounted(page); arch_alloc_page(page, order); kernel_map_pages(page, 1 << order, 1); -- 1.7.12.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 24+ messages in thread
* RE: [RFC PATCH 3/3] Change THP behavior 2013-12-12 18:00 ` Alex Thorlton @ 2013-12-13 13:13 ` Kirill A. Shutemov -1 siblings, 0 replies; 24+ messages in thread From: Kirill A. Shutemov @ 2013-12-13 13:13 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel, linux-mm Alex Thorlton wrote: > + /* > + * now that we've done the accounting work, we check to see if > + * we've exceeded our threshold > + */ > + if (temp_thp->ref_count >= mm->thp_threshold) { > + pmd_t pmd_entry; > + pgtable_t pgtable; > + > + /* > + * we'll do all of the following beneath the big ptl for now > + * this will need to be modified to work with the split ptl > + */ > + spin_lock(&mm->page_table_lock); > + > + /* > + * once we get past the lock we have to make sure that somebody > + * else hasn't already turned this guy into a THP, if they have, > + * then the page we need is already faulted in as part of the THP > + * they created > + */ > + if (PageTransHuge(temp_thp->page)) { > + spin_unlock(&mm->page_table_lock); > + return 0; > + } > + > + pgtable = pte_alloc_one(mm, haddr); > + if (unlikely(!pgtable)) { > + spin_unlock(&mm->page_table_lock); > + return VM_FAULT_OOM; > + } > + > + /* might wanna move this? */ > + __SetPageUptodate(temp_thp->page); > + > + /* turn the pages into one compound page */ > + make_compound_page(temp_thp->page, HPAGE_PMD_ORDER); > + > + /* set up the pmd */ > + pmd_entry = mk_huge_pmd(temp_thp->page, vma->vm_page_prot); > + pmd_entry = maybe_pmd_mkwrite(pmd_mkdirty(pmd_entry), vma); > + > + /* remap the new page since we cleared the mappings */ > + page_add_anon_rmap(temp_thp->page, vma, address); > + > + /* deposit the thp */ > + pgtable_trans_huge_deposit(mm, pmd, pgtable); > + > + set_pmd_at(mm, haddr, pmd, pmd_entry); > + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR - mm->thp_threshold + 1); > + /* mm->nr_ptes++; */ > + > + /* delete the reference to this compound page from our list */ > + spin_lock(&mm->thp_list_lock); > + list_del(&temp_thp->list); > + spin_unlock(&mm->thp_list_lock); > + > + spin_unlock(&mm->page_table_lock); > + return 0; Hm. I think this part is not correct: you collapse temp thp page into real one only for current procees. What will happen if a process with temp thp pages was forked? And I don't think this problem is an easy one. khugepaged can't collapse pages with page->_count != 1 for the same reason: to make it properly you need to take mmap_sem for all processes and collapse all pages at once. And if a page is pinned, we also can't collapse. Sorry, I don't think the whole idea has much potential. :( -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: [RFC PATCH 3/3] Change THP behavior @ 2013-12-13 13:13 ` Kirill A. Shutemov 0 siblings, 0 replies; 24+ messages in thread From: Kirill A. Shutemov @ 2013-12-13 13:13 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel Alex Thorlton wrote: > + /* > + * now that we've done the accounting work, we check to see if > + * we've exceeded our threshold > + */ > + if (temp_thp->ref_count >= mm->thp_threshold) { > + pmd_t pmd_entry; > + pgtable_t pgtable; > + > + /* > + * we'll do all of the following beneath the big ptl for now > + * this will need to be modified to work with the split ptl > + */ > + spin_lock(&mm->page_table_lock); > + > + /* > + * once we get past the lock we have to make sure that somebody > + * else hasn't already turned this guy into a THP, if they have, > + * then the page we need is already faulted in as part of the THP > + * they created > + */ > + if (PageTransHuge(temp_thp->page)) { > + spin_unlock(&mm->page_table_lock); > + return 0; > + } > + > + pgtable = pte_alloc_one(mm, haddr); > + if (unlikely(!pgtable)) { > + spin_unlock(&mm->page_table_lock); > + return VM_FAULT_OOM; > + } > + > + /* might wanna move this? */ > + __SetPageUptodate(temp_thp->page); > + > + /* turn the pages into one compound page */ > + make_compound_page(temp_thp->page, HPAGE_PMD_ORDER); > + > + /* set up the pmd */ > + pmd_entry = mk_huge_pmd(temp_thp->page, vma->vm_page_prot); > + pmd_entry = maybe_pmd_mkwrite(pmd_mkdirty(pmd_entry), vma); > + > + /* remap the new page since we cleared the mappings */ > + page_add_anon_rmap(temp_thp->page, vma, address); > + > + /* deposit the thp */ > + pgtable_trans_huge_deposit(mm, pmd, pgtable); > + > + set_pmd_at(mm, haddr, pmd, pmd_entry); > + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR - mm->thp_threshold + 1); > + /* mm->nr_ptes++; */ > + > + /* delete the reference to this compound page from our list */ > + spin_lock(&mm->thp_list_lock); > + list_del(&temp_thp->list); > + spin_unlock(&mm->thp_list_lock); > + > + spin_unlock(&mm->page_table_lock); > + return 0; Hm. I think this part is not correct: you collapse temp thp page into real one only for current procees. What will happen if a process with temp thp pages was forked? And I don't think this problem is an easy one. khugepaged can't collapse pages with page->_count != 1 for the same reason: to make it properly you need to take mmap_sem for all processes and collapse all pages at once. And if a page is pinned, we also can't collapse. Sorry, I don't think the whole idea has much potential. :( -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] Change THP behavior 2013-12-13 13:13 ` Kirill A. Shutemov @ 2013-12-16 17:37 ` Alex Thorlton -1 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-16 17:37 UTC (permalink / raw) To: Kirill A. Shutemov Cc: linux-mm, Andrew Morton, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel > Hm. I think this part is not correct: you collapse temp thp page > into real one only for current procees. What will happen if a process with > temp thp pages was forked? That's a scenario that I hadn't yet addressed, but definitely something I'll consider going forward. I think we can come up with a way to appropriately handle that situation. > And I don't think this problem is an easy one. khugepaged can't collapse > pages with page->_count != 1 for the same reason: to make it properly you > need to take mmap_sem for all processes and collapse all pages at once. > And if a page is pinned, we also can't collapse. Again, a few things here that I hadn't taken into account. I'll look for a way to address these concerns. > Sorry, I don't think the whole idea has much potential. :( I understand that there are some issues with this initial pass at the idea, but I think we can get things corrected and come up with a workable solution here. When it comes down to it, there are relatively few options to correct the issue that I'm focusing on here, and this one seems to be the most appropriate one that we've tried so far. We've looked at disabling THP on a per-cpuset/per-process basis, and that has been met with fairly strong resistance, for good reason. I'll take another pass at this and hopefully be able to address some of your concerns with the idea. Thanks for taking the time to look the patch over! - Alex ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] Change THP behavior @ 2013-12-16 17:37 ` Alex Thorlton 0 siblings, 0 replies; 24+ messages in thread From: Alex Thorlton @ 2013-12-16 17:37 UTC (permalink / raw) To: Kirill A. Shutemov Cc: linux-mm, Andrew Morton, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel > Hm. I think this part is not correct: you collapse temp thp page > into real one only for current procees. What will happen if a process with > temp thp pages was forked? That's a scenario that I hadn't yet addressed, but definitely something I'll consider going forward. I think we can come up with a way to appropriately handle that situation. > And I don't think this problem is an easy one. khugepaged can't collapse > pages with page->_count != 1 for the same reason: to make it properly you > need to take mmap_sem for all processes and collapse all pages at once. > And if a page is pinned, we also can't collapse. Again, a few things here that I hadn't taken into account. I'll look for a way to address these concerns. > Sorry, I don't think the whole idea has much potential. :( I understand that there are some issues with this initial pass at the idea, but I think we can get things corrected and come up with a workable solution here. When it comes down to it, there are relatively few options to correct the issue that I'm focusing on here, and this one seems to be the most appropriate one that we've tried so far. We've looked at disabling THP on a per-cpuset/per-process basis, and that has been met with fairly strong resistance, for good reason. I'll take another pass at this and hopefully be able to address some of your concerns with the idea. Thanks for taking the time to look the patch over! - Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] Change THP behavior 2013-12-12 18:00 ` Alex Thorlton @ 2013-12-13 18:12 ` Oleg Nesterov -1 siblings, 0 replies; 24+ messages in thread From: Oleg Nesterov @ 2013-12-13 18:12 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel I know almost nothing about thp, unlikely I understand this patch correctly... But, afaics, the main idea is that until we have mm->thp_threshold faults we install the tail pages of temp_hugepage->page as a normal anonymous page, then we actually add the Head/Tail metadata and add the necessary huge_pmd/etc. I simply can't understand how this can work until make_compound_page() is called. Just for example, what happens after sys_munmap() ? If nothing else, who will find and free temp_hugepage connected to this area? Or, what if sys_mremap() moves this vma? find_pmd_mm_freelist() won't find the right temp_thp after that? Or split_vma, etc. And make_compound_page() itself looks suspicious, On 12/12, Alex Thorlton wrote: > > +void make_compound_page(struct page *page, unsigned long order) > +{ > + int i, max_count = 0, max_mapcount = 0; > + int nr_pages = 1 << order; > + > + set_compound_page_dtor(page, free_compound_page); > + set_compound_order(page, order); > + > + __SetPageHead(page); > + > + /* > + * we clear all the mappings here, so we have to remember to set > + * them back up! > + */ > + page->mapping = NULL; > + > + max_count = (int) atomic_read(&page->_count); > + max_mapcount = (int) atomic_read(&page->_mapcount); > + > + for (i = 1; i < nr_pages; i++) { > + int cur_count, cur_mapcount; > + struct page *p = page + i; > + p->flags = 0; /* this seems dumb */ > + __SetPageTail(p); Just for example, what if put_page(p) or get_page(p) is called after __SetPageTail() ? Afaics, this page was already visible to, say, get_user_pages() and it can have external references. In fact everything else looks suspicious too but let me repeat that I do not really understand this code. So please do not count this as review, but perhaps the changelog should tell more to explain what this patch actually does and how this all should work? Oleg. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] Change THP behavior @ 2013-12-13 18:12 ` Oleg Nesterov 0 siblings, 0 replies; 24+ messages in thread From: Oleg Nesterov @ 2013-12-13 18:12 UTC (permalink / raw) To: Alex Thorlton Cc: linux-mm, Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse, Benjamin LaHaise, Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi, linux-kernel I know almost nothing about thp, unlikely I understand this patch correctly... But, afaics, the main idea is that until we have mm->thp_threshold faults we install the tail pages of temp_hugepage->page as a normal anonymous page, then we actually add the Head/Tail metadata and add the necessary huge_pmd/etc. I simply can't understand how this can work until make_compound_page() is called. Just for example, what happens after sys_munmap() ? If nothing else, who will find and free temp_hugepage connected to this area? Or, what if sys_mremap() moves this vma? find_pmd_mm_freelist() won't find the right temp_thp after that? Or split_vma, etc. And make_compound_page() itself looks suspicious, On 12/12, Alex Thorlton wrote: > > +void make_compound_page(struct page *page, unsigned long order) > +{ > + int i, max_count = 0, max_mapcount = 0; > + int nr_pages = 1 << order; > + > + set_compound_page_dtor(page, free_compound_page); > + set_compound_order(page, order); > + > + __SetPageHead(page); > + > + /* > + * we clear all the mappings here, so we have to remember to set > + * them back up! > + */ > + page->mapping = NULL; > + > + max_count = (int) atomic_read(&page->_count); > + max_mapcount = (int) atomic_read(&page->_mapcount); > + > + for (i = 1; i < nr_pages; i++) { > + int cur_count, cur_mapcount; > + struct page *p = page + i; > + p->flags = 0; /* this seems dumb */ > + __SetPageTail(p); Just for example, what if put_page(p) or get_page(p) is called after __SetPageTail() ? Afaics, this page was already visible to, say, get_user_pages() and it can have external references. In fact everything else looks suspicious too but let me repeat that I do not really understand this code. So please do not count this as review, but perhaps the changelog should tell more to explain what this patch actually does and how this all should work? Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2013-12-16 17:37 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <cover.1386790423.git.athorlton@sgi.com> 2013-12-12 18:00 ` [RFC PATCH 1/3] Add flags for temporary compound pages Alex Thorlton 2013-12-12 18:00 ` Alex Thorlton 2013-12-12 18:00 ` [RFC PATCH 2/3] Add tunable to control THP behavior Alex Thorlton 2013-12-12 18:00 ` Alex Thorlton 2013-12-12 20:11 ` Andy Lutomirski 2013-12-12 20:11 ` Andy Lutomirski 2013-12-12 20:49 ` Alex Thorlton 2013-12-12 20:49 ` Alex Thorlton 2013-12-12 20:52 ` Andy Lutomirski 2013-12-12 20:52 ` Andy Lutomirski 2013-12-12 21:04 ` Alex Thorlton 2013-12-12 21:04 ` Alex Thorlton 2013-12-12 21:37 ` Rik van Riel 2013-12-12 21:37 ` Rik van Riel 2013-12-12 23:17 ` Alex Thorlton 2013-12-12 23:17 ` Alex Thorlton 2013-12-12 18:00 ` [RFC PATCH 3/3] Change " Alex Thorlton 2013-12-12 18:00 ` Alex Thorlton 2013-12-13 13:13 ` Kirill A. Shutemov 2013-12-13 13:13 ` Kirill A. Shutemov 2013-12-16 17:37 ` Alex Thorlton 2013-12-16 17:37 ` Alex Thorlton 2013-12-13 18:12 ` Oleg Nesterov 2013-12-13 18:12 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.