* [PATCH 0/4] Reclaim page capture v3 @ 2008-09-05 10:19 Andy Whitcroft 2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft ` (4 more replies) 0 siblings, 5 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-09-05 10:19 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman, Andy Whitcroft For sometime we have been looking at mechanisms for improving the availability of larger allocations under load. One of the options we have explored is the capturing of pages freed under direct reclaim in order to increase the chances of free pages coelescing before they are subject to reallocation by racing allocators. Following this email is a patch stack implementing page capture during direct reclaim. It consits of four patches. The first two simply pull out existing code into helpers for reuse. The third makes buddy's use of struct page explicit. The fourth contains the meat of the changes, and its leader contains a much fuller description of the feature. This update represents a rebase to -mm and incorporates feedback from KOSAKI Motohiro. It also incorporates an accounting fix which was preventing some captures. I have done a lot of comparitive testing with and without this patch set and in broad brush I am seeing improvements in hugepage allocations (worst case size) success on all of my test systems. These tests consist of placing a constant stream of high order allocations on the system, at varying rates. The results for these various runs are then averaged to give an overall improvement. Absolute Effective x86-64 2.48% 4.58% powerpc 5.55% 25.22% x86-64 has a relatively small huge page size and so is always much more effective at allocating huge pages. Even there we get a measurable improvement. On powerpc the huge pages are much larger and much harder to recover. Here we see a full 25% increase in page recovery. It should be noted that these are worst case testing, and very agressive taking every possible page in the system. It would be helpful to get wider testing in -mm. Against: 2.6.27-rc1-mm1 Andrew, please consider for -mm. -apw Changes since V2: - Incorporates review feedback from Christoph Lameter, - Incorporates review feedback from Peter Zijlstra, and - Checkpatch fixes. Changes since V1: - Incorporates review feedback from KOSAKI Motohiro, - fixes up accounting when checking watermarks for captured pages, - rebase 2.6.27-rc1-mm1, - Incorporates review feedback from Mel. Andy Whitcroft (4): pull out the page pre-release and sanity check logic for reuse pull out zone cpuset and watermark checks for reuse buddy: explicitly identify buddy field use in struct page capture pages freed during direct reclaim for allocation by the reclaimer include/linux/mm_types.h | 4 + include/linux/page-flags.h | 4 + mm/internal.h | 8 +- mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++------ mm/vmscan.c | 115 ++++++++++++++++---- 5 files changed, 338 insertions(+), 56 deletions(-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse 2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft @ 2008-09-05 10:19 ` Andy Whitcroft 2008-09-08 14:11 ` MinChan Kim 2008-09-05 10:20 ` [PATCH 2/4] pull out zone cpuset and watermark checks " Andy Whitcroft ` (3 subsequent siblings) 4 siblings, 1 reply; 16+ messages in thread From: Andy Whitcroft @ 2008-09-05 10:19 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman, Andy Whitcroft When we are about to release a page we perform a number of actions on that page. We clear down any anonymous mappings, confirm that the page is safe to release, check for freeing locks, before mapping the page should that be required. Pull this processing out into a helper function for reuse in a later patch. Note that we do not convert the similar cleardown in free_hot_cold_page() as the optimiser is unable to squash the loops during the inline. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> --- mm/page_alloc.c | 43 ++++++++++++++++++++++++++++++------------- 1 files changed, 30 insertions(+), 13 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f52fcf1..b2a2c2b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page) } /* + * Prepare this page for release to the buddy. Sanity check the page. + * Returns 1 if the page is safe to free. + */ +static inline int free_page_prepare(struct page *page, int order) +{ + int i; + int reserved = 0; + + if (PageAnon(page)) + page->mapping = NULL; + + for (i = 0 ; i < (1 << order) ; ++i) + reserved += free_pages_check(page + i); + if (reserved) + return 0; + + if (!PageHighMem(page)) { + debug_check_no_locks_freed(page_address(page), + PAGE_SIZE << order); + debug_check_no_obj_freed(page_address(page), + PAGE_SIZE << order); + } + arch_free_page(page, order); + kernel_map_pages(page, 1 << order, 0); + + return 1; +} + +/* * Frees a list of pages. * Assumes all pages on list are in same zone, and of same order. * count is the number of pages to free. @@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order) static void __free_pages_ok(struct page *page, unsigned int order) { unsigned long flags; - int i; - int reserved = 0; - for (i = 0 ; i < (1 << order) ; ++i) - reserved += free_pages_check(page + i); - if (reserved) + if (!free_page_prepare(page, order)) return; - if (!PageHighMem(page)) { - debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order); - debug_check_no_obj_freed(page_address(page), - PAGE_SIZE << order); - } - arch_free_page(page, order); - kernel_map_pages(page, 1 << order, 0); - local_irq_save(flags); __count_vm_events(PGFREE, 1 << order); free_one_page(page_zone(page), page, order); -- 1.6.0.rc1.258.g80295 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse 2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft @ 2008-09-08 14:11 ` MinChan Kim 2008-09-08 15:14 ` Andy Whitcroft 0 siblings, 1 reply; 16+ messages in thread From: MinChan Kim @ 2008-09-08 14:11 UTC (permalink / raw) To: Andy Whitcroft Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Fri, Sep 5, 2008 at 7:19 PM, Andy Whitcroft <apw@shadowen.org> wrote: > When we are about to release a page we perform a number of actions > on that page. We clear down any anonymous mappings, confirm that > the page is safe to release, check for freeing locks, before mapping > the page should that be required. Pull this processing out into a > helper function for reuse in a later patch. > > Note that we do not convert the similar cleardown in free_hot_cold_page() > as the optimiser is unable to squash the loops during the inline. > > Signed-off-by: Andy Whitcroft <apw@shadowen.org> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Reviewed-by: Rik van Riel <riel@redhat.com> > --- > mm/page_alloc.c | 43 ++++++++++++++++++++++++++++++------------- > 1 files changed, 30 insertions(+), 13 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f52fcf1..b2a2c2b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page) > } > > /* > + * Prepare this page for release to the buddy. Sanity check the page. > + * Returns 1 if the page is safe to free. > + */ > +static inline int free_page_prepare(struct page *page, int order) > +{ > + int i; > + int reserved = 0; > + > + if (PageAnon(page)) > + page->mapping = NULL; Why do you need to clear down anonymous mapping ? I think if you don't convert this cleardown in free_hot_cold_page(), you don't need it. If you do it, bad_page can't do his role. > + for (i = 0 ; i < (1 << order) ; ++i) > + reserved += free_pages_check(page + i); > + if (reserved) > + return 0; > + > + if (!PageHighMem(page)) { > + debug_check_no_locks_freed(page_address(page), > + PAGE_SIZE << order); > + debug_check_no_obj_freed(page_address(page), > + PAGE_SIZE << order); > + } > + arch_free_page(page, order); > + kernel_map_pages(page, 1 << order, 0); > + > + return 1; > +} > + > +/* > * Frees a list of pages. > * Assumes all pages on list are in same zone, and of same order. > * count is the number of pages to free. > @@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order) > static void __free_pages_ok(struct page *page, unsigned int order) > { > unsigned long flags; > - int i; > - int reserved = 0; > > - for (i = 0 ; i < (1 << order) ; ++i) > - reserved += free_pages_check(page + i); > - if (reserved) > + if (!free_page_prepare(page, order)) > return; > > - if (!PageHighMem(page)) { > - debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order); > - debug_check_no_obj_freed(page_address(page), > - PAGE_SIZE << order); > - } > - arch_free_page(page, order); > - kernel_map_pages(page, 1 << order, 0); > - > local_irq_save(flags); > __count_vm_events(PGFREE, 1 << order); > free_one_page(page_zone(page), page, order); > -- > 1.6.0.rc1.258.g80295 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Thanks, MinChan Kim ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse 2008-09-08 14:11 ` MinChan Kim @ 2008-09-08 15:14 ` Andy Whitcroft 0 siblings, 0 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-09-08 15:14 UTC (permalink / raw) To: MinChan Kim Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Mon, Sep 08, 2008 at 11:11:22PM +0900, MinChan Kim wrote: > On Fri, Sep 5, 2008 at 7:19 PM, Andy Whitcroft <apw@shadowen.org> wrote: > > When we are about to release a page we perform a number of actions > > on that page. We clear down any anonymous mappings, confirm that > > the page is safe to release, check for freeing locks, before mapping > > the page should that be required. Pull this processing out into a > > helper function for reuse in a later patch. > > > > Note that we do not convert the similar cleardown in free_hot_cold_page() > > as the optimiser is unable to squash the loops during the inline. > > > > Signed-off-by: Andy Whitcroft <apw@shadowen.org> > > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > > Reviewed-by: Rik van Riel <riel@redhat.com> > > --- > > mm/page_alloc.c | 43 ++++++++++++++++++++++++++++++------------- > > 1 files changed, 30 insertions(+), 13 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index f52fcf1..b2a2c2b 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page) > > } > > > > /* > > + * Prepare this page for release to the buddy. Sanity check the page. > > + * Returns 1 if the page is safe to free. > > + */ > > +static inline int free_page_prepare(struct page *page, int order) > > +{ > > + int i; > > + int reserved = 0; > > + > > + if (PageAnon(page)) > > + page->mapping = NULL; > > Why do you need to clear down anonymous mapping ? > I think if you don't convert this cleardown in free_hot_cold_page(), > you don't need it. > > If you do it, bad_page can't do his role. Yeah that has slipped through from where originally this patch used to merge two different instances of this code. Good spot. Will sort that out. > > + for (i = 0 ; i < (1 << order) ; ++i) > > + reserved += free_pages_check(page + i); > > + if (reserved) > > + return 0; > > + > > + if (!PageHighMem(page)) { > > + debug_check_no_locks_freed(page_address(page), > > + PAGE_SIZE << order); > > + debug_check_no_obj_freed(page_address(page), > > + PAGE_SIZE << order); > > + } > > + arch_free_page(page, order); > > + kernel_map_pages(page, 1 << order, 0); > > + > > + return 1; > > +} > > + > > +/* > > * Frees a list of pages. > > * Assumes all pages on list are in same zone, and of same order. > > * count is the number of pages to free. > > @@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order) > > static void __free_pages_ok(struct page *page, unsigned int order) > > { > > unsigned long flags; > > - int i; > > - int reserved = 0; > > > > - for (i = 0 ; i < (1 << order) ; ++i) > > - reserved += free_pages_check(page + i); > > - if (reserved) > > + if (!free_page_prepare(page, order)) > > return; > > > > - if (!PageHighMem(page)) { > > - debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order); > > - debug_check_no_obj_freed(page_address(page), > > - PAGE_SIZE << order); > > - } > > - arch_free_page(page, order); > > - kernel_map_pages(page, 1 << order, 0); > > - > > local_irq_save(flags); > > __count_vm_events(PGFREE, 1 << order); > > free_one_page(page_zone(page), page, order); -apw ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/4] pull out zone cpuset and watermark checks for reuse 2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft 2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft @ 2008-09-05 10:20 ` Andy Whitcroft 2008-09-05 10:20 ` [PATCH 3/4] buddy: explicitly identify buddy field use in struct page Andy Whitcroft ` (2 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-09-05 10:20 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman, Andy Whitcroft When allocating we need to confirm that the zone we are about to allocate from is acceptable to the CPUSET we are in, and that it does not violate the zone watermarks. Pull these checks out so we can reuse them in a later patch. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> --- mm/page_alloc.c | 62 ++++++++++++++++++++++++++++++++++++++---------------- 1 files changed, 43 insertions(+), 19 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b2a2c2b..2c3874e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1274,6 +1274,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, return 1; } +/* + * Return 1 if this zone is an acceptable source given the cpuset + * constraints. + */ +static inline int zone_cpuset_permits(struct zone *zone, + int alloc_flags, gfp_t gfp_mask) +{ + if ((alloc_flags & ALLOC_CPUSET) && + !cpuset_zone_allowed_softwall(zone, gfp_mask)) + return 0; + return 1; +} + +/* + * Return 1 if this zone is within the watermarks specified by the + * allocation flags. + */ +static inline int zone_watermark_permits(struct zone *zone, int order, + int classzone_idx, int alloc_flags, gfp_t gfp_mask) +{ + if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { + unsigned long mark; + if (alloc_flags & ALLOC_WMARK_MIN) + mark = zone->pages_min; + else if (alloc_flags & ALLOC_WMARK_LOW) + mark = zone->pages_low; + else + mark = zone->pages_high; + if (!zone_watermark_ok(zone, order, mark, + classzone_idx, alloc_flags)) { + if (!zone_reclaim_mode || + !zone_reclaim(zone, gfp_mask, order)) + return 0; + } + } + return 1; +} + #ifdef CONFIG_NUMA /* * zlc_setup - Setup for "zonelist cache". Uses cached zone data to @@ -1427,25 +1465,11 @@ zonelist_scan: if (NUMA_BUILD && zlc_active && !zlc_zone_worth_trying(zonelist, z, allowednodes)) continue; - if ((alloc_flags & ALLOC_CPUSET) && - !cpuset_zone_allowed_softwall(zone, gfp_mask)) - goto try_next_zone; - - if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { - unsigned long mark; - if (alloc_flags & ALLOC_WMARK_MIN) - mark = zone->pages_min; - else if (alloc_flags & ALLOC_WMARK_LOW) - mark = zone->pages_low; - else - mark = zone->pages_high; - if (!zone_watermark_ok(zone, order, mark, - classzone_idx, alloc_flags)) { - if (!zone_reclaim_mode || - !zone_reclaim(zone, gfp_mask, order)) - goto this_zone_full; - } - } + if (!zone_cpuset_permits(zone, alloc_flags, gfp_mask)) + goto try_next_zone; + if (!zone_watermark_permits(zone, order, classzone_idx, + alloc_flags, gfp_mask)) + goto this_zone_full; page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask); if (page) -- 1.6.0.rc1.258.g80295 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/4] buddy: explicitly identify buddy field use in struct page 2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft 2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft 2008-09-05 10:20 ` [PATCH 2/4] pull out zone cpuset and watermark checks " Andy Whitcroft @ 2008-09-05 10:20 ` Andy Whitcroft 2008-09-05 10:20 ` [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer Andy Whitcroft 2008-09-08 6:08 ` [PATCH 0/4] Reclaim page capture v3 Nick Piggin 4 siblings, 0 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-09-05 10:20 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman, Andy Whitcroft Explicitly define the struct page fields which buddy uses when it owns pages. Defines a new anonymous struct to allow additional fields to be defined in a later patch. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> --- include/linux/mm_types.h | 3 +++ mm/internal.h | 2 +- mm/page_alloc.c | 4 ++-- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 995c588..906d8e0 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -70,6 +70,9 @@ struct page { #endif struct kmem_cache *slab; /* SLUB: Pointer to slab */ struct page *first_page; /* Compound tail pages */ + struct { + unsigned long buddy_order; /* buddy: free page order */ + }; }; union { pgoff_t index; /* Our offset within mapping. */ diff --git a/mm/internal.h b/mm/internal.h index c0e4859..fcedcd0 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -58,7 +58,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order); static inline unsigned long page_order(struct page *page) { VM_BUG_ON(!PageBuddy(page)); - return page_private(page); + return page->buddy_order; } extern int mlock_vma_pages_range(struct vm_area_struct *vma, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2c3874e..db0dbd6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -331,7 +331,7 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags) static inline void set_page_order(struct page *page, int order) { - set_page_private(page, order); + page->buddy_order = order; __SetPageBuddy(page); #ifdef CONFIG_PAGE_OWNER page->order = -1; @@ -341,7 +341,7 @@ static inline void set_page_order(struct page *page, int order) static inline void rmv_page_order(struct page *page) { __ClearPageBuddy(page); - set_page_private(page, 0); + page->buddy_order = 0; } /* -- 1.6.0.rc1.258.g80295 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer 2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft ` (2 preceding siblings ...) 2008-09-05 10:20 ` [PATCH 3/4] buddy: explicitly identify buddy field use in struct page Andy Whitcroft @ 2008-09-05 10:20 ` Andy Whitcroft 2008-09-08 6:08 ` [PATCH 0/4] Reclaim page capture v3 Nick Piggin 4 siblings, 0 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-09-05 10:20 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman, Andy Whitcroft When a process enters direct reclaim it will expend effort identifying and releasing pages in the hope of obtaining a page. However as these pages are released asynchronously there is every possibility that the pages will have been consumed by other allocators before the reclaimer gets a look in. This is particularly problematic where the reclaimer is attempting to allocate a higher order page. It is highly likely that a parallel allocation will consume lower order constituent pages as we release them preventing them coelescing into the higher order page the reclaimer desires. This patch set attempts to address this for allocations above ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing onto a local free list. Instead of freeing them to the main buddy lists, pages are collected and coelesced on this per direct reclaimer free list. Pages which are freed by other processes are also considered, where they coelesce with a page already under capture they will be moved to the capture list. When pressure has been applied to a zone we then consult the capture list and if there is an appropriatly sized page available it is taken immediatly and the remainder returned to the free pool. Capture is only enabled when the reclaimer's allocation order exceeds ALLOC_COSTLY_ORDER as free pages below this order should naturally occur in large numbers following regular reclaim. Thanks go to Mel Gorman for numerous discussions during the development of this patch and for his repeated reviews. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- include/linux/mm_types.h | 1 + include/linux/page-flags.h | 4 + mm/internal.h | 6 ++ mm/page_alloc.c | 154 +++++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 115 +++++++++++++++++++++++++++------ 5 files changed, 259 insertions(+), 21 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 906d8e0..cd2b549 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -72,6 +72,7 @@ struct page { struct page *first_page; /* Compound tail pages */ struct { unsigned long buddy_order; /* buddy: free page order */ + struct list_head *buddy_free; /* buddy: free list pointer */ }; }; union { diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index c0ac9e0..016e604 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -117,6 +117,9 @@ enum pageflags { /* SLUB */ PG_slub_frozen = PG_active, PG_slub_debug = PG_error, + + /* BUDDY overlays. */ + PG_buddy_capture = PG_owner_priv_1, }; #ifndef __GENERATING_BOUNDS_H @@ -208,6 +211,7 @@ __PAGEFLAG(SlubDebug, slub_debug) */ TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback) __PAGEFLAG(Buddy, buddy) +__PAGEFLAG(BuddyCapture, buddy_capture) /* A buddy page, but reserved. */ PAGEFLAG(MappedToDisk, mappedtodisk) /* PG_readahead is only used for file reads; PG_reclaim is only for writes */ diff --git a/mm/internal.h b/mm/internal.h index fcedcd0..f266a35 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -251,4 +251,10 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, struct page **pages, struct vm_area_struct **vmas); +extern struct page *capture_alloc_or_return(struct zone *, struct zone *, + struct list_head *, int, int, gfp_t); +void capture_one_page(struct list_head *, struct zone *, struct page *, int); +unsigned long try_to_free_pages_capture(struct page **, struct zonelist *, + nodemask_t *, int, gfp_t, int); + #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index db0dbd6..fab0e72 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -428,6 +428,51 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, * -- wli */ +static inline void __capture_one_page(struct list_head *capture_list, + struct page *page, struct zone *zone, unsigned int order) +{ + unsigned long page_idx; + unsigned long order_size = 1UL << order; + + if (unlikely(PageCompound(page))) + destroy_compound_page(page, order); + + page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1); + + VM_BUG_ON(page_idx & (order_size - 1)); + VM_BUG_ON(bad_range(zone, page)); + + while (order < MAX_ORDER-1) { + unsigned long combined_idx; + struct page *buddy; + + buddy = __page_find_buddy(page, page_idx, order); + if (!page_is_buddy(page, buddy, order)) + break; + + /* Our buddy is free, merge with it and move up one order. */ + list_del(&buddy->lru); + if (PageBuddyCapture(buddy)) { + buddy->buddy_free = 0; + __ClearPageBuddyCapture(buddy); + } else { + zone->free_area[order].nr_free--; + __mod_zone_page_state(zone, + NR_FREE_PAGES, -(1UL << order)); + } + rmv_page_order(buddy); + combined_idx = __find_combined_index(page_idx, order); + page = page + (combined_idx - page_idx); + page_idx = combined_idx; + order++; + } + set_page_order(page, order); + __SetPageBuddyCapture(page); + page->buddy_free = capture_list; + + list_add(&page->lru, capture_list); +} + static inline void __free_one_page(struct page *page, struct zone *zone, unsigned int order) { @@ -451,6 +496,12 @@ static inline void __free_one_page(struct page *page, buddy = __page_find_buddy(page, page_idx, order); if (!page_is_buddy(page, buddy, order)) break; + if (PageBuddyCapture(buddy)) { + __mod_zone_page_state(zone, + NR_FREE_PAGES, -(1UL << order)); + return __capture_one_page(buddy->buddy_free, + page, zone, order); + } /* Our buddy is free, merge with it and move up one order. */ list_del(&buddy->lru); @@ -555,6 +606,19 @@ static void free_one_page(struct zone *zone, struct page *page, int order) spin_unlock(&zone->lock); } +void capture_one_page(struct list_head *free_list, + struct zone *zone, struct page *page, int order) +{ + unsigned long flags; + + if (!free_page_prepare(page, order)) + return; + + spin_lock_irqsave(&zone->lock, flags); + __capture_one_page(free_list, page, zone, order); + spin_unlock_irqrestore(&zone->lock, flags); +} + static void __free_pages_ok(struct page *page, unsigned int order) { unsigned long flags; @@ -629,6 +693,23 @@ static inline void expand(struct zone *zone, struct page *page, } /* + * Convert the passed page from actual_order to desired_order. + * Given a page of actual_order release all but the desired_order sized + * buddy at its start. + */ +void __carve_off(struct page *page, unsigned long actual_order, + unsigned long desired_order) +{ + int migratetype = get_pageblock_migratetype(page); + struct zone *zone = page_zone(page); + struct free_area *area = &(zone->free_area[actual_order]); + + __mod_zone_page_state(zone, NR_FREE_PAGES, + (1UL << actual_order) - (1UL << desired_order)); + expand(zone, page, desired_order, actual_order, area, migratetype); +} + +/* * This page is about to be returned from the page allocator */ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) @@ -1667,11 +1748,15 @@ nofail_alloc: reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state; - did_some_progress = try_to_free_pages(zonelist, order, gfp_mask); + did_some_progress = try_to_free_pages_capture(&page, zonelist, nodemask, + order, gfp_mask, alloc_flags); p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC; + if (page) + goto got_pg; + cond_resched(); if (order != 0) @@ -4815,6 +4900,73 @@ out: spin_unlock_irqrestore(&zone->lock, flags); } +#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru)) + +/* + * Run through the accumulated list of captures pages and take the first + * which is big enough to satisfy the original allocation. Trim the selected + * page to size, and free the remainder. + */ +struct page *capture_alloc_or_return(struct zone *zone, + struct zone *preferred_zone, struct list_head *capture_list, + int order, int alloc_flags, gfp_t gfp_mask) +{ + struct page *capture_page = 0; + unsigned long flags; + int classzone_idx = zone_idx(preferred_zone); + + spin_lock_irqsave(&zone->lock, flags); + + while (!list_empty(capture_list)) { + struct page *page; + int pg_order; + + page = lru_to_page(capture_list); + list_del(&page->lru); + pg_order = page_order(page); + + /* + * Clear out our buddy size and list information before + * releasing or allocating the page. + */ + rmv_page_order(page); + page->buddy_free = 0; + __ClearPageBuddyCapture(page); + + if (!capture_page && pg_order >= order) { + __carve_off(page, pg_order, order); + capture_page = page; + } else + __free_one_page(page, zone, pg_order); + } + + /* + * Ensure that this capture would not violate the watermarks. + * Subtle, we actually already have the page outside the watermarks + * so check if we can allocate an order 0 page. + */ + if (capture_page && + (!zone_cpuset_permits(zone, alloc_flags, gfp_mask) || + !zone_watermark_permits(zone, 0, classzone_idx, + alloc_flags, gfp_mask))) { + __free_one_page(capture_page, zone, order); + capture_page = NULL; + } + + if (capture_page) + __count_zone_vm_events(PGALLOC, zone, 1 << order); + + zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE); + zone->pages_scanned = 0; + + spin_unlock_irqrestore(&zone->lock, flags); + + if (capture_page) + prep_new_page(capture_page, order, gfp_mask); + + return capture_page; +} + #ifdef CONFIG_MEMORY_HOTREMOVE /* * All pages in the range must be isolated before calling this. diff --git a/mm/vmscan.c b/mm/vmscan.c index 85ce427..7b4767f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -56,6 +56,8 @@ struct scan_control { /* This context's GFP mask */ gfp_t gfp_mask; + int alloc_flags; + int may_writepage; /* Can pages be swapped as part of reclaim? */ @@ -81,6 +83,12 @@ struct scan_control { unsigned long *scanned, int order, int mode, struct zone *z, struct mem_cgroup *mem_cont, int active, int file); + + /* Captured page. */ + struct page **capture; + + /* Nodemask for acceptable allocations. */ + nodemask_t *nodemask; }; #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru)) @@ -560,7 +568,8 @@ void putback_lru_page(struct page *page) /* * shrink_page_list() returns the number of reclaimed pages */ -static unsigned long shrink_page_list(struct list_head *page_list, +static unsigned long shrink_page_list(struct list_head *free_list, + struct list_head *page_list, struct scan_control *sc, enum pageout_io sync_writeback) { @@ -742,9 +751,13 @@ static unsigned long shrink_page_list(struct list_head *page_list, unlock_page(page); free_it: nr_reclaimed++; - if (!pagevec_add(&freed_pvec, page)) { - __pagevec_free(&freed_pvec); - pagevec_reinit(&freed_pvec); + if (free_list) { + capture_one_page(free_list, page_zone(page), page, 0); + } else { + if (!pagevec_add(&freed_pvec, page)) { + __pagevec_free(&freed_pvec); + pagevec_reinit(&freed_pvec); + } } continue; @@ -1024,7 +1037,8 @@ int isolate_lru_page(struct page *page) * shrink_inactive_list() is a helper for shrink_zone(). It returns the number * of reclaimed pages */ -static unsigned long shrink_inactive_list(unsigned long max_scan, +static unsigned long shrink_inactive_list(struct list_head *free_list, + unsigned long max_scan, struct zone *zone, struct scan_control *sc, int priority, int file) { @@ -1083,7 +1097,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, spin_unlock_irq(&zone->lru_lock); nr_scanned += nr_scan; - nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); + nr_freed = shrink_page_list(free_list, &page_list, + sc, PAGEOUT_IO_ASYNC); /* * If we are direct reclaiming for contiguous pages and we do @@ -1102,8 +1117,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, nr_active = clear_active_flags(&page_list, count); count_vm_events(PGDEACTIVATE, nr_active); - nr_freed += shrink_page_list(&page_list, sc, - PAGEOUT_IO_SYNC); + nr_freed += shrink_page_list(free_list, &page_list, + sc, PAGEOUT_IO_SYNC); } nr_reclaimed += nr_freed; @@ -1337,7 +1352,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, pagevec_release(&pvec); } -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, +static unsigned long shrink_list(struct list_head *free_list, + enum lru_list lru, unsigned long nr_to_scan, struct zone *zone, struct scan_control *sc, int priority) { int file = is_file_lru(lru); @@ -1352,7 +1368,8 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, shrink_active_list(nr_to_scan, zone, sc, priority, file); return 0; } - return shrink_inactive_list(nr_to_scan, zone, sc, priority, file); + return shrink_inactive_list(free_list, nr_to_scan, zone, + sc, priority, file); } /* @@ -1444,7 +1461,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc, * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. */ static unsigned long shrink_zone(int priority, struct zone *zone, - struct scan_control *sc) + struct zone *preferred_zone, struct scan_control *sc) { unsigned long nr[NR_LRU_LISTS]; unsigned long nr_to_scan; @@ -1452,6 +1469,23 @@ static unsigned long shrink_zone(int priority, struct zone *zone, unsigned long percent[2]; /* anon @ 0; file @ 1 */ enum lru_list l; + struct list_head __capture_list; + struct list_head *capture_list = NULL; + struct page *capture_page; + + /* + * When direct reclaimers are asking for larger orders + * capture pages for them. There is no point if we already + * have an acceptable page or if this zone is not within the + * nodemask. + */ + if (sc->order > PAGE_ALLOC_COSTLY_ORDER && + sc->capture && !*(sc->capture) && (sc->nodemask == NULL || + node_isset(zone_to_nid(zone), *sc->nodemask))) { + capture_list = &__capture_list; + INIT_LIST_HEAD(capture_list); + } + get_scan_ratio(zone, sc, percent); for_each_evictable_lru(l) { @@ -1481,6 +1515,8 @@ static unsigned long shrink_zone(int priority, struct zone *zone, } } + capture_page = NULL; + while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) { for_each_evictable_lru(l) { @@ -1489,10 +1525,18 @@ static unsigned long shrink_zone(int priority, struct zone *zone, (unsigned long)sc->swap_cluster_max); nr[l] -= nr_to_scan; - nr_reclaimed += shrink_list(l, nr_to_scan, + nr_reclaimed += shrink_list(capture_list, + l, nr_to_scan, zone, sc, priority); } } + if (capture_list) { + capture_page = capture_alloc_or_return(zone, + preferred_zone, capture_list, sc->order, + sc->alloc_flags, sc->gfp_mask); + if (capture_page) + capture_list = NULL; + } } /* @@ -1504,6 +1548,9 @@ static unsigned long shrink_zone(int priority, struct zone *zone, else if (!scan_global_lru(sc)) shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + if (capture_page) + *(sc->capture) = capture_page; + throttle_vm_writeout(sc->gfp_mask); return nr_reclaimed; } @@ -1525,7 +1572,7 @@ static unsigned long shrink_zone(int priority, struct zone *zone, * scan then give up on it. */ static unsigned long shrink_zones(int priority, struct zonelist *zonelist, - struct scan_control *sc) + struct zone *preferred_zone, struct scan_control *sc) { enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); unsigned long nr_reclaimed = 0; @@ -1559,7 +1606,7 @@ static unsigned long shrink_zones(int priority, struct zonelist *zonelist, priority); } - nr_reclaimed += shrink_zone(priority, zone, sc); + nr_reclaimed += shrink_zone(priority, zone, preferred_zone, sc); } return nr_reclaimed; @@ -1592,8 +1639,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, unsigned long lru_pages = 0; struct zoneref *z; struct zone *zone; + struct zone *preferred_zone; enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); + /* This should never fail as we should be scanning a real zonelist. */ + (void)first_zones_zonelist(zonelist, high_zoneidx, sc->nodemask, + &preferred_zone); + BUG_ON(!preferred_zone); + delayacct_freepages_start(); if (scan_global_lru(sc)) @@ -1615,7 +1668,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, sc->nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zonelist, sc); + nr_reclaimed += shrink_zones(priority, zonelist, + preferred_zone, sc); /* * Don't shrink slabs when reclaiming memory from * over limit cgroups @@ -1680,11 +1734,13 @@ out: return ret; } -unsigned long try_to_free_pages(struct zonelist *zonelist, int order, - gfp_t gfp_mask) +unsigned long try_to_free_pages_capture(struct page **capture_pagep, + struct zonelist *zonelist, nodemask_t *nodemask, + int order, gfp_t gfp_mask, int alloc_flags) { struct scan_control sc = { .gfp_mask = gfp_mask, + .alloc_flags = alloc_flags, .may_writepage = !laptop_mode, .swap_cluster_max = SWAP_CLUSTER_MAX, .may_swap = 1, @@ -1692,17 +1748,28 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .order = order, .mem_cgroup = NULL, .isolate_pages = isolate_pages_global, + .capture = capture_pagep, + .nodemask = nodemask, }; return do_try_to_free_pages(zonelist, &sc); } +unsigned long try_to_free_pages(struct zonelist *zonelist, + int order, gfp_t gfp_mask) +{ + return try_to_free_pages_capture(NULL, zonelist, NULL, + order, gfp_mask, 0); +} + #ifdef CONFIG_CGROUP_MEM_RES_CTLR unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, gfp_t gfp_mask) { struct scan_control sc = { + .gfp_mask = gfp_mask, + .alloc_flags = 0, .may_writepage = !laptop_mode, .may_swap = 1, .swap_cluster_max = SWAP_CLUSTER_MAX, @@ -1710,6 +1777,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, .order = 0, .mem_cgroup = mem_cont, .isolate_pages = mem_cgroup_isolate_pages, + .capture = NULL, + .nodemask = NULL, }; struct zonelist *zonelist; @@ -1751,12 +1820,15 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) struct reclaim_state *reclaim_state = current->reclaim_state; struct scan_control sc = { .gfp_mask = GFP_KERNEL, + .alloc_flags = 0, .may_swap = 1, .swap_cluster_max = SWAP_CLUSTER_MAX, .swappiness = vm_swappiness, .order = order, .mem_cgroup = NULL, .isolate_pages = isolate_pages_global, + .capture = NULL, + .nodemask = NULL, }; /* * temp_priority is used to remember the scanning priority at which @@ -1852,7 +1924,8 @@ loop_again: */ if (!zone_watermark_ok(zone, order, 8*zone->pages_high, end_zone, 0)) - nr_reclaimed += shrink_zone(priority, zone, &sc); + nr_reclaimed += shrink_zone(priority, + zone, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages); @@ -2054,7 +2127,7 @@ static unsigned long shrink_all_zones(unsigned long nr_pages, int prio, nr_to_scan = min(nr_pages, zone_page_state(zone, NR_LRU_BASE + l)); - ret += shrink_list(l, nr_to_scan, zone, + ret += shrink_list(NULL, l, nr_to_scan, zone, sc, prio); if (ret >= nr_pages) return ret; @@ -2081,6 +2154,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages) struct reclaim_state reclaim_state; struct scan_control sc = { .gfp_mask = GFP_KERNEL, + .alloc_flags = 0, .may_swap = 0, .swap_cluster_max = nr_pages, .may_writepage = 1, @@ -2269,6 +2343,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) .swap_cluster_max = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = gfp_mask, + .alloc_flags = 0, .swappiness = vm_swappiness, .isolate_pages = isolate_pages_global, }; @@ -2295,7 +2370,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) priority = ZONE_RECLAIM_PRIORITY; do { note_zone_scanning_priority(zone, priority); - nr_reclaimed += shrink_zone(priority, zone, &sc); + nr_reclaimed += shrink_zone(priority, zone, zone, &sc); priority--; } while (priority >= 0 && nr_reclaimed < nr_pages); } -- 1.6.0.rc1.258.g80295 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Reclaim page capture v3 2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft ` (3 preceding siblings ...) 2008-09-05 10:20 ` [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer Andy Whitcroft @ 2008-09-08 6:08 ` Nick Piggin 2008-09-08 11:44 ` Andy Whitcroft 4 siblings, 1 reply; 16+ messages in thread From: Nick Piggin @ 2008-09-08 6:08 UTC (permalink / raw) To: Andy Whitcroft Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Friday 05 September 2008 20:19, Andy Whitcroft wrote: > For sometime we have been looking at mechanisms for improving the > availability of larger allocations under load. One of the options we have > explored is the capturing of pages freed under direct reclaim in order to > increase the chances of free pages coelescing before they are subject to > reallocation by racing allocators. > > Following this email is a patch stack implementing page capture during > direct reclaim. It consits of four patches. The first two simply pull > out existing code into helpers for reuse. The third makes buddy's use > of struct page explicit. The fourth contains the meat of the changes, > and its leader contains a much fuller description of the feature. > > This update represents a rebase to -mm and incorporates feedback from > KOSAKI Motohiro. It also incorporates an accounting fix which was > preventing some captures. > > I have done a lot of comparitive testing with and without this patch > set and in broad brush I am seeing improvements in hugepage allocations > (worst case size) success on all of my test systems. These tests consist > of placing a constant stream of high order allocations on the system, > at varying rates. The results for these various runs are then averaged > to give an overall improvement. > > Absolute Effective > x86-64 2.48% 4.58% > powerpc 5.55% 25.22% These are the numbers for the improvement of hugepage allocation success? Then what do you mean by absolute and effective? What sort of constant stream of high order allocations are you talking about? In what "real" situations are you seeing higher order page allocation failures, and in those cases, how much do these patches help? I must say I don't really like this approach. IMO it might be better to put some sort of queue in the page allocator, so if memory becomes low, then processes will start queueing up and not be allowed to jump the queue and steal memory that has been freed by hard work of a direct reclaimer. That would improve a lot of fairness problems as well as improve coalescing for higher order allocations without introducing this capture stuff. > > x86-64 has a relatively small huge page size and so is always much more > effective at allocating huge pages. Even there we get a measurable > improvement. On powerpc the huge pages are much larger and much harder > to recover. Here we see a full 25% increase in page recovery. > > It should be noted that these are worst case testing, and very agressive > taking every possible page in the system. It would be helpful to get > wider testing in -mm. > > Against: 2.6.27-rc1-mm1 > > Andrew, please consider for -mm. > > -apw > > Changes since V2: > - Incorporates review feedback from Christoph Lameter, > - Incorporates review feedback from Peter Zijlstra, and > - Checkpatch fixes. > > Changes since V1: > - Incorporates review feedback from KOSAKI Motohiro, > - fixes up accounting when checking watermarks for captured pages, > - rebase 2.6.27-rc1-mm1, > - Incorporates review feedback from Mel. > > > Andy Whitcroft (4): > pull out the page pre-release and sanity check logic for reuse > pull out zone cpuset and watermark checks for reuse > buddy: explicitly identify buddy field use in struct page > capture pages freed during direct reclaim for allocation by the > reclaimer > > include/linux/mm_types.h | 4 + > include/linux/page-flags.h | 4 + > mm/internal.h | 8 +- > mm/page_alloc.c | 263 > ++++++++++++++++++++++++++++++++++++++------ mm/vmscan.c | > 115 ++++++++++++++++---- > 5 files changed, 338 insertions(+), 56 deletions(-) > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Reclaim page capture v3 2008-09-08 6:08 ` [PATCH 0/4] Reclaim page capture v3 Nick Piggin @ 2008-09-08 11:44 ` Andy Whitcroft 2008-09-08 13:59 ` Nick Piggin 0 siblings, 1 reply; 16+ messages in thread From: Andy Whitcroft @ 2008-09-08 11:44 UTC (permalink / raw) To: Nick Piggin Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Mon, Sep 08, 2008 at 04:08:03PM +1000, Nick Piggin wrote: > On Friday 05 September 2008 20:19, Andy Whitcroft wrote: > > For sometime we have been looking at mechanisms for improving the > > availability of larger allocations under load. One of the options we have > > explored is the capturing of pages freed under direct reclaim in order to > > increase the chances of free pages coelescing before they are subject to > > reallocation by racing allocators. > > > > Following this email is a patch stack implementing page capture during > > direct reclaim. It consits of four patches. The first two simply pull > > out existing code into helpers for reuse. The third makes buddy's use > > of struct page explicit. The fourth contains the meat of the changes, > > and its leader contains a much fuller description of the feature. > > > > This update represents a rebase to -mm and incorporates feedback from > > KOSAKI Motohiro. It also incorporates an accounting fix which was > > preventing some captures. > > > > I have done a lot of comparitive testing with and without this patch > > set and in broad brush I am seeing improvements in hugepage allocations > > (worst case size) success on all of my test systems. These tests consist > > of placing a constant stream of high order allocations on the system, > > at varying rates. The results for these various runs are then averaged > > to give an overall improvement. > > > > Absolute Effective > > x86-64 2.48% 4.58% > > powerpc 5.55% 25.22% > > These are the numbers for the improvement of hugepage allocation success? > Then what do you mean by absolute and effective? The absolute improvement is the literal change in success percentage, the literal percentage of all memory which may be allocated as huge pages. The effective improvement is percentage of the baseline success rates that this change represents; for the powerpc results we get some 20% of memory allocatable without the patches, and 25% with, 25% more pages are allocatable with the patches. > What sort of constant stream of high order allocations are you talking > about? In what "real" situations are you seeing higher order page allocation > failures, and in those cases, how much do these patches help? The test case simulates a constant demand for huge pages, at various rates. This is intended to replicate the scenario where a system is used with mixed small page and huge page applications, with a dynamic huge page pool. Particularly we are examining the effect of starting a very large huge page job on a busy system. Obviously starting hugepage applications depends on hugepage availability as they are not swappable. This test was chosen because it both tests the initial large page demand and then pushes the system to the limit. > I must say I don't really like this approach. IMO it might be better to put > some sort of queue in the page allocator, so if memory becomes low, then > processes will start queueing up and not be allowed to jump the queue and > steal memory that has been freed by hard work of a direct reclaimer. That > would improve a lot of fairness problems as well as improve coalescing for > higher order allocations without introducing this capture stuff. The problem with queuing all allocators is two fold. Firstly allocations are obviously required for reclaim and those would have to be exempt from the queue, and these are as likely to prevent coelesce of pages as any other. Secondly where a very large allocation is requested all allocators would be held while reclaim at that size is performed, majorly increasing latency for those allocations. Reclaim for an order 0 page may target of the order of 32 pages, whereas reclaim for x86_64 hugepages is 1024 pages minimum. It would be grosly unfair for a single large allocation to hold up normal allocations. > > x86-64 has a relatively small huge page size and so is always much more > > effective at allocating huge pages. Even there we get a measurable > > improvement. On powerpc the huge pages are much larger and much harder > > to recover. Here we see a full 25% increase in page recovery. > > > > It should be noted that these are worst case testing, and very agressive > > taking every possible page in the system. It would be helpful to get > > wider testing in -mm. > > > > Against: 2.6.27-rc1-mm1 > > > > Andrew, please consider for -mm. > > > > -apw > > > > Changes since V2: > > - Incorporates review feedback from Christoph Lameter, > > - Incorporates review feedback from Peter Zijlstra, and > > - Checkpatch fixes. > > > > Changes since V1: > > - Incorporates review feedback from KOSAKI Motohiro, > > - fixes up accounting when checking watermarks for captured pages, > > - rebase 2.6.27-rc1-mm1, > > - Incorporates review feedback from Mel. > > > > > > Andy Whitcroft (4): > > pull out the page pre-release and sanity check logic for reuse > > pull out zone cpuset and watermark checks for reuse > > buddy: explicitly identify buddy field use in struct page > > capture pages freed during direct reclaim for allocation by the > > reclaimer > > > > include/linux/mm_types.h | 4 + > > include/linux/page-flags.h | 4 + > > mm/internal.h | 8 +- > > mm/page_alloc.c | 263 > > ++++++++++++++++++++++++++++++++++++++------ mm/vmscan.c | > > 115 ++++++++++++++++---- > > 5 files changed, 338 insertions(+), 56 deletions(-) > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -apw ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Reclaim page capture v3 2008-09-08 11:44 ` Andy Whitcroft @ 2008-09-08 13:59 ` Nick Piggin 2008-09-08 16:41 ` Andy Whitcroft 0 siblings, 1 reply; 16+ messages in thread From: Nick Piggin @ 2008-09-08 13:59 UTC (permalink / raw) To: Andy Whitcroft Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Monday 08 September 2008 21:44, Andy Whitcroft wrote: > On Mon, Sep 08, 2008 at 04:08:03PM +1000, Nick Piggin wrote: > > On Friday 05 September 2008 20:19, Andy Whitcroft wrote: > > > Absolute Effective > > > x86-64 2.48% 4.58% > > > powerpc 5.55% 25.22% > > > > These are the numbers for the improvement of hugepage allocation success? > > Then what do you mean by absolute and effective? > > The absolute improvement is the literal change in success percentage, > the literal percentage of all memory which may be allocated as huge > pages. The effective improvement is percentage of the baseline success > rates that this change represents; for the powerpc results we get some > 20% of memory allocatable without the patches, and 25% with, 25% more > pages are allocatable with the patches. OK. > > What sort of constant stream of high order allocations are you talking > > about? In what "real" situations are you seeing higher order page > > allocation failures, and in those cases, how much do these patches help? > > The test case simulates a constant demand for huge pages, at various > rates. This is intended to replicate the scenario where a system is > used with mixed small page and huge page applications, with a dynamic > huge page pool. Particularly we are examining the effect of starting a > very large huge page job on a busy system. Obviously starting hugepage > applications depends on hugepage availability as they are not swappable. > This test was chosen because it both tests the initial large page demand > and then pushes the system to the limit. So... what does the non-simulation version (ie. the real app) say? > > I must say I don't really like this approach. IMO it might be better to > > put some sort of queue in the page allocator, so if memory becomes low, > > then processes will start queueing up and not be allowed to jump the > > queue and steal memory that has been freed by hard work of a direct > > reclaimer. That would improve a lot of fairness problems as well as > > improve coalescing for higher order allocations without introducing this > > capture stuff. > > The problem with queuing all allocators is two fold. Firstly allocations > are obviously required for reclaim and those would have to be exempt > from the queue, and these are as likely to prevent coelesce of pages > as any other. *Much* less likely, actually. Because there should be very little allocation required for reclaim (only dirty pages, and only when backed by filesystems that do silly things like not ensuring their own reserves before allowing the page to be dirtied). Also, your scheme still doesn't avoid allocation for reclaim so I don't see how you can use that as a point against queueing but not capturing! > Secondly where a very large allocation is requested > all allocators would be held while reclaim at that size is performed, > majorly increasing latency for those allocations. Reclaim for an order > 0 page may target of the order of 32 pages, whereas reclaim for x86_64 > hugepages is 1024 pages minimum. It would be grosly unfair for a single > large allocation to hold up normal allocations. I don't see why it should be unfair to allow a process to allocate 1024 order-0 pages ahead of one order-10 page (actually, yes the order 10 page is I guess somewhat more valueable than the same number of fragmented pages, but you see where I'm coming from). At least with queueing you are basing the idea on *some* reasonable policy, rather than purely random "whoever happens to take this lock at the right time will win" strategy, which might I add, can even be much more unfair than you might say queueing is. However, queueing would still be able to allow some flexibility in priority. For example: if (must_queue) { if (queue_head_prio == 0) join_queue(1<<order); else { queue_head_prio -= 1<<order; skip_queue(); } } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Reclaim page capture v3 2008-09-08 13:59 ` Nick Piggin @ 2008-09-08 16:41 ` Andy Whitcroft 2008-09-09 3:31 ` Nick Piggin 0 siblings, 1 reply; 16+ messages in thread From: Andy Whitcroft @ 2008-09-08 16:41 UTC (permalink / raw) To: Nick Piggin Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Mon, Sep 08, 2008 at 11:59:54PM +1000, Nick Piggin wrote: > On Monday 08 September 2008 21:44, Andy Whitcroft wrote: > > On Mon, Sep 08, 2008 at 04:08:03PM +1000, Nick Piggin wrote: > > > On Friday 05 September 2008 20:19, Andy Whitcroft wrote: > > > > > Absolute Effective > > > > x86-64 2.48% 4.58% > > > > powerpc 5.55% 25.22% > > > > > > These are the numbers for the improvement of hugepage allocation success? > > > Then what do you mean by absolute and effective? > > > > The absolute improvement is the literal change in success percentage, > > the literal percentage of all memory which may be allocated as huge > > pages. The effective improvement is percentage of the baseline success > > rates that this change represents; for the powerpc results we get some > > 20% of memory allocatable without the patches, and 25% with, 25% more > > pages are allocatable with the patches. > > OK. > > > > What sort of constant stream of high order allocations are you talking > > > about? In what "real" situations are you seeing higher order page > > > allocation failures, and in those cases, how much do these patches help? > > > > The test case simulates a constant demand for huge pages, at various > > rates. This is intended to replicate the scenario where a system is > > used with mixed small page and huge page applications, with a dynamic > > huge page pool. Particularly we are examining the effect of starting a > > very large huge page job on a busy system. Obviously starting hugepage > > applications depends on hugepage availability as they are not swappable. > > This test was chosen because it both tests the initial large page demand > > and then pushes the system to the limit. > > So... what does the non-simulation version (ie. the real app) say? In the common use model, use of huge pages is all or nothing. Either there are sufficient pages allocatable at application start time or there are not. As huge pages are not swappable once allocated they stay there. Applications either start using huge pages or they fallback to small pages and continue. This makes the real metric, how often does the customer get annoyed becuase their application has fallen back to small pages and is slow, or how often does their database fail to start. It is very hard to directly measure that and thus to get a comparitive figure. Any attempt to replicate that seems as necessarly artificial as the current test. > > > I must say I don't really like this approach. IMO it might be better to > > > put some sort of queue in the page allocator, so if memory becomes low, > > > then processes will start queueing up and not be allowed to jump the > > > queue and steal memory that has been freed by hard work of a direct > > > reclaimer. That would improve a lot of fairness problems as well as > > > improve coalescing for higher order allocations without introducing this > > > capture stuff. > > > > The problem with queuing all allocators is two fold. Firstly allocations > > are obviously required for reclaim and those would have to be exempt > > from the queue, and these are as likely to prevent coelesce of pages > > as any other. > > *Much* less likely, actually. Because there should be very little allocation > required for reclaim (only dirty pages, and only when backed by filesystems > that do silly things like not ensuring their own reserves before allowing > the page to be dirtied). Well yes and no. A lot of filesystems do do such stupid things. Allocating things like journal pages which have relativly long lives during reclaim. We have seen these getting placed into the memory we have just freed and preventing higher order coelesce. > Also, your scheme still doesn't avoid allocation for reclaim so I don't see > how you can use that as a point against queueing but not capturing! Obviously we cannot prevent allocations during reclaim. But we can avoid those allocations falling within the captured range. All pages under capture are marked. Any page returning to the allocator that merges with a buddy under capture, or that is a buddy under capture are kept separatly. Such that any allocations from within reclaim will necessarily take pages from elsewhere in the pool. The key thing about capture is that it effectivly marks ranges of pages out of use for allocations for the period of the reclaim, so we have a number of small ranges blocked out, not the whole pool. This allows parallel allocations (reclaim and otherwise) to succeed against the reserves (refilled by kswapd etc), whilst marking the pages under capture out and preventing them from being used. > > Secondly where a very large allocation is requested > > all allocators would be held while reclaim at that size is performed, > > majorly increasing latency for those allocations. Reclaim for an order > > 0 page may target of the order of 32 pages, whereas reclaim for x86_64 > > hugepages is 1024 pages minimum. It would be grosly unfair for a single > > large allocation to hold up normal allocations. > > I don't see why it should be unfair to allow a process to allocate 1024 > order-0 pages ahead of one order-10 page (actually, yes the order 10 page > is I guess somewhat more valueable than the same number of fragmented > pages, but you see where I'm coming from). I think we have our wires crossed here. I was saying it would seem unfair to block the allocator from giving out order-0 pages while we are struggling to get an order-10 page for one process. Having a queue would seem to generate such behaviour. What I am trying to achieve with capture is to push areas likely to return us a page of the requested size out of use, while we try and reclaim it without blocking the rest of the allocator. > At least with queueing you are basing the idea on *some* reasonable policy, > rather than purely random "whoever happens to take this lock at the right > time will win" strategy, which might I add, can even be much more unfair > than you might say queueing is. > > However, queueing would still be able to allow some flexibility in priority. > For example: > > if (must_queue) { > if (queue_head_prio == 0) > join_queue(1<<order); > else { > queue_head_prio -= 1<<order; > skip_queue(); > } > } Sure you would be able to make some kind of more flexible decisions, but that still seems like a heavy handed approach. You are important enough to takes pages (possibly from our mostly free high order page) or not. In a perfect world we would be able to know in advance that an order-N region would definatly come free if reclaimed and allocate that preemptivly to the requestor, apply reclaim to it, and then actually allocate the page. Obviously the world is not quite that perfect, but we are trying to follow that model with capture. Picking areas of the appropriate order, applying reclaim to just those areas, and preventing any freed pages getting back into the general population. This behaviour should mark the minimum memory out of use as possible. -apw ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Reclaim page capture v3 2008-09-08 16:41 ` Andy Whitcroft @ 2008-09-09 3:31 ` Nick Piggin 2008-09-09 16:35 ` Andy Whitcroft 0 siblings, 1 reply; 16+ messages in thread From: Nick Piggin @ 2008-09-09 3:31 UTC (permalink / raw) To: Andy Whitcroft Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Tuesday 09 September 2008 02:41, Andy Whitcroft wrote: > On Mon, Sep 08, 2008 at 11:59:54PM +1000, Nick Piggin wrote: > > So... what does the non-simulation version (ie. the real app) say? > > In the common use model, use of huge pages is all or nothing. Either > there are sufficient pages allocatable at application start time or there > are not. As huge pages are not swappable once allocated they stay there. > Applications either start using huge pages or they fallback to small pages > and continue. This makes the real metric, how often does the customer get > annoyed becuase their application has fallen back to small pages and is > slow, or how often does their database fail to start. It is very hard to > directly measure that and thus to get a comparitive figure. Any attempt > to replicate that seems as necessarly artificial as the current test. But you have customers telling you they're getting annoyed because of this? Or you have your own "realistic" workloads that allocate hugepages on demand (OK, when I say realistic, something like specjbb or whatever is obviously a reasonable macrobenchmark even if it isn't strictly realistics). > > *Much* less likely, actually. Because there should be very little > > allocation required for reclaim (only dirty pages, and only when backed > > by filesystems that do silly things like not ensuring their own reserves > > before allowing the page to be dirtied). > > Well yes and no. A lot of filesystems do do such stupid things. > Allocating things like journal pages which have relativly long lives > during reclaim. We have seen these getting placed into the memory we > have just freed and preventing higher order coelesce. They shouldn't, because that's sad and deadlocky. But yes I agree it happens sometimes. > > Also, your scheme still doesn't avoid allocation for reclaim so I don't > > see how you can use that as a point against queueing but not capturing! > > Obviously we cannot prevent allocations during reclaim. But we can avoid > those allocations falling within the captured range. All pages under > capture are marked. Any page returning to the allocator that merges with a > buddy under capture, or that is a buddy under capture are kept separatly. > Such that any allocations from within reclaim will necessarily take > pages from elsewhere in the pool. The key thing about capture is that it > effectivly marks ranges of pages out of use for allocations for the period > of the reclaim, so we have a number of small ranges blocked out, not the > whole pool. This allows parallel allocations (reclaim and otherwise) > to succeed against the reserves (refilled by kswapd etc), whilst marking > the pages under capture out and preventing them from being used. Yeah, but blocking the whole pool gives a *much* bigger chance to coalesce freed pages. And I'm not just talking about massive order-10 allocations or something where you have the targetted reclaim which improves chances of getting a free page within that range, but also for order-1/2/3 pages that might be more commonly used in normal kernel workloads but can still have a fairly high latency to succeed if there is a lot of other parallel allocation happening. > > I don't see why it should be unfair to allow a process to allocate 1024 > > order-0 pages ahead of one order-10 page (actually, yes the order 10 page > > is I guess somewhat more valueable than the same number of fragmented > > pages, but you see where I'm coming from). > > I think we have our wires crossed here. I was saying it would seem > unfair to block the allocator from giving out order-0 pages while we are > struggling to get an order-10 page for one process. Having a queue > would seem to generate such behaviour. What I am trying to achieve with > capture is to push areas likely to return us a page of the requested > size out of use, while we try and reclaim it without blocking the rest > of the allocator. We don't have our wires crossed. I just don't agree that it is unfair. It might be unfair to allow order-10 allocations at the same *rate* at order-0 allocations, which is why you could allow some priority in the queue. But when you decide you want to satisfy an order-10 allocation, do you want the other guys potentially mopping up your coalescing candidates? (and right, for targetted reclaim, maybe this is less of a problem, but I'm also thinking about a general solution for all orders of pages not just hugepages). > > At least with queueing you are basing the idea on *some* reasonable > > policy, rather than purely random "whoever happens to take this lock at > > the right time will win" strategy, which might I add, can even be much > > more unfair than you might say queueing is. > > > > However, queueing would still be able to allow some flexibility in > > priority. For example: > > > > if (must_queue) { > > if (queue_head_prio == 0) > > join_queue(1<<order); > > else { > > queue_head_prio -= 1<<order; > > skip_queue(); > > } > > } > > Sure you would be able to make some kind of more flexible decisions, but > that still seems like a heavy handed approach. You are important enough > to takes pages (possibly from our mostly free high order page) or not. I don't understand the thought process that leads you to these assertions. Where do you get your idea of fairness or importance? I would say that allowing 2^N order-0 allocations for every order-N allocations if both allocators are in a tight loop (and reserves are low, ie. reclaim is required) is a completely reasonable starting point for fairness. Do you disagree with that? How is it less fair than your approach? > In a perfect world we would be able to know in advance that an order-N > region would definatly come free if reclaimed and allocate that preemptivly > to the requestor, apply reclaim to it, and then actually allocate the page. But with my queueing you get effectively the same thing without having an oracle. Because you will wait for *any* order-N region to become free. The "tradeoff" is blocking other allocators. But IMO that is actually a good thing because that equates to a general fairness model in our allocator for all allocation types in place of the existing, fairly random game of chance (especially for order-2/3+) that we have now. Even for hugepages: if, with your capture patches, if process 0 comes in and does all this reclaim work and has nearly freed up some linear region of memory; then do you think it is reasonable if process-1 happens to come in and get lucky and find an subsequently coalesced hugepage and allocate it? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Reclaim page capture v3 2008-09-09 3:31 ` Nick Piggin @ 2008-09-09 16:35 ` Andy Whitcroft 2008-09-10 3:19 ` Nick Piggin 0 siblings, 1 reply; 16+ messages in thread From: Andy Whitcroft @ 2008-09-09 16:35 UTC (permalink / raw) To: Nick Piggin Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Tue, Sep 09, 2008 at 01:31:43PM +1000, Nick Piggin wrote: > On Tuesday 09 September 2008 02:41, Andy Whitcroft wrote: > > On Mon, Sep 08, 2008 at 11:59:54PM +1000, Nick Piggin wrote: > > > > So... what does the non-simulation version (ie. the real app) say? > > > > In the common use model, use of huge pages is all or nothing. Either > > there are sufficient pages allocatable at application start time or there > > are not. As huge pages are not swappable once allocated they stay there. > > Applications either start using huge pages or they fallback to small pages > > and continue. This makes the real metric, how often does the customer get > > annoyed becuase their application has fallen back to small pages and is > > slow, or how often does their database fail to start. It is very hard to > > directly measure that and thus to get a comparitive figure. Any attempt > > to replicate that seems as necessarly artificial as the current test. > > But you have customers telling you they're getting annoyed because of > this? Or you have your own "realistic" workloads that allocate hugepages > on demand (OK, when I say realistic, something like specjbb or whatever > is obviously a reasonable macrobenchmark even if it isn't strictly > realistics). We have customers complaining about usability of hugepages. They are keen to obtain the performance benefits from their use, but put off by the difficulty in managing systems with them enabled. That has led to series of changes to improve availability and managability of hugepages, including targetted reclaim, anti-fragmentation, and the dynamic pool. We obviously test hugepages with a series of benchmarks but generally in controlled environments where allocation of hugepages is guarenteed, to understand the performance benefits of their use. This is an acceptable configuration for hugepage diehards, but limits their utility to the wider community. The focus of this work is availability of hugepages and how that applies to managability. A common use model for big systems is batch oriented running a mix of jobs. For this to work reliably we need to move pages into and out of the hugepage pool on a regular basis. For this we need effective high order reclaim. > > > *Much* less likely, actually. Because there should be very little > > > allocation required for reclaim (only dirty pages, and only when backed > > > by filesystems that do silly things like not ensuring their own reserves > > > before allowing the page to be dirtied). > > > > Well yes and no. A lot of filesystems do do such stupid things. > > Allocating things like journal pages which have relativly long lives > > during reclaim. We have seen these getting placed into the memory we > > have just freed and preventing higher order coelesce. > > They shouldn't, because that's sad and deadlocky. But yes I agree it > happens sometimes. > > > > Also, your scheme still doesn't avoid allocation for reclaim so I don't > > > see how you can use that as a point against queueing but not capturing! > > > > Obviously we cannot prevent allocations during reclaim. But we can avoid > > those allocations falling within the captured range. All pages under > > capture are marked. Any page returning to the allocator that merges with a > > buddy under capture, or that is a buddy under capture are kept separatly. > > Such that any allocations from within reclaim will necessarily take > > pages from elsewhere in the pool. The key thing about capture is that it > > effectivly marks ranges of pages out of use for allocations for the period > > of the reclaim, so we have a number of small ranges blocked out, not the > > whole pool. This allows parallel allocations (reclaim and otherwise) > > to succeed against the reserves (refilled by kswapd etc), whilst marking > > the pages under capture out and preventing them from being used. > > Yeah, but blocking the whole pool gives a *much* bigger chance to coalesce > freed pages. And I'm not just talking about massive order-10 allocations > or something where you have the targetted reclaim which improves chances of > getting a free page within that range, but also for order-1/2/3 pages > that might be more commonly used in normal kernel workloads but can still > have a fairly high latency to succeed if there is a lot of other parallel > allocation happening. Yes in isolation blocking the whole pool would give us a higher chance of coalescing higher order pages. The problem here is that we cannot block the whole pool, we have to allow allocations for reclaim. If we have those occuring we risk loss of pages from the areas we are trying to reclaim. > > > I don't see why it should be unfair to allow a process to allocate 1024 > > > order-0 pages ahead of one order-10 page (actually, yes the order 10 page > > > is I guess somewhat more valueable than the same number of fragmented > > > pages, but you see where I'm coming from). > > > > I think we have our wires crossed here. I was saying it would seem > > unfair to block the allocator from giving out order-0 pages while we are > > struggling to get an order-10 page for one process. Having a queue > > would seem to generate such behaviour. What I am trying to achieve with > > capture is to push areas likely to return us a page of the requested > > size out of use, while we try and reclaim it without blocking the rest > > of the allocator. > > We don't have our wires crossed. I just don't agree that it is unfair. > It might be unfair to allow order-10 allocations at the same *rate* at > order-0 allocations, which is why you could allow some priority in the > queue. But when you decide you want to satisfy an order-10 allocation, > do you want the other guys potentially mopping up your coalescing > candidates? (and right, for targetted reclaim, maybe this is less of a > problem, but I'm also thinking about a general solution for all orders > of pages not just hugepages). When I am attempting to satisfy an order-10 allocation I clearly want to protect that order-10 area from use for allocation to other allocators. And that is the key thrust of the capture stuff, that is what it does. Capture puts the pages in areas under targetted reclaim out of the allocator during the reclaim pass. Protecting those areas from any parallel parallel allocators, either regular or PF_MEMALLOC; but only protecting those areas. I think we have differing views on what a queuing alone can deliver. Given that the allocator will have to allow allocations made during reclaim in parallel with reclaim for higher orders it does not seem sufficient to have a 'fair queue' alone. That will not stop the smaller necessary allocations from stealing our partial free buddies, allocations that we know are not always short lived. [continued below] > > > At least with queueing you are basing the idea on *some* reasonable > > > policy, rather than purely random "whoever happens to take this lock at > > > the right time will win" strategy, which might I add, can even be much > > > more unfair than you might say queueing is. > > > > > > However, queueing would still be able to allow some flexibility in > > > priority. For example: > > > > > > if (must_queue) { > > > if (queue_head_prio == 0) > > > join_queue(1<<order); > > > else { > > > queue_head_prio -= 1<<order; > > > skip_queue(); > > > } > > > } > > > > Sure you would be able to make some kind of more flexible decisions, but > > that still seems like a heavy handed approach. You are important enough > > to takes pages (possibly from our mostly free high order page) or not. > > I don't understand the thought process that leads you to these assertions. > Where do you get your idea of fairness or importance? > > I would say that allowing 2^N order-0 allocations for every order-N > allocations if both allocators are in a tight loop (and reserves are low, > ie. reclaim is required) is a completely reasonable starting point for > fairness. Do you disagree with that? How is it less fair than your > approach? What I was trying to convey here is that given the limitation that we need to allow allocation from within reclaim for reclaim to complete, that even stopping other small allocations for the period of the high order allocation is insufficient as those reclaim allocations are potentially as damaging. That even with a priority queue on the allocator you would need something to prevent allocations falling within an area under reclaim without preventing allocations completely, it is this problem that I am trying to solve. Trying to get the most beneficial result from the effort put in during reclaim. I actually think your idea of a priority queue and what capture is trying to do are othogonal. The current model uses direct reclaim as the 'fairness control'. If you are allocating a lot and the pool runs dry, you enter reclaim and are penalised for your greed. "If you want something big, you work to get it." Someone entering reclaim for higher order is independant of lower order allocators, and may even take pages from the pool without reclaim. What you are proposing with the priority queue seems to be a replacement for that. > > In a perfect world we would be able to know in advance that an order-N > > region would definatly come free if reclaimed and allocate that preemptivly > > to the requestor, apply reclaim to it, and then actually allocate the page. > > But with my queueing you get effectively the same thing without having an > oracle. Because you will wait for *any* order-N region to become free. > > The "tradeoff" is blocking other allocators. But IMO that is actually a > good thing because that equates to a general fairness model in our allocator > for all allocation types in place of the existing, fairly random game of > chance (especially for order-2/3+) that we have now. > > Even for hugepages: if, with your capture patches, if process 0 comes in > and does all this reclaim work and has nearly freed up some linear region > of memory; then do you think it is reasonable if process-1 happens to come > in and get lucky and find an subsequently coalesced hugepage and allocate > it? Are you saying of capture fails to obtain the high order page, so all the reclaim effort was wasted for that process, but then a subsequent reclaimer comes in and reclaims a few pages, gets the whole page and effectivly steals the work? It is unfortuanate for sure. Is it better that the work was not completely wasted, yes. By containing and protecting the partial results we work to reduce the likely hood of this happening. -apw ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Reclaim page capture v3 2008-09-09 16:35 ` Andy Whitcroft @ 2008-09-10 3:19 ` Nick Piggin 2008-09-10 6:56 ` Andy Whitcroft 2008-09-12 21:30 ` Andy Whitcroft 0 siblings, 2 replies; 16+ messages in thread From: Nick Piggin @ 2008-09-10 3:19 UTC (permalink / raw) To: Andy Whitcroft Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Wednesday 10 September 2008 02:35, Andy Whitcroft wrote: > On Tue, Sep 09, 2008 at 01:31:43PM +1000, Nick Piggin wrote: > > Yeah, but blocking the whole pool gives a *much* bigger chance to > > coalesce freed pages. And I'm not just talking about massive order-10 > > allocations or something where you have the targetted reclaim which > > improves chances of getting a free page within that range, but also for > > order-1/2/3 pages that might be more commonly used in normal kernel > > workloads but can still have a fairly high latency to succeed if there is > > a lot of other parallel allocation happening. > > Yes in isolation blocking the whole pool would give us a higher chance > of coalescing higher order pages. The problem here is that we cannot > block the whole pool, we have to allow allocations for reclaim. If we > have those occuring we risk loss of pages from the areas we are trying > to reclaim. Allocations from reclaim should be pretty rare, I think you're too worried about them. Actually, if you have queueing, you can do a simple form of capture just in the targetted reclaim path without all the stuff added to page alloc etc. Because you can hold off from freeing the targetted pages into the page allocator until you have built up a large number of them. That way, they won't be subject to recursive allocations and they will coalesce in the allocator; the queueing will ensure they don't get stolen or broken into smaller pages under us. Yes I really think queueing (perhaps with a much simplified form of capture) is the way to go. > > > I think we have our wires crossed here. I was saying it would seem > > > unfair to block the allocator from giving out order-0 pages while we > > > are struggling to get an order-10 page for one process. Having a queue > > > would seem to generate such behaviour. What I am trying to achieve > > > with capture is to push areas likely to return us a page of the > > > requested size out of use, while we try and reclaim it without blocking > > > the rest of the allocator. > > > > We don't have our wires crossed. I just don't agree that it is unfair. > > It might be unfair to allow order-10 allocations at the same *rate* at > > order-0 allocations, which is why you could allow some priority in the > > queue. But when you decide you want to satisfy an order-10 allocation, > > do you want the other guys potentially mopping up your coalescing > > candidates? (and right, for targetted reclaim, maybe this is less of a > > problem, but I'm also thinking about a general solution for all orders > > of pages not just hugepages). > > When I am attempting to satisfy an order-10 allocation I clearly want to > protect that order-10 area from use for allocation to other allocators. > And that is the key thrust of the capture stuff, that is what it does. > Capture puts the pages in areas under targetted reclaim out of the > allocator during the reclaim pass. Protecting those areas from any > parallel parallel allocators, either regular or PF_MEMALLOC; but only > protecting those areas. And I've shown that by a principle of fairness, you don't need to protect just one area, but you can hold off reclaim on all areas, which should greatly increase your chances of getting a hugepage, and will allow you to do so fairly. > I think we have differing views on what a queuing alone can deliver. > Given that the allocator will have to allow allocations made during reclaim > in parallel with reclaim for higher orders it does not seem sufficient > to have a 'fair queue' alone. That will not stop the smaller necessary > allocations from stealing our partial free buddies, allocations that we > know are not always short lived. [continued below] Queueing is a general and natural step which (hopefully) benefits everyone, so I think it should be tried before capture (which introduces new complexity solely for big allocations). > > > Sure you would be able to make some kind of more flexible decisions, > > > but that still seems like a heavy handed approach. You are important > > > enough to takes pages (possibly from our mostly free high order page) > > > or not. > > > > I don't understand the thought process that leads you to these > > assertions. Where do you get your idea of fairness or importance? > > > > I would say that allowing 2^N order-0 allocations for every order-N > > allocations if both allocators are in a tight loop (and reserves are low, > > ie. reclaim is required) is a completely reasonable starting point for > > fairness. Do you disagree with that? How is it less fair than your > > approach? > > What I was trying to convey here is that given the limitation that we need > to allow allocation from within reclaim for reclaim to complete, that > even stopping other small allocations for the period of the high order > allocation is insufficient as those reclaim allocations are potentially > as damaging. That even with a priority queue on the allocator you would > need something to prevent allocations falling within an area under reclaim > without preventing allocations completely, it is this problem that I > am trying to solve. Trying to get the most beneficial result from the > effort put in during reclaim. I actually think your idea of a priority > queue and what capture is trying to do are othogonal. Yes they are orthogonal, but I would like to see if we can do without page capture. I'm not saying the theory or practice of it is wrong. > The current model uses direct reclaim as the 'fairness control'. If you > are allocating a lot and the pool runs dry, you enter reclaim and are > penalised for your greed. "If you want something big, you work to get it." > Someone entering reclaim for higher order is independant of lower order > allocators, and may even take pages from the pool without reclaim. > What you are proposing with the priority queue seems to be a replacement > for that. For order-2,3 etc allocations that can be commonly used by the kernel but are not always in large supply, that fairness model doesn't really work. You can work to free things, then have another process allocate them, then exit direct reclaim to find none left etc. Even for order-0 pages there is a problem in theory (although statistically it probably evens out much better). > > Even for hugepages: if, with your capture patches, if process 0 comes in > > and does all this reclaim work and has nearly freed up some linear region > > of memory; then do you think it is reasonable if process-1 happens to > > come in and get lucky and find an subsequently coalesced hugepage and > > allocate it? > > Are you saying of capture fails to obtain the high order page, so all > the reclaim effort was wasted for that process, but then a subsequent > reclaimer comes in and reclaims a few pages, gets the whole page and > effectivly steals the work? It is unfortuanate for sure. Is it better > that the work was not completely wasted, yes. By containing and protecting > the partial results we work to reduce the likely hood of this happening. Yes. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Reclaim page capture v3 2008-09-10 3:19 ` Nick Piggin @ 2008-09-10 6:56 ` Andy Whitcroft 2008-09-12 21:30 ` Andy Whitcroft 1 sibling, 0 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-09-10 6:56 UTC (permalink / raw) To: Nick Piggin Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman Am flying today, so I won't get to respond to this one in detail immediatly. -apw ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Reclaim page capture v3 2008-09-10 3:19 ` Nick Piggin 2008-09-10 6:56 ` Andy Whitcroft @ 2008-09-12 21:30 ` Andy Whitcroft 1 sibling, 0 replies; 16+ messages in thread From: Andy Whitcroft @ 2008-09-12 21:30 UTC (permalink / raw) To: Nick Piggin Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter, Rik van Riel, Mel Gorman On Wed, Sep 10, 2008 at 01:19:01PM +1000, Nick Piggin wrote: > On Wednesday 10 September 2008 02:35, Andy Whitcroft wrote: > > On Tue, Sep 09, 2008 at 01:31:43PM +1000, Nick Piggin wrote: > > > > Yeah, but blocking the whole pool gives a *much* bigger chance to > > > coalesce freed pages. And I'm not just talking about massive order-10 > > > allocations or something where you have the targetted reclaim which > > > improves chances of getting a free page within that range, but also for > > > order-1/2/3 pages that might be more commonly used in normal kernel > > > workloads but can still have a fairly high latency to succeed if there is > > > a lot of other parallel allocation happening. > > > > Yes in isolation blocking the whole pool would give us a higher chance > > of coalescing higher order pages. The problem here is that we cannot > > block the whole pool, we have to allow allocations for reclaim. If we > > have those occuring we risk loss of pages from the areas we are trying > > to reclaim. > > Allocations from reclaim should be pretty rare, I think you're too > worried about them. They are certainly there, particularly with journalled filesystems in the mix. But although I do worry about them, my solution has been to allow all allocations in parallel with reclaim so they fall into the same bucket. > Actually, if you have queueing, you can do a simple form of capture > just in the targetted reclaim path without all the stuff added to page > alloc etc. Because you can hold off from freeing the targetted pages > into the page allocator until you have built up a large number of them. > That way, they won't be subject to recursive allocations and they will > coalesce in the allocator; the queueing will ensure they don't get > stolen or broken into smaller pages under us. Yes I really think queueing > (perhaps with a much simplified form of capture) is the way to go. It almost feels as if you think these patches do much more than they actually do. The primary source of captured pages is through collection of pages as we complete synchronous targetted reclaim on them. However we do also collect and merge pages which are already free, and those which are freed from outside the reclaim during the period of the capture. >From your description I am getting the impression that your proposal is that there be a queue on the allocator plus capture only for pages we release during targetted reclaim. We would then release our list and try to allocate from the buddy. Relying on buddy to coelesce them, and then allocate from there. The queue protects us from any stealing. In the development and testing of capture, we actually started out with simple version of capture, without the additional ability to collect the already free pages, and those which free during capture. (It is these parts which bring the apparent complexity of change to page_alloc.c and I am assuming is the bit you dislike particularly.) Testing on this simple version showed the chances of actually finding areas of the size requested were close to zero and capture statistically worthless. That seems to indicate that for queue+simple capture to have any statistical chance of improving things all allocations would need to be stopped for the entire period of the capture. Even if we assume reclaim allocations never occur (which we cannot queue), that still leaves us with a singly threaded allocator once reclaim starts. Bearing in mind that for very high order pages that would be periods equivalent to the time it takes to write thousands of base pages. That seems unreasonable at best. > > > > I think we have our wires crossed here. I was saying it would seem > > > > unfair to block the allocator from giving out order-0 pages while we > > > > are struggling to get an order-10 page for one process. Having a queue > > > > would seem to generate such behaviour. What I am trying to achieve > > > > with capture is to push areas likely to return us a page of the > > > > requested size out of use, while we try and reclaim it without blocking > > > > the rest of the allocator. > > > > > > We don't have our wires crossed. I just don't agree that it is unfair. > > > It might be unfair to allow order-10 allocations at the same *rate* at > > > order-0 allocations, which is why you could allow some priority in the > > > queue. But when you decide you want to satisfy an order-10 allocation, > > > do you want the other guys potentially mopping up your coalescing > > > candidates? (and right, for targetted reclaim, maybe this is less of a > > > problem, but I'm also thinking about a general solution for all orders > > > of pages not just hugepages). > > > > When I am attempting to satisfy an order-10 allocation I clearly want to > > protect that order-10 area from use for allocation to other allocators. > > And that is the key thrust of the capture stuff, that is what it does. > > Capture puts the pages in areas under targetted reclaim out of the > > allocator during the reclaim pass. Protecting those areas from any > > parallel parallel allocators, either regular or PF_MEMALLOC; but only > > protecting those areas. > > And I've shown that by a principle of fairness, you don't need to protect > just one area, but you can hold off reclaim on all areas, which should > greatly increase your chances of getting a hugepage, and will allow you > to do so fairly. > > > > I think we have differing views on what a queuing alone can deliver. > > Given that the allocator will have to allow allocations made during reclaim > > in parallel with reclaim for higher orders it does not seem sufficient > > to have a 'fair queue' alone. That will not stop the smaller necessary > > allocations from stealing our partial free buddies, allocations that we > > know are not always short lived. [continued below] > > Queueing is a general and natural step which (hopefully) benefits everyone, > so I think it should be tried before capture (which introduces new complexity > solely for big allocations). To be completely honest here, I do not think I have yet figured out what form queueing you really envisage here. Its name implies an ordering of allocations based on request 'start' time. The discussions in previous emails imply more of a balancing of rates of allocations, at least in terms of fairness. Could you perhaps flesh out your queue idea here so that we are arguing against the same object? Some details of that would help to understand the version of fairness you are trying to generate as well. As there are a number of often conflicting definitions of fairness which are valid in their own senses. For example I could state that it is 'unfair that an allocator wanting an order 10 page would stop me getting a single order 0 page', and yet also state 'it is unfair that I cannot allocate a single order 10 page, when you have allocated 1024 order-0 pages' both are reasonable statements and yet very likely to be conflicting in implementation. > > > > Sure you would be able to make some kind of more flexible decisions, > > > > but that still seems like a heavy handed approach. You are important > > > > enough to takes pages (possibly from our mostly free high order page) > > > > or not. > > > > > > I don't understand the thought process that leads you to these > > > assertions. Where do you get your idea of fairness or importance? > > > > > > I would say that allowing 2^N order-0 allocations for every order-N > > > allocations if both allocators are in a tight loop (and reserves are low, > > > ie. reclaim is required) is a completely reasonable starting point for > > > fairness. Do you disagree with that? How is it less fair than your > > > approach? > > > > What I was trying to convey here is that given the limitation that we need > > to allow allocation from within reclaim for reclaim to complete, that > > even stopping other small allocations for the period of the high order > > allocation is insufficient as those reclaim allocations are potentially > > as damaging. That even with a priority queue on the allocator you would > > need something to prevent allocations falling within an area under reclaim > > without preventing allocations completely, it is this problem that I > > am trying to solve. Trying to get the most beneficial result from the > > effort put in during reclaim. I actually think your idea of a priority > > queue and what capture is trying to do are othogonal. > > Yes they are orthogonal, but I would like to see if we can do without > page capture. I'm not saying the theory or practice of it is wrong. > > > > The current model uses direct reclaim as the 'fairness control'. If you > > are allocating a lot and the pool runs dry, you enter reclaim and are > > penalised for your greed. "If you want something big, you work to get it." > > Someone entering reclaim for higher order is independant of lower order > > allocators, and may even take pages from the pool without reclaim. > > What you are proposing with the priority queue seems to be a replacement > > for that. > > For order-2,3 etc allocations that can be commonly used by the kernel but > are not always in large supply, that fairness model doesn't really work. > You can work to free things, then have another process allocate them, then > exit direct reclaim to find none left etc. > > Even for order-0 pages there is a problem in theory (although statistically > it probably evens out much better). Yes that is entirely true. As it happens we only apply capture here to orders above COSTLY_ORDER in line with the kernels assumption that statistically those orders are released during reclaim 'relativly easily'. If that assumption is wrong then it would seem logical to apply capture down to order 0. > > > Even for hugepages: if, with your capture patches, if process 0 comes in > > > and does all this reclaim work and has nearly freed up some linear region > > > of memory; then do you think it is reasonable if process-1 happens to > > > come in and get lucky and find an subsequently coalesced hugepage and > > > allocate it? > > > > Are you saying of capture fails to obtain the high order page, so all > > the reclaim effort was wasted for that process, but then a subsequent > > reclaimer comes in and reclaims a few pages, gets the whole page and > > effectivly steals the work? It is unfortuanate for sure. Is it better > > that the work was not completely wasted, yes. By containing and protecting > > the partial results we work to reduce the likely hood of this happening. > > Yes. -apw ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-09-12 21:30 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft 2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft 2008-09-08 14:11 ` MinChan Kim 2008-09-08 15:14 ` Andy Whitcroft 2008-09-05 10:20 ` [PATCH 2/4] pull out zone cpuset and watermark checks " Andy Whitcroft 2008-09-05 10:20 ` [PATCH 3/4] buddy: explicitly identify buddy field use in struct page Andy Whitcroft 2008-09-05 10:20 ` [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer Andy Whitcroft 2008-09-08 6:08 ` [PATCH 0/4] Reclaim page capture v3 Nick Piggin 2008-09-08 11:44 ` Andy Whitcroft 2008-09-08 13:59 ` Nick Piggin 2008-09-08 16:41 ` Andy Whitcroft 2008-09-09 3:31 ` Nick Piggin 2008-09-09 16:35 ` Andy Whitcroft 2008-09-10 3:19 ` Nick Piggin 2008-09-10 6:56 ` Andy Whitcroft 2008-09-12 21:30 ` Andy Whitcroft
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).