linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Reclaim page capture v3
@ 2008-09-05 10:19 Andy Whitcroft
  2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-05 10:19 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andy Whitcroft

For sometime we have been looking at mechanisms for improving the availability
of larger allocations under load.  One of the options we have explored is
the capturing of pages freed under direct reclaim in order to increase the
chances of free pages coelescing before they are subject to reallocation
by racing allocators.

Following this email is a patch stack implementing page capture during
direct reclaim.  It consits of four patches.  The first two simply pull
out existing code into helpers for reuse.  The third makes buddy's use
of struct page explicit.  The fourth contains the meat of the changes,
and its leader contains a much fuller description of the feature.

This update represents a rebase to -mm and incorporates feedback from
KOSAKI Motohiro.  It also incorporates an accounting fix which was
preventing some captures.

I have done a lot of comparitive testing with and without this patch
set and in broad brush I am seeing improvements in hugepage allocations
(worst case size) success on all of my test systems.  These tests consist
of placing a constant stream of high order allocations on the system,
at varying rates.  The results for these various runs are then averaged
to give an overall improvement.

		Absolute	Effective
x86-64		2.48%		 4.58%
powerpc		5.55%		25.22%

x86-64 has a relatively small huge page size and so is always much more
effective at allocating huge pages.  Even there we get a measurable
improvement.  On powerpc the huge pages are much larger and much harder
to recover.  Here we see a full 25% increase in page recovery.

It should be noted that these are worst case testing, and very agressive
taking every possible page in the system.  It would be helpful to get
wider testing in -mm.

Against: 2.6.27-rc1-mm1

Andrew, please consider for -mm.

-apw

Changes since V2:
 - Incorporates review feedback from Christoph Lameter,
 - Incorporates review feedback from Peter Zijlstra, and
 - Checkpatch fixes.

Changes since V1:
 - Incorporates review feedback from KOSAKI Motohiro,
 - fixes up accounting when checking watermarks for captured pages,
 - rebase 2.6.27-rc1-mm1,
 - Incorporates review feedback from Mel.


Andy Whitcroft (4):
  pull out the page pre-release and sanity check logic for reuse
  pull out zone cpuset and watermark checks for reuse
  buddy: explicitly identify buddy field use in struct page
  capture pages freed during direct reclaim for allocation by the
    reclaimer

 include/linux/mm_types.h   |    4 +
 include/linux/page-flags.h |    4 +
 mm/internal.h              |    8 +-
 mm/page_alloc.c            |  263 ++++++++++++++++++++++++++++++++++++++------
 mm/vmscan.c                |  115 ++++++++++++++++----
 5 files changed, 338 insertions(+), 56 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse
  2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft
@ 2008-09-05 10:19 ` Andy Whitcroft
  2008-09-08 14:11   ` MinChan Kim
  2008-09-05 10:20 ` [PATCH 2/4] pull out zone cpuset and watermark checks " Andy Whitcroft
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-05 10:19 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andy Whitcroft

When we are about to release a page we perform a number of actions
on that page.  We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required.  Pull this processing out into a
helper function for reuse in a later patch.

Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++++++-------------
 1 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..b2a2c2b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
+ * Prepare this page for release to the buddy.  Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+	int i;
+	int reserved = 0;
+
+	if (PageAnon(page))
+		page->mapping = NULL;
+
+	for (i = 0 ; i < (1 << order) ; ++i)
+		reserved += free_pages_check(page + i);
+	if (reserved)
+		return 0;
+
+	if (!PageHighMem(page)) {
+		debug_check_no_locks_freed(page_address(page),
+							PAGE_SIZE << order);
+		debug_check_no_obj_freed(page_address(page),
+					   PAGE_SIZE << order);
+	}
+	arch_free_page(page, order);
+	kernel_map_pages(page, 1 << order, 0);
+
+	return 1;
+}
+
+/*
  * Frees a list of pages. 
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
@@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
-	int i;
-	int reserved = 0;
 
-	for (i = 0 ; i < (1 << order) ; ++i)
-		reserved += free_pages_check(page + i);
-	if (reserved)
+	if (!free_page_prepare(page, order))
 		return;
 
-	if (!PageHighMem(page)) {
-		debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
-		debug_check_no_obj_freed(page_address(page),
-					   PAGE_SIZE << order);
-	}
-	arch_free_page(page, order);
-	kernel_map_pages(page, 1 << order, 0);
-
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order);
-- 
1.6.0.rc1.258.g80295


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] pull out zone cpuset and watermark checks for reuse
  2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft
  2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
@ 2008-09-05 10:20 ` Andy Whitcroft
  2008-09-05 10:20 ` [PATCH 3/4] buddy: explicitly identify buddy field use in struct page Andy Whitcroft
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-05 10:20 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andy Whitcroft

When allocating we need to confirm that the zone we are about to allocate
from is acceptable to the CPUSET we are in, and that it does not violate
the zone watermarks.  Pull these checks out so we can reuse them in a
later patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c |   62 ++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2a2c2b..2c3874e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1274,6 +1274,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 	return 1;
 }
 
+/*
+ * Return 1 if this zone is an acceptable source given the cpuset
+ * constraints.
+ */
+static inline int zone_cpuset_permits(struct zone *zone,
+					int alloc_flags, gfp_t gfp_mask)
+{
+	if ((alloc_flags & ALLOC_CPUSET) &&
+	    !cpuset_zone_allowed_softwall(zone, gfp_mask))
+		return 0;
+	return 1;
+}
+
+/*
+ * Return 1 if this zone is within the watermarks specified by the
+ * allocation flags.
+ */
+static inline int zone_watermark_permits(struct zone *zone, int order,
+			int classzone_idx, int alloc_flags, gfp_t gfp_mask)
+{
+	if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
+		unsigned long mark;
+		if (alloc_flags & ALLOC_WMARK_MIN)
+			mark = zone->pages_min;
+		else if (alloc_flags & ALLOC_WMARK_LOW)
+			mark = zone->pages_low;
+		else
+			mark = zone->pages_high;
+		if (!zone_watermark_ok(zone, order, mark,
+			    classzone_idx, alloc_flags)) {
+			if (!zone_reclaim_mode ||
+					!zone_reclaim(zone, gfp_mask, order))
+				return 0;
+		}
+	}
+	return 1;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
@@ -1427,25 +1465,11 @@ zonelist_scan:
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed_softwall(zone, gfp_mask))
-				goto try_next_zone;
-
-		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
-			unsigned long mark;
-			if (alloc_flags & ALLOC_WMARK_MIN)
-				mark = zone->pages_min;
-			else if (alloc_flags & ALLOC_WMARK_LOW)
-				mark = zone->pages_low;
-			else
-				mark = zone->pages_high;
-			if (!zone_watermark_ok(zone, order, mark,
-				    classzone_idx, alloc_flags)) {
-				if (!zone_reclaim_mode ||
-				    !zone_reclaim(zone, gfp_mask, order))
-					goto this_zone_full;
-			}
-		}
+		if (!zone_cpuset_permits(zone, alloc_flags, gfp_mask))
+			goto try_next_zone;
+		if (!zone_watermark_permits(zone, order, classzone_idx,
+							alloc_flags, gfp_mask))
+			goto this_zone_full;
 
 		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
 		if (page)
-- 
1.6.0.rc1.258.g80295


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] buddy: explicitly identify buddy field use in struct page
  2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft
  2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
  2008-09-05 10:20 ` [PATCH 2/4] pull out zone cpuset and watermark checks " Andy Whitcroft
@ 2008-09-05 10:20 ` Andy Whitcroft
  2008-09-05 10:20 ` [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer Andy Whitcroft
  2008-09-08  6:08 ` [PATCH 0/4] Reclaim page capture v3 Nick Piggin
  4 siblings, 0 replies; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-05 10:20 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andy Whitcroft

Explicitly define the struct page fields which buddy uses when it owns
pages.  Defines a new anonymous struct to allow additional fields to
be defined in a later patch.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/mm_types.h |    3 +++
 mm/internal.h            |    2 +-
 mm/page_alloc.c          |    4 ++--
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 995c588..906d8e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -70,6 +70,9 @@ struct page {
 #endif
 	    struct kmem_cache *slab;	/* SLUB: Pointer to slab */
 	    struct page *first_page;	/* Compound tail pages */
+	    struct {
+		unsigned long buddy_order;     /* buddy: free page order */
+	    };
 	};
 	union {
 		pgoff_t index;		/* Our offset within mapping. */
diff --git a/mm/internal.h b/mm/internal.h
index c0e4859..fcedcd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,7 +58,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order);
 static inline unsigned long page_order(struct page *page)
 {
 	VM_BUG_ON(!PageBuddy(page));
-	return page_private(page);
+	return page->buddy_order;
 }
 
 extern int mlock_vma_pages_range(struct vm_area_struct *vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c3874e..db0dbd6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,7 +331,7 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
 
 static inline void set_page_order(struct page *page, int order)
 {
-	set_page_private(page, order);
+	page->buddy_order = order;
 	__SetPageBuddy(page);
 #ifdef CONFIG_PAGE_OWNER
 		page->order = -1;
@@ -341,7 +341,7 @@ static inline void set_page_order(struct page *page, int order)
 static inline void rmv_page_order(struct page *page)
 {
 	__ClearPageBuddy(page);
-	set_page_private(page, 0);
+	page->buddy_order = 0;
 }
 
 /*
-- 
1.6.0.rc1.258.g80295


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer
  2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft
                   ` (2 preceding siblings ...)
  2008-09-05 10:20 ` [PATCH 3/4] buddy: explicitly identify buddy field use in struct page Andy Whitcroft
@ 2008-09-05 10:20 ` Andy Whitcroft
  2008-09-08  6:08 ` [PATCH 0/4] Reclaim page capture v3 Nick Piggin
  4 siblings, 0 replies; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-05 10:20 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andy Whitcroft

When a process enters direct reclaim it will expend effort identifying
and releasing pages in the hope of obtaining a page.  However as these
pages are released asynchronously there is every possibility that the
pages will have been consumed by other allocators before the reclaimer
gets a look in.  This is particularly problematic where the reclaimer is
attempting to allocate a higher order page.  It is highly likely that
a parallel allocation will consume lower order constituent pages as we
release them preventing them coelescing into the higher order page the
reclaimer desires.

This patch set attempts to address this for allocations above
ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing
onto a local free list.  Instead of freeing them to the main buddy lists,
pages are collected and coelesced on this per direct reclaimer free list.
Pages which are freed by other processes are also considered, where they
coelesce with a page already under capture they will be moved to the
capture list.  When pressure has been applied to a zone we then consult
the capture list and if there is an appropriatly sized page available
it is taken immediatly and the remainder returned to the free pool.
Capture is only enabled when the reclaimer's allocation order exceeds
ALLOC_COSTLY_ORDER as free pages below this order should naturally occur
in large numbers following regular reclaim.

Thanks go to Mel Gorman for numerous discussions during the development
of this patch and for his repeated reviews.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mm_types.h   |    1 +
 include/linux/page-flags.h |    4 +
 mm/internal.h              |    6 ++
 mm/page_alloc.c            |  154 +++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |  115 +++++++++++++++++++++++++++------
 5 files changed, 259 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 906d8e0..cd2b549 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -72,6 +72,7 @@ struct page {
 	    struct page *first_page;	/* Compound tail pages */
 	    struct {
 		unsigned long buddy_order;     /* buddy: free page order */
+		struct list_head *buddy_free;  /* buddy: free list pointer */
 	    };
 	};
 	union {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c0ac9e0..016e604 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -117,6 +117,9 @@ enum pageflags {
 	/* SLUB */
 	PG_slub_frozen = PG_active,
 	PG_slub_debug = PG_error,
+
+	/* BUDDY overlays. */
+	PG_buddy_capture = PG_owner_priv_1,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -208,6 +211,7 @@ __PAGEFLAG(SlubDebug, slub_debug)
  */
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 __PAGEFLAG(Buddy, buddy)
+__PAGEFLAG(BuddyCapture, buddy_capture)	/* A buddy page, but reserved. */
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
diff --git a/mm/internal.h b/mm/internal.h
index fcedcd0..f266a35 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -251,4 +251,10 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
 		     struct page **pages, struct vm_area_struct **vmas);
 
+extern struct page *capture_alloc_or_return(struct zone *, struct zone *,
+					struct list_head *, int, int, gfp_t);
+void capture_one_page(struct list_head *, struct zone *, struct page *, int);
+unsigned long try_to_free_pages_capture(struct page **, struct zonelist *,
+					nodemask_t *, int, gfp_t, int);
+
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index db0dbd6..fab0e72 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -428,6 +428,51 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  * -- wli
  */
 
+static inline void __capture_one_page(struct list_head *capture_list,
+		struct page *page, struct zone *zone, unsigned int order)
+{
+	unsigned long page_idx;
+	unsigned long order_size = 1UL << order;
+
+	if (unlikely(PageCompound(page)))
+		destroy_compound_page(page, order);
+
+	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+
+	VM_BUG_ON(page_idx & (order_size - 1));
+	VM_BUG_ON(bad_range(zone, page));
+
+	while (order < MAX_ORDER-1) {
+		unsigned long combined_idx;
+		struct page *buddy;
+
+		buddy = __page_find_buddy(page, page_idx, order);
+		if (!page_is_buddy(page, buddy, order))
+			break;
+
+		/* Our buddy is free, merge with it and move up one order. */
+		list_del(&buddy->lru);
+		if (PageBuddyCapture(buddy)) {
+			buddy->buddy_free = 0;
+			__ClearPageBuddyCapture(buddy);
+		} else {
+			zone->free_area[order].nr_free--;
+			__mod_zone_page_state(zone,
+					NR_FREE_PAGES, -(1UL << order));
+		}
+		rmv_page_order(buddy);
+		combined_idx = __find_combined_index(page_idx, order);
+		page = page + (combined_idx - page_idx);
+		page_idx = combined_idx;
+		order++;
+	}
+	set_page_order(page, order);
+	__SetPageBuddyCapture(page);
+	page->buddy_free = capture_list;
+
+	list_add(&page->lru, capture_list);
+}
+
 static inline void __free_one_page(struct page *page,
 		struct zone *zone, unsigned int order)
 {
@@ -451,6 +496,12 @@ static inline void __free_one_page(struct page *page,
 		buddy = __page_find_buddy(page, page_idx, order);
 		if (!page_is_buddy(page, buddy, order))
 			break;
+		if (PageBuddyCapture(buddy)) {
+			__mod_zone_page_state(zone,
+					NR_FREE_PAGES, -(1UL << order));
+			return __capture_one_page(buddy->buddy_free,
+							page, zone, order);
+		}
 
 		/* Our buddy is free, merge with it and move up one order. */
 		list_del(&buddy->lru);
@@ -555,6 +606,19 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
 	spin_unlock(&zone->lock);
 }
 
+void capture_one_page(struct list_head *free_list,
+			struct zone *zone, struct page *page, int order)
+{
+	unsigned long flags;
+
+	if (!free_page_prepare(page, order))
+		return;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	__capture_one_page(free_list, page, zone, order);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
@@ -629,6 +693,23 @@ static inline void expand(struct zone *zone, struct page *page,
 }
 
 /*
+ * Convert the passed page from actual_order to desired_order.
+ * Given a page of actual_order release all but the desired_order sized
+ * buddy at its start.
+ */
+void __carve_off(struct page *page, unsigned long actual_order,
+					unsigned long desired_order)
+{
+	int migratetype = get_pageblock_migratetype(page);
+	struct zone *zone = page_zone(page);
+	struct free_area *area = &(zone->free_area[actual_order]);
+
+	__mod_zone_page_state(zone, NR_FREE_PAGES,
+				(1UL << actual_order) - (1UL << desired_order));
+	expand(zone, page, desired_order, actual_order, area, migratetype);
+}
+
+/*
  * This page is about to be returned from the page allocator
  */
 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
@@ -1667,11 +1748,15 @@ nofail_alloc:
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+	did_some_progress = try_to_free_pages_capture(&page, zonelist, nodemask,
+						order, gfp_mask, alloc_flags);
 
 	p->reclaim_state = NULL;
 	p->flags &= ~PF_MEMALLOC;
 
+	if (page)
+		goto got_pg;
+
 	cond_resched();
 
 	if (order != 0)
@@ -4815,6 +4900,73 @@ out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
+#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
+
+/*
+ * Run through the accumulated list of captures pages and take the first
+ * which is big enough to satisfy the original allocation.  Trim the selected
+ * page to size, and free the remainder.
+ */
+struct page *capture_alloc_or_return(struct zone *zone,
+		struct zone *preferred_zone, struct list_head *capture_list,
+		int order, int alloc_flags, gfp_t gfp_mask)
+{
+	struct page *capture_page = 0;
+	unsigned long flags;
+	int classzone_idx = zone_idx(preferred_zone);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	while (!list_empty(capture_list)) {
+		struct page *page;
+		int pg_order;
+
+		page = lru_to_page(capture_list);
+		list_del(&page->lru);
+		pg_order = page_order(page);
+
+		/*
+		 * Clear out our buddy size and list information before
+		 * releasing or allocating the page.
+		 */
+		rmv_page_order(page);
+		page->buddy_free = 0;
+		__ClearPageBuddyCapture(page);
+
+		if (!capture_page && pg_order >= order) {
+			__carve_off(page, pg_order, order);
+			capture_page = page;
+		} else
+			__free_one_page(page, zone, pg_order);
+	}
+
+	/*
+	 * Ensure that this capture would not violate the watermarks.
+	 * Subtle, we actually already have the page outside the watermarks
+	 * so check if we can allocate an order 0 page.
+	 */
+	if (capture_page &&
+	    (!zone_cpuset_permits(zone, alloc_flags, gfp_mask) ||
+	     !zone_watermark_permits(zone, 0, classzone_idx,
+					     alloc_flags, gfp_mask))) {
+		__free_one_page(capture_page, zone, order);
+		capture_page = NULL;
+	}
+
+	if (capture_page)
+		__count_zone_vm_events(PGALLOC, zone, 1 << order);
+
+	zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
+	zone->pages_scanned = 0;
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	if (capture_page)
+		prep_new_page(capture_page, order, gfp_mask);
+
+	return capture_page;
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
  * All pages in the range must be isolated before calling this.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 85ce427..7b4767f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,8 @@ struct scan_control {
 	/* This context's GFP mask */
 	gfp_t gfp_mask;
 
+	int alloc_flags;
+
 	int may_writepage;
 
 	/* Can pages be swapped as part of reclaim? */
@@ -81,6 +83,12 @@ struct scan_control {
 			unsigned long *scanned, int order, int mode,
 			struct zone *z, struct mem_cgroup *mem_cont,
 			int active, int file);
+
+	/* Captured page. */
+	struct page **capture;
+
+	/* Nodemask for acceptable allocations. */
+	nodemask_t *nodemask;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -560,7 +568,8 @@ void putback_lru_page(struct page *page)
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
-static unsigned long shrink_page_list(struct list_head *page_list,
+static unsigned long shrink_page_list(struct list_head *free_list,
+					struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
@@ -742,9 +751,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		unlock_page(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
+		if (free_list) {
+			capture_one_page(free_list, page_zone(page), page, 0);
+		} else {
+			if (!pagevec_add(&freed_pvec, page)) {
+				__pagevec_free(&freed_pvec);
+				pagevec_reinit(&freed_pvec);
+			}
 		}
 		continue;
 
@@ -1024,7 +1037,8 @@ int isolate_lru_page(struct page *page)
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(struct list_head *free_list,
+			unsigned long max_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
@@ -1083,7 +1097,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 		spin_unlock_irq(&zone->lru_lock);
 
 		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+		nr_freed = shrink_page_list(free_list, &page_list,
+							sc, PAGEOUT_IO_ASYNC);
 
 		/*
 		 * If we are direct reclaiming for contiguous pages and we do
@@ -1102,8 +1117,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 			nr_active = clear_active_flags(&page_list, count);
 			count_vm_events(PGDEACTIVATE, nr_active);
 
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
+			nr_freed += shrink_page_list(free_list, &page_list,
+							sc, PAGEOUT_IO_SYNC);
 		}
 
 		nr_reclaimed += nr_freed;
@@ -1337,7 +1352,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	pagevec_release(&pvec);
 }
 
-static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
+static unsigned long shrink_list(struct list_head *free_list,
+	enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
@@ -1352,7 +1368,8 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
-	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
+	return shrink_inactive_list(free_list, nr_to_scan, zone,
+							sc, priority, file);
 }
 
 /*
@@ -1444,7 +1461,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static unsigned long shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+			struct zone *preferred_zone, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -1452,6 +1469,23 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 	unsigned long percent[2];	/* anon @ 0; file @ 1 */
 	enum lru_list l;
 
+	struct list_head __capture_list;
+	struct list_head *capture_list = NULL;
+	struct page *capture_page;
+
+	/*
+	 * When direct reclaimers are asking for larger orders
+	 * capture pages for them.  There is no point if we already
+	 * have an acceptable page or if this zone is not within the
+	 * nodemask.
+	 */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
+	    sc->capture && !*(sc->capture) && (sc->nodemask == NULL ||
+	    node_isset(zone_to_nid(zone), *sc->nodemask))) {
+		capture_list = &__capture_list;
+		INIT_LIST_HEAD(capture_list);
+	}
+
 	get_scan_ratio(zone, sc, percent);
 
 	for_each_evictable_lru(l) {
@@ -1481,6 +1515,8 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 		}
 	}
 
+	capture_page = NULL;
+
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
@@ -1489,10 +1525,18 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 					(unsigned long)sc->swap_cluster_max);
 				nr[l] -= nr_to_scan;
 
-				nr_reclaimed += shrink_list(l, nr_to_scan,
+				nr_reclaimed += shrink_list(capture_list,
+							l, nr_to_scan,
 							zone, sc, priority);
 			}
 		}
+		if (capture_list) {
+			capture_page = capture_alloc_or_return(zone,
+				preferred_zone, capture_list, sc->order,
+				sc->alloc_flags, sc->gfp_mask);
+			if (capture_page)
+				capture_list = NULL;
+		}
 	}
 
 	/*
@@ -1504,6 +1548,9 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 	else if (!scan_global_lru(sc))
 		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
 
+	if (capture_page)
+		*(sc->capture) = capture_page;
+
 	throttle_vm_writeout(sc->gfp_mask);
 	return nr_reclaimed;
 }
@@ -1525,7 +1572,7 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
  * scan then give up on it.
  */
 static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
-					struct scan_control *sc)
+		struct zone *preferred_zone, struct scan_control *sc)
 {
 	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long nr_reclaimed = 0;
@@ -1559,7 +1606,7 @@ static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
 							priority);
 		}
 
-		nr_reclaimed += shrink_zone(priority, zone, sc);
+		nr_reclaimed += shrink_zone(priority, zone, preferred_zone, sc);
 	}
 
 	return nr_reclaimed;
@@ -1592,8 +1639,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
+	struct zone *preferred_zone;
 	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 
+	/* This should never fail as we should be scanning a real zonelist. */
+	(void)first_zones_zonelist(zonelist, high_zoneidx, sc->nodemask,
+							&preferred_zone);
+	BUG_ON(!preferred_zone);
+
 	delayacct_freepages_start();
 
 	if (scan_global_lru(sc))
@@ -1615,7 +1668,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zonelist, sc);
+		nr_reclaimed += shrink_zones(priority, zonelist,
+							preferred_zone, sc);
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
@@ -1680,11 +1734,13 @@ out:
 	return ret;
 }
 
-unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-								gfp_t gfp_mask)
+unsigned long try_to_free_pages_capture(struct page **capture_pagep,
+		struct zonelist *zonelist, nodemask_t *nodemask,
+		int order, gfp_t gfp_mask, int alloc_flags)
 {
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
+		.alloc_flags = alloc_flags,
 		.may_writepage = !laptop_mode,
 		.swap_cluster_max = SWAP_CLUSTER_MAX,
 		.may_swap = 1,
@@ -1692,17 +1748,28 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.order = order,
 		.mem_cgroup = NULL,
 		.isolate_pages = isolate_pages_global,
+		.capture = capture_pagep,
+		.nodemask = nodemask,
 	};
 
 	return do_try_to_free_pages(zonelist, &sc);
 }
 
+unsigned long try_to_free_pages(struct zonelist *zonelist,
+						int order, gfp_t gfp_mask)
+{
+	return try_to_free_pages_capture(NULL, zonelist, NULL,
+							order, gfp_mask, 0);
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 						gfp_t gfp_mask)
 {
 	struct scan_control sc = {
+		.gfp_mask = gfp_mask,
+		.alloc_flags = 0,
 		.may_writepage = !laptop_mode,
 		.may_swap = 1,
 		.swap_cluster_max = SWAP_CLUSTER_MAX,
@@ -1710,6 +1777,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,
+		.capture = NULL,
+		.nodemask = NULL,
 	};
 	struct zonelist *zonelist;
 
@@ -1751,12 +1820,15 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
+		.alloc_flags = 0,
 		.may_swap = 1,
 		.swap_cluster_max = SWAP_CLUSTER_MAX,
 		.swappiness = vm_swappiness,
 		.order = order,
 		.mem_cgroup = NULL,
 		.isolate_pages = isolate_pages_global,
+		.capture = NULL,
+		.nodemask = NULL,
 	};
 	/*
 	 * temp_priority is used to remember the scanning priority at which
@@ -1852,7 +1924,8 @@ loop_again:
 			 */
 			if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
 						end_zone, 0))
-				nr_reclaimed += shrink_zone(priority, zone, &sc);
+				nr_reclaimed += shrink_zone(priority,
+							zone, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);
@@ -2054,7 +2127,7 @@ static unsigned long shrink_all_zones(unsigned long nr_pages, int prio,
 				nr_to_scan = min(nr_pages,
 					zone_page_state(zone,
 							NR_LRU_BASE + l));
-				ret += shrink_list(l, nr_to_scan, zone,
+				ret += shrink_list(NULL, l, nr_to_scan, zone,
 								sc, prio);
 				if (ret >= nr_pages)
 					return ret;
@@ -2081,6 +2154,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
 	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
+		.alloc_flags = 0,
 		.may_swap = 0,
 		.swap_cluster_max = nr_pages,
 		.may_writepage = 1,
@@ -2269,6 +2343,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		.swap_cluster_max = max_t(unsigned long, nr_pages,
 					SWAP_CLUSTER_MAX),
 		.gfp_mask = gfp_mask,
+		.alloc_flags = 0,
 		.swappiness = vm_swappiness,
 		.isolate_pages = isolate_pages_global,
 	};
@@ -2295,7 +2370,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
 			note_zone_scanning_priority(zone, priority);
-			nr_reclaimed += shrink_zone(priority, zone, &sc);
+			nr_reclaimed += shrink_zone(priority, zone, zone, &sc);
 			priority--;
 		} while (priority >= 0 && nr_reclaimed < nr_pages);
 	}
-- 
1.6.0.rc1.258.g80295


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Reclaim page capture v3
  2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft
                   ` (3 preceding siblings ...)
  2008-09-05 10:20 ` [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer Andy Whitcroft
@ 2008-09-08  6:08 ` Nick Piggin
  2008-09-08 11:44   ` Andy Whitcroft
  4 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2008-09-08  6:08 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Friday 05 September 2008 20:19, Andy Whitcroft wrote:
> For sometime we have been looking at mechanisms for improving the
> availability of larger allocations under load.  One of the options we have
> explored is the capturing of pages freed under direct reclaim in order to
> increase the chances of free pages coelescing before they are subject to
> reallocation by racing allocators.
>
> Following this email is a patch stack implementing page capture during
> direct reclaim.  It consits of four patches.  The first two simply pull
> out existing code into helpers for reuse.  The third makes buddy's use
> of struct page explicit.  The fourth contains the meat of the changes,
> and its leader contains a much fuller description of the feature.
>
> This update represents a rebase to -mm and incorporates feedback from
> KOSAKI Motohiro.  It also incorporates an accounting fix which was
> preventing some captures.
>
> I have done a lot of comparitive testing with and without this patch
> set and in broad brush I am seeing improvements in hugepage allocations
> (worst case size) success on all of my test systems.  These tests consist
> of placing a constant stream of high order allocations on the system,
> at varying rates.  The results for these various runs are then averaged
> to give an overall improvement.
>
> 		Absolute	Effective
> x86-64		2.48%		 4.58%
> powerpc		5.55%		25.22%

These are the numbers for the improvement of hugepage allocation success?
Then what do you mean by absolute and effective?

What sort of constant stream of high order allocations are you talking
about? In what "real" situations are you seeing higher order page allocation
failures, and in those cases, how much do these patches help?

I must say I don't really like this approach. IMO it might be better to put
some sort of queue in the page allocator, so if memory becomes low, then
processes will start queueing up and not be allowed to jump the queue and
steal memory that has been freed by hard work of a direct reclaimer. That
would improve a lot of fairness problems as well as improve coalescing for
higher order allocations without introducing this capture stuff.


>
> x86-64 has a relatively small huge page size and so is always much more
> effective at allocating huge pages.  Even there we get a measurable
> improvement.  On powerpc the huge pages are much larger and much harder
> to recover.  Here we see a full 25% increase in page recovery.
>
> It should be noted that these are worst case testing, and very agressive
> taking every possible page in the system.  It would be helpful to get
> wider testing in -mm.
>
> Against: 2.6.27-rc1-mm1
>
> Andrew, please consider for -mm.
>
> -apw
>
> Changes since V2:
>  - Incorporates review feedback from Christoph Lameter,
>  - Incorporates review feedback from Peter Zijlstra, and
>  - Checkpatch fixes.
>
> Changes since V1:
>  - Incorporates review feedback from KOSAKI Motohiro,
>  - fixes up accounting when checking watermarks for captured pages,
>  - rebase 2.6.27-rc1-mm1,
>  - Incorporates review feedback from Mel.
>
>
> Andy Whitcroft (4):
>   pull out the page pre-release and sanity check logic for reuse
>   pull out zone cpuset and watermark checks for reuse
>   buddy: explicitly identify buddy field use in struct page
>   capture pages freed during direct reclaim for allocation by the
>     reclaimer
>
>  include/linux/mm_types.h   |    4 +
>  include/linux/page-flags.h |    4 +
>  mm/internal.h              |    8 +-
>  mm/page_alloc.c            |  263
> ++++++++++++++++++++++++++++++++++++++------ mm/vmscan.c                | 
> 115 ++++++++++++++++----
>  5 files changed, 338 insertions(+), 56 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Reclaim page capture v3
  2008-09-08  6:08 ` [PATCH 0/4] Reclaim page capture v3 Nick Piggin
@ 2008-09-08 11:44   ` Andy Whitcroft
  2008-09-08 13:59     ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-08 11:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Mon, Sep 08, 2008 at 04:08:03PM +1000, Nick Piggin wrote:
> On Friday 05 September 2008 20:19, Andy Whitcroft wrote:
> > For sometime we have been looking at mechanisms for improving the
> > availability of larger allocations under load.  One of the options we have
> > explored is the capturing of pages freed under direct reclaim in order to
> > increase the chances of free pages coelescing before they are subject to
> > reallocation by racing allocators.
> >
> > Following this email is a patch stack implementing page capture during
> > direct reclaim.  It consits of four patches.  The first two simply pull
> > out existing code into helpers for reuse.  The third makes buddy's use
> > of struct page explicit.  The fourth contains the meat of the changes,
> > and its leader contains a much fuller description of the feature.
> >
> > This update represents a rebase to -mm and incorporates feedback from
> > KOSAKI Motohiro.  It also incorporates an accounting fix which was
> > preventing some captures.
> >
> > I have done a lot of comparitive testing with and without this patch
> > set and in broad brush I am seeing improvements in hugepage allocations
> > (worst case size) success on all of my test systems.  These tests consist
> > of placing a constant stream of high order allocations on the system,
> > at varying rates.  The results for these various runs are then averaged
> > to give an overall improvement.
> >
> > 		Absolute	Effective
> > x86-64		2.48%		 4.58%
> > powerpc		5.55%		25.22%
> 
> These are the numbers for the improvement of hugepage allocation success?
> Then what do you mean by absolute and effective?

The absolute improvement is the literal change in success percentage,
the literal percentage of all memory which may be allocated as huge
pages.  The effective improvement is percentage of the baseline success
rates that this change represents; for the powerpc results we get some
20% of memory allocatable without the patches, and 25% with, 25% more
pages are allocatable with the patches.

> What sort of constant stream of high order allocations are you talking
> about? In what "real" situations are you seeing higher order page allocation
> failures, and in those cases, how much do these patches help?

The test case simulates a constant demand for huge pages, at various
rates.  This is intended to replicate the scenario where a system is
used with mixed small page and huge page applications, with a dynamic
huge page pool.  Particularly we are examining the effect of starting a
very large huge page job on a busy system.  Obviously starting hugepage
applications depends on hugepage availability as they are not swappable.
This test was chosen because it both tests the initial large page demand
and then pushes the system to the limit.

> I must say I don't really like this approach. IMO it might be better to put
> some sort of queue in the page allocator, so if memory becomes low, then
> processes will start queueing up and not be allowed to jump the queue and
> steal memory that has been freed by hard work of a direct reclaimer. That
> would improve a lot of fairness problems as well as improve coalescing for
> higher order allocations without introducing this capture stuff.

The problem with queuing all allocators is two fold.  Firstly allocations
are obviously required for reclaim and those would have to be exempt
from the queue, and these are as likely to prevent coelesce of pages
as any other.  Secondly where a very large allocation is requested
all allocators would be held while reclaim at that size is performed,
majorly increasing latency for those allocations.  Reclaim for an order
0 page may target of the order of 32 pages, whereas reclaim for x86_64
hugepages is 1024 pages minimum.  It would be grosly unfair for a single
large allocation to hold up normal allocations.

> > x86-64 has a relatively small huge page size and so is always much more
> > effective at allocating huge pages.  Even there we get a measurable
> > improvement.  On powerpc the huge pages are much larger and much harder
> > to recover.  Here we see a full 25% increase in page recovery.
> >
> > It should be noted that these are worst case testing, and very agressive
> > taking every possible page in the system.  It would be helpful to get
> > wider testing in -mm.
> >
> > Against: 2.6.27-rc1-mm1
> >
> > Andrew, please consider for -mm.
> >
> > -apw
> >
> > Changes since V2:
> >  - Incorporates review feedback from Christoph Lameter,
> >  - Incorporates review feedback from Peter Zijlstra, and
> >  - Checkpatch fixes.
> >
> > Changes since V1:
> >  - Incorporates review feedback from KOSAKI Motohiro,
> >  - fixes up accounting when checking watermarks for captured pages,
> >  - rebase 2.6.27-rc1-mm1,
> >  - Incorporates review feedback from Mel.
> >
> >
> > Andy Whitcroft (4):
> >   pull out the page pre-release and sanity check logic for reuse
> >   pull out zone cpuset and watermark checks for reuse
> >   buddy: explicitly identify buddy field use in struct page
> >   capture pages freed during direct reclaim for allocation by the
> >     reclaimer
> >
> >  include/linux/mm_types.h   |    4 +
> >  include/linux/page-flags.h |    4 +
> >  mm/internal.h              |    8 +-
> >  mm/page_alloc.c            |  263
> > ++++++++++++++++++++++++++++++++++++++------ mm/vmscan.c                | 
> > 115 ++++++++++++++++----
> >  5 files changed, 338 insertions(+), 56 deletions(-)
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-apw

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Reclaim page capture v3
  2008-09-08 11:44   ` Andy Whitcroft
@ 2008-09-08 13:59     ` Nick Piggin
  2008-09-08 16:41       ` Andy Whitcroft
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2008-09-08 13:59 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Monday 08 September 2008 21:44, Andy Whitcroft wrote:
> On Mon, Sep 08, 2008 at 04:08:03PM +1000, Nick Piggin wrote:
> > On Friday 05 September 2008 20:19, Andy Whitcroft wrote:

> > > 		Absolute	Effective
> > > x86-64		2.48%		 4.58%
> > > powerpc		5.55%		25.22%
> >
> > These are the numbers for the improvement of hugepage allocation success?
> > Then what do you mean by absolute and effective?
>
> The absolute improvement is the literal change in success percentage,
> the literal percentage of all memory which may be allocated as huge
> pages.  The effective improvement is percentage of the baseline success
> rates that this change represents; for the powerpc results we get some
> 20% of memory allocatable without the patches, and 25% with, 25% more
> pages are allocatable with the patches.

OK.


> > What sort of constant stream of high order allocations are you talking
> > about? In what "real" situations are you seeing higher order page
> > allocation failures, and in those cases, how much do these patches help?
>
> The test case simulates a constant demand for huge pages, at various
> rates.  This is intended to replicate the scenario where a system is
> used with mixed small page and huge page applications, with a dynamic
> huge page pool.  Particularly we are examining the effect of starting a
> very large huge page job on a busy system.  Obviously starting hugepage
> applications depends on hugepage availability as they are not swappable.
> This test was chosen because it both tests the initial large page demand
> and then pushes the system to the limit.

So... what does the non-simulation version (ie. the real app) say?


> > I must say I don't really like this approach. IMO it might be better to
> > put some sort of queue in the page allocator, so if memory becomes low,
> > then processes will start queueing up and not be allowed to jump the
> > queue and steal memory that has been freed by hard work of a direct
> > reclaimer. That would improve a lot of fairness problems as well as
> > improve coalescing for higher order allocations without introducing this
> > capture stuff.
>
> The problem with queuing all allocators is two fold.  Firstly allocations
> are obviously required for reclaim and those would have to be exempt
> from the queue, and these are as likely to prevent coelesce of pages
> as any other.

*Much* less likely, actually. Because there should be very little allocation
required for reclaim (only dirty pages, and only when backed by filesystems
that do silly things like not ensuring their own reserves before allowing
the page to be dirtied).

Also, your scheme still doesn't avoid allocation for reclaim so I don't see
how you can use that as a point against queueing but not capturing!


> Secondly where a very large allocation is requested 
> all allocators would be held while reclaim at that size is performed,
> majorly increasing latency for those allocations.  Reclaim for an order
> 0 page may target of the order of 32 pages, whereas reclaim for x86_64
> hugepages is 1024 pages minimum.  It would be grosly unfair for a single
> large allocation to hold up normal allocations.

I don't see why it should be unfair to allow a process to allocate 1024
order-0 pages ahead of one order-10 page (actually, yes the order 10 page
is I guess somewhat more valueable than the same number of fragmented
pages, but you see where I'm coming from).

At least with queueing you are basing the idea on *some* reasonable policy,
rather than purely random "whoever happens to take this lock at the right
time will win" strategy, which might I add, can even be much more unfair
than you might say queueing is.

However, queueing would still be able to allow some flexibility in priority.
For example:

if (must_queue) {
  if (queue_head_prio == 0)
    join_queue(1<<order);
  else {
    queue_head_prio -= 1<<order;
    skip_queue();
  }
}


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse
  2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
@ 2008-09-08 14:11   ` MinChan Kim
  2008-09-08 15:14     ` Andy Whitcroft
  0 siblings, 1 reply; 22+ messages in thread
From: MinChan Kim @ 2008-09-08 14:11 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Fri, Sep 5, 2008 at 7:19 PM, Andy Whitcroft <apw@shadowen.org> wrote:
> When we are about to release a page we perform a number of actions
> on that page.  We clear down any anonymous mappings, confirm that
> the page is safe to release, check for freeing locks, before mapping
> the page should that be required.  Pull this processing out into a
> helper function for reuse in a later patch.
>
> Note that we do not convert the similar cleardown in free_hot_cold_page()
> as the optimiser is unable to squash the loops during the inline.
>
> Signed-off-by: Andy Whitcroft <apw@shadowen.org>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/page_alloc.c |   43 ++++++++++++++++++++++++++++++-------------
>  1 files changed, 30 insertions(+), 13 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f52fcf1..b2a2c2b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page)
>  }
>
>  /*
> + * Prepare this page for release to the buddy.  Sanity check the page.
> + * Returns 1 if the page is safe to free.
> + */
> +static inline int free_page_prepare(struct page *page, int order)
> +{
> +       int i;
> +       int reserved = 0;
> +
> +       if (PageAnon(page))
> +               page->mapping = NULL;

Why do you need to clear down anonymous mapping ?
I think if you don't convert this cleardown in free_hot_cold_page(),
you don't need it.

If you do it, bad_page can't do his role.

> +       for (i = 0 ; i < (1 << order) ; ++i)
> +               reserved += free_pages_check(page + i);
> +       if (reserved)
> +               return 0;
> +
> +       if (!PageHighMem(page)) {
> +               debug_check_no_locks_freed(page_address(page),
> +                                                       PAGE_SIZE << order);
> +               debug_check_no_obj_freed(page_address(page),
> +                                          PAGE_SIZE << order);
> +       }
> +       arch_free_page(page, order);
> +       kernel_map_pages(page, 1 << order, 0);
> +
> +       return 1;
> +}
> +
> +/*
>  * Frees a list of pages.
>  * Assumes all pages on list are in same zone, and of same order.
>  * count is the number of pages to free.
> @@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
>  static void __free_pages_ok(struct page *page, unsigned int order)
>  {
>        unsigned long flags;
> -       int i;
> -       int reserved = 0;
>
> -       for (i = 0 ; i < (1 << order) ; ++i)
> -               reserved += free_pages_check(page + i);
> -       if (reserved)
> +       if (!free_page_prepare(page, order))
>                return;
>
> -       if (!PageHighMem(page)) {
> -               debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
> -               debug_check_no_obj_freed(page_address(page),
> -                                          PAGE_SIZE << order);
> -       }
> -       arch_free_page(page, order);
> -       kernel_map_pages(page, 1 << order, 0);
> -
>        local_irq_save(flags);
>        __count_vm_events(PGFREE, 1 << order);
>        free_one_page(page_zone(page), page, order);
> --
> 1.6.0.rc1.258.g80295
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Thanks,
MinChan Kim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse
  2008-09-08 14:11   ` MinChan Kim
@ 2008-09-08 15:14     ` Andy Whitcroft
  0 siblings, 0 replies; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-08 15:14 UTC (permalink / raw)
  To: MinChan Kim
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Mon, Sep 08, 2008 at 11:11:22PM +0900, MinChan Kim wrote:
> On Fri, Sep 5, 2008 at 7:19 PM, Andy Whitcroft <apw@shadowen.org> wrote:
> > When we are about to release a page we perform a number of actions
> > on that page.  We clear down any anonymous mappings, confirm that
> > the page is safe to release, check for freeing locks, before mapping
> > the page should that be required.  Pull this processing out into a
> > helper function for reuse in a later patch.
> >
> > Note that we do not convert the similar cleardown in free_hot_cold_page()
> > as the optimiser is unable to squash the loops during the inline.
> >
> > Signed-off-by: Andy Whitcroft <apw@shadowen.org>
> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > ---
> >  mm/page_alloc.c |   43 ++++++++++++++++++++++++++++++-------------
> >  1 files changed, 30 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f52fcf1..b2a2c2b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page)
> >  }
> >
> >  /*
> > + * Prepare this page for release to the buddy.  Sanity check the page.
> > + * Returns 1 if the page is safe to free.
> > + */
> > +static inline int free_page_prepare(struct page *page, int order)
> > +{
> > +       int i;
> > +       int reserved = 0;
> > +
> > +       if (PageAnon(page))
> > +               page->mapping = NULL;
> 
> Why do you need to clear down anonymous mapping ?
> I think if you don't convert this cleardown in free_hot_cold_page(),
> you don't need it.
> 
> If you do it, bad_page can't do his role.

Yeah that has slipped through from where originally this patch used to
merge two different instances of this code.  Good spot.  Will sort that
out.

> > +       for (i = 0 ; i < (1 << order) ; ++i)
> > +               reserved += free_pages_check(page + i);
> > +       if (reserved)
> > +               return 0;
> > +
> > +       if (!PageHighMem(page)) { > > +               debug_check_no_locks_freed(page_address(page),
> > +                                                       PAGE_SIZE << order);
> > +               debug_check_no_obj_freed(page_address(page),
> > +                                          PAGE_SIZE << order);
> > +       }
> > +       arch_free_page(page, order);
> > +       kernel_map_pages(page, 1 << order, 0);
> > +
> > +       return 1;
> > +}
> > +
> > +/*
> >  * Frees a list of pages.
> >  * Assumes all pages on list are in same zone, and of same order.
> >  * count is the number of pages to free.
> > @@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
> >  static void __free_pages_ok(struct page *page, unsigned int order)
> >  {
> >        unsigned long flags;
> > -       int i;
> > -       int reserved = 0;
> >
> > -       for (i = 0 ; i < (1 << order) ; ++i)
> > -               reserved += free_pages_check(page + i);
> > -       if (reserved)
> > +       if (!free_page_prepare(page, order))
> >                return;
> >
> > -       if (!PageHighMem(page)) {
> > -               debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
> > -               debug_check_no_obj_freed(page_address(page),
> > -                                          PAGE_SIZE << order);
> > -       }
> > -       arch_free_page(page, order);
> > -       kernel_map_pages(page, 1 << order, 0);
> > -
> >        local_irq_save(flags);
> >        __count_vm_events(PGFREE, 1 << order);
> >        free_one_page(page_zone(page), page, order);

-apw

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Reclaim page capture v3
  2008-09-08 13:59     ` Nick Piggin
@ 2008-09-08 16:41       ` Andy Whitcroft
  2008-09-09  3:31         ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-08 16:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Mon, Sep 08, 2008 at 11:59:54PM +1000, Nick Piggin wrote:
> On Monday 08 September 2008 21:44, Andy Whitcroft wrote:
> > On Mon, Sep 08, 2008 at 04:08:03PM +1000, Nick Piggin wrote:
> > > On Friday 05 September 2008 20:19, Andy Whitcroft wrote:
> 
> > > > 		Absolute	Effective
> > > > x86-64		2.48%		 4.58%
> > > > powerpc		5.55%		25.22%
> > >
> > > These are the numbers for the improvement of hugepage allocation success?
> > > Then what do you mean by absolute and effective?
> >
> > The absolute improvement is the literal change in success percentage,
> > the literal percentage of all memory which may be allocated as huge
> > pages.  The effective improvement is percentage of the baseline success
> > rates that this change represents; for the powerpc results we get some
> > 20% of memory allocatable without the patches, and 25% with, 25% more
> > pages are allocatable with the patches.
> 
> OK.
> 
> > > What sort of constant stream of high order allocations are you talking
> > > about? In what "real" situations are you seeing higher order page
> > > allocation failures, and in those cases, how much do these patches help?
> >
> > The test case simulates a constant demand for huge pages, at various
> > rates.  This is intended to replicate the scenario where a system is
> > used with mixed small page and huge page applications, with a dynamic
> > huge page pool.  Particularly we are examining the effect of starting a
> > very large huge page job on a busy system.  Obviously starting hugepage
> > applications depends on hugepage availability as they are not swappable.
> > This test was chosen because it both tests the initial large page demand
> > and then pushes the system to the limit.
> 
> So... what does the non-simulation version (ie. the real app) say?

In the common use model, use of huge pages is all or nothing.  Either
there are sufficient pages allocatable at application start time or there
are not.  As huge pages are not swappable once allocated they stay there.
Applications either start using huge pages or they fallback to small pages
and continue.  This makes the real metric, how often does the customer get
annoyed becuase their application has fallen back to small pages and is
slow, or how often does their database fail to start.  It is very hard to
directly measure that and thus to get a comparitive figure.  Any attempt
to replicate that seems as necessarly artificial as the current test.

> > > I must say I don't really like this approach. IMO it might be better to
> > > put some sort of queue in the page allocator, so if memory becomes low,
> > > then processes will start queueing up and not be allowed to jump the
> > > queue and steal memory that has been freed by hard work of a direct
> > > reclaimer. That would improve a lot of fairness problems as well as
> > > improve coalescing for higher order allocations without introducing this
> > > capture stuff.
> >
> > The problem with queuing all allocators is two fold.  Firstly allocations
> > are obviously required for reclaim and those would have to be exempt
> > from the queue, and these are as likely to prevent coelesce of pages
> > as any other.
> 
> *Much* less likely, actually. Because there should be very little allocation
> required for reclaim (only dirty pages, and only when backed by filesystems
> that do silly things like not ensuring their own reserves before allowing
> the page to be dirtied).

Well yes and no.  A lot of filesystems do do such stupid things.
Allocating things like journal pages which have relativly long lives
during reclaim.  We have seen these getting placed into the memory we
have just freed and preventing higher order coelesce.

> Also, your scheme still doesn't avoid allocation for reclaim so I don't see
> how you can use that as a point against queueing but not capturing!

Obviously we cannot prevent allocations during reclaim.  But we can avoid
those allocations falling within the captured range.  All pages under
capture are marked.  Any page returning to the allocator that merges with a
buddy under capture, or that is a buddy under capture are kept separatly.
Such that any allocations from within reclaim will necessarily take
pages from elsewhere in the pool.  The key thing about capture is that it
effectivly marks ranges of pages out of use for allocations for the period
of the reclaim, so we have a number of small ranges blocked out, not the
whole pool.  This allows parallel allocations (reclaim and otherwise)
to succeed against the reserves (refilled by kswapd etc), whilst marking
the pages under capture out and preventing them from being used.

> > Secondly where a very large allocation is requested 
> > all allocators would be held while reclaim at that size is performed,
> > majorly increasing latency for those allocations.  Reclaim for an order
> > 0 page may target of the order of 32 pages, whereas reclaim for x86_64
> > hugepages is 1024 pages minimum.  It would be grosly unfair for a single
> > large allocation to hold up normal allocations.
> 
> I don't see why it should be unfair to allow a process to allocate 1024
> order-0 pages ahead of one order-10 page (actually, yes the order 10 page
> is I guess somewhat more valueable than the same number of fragmented
> pages, but you see where I'm coming from).

I think we have our wires crossed here.  I was saying it would seem
unfair to block the allocator from giving out order-0 pages while we are
struggling to get an order-10 page for one process.  Having a queue
would seem to generate such behaviour.  What I am trying to achieve with
capture is to push areas likely to return us a page of the requested
size out of use, while we try and reclaim it without blocking the rest
of the allocator.

> At least with queueing you are basing the idea on *some* reasonable policy,
> rather than purely random "whoever happens to take this lock at the right
> time will win" strategy, which might I add, can even be much more unfair
> than you might say queueing is.
> 
> However, queueing would still be able to allow some flexibility in priority.
> For example:
> 
> if (must_queue) {
>   if (queue_head_prio == 0)
>     join_queue(1<<order);
>   else {
>     queue_head_prio -= 1<<order;
>     skip_queue();
>   }
> }

Sure you would be able to make some kind of more flexible decisions, but
that still seems like a heavy handed approach.  You are important enough
to takes pages (possibly from our mostly free high order page) or not. 

In a perfect world we would be able to know in advance that an order-N
region would definatly come free if reclaimed and allocate that preemptivly
to the requestor, apply reclaim to it, and then actually allocate the page.

Obviously the world is not quite that perfect, but we are trying to
follow that model with capture.  Picking areas of the appropriate order,
applying reclaim to just those areas, and preventing any freed pages
getting back into the general population.  This behaviour should mark
the minimum memory out of use as possible.

-apw

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Reclaim page capture v3
  2008-09-08 16:41       ` Andy Whitcroft
@ 2008-09-09  3:31         ` Nick Piggin
  2008-09-09 16:35           ` Andy Whitcroft
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2008-09-09  3:31 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Tuesday 09 September 2008 02:41, Andy Whitcroft wrote:
> On Mon, Sep 08, 2008 at 11:59:54PM +1000, Nick Piggin wrote:

> > So... what does the non-simulation version (ie. the real app) say?
>
> In the common use model, use of huge pages is all or nothing.  Either
> there are sufficient pages allocatable at application start time or there
> are not.  As huge pages are not swappable once allocated they stay there.
> Applications either start using huge pages or they fallback to small pages
> and continue.  This makes the real metric, how often does the customer get
> annoyed becuase their application has fallen back to small pages and is
> slow, or how often does their database fail to start.  It is very hard to
> directly measure that and thus to get a comparitive figure.  Any attempt
> to replicate that seems as necessarly artificial as the current test.

But you have customers telling you they're getting annoyed because of
this? Or you have your own "realistic" workloads that allocate hugepages
on demand (OK, when I say realistic, something like specjbb or whatever
is obviously a reasonable macrobenchmark even if it isn't strictly
realistics).


> > *Much* less likely, actually. Because there should be very little
> > allocation required for reclaim (only dirty pages, and only when backed
> > by filesystems that do silly things like not ensuring their own reserves
> > before allowing the page to be dirtied).
>
> Well yes and no.  A lot of filesystems do do such stupid things.
> Allocating things like journal pages which have relativly long lives
> during reclaim.  We have seen these getting placed into the memory we
> have just freed and preventing higher order coelesce.

They shouldn't, because that's sad and deadlocky. But yes I agree it
happens sometimes.


> > Also, your scheme still doesn't avoid allocation for reclaim so I don't
> > see how you can use that as a point against queueing but not capturing!
>
> Obviously we cannot prevent allocations during reclaim.  But we can avoid
> those allocations falling within the captured range.  All pages under
> capture are marked.  Any page returning to the allocator that merges with a
> buddy under capture, or that is a buddy under capture are kept separatly.
> Such that any allocations from within reclaim will necessarily take
> pages from elsewhere in the pool.  The key thing about capture is that it
> effectivly marks ranges of pages out of use for allocations for the period
> of the reclaim, so we have a number of small ranges blocked out, not the
> whole pool.  This allows parallel allocations (reclaim and otherwise)
> to succeed against the reserves (refilled by kswapd etc), whilst marking
> the pages under capture out and preventing them from being used.

Yeah, but blocking the whole pool gives a *much* bigger chance to coalesce
freed pages. And I'm not just talking about massive order-10 allocations
or something where you have the targetted reclaim which improves chances of
getting a free page within that range, but also for order-1/2/3 pages
that might be more commonly used in normal kernel workloads but can still
have a fairly high latency to succeed if there is a lot of other parallel
allocation happening.


> > I don't see why it should be unfair to allow a process to allocate 1024
> > order-0 pages ahead of one order-10 page (actually, yes the order 10 page
> > is I guess somewhat more valueable than the same number of fragmented
> > pages, but you see where I'm coming from).
>
> I think we have our wires crossed here.  I was saying it would seem
> unfair to block the allocator from giving out order-0 pages while we are
> struggling to get an order-10 page for one process.  Having a queue
> would seem to generate such behaviour.  What I am trying to achieve with
> capture is to push areas likely to return us a page of the requested
> size out of use, while we try and reclaim it without blocking the rest
> of the allocator.

We don't have our wires crossed. I just don't agree that it is unfair.
It might be unfair to allow order-10 allocations at the same *rate* at
order-0 allocations, which is why you could allow some priority in the
queue. But when you decide you want to satisfy an order-10 allocation,
do you want the other guys potentially mopping up your coalescing
candidates? (and right, for targetted reclaim, maybe this is less of a
problem, but I'm also thinking about a general solution for all orders
of pages not just hugepages).


> > At least with queueing you are basing the idea on *some* reasonable
> > policy, rather than purely random "whoever happens to take this lock at
> > the right time will win" strategy, which might I add, can even be much
> > more unfair than you might say queueing is.
> >
> > However, queueing would still be able to allow some flexibility in
> > priority. For example:
> >
> > if (must_queue) {
> >   if (queue_head_prio == 0)
> >     join_queue(1<<order);
> >   else {
> >     queue_head_prio -= 1<<order;
> >     skip_queue();
> >   }
> > }
>
> Sure you would be able to make some kind of more flexible decisions, but
> that still seems like a heavy handed approach.  You are important enough
> to takes pages (possibly from our mostly free high order page) or not.

I don't understand the thought process that leads you to these assertions. 
Where do you get your idea of fairness or importance?

I would say that allowing 2^N order-0 allocations for every order-N
allocations if both allocators are in a tight loop (and reserves are low,
ie. reclaim is required) is a completely reasonable starting point for
fairness. Do you disagree with that? How is it less fair than your
approach?


> In a perfect world we would be able to know in advance that an order-N
> region would definatly come free if reclaimed and allocate that preemptivly
> to the requestor, apply reclaim to it, and then actually allocate the page.

But with my queueing you get effectively the same thing without having an
oracle. Because you will wait for *any* order-N region to become free.

The "tradeoff" is blocking other allocators. But IMO that is actually a
good thing because that equates to a general fairness model in our allocator
for all allocation types in place of the existing, fairly random game of
chance (especially for order-2/3+) that we have now.

Even for hugepages: if, with your capture patches, if process 0 comes in
and does all this reclaim work and has nearly freed up some linear region
of memory; then do you think it is reasonable if process-1 happens to come
in and get lucky and find an subsequently coalesced hugepage and allocate
it?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Reclaim page capture v3
  2008-09-09  3:31         ` Nick Piggin
@ 2008-09-09 16:35           ` Andy Whitcroft
  2008-09-10  3:19             ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-09 16:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Tue, Sep 09, 2008 at 01:31:43PM +1000, Nick Piggin wrote:
> On Tuesday 09 September 2008 02:41, Andy Whitcroft wrote:
> > On Mon, Sep 08, 2008 at 11:59:54PM +1000, Nick Piggin wrote:
> 
> > > So... what does the non-simulation version (ie. the real app) say?
> >
> > In the common use model, use of huge pages is all or nothing.  Either
> > there are sufficient pages allocatable at application start time or there
> > are not.  As huge pages are not swappable once allocated they stay there.
> > Applications either start using huge pages or they fallback to small pages
> > and continue.  This makes the real metric, how often does the customer get
> > annoyed becuase their application has fallen back to small pages and is
> > slow, or how often does their database fail to start.  It is very hard to
> > directly measure that and thus to get a comparitive figure.  Any attempt
> > to replicate that seems as necessarly artificial as the current test.
> 
> But you have customers telling you they're getting annoyed because of
> this? Or you have your own "realistic" workloads that allocate hugepages
> on demand (OK, when I say realistic, something like specjbb or whatever
> is obviously a reasonable macrobenchmark even if it isn't strictly
> realistics).

We have customers complaining about usability of hugepages.  They are
keen to obtain the performance benefits from their use, but put off by
the difficulty in managing systems with them enabled.  That has led to
series of changes to improve availability and managability of hugepages,
including targetted reclaim, anti-fragmentation, and the dynamic pool.
We obviously test hugepages with a series of benchmarks but generally in
controlled environments where allocation of hugepages is guarenteed, to
understand the performance benefits of their use.  This is an acceptable
configuration for hugepage diehards, but limits their utility to the
wider community.  The focus of this work is availability of hugepages and
how that applies to managability.  A common use model for big systems
is batch oriented running a mix of jobs.  For this to work reliably we
need to move pages into and out of the hugepage pool on a regular basis.
For this we need effective high order reclaim.

> > > *Much* less likely, actually. Because there should be very little
> > > allocation required for reclaim (only dirty pages, and only when backed
> > > by filesystems that do silly things like not ensuring their own reserves
> > > before allowing the page to be dirtied).
> >
> > Well yes and no.  A lot of filesystems do do such stupid things.
> > Allocating things like journal pages which have relativly long lives
> > during reclaim.  We have seen these getting placed into the memory we
> > have just freed and preventing higher order coelesce.
> 
> They shouldn't, because that's sad and deadlocky. But yes I agree it
> happens sometimes.
> 
> > > Also, your scheme still doesn't avoid allocation for reclaim so I don't
> > > see how you can use that as a point against queueing but not capturing!
> >
> > Obviously we cannot prevent allocations during reclaim.  But we can avoid
> > those allocations falling within the captured range.  All pages under
> > capture are marked.  Any page returning to the allocator that merges with a
> > buddy under capture, or that is a buddy under capture are kept separatly.
> > Such that any allocations from within reclaim will necessarily take
> > pages from elsewhere in the pool.  The key thing about capture is that it
> > effectivly marks ranges of pages out of use for allocations for the period
> > of the reclaim, so we have a number of small ranges blocked out, not the
> > whole pool.  This allows parallel allocations (reclaim and otherwise)
> > to succeed against the reserves (refilled by kswapd etc), whilst marking
> > the pages under capture out and preventing them from being used.
> 
> Yeah, but blocking the whole pool gives a *much* bigger chance to coalesce
> freed pages. And I'm not just talking about massive order-10 allocations
> or something where you have the targetted reclaim which improves chances of
> getting a free page within that range, but also for order-1/2/3 pages
> that might be more commonly used in normal kernel workloads but can still
> have a fairly high latency to succeed if there is a lot of other parallel
> allocation happening.

Yes in isolation blocking the whole pool would give us a higher chance
of coalescing higher order pages.  The problem here is that we cannot
block the whole pool, we have to allow allocations for reclaim.  If we
have those occuring we risk loss of pages from the areas we are trying
to reclaim.

> > > I don't see why it should be unfair to allow a process to allocate 1024
> > > order-0 pages ahead of one order-10 page (actually, yes the order 10 page
> > > is I guess somewhat more valueable than the same number of fragmented
> > > pages, but you see where I'm coming from).
> >
> > I think we have our wires crossed here.  I was saying it would seem
> > unfair to block the allocator from giving out order-0 pages while we are
> > struggling to get an order-10 page for one process.  Having a queue
> > would seem to generate such behaviour.  What I am trying to achieve with
> > capture is to push areas likely to return us a page of the requested
> > size out of use, while we try and reclaim it without blocking the rest
> > of the allocator.
> 
> We don't have our wires crossed. I just don't agree that it is unfair.
> It might be unfair to allow order-10 allocations at the same *rate* at
> order-0 allocations, which is why you could allow some priority in the
> queue. But when you decide you want to satisfy an order-10 allocation,
> do you want the other guys potentially mopping up your coalescing
> candidates? (and right, for targetted reclaim, maybe this is less of a
> problem, but I'm also thinking about a general solution for all orders
> of pages not just hugepages).

When I am attempting to satisfy an order-10 allocation I clearly want to
protect that order-10 area from use for allocation to other allocators.
And that is the key thrust of the capture stuff, that is what it does.
Capture puts the pages in areas under targetted reclaim out of the
allocator during the reclaim pass.  Protecting those areas from any
parallel parallel allocators, either regular or PF_MEMALLOC; but only
protecting those areas.

I think we have differing views on what a queuing alone can deliver.
Given that the allocator will have to allow allocations made during reclaim
in parallel with reclaim for higher orders it does not seem sufficient
to have a 'fair queue' alone.  That will not stop the smaller necessary
allocations from stealing our partial free buddies, allocations that we
know are not always short lived.  [continued below]

> > > At least with queueing you are basing the idea on *some* reasonable
> > > policy, rather than purely random "whoever happens to take this lock at
> > > the right time will win" strategy, which might I add, can even be much
> > > more unfair than you might say queueing is.
> > >
> > > However, queueing would still be able to allow some flexibility in
> > > priority. For example:
> > >
> > > if (must_queue) {
> > >   if (queue_head_prio == 0)
> > >     join_queue(1<<order);
> > >   else {
> > >     queue_head_prio -= 1<<order;
> > >     skip_queue();
> > >   }
> > > }
> >
> > Sure you would be able to make some kind of more flexible decisions, but
> > that still seems like a heavy handed approach.  You are important enough
> > to takes pages (possibly from our mostly free high order page) or not.
> 
> I don't understand the thought process that leads you to these assertions. 
> Where do you get your idea of fairness or importance?
> 
> I would say that allowing 2^N order-0 allocations for every order-N
> allocations if both allocators are in a tight loop (and reserves are low,
> ie. reclaim is required) is a completely reasonable starting point for
> fairness. Do you disagree with that? How is it less fair than your
> approach?

What I was trying to convey here is that given the limitation that we need
to allow allocation from within reclaim for reclaim to complete, that
even stopping other small allocations for the period of the high order
allocation is insufficient as those reclaim allocations are potentially
as damaging.  That even with a priority queue on the allocator you would
need something to prevent allocations falling within an area under reclaim
without preventing allocations completely, it is this problem that I
am trying to solve.  Trying to get the most beneficial result from the
effort put in during reclaim.  I actually think your idea of a priority
queue and what capture is trying to do are othogonal.

The current model uses direct reclaim as the 'fairness control'.  If you
are allocating a lot and the pool runs dry, you enter reclaim and are
penalised for your greed.  "If you want something big, you work to get it."
Someone entering reclaim for higher order is independant of lower order
allocators, and may even take pages from the pool without reclaim.
What you are proposing with the priority queue seems to be a replacement
for that.

> > In a perfect world we would be able to know in advance that an order-N
> > region would definatly come free if reclaimed and allocate that preemptivly
> > to the requestor, apply reclaim to it, and then actually allocate the page.
> 
> But with my queueing you get effectively the same thing without having an
> oracle. Because you will wait for *any* order-N region to become free.
> 
> The "tradeoff" is blocking other allocators. But IMO that is actually a
> good thing because that equates to a general fairness model in our allocator
> for all allocation types in place of the existing, fairly random game of
> chance (especially for order-2/3+) that we have now.
> 
> Even for hugepages: if, with your capture patches, if process 0 comes in
> and does all this reclaim work and has nearly freed up some linear region
> of memory; then do you think it is reasonable if process-1 happens to come
> in and get lucky and find an subsequently coalesced hugepage and allocate
> it?

Are you saying of capture fails to obtain the high order page, so all
the reclaim effort was wasted for that process, but then a subsequent
reclaimer comes in and reclaims a few pages, gets the whole page and
effectivly steals the work?  It is unfortuanate for sure.  Is it better
that the work was not completely wasted, yes.  By containing and protecting
the partial results we work to reduce the likely hood of this happening.

-apw

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Reclaim page capture v3
  2008-09-09 16:35           ` Andy Whitcroft
@ 2008-09-10  3:19             ` Nick Piggin
  2008-09-10  6:56               ` Andy Whitcroft
  2008-09-12 21:30               ` Andy Whitcroft
  0 siblings, 2 replies; 22+ messages in thread
From: Nick Piggin @ 2008-09-10  3:19 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Wednesday 10 September 2008 02:35, Andy Whitcroft wrote:
> On Tue, Sep 09, 2008 at 01:31:43PM +1000, Nick Piggin wrote:

> > Yeah, but blocking the whole pool gives a *much* bigger chance to
> > coalesce freed pages. And I'm not just talking about massive order-10
> > allocations or something where you have the targetted reclaim which
> > improves chances of getting a free page within that range, but also for
> > order-1/2/3 pages that might be more commonly used in normal kernel
> > workloads but can still have a fairly high latency to succeed if there is
> > a lot of other parallel allocation happening.
>
> Yes in isolation blocking the whole pool would give us a higher chance
> of coalescing higher order pages.  The problem here is that we cannot
> block the whole pool, we have to allow allocations for reclaim.  If we
> have those occuring we risk loss of pages from the areas we are trying
> to reclaim.

Allocations from reclaim should be pretty rare, I think you're too
worried about them.

Actually, if you have queueing, you can do a simple form of capture
just in the targetted reclaim path without all the stuff added to page
alloc etc. Because you can hold off from freeing the targetted pages
into the page allocator until you have built up a large number of them.
That way, they won't be subject to recursive allocations and they will
coalesce in the allocator; the queueing will ensure they don't get
stolen or broken into smaller pages under us. Yes I really think queueing
(perhaps with a much simplified form of capture) is the way to go.


> > > I think we have our wires crossed here.  I was saying it would seem
> > > unfair to block the allocator from giving out order-0 pages while we
> > > are struggling to get an order-10 page for one process.  Having a queue
> > > would seem to generate such behaviour.  What I am trying to achieve
> > > with capture is to push areas likely to return us a page of the
> > > requested size out of use, while we try and reclaim it without blocking
> > > the rest of the allocator.
> >
> > We don't have our wires crossed. I just don't agree that it is unfair.
> > It might be unfair to allow order-10 allocations at the same *rate* at
> > order-0 allocations, which is why you could allow some priority in the
> > queue. But when you decide you want to satisfy an order-10 allocation,
> > do you want the other guys potentially mopping up your coalescing
> > candidates? (and right, for targetted reclaim, maybe this is less of a
> > problem, but I'm also thinking about a general solution for all orders
> > of pages not just hugepages).
>
> When I am attempting to satisfy an order-10 allocation I clearly want to
> protect that order-10 area from use for allocation to other allocators.
> And that is the key thrust of the capture stuff, that is what it does.
> Capture puts the pages in areas under targetted reclaim out of the
> allocator during the reclaim pass.  Protecting those areas from any
> parallel parallel allocators, either regular or PF_MEMALLOC; but only
> protecting those areas.

And I've shown that by a principle of fairness, you don't need to protect
just one area, but you can hold off reclaim on all areas, which should
greatly increase your chances of getting a hugepage, and will allow you
to do so fairly.


> I think we have differing views on what a queuing alone can deliver.
> Given that the allocator will have to allow allocations made during reclaim
> in parallel with reclaim for higher orders it does not seem sufficient
> to have a 'fair queue' alone.  That will not stop the smaller necessary
> allocations from stealing our partial free buddies, allocations that we
> know are not always short lived.  [continued below]

Queueing is a general and natural step which (hopefully) benefits everyone,
so I think it should be tried before capture (which introduces new complexity
solely for big allocations).


> > > Sure you would be able to make some kind of more flexible decisions,
> > > but that still seems like a heavy handed approach.  You are important
> > > enough to takes pages (possibly from our mostly free high order page)
> > > or not.
> >
> > I don't understand the thought process that leads you to these
> > assertions. Where do you get your idea of fairness or importance?
> >
> > I would say that allowing 2^N order-0 allocations for every order-N
> > allocations if both allocators are in a tight loop (and reserves are low,
> > ie. reclaim is required) is a completely reasonable starting point for
> > fairness. Do you disagree with that? How is it less fair than your
> > approach?
>
> What I was trying to convey here is that given the limitation that we need
> to allow allocation from within reclaim for reclaim to complete, that
> even stopping other small allocations for the period of the high order
> allocation is insufficient as those reclaim allocations are potentially
> as damaging.  That even with a priority queue on the allocator you would
> need something to prevent allocations falling within an area under reclaim
> without preventing allocations completely, it is this problem that I
> am trying to solve.  Trying to get the most beneficial result from the
> effort put in during reclaim.  I actually think your idea of a priority
> queue and what capture is trying to do are othogonal.

Yes they are orthogonal, but I would like to see if we can do without
page capture. I'm not saying the theory or practice of it is wrong.


> The current model uses direct reclaim as the 'fairness control'.  If you
> are allocating a lot and the pool runs dry, you enter reclaim and are
> penalised for your greed.  "If you want something big, you work to get it."
> Someone entering reclaim for higher order is independant of lower order
> allocators, and may even take pages from the pool without reclaim.
> What you are proposing with the priority queue seems to be a replacement
> for that.

For order-2,3 etc allocations that can be commonly used by the kernel but
are not always in large supply, that fairness model doesn't really work.
You can work to free things, then have another process allocate them, then
exit direct reclaim to find none left etc.

Even for order-0 pages there is a problem in theory (although statistically
it probably evens out much better).


> > Even for hugepages: if, with your capture patches, if process 0 comes in
> > and does all this reclaim work and has nearly freed up some linear region
> > of memory; then do you think it is reasonable if process-1 happens to
> > come in and get lucky and find an subsequently coalesced hugepage and
> > allocate it?
>
> Are you saying of capture fails to obtain the high order page, so all
> the reclaim effort was wasted for that process, but then a subsequent
> reclaimer comes in and reclaims a few pages, gets the whole page and
> effectivly steals the work?  It is unfortuanate for sure.  Is it better
> that the work was not completely wasted, yes.  By containing and protecting
> the partial results we work to reduce the likely hood of this happening.

Yes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Reclaim page capture v3
  2008-09-10  3:19             ` Nick Piggin
@ 2008-09-10  6:56               ` Andy Whitcroft
  2008-09-12 21:30               ` Andy Whitcroft
  1 sibling, 0 replies; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-10  6:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

Am flying today, so I won't get to respond to this one in detail
immediatly.

-apw

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Reclaim page capture v3
  2008-09-10  3:19             ` Nick Piggin
  2008-09-10  6:56               ` Andy Whitcroft
@ 2008-09-12 21:30               ` Andy Whitcroft
  1 sibling, 0 replies; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-12 21:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman

On Wed, Sep 10, 2008 at 01:19:01PM +1000, Nick Piggin wrote:
> On Wednesday 10 September 2008 02:35, Andy Whitcroft wrote:
> > On Tue, Sep 09, 2008 at 01:31:43PM +1000, Nick Piggin wrote:
> 
> > > Yeah, but blocking the whole pool gives a *much* bigger chance to
> > > coalesce freed pages. And I'm not just talking about massive order-10
> > > allocations or something where you have the targetted reclaim which
> > > improves chances of getting a free page within that range, but also for
> > > order-1/2/3 pages that might be more commonly used in normal kernel
> > > workloads but can still have a fairly high latency to succeed if there is
> > > a lot of other parallel allocation happening.
> >
> > Yes in isolation blocking the whole pool would give us a higher chance
> > of coalescing higher order pages.  The problem here is that we cannot
> > block the whole pool, we have to allow allocations for reclaim.  If we
> > have those occuring we risk loss of pages from the areas we are trying
> > to reclaim.
> 
> Allocations from reclaim should be pretty rare, I think you're too
> worried about them.

They are certainly there, particularly with journalled filesystems in
the mix.  But although I do worry about them, my solution has been to
allow all allocations in parallel with reclaim so they fall into the
same bucket.

> Actually, if you have queueing, you can do a simple form of capture
> just in the targetted reclaim path without all the stuff added to page
> alloc etc. Because you can hold off from freeing the targetted pages
> into the page allocator until you have built up a large number of them.
> That way, they won't be subject to recursive allocations and they will
> coalesce in the allocator; the queueing will ensure they don't get
> stolen or broken into smaller pages under us. Yes I really think queueing
> (perhaps with a much simplified form of capture) is the way to go.

It almost feels as if you think these patches do much more than they
actually do.  The primary source of captured pages is through collection
of pages as we complete synchronous targetted reclaim on them.  However we
do also collect and merge pages which are already free, and those which
are freed from outside the reclaim during the period of the capture.

>From your description I am getting the impression that your proposal is
that there be a queue on the allocator plus capture only for pages we
release during targetted reclaim.  We would then release our list and
try to allocate from the buddy.  Relying on buddy to coelesce them, and
then allocate from there.  The queue protects us from any stealing.

In the development and testing of capture, we actually started out with
simple version of capture, without the additional ability to collect the
already free pages, and those which free during capture.  (It is these
parts which bring the apparent complexity of change to page_alloc.c and I
am assuming is the bit you dislike particularly.)  Testing on this simple
version showed the chances of actually finding areas of the size requested
were close to zero and capture statistically worthless.  That seems to
indicate that for queue+simple capture to have any statistical chance
of improving things all allocations would need to be stopped for the
entire period of the capture.  Even if we assume reclaim allocations
never occur (which we cannot queue), that still leaves us with a singly
threaded allocator once reclaim starts.  Bearing in mind that for very
high order pages that would be periods equivalent to the time it takes
to write thousands of base pages.  That seems unreasonable at best.

> > > > I think we have our wires crossed here.  I was saying it would seem
> > > > unfair to block the allocator from giving out order-0 pages while we
> > > > are struggling to get an order-10 page for one process.  Having a queue
> > > > would seem to generate such behaviour.  What I am trying to achieve
> > > > with capture is to push areas likely to return us a page of the
> > > > requested size out of use, while we try and reclaim it without blocking
> > > > the rest of the allocator.
> > >
> > > We don't have our wires crossed. I just don't agree that it is unfair.
> > > It might be unfair to allow order-10 allocations at the same *rate* at
> > > order-0 allocations, which is why you could allow some priority in the
> > > queue. But when you decide you want to satisfy an order-10 allocation,
> > > do you want the other guys potentially mopping up your coalescing
> > > candidates? (and right, for targetted reclaim, maybe this is less of a
> > > problem, but I'm also thinking about a general solution for all orders
> > > of pages not just hugepages).
> >
> > When I am attempting to satisfy an order-10 allocation I clearly want to
> > protect that order-10 area from use for allocation to other allocators.
> > And that is the key thrust of the capture stuff, that is what it does.
> > Capture puts the pages in areas under targetted reclaim out of the
> > allocator during the reclaim pass.  Protecting those areas from any
> > parallel parallel allocators, either regular or PF_MEMALLOC; but only
> > protecting those areas.
> 
> And I've shown that by a principle of fairness, you don't need to protect
> just one area, but you can hold off reclaim on all areas, which should
> greatly increase your chances of getting a hugepage, and will allow you
> to do so fairly.
> 
> 
> > I think we have differing views on what a queuing alone can deliver.
> > Given that the allocator will have to allow allocations made during reclaim
> > in parallel with reclaim for higher orders it does not seem sufficient
> > to have a 'fair queue' alone.  That will not stop the smaller necessary
> > allocations from stealing our partial free buddies, allocations that we
> > know are not always short lived.  [continued below]
> 
> Queueing is a general and natural step which (hopefully) benefits everyone,
> so I think it should be tried before capture (which introduces new complexity
> solely for big allocations).

To be completely honest here, I do not think I have yet figured out what
form queueing you really envisage here.  Its name implies an ordering of
allocations based on request 'start' time.  The discussions in previous
emails imply more of a balancing of rates of allocations, at least in
terms of fairness.  Could you perhaps flesh out your queue idea here so
that we are arguing against the same object?

Some details of that would help to understand the version of fairness
you are trying to generate as well.  As there are a number of often
conflicting definitions of fairness which are valid in their own senses.
For example I could state that it is 'unfair that an allocator wanting
an order 10 page would stop me getting a single order 0 page', and yet
also state 'it is unfair that I cannot allocate a single order 10 page,
when you have allocated 1024 order-0 pages' both are reasonable
statements and yet very likely to be conflicting in implementation.

> > > > Sure you would be able to make some kind of more flexible decisions,
> > > > but that still seems like a heavy handed approach.  You are important
> > > > enough to takes pages (possibly from our mostly free high order page)
> > > > or not.
> > >
> > > I don't understand the thought process that leads you to these
> > > assertions. Where do you get your idea of fairness or importance?
> > >
> > > I would say that allowing 2^N order-0 allocations for every order-N
> > > allocations if both allocators are in a tight loop (and reserves are low,
> > > ie. reclaim is required) is a completely reasonable starting point for
> > > fairness. Do you disagree with that? How is it less fair than your
> > > approach?
> >
> > What I was trying to convey here is that given the limitation that we need
> > to allow allocation from within reclaim for reclaim to complete, that
> > even stopping other small allocations for the period of the high order
> > allocation is insufficient as those reclaim allocations are potentially
> > as damaging.  That even with a priority queue on the allocator you would
> > need something to prevent allocations falling within an area under reclaim
> > without preventing allocations completely, it is this problem that I
> > am trying to solve.  Trying to get the most beneficial result from the
> > effort put in during reclaim.  I actually think your idea of a priority
> > queue and what capture is trying to do are othogonal.
> 
> Yes they are orthogonal, but I would like to see if we can do without
> page capture. I'm not saying the theory or practice of it is wrong.
> 
> 
> > The current model uses direct reclaim as the 'fairness control'.  If you
> > are allocating a lot and the pool runs dry, you enter reclaim and are
> > penalised for your greed.  "If you want something big, you work to get it."
> > Someone entering reclaim for higher order is independant of lower order
> > allocators, and may even take pages from the pool without reclaim.
> > What you are proposing with the priority queue seems to be a replacement
> > for that.
> 
> For order-2,3 etc allocations that can be commonly used by the kernel but
> are not always in large supply, that fairness model doesn't really work.
> You can work to free things, then have another process allocate them, then
> exit direct reclaim to find none left etc.
>
> Even for order-0 pages there is a problem in theory (although statistically
> it probably evens out much better).

Yes that is entirely true.  As it happens we only apply capture here to
orders above COSTLY_ORDER in line with the kernels assumption that
statistically those orders are released during reclaim 'relativly
easily'.  If that assumption is wrong then it would seem logical to
apply capture down to order 0.

> > > Even for hugepages: if, with your capture patches, if process 0 comes in
> > > and does all this reclaim work and has nearly freed up some linear region
> > > of memory; then do you think it is reasonable if process-1 happens to
> > > come in and get lucky and find an subsequently coalesced hugepage and
> > > allocate it?
> >
> > Are you saying of capture fails to obtain the high order page, so all
> > the reclaim effort was wasted for that process, but then a subsequent
> > reclaimer comes in and reclaims a few pages, gets the whole page and
> > effectivly steals the work?  It is unfortuanate for sure.  Is it better
> > that the work was not completely wasted, yes.  By containing and protecting
> > the partial results we work to reduce the likely hood of this happening.
> 
> Yes.

-apw

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse
  2008-10-01 12:30 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
@ 2008-10-02  7:05   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-02  7:05 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Peter Zijlstra,
	Christoph Lameter, Rik van Riel, Mel Gorman, Nick Piggin,
	Andrew Morton

On Wed,  1 Oct 2008 13:30:58 +0100
Andy Whitcroft <apw@shadowen.org> wrote:

> When we are about to release a page we perform a number of actions
> on that page.  We clear down any anonymous mappings, confirm that
> the page is safe to release, check for freeing locks, before mapping
> the page should that be required.  Pull this processing out into a
> helper function for reuse in a later patch.
> 
> Note that we do not convert the similar cleardown in free_hot_cold_page()
> as the optimiser is unable to squash the loops during the inline.
> 
> Signed-off-by: Andy Whitcroft <apw@shadowen.org>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiruyo@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse
  2008-10-01 12:30 [PATCH 0/4] Reclaim page capture v4 Andy Whitcroft
@ 2008-10-01 12:30 ` Andy Whitcroft
  2008-10-02  7:05   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Whitcroft @ 2008-10-01 12:30 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, KOSAKI Motohiro, Peter Zijlstra, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andy Whitcroft, Nick Piggin,
	Andrew Morton

When we are about to release a page we perform a number of actions
on that page.  We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required.  Pull this processing out into a
helper function for reuse in a later patch.

Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c |   40 +++++++++++++++++++++++++++-------------
 1 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..55d8d9b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,32 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
+ * Prepare this page for release to the buddy.  Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+	int i;
+	int reserved = 0;
+
+	for (i = 0 ; i < (1 << order) ; ++i)
+		reserved += free_pages_check(page + i);
+	if (reserved)
+		return 0;
+
+	if (!PageHighMem(page)) {
+		debug_check_no_locks_freed(page_address(page),
+							PAGE_SIZE << order);
+		debug_check_no_obj_freed(page_address(page),
+					   PAGE_SIZE << order);
+	}
+	arch_free_page(page, order);
+	kernel_map_pages(page, 1 << order, 0);
+
+	return 1;
+}
+
+/*
  * Frees a list of pages. 
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
@@ -529,22 +555,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
-	int i;
-	int reserved = 0;
 
-	for (i = 0 ; i < (1 << order) ; ++i)
-		reserved += free_pages_check(page + i);
-	if (reserved)
+	if (!free_page_prepare(page, order))
 		return;
 
-	if (!PageHighMem(page)) {
-		debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
-		debug_check_no_obj_freed(page_address(page),
-					   PAGE_SIZE << order);
-	}
-	arch_free_page(page, order);
-	kernel_map_pages(page, 1 << order, 0);
-
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order);
-- 
1.6.0.1.451.gc8d31


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse
  2008-09-03 18:44 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
  2008-09-04  1:24   ` Rik van Riel
@ 2008-09-05  1:52   ` KOSAKI Motohiro
  1 sibling, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2008-09-05  1:52 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: kosaki.motohiro, linux-mm, linux-kernel, Mel Gorman

> When we are about to release a page we perform a number of actions
> on that page.  We clear down any anonymous mappings, confirm that
> the page is safe to release, check for freeing locks, before mapping
> the page should that be required.  Pull this processing out into a
> helper function for reuse in a later patch.
> 
> Note that we do not convert the similar cleardown in free_hot_cold_page()
> as the optimiser is unable to squash the loops during the inline.
> 
> Signed-off-by: Andy Whitcroft <apw@shadowen.org>

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>






^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse
  2008-09-03 18:44 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
@ 2008-09-04  1:24   ` Rik van Riel
  2008-09-05  1:52   ` KOSAKI Motohiro
  1 sibling, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2008-09-04  1:24 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Mel Gorman, Andy Whitcroft

On Wed,  3 Sep 2008 19:44:09 +0100
Andy Whitcroft <apw@shadowen.org> wrote:

> Signed-off-by: Andy Whitcroft <apw@shadowen.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse
  2008-09-03 18:44 [RFC PATCH 0/4] Reclaim page capture v2 Andy Whitcroft
@ 2008-09-03 18:44 ` Andy Whitcroft
  2008-09-04  1:24   ` Rik van Riel
  2008-09-05  1:52   ` KOSAKI Motohiro
  0 siblings, 2 replies; 22+ messages in thread
From: Andy Whitcroft @ 2008-09-03 18:44 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, KOSAKI Motohiro, Mel Gorman, Andy Whitcroft

When we are about to release a page we perform a number of actions
on that page.  We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required.  Pull this processing out into a
helper function for reuse in a later patch.

Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++++++-------------
 1 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..b2a2c2b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,35 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
+ * Prepare this page for release to the buddy.  Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+	int i;
+	int reserved = 0;
+
+	if (PageAnon(page))
+		page->mapping = NULL;
+
+	for (i = 0 ; i < (1 << order) ; ++i)
+		reserved += free_pages_check(page + i);
+	if (reserved)
+		return 0;
+
+	if (!PageHighMem(page)) {
+		debug_check_no_locks_freed(page_address(page),
+							PAGE_SIZE << order);
+		debug_check_no_obj_freed(page_address(page),
+					   PAGE_SIZE << order);
+	}
+	arch_free_page(page, order);
+	kernel_map_pages(page, 1 << order, 0);
+
+	return 1;
+}
+
+/*
  * Frees a list of pages. 
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
@@ -529,22 +558,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
-	int i;
-	int reserved = 0;
 
-	for (i = 0 ; i < (1 << order) ; ++i)
-		reserved += free_pages_check(page + i);
-	if (reserved)
+	if (!free_page_prepare(page, order))
 		return;
 
-	if (!PageHighMem(page)) {
-		debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
-		debug_check_no_obj_freed(page_address(page),
-					   PAGE_SIZE << order);
-	}
-	arch_free_page(page, order);
-	kernel_map_pages(page, 1 << order, 0);
-
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order);
-- 
1.6.0.rc1.258.g80295


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse
  2008-07-01 17:58 [RFC PATCH 0/4] Reclaim page capture v1 Andy Whitcroft
@ 2008-07-01 17:58 ` Andy Whitcroft
  0 siblings, 0 replies; 22+ messages in thread
From: Andy Whitcroft @ 2008-07-01 17:58 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Mel Gorman, Andy Whitcroft

When we are about to release a page we perform a number of actions
on that page.  We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required.  Pull this processing out into a
helper function for reuse in a later patch.

Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
---
 mm/page_alloc.c |   43 ++++++++++++++++++++++++++++++-------------
 1 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8aa93f3..758ecf1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -468,6 +468,35 @@ static inline int free_pages_check(struct page *page)
 }
 
 /*
+ * Prepare this page for release to the buddy.  Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+	int i;
+	int reserved = 0;
+
+	if (PageAnon(page))
+		page->mapping = NULL;
+
+	for (i = 0 ; i < (1 << order) ; ++i)
+		reserved += free_pages_check(page + i);
+	if (reserved)
+		return 0;
+
+	if (!PageHighMem(page)) {
+		debug_check_no_locks_freed(page_address(page),
+							PAGE_SIZE << order);
+		debug_check_no_obj_freed(page_address(page),
+					   PAGE_SIZE << order);
+	}
+	arch_free_page(page, order);
+	kernel_map_pages(page, 1 << order, 0);
+
+	return 1;
+}
+
+/*
  * Frees a list of pages. 
  * Assumes all pages on list are in same zone, and of same order.
  * count is the number of pages to free.
@@ -508,22 +537,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
-	int i;
-	int reserved = 0;
 
-	for (i = 0 ; i < (1 << order) ; ++i)
-		reserved += free_pages_check(page + i);
-	if (reserved)
+	if (!free_page_prepare(page, order))
 		return;
 
-	if (!PageHighMem(page)) {
-		debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
-		debug_check_no_obj_freed(page_address(page),
-					   PAGE_SIZE << order);
-	}
-	arch_free_page(page, order);
-	kernel_map_pages(page, 1 << order, 0);
-
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
 	free_one_page(page_zone(page), page, order);
-- 
1.5.6.1.201.g3e7d3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-10-02  7:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-05 10:19 [PATCH 0/4] Reclaim page capture v3 Andy Whitcroft
2008-09-05 10:19 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
2008-09-08 14:11   ` MinChan Kim
2008-09-08 15:14     ` Andy Whitcroft
2008-09-05 10:20 ` [PATCH 2/4] pull out zone cpuset and watermark checks " Andy Whitcroft
2008-09-05 10:20 ` [PATCH 3/4] buddy: explicitly identify buddy field use in struct page Andy Whitcroft
2008-09-05 10:20 ` [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer Andy Whitcroft
2008-09-08  6:08 ` [PATCH 0/4] Reclaim page capture v3 Nick Piggin
2008-09-08 11:44   ` Andy Whitcroft
2008-09-08 13:59     ` Nick Piggin
2008-09-08 16:41       ` Andy Whitcroft
2008-09-09  3:31         ` Nick Piggin
2008-09-09 16:35           ` Andy Whitcroft
2008-09-10  3:19             ` Nick Piggin
2008-09-10  6:56               ` Andy Whitcroft
2008-09-12 21:30               ` Andy Whitcroft
  -- strict thread matches above, loose matches on Subject: below --
2008-10-01 12:30 [PATCH 0/4] Reclaim page capture v4 Andy Whitcroft
2008-10-01 12:30 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
2008-10-02  7:05   ` KAMEZAWA Hiroyuki
2008-09-03 18:44 [RFC PATCH 0/4] Reclaim page capture v2 Andy Whitcroft
2008-09-03 18:44 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft
2008-09-04  1:24   ` Rik van Riel
2008-09-05  1:52   ` KOSAKI Motohiro
2008-07-01 17:58 [RFC PATCH 0/4] Reclaim page capture v1 Andy Whitcroft
2008-07-01 17:58 ` [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse Andy Whitcroft

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).