All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 13/16] mm: fix minor scan count bugs
  2005-12-07 10:48 ` [PATCH 13/16] mm: fix minor scan count bugs Wu Fengguang
@ 2005-12-07 10:32   ` Nick Piggin
  2005-12-07 11:02   ` Wu Fengguang
  1 sibling, 0 replies; 35+ messages in thread
From: Nick Piggin @ 2005-12-07 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Marcelo Tosatti, Magnus Damm, Nick Piggin,
	Andrea Arcangeli

Wu Fengguang wrote:
> - in isolate_lru_pages(): reports one more scan. Fix it.
> - in shrink_cache(): 0 pages taken does not mean 0 pages scanned. Fix it.
> 

This looks good, although in the first hunk it might be nicer
to turn it into the more familiar for-loop.

   for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {

> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> ---
> 
>  mm/vmscan.c |   10 ++++++----
>  1 files changed, 6 insertions(+), 4 deletions(-)
> 
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -864,7 +864,8 @@ static int isolate_lru_pages(int nr_to_s
>  	struct page *page;
>  	int scan = 0;
>  
> -	while (scan++ < nr_to_scan && !list_empty(src)) {
> +	while (scan < nr_to_scan && !list_empty(src)) {
> +		scan++;
>  		page = lru_to_page(src);
>  		prefetchw_prev_lru_page(page, src, flags);
>  
> @@ -911,14 +912,15 @@ static void shrink_cache(struct zone *zo
>  	update_zone_age(zone, nr_scan);
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	if (nr_taken == 0)
> -		return;
> -
>  	sc->nr_scanned += nr_scan;
>  	if (current_is_kswapd())
>  		mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
>  	else
>  		mod_page_state_zone(zone, pgscan_direct, nr_scan);
> +
> +	if (nr_taken == 0)
> +		return;
> +
>  	nr_freed = shrink_list(&page_list, sc);
>  	if (current_is_kswapd())
>  		mod_page_state(kswapd_steal, nr_freed);
> 

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags
  2005-12-07 10:48 ` [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags Wu Fengguang
@ 2005-12-07 10:36   ` Nick Piggin
  2005-12-07 11:11     ` Wu Fengguang
  2005-12-07 11:15   ` Wu Fengguang
  1 sibling, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2005-12-07 10:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Marcelo Tosatti, Magnus Damm, Nick Piggin,
	Andrea Arcangeli

Wu Fengguang wrote:
> Fold bool values into flags to make struct scan_control more compact.
> 

Probably not a bad idea (although you haven't done anything for 64-bit
archs, yet)... do we wait until one more flag wants to be added?

> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> ---
> 
>  mm/vmscan.c |   22 ++++++++++------------
>  1 files changed, 10 insertions(+), 12 deletions(-)
> 
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -72,12 +72,12 @@ struct scan_control {
>  	/* This context's GFP mask */
>  	gfp_t gfp_mask;
>  
> -	int may_writepage;
> -
> -	/* Can pages be swapped as part of reclaim? */
> -	int may_swap;
> +	unsigned long flags;
>  };
>  
> +#define SC_MAY_WRITEPAGE	0x1
> +#define SC_MAY_SWAP		0x2	/* Can pages be swapped as part of reclaim? */
> +
>  #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
>  
>  #ifdef ARCH_HAS_PREFETCH
> @@ -488,7 +488,7 @@ static int shrink_list(struct list_head 
>  		 * Try to allocate it some swap space here.
>  		 */
>  		if (PageAnon(page) && !PageSwapCache(page)) {
> -			if (!sc->may_swap)
> +			if (!(sc->flags & SC_MAY_SWAP))
>  				goto keep_locked;
>  			if (!add_to_swap(page, GFP_ATOMIC))
>  				goto activate_locked;
> @@ -519,7 +519,7 @@ static int shrink_list(struct list_head 
>  				goto keep_locked;
>  			if (!may_enter_fs)
>  				goto keep_locked;
> -			if (laptop_mode && !sc->may_writepage)
> +			if (laptop_mode && !(sc->flags & SC_MAY_WRITEPAGE))
>  				goto keep_locked;
>  
>  			/* Page is dirty, try to write it out here */
> @@ -1238,8 +1238,7 @@ int try_to_free_pages(struct zone **zone
>  	delay_prefetch();
>  
>  	sc.gfp_mask = gfp_mask;
> -	sc.may_writepage = 0;
> -	sc.may_swap = 1;
> +	sc.flags = SC_MAY_SWAP;
>  	sc.nr_scanned = 0;
>  	sc.nr_reclaimed = 0;
>  
> @@ -1287,7 +1286,7 @@ int try_to_free_pages(struct zone **zone
>  		 */
>  		if (sc.nr_scanned > SWAP_CLUSTER_MAX * 3 / 2) {
>  			wakeup_pdflush(laptop_mode ? 0 : sc.nr_scanned);
> -			sc.may_writepage = 1;
> +			sc.flags |= SC_MAY_WRITEPAGE;
>  		}
>  
>  		/* Take a nap, wait for some writeback to complete */
> @@ -1343,8 +1342,7 @@ static int balance_pgdat(pg_data_t *pgda
>  
>  loop_again:
>  	sc.gfp_mask = GFP_KERNEL;
> -	sc.may_writepage = 0;
> -	sc.may_swap = 1;
> +	sc.flags = SC_MAY_SWAP;
>  	sc.nr_mapped = read_page_state(nr_mapped);
>  	sc.nr_scanned = 0;
>  	sc.nr_reclaimed = 0;
> @@ -1439,7 +1437,7 @@ scan_swspd:
>  		 */
>  		if (sc.nr_scanned > SWAP_CLUSTER_MAX * 2 &&
>  		    sc.nr_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> -			sc.may_writepage = 1;
> +			sc.flags |= SC_MAY_WRITEPAGE;
>  
>  		if (nr_pages && to_free > sc.nr_reclaimed)
>  			continue;	/* swsusp: need to do more work */
> 

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 00/16] Balancing the scan rate of major caches V3
@ 2005-12-07 10:47 Wu Fengguang
  2005-12-07 10:47 ` [PATCH 01/16] mm: restore sc.nr_to_reclaim Wu Fengguang
                   ` (15 more replies)
  0 siblings, 16 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra

Changes since V2:
- fix divide error in shrink_slab()
- more debug/accounting code
- fine grained priority/scan quantity
- reluctant to reclaim lowest zone if it is out of sync with highest zone

Changes since V1:
- better broken up of patches
- replace pages_more_aged with age_ge/age_gt
- expanded shrink_slab interface
- rewrite kswapd rebalance logic to be simple and robust


This patch balances the aging rates of active_list/inactive_list/slab.

It started out as an effort to enable the adaptive read-ahead to handle large
number of concurrent readers. Then I found it envolves much more stuffs, and
deserves a standalone patchset to address the balancing problem as a whole.


The whole picture of balancing:

- In each node, inactive_list scan rates are synced with each other
  It is done in the direct/kswapd reclaim path.

- In each zone, active_list scan rate always follows that of inactive_list

- Slab cache scan rates always follow that of the current node.
  Since shrink_slab() can be called from different CPUs, that effectly sync
  slab cache scan rates with that of the most scanned node.


The patches is grouped as follows:

- balancing stuffs
mm-revert-vmscan-balancing-fix.patch
mm-simplify-kswapd-reclaim-code.patch
mm-balance-zone-aging-supporting-facilities.patch
mm-balance-zone-aging-in-direct-reclaim.patch
mm-balance-zone-aging-in-kswapd-reclaim.patch
mm-balance-slab-aging.patch
mm-balance-active-inactive-list-aging.patch
mm-fine-grained-scan-priority.patch

- pure code cleanups
mm-remove-unnecessary-variable-and-loop.patch
mm-remove-swap-cluster-max-from-scan-control.patch
mm-accumulate-nr-scanned-reclaimed-in-scan-control.patch
mm-fold-bool-variables-into-flags-in-scan-control.patch

- minor fix
mm-scan-accounting-fix.patch

- debug code
mm-account-zone-aging-rounds.patch
mm-page-reclaim-debug-traces.patch
mm-kswapd-reclaim-debug-trace.patch

Thanks,
Wu Fengguang

--
Dept. Automation                University of Science and Technology of China

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/16] mm: restore sc.nr_to_reclaim
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
@ 2005-12-07 10:47 ` Wu Fengguang
  2005-12-07 10:47 ` [PATCH 02/16] mm: simplify kswapd reclaim code Wu Fengguang
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-revert-vmscan-balancing-fix.patch --]
[-- Type: text/plain, Size: 1301 bytes --]

Keep it before the real fine grained scan patch is ready :)

The following patches really needs small scan quantities, at least in
normal situation.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |    8 ++++++++
 1 files changed, 8 insertions(+)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -63,6 +63,9 @@ struct scan_control {
 
 	unsigned long nr_mapped;	/* From page_state */
 
+	/* How many pages shrink_cache() should reclaim */
+	int nr_to_reclaim;
+
 	/* Ask shrink_caches, or shrink_zone to scan at this priority */
 	unsigned int priority;
 
@@ -898,6 +901,7 @@ static void shrink_cache(struct zone *zo
 		if (current_is_kswapd())
 			mod_page_state(kswapd_steal, nr_freed);
 		mod_page_state_zone(zone, pgsteal, nr_freed);
+		sc->nr_to_reclaim -= nr_freed;
 
 		spin_lock_irq(&zone->lru_lock);
 		/*
@@ -1097,6 +1101,8 @@ shrink_zone(struct zone *zone, struct sc
 	else
 		nr_inactive = 0;
 
+	sc->nr_to_reclaim = sc->swap_cluster_max;
+
 	while (nr_active || nr_inactive) {
 		if (nr_active) {
 			sc->nr_to_scan = min(nr_active,
@@ -1110,6 +1116,8 @@ shrink_zone(struct zone *zone, struct sc
 					(unsigned long)sc->swap_cluster_max);
 			nr_inactive -= sc->nr_to_scan;
 			shrink_cache(zone, sc);
+			if (sc->nr_to_reclaim <= 0)
+				break;
 		}
 	}
 

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 02/16] mm: simplify kswapd reclaim code
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
  2005-12-07 10:47 ` [PATCH 01/16] mm: restore sc.nr_to_reclaim Wu Fengguang
@ 2005-12-07 10:47 ` Wu Fengguang
  2005-12-07 10:47 ` [PATCH 03/16] mm: supporting variables and functions for balanced zone aging Wu Fengguang
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Nick Piggin, Wu Fengguang

[-- Attachment #1: mm-simplify-kswapd-reclaim-code.patch --]
[-- Type: text/plain, Size: 4393 bytes --]

Simplify the kswapd reclaim code for the new balancing logic.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


 mm/vmscan.c |  100 ++++++++++++++++++++----------------------------------------
 1 files changed, 34 insertions(+), 66 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1309,47 +1309,18 @@ loop_again:
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
 
+		all_zones_ok = 1;
+		sc.nr_scanned = 0;
+		sc.nr_reclaimed = 0;
+		sc.priority = priority;
+		sc.swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX;
+
 		/* The swap token gets in the way of swapout... */
 		if (!priority)
 			disable_swap_token();
 
-		all_zones_ok = 1;
-
-		if (nr_pages == 0) {
-			/*
-			 * Scan in the highmem->dma direction for the highest
-			 * zone which needs scanning
-			 */
-			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
-				struct zone *zone = pgdat->node_zones + i;
-
-				if (!populated_zone(zone))
-					continue;
-
-				if (zone->all_unreclaimable &&
-						priority != DEF_PRIORITY)
-					continue;
-
-				if (!zone_watermark_ok(zone, order,
-						zone->pages_high, 0, 0)) {
-					end_zone = i;
-					goto scan;
-				}
-			}
-			goto out;
-		} else {
-			end_zone = pgdat->nr_zones - 1;
-		}
-scan:
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
-
-			lru_pages += zone->nr_active + zone->nr_inactive;
-		}
-
 		/*
 		 * Now scan the zone in the dma->highmem direction, stopping
 		 * at the last zone which needs scanning.
@@ -1359,51 +1330,49 @@ scan:
 		 * pages behind kswapd's direction of progress, which would
 		 * cause too much scanning of the lower zones.
 		 */
-		for (i = 0; i <= end_zone; i++) {
+		for (i = 0; i < pgdat->nr_zones; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab;
 
 			if (!populated_zone(zone))
 				continue;
 
+			if (nr_pages == 0) {	/* Not software suspend */
+				if (zone_watermark_ok(zone, order,
+					zone->pages_high, 0, 0))
+					continue;
+
+				all_zones_ok = 0;
+			}
+
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			if (nr_pages == 0) {	/* Not software suspend */
-				if (!zone_watermark_ok(zone, order,
-						zone->pages_high, end_zone, 0))
-					all_zones_ok = 0;
-			}
 			zone->temp_priority = priority;
 			if (zone->prev_priority > priority)
 				zone->prev_priority = priority;
-			sc.nr_scanned = 0;
-			sc.nr_reclaimed = 0;
-			sc.priority = priority;
-			sc.swap_cluster_max = nr_pages? nr_pages : SWAP_CLUSTER_MAX;
-			atomic_inc(&zone->reclaim_in_progress);
+			lru_pages += zone->nr_active + zone->nr_inactive;
+
 			shrink_zone(zone, &sc);
-			atomic_dec(&zone->reclaim_in_progress);
-			reclaim_state->reclaimed_slab = 0;
-			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
-			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-			total_reclaimed += sc.nr_reclaimed;
-			total_scanned += sc.nr_scanned;
-			if (zone->all_unreclaimable)
-				continue;
-			if (nr_slab == 0 && zone->pages_scanned >=
+
+			if (zone->pages_scanned >=
 				    (zone->nr_active + zone->nr_inactive) * 4)
 				zone->all_unreclaimable = 1;
-			/*
-			 * If we've done a decent amount of scanning and
-			 * the reclaim ratio is low, start doing writepage
-			 * even in laptop mode
-			 */
-			if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
-			    total_scanned > total_reclaimed+total_reclaimed/2)
-				sc.may_writepage = 1;
 		}
+		reclaim_state->reclaimed_slab = 0;
+		shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages);
+		sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+		total_reclaimed += sc.nr_reclaimed;
+		total_scanned += sc.nr_scanned;
+
+		/*
+		 * If we've done a decent amount of scanning and
+		 * the reclaim ratio is low, start doing writepage
+		 * even in laptop mode
+		 */
+		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    total_scanned > total_reclaimed+total_reclaimed/2)
+			sc.may_writepage = 1;
+
 		if (nr_pages && to_free > total_reclaimed)
 			continue;	/* swsusp: need to do more work */
 		if (all_zones_ok)
@@ -1424,7 +1393,6 @@ scan:
 		if ((total_reclaimed >= SWAP_CLUSTER_MAX) && (!nr_pages))
 			break;
 	}
-out:
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 03/16] mm: supporting variables and functions for balanced zone aging
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
  2005-12-07 10:47 ` [PATCH 01/16] mm: restore sc.nr_to_reclaim Wu Fengguang
  2005-12-07 10:47 ` [PATCH 02/16] mm: simplify kswapd reclaim code Wu Fengguang
@ 2005-12-07 10:47 ` Wu Fengguang
  2005-12-11 22:36   ` Marcelo Tosatti
  2005-12-07 10:47 ` [PATCH 04/16] mm: balance zone aging in direct reclaim path Wu Fengguang
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-balance-zone-aging-supporting-facilities.patch --]
[-- Type: text/plain, Size: 5621 bytes --]

The zone aging rates are currently imbalanced, the gap can be as large as 3
times, which can severely damage read-ahead requests and shorten their
effective life time.

This patch adds three variables in struct zone
	- aging_total
	- aging_milestone
	- page_age
to keep track of page aging rate, and keep it in sync on page reclaim time.

The aging_total is just a per-zone counter-part to the per-cpu
pgscan_{kswapd,direct}_{zone name}. But it is not direct comparable between
zones, so the aging_milestone/page_age are maintained based on aging_total.

The page_age is a normalized value that can be direct compared between zones
with the helper macro age_ge/age_gt. The goal of balancing logics are to keep
this normalized value in sync between zones.

One can check the balanced aging progress by running:
                        tar c / | cat > /dev/null &
                        watch -n1 'grep "age " /proc/zoneinfo'

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mmzone.h |   14 ++++++++++++++
 mm/page_alloc.c        |   11 +++++++++++
 mm/vmscan.c            |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+)

--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -149,6 +149,20 @@ struct zone {
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
+	/* Fields for balanced page aging:
+	 * aging_total     - The accumulated number of activities that may
+	 *                   cause page aging, that is, make some pages closer
+	 *                   to the tail of inactive_list.
+	 * aging_milestone - A snapshot of total_scan every time a full
+	 *                   inactive_list of pages become aged.
+	 * page_age        - A normalized value showing the percent of pages
+	 *                   have been aged.  It is compared between zones to
+	 *                   balance the rate of page aging.
+	 */
+	unsigned long		aging_total;
+	unsigned long		aging_milestone;
+	unsigned long		page_age;
+
 	/*
 	 * Does the allocator try to reclaim pages from the zone as soon
 	 * as it fails a watermark_ok() in __alloc_pages?
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -123,6 +123,53 @@ static long total_memory;
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
+#ifdef CONFIG_HIGHMEM64G
+#define		PAGE_AGE_SHIFT  8
+#elif BITS_PER_LONG == 32
+#define		PAGE_AGE_SHIFT  12
+#elif BITS_PER_LONG == 64
+#define		PAGE_AGE_SHIFT  20
+#else
+#error unknown BITS_PER_LONG
+#endif
+#define		PAGE_AGE_SIZE   (1 << PAGE_AGE_SHIFT)
+#define		PAGE_AGE_MASK   (PAGE_AGE_SIZE - 1)
+
+/*
+ * The simplified code is:
+ * 	age_ge: (@a->page_age >= @b->page_age)
+ * 	age_gt: (@a->page_age > @b->page_age)
+ * The complexity deals with the wrap-around problem.
+ * Two page ages not close enough(gap >= 1/8) should also be ignored:
+ * they are out of sync and the comparison may be nonsense.
+ *
+ * Return value depends on the position of @a relative to @b:
+ * -1/8       b      +1/8
+ *   |--------|--------|-----------------------------------------------|
+ *       0        1                           0
+ */
+#define age_ge(a, b) \
+	(((a->page_age - b->page_age) & PAGE_AGE_MASK) < PAGE_AGE_SIZE / 8)
+#define age_gt(a, b) \
+	(((b->page_age - a->page_age) & PAGE_AGE_MASK) > PAGE_AGE_SIZE * 7 / 8)
+
+/*
+ * Keep track of the percent of cold pages that have been scanned / aged.
+ * It's not really ##%, but a high resolution normalized value.
+ */
+static inline void update_zone_age(struct zone *z, int nr_scan)
+{
+	unsigned long len = z->nr_inactive | 1;
+
+	z->aging_total += nr_scan;
+
+	if (z->aging_total - z->aging_milestone > len)
+		z->aging_milestone += len;
+
+	z->page_age = ((z->aging_total - z->aging_milestone)
+						<< PAGE_AGE_SHIFT) / len;
+}
+
 /*
  * Add a shrinker callback to be called from the vm
  */
@@ -887,6 +934,7 @@ static void shrink_cache(struct zone *zo
 					     &page_list, &nr_scan);
 		zone->nr_inactive -= nr_taken;
 		zone->pages_scanned += nr_scan;
+		update_zone_age(zone, nr_scan);
 		spin_unlock_irq(&zone->lru_lock);
 
 		if (nr_taken == 0)
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1522,6 +1522,8 @@ void show_free_areas(void)
 			" active:%lukB"
 			" inactive:%lukB"
 			" present:%lukB"
+			" aging:%lukB"
+			" age:%lu"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
 			"\n",
@@ -1533,6 +1535,8 @@ void show_free_areas(void)
 			K(zone->nr_active),
 			K(zone->nr_inactive),
 			K(zone->present_pages),
+			K(zone->aging_total),
+			zone->page_age,
 			zone->pages_scanned,
 			(zone->all_unreclaimable ? "yes" : "no")
 			);
@@ -2144,6 +2148,9 @@ static void __init free_area_init_core(s
 		zone->nr_scan_inactive = 0;
 		zone->nr_active = 0;
 		zone->nr_inactive = 0;
+		zone->aging_total = 0;
+		zone->aging_milestone = 0;
+		zone->page_age = 0;
 		atomic_set(&zone->reclaim_in_progress, 0);
 		if (!size)
 			continue;
@@ -2292,6 +2299,8 @@ static int zoneinfo_show(struct seq_file
 			   "\n        high     %lu"
 			   "\n        active   %lu"
 			   "\n        inactive %lu"
+			   "\n        aging    %lu"
+			   "\n        age      %lu"
 			   "\n        scanned  %lu (a: %lu i: %lu)"
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
@@ -2301,6 +2310,8 @@ static int zoneinfo_show(struct seq_file
 			   zone->pages_high,
 			   zone->nr_active,
 			   zone->nr_inactive,
+			   zone->aging_total,
+			   zone->page_age,
 			   zone->pages_scanned,
 			   zone->nr_scan_active, zone->nr_scan_inactive,
 			   zone->spanned_pages,

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 04/16] mm: balance zone aging in direct reclaim path
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (2 preceding siblings ...)
  2005-12-07 10:47 ` [PATCH 03/16] mm: supporting variables and functions for balanced zone aging Wu Fengguang
@ 2005-12-07 10:47 ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 05/16] mm: balance zone aging in kswapd " Wu Fengguang
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-balance-zone-aging-in-direct-reclaim.patch --]
[-- Type: text/plain, Size: 2360 bytes --]

Add 10 extra priorities to the direct page reclaim path, which makes 10 round of
balancing effort(reclaim only from the least aged local/headless zone) before
falling back to the reclaim-all scheme.

Ten rounds should be enough to get enough free pages in normal cases, which
prevents unnecessarily disturbing remote nodes. If further restrict the first
round of page allocation to local zones, we might get what the early zone
reclaim patch want: memory affinity/locality.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |   31 ++++++++++++++++++++++++++++---
 1 files changed, 28 insertions(+), 3 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1194,6 +1194,7 @@ static void
 shrink_caches(struct zone **zones, struct scan_control *sc)
 {
 	int i;
+	struct zone *z = NULL;
 
 	for (i = 0; zones[i] != NULL; i++) {
 		struct zone *zone = zones[i];
@@ -1208,11 +1209,34 @@ shrink_caches(struct zone **zones, struc
 		if (zone->prev_priority > sc->priority)
 			zone->prev_priority = sc->priority;
 
-		if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY)
+		if (zone->all_unreclaimable && sc->priority < DEF_PRIORITY)
 			continue;	/* Let kswapd poll it */
 
+		/*
+		 * Balance page aging in local zones and following headless
+		 * zones.
+		 */
+		if (sc->priority > DEF_PRIORITY) {
+			if (zone->zone_pgdat != zones[0]->zone_pgdat) {
+				cpumask_t cpu = node_to_cpumask(
+						zone->zone_pgdat->node_id);
+				if (!cpus_empty(cpu))
+					break;
+			}
+
+			if (!z)
+				z = zone;
+			else if (age_gt(z, zone))
+				z = zone;
+
+			continue;
+		}
+
 		shrink_zone(zone, sc);
 	}
+
+	if (z)
+		shrink_zone(z, sc);
 }
  
 /*
@@ -1256,7 +1280,8 @@ int try_to_free_pages(struct zone **zone
 		lru_pages += zone->nr_active + zone->nr_inactive;
 	}
 
-	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+	/* The added 10 priorities are for scan rate balancing */
+	for (priority = DEF_PRIORITY + 10; priority >= 0; priority--) {
 		sc.nr_mapped = read_page_state(nr_mapped);
 		sc.nr_scanned = 0;
 		sc.nr_reclaimed = 0;
@@ -1290,7 +1315,7 @@ int try_to_free_pages(struct zone **zone
 		}
 
 		/* Take a nap, wait for some writeback to complete */
-		if (sc.nr_scanned && priority < DEF_PRIORITY - 2)
+		if (sc.nr_scanned && priority < DEF_PRIORITY)
 			blk_congestion_wait(WRITE, HZ/10);
 	}
 out:

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 05/16] mm: balance zone aging in kswapd reclaim path
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (3 preceding siblings ...)
  2005-12-07 10:47 ` [PATCH 04/16] mm: balance zone aging in direct reclaim path Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 10:58   ` Wu Fengguang
  2005-12-07 13:32   ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 06/16] mm: balance slab aging Wu Fengguang
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-balance-zone-aging-in-kswapd-reclaim.patch --]
[-- Type: text/plain, Size: 4431 bytes --]

The vm subsystem is rather complex. System memory is divided into zones,
lower zones act as fallback of higher zones in memory allocation.  The page
reclaim algorithm should generally keep zone aging rates in sync. But if a
zone under watermark has many unreclaimable pages, it has to be scanned much
more to get enough free pages. While doing this,

- lower zones should also be scanned more, since their pages are also usable
  for higher zone allocations.
- higher zones should not be scanned just to keep the aging in sync, which
  can evict large amount of pages without saving the problem(and may well
  worsen it).

With that in mind, the patch does the rebalance in kswapd as follows:
1) reclaim from the lowest zone when
	- under pages_high
	- under pages_high+lowmem_reserve, and less/equal aged than highest
	  zone(or out of sync with it)
2) reclaim from higher zones when
	- under pages_high+lowmem_reserve, and less/equal aged than its
	  immediate lower neighbor(or out of sync with it)

Note that the zone age is a normalized value in range 0-4096 on i386/4G. 4096
corresponds to a full scan of one zone. And the comparison of ages are only
deemed ok if the gap is less than 4096/8, or they will be regarded as out of
sync.

On exit, the code ensures:
1) the lowest zone will be pages_high ok
2) at least one zone will be pages_high+lowmem_reserve ok
3) a very strong force of rebalancing with the exception of
	- some lower zones are unreclaimable: we must let them go ahead
	  alone, leaving higher zones back
	- shrink_zone() scans too much and creates huge imbalance in one
	  run(Nick is working on this)

The logic can deal with known normal/abnormal situations gracefully:
1) Normal case
	- zone ages are cyclicly tied together: taking over each other, and
	  keeping close enough

2) A Zone is unreclaimable, scanned much more, and become out of sync
	- if ever a troublesome zone is being overscanned, the logic brings
	  its lower neighbors ahead together, leaving higher neighbors back.
	- the aging tie between the two groups is broken, and the relevant
	  zones are reclaimed when pages_high+lowmem_reserve not ok, just as
	  before the patch.
	- at some time the zone ages meet again and back to normal
	- a possiblely better strategy, as soon as the pressure disappeared,
	  might be relunctant to reclaim from the already overscanned lower
	  group, and let the higher group slowly catch up.

3) Zone is truncated
	- will not reclaim from it until under watermark

With this patch, the meaning of zone->pages_high+lowmem_reserve changed from
the _required_ watermark to the _recommended_ watermark. Someone might be
willing to increase them somehow.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |   34 +++++++++++++++++++++++++++++-----
 1 files changed, 29 insertions(+), 5 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1364,6 +1364,7 @@ static int balance_pgdat(pg_data_t *pgda
 	int total_scanned, total_reclaimed;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
+	struct zone *prev_zone = pgdat->node_zones;
 
 loop_again:
 	total_scanned = 0;
@@ -1379,6 +1380,9 @@ loop_again:
 		struct zone *zone = pgdat->node_zones + i;
 
 		zone->temp_priority = DEF_PRIORITY;
+
+		if (populated_zone(zone))
+			prev_zone = zone;
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -1409,14 +1413,34 @@ loop_again:
 			if (!populated_zone(zone))
 				continue;
 
-			if (nr_pages == 0) {	/* Not software suspend */
-				if (zone_watermark_ok(zone, order,
-					zone->pages_high, 0, 0))
-					continue;
+			if (nr_pages) 	/* software suspend */
+				goto scan_swspd;
 
-				all_zones_ok = 0;
+			if (zone_watermark_ok(zone, order,
+						zone->pages_high,
+						pgdat->nr_zones - 1, 0)) {
+				/* free pages enough, no reclaim */
+			} else if (zone < prev_zone) {
+				if (!zone_watermark_ok(zone, order,
+						zone->pages_high, 0, 0)) {
+					/* have to scan for free pages */
+					goto scan;
+				}
+				if (age_ge(prev_zone, zone)) {
+					/* catch up if falls behind */
+					goto scan;
+				}
+			} else if (!age_gt(zone, prev_zone)) {
+				/* catch up if falls behind or out of sync */
+				goto scan;
 			}
 
+			prev_zone = zone;
+			continue;
+scan:
+			prev_zone = zone;
+			all_zones_ok = 0;
+scan_swspd:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 06/16] mm: balance slab aging
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (4 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 05/16] mm: balance zone aging in kswapd " Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 11:08   ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 07/16] mm: balance active/inactive list scan rates Wu Fengguang
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-balance-slab-aging.patch --]
[-- Type: text/plain, Size: 8590 bytes --]

The current slab shrinking code is way too fragile.
Let it manage aging pace by itself, and provide a simple and robust interface.

The design considerations:
- use the same syncing facilities as that of the zones
- keep the age of slabs in line with that of the largest zone
  this in effect makes aging rate of slabs follow that of the most aged node.

- reserve a minimal number of unused slabs
  the size of reservation depends on vm pressure

- shrink more slab caches only when vm pressure is high
  the old logic, `mmap pages found' - `shrink more caches' - `avoid swapping',
  sounds not quite logical, so the code is removed.

- let sc->nr_scanned record the exact number of cold pages scanned
  it is no longer used by the slab cache shrinking algorithm, but good for other
  algorithms(e.g. the active_list/inactive_list balancing).

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 fs/drop-pagecache.c |    2 
 include/linux/mm.h  |    7 +--
 mm/vmscan.c         |  106 +++++++++++++++++++++-------------------------------
 3 files changed, 48 insertions(+), 67 deletions(-)

--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -798,7 +798,9 @@ struct shrinker {
 	shrinker_t		shrinker;
 	struct list_head	list;
 	int			seeks;	/* seeks to recreate an obj */
-	long			nr;	/* objs pending delete */
+	unsigned long		aging_total;
+	unsigned long		aging_milestone;
+	unsigned long		page_age;
 	struct shrinker_stats	*s_stats;
 };
 
@@ -1080,8 +1082,7 @@ int in_gate_area_no_task(unsigned long a
 
 int drop_pagecache_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
-int shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
+int shrink_slab(struct zone *zone, int priority, gfp_t gfp_mask);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -170,6 +170,18 @@ static inline void update_zone_age(struc
 						<< PAGE_AGE_SHIFT) / len;
 }
 
+static inline void update_slab_age(struct shrinker *s,
+					unsigned long len, int nr_scan)
+{
+	s->aging_total += nr_scan;
+
+	if (s->aging_total - s->aging_milestone > len)
+		s->aging_milestone += len;
+
+	s->page_age = ((s->aging_total - s->aging_milestone)
+						<< PAGE_AGE_SHIFT) / len;
+}
+
 /*
  * Add a shrinker callback to be called from the vm
  */
@@ -181,7 +193,9 @@ struct shrinker *set_shrinker(int seeks,
         if (shrinker) {
 	        shrinker->shrinker = theshrinker;
 	        shrinker->seeks = seeks;
-	        shrinker->nr = 0;
+	        shrinker->aging_total = 0;
+	        shrinker->aging_milestone = 0;
+	        shrinker->page_age = 0;
 		shrinker->s_stats = alloc_percpu(struct shrinker_stats);
 		if (!shrinker->s_stats) {
 			kfree(shrinker);
@@ -209,6 +223,7 @@ void remove_shrinker(struct shrinker *sh
 EXPORT_SYMBOL(remove_shrinker);
 
 #define SHRINK_BATCH 128
+#define SLAB_RESERVE 1000
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -217,76 +232,49 @@ EXPORT_SYMBOL(remove_shrinker);
  * percentages of the lru and ageable caches.  This should balance the seeks
  * generated by these structures.
  *
- * If the vm encounted mapped pages on the LRU it increase the pressure on
- * slab to avoid swapping.
+ * @priority reflects the vm pressure, the lower the value, the more to
+ * shrink.
  *
- * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
- *
- * `lru_pages' represents the number of on-LRU pages in all the zones which
- * are eligible for the caller's allocation attempt.  It is used for balancing
- * slab reclaim versus page reclaim.
+ * @zone is better to be the least over-scanned one (normally the highest
+ * zone).
  *
  * Returns the number of slab objects which we shrunk.
  */
-int shrink_slab(unsigned long scanned, gfp_t gfp_mask, unsigned long lru_pages)
+int shrink_slab(struct zone *zone, int priority, gfp_t gfp_mask)
 {
 	struct shrinker *shrinker;
 	int ret = 0;
 
-	if (scanned == 0)
-		scanned = SWAP_CLUSTER_MAX;
-
 	if (!down_read_trylock(&shrinker_rwsem))
 		return 1;	/* Assume we'll be able to shrink next time */
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		unsigned long long delta;
-		unsigned long total_scan;
-		unsigned long max_pass = (*shrinker->shrinker)(0, gfp_mask);
-
-		delta = (4 * scanned) / shrinker->seeks;
-		delta *= max_pass;
-		do_div(delta, lru_pages + 1);
-		shrinker->nr += delta;
-		if (shrinker->nr < 0) {
-			printk(KERN_ERR "%s: nr=%ld\n",
-					__FUNCTION__, shrinker->nr);
-			shrinker->nr = max_pass;
-		}
-
-		/*
-		 * Avoid risking looping forever due to too large nr value:
-		 * never try to free more than twice the estimate number of
-		 * freeable entries.
-		 */
-		if (shrinker->nr > max_pass * 2)
-			shrinker->nr = max_pass * 2;
-
-		total_scan = shrinker->nr;
-		shrinker->nr = 0;
-
-		while (total_scan >= SHRINK_BATCH) {
-			long this_scan = SHRINK_BATCH;
-			int shrink_ret;
+		while (!zone || age_gt(zone, shrinker)) {
 			int nr_before;
+			int nr_after;
 
 			nr_before = (*shrinker->shrinker)(0, gfp_mask);
-			shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
-			if (shrink_ret == -1)
+			if (nr_before <= SLAB_RESERVE * priority / DEF_PRIORITY)
+				break;
+
+			nr_after = (*shrinker->shrinker)(SHRINK_BATCH, gfp_mask);
+			if (nr_after == -1)
 				break;
-			if (shrink_ret < nr_before) {
-				ret += nr_before - shrink_ret;
-				shrinker_stat_add(shrinker, nr_freed,
-					(nr_before - shrink_ret));
+
+			if (nr_after < nr_before) {
+				int nr_freed = nr_before - nr_after;
+
+				ret += nr_freed;
+				shrinker_stat_add(shrinker, nr_freed, nr_freed);
 			}
-			shrinker_stat_add(shrinker, nr_req, this_scan);
-			mod_page_state(slabs_scanned, this_scan);
-			total_scan -= this_scan;
+			shrinker_stat_add(shrinker, nr_req, SHRINK_BATCH);
+			mod_page_state(slabs_scanned, SHRINK_BATCH);
+			update_slab_age(shrinker, nr_before * DEF_PRIORITY * 2,
+					SHRINK_BATCH * shrinker->seeks *
+					(DEF_PRIORITY + priority));
 
 			cond_resched();
 		}
-
-		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
 	return ret;
@@ -492,11 +480,6 @@ static int shrink_list(struct list_head 
 
 		BUG_ON(PageActive(page));
 
-		sc->nr_scanned++;
-		/* Double the slab pressure for mapped and swapcache pages */
-		if (page_mapped(page) || PageSwapCache(page))
-			sc->nr_scanned++;
-
 		if (PageWriteback(page))
 			goto keep_locked;
 
@@ -941,6 +924,7 @@ static void shrink_cache(struct zone *zo
 			goto done;
 
 		max_scan -= nr_scan;
+		sc->nr_scanned += nr_scan;
 		if (current_is_kswapd())
 			mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
 		else
@@ -1259,7 +1243,6 @@ int try_to_free_pages(struct zone **zone
 	int total_scanned = 0, total_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
-	unsigned long lru_pages = 0;
 	int i;
 
 	delay_prefetch();
@@ -1277,7 +1260,6 @@ int try_to_free_pages(struct zone **zone
 			continue;
 
 		zone->temp_priority = DEF_PRIORITY;
-		lru_pages += zone->nr_active + zone->nr_inactive;
 	}
 
 	/* The added 10 priorities are for scan rate balancing */
@@ -1290,7 +1272,8 @@ int try_to_free_pages(struct zone **zone
 		if (!priority)
 			disable_swap_token();
 		shrink_caches(zones, &sc);
-		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
+		if (zone_idx(zones[0]))
+			shrink_slab(zones[0], priority, gfp_mask);
 		if (reclaim_state) {
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -1386,8 +1369,6 @@ loop_again:
 	}
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		unsigned long lru_pages = 0;
-
 		all_zones_ok = 1;
 		sc.nr_scanned = 0;
 		sc.nr_reclaimed = 0;
@@ -1447,7 +1428,6 @@ scan_swspd:
 			zone->temp_priority = priority;
 			if (zone->prev_priority > priority)
 				zone->prev_priority = priority;
-			lru_pages += zone->nr_active + zone->nr_inactive;
 
 			shrink_zone(zone, &sc);
 
@@ -1456,7 +1436,7 @@ scan_swspd:
 				zone->all_unreclaimable = 1;
 		}
 		reclaim_state->reclaimed_slab = 0;
-		shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages);
+		shrink_slab(prev_zone, priority, GFP_KERNEL);
 		sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 		total_reclaimed += sc.nr_reclaimed;
 		total_scanned += sc.nr_scanned;
--- linux.orig/fs/drop-pagecache.c
+++ linux/fs/drop-pagecache.c
@@ -47,7 +47,7 @@ static void drop_slab(void)
 	int nr_objects;
 
 	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+		nr_objects = shrink_slab(NULL, 0, GFP_KERNEL);
 	} while (nr_objects > 10);
 }
 

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/16] mm: balance active/inactive list scan rates
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (5 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 06/16] mm: balance slab aging Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 08/16] mm: fine grained scan priority Wu Fengguang
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli

[-- Attachment #1: mm-balance-active-inactive-list-aging.patch --]
[-- Type: text/plain, Size: 9397 bytes --]

shrink_zone() has two major design goals:
1) let active/inactive lists have equal scan rates
2) do the scans in small chunks

But the implementation has some problems:
- reluctant to scan small zones
  the callers often have to dip into low priority to free memory.

- the balance is quite rough
  the break statement in the loop breaks it.

- may scan few pages in one batch
  refill_inactive_zone can be called twice to scan 32 and 1 pages.

The new design:
1) keep perfect balance
   let active_list follow inactive_list in scan rate

2) always scan in SWAP_CLUSTER_MAX sized chunks
   simple and efficient

3) will scan at least one chunk
   the expected behavior from the callers

The perfect balance may or may not yield better performance, though it
a) is a more understandable and dependable behavior
b) together with inter-zone balancing, makes the zoned memories consistent

The atomic reclaim_in_progress is there to prevent most concurrent reclaims.
If concurrent reclaims did happen, there will be no fatal errors.


I tested the patch with the following commands:
	dd if=/dev/zero of=hot bs=1M seek=800 count=1
	dd if=/dev/zero of=cold bs=1M seek=50000 count=1
	./test-aging.sh; ./active-inactive-aging-rate.sh

Before the patch:
-----------------------------------------------------------------------------
active/inactive sizes on 2.6.14-2-686-smp:
0/1000          = 0 / 1241
563/1000        = 73343 / 130108
887/1000        = 137348 / 154816

active/inactive scan rates:
dma      38/1000        = 7731 / (198924 + 0)
normal   465/1000       = 2979780 / (6394740 + 0)
high     680/1000       = 4354230 / (6396786 + 0)

             total       used       free     shared    buffers     cached
Mem:          2027       1978         49          0          4       1923
-/+ buffers/cache:         49       1977
Swap:            0          0          0
-----------------------------------------------------------------------------

After the patch, the scan rates and the size ratios are kept roughly the same
for all zones:
-----------------------------------------------------------------------------
active/inactive sizes on 2.6.15-rc3-mm1:
0/1000          = 0 / 961
236/1000        = 38385 / 162429
319/1000        = 70607 / 221101

active/inactive scan rates:
dma      0/1000         = 0 / (42176 + 0)
normal   234/1000       = 1714688 / (7303456 + 1088)
high     317/1000       = 3151936 / (9933792 + 96)
             
             total       used       free     shared    buffers     cached
Mem:          2020       1969         50          0          5       1908
-/+ buffers/cache:         54       1965
Swap:            0          0          0
-----------------------------------------------------------------------------

script test-aging.sh:
------------------------------
#!/bin/zsh
cp cold /dev/null&

while {pidof cp > /dev/null};
do
        cp hot /dev/null
done
------------------------------

script active-inactive-aging-rate.sh:
-----------------------------------------------------------------------------
#!/bin/sh

echo active/inactive sizes on `uname -r`:
egrep '(active|inactive)' /proc/zoneinfo |
while true
do
	read name value
	[[ -z $name ]] && break
	eval $name=$value
	[[ $name = "inactive" ]] && echo -e "$((active * 1000 / (1 + inactive)))/1000  \t= $active / $inactive"
done

while true
do
	read name value
	[[ -z $name ]] && break
	eval $name=$value
done < /proc/vmstat

echo
echo active/inactive scan rates:
echo -e "dma \t $((pgrefill_dma * 1000 / (1 + pgscan_kswapd_dma + pgscan_direct_dma)))/1000 \t= $pgrefill_dma / ($pgscan_kswapd_dma + $pgscan_direct_dma)"
echo -e "normal \t $((pgrefill_normal * 1000 / (1 + pgscan_kswapd_normal + pgscan_direct_normal)))/1000 \t= $pgrefill_normal / ($pgscan_kswapd_normal + $pgscan_direct_normal)"
echo -e "high \t $((pgrefill_high * 1000 / (1 + pgscan_kswapd_high + pgscan_direct_high)))/1000 \t= $pgrefill_high / ($pgscan_kswapd_high + $pgscan_direct_high)"

echo
free -m
-----------------------------------------------------------------------------

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mmzone.h |    3 --
 include/linux/swap.h   |    2 -
 mm/page_alloc.c        |    5 +---
 mm/vmscan.c            |   52 +++++++++++++++++++++++++++----------------------
 4 files changed, 33 insertions(+), 29 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -912,7 +912,7 @@ static void shrink_cache(struct zone *zo
 		int nr_scan;
 		int nr_freed;
 
-		nr_taken = isolate_lru_pages(sc->swap_cluster_max,
+		nr_taken = isolate_lru_pages(sc->nr_to_scan,
 					     &zone->inactive_list,
 					     &page_list, &nr_scan);
 		zone->nr_inactive -= nr_taken;
@@ -1106,56 +1106,56 @@ refill_inactive_zone(struct zone *zone, 
 
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
+ * The reclaim process:
+ * a) scan always in batch of SWAP_CLUSTER_MAX pages
+ * b) scan inactive list at least one batch
+ * c) balance the scan rate of active/inactive list
+ * d) finish on either scanned or reclaimed enough pages
  */
 static void
 shrink_zone(struct zone *zone, struct scan_control *sc)
 {
+	unsigned long long next_scan_active;
 	unsigned long nr_active;
 	unsigned long nr_inactive;
 
 	atomic_inc(&zone->reclaim_in_progress);
 
+	next_scan_active = sc->nr_scanned;
+
 	/*
 	 * Add one to `nr_to_scan' just to make sure that the kernel will
 	 * slowly sift through the active list.
 	 */
-	zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
-	nr_active = zone->nr_scan_active;
-	if (nr_active >= sc->swap_cluster_max)
-		zone->nr_scan_active = 0;
-	else
-		nr_active = 0;
-
-	zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
-	nr_inactive = zone->nr_scan_inactive;
-	if (nr_inactive >= sc->swap_cluster_max)
-		zone->nr_scan_inactive = 0;
-	else
-		nr_inactive = 0;
+	nr_active = zone->nr_scan_active + 1;
+	nr_inactive = (zone->nr_inactive >> sc->priority) + SWAP_CLUSTER_MAX;
+	nr_inactive &= ~(SWAP_CLUSTER_MAX - 1);
 
+	sc->nr_to_scan = SWAP_CLUSTER_MAX;
 	sc->nr_to_reclaim = sc->swap_cluster_max;
 
-	while (nr_active || nr_inactive) {
-		if (nr_active) {
-			sc->nr_to_scan = min(nr_active,
-					(unsigned long)sc->swap_cluster_max);
-			nr_active -= sc->nr_to_scan;
+	while (nr_active >= SWAP_CLUSTER_MAX * 1024 || nr_inactive) {
+		if (nr_active >= SWAP_CLUSTER_MAX * 1024) {
+			nr_active -= SWAP_CLUSTER_MAX * 1024;
 			refill_inactive_zone(zone, sc);
 		}
 
 		if (nr_inactive) {
-			sc->nr_to_scan = min(nr_inactive,
-					(unsigned long)sc->swap_cluster_max);
-			nr_inactive -= sc->nr_to_scan;
+			nr_inactive -= SWAP_CLUSTER_MAX;
 			shrink_cache(zone, sc);
 			if (sc->nr_to_reclaim <= 0)
 				break;
 		}
 	}
 
-	throttle_vm_writeout();
+	next_scan_active = (sc->nr_scanned - next_scan_active) * 1024ULL *
+					(unsigned long long)zone->nr_active;
+	do_div(next_scan_active, zone->nr_inactive | 1);
+	zone->nr_scan_active = nr_active + (unsigned long)next_scan_active;
 
 	atomic_dec(&zone->reclaim_in_progress);
+
+	throttle_vm_writeout();
 }
 
 /*
@@ -1196,6 +1196,9 @@ shrink_caches(struct zone **zones, struc
 		if (zone->all_unreclaimable && sc->priority < DEF_PRIORITY)
 			continue;	/* Let kswapd poll it */
 
+		if (atomic_read(&zone->reclaim_in_progress))
+			continue;
+
 		/*
 		 * Balance page aging in local zones and following headless
 		 * zones.
@@ -1425,6 +1428,9 @@ scan_swspd:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
+			if (atomic_read(&zone->reclaim_in_progress))
+				continue;
+
 			zone->temp_priority = priority;
 			if (zone->prev_priority > priority)
 				zone->prev_priority = priority;
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -2145,7 +2145,6 @@ static void __init free_area_init_core(s
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
 		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
 		zone->nr_active = 0;
 		zone->nr_inactive = 0;
 		zone->aging_total = 0;
@@ -2301,7 +2300,7 @@ static int zoneinfo_show(struct seq_file
 			   "\n        inactive %lu"
 			   "\n        aging    %lu"
 			   "\n        age      %lu"
-			   "\n        scanned  %lu (a: %lu i: %lu)"
+			   "\n        scanned  %lu (a: %lu)"
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
 			   zone->free_pages,
@@ -2313,7 +2312,7 @@ static int zoneinfo_show(struct seq_file
 			   zone->aging_total,
 			   zone->page_age,
 			   zone->pages_scanned,
-			   zone->nr_scan_active, zone->nr_scan_inactive,
+			   zone->nr_scan_active / 1024,
 			   zone->spanned_pages,
 			   zone->present_pages);
 		seq_printf(m,
--- linux.orig/include/linux/swap.h
+++ linux/include/linux/swap.h
@@ -111,7 +111,7 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
-#define SWAP_CLUSTER_MAX 32
+#define SWAP_CLUSTER_MAX 32		/* must be power of 2 */
 
 #define SWAP_MAP_MAX	0x7fff
 #define SWAP_MAP_BAD	0x8000
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -142,8 +142,7 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct list_head	active_list;
 	struct list_head	inactive_list;
-	unsigned long		nr_scan_active;
-	unsigned long		nr_scan_inactive;
+	unsigned long		nr_scan_active;	/* x1024 to be more precise */
 	unsigned long		nr_active;
 	unsigned long		nr_inactive;
 	unsigned long		pages_scanned;	   /* since last reclaim */

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 08/16] mm: fine grained scan priority
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (6 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 07/16] mm: balance active/inactive list scan rates Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 09/16] mm: remove unnecessary variable and loop Wu Fengguang
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Nick Piggin, Wu Fengguang

[-- Attachment #1: mm-fine-grained-scan-priority.patch --]
[-- Type: text/plain, Size: 3311 bytes --]

Limit max scan fraction to 1/64. The scan fractions will be 
	1/4096, 64x1/2048, 64x1/1024, ..., 64x1/64

The old ones are
	1/4096, 1/2048, 1/1024, ..., 1/1
which is too much to create major imbalance of aging rates.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mmzone.h |    9 ++++++---
 mm/vmscan.c            |   18 ++++++++++++------
 2 files changed, 18 insertions(+), 9 deletions(-)

--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -251,10 +251,13 @@ struct zone {
 
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
- * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
- * queues ("queue_length >> 12") during an aging round.
+ * go. A value of 12 for DEF_PRIORITY/PRIORITY_STEPS implies that we will
+ * scan 1/4096th of the queues ("queue_length >> 12") during an aging round.
+ * Typically we will first try to scan 1/4096, then 64 times 1/2048, then 64
+ * times 1/1024, ..., at last 64 times 1/64.
  */
-#define DEF_PRIORITY 12
+#define PRIORITY_STEPS	64
+#define DEF_PRIORITY	(12*PRIORITY_STEPS)
 
 /*
  * One allocation request operates on a zonelist. A zonelist
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1006,7 +1006,7 @@ refill_inactive_zone(struct zone *zone, 
 	 * `distress' is a measure of how much trouble we're having reclaiming
 	 * pages.  0 -> no problems.  100 -> great trouble.
 	 */
-	distress = 100 >> zone->prev_priority;
+	distress = 100 >> (zone->prev_priority / PRIORITY_STEPS);
 
 	/*
 	 * The point of this algorithm is to decide when to start reclaiming
@@ -1128,7 +1128,8 @@ shrink_zone(struct zone *zone, struct sc
 	 * slowly sift through the active list.
 	 */
 	nr_active = zone->nr_scan_active + 1;
-	nr_inactive = (zone->nr_inactive >> sc->priority) + SWAP_CLUSTER_MAX;
+	nr_inactive = ((zone->nr_inactive / PRIORITY_STEPS) >>
+			(sc->priority / PRIORITY_STEPS)) + SWAP_CLUSTER_MAX;
 	nr_inactive &= ~(SWAP_CLUSTER_MAX - 1);
 
 	sc->nr_to_scan = SWAP_CLUSTER_MAX;
@@ -1265,8 +1266,13 @@ int try_to_free_pages(struct zone **zone
 		zone->temp_priority = DEF_PRIORITY;
 	}
 
-	/* The added 10 priorities are for scan rate balancing */
-	for (priority = DEF_PRIORITY + 10; priority >= 0; priority--) {
+	/*
+	 * The first PRIORITY_STEPS priorities are for scan rate balancing.
+	 * One run of shrink_zone() can create at most 1/64 imbalance, here
+	 * we first scan about 64 times 1/4096 for aging, just enough to
+	 * rebalance it, before creating new imbalance.
+	 */
+	for (priority = DEF_PRIORITY + PRIORITY_STEPS; priority >= 0; priority--) {
 		sc.nr_mapped = read_page_state(nr_mapped);
 		sc.nr_scanned = 0;
 		sc.nr_reclaimed = 0;
@@ -1301,7 +1307,7 @@ int try_to_free_pages(struct zone **zone
 		}
 
 		/* Take a nap, wait for some writeback to complete */
-		if (sc.nr_scanned && priority < DEF_PRIORITY)
+		if (sc.nr_scanned && priority < DEF_PRIORITY - PRIORITY_STEPS)
 			blk_congestion_wait(WRITE, HZ/10);
 	}
 out:
@@ -1464,7 +1470,7 @@ scan_swspd:
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
 		 */
-		if (total_scanned && priority < DEF_PRIORITY - 2)
+		if (total_scanned && priority < DEF_PRIORITY - PRIORITY_STEPS)
 			blk_congestion_wait(WRITE, HZ/10);
 
 		/*

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 09/16] mm: remove unnecessary variable and loop
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (7 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 08/16] mm: fine grained scan priority Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2006-01-05 19:21   ` Marcelo Tosatti
  2005-12-07 10:48 ` [PATCH 10/16] mm: remove swap_cluster_max from scan_control Wu Fengguang
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-remove-unnecessary-variable-and-loop.patch --]
[-- Type: text/plain, Size: 3826 bytes --]

shrink_cache() and refill_inactive_zone() do not need loops.

Simplify them to scan one chunk at a time.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |   92 ++++++++++++++++++++++++++++--------------------------------
 1 files changed, 43 insertions(+), 49 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -900,63 +900,58 @@ static void shrink_cache(struct zone *zo
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	int max_scan = sc->nr_to_scan;
+	struct page *page;
+	int nr_taken;
+	int nr_scan;
+	int nr_freed;
 
 	pagevec_init(&pvec, 1);
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	while (max_scan > 0) {
-		struct page *page;
-		int nr_taken;
-		int nr_scan;
-		int nr_freed;
-
-		nr_taken = isolate_lru_pages(sc->nr_to_scan,
-					     &zone->inactive_list,
-					     &page_list, &nr_scan);
-		zone->nr_inactive -= nr_taken;
-		zone->pages_scanned += nr_scan;
-		update_zone_age(zone, nr_scan);
-		spin_unlock_irq(&zone->lru_lock);
+	nr_taken = isolate_lru_pages(sc->nr_to_scan,
+				     &zone->inactive_list,
+				     &page_list, &nr_scan);
+	zone->nr_inactive -= nr_taken;
+	zone->pages_scanned += nr_scan;
+	update_zone_age(zone, nr_scan);
+	spin_unlock_irq(&zone->lru_lock);
 
-		if (nr_taken == 0)
-			goto done;
+	if (nr_taken == 0)
+		return;
 
-		max_scan -= nr_scan;
-		sc->nr_scanned += nr_scan;
-		if (current_is_kswapd())
-			mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
-		else
-			mod_page_state_zone(zone, pgscan_direct, nr_scan);
-		nr_freed = shrink_list(&page_list, sc);
-		if (current_is_kswapd())
-			mod_page_state(kswapd_steal, nr_freed);
-		mod_page_state_zone(zone, pgsteal, nr_freed);
-		sc->nr_to_reclaim -= nr_freed;
+	sc->nr_scanned += nr_scan;
+	if (current_is_kswapd())
+		mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
+	else
+		mod_page_state_zone(zone, pgscan_direct, nr_scan);
+	nr_freed = shrink_list(&page_list, sc);
+	if (current_is_kswapd())
+		mod_page_state(kswapd_steal, nr_freed);
+	mod_page_state_zone(zone, pgsteal, nr_freed);
+	sc->nr_to_reclaim -= nr_freed;
 
-		spin_lock_irq(&zone->lru_lock);
-		/*
-		 * Put back any unfreeable pages.
-		 */
-		while (!list_empty(&page_list)) {
-			page = lru_to_page(&page_list);
-			if (TestSetPageLRU(page))
-				BUG();
-			list_del(&page->lru);
-			if (PageActive(page))
-				add_page_to_active_list(zone, page);
-			else
-				add_page_to_inactive_list(zone, page);
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
+	spin_lock_irq(&zone->lru_lock);
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	while (!list_empty(&page_list)) {
+		page = lru_to_page(&page_list);
+		if (TestSetPageLRU(page))
+			BUG();
+		list_del(&page->lru);
+		if (PageActive(page))
+			add_page_to_active_list(zone, page);
+		else
+			add_page_to_inactive_list(zone, page);
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
 		}
-  	}
+	}
 	spin_unlock_irq(&zone->lru_lock);
-done:
+
 	pagevec_release(&pvec);
 }
 
@@ -983,7 +978,6 @@ refill_inactive_zone(struct zone *zone, 
 	int pgmoved;
 	int pgdeactivate = 0;
 	int pgscanned;
-	int nr_pages = sc->nr_to_scan;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
 	LIST_HEAD(l_inactive);	/* Pages to go onto the inactive_list */
 	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
@@ -996,7 +990,7 @@ refill_inactive_zone(struct zone *zone, 
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
+	pgmoved = isolate_lru_pages(sc->nr_to_scan, &zone->active_list,
 				    &l_hold, &pgscanned);
 	zone->pages_scanned += pgscanned;
 	zone->nr_active -= pgmoved;

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 10/16] mm: remove swap_cluster_max from scan_control
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (8 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 09/16] mm: remove unnecessary variable and loop Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 11/16] mm: let sc.nr_scanned/sc.nr_reclaimed accumulate Wu Fengguang
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-remove-swap-cluster-max-from-scan-control.patch --]
[-- Type: text/plain, Size: 2354 bytes --]

The use of sc.swap_cluster_max is weird and redundant.

The callers should just set sc.priority/sc.nr_to_reclaim, and let
shrink_zone() decide the proper loop parameters.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |   15 ++++-----------
 1 files changed, 4 insertions(+), 11 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -76,12 +76,6 @@ struct scan_control {
 
 	/* Can pages be swapped as part of reclaim? */
 	int may_swap;
-
-	/* This context's SWAP_CLUSTER_MAX. If freeing memory for
-	 * suspend, we effectively ignore SWAP_CLUSTER_MAX.
-	 * In this context, it doesn't matter that we scan the
-	 * whole list at once. */
-	int swap_cluster_max;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -1127,7 +1121,6 @@ shrink_zone(struct zone *zone, struct sc
 	nr_inactive &= ~(SWAP_CLUSTER_MAX - 1);
 
 	sc->nr_to_scan = SWAP_CLUSTER_MAX;
-	sc->nr_to_reclaim = sc->swap_cluster_max;
 
 	while (nr_active >= SWAP_CLUSTER_MAX * 1024 || nr_inactive) {
 		if (nr_active >= SWAP_CLUSTER_MAX * 1024) {
@@ -1271,7 +1264,7 @@ int try_to_free_pages(struct zone **zone
 		sc.nr_scanned = 0;
 		sc.nr_reclaimed = 0;
 		sc.priority = priority;
-		sc.swap_cluster_max = SWAP_CLUSTER_MAX;
+		sc.nr_to_reclaim = SWAP_CLUSTER_MAX;
 		if (!priority)
 			disable_swap_token();
 		shrink_caches(zones, &sc);
@@ -1283,7 +1276,7 @@ int try_to_free_pages(struct zone **zone
 		}
 		total_scanned += sc.nr_scanned;
 		total_reclaimed += sc.nr_reclaimed;
-		if (total_reclaimed >= sc.swap_cluster_max) {
+		if (total_reclaimed >= SWAP_CLUSTER_MAX) {
 			ret = 1;
 			goto out;
 		}
@@ -1295,7 +1288,7 @@ int try_to_free_pages(struct zone **zone
 		 * that's undesirable in laptop mode, where we *want* lumpy
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
-		if (total_scanned > sc.swap_cluster_max + sc.swap_cluster_max/2) {
+		if (total_scanned > SWAP_CLUSTER_MAX * 3 / 2) {
 			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
 			sc.may_writepage = 1;
 		}
@@ -1376,7 +1369,7 @@ loop_again:
 		sc.nr_scanned = 0;
 		sc.nr_reclaimed = 0;
 		sc.priority = priority;
-		sc.swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX;
+		sc.nr_to_reclaim = nr_pages ? nr_pages : SWAP_CLUSTER_MAX;
 
 		/* The swap token gets in the way of swapout... */
 		if (!priority)

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 11/16] mm: let sc.nr_scanned/sc.nr_reclaimed accumulate
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (9 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 10/16] mm: remove swap_cluster_max from scan_control Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags Wu Fengguang
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-accumulate-nr-scanned-reclaimed-in-scan-control.patch --]
[-- Type: text/plain, Size: 4599 bytes --]

Now that there's no need to keep track of nr_scanned/nr_reclaimed for every
single round of shrink_zone(), remove the total_scanned/total_reclaimed and
let nr_scanned/nr_reclaimed accumulate between shrink_zone() calls.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |   36 ++++++++++++++----------------------
 1 files changed, 14 insertions(+), 22 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1231,7 +1231,6 @@ int try_to_free_pages(struct zone **zone
 {
 	int priority;
 	int ret = 0;
-	int total_scanned = 0, total_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
 	int i;
@@ -1241,6 +1240,8 @@ int try_to_free_pages(struct zone **zone
 	sc.gfp_mask = gfp_mask;
 	sc.may_writepage = 0;
 	sc.may_swap = 1;
+	sc.nr_scanned = 0;
+	sc.nr_reclaimed = 0;
 
 	inc_page_state(allocstall);
 
@@ -1261,8 +1262,6 @@ int try_to_free_pages(struct zone **zone
 	 */
 	for (priority = DEF_PRIORITY + PRIORITY_STEPS; priority >= 0; priority--) {
 		sc.nr_mapped = read_page_state(nr_mapped);
-		sc.nr_scanned = 0;
-		sc.nr_reclaimed = 0;
 		sc.priority = priority;
 		sc.nr_to_reclaim = SWAP_CLUSTER_MAX;
 		if (!priority)
@@ -1274,9 +1273,7 @@ int try_to_free_pages(struct zone **zone
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
 		}
-		total_scanned += sc.nr_scanned;
-		total_reclaimed += sc.nr_reclaimed;
-		if (total_reclaimed >= SWAP_CLUSTER_MAX) {
+		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) {
 			ret = 1;
 			goto out;
 		}
@@ -1288,13 +1285,13 @@ int try_to_free_pages(struct zone **zone
 		 * that's undesirable in laptop mode, where we *want* lumpy
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
-		if (total_scanned > SWAP_CLUSTER_MAX * 3 / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+		if (sc.nr_scanned > SWAP_CLUSTER_MAX * 3 / 2) {
+			wakeup_pdflush(laptop_mode ? 0 : sc.nr_scanned);
 			sc.may_writepage = 1;
 		}
 
 		/* Take a nap, wait for some writeback to complete */
-		if (sc.nr_scanned && priority < DEF_PRIORITY - PRIORITY_STEPS)
+		if (priority < DEF_PRIORITY - PRIORITY_STEPS)
 			blk_congestion_wait(WRITE, HZ/10);
 	}
 out:
@@ -1340,18 +1337,17 @@ static int balance_pgdat(pg_data_t *pgda
 	int all_zones_ok;
 	int priority;
 	int i;
-	int total_scanned, total_reclaimed;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc;
 	struct zone *prev_zone = pgdat->node_zones;
 
 loop_again:
-	total_scanned = 0;
-	total_reclaimed = 0;
 	sc.gfp_mask = GFP_KERNEL;
 	sc.may_writepage = 0;
 	sc.may_swap = 1;
 	sc.nr_mapped = read_page_state(nr_mapped);
+	sc.nr_scanned = 0;
+	sc.nr_reclaimed = 0;
 
 	inc_page_state(pageoutrun);
 
@@ -1366,8 +1362,6 @@ loop_again:
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		all_zones_ok = 1;
-		sc.nr_scanned = 0;
-		sc.nr_reclaimed = 0;
 		sc.priority = priority;
 		sc.nr_to_reclaim = nr_pages ? nr_pages : SWAP_CLUSTER_MAX;
 
@@ -1437,19 +1431,17 @@ scan_swspd:
 		reclaim_state->reclaimed_slab = 0;
 		shrink_slab(prev_zone, priority, GFP_KERNEL);
 		sc.nr_reclaimed += reclaim_state->reclaimed_slab;
-		total_reclaimed += sc.nr_reclaimed;
-		total_scanned += sc.nr_scanned;
 
 		/*
 		 * If we've done a decent amount of scanning and
 		 * the reclaim ratio is low, start doing writepage
 		 * even in laptop mode
 		 */
-		if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
-		    total_scanned > total_reclaimed+total_reclaimed/2)
+		if (sc.nr_scanned > SWAP_CLUSTER_MAX * 2 &&
+		    sc.nr_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 			sc.may_writepage = 1;
 
-		if (nr_pages && to_free > total_reclaimed)
+		if (nr_pages && to_free > sc.nr_reclaimed)
 			continue;	/* swsusp: need to do more work */
 		if (all_zones_ok)
 			break;		/* kswapd: all done */
@@ -1457,7 +1449,7 @@ scan_swspd:
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
 		 */
-		if (total_scanned && priority < DEF_PRIORITY - PRIORITY_STEPS)
+		if (priority < DEF_PRIORITY - PRIORITY_STEPS)
 			blk_congestion_wait(WRITE, HZ/10);
 
 		/*
@@ -1466,7 +1458,7 @@ scan_swspd:
 		 * matches the direct reclaim path behaviour in terms of impact
 		 * on zone->*_priority.
 		 */
-		if ((total_reclaimed >= SWAP_CLUSTER_MAX) && (!nr_pages))
+		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX && !nr_pages)
 			break;
 	}
 	for (i = 0; i < pgdat->nr_zones; i++) {
@@ -1479,7 +1471,7 @@ scan_swspd:
 		goto loop_again;
 	}
 
-	return total_reclaimed;
+	return sc.nr_reclaimed;
 }
 
 /*

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (10 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 11/16] mm: let sc.nr_scanned/sc.nr_reclaimed accumulate Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 10:36   ` Nick Piggin
  2005-12-07 11:15   ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 13/16] mm: fix minor scan count bugs Wu Fengguang
                   ` (3 subsequent siblings)
  15 siblings, 2 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-fold-bool-variables-into-flags-in-scan-control.patch --]
[-- Type: text/plain, Size: 2406 bytes --]

Fold bool values into flags to make struct scan_control more compact.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |   22 ++++++++++------------
 1 files changed, 10 insertions(+), 12 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -72,12 +72,12 @@ struct scan_control {
 	/* This context's GFP mask */
 	gfp_t gfp_mask;
 
-	int may_writepage;
-
-	/* Can pages be swapped as part of reclaim? */
-	int may_swap;
+	unsigned long flags;
 };
 
+#define SC_MAY_WRITEPAGE	0x1
+#define SC_MAY_SWAP		0x2	/* Can pages be swapped as part of reclaim? */
+
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -488,7 +488,7 @@ static int shrink_list(struct list_head 
 		 * Try to allocate it some swap space here.
 		 */
 		if (PageAnon(page) && !PageSwapCache(page)) {
-			if (!sc->may_swap)
+			if (!(sc->flags & SC_MAY_SWAP))
 				goto keep_locked;
 			if (!add_to_swap(page, GFP_ATOMIC))
 				goto activate_locked;
@@ -519,7 +519,7 @@ static int shrink_list(struct list_head 
 				goto keep_locked;
 			if (!may_enter_fs)
 				goto keep_locked;
-			if (laptop_mode && !sc->may_writepage)
+			if (laptop_mode && !(sc->flags & SC_MAY_WRITEPAGE))
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
@@ -1238,8 +1238,7 @@ int try_to_free_pages(struct zone **zone
 	delay_prefetch();
 
 	sc.gfp_mask = gfp_mask;
-	sc.may_writepage = 0;
-	sc.may_swap = 1;
+	sc.flags = SC_MAY_SWAP;
 	sc.nr_scanned = 0;
 	sc.nr_reclaimed = 0;
 
@@ -1287,7 +1286,7 @@ int try_to_free_pages(struct zone **zone
 		 */
 		if (sc.nr_scanned > SWAP_CLUSTER_MAX * 3 / 2) {
 			wakeup_pdflush(laptop_mode ? 0 : sc.nr_scanned);
-			sc.may_writepage = 1;
+			sc.flags |= SC_MAY_WRITEPAGE;
 		}
 
 		/* Take a nap, wait for some writeback to complete */
@@ -1343,8 +1342,7 @@ static int balance_pgdat(pg_data_t *pgda
 
 loop_again:
 	sc.gfp_mask = GFP_KERNEL;
-	sc.may_writepage = 0;
-	sc.may_swap = 1;
+	sc.flags = SC_MAY_SWAP;
 	sc.nr_mapped = read_page_state(nr_mapped);
 	sc.nr_scanned = 0;
 	sc.nr_reclaimed = 0;
@@ -1439,7 +1437,7 @@ scan_swspd:
 		 */
 		if (sc.nr_scanned > SWAP_CLUSTER_MAX * 2 &&
 		    sc.nr_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
-			sc.may_writepage = 1;
+			sc.flags |= SC_MAY_WRITEPAGE;
 
 		if (nr_pages && to_free > sc.nr_reclaimed)
 			continue;	/* swsusp: need to do more work */

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 13/16] mm: fix minor scan count bugs
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (11 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 10:32   ` Nick Piggin
  2005-12-07 11:02   ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 14/16] mm: zone aging rounds accounting Wu Fengguang
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli,
	Wu Fengguang

[-- Attachment #1: mm-scan-accounting-fix.patch --]
[-- Type: text/plain, Size: 1100 bytes --]

- in isolate_lru_pages(): reports one more scan. Fix it.
- in shrink_cache(): 0 pages taken does not mean 0 pages scanned. Fix it.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -864,7 +864,8 @@ static int isolate_lru_pages(int nr_to_s
 	struct page *page;
 	int scan = 0;
 
-	while (scan++ < nr_to_scan && !list_empty(src)) {
+	while (scan < nr_to_scan && !list_empty(src)) {
+		scan++;
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
@@ -911,14 +912,15 @@ static void shrink_cache(struct zone *zo
 	update_zone_age(zone, nr_scan);
 	spin_unlock_irq(&zone->lru_lock);
 
-	if (nr_taken == 0)
-		return;
-
 	sc->nr_scanned += nr_scan;
 	if (current_is_kswapd())
 		mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
 	else
 		mod_page_state_zone(zone, pgscan_direct, nr_scan);
+
+	if (nr_taken == 0)
+		return;
+
 	nr_freed = shrink_list(&page_list, sc);
 	if (current_is_kswapd())
 		mod_page_state(kswapd_steal, nr_freed);

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 14/16] mm: zone aging rounds accounting
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (12 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 13/16] mm: fix minor scan count bugs Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 15/16] mm: add page reclaim debug traces Wu Fengguang
  2005-12-07 10:48 ` [PATCH 16/16] mm: kswapd reclaim debug trace Wu Fengguang
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Wu Fengguang

[-- Attachment #1: mm-account-zone-aging-rounds.patch --]
[-- Type: text/plain, Size: 2270 bytes --]

Add zone->aging_rounds to evaluate the balancing work.
It means how many rounds the zone has been fully scanned.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mmzone.h |    1 +
 mm/page_alloc.c        |    5 +++++
 mm/vmscan.c            |    4 +++-
 3 files changed, 9 insertions(+), 1 deletion(-)

--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -160,6 +160,7 @@ struct zone {
 	 */
 	unsigned long		aging_total;
 	unsigned long		aging_milestone;
+	unsigned long		aging_rounds;
 	unsigned long		page_age;
 
 	/*
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1523,6 +1523,7 @@ void show_free_areas(void)
 			" inactive:%lukB"
 			" present:%lukB"
 			" aging:%lukB"
+			" aging_rounds:%lukB"
 			" age:%lu"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1536,6 +1537,7 @@ void show_free_areas(void)
 			K(zone->nr_inactive),
 			K(zone->present_pages),
 			K(zone->aging_total),
+			zone->aging_rounds,
 			zone->page_age,
 			zone->pages_scanned,
 			(zone->all_unreclaimable ? "yes" : "no")
@@ -2149,6 +2151,7 @@ static void __init free_area_init_core(s
 		zone->nr_inactive = 0;
 		zone->aging_total = 0;
 		zone->aging_milestone = 0;
+		zone->aging_rounds = 0;
 		zone->page_age = 0;
 		atomic_set(&zone->reclaim_in_progress, 0);
 		if (!size)
@@ -2299,6 +2302,7 @@ static int zoneinfo_show(struct seq_file
 			   "\n        active   %lu"
 			   "\n        inactive %lu"
 			   "\n        aging    %lu"
+			   "\n        rounds   %lu"
 			   "\n        age      %lu"
 			   "\n        scanned  %lu (a: %lu)"
 			   "\n        spanned  %lu"
@@ -2310,6 +2314,7 @@ static int zoneinfo_show(struct seq_file
 			   zone->nr_active,
 			   zone->nr_inactive,
 			   zone->aging_total,
+			   zone->aging_rounds,
 			   zone->page_age,
 			   zone->pages_scanned,
 			   zone->nr_scan_active / 1024,
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -157,8 +157,10 @@ static inline void update_zone_age(struc
 
 	z->aging_total += nr_scan;
 
-	if (z->aging_total - z->aging_milestone > len)
+	if (z->aging_total - z->aging_milestone > len) {
 		z->aging_milestone += len;
+		z->aging_rounds++;
+	}
 
 	z->page_age = ((z->aging_total - z->aging_milestone)
 						<< PAGE_AGE_SHIFT) / len;

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 15/16] mm: add page reclaim debug traces
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (13 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 14/16] mm: zone aging rounds accounting Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  2005-12-07 10:48 ` [PATCH 16/16] mm: kswapd reclaim debug trace Wu Fengguang
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Wu Fengguang

[-- Attachment #1: mm-page-reclaim-debug-traces.patch --]
[-- Type: text/plain, Size: 5360 bytes --]

Show the detailed steps of direct/kswapd page reclaim.

To enable the printk traces:
# echo y > /debug/debug_page_reclaim

Sample lines:

reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 32-32, age 2626, active to scan 6542, hot+cold+free pages 8842+283558+352
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2626, active to scan 8018, hot+cold+free pages 1693+200036+10360
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-64, age 2627, active to scan 7564, hot+cold+free pages 8842+283526+384
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2627, active to scan 8296, hot+cold+free pages 1693+200018+10360
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-63, age 2628, active to scan 8587, hot+cold+free pages 8843+283495+416
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2628, active to scan 8574, hot+cold+free pages 1693+200014+10392
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-63, age 2628, active to scan 9610, hot+cold+free pages 8844+283465+448
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2628, active to scan 8852, hot+cold+free pages 1693+199996+10424
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-64, age 2629, active to scan 10633, hot+cold+free pages 8844+283433+480
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2629, active to scan 9130, hot+cold+free pages 1693+199992+10456
reclaim zone3 from kswapd for watermark, prio 12, scan-reclaimed 64-64, age 2630, active to scan 11656, hot+cold+free pages 8844+283401+512
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 32-32, age 2630, active to scan 9408, hot+cold+free pages 1693+199974+10488

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


 mm/vmscan.c |   72 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 71 insertions(+), 1 deletion(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -38,6 +38,7 @@
 #include <asm/div64.h>
 
 #include <linux/swapops.h>
+#include <linux/debugfs.h>
 
 /* possible outcome of pageout() */
 typedef enum {
@@ -78,6 +79,62 @@ struct scan_control {
 #define SC_MAY_WRITEPAGE	0x1
 #define SC_MAY_SWAP		0x2	/* Can pages be swapped as part of reclaim? */
 
+#define SC_RECLAIM_FROM_KSWAPD		0x10
+#define SC_RECLAIM_FROM_DIRECT		0x20
+#define SC_RECLAIM_FOR_WATERMARK	0x40
+#define SC_RECLAIM_FOR_AGING		0x80
+#define SC_RECLAIM_MASK			0xF0
+
+#ifdef CONFIG_DEBUG_FS
+static u32 debug_page_reclaim;
+
+static inline void debug_reclaim(struct scan_control *sc, unsigned long flags)
+{
+	sc->flags = (sc->flags & ~SC_RECLAIM_MASK) | flags;
+}
+
+static inline void debug_reclaim_report(struct scan_control *sc, struct zone *z)
+{
+	if (!debug_page_reclaim)
+		return;
+
+	printk(KERN_DEBUG "reclaim zone%d from %s for %s, "
+			"prio %d, scan-reclaimed %lu-%lu, age %lu, "
+			"active to scan %lu, "
+			"hot+cold+free pages %lu+%lu+%lu\n",
+			zone_idx(z),
+			(sc->flags & SC_RECLAIM_FROM_KSWAPD) ? "kswapd" :
+			((sc->flags & SC_RECLAIM_FROM_DIRECT) ? "direct" :
+								"early"),
+			(sc->flags & SC_RECLAIM_FOR_AGING) ?
+							"aging" : "watermark",
+			sc->priority, sc->nr_scanned, sc->nr_reclaimed,
+			z->page_age,
+			z->nr_scan_active,
+			z->nr_active, z->nr_inactive, z->free_pages);
+
+	if (atomic_read(&z->reclaim_in_progress))
+		printk(KERN_WARNING "reclaim_in_progress=%d\n",
+					atomic_read(&z->reclaim_in_progress));
+}
+
+static inline void debug_reclaim_init(void)
+{
+	debugfs_create_bool("debug_page_reclaim", 0644, NULL,
+							&debug_page_reclaim);
+}
+#else
+static inline void debug_reclaim(struct scan_control *sc, int flags)
+{
+}
+static inline void debug_reclaim_report(struct scan_control *sc, struct zone *z)
+{
+}
+static inline void debug_reclaim_init(void)
+{
+}
+#endif
+
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -1147,6 +1204,7 @@ shrink_zone(struct zone *zone, struct sc
 
 	atomic_dec(&zone->reclaim_in_progress);
 
+	debug_reclaim_report(sc, zone);
 	throttle_vm_writeout();
 }
 
@@ -1211,11 +1269,14 @@ shrink_caches(struct zone **zones, struc
 			continue;
 		}
 
+		debug_reclaim(sc, SC_RECLAIM_FROM_DIRECT);
 		shrink_zone(zone, sc);
 	}
 
-	if (z)
+	if (z) {
+		debug_reclaim(sc, SC_RECLAIM_FROM_DIRECT|SC_RECLAIM_FOR_AGING);
 		shrink_zone(z, sc);
+	}
 }
  
 /*
@@ -1397,14 +1458,22 @@ loop_again:
 				if (!zone_watermark_ok(zone, order,
 						zone->pages_high, 0, 0)) {
 					/* have to scan for free pages */
+					debug_reclaim(&sc,
+							SC_RECLAIM_FROM_KSWAPD |
+							SC_RECLAIM_FOR_WATERMARK);
 					goto scan;
 				}
 				if (age_ge(prev_zone, zone)) {
 					/* catch up if falls behind */
+					debug_reclaim(&sc,
+							SC_RECLAIM_FROM_KSWAPD |
+							SC_RECLAIM_FOR_AGING);
 					goto scan;
 				}
 			} else if (!age_gt(zone, prev_zone)) {
 				/* catch up if falls behind or out of sync */
+				debug_reclaim(&sc, SC_RECLAIM_FROM_KSWAPD |
+						   SC_RECLAIM_FOR_AGING);
 				goto scan;
 			}
 
@@ -1631,6 +1700,7 @@ static int __init kswapd_init(void)
 		= find_task_by_pid(kernel_thread(kswapd, pgdat, CLONE_KERNEL));
 	total_memory = nr_free_pagecache_pages();
 	hotcpu_notifier(cpu_callback, 0);
+	debug_reclaim_init();
 	return 0;
 }
 

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 16/16] mm: kswapd reclaim debug trace
  2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
                   ` (14 preceding siblings ...)
  2005-12-07 10:48 ` [PATCH 15/16] mm: add page reclaim debug traces Wu Fengguang
@ 2005-12-07 10:48 ` Wu Fengguang
  15 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Wu Fengguang

[-- Attachment #1: mm-kswapd-reclaim-debug-trace.patch --]
[-- Type: text/plain, Size: 752 bytes --]

Debug trace for kswapd reclaim.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |   11 +++++++++++
 1 files changed, 11 insertions(+)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1450,6 +1450,17 @@ loop_again:
 			if (nr_pages) 	/* software suspend */
 				goto scan_swspd;
 
+			if (debug_page_reclaim)
+			printk(KERN_DEBUG "zone %d%d watermark %d%d age %lu prio %d\n",
+					zone_idx(prev_zone),
+					zone_idx(zone),
+					zone_watermark_ok(zone, order,
+						zone->pages_high, 0, 0),
+					zone_watermark_ok(zone, order,
+						zone->pages_high,
+						pgdat->nr_zones - 1, 0),
+					zone->page_age, priority);
+
 			if (zone_watermark_ok(zone, order,
 						zone->pages_high,
 						pgdat->nr_zones - 1, 0)) {

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/16] mm: balance zone aging in kswapd reclaim path
  2005-12-07 10:48 ` [PATCH 05/16] mm: balance zone aging in kswapd " Wu Fengguang
@ 2005-12-07 10:58   ` Wu Fengguang
  2005-12-07 13:32   ` Wu Fengguang
  1 sibling, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 10:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli

Here is the new testing reports. The intermittent and concurrent copy of two
files is expected to generate large range of unreclaimable pages.

The balance seems to improve a lot since last version. And the direct reclaim
times is reduced to minimal.

IN QEMU
=======
root ~# grep (age |rounds) /proc/zoneinfo
        rounds   142
        age      3621
        rounds   142
        age      3499
        rounds   142
        age      3502

root ~# ./show-aging-rate.sh
Linux (none) 2.6.15-rc5-mm1 #8 SMP Wed Dec 7 16:06:47 CST 2005 i686 GNU/Linux
             total       used       free     shared    buffers     cached
Mem:          1138       1119         18          0          0       1105
-/+ buffers/cache:         14       1124
Swap:            0          0          0

---------------------------------------------------------------
active/inactive size ratios:
    DMA0:  469 / 1000 =       621 /      1323
 Normal0:  374 / 1000 =     58588 /    156523
HighMem0:  397 / 1000 =     18880 /     47498

active/inactive scan rates:
     DMA:  273 / 1000 =       58528 / (     210464 +        3296)
  Normal:  342 / 1000 =     7851552 / (   22838944 +       94080)
 HighMem:  393 / 1000 =     2680480 / (    6774304 +       31040)

---------------------------------------------------------------
inactive size ratios:
    DMA0 /  Normal0:   85 / 10000 =      1334 /    156630
 Normal0 / HighMem0: 32946 / 10000 =    156630 /     47540

inactive scan rates:
     DMA /   Normal:   93 / 10000 = (     210464 +        3296) / (   22838944 +       94080)
  Normal /  HighMem: 33698 / 10000 = (   22838944 +       94080) / (    6774304 +       31040)

root ~# grep -E '(low|high|free|protection:) ' /proc/zoneinfo
  pages free     1161
        low      21
        high     25
        protection: (0, 0, 880, 1140)
  pages free     3505
        low      1173
        high     1408
        protection: (0, 0, 0, 2080)
  pages free     189
        low      134
        high     203
        protection: (0, 0, 0, 0)

root ~# vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  0      0  18556    344 1131720    0    0    16     5 1042    37  4 66 30  0
 2  0      0  18616    352 1132324    0    0     0    16 1035    48  6 94  0  0
 2  0      0  19060    348 1131512    0    0     0     3  974    52  6 94  0  0
 2  0      0  18256    268 1132272    0    0     0     8 1018    50  6 94  0  0
 2  0      0  19096    248 1132020    0    0     0     3 1009    49  6 94  0  0
 2  0      0  19520    248 1130524    0    0     0     8  989    50  6 94  0  0
 2  0      0  18916    248 1131680    0    0     0     3 1008    49  6 94  0  0
 1  0      0  18436    208 1132740    0    0     0     7 1009    64  4 96  0  0
 2  0      0  18976    200 1132272    0    0     0    14 1029    64  5 95  0  0
 2  0      0  19156    200 1131932    0    0     0     8  992    48  6 94  0  0

root ~# cat /proc/vmstat
nr_dirty 9
nr_writeback 0
nr_unstable 0
nr_page_table_pages 22
nr_mapped 971
nr_slab 1405
pgpgin 68177
pgpgout 21820
pswpin 0
pswpout 0
pgalloc_high 4338439
pgalloc_normal 15416448
pgalloc_dma32 0
pgalloc_dma 157690
pgfree 19917495
pgactivate 10660320
pgdeactivate 10582874
pgkeephot 53405
pgkeepcold 17
pgfault 145079
pgmajfault 116
pgrefill_high 2707872
pgrefill_normal 7936896
pgrefill_dma32 0
pgrefill_dma 59392
pgsteal_high 4264660
pgsteal_normal 15171368
pgsteal_dma32 0
pgsteal_dma 155635
pgscan_kswapd_high 6843616
pgscan_kswapd_normal 23067616
pgscan_kswapd_dma32 0
pgscan_kswapd_dma 212352
pgscan_direct_high 31040
pgscan_direct_normal 94080
pgscan_direct_dma32 0
pgscan_direct_dma 3296
pginodesteal 0
slabs_scanned 128
kswapd_steal 19582040
kswapd_inodesteal 0
pageoutrun 547184
allocstall 274
pgrotated 8
nr_bounce 0


ON A REAL BOX
=============
root@Server ~# grep (age |rounds) /proc/zoneinfo
        rounds   164
        age      410
        rounds   150
        age      396
        rounds   150
        age      396

root@Server ~# ./show-aging-rate.sh
Linux Server 2.6.15-rc5-mm1 #9 SMP Wed Dec 7 16:47:56 CST 2005 i686 GNU/Linux
             total       used       free     shared    buffers     cached
Mem:          2020       1970         50          0          5       1916
-/+ buffers/cache:         48       1972
Swap:            0          0          0

---------------------------------------------------------------
active/inactive size ratios:
    DMA0:  132 / 1000 =       123 /       930
 Normal0:  161 / 1000 =     28022 /    173838
HighMem0:  177 / 1000 =     43935 /    247952

active/inactive scan rates:
     DMA:  170 / 1000 =       23889 / (     118528 +       21216)
  Normal:  210 / 1000 =     5296960 / (   24645696 +      484160)
 HighMem:  239 / 1000 =     8501024 / (   34741600 +      752000)

---------------------------------------------------------------
inactive size ratios:
    DMA0 /  Normal0:   53 / 10000 =       930 /    173838
 Normal0 / HighMem0: 7010 / 10000 =    173838 /    247952

inactive scan rates:
     DMA /   Normal:   55 / 10000 = (     118528 +       21216) / (   24645696 +      484160)
  Normal /  HighMem: 7080 / 10000 = (   24645696 +      484160) / (   34741600 +      752000)

pageoutrun / allocstall = 73374 / 100 = 1072730 / 1461

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 13/16] mm: fix minor scan count bugs
  2005-12-07 10:48 ` [PATCH 13/16] mm: fix minor scan count bugs Wu Fengguang
  2005-12-07 10:32   ` Nick Piggin
@ 2005-12-07 11:02   ` Wu Fengguang
  1 sibling, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 11:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton

Hi Andrew,

Here is the standalone version for -mm inclusion.


Subject: mm: fix minor scan count bugs

- in isolate_lru_pages(): reports one more scan. Fix it.
- in shrink_cache(): 0 pages taken does not mean 0 pages scanned. Fix it.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -837,7 +837,8 @@ static int isolate_lru_pages(int nr_to_s
 	struct page *page;
 	int scan = 0;
 
-	while (scan++ < nr_to_scan && !list_empty(src)) {
+	while (scan < nr_to_scan && !list_empty(src)) {
+		scan++;
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
@@ -886,14 +887,15 @@ static void shrink_cache(struct zone *zo
 		zone->pages_scanned += nr_scan;
 		spin_unlock_irq(&zone->lru_lock);
 
-		if (nr_taken == 0)
-			goto done;
-
 		max_scan -= nr_scan;
 		if (current_is_kswapd())
 			mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
 		else
 			mod_page_state_zone(zone, pgscan_direct, nr_scan);
+
+		if (nr_taken == 0)
+			goto done;
+
 		nr_freed = shrink_list(&page_list, sc);
 		if (current_is_kswapd())
 			mod_page_state(kswapd_steal, nr_freed);

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/16] mm: balance slab aging
  2005-12-07 10:48 ` [PATCH 06/16] mm: balance slab aging Wu Fengguang
@ 2005-12-07 11:08   ` Wu Fengguang
  2005-12-07 11:34     ` Nick Piggin
  0 siblings, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 11:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli

A question about the current one:

For a NUMA system with N nodes, the way kswapd calculates lru_pages - only sum
up local zones - may cause N times more shrinking than a 1-CPU system.

Is this a feature or bug?

Thanks,
Wu

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags
  2005-12-07 10:36   ` Nick Piggin
@ 2005-12-07 11:11     ` Wu Fengguang
  2005-12-07 11:12       ` Nick Piggin
  0 siblings, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 11:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Marcelo Tosatti, Magnus Damm, Nick Piggin,
	Andrea Arcangeli

On Wed, Dec 07, 2005 at 09:36:23PM +1100, Nick Piggin wrote:
> Wu Fengguang wrote:
> >Fold bool values into flags to make struct scan_control more compact.
> >
> 
> Probably not a bad idea (although you haven't done anything for 64-bit
> archs, yet)... do we wait until one more flag wants to be added?

I did this to hold some more debug flags :)
I'll make it a standalone patch, too.

Wu

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags
  2005-12-07 11:11     ` Wu Fengguang
@ 2005-12-07 11:12       ` Nick Piggin
  2005-12-07 13:01         ` Wu Fengguang
  0 siblings, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2005-12-07 11:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Marcelo Tosatti, Magnus Damm, Nick Piggin,
	Andrea Arcangeli

Wu Fengguang wrote:
> On Wed, Dec 07, 2005 at 09:36:23PM +1100, Nick Piggin wrote:
> 
>>Wu Fengguang wrote:
>>
>>>Fold bool values into flags to make struct scan_control more compact.
>>>
>>
>>Probably not a bad idea (although you haven't done anything for 64-bit
>>archs, yet)... do we wait until one more flag wants to be added?
> 
> 
> I did this to hold some more debug flags :)

Yes, but if they make sense for the current kernel too, it reduces
the peripheral noise out of your patchset... which helps everyone :)

> I'll make it a standalone patch, too.
> 

Thanks. I don't have strong feelings either way, but I had always
been meaning to do something like this if we picked up another flag.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* mm: fold sc.may_writepage and sc.may_swap into sc.flags
  2005-12-07 10:48 ` [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags Wu Fengguang
  2005-12-07 10:36   ` Nick Piggin
@ 2005-12-07 11:15   ` Wu Fengguang
  2005-12-07 17:02     ` Martin Hicks
  1 sibling, 1 reply; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 11:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton

Fold bool values into flags to make struct scan_control more compact.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/vmscan.c |   22 ++++++++++------------
 1 files changed, 10 insertions(+), 12 deletions(-)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -72,12 +72,12 @@ struct scan_control {
 	/* This context's GFP mask */
 	gfp_t gfp_mask;
 
-	int may_writepage;
-
-	/* Can pages be swapped as part of reclaim? */
-	int may_swap;
+	unsigned long flags;
 };
 
+#define SC_MAY_WRITEPAGE	0x1
+#define SC_MAY_SWAP		0x2	/* Can pages be swapped as part of reclaim? */
+
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -488,7 +488,7 @@ static int shrink_list(struct list_head 
 		 * Try to allocate it some swap space here.
 		 */
 		if (PageAnon(page) && !PageSwapCache(page)) {
-			if (!sc->may_swap)
+			if (!(sc->flags & SC_MAY_SWAP))
 				goto keep_locked;
 			if (!add_to_swap(page, GFP_ATOMIC))
 				goto activate_locked;
@@ -519,7 +519,7 @@ static int shrink_list(struct list_head 
 				goto keep_locked;
 			if (!may_enter_fs)
 				goto keep_locked;
-			if (laptop_mode && !sc->may_writepage)
+			if (laptop_mode && !(sc->flags & SC_MAY_WRITEPAGE))
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
@@ -1238,8 +1238,7 @@ int try_to_free_pages(struct zone **zone
 	delay_prefetch();
 
 	sc.gfp_mask = gfp_mask;
-	sc.may_writepage = 0;
-	sc.may_swap = 1;
+	sc.flags = SC_MAY_SWAP;
 	sc.nr_scanned = 0;
 	sc.nr_reclaimed = 0;
 
@@ -1287,7 +1286,7 @@ int try_to_free_pages(struct zone **zone
 		 */
 		if (sc.nr_scanned > SWAP_CLUSTER_MAX * 3 / 2) {
 			wakeup_pdflush(laptop_mode ? 0 : sc.nr_scanned);
-			sc.may_writepage = 1;
+			sc.flags |= SC_MAY_WRITEPAGE;
 		}
 
 		/* Take a nap, wait for some writeback to complete */
@@ -1343,8 +1342,7 @@ static int balance_pgdat(pg_data_t *pgda
 
 loop_again:
 	sc.gfp_mask = GFP_KERNEL;
-	sc.may_writepage = 0;
-	sc.may_swap = 1;
+	sc.flags = SC_MAY_SWAP;
 	sc.nr_mapped = read_page_state(nr_mapped);
 	sc.nr_scanned = 0;
 	sc.nr_reclaimed = 0;
@@ -1439,7 +1437,7 @@ scan_swspd:
 		 */
 		if (sc.nr_scanned > SWAP_CLUSTER_MAX * 2 &&
 		    sc.nr_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
-			sc.may_writepage = 1;
+			sc.flags |= SC_MAY_WRITEPAGE;
 
 		if (nr_pages && to_free > sc.nr_reclaimed)
 			continue;	/* swsusp: need to do more work */

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/16] mm: balance slab aging
  2005-12-07 11:08   ` Wu Fengguang
@ 2005-12-07 11:34     ` Nick Piggin
  2005-12-07 12:59       ` Wu Fengguang
  0 siblings, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2005-12-07 11:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Marcelo Tosatti, Magnus Damm, Nick Piggin,
	Andrea Arcangeli

Wu Fengguang wrote:
> A question about the current one:
> 
> For a NUMA system with N nodes, the way kswapd calculates lru_pages - only sum
> up local zones - may cause N times more shrinking than a 1-CPU system.
> 

But it is equal pressure for all pools involved in being scaned the
simplifying assumption is that slab is equally distributed among
nodes. And yeah, scanning would load up when more than 1 kswapd is
running.

I had patches to do per-zone inode and dentry slab shrinking ages
ago, but nobody was interested... so I'm guessing it is a feature :)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/16] mm: balance slab aging
  2005-12-07 11:34     ` Nick Piggin
@ 2005-12-07 12:59       ` Wu Fengguang
  0 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 12:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Marcelo Tosatti, Magnus Damm, Nick Piggin,
	Andrea Arcangeli

On Wed, Dec 07, 2005 at 10:34:11PM +1100, Nick Piggin wrote:
> Wu Fengguang wrote:
> >A question about the current one:
> >
> >For a NUMA system with N nodes, the way kswapd calculates lru_pages - only 
> >sum
> >up local zones - may cause N times more shrinking than a 1-CPU system.
> >
> 
> But it is equal pressure for all pools involved in being scaned the
> simplifying assumption is that slab is equally distributed among
> nodes. And yeah, scanning would load up when more than 1 kswapd is
> running.
> 
> I had patches to do per-zone inode and dentry slab shrinking ages
> ago, but nobody was interested... so I'm guessing it is a feature :)

I rechecked shrink_dcache_memory()/prune_dcache(), it seems to be operating in
a global manner, which means (conceptually) if 10 nodes each scans 10%, the
global dcache is scanned 100%. Isn't it crazy? ;)

Thanks,
Wu

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags
  2005-12-07 11:12       ` Nick Piggin
@ 2005-12-07 13:01         ` Wu Fengguang
  0 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 13:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Marcelo Tosatti, Magnus Damm, Nick Piggin,
	Andrea Arcangeli

On Wed, Dec 07, 2005 at 10:12:44PM +1100, Nick Piggin wrote:
> >I did this to hold some more debug flags :)
> 
> Yes, but if they make sense for the current kernel too, it reduces
> the peripheral noise out of your patchset... which helps everyone :)

Thanks. I have not been quite aware of this, sorry.

Wu

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/16] mm: balance zone aging in kswapd reclaim path
  2005-12-07 10:48 ` [PATCH 05/16] mm: balance zone aging in kswapd " Wu Fengguang
  2005-12-07 10:58   ` Wu Fengguang
@ 2005-12-07 13:32   ` Wu Fengguang
  1 sibling, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-07 13:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Christoph Lameter, Rik van Riel, Peter Zijlstra,
	Marcelo Tosatti, Magnus Damm, Nick Piggin, Andrea Arcangeli

Here is another testing on 512M desktop. This time only a big sparse file copy.

- The inactive_list balance is perfectly maintained
- The active_list is scanned a bit more, for the calculation is performed after
  the scan, when the nr_inactive is made a little smaller
- direct reclaims near to zero, good or evil?

wfg ~% grep (age |rounds) /proc/zoneinfo
        rounds   100
        age      659
        rounds   100
        age      621

wfg ~% show-aging-rate.sh
Linux lark 2.6.15-rc5-mm1 #9 SMP Wed Dec 7 16:47:56 CST 2005 i686 GNU/Linux
             total       used       free     shared    buffers     cached
Mem:           501        475         26          0          6        278
-/+ buffers/cache:        189        311
Swap:          127          2        125

---------------------------------------------------------------
active/inactive size ratios:
    DMA0:   75 / 1000 =       161 /      2135
 Normal0: 1074 / 1000 =     59046 /     54936

active/inactive scan rates:
     DMA:   31 / 1000 =        7867 / (     246784 +           0)
  Normal:  974 / 1000 =     5847744 / (    6001216 +         128)

---------------------------------------------------------------
inactive size ratios:
    DMA0 /  Normal0:  388 / 10000 =      2135 /     54936

inactive scan rates:
     DMA /   Normal:  411 / 10000 = (     246784 +           0) / (    6001216 +         128)

pageoutrun / allocstall = 4140780 / 100 = 207039 / 4

wfg ~% vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 3  0   2612   6408   7280 293960    0    1   122    13 1178  1415 14  9 75  2
 1  0   2612   6404   2144 298144    0    0   190    35 1159  1675 11 89  0  0
 1  0   2612   6216   2052 299596    0    0   442     8 1189  1612 10 90  0  0
 1  0   2612   5916   2032 299888    0    0   326     0 1182  1713 10 90  0  0
 1  3   2612   6528   3252 297240    0    0   795     6 1275  1464 10 53  0 37
 0  0   2612   5648   3644 298480    0    0   739    14 1261  1203 14 23 39 24
[the big cp stops about here]
 0  0   2612   5784   3660 298532    0    0     0    17 1130  1322 10  3 87  0
 0  0   2612   5952   3692 298500    0    0     6     4 1137  1343  9  2 87  2
 0  0   2612   5976   3700 298492    0    0     2     3 1143  1327  7  1 91  0
 0  0   2612   6000   3700 298492    0    0     0     0 1138  1315  7  2 91  0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: fold sc.may_writepage and sc.may_swap into sc.flags
  2005-12-07 11:15   ` Wu Fengguang
@ 2005-12-07 17:02     ` Martin Hicks
  2005-12-07 23:15       ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Martin Hicks @ 2005-12-07 17:02 UTC (permalink / raw)
  To: Wu Fengguang, linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 332 bytes --]


On Wed, Dec 07, 2005 at 07:15:01PM +0800, Wu Fengguang wrote:
> Fold bool values into flags to make struct scan_control more compact.
> 

I suspect that the may_swap flag is still a left over from my failed
attempt at zone_reclaim.  It should be removed.

mh

-- 
Martin Hicks || mort@bork.org || PGP/GnuPG: 0x4C7F2BEE

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mm: fold sc.may_writepage and sc.may_swap into sc.flags
  2005-12-07 17:02     ` Martin Hicks
@ 2005-12-07 23:15       ` Andrew Morton
  0 siblings, 0 replies; 35+ messages in thread
From: Andrew Morton @ 2005-12-07 23:15 UTC (permalink / raw)
  To: Martin Hicks; +Cc: wfg, linux-kernel

Martin Hicks <mort@bork.org> wrote:
>
> On Wed, Dec 07, 2005 at 07:15:01PM +0800, Wu Fengguang wrote:
> > Fold bool values into flags to make struct scan_control more compact.
> > 
> 
> I suspect that the may_swap flag is still a left over from my failed
> attempt at zone_reclaim.  It should be removed.

Yes, it can be removed, thanks.  I missed that.  (Patch
`kill-last-zone_reclaim-bits.patch' in -mm updated).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/16] mm: supporting variables and functions for balanced zone aging
  2005-12-07 10:47 ` [PATCH 03/16] mm: supporting variables and functions for balanced zone aging Wu Fengguang
@ 2005-12-11 22:36   ` Marcelo Tosatti
  2005-12-12  2:53     ` Wu Fengguang
  0 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2005-12-11 22:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Magnus Damm, Nick Piggin, Andrea Arcangeli

On Wed, Dec 07, 2005 at 06:47:58PM +0800, Wu Fengguang wrote:
> The zone aging rates are currently imbalanced, the gap can be as large as 3
> times, which can severely damage read-ahead requests and shorten their
> effective life time.
> 
> This patch adds three variables in struct zone
> 	- aging_total
> 	- aging_milestone
> 	- page_age
> to keep track of page aging rate, and keep it in sync on page reclaim time.
> 
> The aging_total is just a per-zone counter-part to the per-cpu
> pgscan_{kswapd,direct}_{zone name}. But it is not direct comparable between
> zones, so the aging_milestone/page_age are maintained based on aging_total.
> 
> The page_age is a normalized value that can be direct compared between zones
> with the helper macro age_ge/age_gt. The goal of balancing logics are to keep
> this normalized value in sync between zones.
> 
> One can check the balanced aging progress by running:
>                         tar c / | cat > /dev/null &
>                         watch -n1 'grep "age " /proc/zoneinfo'
> 
> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> ---
> 
>  include/linux/mmzone.h |   14 ++++++++++++++
>  mm/page_alloc.c        |   11 +++++++++++
>  mm/vmscan.c            |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+)
> 
> --- linux.orig/include/linux/mmzone.h
> +++ linux/include/linux/mmzone.h
> @@ -149,6 +149,20 @@ struct zone {
>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	int			all_unreclaimable; /* All pages pinned */
>  
> +	/* Fields for balanced page aging:
> +	 * aging_total     - The accumulated number of activities that may
> +	 *                   cause page aging, that is, make some pages closer
> +	 *                   to the tail of inactive_list.
> +	 * aging_milestone - A snapshot of total_scan every time a full
> +	 *                   inactive_list of pages become aged.
> +	 * page_age        - A normalized value showing the percent of pages
> +	 *                   have been aged.  It is compared between zones to
> +	 *                   balance the rate of page aging.
> +	 */
> +	unsigned long		aging_total;
> +	unsigned long		aging_milestone;
> +	unsigned long		page_age;
> +
>  	/*
>  	 * Does the allocator try to reclaim pages from the zone as soon
>  	 * as it fails a watermark_ok() in __alloc_pages?
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -123,6 +123,53 @@ static long total_memory;
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
> +#ifdef CONFIG_HIGHMEM64G
> +#define		PAGE_AGE_SHIFT  8
> +#elif BITS_PER_LONG == 32
> +#define		PAGE_AGE_SHIFT  12
> +#elif BITS_PER_LONG == 64
> +#define		PAGE_AGE_SHIFT  20
> +#else
> +#error unknown BITS_PER_LONG
> +#endif
> +#define		PAGE_AGE_SIZE   (1 << PAGE_AGE_SHIFT)
> +#define		PAGE_AGE_MASK   (PAGE_AGE_SIZE - 1)
> +
> +/*
> + * The simplified code is:
> + * 	age_ge: (@a->page_age >= @b->page_age)
> + * 	age_gt: (@a->page_age > @b->page_age)
> + * The complexity deals with the wrap-around problem.
> + * Two page ages not close enough(gap >= 1/8) should also be ignored:
> + * they are out of sync and the comparison may be nonsense.
> + *
> + * Return value depends on the position of @a relative to @b:
> + * -1/8       b      +1/8
> + *   |--------|--------|-----------------------------------------------|
> + *       0        1                           0
> + */
> +#define age_ge(a, b) \
> +	(((a->page_age - b->page_age) & PAGE_AGE_MASK) < PAGE_AGE_SIZE / 8)
> +#define age_gt(a, b) \
> +	(((b->page_age - a->page_age) & PAGE_AGE_MASK) > PAGE_AGE_SIZE * 7 / 8)
> +
> +/*
> + * Keep track of the percent of cold pages that have been scanned / aged.
> + * It's not really ##%, but a high resolution normalized value.
> + */
> +static inline void update_zone_age(struct zone *z, int nr_scan)
> +{
> +	unsigned long len = z->nr_inactive | 1;
> +
> +	z->aging_total += nr_scan;
> +
> +	if (z->aging_total - z->aging_milestone > len)
> +		z->aging_milestone += len;
> +
> +	z->page_age = ((z->aging_total - z->aging_milestone)
> +						<< PAGE_AGE_SHIFT) / len;
> +}
> +

Hi Wu,

It is not very clear to me what is the meaning of these numbers and what
you're trying to deduce from them. Please correct me if I'm wrong.

z->aging_total is the sum of all scanned inactive pages for the zone
(sum of scanned pages by shrink_cache).

z->aging_milestone is the number of pages scanned with "nr_inactive"
precision (it is updated in response to a full inactive list scan).

z->page_age is the difference between aging_milestone and aging_total.

The name sounds a bit misleading since "page age" intuitively means
"what is the age of a page".

Anyway, z->page_age is the number of scanned pages since the last full  
scan, shifted left by PAGE_AGE_SHIFT and divided by the number of       
inactive pages.                                                         

IOW, it still means "number of scanned pages since last full scan".

How can that be meaningful? No other code uses this in the patchset
AFAICS.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/16] mm: supporting variables and functions for balanced zone aging
  2005-12-11 22:36   ` Marcelo Tosatti
@ 2005-12-12  2:53     ` Wu Fengguang
  0 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2005-12-12  2:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Magnus Damm, Nick Piggin, Andrea Arcangeli

On Sun, Dec 11, 2005 at 08:36:46PM -0200, Marcelo Tosatti wrote:
> Hi Wu,

Hi Marcelo,

> It is not very clear to me what is the meaning of these numbers and what
> you're trying to deduce from them. Please correct me if I'm wrong.
> 
> z->aging_total is the sum of all scanned inactive pages for the zone
> (sum of scanned pages by shrink_cache).
> 
> z->aging_milestone is the number of pages scanned with "nr_inactive"
> precision (it is updated in response to a full inactive list scan).
> 
> z->page_age is the difference between aging_milestone and aging_total.
> 
> The name sounds a bit misleading since "page age" intuitively means
> "what is the age of a page".
> 
> Anyway, z->page_age is the number of scanned pages since the last full  
> scan, shifted left by PAGE_AGE_SHIFT and divided by the number of       
> inactive pages.                                                         
> 
> IOW, it still means "number of scanned pages since last full scan".
 
Thanks for the review and comments. Your interpretations are pretty accurate.

> How can that be meaningful? No other code uses this in the patchset
> AFAICS.

1) It is updated _solely_ by update_age() and used _only_ through age_ge/age_gt
   macros.
2) Yes, there are some duplications. The normalized one can be removed, at cost
   of possibly more runtime computations.

Let me answer more of your recommends by the following new patch :)

The new patch also fixed a bug where the fluctuations of nr_inactive can lead to
big jumps of zone age. Now the balancing logics can work as expected.

Regards,
Wu
---

Subject: mm: supporting variables and functions for balanced zone aging

The imbalance of zone aging rates can severely damage read-ahead requests,
shorten their effective life time, increase unexpected I/O lantency and waste
memory.

This patch introduces struct aging with members
	- life_span
	- raw_age
	- std_age
to keep track of page aging rate. It is updated _solely_ by update_age() and
used _only_ through age_ge/age_gt macros.

The aging.std_age is a normalized value of (aging.raw_age/aging.life_span)
that can be compared between zones/slabs with the helper macro age_ge/age_gt.
The goal of balancing logics are to keep this normalized value in sync.

One can check the balanced aging progress by running:
                        tar c / | cat > /dev/null &
                        watch -n1 'grep "age " /proc/zoneinfo'

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
 include/linux/mmzone.h |   14 ++++++++++++++
 mm/page_alloc.c        |   11 +++++++++++
 mm/vmscan.c            |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+)

--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -106,6 +106,12 @@ struct per_cpu_pageset {
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */
 
+struct aging {
+	unsigned long	life_span;
+	unsigned long	raw_age;
+	unsigned long	std_age;
+};
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
@@ -149,6 +155,8 @@ struct zone {
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
+	struct aging		aging;
+
 	/*
 	 * Does the allocator try to reclaim pages from the zone as soon
 	 * as it fails a watermark_ok() in __alloc_pages?
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -123,6 +123,53 @@ static long total_memory;
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
+#ifdef CONFIG_HIGHMEM64G
+#define		AGING_SHIFT  8
+#elif BITS_PER_LONG == 32
+#define		AGING_SHIFT  12
+#elif BITS_PER_LONG == 64
+#define		AGING_SHIFT  20
+#else
+#error unknown BITS_PER_LONG
+#endif
+#define		AGING_SIZE   (1 << AGING_SHIFT)
+#define		AGING_MASK   (AGING_SIZE - 1)
+
+/*
+ * The simplified code is:
+ * 	age_ge: (@a->aging.std_age >= @b->aging.std_age)
+ * 	age_gt: (@a->aging.std_age > @b->aging.std_age)
+ * The complexity deals with the wrap-around problem.
+ * Two page ages not close enough(gap >= 1/8) should also be ignored:
+ * they are out of sync and the comparison may be nonsense.
+ *
+ * Return value depends on the position of @a relative to @b:
+ * -1/8       b      +1/8
+ *   |--------|--------|-----------------------------------------------|
+ *       0        1                           0
+ */
+#define age_ge(a, b) \
+	(((a->aging.std_age - b->aging.std_age) & AGING_MASK) < AGING_SIZE / 8)
+#define age_gt(a, b) \
+	(((b->aging.std_age - a->aging.std_age) & AGING_MASK) > AGING_SIZE * 7 / 8)
+
+/*
+ * Keep track of the percent of cold pages that have been scanned / aged.
+ * It's not really ##%, but a high resolution normalized value.
+ */
+static inline void update_age(struct aging *a, unsigned long age,
+						unsigned long current_life_span)
+{
+	a->raw_age += age;
+
+	if (a->raw_age > a->life_span ) {
+		a->raw_age -= a->life_span;
+		a->life_span = (current_life_span | 1);
+	}
+
+	a->std_age = (a->raw_age << AGING_SHIFT) / a->life_span;
+}
+
 /*
  * Add a shrinker callback to be called from the vm
  */
@@ -887,6 +934,7 @@ static void shrink_cache(struct zone *zo
 					     &page_list, &nr_scan);
 		zone->nr_inactive -= nr_taken;
 		zone->pages_scanned += nr_scan;
+		update_age(&zone->aging, nr_scan, zone->nr_inactive);
 		spin_unlock_irq(&zone->lru_lock);
 
 		if (nr_taken == 0)
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1522,6 +1522,7 @@ void show_free_areas(void)
 			" active:%lukB"
 			" inactive:%lukB"
 			" present:%lukB"
+			" age:%lu"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
 			"\n",
@@ -1533,6 +1534,7 @@ void show_free_areas(void)
 			K(zone->nr_active),
 			K(zone->nr_inactive),
 			K(zone->present_pages),
+			zone->aging.std_age,
 			zone->pages_scanned,
 			(zone->all_unreclaimable ? "yes" : "no")
 			);
@@ -2144,6 +2146,7 @@ static void __init free_area_init_core(s
 		zone->nr_scan_inactive = 0;
 		zone->nr_active = 0;
 		zone->nr_inactive = 0;
+		memset(&zone->aging, 0, sizeof(struct aging));
 		atomic_set(&zone->reclaim_in_progress, 0);
 		if (!size)
 			continue;
@@ -2292,6 +2295,7 @@ static int zoneinfo_show(struct seq_file
 			   "\n        high     %lu"
 			   "\n        active   %lu"
 			   "\n        inactive %lu"
+			   "\n        age      %lu"
 			   "\n        scanned  %lu (a: %lu i: %lu)"
 			   "\n        spanned  %lu"
 			   "\n        present  %lu",
@@ -2301,6 +2305,7 @@ static int zoneinfo_show(struct seq_file
 			   zone->pages_high,
 			   zone->nr_active,
 			   zone->nr_inactive,
+			   zone->aging.std_age,
 			   zone->pages_scanned,
 			   zone->nr_scan_active, zone->nr_scan_inactive,
 			   zone->spanned_pages,

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 09/16] mm: remove unnecessary variable and loop
  2005-12-07 10:48 ` [PATCH 09/16] mm: remove unnecessary variable and loop Wu Fengguang
@ 2006-01-05 19:21   ` Marcelo Tosatti
  2006-01-06  8:58     ` Wu Fengguang
  0 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2006-01-05 19:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Magnus Damm, Nick Piggin, Andrea Arcangeli

On Wed, Dec 07, 2005 at 06:48:04PM +0800, Wu Fengguang wrote:
> shrink_cache() and refill_inactive_zone() do not need loops.
> 
> Simplify them to scan one chunk at a time.

Hi Wu,

What is the purpose of scanning large chunks at a time?

Some drawbacks that I can think of by doing that:

- zone->lru_lock will be held for much longer periods, resulting in
decreased responsiveness and possibly slowdowns.

- if the task doing the scan is uncapable of certain operations, for
instance IO, dirty pages will be moved back to the head of the inactive
list in much larger batches then they were before. This could hurt
reclaim in general.

What were the results of this change? Particularly contention on the
lru_lock on medium-large SMP systems.

Thanks!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 09/16] mm: remove unnecessary variable and loop
  2006-01-05 19:21   ` Marcelo Tosatti
@ 2006-01-06  8:58     ` Wu Fengguang
  0 siblings, 0 replies; 35+ messages in thread
From: Wu Fengguang @ 2006-01-06  8:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Andrew Morton, Christoph Lameter, Rik van Riel,
	Peter Zijlstra, Magnus Damm, Nick Piggin, Andrea Arcangeli

On Thu, Jan 05, 2006 at 05:21:56PM -0200, Marcelo Tosatti wrote:
> On Wed, Dec 07, 2005 at 06:48:04PM +0800, Wu Fengguang wrote:
> > shrink_cache() and refill_inactive_zone() do not need loops.
> > 
> > Simplify them to scan one chunk at a time.
> 
> Hi Wu,

Hi Marcelo,

> What is the purpose of scanning large chunks at a time?

But I did not say or mean 'large' chunks :)
With the patch the chunk size is _always_ set to SWAP_CLUSTER_MAX=32 - the good
old default value.

Thanks.
Wu

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2006-01-06  8:50 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-07 10:47 [PATCH 00/16] Balancing the scan rate of major caches V3 Wu Fengguang
2005-12-07 10:47 ` [PATCH 01/16] mm: restore sc.nr_to_reclaim Wu Fengguang
2005-12-07 10:47 ` [PATCH 02/16] mm: simplify kswapd reclaim code Wu Fengguang
2005-12-07 10:47 ` [PATCH 03/16] mm: supporting variables and functions for balanced zone aging Wu Fengguang
2005-12-11 22:36   ` Marcelo Tosatti
2005-12-12  2:53     ` Wu Fengguang
2005-12-07 10:47 ` [PATCH 04/16] mm: balance zone aging in direct reclaim path Wu Fengguang
2005-12-07 10:48 ` [PATCH 05/16] mm: balance zone aging in kswapd " Wu Fengguang
2005-12-07 10:58   ` Wu Fengguang
2005-12-07 13:32   ` Wu Fengguang
2005-12-07 10:48 ` [PATCH 06/16] mm: balance slab aging Wu Fengguang
2005-12-07 11:08   ` Wu Fengguang
2005-12-07 11:34     ` Nick Piggin
2005-12-07 12:59       ` Wu Fengguang
2005-12-07 10:48 ` [PATCH 07/16] mm: balance active/inactive list scan rates Wu Fengguang
2005-12-07 10:48 ` [PATCH 08/16] mm: fine grained scan priority Wu Fengguang
2005-12-07 10:48 ` [PATCH 09/16] mm: remove unnecessary variable and loop Wu Fengguang
2006-01-05 19:21   ` Marcelo Tosatti
2006-01-06  8:58     ` Wu Fengguang
2005-12-07 10:48 ` [PATCH 10/16] mm: remove swap_cluster_max from scan_control Wu Fengguang
2005-12-07 10:48 ` [PATCH 11/16] mm: let sc.nr_scanned/sc.nr_reclaimed accumulate Wu Fengguang
2005-12-07 10:48 ` [PATCH 12/16] mm: fold sc.may_writepage and sc.may_swap into sc.flags Wu Fengguang
2005-12-07 10:36   ` Nick Piggin
2005-12-07 11:11     ` Wu Fengguang
2005-12-07 11:12       ` Nick Piggin
2005-12-07 13:01         ` Wu Fengguang
2005-12-07 11:15   ` Wu Fengguang
2005-12-07 17:02     ` Martin Hicks
2005-12-07 23:15       ` Andrew Morton
2005-12-07 10:48 ` [PATCH 13/16] mm: fix minor scan count bugs Wu Fengguang
2005-12-07 10:32   ` Nick Piggin
2005-12-07 11:02   ` Wu Fengguang
2005-12-07 10:48 ` [PATCH 14/16] mm: zone aging rounds accounting Wu Fengguang
2005-12-07 10:48 ` [PATCH 15/16] mm: add page reclaim debug traces Wu Fengguang
2005-12-07 10:48 ` [PATCH 16/16] mm: kswapd reclaim debug trace Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.