[PATCH 0/2] Reduce the amount of time spent in watermark-related functions

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Reduce the amount of time spent in watermark-related functions
@ 2010-10-27  8:47 Mel Gorman
  2010-10-27  8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
  2010-10-27  8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman
  0 siblings, 2 replies; 40+ messages in thread
From: Mel Gorman @ 2010-10-27  8:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman

The following two patches are in response to a bug report by Shaohua Li
where the amount of time spent in zone_nr_free_pages() is unacceptable
for large machines. All the background is in the first patches leader. The
second patch replaces two setter functions with one function that takes a
callback function as a parameter.

Mel Gorman (2):
  mm: page allocator: Adjust the per-cpu counter threshold when memory
    is low
  mm: vmstat: Use a single setter function and callback for adjusting
    percpu thresholds

 include/linux/mmzone.h |   10 +++-------
 include/linux/vmstat.h |    7 +++++++
 mm/mmzone.c            |   21 ---------------------
 mm/page_alloc.c        |   35 +++++++++++++++++++++++++++--------
 mm/vmscan.c            |   25 +++++++++++++++----------
 mm/vmstat.c            |   32 +++++++++++++++++++++++++++++---
 6 files changed, 81 insertions(+), 49 deletions(-)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-27  8:47 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions Mel Gorman
@ 2010-10-27  8:47 ` Mel Gorman
  2010-10-27 20:16   ` Christoph Lameter
                     ` (3 more replies)
  2010-10-27  8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman
  1 sibling, 4 replies; 40+ messages in thread
From: Mel Gorman @ 2010-10-27  8:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman

Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
memory is low] noted that watermarks were based on the vmstat
NR_FREE_PAGES. To avoid synchronization overhead, these counters are
maintained on a per-cpu basis and drained both periodically and when a
threshold is above a threshold. On large CPU systems, the difference
between the estimate and real value of NR_FREE_PAGES can be very high.
The system can get into a case where pages are allocated far below the
min watermark potentially causing livelock issues. The commit solved the
problem by taking a better reading of NR_FREE_PAGES when memory was low.

Unfortunately, as reported by Shaohua Li this accurate reading can consume
a large amount of CPU time on systems with many sockets due to cache
line bouncing. This patch takes a different approach. For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift
to what should be a safe level. This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine but
the machine should function correctly without accidentally exhausting all
memory on a node. There is an additional cost when kswapd wakes and sleeps
but the event is not expected to be frequent - in Shaohua's test case,
there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
is introduced that takes a more accurate reading of NR_FREE_PAGES when
called from wakeup_kswapd, when deciding whether it is really safe to go
back to sleep in sleeping_prematurely() and when deciding if a zone is
really balanced or not in balance_pgdat(). We are still using an expensive
function but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark functions
is reduced. The following report is on the percentage of time spent
cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(),
__zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(),
zone_page_state().

vanilla                      11.6615%
disable-threshold            0.2584%

Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |   10 +++-------
 include/linux/vmstat.h |    5 +++++
 mm/mmzone.c            |   21 ---------------------
 mm/page_alloc.c        |   35 +++++++++++++++++++++++++++--------
 mm/vmscan.c            |   23 +++++++++++++----------
 mm/vmstat.c            |   46 +++++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 93 insertions(+), 47 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3984c4e..8d789d7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
-#ifdef CONFIG_SMP
-unsigned long zone_nr_free_pages(struct zone *zone);
-#else
-#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
-#endif /* CONFIG_SMP */
-
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
@@ -651,7 +645,9 @@ typedef struct pglist_data {
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(void *data);
 void wakeup_kswapd(struct zone *zone, int order);
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+		int classzone_idx, int alloc_flags);
+bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
 		int classzone_idx, int alloc_flags);
 enum memmap_context {
 	MEMMAP_EARLY,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index eaaea37..e4cc21c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void refresh_cpu_vm_stats(int);
+void reduce_pgdat_percpu_threshold(pg_data_t *pgdat);
+void restore_pgdat_percpu_threshold(pg_data_t *pgdat);
 #else /* CONFIG_SMP */
 
 /*
@@ -298,6 +300,9 @@ static inline void __dec_zone_page_state(struct page *page,
 #define dec_zone_page_state __dec_zone_page_state
 #define mod_zone_page_state __mod_zone_page_state
 
+static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { }
+static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { }
+
 static inline void refresh_cpu_vm_stats(int cpu) { }
 #endif
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index e35bfb8..f5b7d17 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
-
-#ifdef CONFIG_SMP
-/* Called when a more accurate view of NR_FREE_PAGES is needed */
-unsigned long zone_nr_free_pages(struct zone *zone)
-{
-	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
-
-	/*
-	 * While kswapd is awake, it is considered the zone is under some
-	 * memory pressure. Under pressure, there is a risk that
-	 * per-cpu-counter-drift will allow the min watermark to be breached
-	 * potentially causing a live-lock. While kswapd is awake and
-	 * free pages are low, get a better estimate for free pages
-	 */
-	if (nr_free_pages < zone->percpu_drift_mark &&
-			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
-		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
-
-	return nr_free_pages;
-}
-#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f12ad18..0286150 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1454,24 +1454,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 #endif /* CONFIG_FAIL_PAGE_ALLOC */
 
 /*
- * Return 1 if free pages are above 'mark'. This takes into account the order
+ * Return true if free pages are above 'mark'. This takes into account the order
  * of the allocation.
  */
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags)
+static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+		      int classzone_idx, int alloc_flags, long free_pages)
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
+	free_pages -= (1 << order) + 1;
 	if (alloc_flags & ALLOC_HIGH)
 		min -= min / 2;
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
-		return 0;
+		return false;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
 		free_pages -= z->free_area[o].nr_free << o;
@@ -1480,9 +1480,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		min >>= 1;
 
 		if (free_pages <= min)
-			return 0;
+			return false;
 	}
-	return 1;
+	return true;
+}
+
+bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+		      int classzone_idx, int alloc_flags)
+{
+	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+					zone_page_state(z, NR_FREE_PAGES));
+}
+
+bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
+		      int classzone_idx, int alloc_flags)
+{
+	long free_pages = zone_page_state(z, NR_FREE_PAGES);
+
+	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
+		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
+
+	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+								free_pages);
 }
 
 #ifdef CONFIG_NUMA
@@ -2436,7 +2455,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_nr_free_pages(zone)),
+			K(zone_page_state(zone, NR_FREE_PAGES)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..3e71cb1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 		if (zone->all_unreclaimable)
 			continue;
 
-		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
+		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
 								0, 0))
 			return 1;
 	}
@@ -2169,7 +2169,7 @@ loop_again:
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
-			if (!zone_watermark_ok(zone, order,
+			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
 				break;
@@ -2215,7 +2215,7 @@ loop_again:
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok(zone, order,
+			if (!zone_watermark_ok_safe(zone, order,
 					8*high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
@@ -2236,7 +2236,7 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
-			if (!zone_watermark_ok(zone, order,
+			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
 				/*
@@ -2244,7 +2244,7 @@ loop_again:
 				 * means that we have a GFP_ATOMIC allocation
 				 * failure risk. Hurry up!
 				 */
-				if (!zone_watermark_ok(zone, order,
+				if (!zone_watermark_ok_safe(zone, order,
 					    min_wmark_pages(zone), end_zone, 0))
 					has_under_min_watermark_zone = 1;
 			}
@@ -2378,7 +2378,9 @@ static int kswapd(void *p)
 				 */
 				if (!sleeping_prematurely(pgdat, order, remaining)) {
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+					restore_pgdat_percpu_threshold(pgdat);
 					schedule();
+					reduce_pgdat_percpu_threshold(pgdat);
 				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2417,16 +2419,17 @@ void wakeup_kswapd(struct zone *zone, int order)
 	if (!populated_zone(zone))
 		return;
 
-	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
+	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
+	pgdat = zone->zone_pgdat;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
-	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
+	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
+		return;
+
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 355a9e6..cafcc2d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
 
 #ifdef CONFIG_SMP
 
+static int calculate_pressure_threshold(struct zone *zone)
+{
+	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
+				num_online_cpus())));
+}
+
 static int calculate_threshold(struct zone *zone)
 {
 	int threshold;
@@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
 	}
 }
 
+void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+	int i;
+
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		zone = &pgdat->node_zones[i];
+		if (!zone->percpu_drift_mark)
+			continue;
+
+		threshold = calculate_pressure_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+}
+
+void restore_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+	int i;
+
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		zone = &pgdat->node_zones[i];
+		if (!zone->percpu_drift_mark)
+			continue;
+
+		threshold = calculate_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+}
+
 /*
  * For use when we know that interrupts are disabled.
  */
@@ -826,7 +870,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_nr_free_pages(zone),
+		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-27  8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
@ 2010-10-27 20:16   ` Christoph Lameter
  2010-10-28  1:09   ` KAMEZAWA Hiroyuki
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2010-10-27 20:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM


Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-27  8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
  2010-10-27 20:16   ` Christoph Lameter
@ 2010-10-28  1:09   ` KAMEZAWA Hiroyuki
  2010-10-28  9:49     ` Mel Gorman
  2010-11-14  8:53     ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro
  2010-11-01  7:06   ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low KOSAKI Motohiro
  2010-11-26 16:06   ` Kyle McMartin
  3 siblings, 2 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-28  1:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	David Rientjes, LKML, Linux-MM

On Wed, 27 Oct 2010 09:47:35 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
> memory is low] noted that watermarks were based on the vmstat
> NR_FREE_PAGES. To avoid synchronization overhead, these counters are
> maintained on a per-cpu basis and drained both periodically and when a
> threshold is above a threshold. On large CPU systems, the difference
> between the estimate and real value of NR_FREE_PAGES can be very high.
> The system can get into a case where pages are allocated far below the
> min watermark potentially causing livelock issues. The commit solved the
> problem by taking a better reading of NR_FREE_PAGES when memory was low.
> 
> Unfortunately, as reported by Shaohua Li this accurate reading can consume
> a large amount of CPU time on systems with many sockets due to cache
> line bouncing. This patch takes a different approach. For large machines
> where counter drift might be unsafe and while kswapd is awake, the per-cpu
> thresholds for the target pgdat are reduced to limit the level of drift
> to what should be a safe level. This incurs a performance penalty in heavy
> memory pressure by a factor that depends on the workload and the machine but
> the machine should function correctly without accidentally exhausting all
> memory on a node. There is an additional cost when kswapd wakes and sleeps
> but the event is not expected to be frequent - in Shaohua's test case,
> there was one recorded sleep and wake event at least.
> 
> To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> is introduced that takes a more accurate reading of NR_FREE_PAGES when
> called from wakeup_kswapd, when deciding whether it is really safe to go
> back to sleep in sleeping_prematurely() and when deciding if a zone is
> really balanced or not in balance_pgdat(). We are still using an expensive
> function but limiting how often it is called.
> 
> When the test case is reproduced, the time spent in the watermark functions
> is reduced. The following report is on the percentage of time spent
> cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(),
> __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(),
> zone_page_state().
> 
> vanilla                      11.6615%
> disable-threshold            0.2584%
> 
> Reported-by: Shaohua Li <shaohua.li@intel.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/mmzone.h |   10 +++-------
>  include/linux/vmstat.h |    5 +++++
>  mm/mmzone.c            |   21 ---------------------
>  mm/page_alloc.c        |   35 +++++++++++++++++++++++++++--------
>  mm/vmscan.c            |   23 +++++++++++++----------
>  mm/vmstat.c            |   46 +++++++++++++++++++++++++++++++++++++++++++++-
>  6 files changed, 93 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3984c4e..8d789d7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
>  	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
>  }
>  
> -#ifdef CONFIG_SMP
> -unsigned long zone_nr_free_pages(struct zone *zone);
> -#else
> -#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
> -#endif /* CONFIG_SMP */
> -
>  /*
>   * The "priority" of VM scanning is how much of the queues we will scan in one
>   * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
> @@ -651,7 +645,9 @@ typedef struct pglist_data {
>  extern struct mutex zonelists_mutex;
>  void build_all_zonelists(void *data);
>  void wakeup_kswapd(struct zone *zone, int order);
> -int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> +		int classzone_idx, int alloc_flags);
> +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
>  		int classzone_idx, int alloc_flags);
>  enum memmap_context {
>  	MEMMAP_EARLY,
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index eaaea37..e4cc21c 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
>  extern void __dec_zone_state(struct zone *, enum zone_stat_item);
>  
>  void refresh_cpu_vm_stats(int);
> +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat);
> +void restore_pgdat_percpu_threshold(pg_data_t *pgdat);
>  #else /* CONFIG_SMP */
>  
>  /*
> @@ -298,6 +300,9 @@ static inline void __dec_zone_page_state(struct page *page,
>  #define dec_zone_page_state __dec_zone_page_state
>  #define mod_zone_page_state __mod_zone_page_state
>  
> +static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { }
> +static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { }
> +
>  static inline void refresh_cpu_vm_stats(int cpu) { }
>  #endif
>  
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index e35bfb8..f5b7d17 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn,
>  	return 1;
>  }
>  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> -
> -#ifdef CONFIG_SMP
> -/* Called when a more accurate view of NR_FREE_PAGES is needed */
> -unsigned long zone_nr_free_pages(struct zone *zone)
> -{
> -	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> -
> -	/*
> -	 * While kswapd is awake, it is considered the zone is under some
> -	 * memory pressure. Under pressure, there is a risk that
> -	 * per-cpu-counter-drift will allow the min watermark to be breached
> -	 * potentially causing a live-lock. While kswapd is awake and
> -	 * free pages are low, get a better estimate for free pages
> -	 */
> -	if (nr_free_pages < zone->percpu_drift_mark &&
> -			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> -		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> -
> -	return nr_free_pages;
> -}
> -#endif /* CONFIG_SMP */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f12ad18..0286150 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1454,24 +1454,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>  #endif /* CONFIG_FAIL_PAGE_ALLOC */
>  
>  /*
> - * Return 1 if free pages are above 'mark'. This takes into account the order
> + * Return true if free pages are above 'mark'. This takes into account the order
>   * of the allocation.
>   */
> -int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> -		      int classzone_idx, int alloc_flags)
> +static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> +		      int classzone_idx, int alloc_flags, long free_pages)
>  {
>  	/* free_pages my go negative - that's OK */
>  	long min = mark;
> -	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
>  	int o;
>  
> +	free_pages -= (1 << order) + 1;
>  	if (alloc_flags & ALLOC_HIGH)
>  		min -= min / 2;
>  	if (alloc_flags & ALLOC_HARDER)
>  		min -= min / 4;
>  
>  	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
> -		return 0;
> +		return false;
>  	for (o = 0; o < order; o++) {
>  		/* At the next order, this order's pages become unavailable */
>  		free_pages -= z->free_area[o].nr_free << o;
> @@ -1480,9 +1480,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  		min >>= 1;
>  
>  		if (free_pages <= min)
> -			return 0;
> +			return false;
>  	}
> -	return 1;
> +	return true;
> +}
> +
> +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> +		      int classzone_idx, int alloc_flags)
> +{
> +	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> +					zone_page_state(z, NR_FREE_PAGES));
> +}
> +
> +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
> +		      int classzone_idx, int alloc_flags)
> +{
> +	long free_pages = zone_page_state(z, NR_FREE_PAGES);
> +
> +	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
> +		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
> +
> +	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> +								free_pages);
>  }
>  
>  #ifdef CONFIG_NUMA
> @@ -2436,7 +2455,7 @@ void show_free_areas(void)
>  			" all_unreclaimable? %s"
>  			"\n",
>  			zone->name,
> -			K(zone_nr_free_pages(zone)),
> +			K(zone_page_state(zone, NR_FREE_PAGES)),
>  			K(min_wmark_pages(zone)),
>  			K(low_wmark_pages(zone)),
>  			K(high_wmark_pages(zone)),
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c5dfabf..3e71cb1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>  		if (zone->all_unreclaimable)
>  			continue;
>  
> -		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> +		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
>  								0, 0))
>  			return 1;
>  	}
> @@ -2169,7 +2169,7 @@ loop_again:
>  				shrink_active_list(SWAP_CLUSTER_MAX, zone,
>  							&sc, priority, 0);
>  
> -			if (!zone_watermark_ok(zone, order,
> +			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), 0, 0)) {
>  				end_zone = i;
>  				break;
> @@ -2215,7 +2215,7 @@ loop_again:
>  			 * We put equal pressure on every zone, unless one
>  			 * zone has way too many pages free already.
>  			 */
> -			if (!zone_watermark_ok(zone, order,
> +			if (!zone_watermark_ok_safe(zone, order,
>  					8*high_wmark_pages(zone), end_zone, 0))
>  				shrink_zone(priority, zone, &sc);
>  			reclaim_state->reclaimed_slab = 0;
> @@ -2236,7 +2236,7 @@ loop_again:
>  			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
>  				sc.may_writepage = 1;
>  
> -			if (!zone_watermark_ok(zone, order,
> +			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), end_zone, 0)) {
>  				all_zones_ok = 0;
>  				/*
> @@ -2244,7 +2244,7 @@ loop_again:
>  				 * means that we have a GFP_ATOMIC allocation
>  				 * failure risk. Hurry up!
>  				 */
> -				if (!zone_watermark_ok(zone, order,
> +				if (!zone_watermark_ok_safe(zone, order,
>  					    min_wmark_pages(zone), end_zone, 0))
>  					has_under_min_watermark_zone = 1;
>  			}
> @@ -2378,7 +2378,9 @@ static int kswapd(void *p)
>  				 */
>  				if (!sleeping_prematurely(pgdat, order, remaining)) {
>  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +					restore_pgdat_percpu_threshold(pgdat);
>  					schedule();
> +					reduce_pgdat_percpu_threshold(pgdat);
>  				} else {
>  					if (remaining)
>  						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> @@ -2417,16 +2419,17 @@ void wakeup_kswapd(struct zone *zone, int order)
>  	if (!populated_zone(zone))
>  		return;
>  
> -	pgdat = zone->zone_pgdat;
> -	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
> +	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  		return;
> +	pgdat = zone->zone_pgdat;
>  	if (pgdat->kswapd_max_order < order)
>  		pgdat->kswapd_max_order = order;
> -	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
> -	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> -		return;
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
> +	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
> +		return;
> +
> +	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
>  	wake_up_interruptible(&pgdat->kswapd_wait);
>  }
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 355a9e6..cafcc2d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
>  
>  #ifdef CONFIG_SMP
>  
> +static int calculate_pressure_threshold(struct zone *zone)
> +{
> +	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
> +				num_online_cpus())));
> +}
> +

Could you add background theory of this calculation as a comment to
show the difference with calculate_threshold() ?

And don't we need to have "max=125" thresh here ?


>  static int calculate_threshold(struct zone *zone)
>  {
>  	int threshold;
> @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
>  	}
>  }
>  
> +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> +{
> +	struct zone *zone;
> +	int cpu;
> +	int threshold;
> +	int i;
> +

get_online_cpus();

> +	for (i = 0; i < pgdat->nr_zones; i++) {
> +		zone = &pgdat->node_zones[i];
> +		if (!zone->percpu_drift_mark)
> +			continue;
> +
> +		threshold = calculate_pressure_threshold(zone);
> +		for_each_online_cpu(cpu)
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
> +	}

put_online_cpus();

> +}
> +
> +void restore_pgdat_percpu_threshold(pg_data_t *pgdat)
> +{
> +	struct zone *zone;
> +	int cpu;
> +	int threshold;
> +	int i;
> +

get_online_cpus();

> +	for (i = 0; i < pgdat->nr_zones; i++) {
> +		zone = &pgdat->node_zones[i];
> +		if (!zone->percpu_drift_mark)
> +			continue;
> +
> +		threshold = calculate_threshold(zone);
> +		for_each_online_cpu(cpu)
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
> +	}

put_online_cpus();


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-28  1:09   ` KAMEZAWA Hiroyuki
@ 2010-10-28  9:49     ` Mel Gorman
  2010-10-28  9:58       ` KAMEZAWA Hiroyuki
  2010-11-14  8:53     ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro
  1 sibling, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-10-28  9:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	David Rientjes, LKML, Linux-MM

On Thu, Oct 28, 2010 at 10:09:20AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 27 Oct 2010 09:47:35 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
> > memory is low] noted that watermarks were based on the vmstat
> > NR_FREE_PAGES. To avoid synchronization overhead, these counters are
> > maintained on a per-cpu basis and drained both periodically and when a
> > threshold is above a threshold. On large CPU systems, the difference
> > between the estimate and real value of NR_FREE_PAGES can be very high.
> > The system can get into a case where pages are allocated far below the
> > min watermark potentially causing livelock issues. The commit solved the
> > problem by taking a better reading of NR_FREE_PAGES when memory was low.
> > 
> > <SNIP>
> > 
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 355a9e6..cafcc2d 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
> >  
> >  #ifdef CONFIG_SMP
> >  
> > +static int calculate_pressure_threshold(struct zone *zone)
> > +{
> > +	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
> > +				num_online_cpus())));
> > +}
> > +
> 
> Could you add background theory of this calculation as a comment to
> show the difference with calculate_threshold() ?
> 

Sure. When writing it, I realised that the calculations here differ from
what percpu_drift_mark does. This is what I currently have

int calculate_pressure_threshold(struct zone *zone)
{
        int threshold;
        int watermark_distance;

        /*
         * As vmstats are not up to date, there is drift between the estimated
         * and real values. For high thresholds and a high number of CPUs, it
         * is possible for the min watermark to be breached while the estimated
         * value looks fine. The pressure threshold is a reduced value such
         * that even the maximum amount of drift will not accidentally breach
         * the min watermark
         */
        watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone);
        threshold = max(1, watermark_distance / num_online_cpus());

        /*
         * Maximum threshold is 125
         */
        threshold = min(125, threshold);

        return threshold;
}

Is this better?

> And don't we need to have "max=125" thresh here ?
> 

Yes.

> 
> >  static int calculate_threshold(struct zone *zone)
> >  {
> >  	int threshold;
> > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
> >  	}
> >  }
> >  
> > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> > +{
> > +	struct zone *zone;
> > +	int cpu;
> > +	int threshold;
> > +	int i;
> > +
> 
> get_online_cpus();
> 

Also correct.

Thanks very much. I'm revising the series.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-28  9:49     ` Mel Gorman
@ 2010-10-28  9:58       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-28  9:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	David Rientjes, LKML, Linux-MM

On Thu, 28 Oct 2010 10:49:03 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Oct 28, 2010 at 10:09:20AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 27 Oct 2010 09:47:35 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
> > > memory is low] noted that watermarks were based on the vmstat
> > > NR_FREE_PAGES. To avoid synchronization overhead, these counters are
> > > maintained on a per-cpu basis and drained both periodically and when a
> > > threshold is above a threshold. On large CPU systems, the difference
> > > between the estimate and real value of NR_FREE_PAGES can be very high.
> > > The system can get into a case where pages are allocated far below the
> > > min watermark potentially causing livelock issues. The commit solved the
> > > problem by taking a better reading of NR_FREE_PAGES when memory was low.
> > > 
> > > <SNIP>
> > > 
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 355a9e6..cafcc2d 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat);
> > >  
> > >  #ifdef CONFIG_SMP
> > >  
> > > +static int calculate_pressure_threshold(struct zone *zone)
> > > +{
> > > +	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
> > > +				num_online_cpus())));
> > > +}
> > > +
> > 
> > Could you add background theory of this calculation as a comment to
> > show the difference with calculate_threshold() ?
> > 
> 
> Sure. When writing it, I realised that the calculations here differ from
> what percpu_drift_mark does. This is what I currently have
> 
> int calculate_pressure_threshold(struct zone *zone)
> {
>         int threshold;
>         int watermark_distance;
> 
>         /*
>          * As vmstats are not up to date, there is drift between the estimated
>          * and real values. For high thresholds and a high number of CPUs, it
>          * is possible for the min watermark to be breached while the estimated
>          * value looks fine. The pressure threshold is a reduced value such
>          * that even the maximum amount of drift will not accidentally breach
>          * the min watermark
>          */
>         watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone);
>         threshold = max(1, watermark_distance / num_online_cpus());
> 
>         /*
>          * Maximum threshold is 125
>          */
>         threshold = min(125, threshold);
> 
>         return threshold;
> }
> 
> Is this better?
> 

sounds nice.

Regards,
-Kame



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu
  2010-10-28  1:09   ` KAMEZAWA Hiroyuki
  2010-10-28  9:49     ` Mel Gorman
@ 2010-11-14  8:53     ` KOSAKI Motohiro
  2010-11-15 10:26       ` Mel Gorman
  2010-11-17  0:07       ` Andrew Morton
  1 sibling, 2 replies; 40+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  8:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, Shaohua Li,
	Christoph Lameter, David Rientjes, LKML, Linux-MM

> > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
> >  	}
> >  }
> >  
> > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> > +{
> > +	struct zone *zone;
> > +	int cpu;
> > +	int threshold;
> > +	int i;
> > +
> 
> get_online_cpus();


This caused following runtime warnings. but I don't think here is
real lock inversion. 

=================================
[ INFO: inconsistent lock state ]
2.6.37-rc1-mm1+ #150
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>] get_online_cpus+0x41/0x60
{RECLAIM_FS-ON-W} state was registered at:
  [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0
  [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100
  [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0
  [<ffffffff812eea10>] idr_pre_get+0x60/0x90
  [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0
  [<ffffffff8106ebf5>] create_worker+0x55/0x190
  [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235
  [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0
  [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10
  [<ffffffff81051f30>] __cpu_notify+0x20/0x40
  [<ffffffff8150bff7>] _cpu_up+0x73/0x113
  [<ffffffff8150c175>] cpu_up+0xde/0xf1
  [<ffffffff81dcc81d>] kernel_init+0x21b/0x342
  [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
irq event stamp: 27
hardirqs last  enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80
hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0
softirqs last  enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0
softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0

other info that might help us debug this:
no locks held by kswapd0/419.

stack backtrace:
Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150
Call Trace:
 [<ffffffff810890b1>] print_usage_bug+0x171/0x180
 [<ffffffff8108a057>] mark_lock+0x377/0x450
 [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0
 [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
 [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
 [<ffffffff8108bf94>] lock_acquire+0xb4/0x150
 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
 [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0
 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
 [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90
 [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
 [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
 [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60
 [<ffffffff810520d1>] get_online_cpus+0x41/0x60
 [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0
 [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60
 [<ffffffff8110b552>] kswapd+0x1f2/0x360
 [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8110b360>] ? kswapd+0x0/0x360
 [<ffffffff81073ae6>] kthread+0xa6/0xb0
 [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
 [<ffffffff81515710>] ? restore_args+0x0/0x30
 [<ffffffff81073a40>] ? kthread+0x0/0xb0
 [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10


I think we have two option 1) call lockdep_clear_current_reclaim_state()
every time 2) use for_each_possible_cpu instead for_each_online_cpu.

Following patch use (2) beucase removing get_online_cpus() makes good
side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock
risk. 


-------------------------------------------------------------------------
>From 74b809353c42a440d0bac6b83ac84281299bb09e Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Fri, 3 Dec 2010 20:21:40 +0900
Subject: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu

This patch fixes following lockdep warning.

=================================
[ INFO: inconsistent lock state ]
2.6.37-rc1-mm1+ #150
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>]
get_online_cpus+0x41/0x60
{RECLAIM_FS-ON-W} state was registered at:
  [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0
  [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100
  [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0
  [<ffffffff812eea10>] idr_pre_get+0x60/0x90
  [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0
  [<ffffffff8106ebf5>] create_worker+0x55/0x190
  [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235
  [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0
  [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10
  [<ffffffff81051f30>] __cpu_notify+0x20/0x40
  [<ffffffff8150bff7>] _cpu_up+0x73/0x113
  [<ffffffff8150c175>] cpu_up+0xde/0xf1
  [<ffffffff81dcc81d>] kernel_init+0x21b/0x342
  [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
irq event stamp: 27
hardirqs last  enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80
hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0
softirqs last  enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0
softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0

other info that might help us debug this:
no locks held by kswapd0/419.

stack backtrace:
Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150
Call Trace:
 [<ffffffff810890b1>] print_usage_bug+0x171/0x180
 [<ffffffff8108a057>] mark_lock+0x377/0x450
 [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0
 [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
 [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
 [<ffffffff8108bf94>] lock_acquire+0xb4/0x150
 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
 [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0
 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
 [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90
 [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
 [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
 [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60
 [<ffffffff810520d1>] get_online_cpus+0x41/0x60
 [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0
 [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60
 [<ffffffff8110b552>] kswapd+0x1f2/0x360
 [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8110b360>] ? kswapd+0x0/0x360
 [<ffffffff81073ae6>] kthread+0xa6/0xb0
 [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
 [<ffffffff81515710>] ? restore_args+0x0/0x30
 [<ffffffff81073a40>] ? kthread+0x0/0xb0
 [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmstat.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2ab01f2..ca2d3be 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -193,18 +193,16 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
 	int threshold;
 	int i;
 
-	get_online_cpus();
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		zone = &pgdat->node_zones[i];
 		if (!zone->percpu_drift_mark)
 			continue;
 
 		threshold = (*calculate_pressure)(zone);
-		for_each_online_cpu(cpu)
+		for_each_possible_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
 	}
-	put_online_cpus();
 }
 
 /*
-- 
1.6.5.2





^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu
  2010-11-14  8:53     ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro
@ 2010-11-15 10:26       ` Mel Gorman
  2010-11-15 14:04         ` Christoph Lameter
  2010-11-17  0:07       ` Andrew Morton
  1 sibling, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-11-15 10:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Shaohua Li, Christoph Lameter,
	David Rientjes, LKML, Linux-MM

On Sun, Nov 14, 2010 at 05:53:03PM +0900, KOSAKI Motohiro wrote:
> > > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
> > >  	}
> > >  }
> > >  
> > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> > > +{
> > > +	struct zone *zone;
> > > +	int cpu;
> > > +	int threshold;
> > > +	int i;
> > > +
> > 
> > get_online_cpus();
> 
> 
> This caused following runtime warnings. but I don't think here is
> real lock inversion. 
> 
> =================================
> [ INFO: inconsistent lock state ]
> 2.6.37-rc1-mm1+ #150
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>] get_online_cpus+0x41/0x60
> {RECLAIM_FS-ON-W} state was registered at:
>   [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0
>   [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100
>   [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0
>   [<ffffffff812eea10>] idr_pre_get+0x60/0x90
>   [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0
>   [<ffffffff8106ebf5>] create_worker+0x55/0x190
>   [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235
>   [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0
>   [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10
>   [<ffffffff81051f30>] __cpu_notify+0x20/0x40
>   [<ffffffff8150bff7>] _cpu_up+0x73/0x113
>   [<ffffffff8150c175>] cpu_up+0xde/0xf1
>   [<ffffffff81dcc81d>] kernel_init+0x21b/0x342
>   [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
> irq event stamp: 27
> hardirqs last  enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80
> hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0
> softirqs last  enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0
> softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0
> 
> other info that might help us debug this:
> no locks held by kswapd0/419.
> 
> stack backtrace:
> Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150
> Call Trace:
>  [<ffffffff810890b1>] print_usage_bug+0x171/0x180
>  [<ffffffff8108a057>] mark_lock+0x377/0x450
>  [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0
>  [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
>  [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
>  [<ffffffff8108bf94>] lock_acquire+0xb4/0x150
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90
>  [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10
>  [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
>  [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60
>  [<ffffffff810520d1>] get_online_cpus+0x41/0x60
>  [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0
>  [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60
>  [<ffffffff8110b552>] kswapd+0x1f2/0x360
>  [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff8110b360>] ? kswapd+0x0/0x360
>  [<ffffffff81073ae6>] kthread+0xa6/0xb0
>  [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
>  [<ffffffff81515710>] ? restore_args+0x0/0x30
>  [<ffffffff81073a40>] ? kthread+0x0/0xb0
>  [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10
> 
> 
> I think we have two option 1) call lockdep_clear_current_reclaim_state()
> every time 2) use for_each_possible_cpu instead for_each_online_cpu.
> 
> Following patch use (2) beucase removing get_online_cpus() makes good
> side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock
> risk. 
> 

With recent per-cpu allocator changes, are we guaranteed that the per-cpu
structures exist and are valid?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu
  2010-11-15 10:26       ` Mel Gorman
@ 2010-11-15 14:04         ` Christoph Lameter
  2010-11-16  9:58           ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2010-11-15 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Shaohua Li,
	David Rientjes, LKML, Linux-MM, Tejun Heo

On Mon, 15 Nov 2010, Mel Gorman wrote:

> With recent per-cpu allocator changes, are we guaranteed that the per-cpu
> structures exist and are valid?

We always guarantee that all per cpu areas for all possible cpus exist.
That has always been the case. There was a discussion about changing
that though. Could be difficult given the need for additional locking.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu
  2010-11-15 14:04         ` Christoph Lameter
@ 2010-11-16  9:58           ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2010-11-16  9:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Shaohua Li,
	David Rientjes, LKML, Linux-MM, Tejun Heo

On Mon, Nov 15, 2010 at 08:04:23AM -0600, Christoph Lameter wrote:
> On Mon, 15 Nov 2010, Mel Gorman wrote:
> 
> > With recent per-cpu allocator changes, are we guaranteed that the per-cpu
> > structures exist and are valid?
> 
> We always guarantee that all per cpu areas for all possible cpus exist.
> That has always been the case. There was a discussion about changing
> that though. Could be difficult given the need for additional locking.
> 

In that case, I do not have any more concerns about the patch. It's
unfortunate that more per-cpu structures will have to be updated but I
doubt it'll be noticable.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu
  2010-11-14  8:53     ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro
  2010-11-15 10:26       ` Mel Gorman
@ 2010-11-17  0:07       ` Andrew Morton
  2010-11-19 15:29         ` Christoph Lameter
  2010-11-23  8:32         ` KOSAKI Motohiro
  1 sibling, 2 replies; 40+ messages in thread
From: Andrew Morton @ 2010-11-17  0:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Mel Gorman, Shaohua Li, Christoph Lameter,
	David Rientjes, LKML, Linux-MM

On Sun, 14 Nov 2010 17:53:03 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void)
> > >  	}
> > >  }
> > >  
> > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> > > +{
> > > +	struct zone *zone;
> > > +	int cpu;
> > > +	int threshold;
> > > +	int i;
> > > +
> > 
> > get_online_cpus();
> 
> 
> This caused following runtime warnings. but I don't think here is
> real lock inversion. 
> 
> =================================
> [ INFO: inconsistent lock state ]
> 2.6.37-rc1-mm1+ #150
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>] get_online_cpus+0x41/0x60
> {RECLAIM_FS-ON-W} state was registered at:
>   [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0
>   [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100
>   [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0
>   [<ffffffff812eea10>] idr_pre_get+0x60/0x90
>   [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0
>   [<ffffffff8106ebf5>] create_worker+0x55/0x190
>   [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235
>   [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0
>   [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10
>   [<ffffffff81051f30>] __cpu_notify+0x20/0x40
>   [<ffffffff8150bff7>] _cpu_up+0x73/0x113
>   [<ffffffff8150c175>] cpu_up+0xde/0xf1
>   [<ffffffff81dcc81d>] kernel_init+0x21b/0x342
>   [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
> irq event stamp: 27
> hardirqs last  enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80
> hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0
> softirqs last  enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0
> softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0
> 
> other info that might help us debug this:
> no locks held by kswapd0/419.
> 
> stack backtrace:
> Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150
> Call Trace:
>  [<ffffffff810890b1>] print_usage_bug+0x171/0x180
>  [<ffffffff8108a057>] mark_lock+0x377/0x450
>  [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0
>  [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
>  [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
>  [<ffffffff8108bf94>] lock_acquire+0xb4/0x150
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90
>  [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150
>  [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60
>  [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10
>  [<ffffffff8107af0f>] ? local_clock+0x6f/0x80
>  [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60
>  [<ffffffff810520d1>] get_online_cpus+0x41/0x60
>  [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0
>  [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60
>  [<ffffffff8110b552>] kswapd+0x1f2/0x360
>  [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff8110b360>] ? kswapd+0x0/0x360
>  [<ffffffff81073ae6>] kthread+0xa6/0xb0
>  [<ffffffff81003724>] kernel_thread_helper+0x4/0x10
>  [<ffffffff81515710>] ? restore_args+0x0/0x30
>  [<ffffffff81073a40>] ? kthread+0x0/0xb0
>  [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10

Well what's actually happening here?  Where is the alleged deadlock?

In the kernel_init() case we have a GFP_KERNEL allocation inside
get_online_cpus().  In the other case we simply have kswapd calling
get_online_cpus(), yes?

Does lockdep consider all kswapd actions to be "in reclaim context"? 
If so, why?

> 
> I think we have two option 1) call lockdep_clear_current_reclaim_state()
> every time 2) use for_each_possible_cpu instead for_each_online_cpu.
> 
> Following patch use (2) beucase removing get_online_cpus() makes good
> side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock
> risk. 

Well.  Being able to run for_each_online_cpu() is a pretty low-level
and fundamental thing.  It's something we're likely to want to do more
and more of as time passes.  It seems a bad thing to tell ourselves
that we cannot use it in reclaim context.  That blots out large chunks
of filesystem and IO-layer code as well!

> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -193,18 +193,16 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
>  	int threshold;
>  	int i;
>  
> -	get_online_cpus();
>  	for (i = 0; i < pgdat->nr_zones; i++) {
>  		zone = &pgdat->node_zones[i];
>  		if (!zone->percpu_drift_mark)
>  			continue;
>  
>  		threshold = (*calculate_pressure)(zone);
> -		for_each_online_cpu(cpu)
> +		for_each_possible_cpu(cpu)
>  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
>  							= threshold;
>  	}
> -	put_online_cpus();
>  }

That's a pretty sad change IMO, especially of num_possible_cpus is much
larger than num_online_cpus.

What do we need to do to make get_online_cpus() safe to use in reclaim
context?  (And in kswapd context, if that's really equivalent to
"reclaim context").

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu
  2010-11-17  0:07       ` Andrew Morton
@ 2010-11-19 15:29         ` Christoph Lameter
  2010-11-23  8:32         ` KOSAKI Motohiro
  1 sibling, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2010-11-19 15:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mel Gorman, Shaohua Li,
	David Rientjes, LKML, Linux-MM

On Tue, 16 Nov 2010, Andrew Morton wrote:

> > Following patch use (2) beucase removing get_online_cpus() makes good
> > side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock
> > risk.
>
> Well.  Being able to run for_each_online_cpu() is a pretty low-level
> and fundamental thing.  It's something we're likely to want to do more
> and more of as time passes.  It seems a bad thing to tell ourselves
> that we cannot use it in reclaim context.  That blots out large chunks
> of filesystem and IO-layer code as well!

The online map can change if no locks were taken. Thus it
becomes something difficult to do in some code paths and overhead
increases significantly.

> >  		threshold = (*calculate_pressure)(zone);
> > -		for_each_online_cpu(cpu)
> > +		for_each_possible_cpu(cpu)
> >  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> >  							= threshold;
> >  	}
> > -	put_online_cpus();
> >  }
>
> That's a pretty sad change IMO, especially of num_possible_cpus is much
> larger than num_online_cpus.

num_possible_cpus should only be higher if the arch code has detected
that the system has the ability to physically online and offline cpus.
I have never actually seen such a system. Heard rumors from Fujitsu that
they have something.

Maybe the virtualization people also need this? Otherwise cpu
online/offline is useful mainly to debug the cpu offline/online handling
in various subsystems which is unsurprisingly often buggy given the rarity
of encountering such hardware.

> What do we need to do to make get_online_cpus() safe to use in reclaim
> context?  (And in kswapd context, if that's really equivalent to
> "reclaim context").

I think its not worth the effort.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu
  2010-11-17  0:07       ` Andrew Morton
  2010-11-19 15:29         ` Christoph Lameter
@ 2010-11-23  8:32         ` KOSAKI Motohiro
  1 sibling, 0 replies; 40+ messages in thread
From: KOSAKI Motohiro @ 2010-11-23  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Mel Gorman, Shaohua Li,
	Christoph Lameter, David Rientjes, LKML, Linux-MM

sorry for the delay.

> Well what's actually happening here?  Where is the alleged deadlock?
> 
> In the kernel_init() case we have a GFP_KERNEL allocation inside
> get_online_cpus().  In the other case we simply have kswapd calling
> get_online_cpus(), yes?

Yes.

> 
> Does lockdep consider all kswapd actions to be "in reclaim context"? 
> If so, why?

kswapd call lockdep_set_current_reclaim_state() at thread starting time.
see below.

----------------------------------------------------------------------
static int kswapd(void *p)
{
        unsigned long order;
        pg_data_t *pgdat = (pg_data_t*)p;
        struct task_struct *tsk = current;

        struct reclaim_state reclaim_state = {
                .reclaimed_slab = 0,
        };
        const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

        lockdep_set_current_reclaim_state(GFP_KERNEL);
     ......
----------------------------------------------------------------------




> > I think we have two option 1) call lockdep_clear_current_reclaim_state()
> > every time 2) use for_each_possible_cpu instead for_each_online_cpu.
> > 
> > Following patch use (2) beucase removing get_online_cpus() makes good
> > side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock
> > risk. 
> 
> Well.  Being able to run for_each_online_cpu() is a pretty low-level
> and fundamental thing.  It's something we're likely to want to do more
> and more of as time passes.  It seems a bad thing to tell ourselves
> that we cannot use it in reclaim context.  That blots out large chunks
> of filesystem and IO-layer code as well!
> 
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -193,18 +193,16 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
> >  	int threshold;
> >  	int i;
> >  
> > -	get_online_cpus();
> >  	for (i = 0; i < pgdat->nr_zones; i++) {
> >  		zone = &pgdat->node_zones[i];
> >  		if (!zone->percpu_drift_mark)
> >  			continue;
> >  
> >  		threshold = (*calculate_pressure)(zone);
> > -		for_each_online_cpu(cpu)
> > +		for_each_possible_cpu(cpu)
> >  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> >  							= threshold;
> >  	}
> > -	put_online_cpus();
> >  }
> 
> That's a pretty sad change IMO, especially of num_possible_cpus is much
> larger than num_online_cpus.

As far as I know, CPU hotplug is used server area and almost server have
ACPI or similar flexible firmware interface. then, num_possible_cpus is
not so much big than actual numbers of socket.

IOW, I haven't hear embedded people use cpu hotplug. If you've hear, please
let me know.


> What do we need to do to make get_online_cpus() safe to use in reclaim
> context?  (And in kswapd context, if that's really equivalent to
> "reclaim context").

Hmm... It's too hard.
kmalloc() is called from everywhere and cpu hotplug is happen any time.
then, any lock design break your requested rule. ;)

And again, _now_ I don't think for_each_possible_cpu() is very costly.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-27  8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
  2010-10-27 20:16   ` Christoph Lameter
  2010-10-28  1:09   ` KAMEZAWA Hiroyuki
@ 2010-11-01  7:06   ` KOSAKI Motohiro
  2010-11-26 16:06   ` Kyle McMartin
  3 siblings, 0 replies; 40+ messages in thread
From: KOSAKI Motohiro @ 2010-11-01  7:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Shaohua Li, Christoph Lameter,
	David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM

> To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> is introduced that takes a more accurate reading of NR_FREE_PAGES when
> called from wakeup_kswapd, when deciding whether it is really safe to go
> back to sleep in sleeping_prematurely() and when deciding if a zone is
> really balanced or not in balance_pgdat(). We are still using an expensive
> function but limiting how often it is called.
> 
> When the test case is reproduced, the time spent in the watermark functions
> is reduced. The following report is on the percentage of time spent
> cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(),
> __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(),
> zone_page_state().
> 
> vanilla                      11.6615%
> disable-threshold            0.2584%
> 
> Reported-by: Shaohua Li <shaohua.li@intel.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Except kamezawa-san pointed piece

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-27  8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
                     ` (2 preceding siblings ...)
  2010-11-01  7:06   ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low KOSAKI Motohiro
@ 2010-11-26 16:06   ` Kyle McMartin
  2010-11-29  9:56     ` Mel Gorman
  3 siblings, 1 reply; 40+ messages in thread
From: Kyle McMartin @ 2010-11-26 16:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM

On Wed, Oct 27, 2010 at 09:47:35AM +0100, Mel Gorman wrote:
><snip>
> To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> is introduced that takes a more accurate reading of NR_FREE_PAGES when
> called from wakeup_kswapd, when deciding whether it is really safe to go
> back to sleep in sleeping_prematurely() and when deciding if a zone is
> really balanced or not in balance_pgdat(). We are still using an expensive
> function but limiting how often it is called.
><snip>
> Reported-by: Shaohua Li <shaohua.li@intel.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Hi Mel,

I notice these aren't flagged for stable, should they be? (They fairly
trivially apply and compile on 2.6.36 barring the trace_ points which
changed.) I've got a few bug reports against .36/.37 where kswapd has
been sleeping for 60s+.

I built them some kernels with these patches, but haven't heard back yet
as to whether it fixes things for them.

Thanks for any insight,
	Kyle

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-11-26 16:06   ` Kyle McMartin
@ 2010-11-29  9:56     ` Mel Gorman
  2010-11-29 13:16       ` Kyle McMartin
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-11-29  9:56 UTC (permalink / raw)
  To: Kyle McMartin
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM

On Fri, Nov 26, 2010 at 11:06:19AM -0500, Kyle McMartin wrote:
> On Wed, Oct 27, 2010 at 09:47:35AM +0100, Mel Gorman wrote:
> ><snip>
> > To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> > is introduced that takes a more accurate reading of NR_FREE_PAGES when
> > called from wakeup_kswapd, when deciding whether it is really safe to go
> > back to sleep in sleeping_prematurely() and when deciding if a zone is
> > really balanced or not in balance_pgdat(). We are still using an expensive
> > function but limiting how often it is called.
> ><snip>
> > Reported-by: Shaohua Li <shaohua.li@intel.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Hi Mel,
> 
> I notice these aren't flagged for stable, should they be? (They fairly
> trivially apply and compile on 2.6.36 barring the trace_ points which
> changed.)

They were not flagged for stable because they were performance rather than
function bugs that affected a limited number of machines. Should that decision
be revisited?

> I've got a few bug reports against .36/.37 where kswapd has
> been sleeping for 60s+.
> 

I do not believe these patches would affect kswapd sleeping for 60s.

> I built them some kernels with these patches, but haven't heard back yet
> as to whether it fixes things for them.
> 
> Thanks for any insight,

Can you point me at a relevant bugzilla entry or forward me the bug report
and I'll take a look?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-11-29  9:56     ` Mel Gorman
@ 2010-11-29 13:16       ` Kyle McMartin
  2010-11-29 15:08         ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Kyle McMartin @ 2010-11-29 13:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kyle McMartin, Andrew Morton, Shaohua Li, KOSAKI Motohiro,
	Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML,
	Linux-MM

On Mon, Nov 29, 2010 at 09:56:19AM +0000, Mel Gorman wrote:
> Can you point me at a relevant bugzilla entry or forward me the bug report
> and I'll take a look?
> 

https://bugzilla.redhat.com/show_bug.cgi?id=649694

Thanks,
	Kyle

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-11-29 13:16       ` Kyle McMartin
@ 2010-11-29 15:08         ` Mel Gorman
  2010-11-29 15:22           ` Kyle McMartin
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-11-29 15:08 UTC (permalink / raw)
  To: Kyle McMartin
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM

On Mon, Nov 29, 2010 at 08:16:26AM -0500, Kyle McMartin wrote:
> On Mon, Nov 29, 2010 at 09:56:19AM +0000, Mel Gorman wrote:
> > Can you point me at a relevant bugzilla entry or forward me the bug report
> > and I'll take a look?
> > 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=649694
> 

Ouch! I have been unable to create an exact copy of your kernel source as
I'm not running Fedora. From a partial conversion of a source RPM, I saw no
changes related to mm/vmscan.c. Is this accurate? I'm trying to establish
if this is a mainline bug as well.

Second, I see all the stack traces are marked with "?" making them
unreliable. Is that anything to be concerned about?

I see that one user has reported that the patches fixed the problem for him
but I fear that this might be a co-incidence or that the patches close a
race of some description. Specifically, I'm trying to identify if there is
a situation where kswapd() constantly loops checking watermarks and never
calling cond_resched(). This could conceivably happen if kswapd() is always
checking sleeping_prematurely() at a higher order where as balance_pgdat()
is always checks the watermarks at the lower order. I'm not seeing how this
could happen in 2.6.35.6 though. If Fedora doesn't have special changes,
it might mean that these patches do need to go into -stable as the
cost of zone_page_state_snapshot() is far higher on larger machines than
previously reported.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-11-29 15:08         ` Mel Gorman
@ 2010-11-29 15:22           ` Kyle McMartin
  2010-11-29 15:26             ` Kyle McMartin
  2010-11-29 15:58             ` Mel Gorman
  0 siblings, 2 replies; 40+ messages in thread
From: Kyle McMartin @ 2010-11-29 15:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kyle McMartin, Andrew Morton, Shaohua Li, KOSAKI Motohiro,
	Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML,
	Linux-MM

On Mon, Nov 29, 2010 at 03:08:24PM +0000, Mel Gorman wrote:
> Ouch! I have been unable to create an exact copy of your kernel source as
> I'm not running Fedora. From a partial conversion of a source RPM, I saw no
> changes related to mm/vmscan.c. Is this accurate? I'm trying to establish
> if this is a mainline bug as well.
> 

Sorry, if you extract the source rpm you should get the patched
sources... Aside from a few patches to mm/mmap for execshield, mm/* is
otherwise untouched from the latest stable 2.6.35 kernels.

If you git clone git://pkgs.fedoraproject.org/kernel and check out the
origin/f14/master branch, it has all the patches we apply (based on the
'ApplyPatch' lines in kernel.spec

> Second, I see all the stack traces are marked with "?" making them
> unreliable. Is that anything to be concerned about?
> 

Hrm, I don't think it is, I think the ones with '?' are just artifacts
because we don't have a proper unwinder. Oh! Thanks! I just found a bug
in our configs... We don't have CONFIG_FRAME_POINTER set because
CONFIG_DEBUG_KERNEL got unset in the 'production' configs... I'll fix
that up.

> I see that one user has reported that the patches fixed the problem for him
> but I fear that this might be a co-incidence or that the patches close a
> race of some description. Specifically, I'm trying to identify if there is
> a situation where kswapd() constantly loops checking watermarks and never
> calling cond_resched(). This could conceivably happen if kswapd() is always
> checking sleeping_prematurely() at a higher order where as balance_pgdat()
> is always checks the watermarks at the lower order. I'm not seeing how this
> could happen in 2.6.35.6 though. If Fedora doesn't have special changes,
> it might mean that these patches do need to go into -stable as the
> cost of zone_page_state_snapshot() is far higher on larger machines than
> previously reported.
> 

Yeah, I am a bit surprised as well. Luke seems to have quite a large
machine... I haven't seen any kswapd lockups there on my 18G machine
using the same kernel. :< (Possibly it's just not stressed enough
though.)

Thanks for looking into this!
	Kyle

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-11-29 15:22           ` Kyle McMartin
@ 2010-11-29 15:26             ` Kyle McMartin
  2010-11-29 15:58             ` Mel Gorman
  1 sibling, 0 replies; 40+ messages in thread
From: Kyle McMartin @ 2010-11-29 15:26 UTC (permalink / raw)
  To: Kyle McMartin
  Cc: Mel Gorman, Andrew Morton, Shaohua Li, KOSAKI Motohiro,
	Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML,
	Linux-MM

On Mon, Nov 29, 2010 at 10:22:30AM -0500, Kyle McMartin wrote:
> Hrm, I don't think it is, I think the ones with '?' are just artifacts
> because we don't have a proper unwinder. Oh! Thanks! I just found a bug
> in our configs... We don't have CONFIG_FRAME_POINTER set because
> CONFIG_DEBUG_KERNEL got unset in the 'production' configs... I'll fix
> that up.
> 

Oops, no, misdiagnosed that by accidentally grepping for FRAME_POINTERS
instead of FRAME_POINTER... it's set in our configs.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-11-29 15:22           ` Kyle McMartin
  2010-11-29 15:26             ` Kyle McMartin
@ 2010-11-29 15:58             ` Mel Gorman
  2010-12-23 22:18               ` David Rientjes
  1 sibling, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-11-29 15:58 UTC (permalink / raw)
  To: Kyle McMartin
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM

On Mon, Nov 29, 2010 at 10:22:30AM -0500, Kyle McMartin wrote:
> On Mon, Nov 29, 2010 at 03:08:24PM +0000, Mel Gorman wrote:
> > Ouch! I have been unable to create an exact copy of your kernel source as
> > I'm not running Fedora. From a partial conversion of a source RPM, I saw no
> > changes related to mm/vmscan.c. Is this accurate? I'm trying to establish
> > if this is a mainline bug as well.
> > 
> 
> Sorry, if you extract the source rpm you should get the patched
> sources... Aside from a few patches to mm/mmap for execshield, mm/* is
> otherwise untouched from the latest stable 2.6.35 kernels.
> 

Perfect, that correlates with what I saw so this is probably a
mainline issue.

> If you git clone git://pkgs.fedoraproject.org/kernel and check out the
> origin/f14/master branch, it has all the patches we apply (based on the
> 'ApplyPatch' lines in kernel.spec
> 
> > Second, I see all the stack traces are marked with "?" making them
> > unreliable. Is that anything to be concerned about?
> > 
> 
> Hrm, I don't think it is, I think the ones with '?' are just artifacts
> because we don't have a proper unwinder. Oh! Thanks! I just found a bug
> in our configs... We don't have CONFIG_FRAME_POINTER set because
> CONFIG_DEBUG_KERNEL got unset in the 'production' configs... I'll fix
> that up.
> 

Ordinarily I'd expect it to be from the lack of a unwinder but if FRAME_POINTER
is there (which you say in a follow-up mail that is), it can be a bit of
a concern. There is some real weirness as it is. Take on of Luke's
examples where it appears to be locked up in

[ 5015.448127] Pid: 185, comm: kswapd1 Tainted: P 2.6.35.6-48.fc14.x86_64 #1 X8DA3/X8DA3
[ 5015.448127] RIP: 0010:[<ffffffff81469130>]  [<ffffffff81469130>] _raw_spin_unlock_irqrestore+0x18/0x19

I am at a loss to explain under what circumstances that can even happen!
Is there any possibility RIP is being translated to the wrong symbol possibly
via an userspace decoder of the oops or similar? Is there any possibility
the stack is being corrupted if the swap subsystem is on a complicated
software stack?

> > I see that one user has reported that the patches fixed the problem for him
> > but I fear that this might be a co-incidence or that the patches close a
> > race of some description. Specifically, I'm trying to identify if there is
> > a situation where kswapd() constantly loops checking watermarks and never
> > calling cond_resched(). This could conceivably happen if kswapd() is always
> > checking sleeping_prematurely() at a higher order where as balance_pgdat()
> > is always checks the watermarks at the lower order. I'm not seeing how this
> > could happen in 2.6.35.6 though. If Fedora doesn't have special changes,
> > it might mean that these patches do need to go into -stable as the
> > cost of zone_page_state_snapshot() is far higher on larger machines than
> > previously reported.
> > 
> 
> Yeah, I am a bit surprised as well. Luke seems to have quite a large
> machine... I haven't seen any kswapd lockups there on my 18G machine
> using the same kernel. :< (Possibly it's just not stressed enough
> though.)
> 

He reports his machine as 24-way but with his model of CPU it could
still be 2-socket which I ordinarily would not have expected to suffer
so badly from zone_page_state_snapshot(). I would have predicted that
the patches being thrown about in the thread "Free memory never fully
used, swapping" to be more relevant to kswapd failing to go to sleep :/

Andrew, this patch was a performance fix but is a report saying that it
fixes a functional regression in Fedora enough to push a patch torwards
stable even though an explanation as to *why* it fixes the problem is missing?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-11-29 15:58             ` Mel Gorman
@ 2010-12-23 22:18               ` David Rientjes
  2010-12-23 22:35                 ` Andrew Morton
  0 siblings, 1 reply; 40+ messages in thread
From: David Rientjes @ 2010-12-23 22:18 UTC (permalink / raw)
  To: Mel Gorman, Greg Kroah-Hartman, Andrew Morton
  Cc: Kyle McMartin, Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Mon, 29 Nov 2010, Mel Gorman wrote:

> Andrew, this patch was a performance fix but is a report saying that it
> fixes a functional regression in Fedora enough to push a patch torwards
> stable even though an explanation as to *why* it fixes the problem is missing?
> 

We had to pull aa454840 "mm: page allocator: calculate a better estimate 
of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 
internally because tests showed that it would cause the machine to stall 
as the result of heavy kswapd activity.  I merged it back with this fix as 
it is pending in the -mm tree and it solves the issue we were seeing, so I 
definitely think this should be pushed to -stable (and I would seriously 
consider it for 2.6.37 inclusion even at this late date).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-12-23 22:18               ` David Rientjes
@ 2010-12-23 22:35                 ` Andrew Morton
  2010-12-23 23:00                   ` Kyle McMartin
  2010-12-23 23:07                   ` David Rientjes
  0 siblings, 2 replies; 40+ messages in thread
From: Andrew Morton @ 2010-12-23 22:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Greg Kroah-Hartman, Kyle McMartin, Shaohua Li,
	KOSAKI Motohiro, Christoph Lameter, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Thu, 23 Dec 2010 14:18:38 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 29 Nov 2010, Mel Gorman wrote:
> 
> > Andrew, this patch was a performance fix but is a report saying that it
> > fixes a functional regression in Fedora enough to push a patch torwards
> > stable even though an explanation as to *why* it fixes the problem is missing?
> > 
> 
> We had to pull aa454840 "mm: page allocator: calculate a better estimate 
> of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 
> internally because tests showed that it would cause the machine to stall 
> as the result of heavy kswapd activity.  I merged it back with this fix as 
> it is pending in the -mm tree and it solves the issue we were seeing, so I 
> definitely think this should be pushed to -stable (and I would seriously 
> consider it for 2.6.37 inclusion even at this late date).

How's about I send
mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch
in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x? 
That way it'll get a bit of 2.6.38-rc testing before being merged into
2.6.37.x.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-12-23 22:35                 ` Andrew Morton
@ 2010-12-23 23:00                   ` Kyle McMartin
  2010-12-23 23:07                   ` David Rientjes
  1 sibling, 0 replies; 40+ messages in thread
From: Kyle McMartin @ 2010-12-23 23:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Greg Kroah-Hartman, Kyle McMartin,
	Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Thu, Dec 23, 2010 at 02:35:21PM -0800, Andrew Morton wrote:
> > > Andrew, this patch was a performance fix but is a report saying that it
> > > fixes a functional regression in Fedora enough to push a patch torwards
> > > stable even though an explanation as to *why* it fixes the problem is missing?
> > We had to pull aa454840 "mm: page allocator: calculate a better estimate 
> > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 
> > internally because tests showed that it would cause the machine to stall 
> > as the result of heavy kswapd activity.  I merged it back with this fix as 
> > it is pending in the -mm tree and it solves the issue we were seeing, so I 
> > definitely think this should be pushed to -stable (and I would seriously 
> > consider it for 2.6.37 inclusion even at this late date).
> 
> How's about I send
> mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch
> in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x? 
> That way it'll get a bit of 2.6.38-rc testing before being merged into
> 2.6.37.x.
> 

That sounds fine to me. (Thanks very much for the update, David!) I
don't mind carrying a few extra patches here and there in Fedora to get
them some exposure if they're low risk... I've been carrying Mel's
patches for a month or so now and it hasn't turned up any obvious
problems in testing.

regards, Kyle

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-12-23 22:35                 ` Andrew Morton
  2010-12-23 23:00                   ` Kyle McMartin
@ 2010-12-23 23:07                   ` David Rientjes
  2010-12-23 23:17                     ` Andrew Morton
  1 sibling, 1 reply; 40+ messages in thread
From: David Rientjes @ 2010-12-23 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Greg Kroah-Hartman, Kyle McMartin, Shaohua Li,
	KOSAKI Motohiro, Christoph Lameter, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Thu, 23 Dec 2010, Andrew Morton wrote:

> > We had to pull aa454840 "mm: page allocator: calculate a better estimate 
> > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 
> > internally because tests showed that it would cause the machine to stall 
> > as the result of heavy kswapd activity.  I merged it back with this fix as 
> > it is pending in the -mm tree and it solves the issue we were seeing, so I 
> > definitely think this should be pushed to -stable (and I would seriously 
> > consider it for 2.6.37 inclusion even at this late date).
> 
> How's about I send
> mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch
> in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x? 
> That way it'll get a bit of 2.6.38-rc testing before being merged into
> 2.6.37.x.
> 

I don't think anyone would be able to answer that judgment call other than 
you or Linus, it's a trade-off on whether 2.6.37 should be released with 
the knowledge that it regresses just like 2.6.36 does (rendering both 
unusable on some of our machines out of the box) because we're late in the 
cycle.

I personally think the testing is already sufficient since it's been 
sitting in -mm for two months, it's been suggested as stable material by a 
couple different parties, it was a prerequisite for the transparent 
hugepage series, and we've tested and merged it as fixing the regression 
in 2.6.36 (as Fedora has, as far as I know).  We've already merged the fix 
internally, though, so it's not for selfish reasons :)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-12-23 23:07                   ` David Rientjes
@ 2010-12-23 23:17                     ` Andrew Morton
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2010-12-23 23:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Greg Kroah-Hartman, Kyle McMartin, Shaohua Li,
	KOSAKI Motohiro, Christoph Lameter, KAMEZAWA Hiroyuki,
	linux-kernel, linux-mm

On Thu, 23 Dec 2010 15:07:02 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 23 Dec 2010, Andrew Morton wrote:
> 
> > > We had to pull aa454840 "mm: page allocator: calculate a better estimate 
> > > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 
> > > internally because tests showed that it would cause the machine to stall 
> > > as the result of heavy kswapd activity.  I merged it back with this fix as 
> > > it is pending in the -mm tree and it solves the issue we were seeing, so I 
> > > definitely think this should be pushed to -stable (and I would seriously 
> > > consider it for 2.6.37 inclusion even at this late date).
> > 
> > How's about I send
> > mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch
> > in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x? 
> > That way it'll get a bit of 2.6.38-rc testing before being merged into
> > 2.6.37.x.
> > 
> 
> I don't think anyone would be able to answer that judgment call other than 
> you or Linus, it's a trade-off on whether 2.6.37 should be released with 
> the knowledge that it regresses just like 2.6.36 does (rendering both 
> unusable on some of our machines out of the box) because we're late in the 
> cycle.
> 
> I personally think the testing is already sufficient since it's been 
> sitting in -mm for two months, it's been suggested as stable material by a 
> couple different parties, it was a prerequisite for the transparent 
> hugepage series, and we've tested and merged it as fixing the regression 
> in 2.6.36 (as Fedora has, as far as I know).  We've already merged the fix 
> internally, though, so it's not for selfish reasons :)

Wibble, wobble.  It's good that the patch has been used in RH kernels. 
otoh, the patch is really quite big and the problem was present in
2.6.36 without a lot of complaints and we're very late in -rc and not
many people will be testing over xmas/newyear, and it would be most sad
to put badness into mainline at this time.

So I'm still inclined to go with (discretion > valour).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds
  2010-10-27  8:47 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions Mel Gorman
  2010-10-27  8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
@ 2010-10-27  8:47 ` Mel Gorman
  2010-10-27 20:13   ` Christoph Lameter
                     ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Mel Gorman @ 2010-10-27  8:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman

reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold()
exist to adjust the per-cpu vmstat thresholds while kswapd is awake to
avoid errors due to counter drift. The functions duplicate some code so
this patch replaces them with a single set_pgdat_percpu_threshold() that
takes a callback function to calculate the desired threshold as a
parameter.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/vmstat.h |   10 ++++++----
 mm/vmscan.c            |    6 ++++--
 mm/vmstat.c            |   30 ++++++------------------------
 3 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e4cc21c..833e676 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -254,8 +254,11 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void refresh_cpu_vm_stats(int);
-void reduce_pgdat_percpu_threshold(pg_data_t *pgdat);
-void restore_pgdat_percpu_threshold(pg_data_t *pgdat);
+
+int calculate_pressure_threshold(struct zone *zone);
+int calculate_normal_threshold(struct zone *zone);
+void set_pgdat_percpu_threshold(pg_data_t *pgdat,
+				int (*calculate_pressure)(struct zone *));
 #else /* CONFIG_SMP */
 
 /*
@@ -300,8 +303,7 @@ static inline void __dec_zone_page_state(struct page *page,
 #define dec_zone_page_state __dec_zone_page_state
 #define mod_zone_page_state __mod_zone_page_state
 
-static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { }
-static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { }
+#define set_pgdat_percpu_threshold(pgdat, callback) { }
 
 static inline void refresh_cpu_vm_stats(int cpu) { }
 #endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3e71cb1..7966110 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2378,9 +2378,11 @@ static int kswapd(void *p)
 				 */
 				if (!sleeping_prematurely(pgdat, order, remaining)) {
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
-					restore_pgdat_percpu_threshold(pgdat);
+					set_pgdat_percpu_threshold(pgdat,
+						calculate_normal_threshold);
 					schedule();
-					reduce_pgdat_percpu_threshold(pgdat);
+					set_pgdat_percpu_threshold(pgdat,
+						calculate_pressure_threshold);
 				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index cafcc2d..73661e8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -81,13 +81,13 @@ EXPORT_SYMBOL(vm_stat);
 
 #ifdef CONFIG_SMP
 
-static int calculate_pressure_threshold(struct zone *zone)
+int calculate_pressure_threshold(struct zone *zone)
 {
 	return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) /
 				num_online_cpus())));
 }
 
-static int calculate_threshold(struct zone *zone)
+int calculate_normal_threshold(struct zone *zone)
 {
 	int threshold;
 	int mem;	/* memory in 128 MB units */
@@ -146,7 +146,7 @@ static void refresh_zone_stat_thresholds(void)
 	for_each_populated_zone(zone) {
 		unsigned long max_drift, tolerate_drift;
 
-		threshold = calculate_threshold(zone);
+		threshold = calculate_normal_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
@@ -165,7 +165,8 @@ static void refresh_zone_stat_thresholds(void)
 	}
 }
 
-void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
+void set_pgdat_percpu_threshold(pg_data_t *pgdat,
+				int (*calculate_pressure)(struct zone *))
 {
 	struct zone *zone;
 	int cpu;
@@ -177,26 +178,7 @@ void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
 		if (!zone->percpu_drift_mark)
 			continue;
 
-		threshold = calculate_pressure_threshold(zone);
-		for_each_online_cpu(cpu)
-			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
-							= threshold;
-	}
-}
-
-void restore_pgdat_percpu_threshold(pg_data_t *pgdat)
-{
-	struct zone *zone;
-	int cpu;
-	int threshold;
-	int i;
-
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		zone = &pgdat->node_zones[i];
-		if (!zone->percpu_drift_mark)
-			continue;
-
-		threshold = calculate_threshold(zone);
+		threshold = calculate_pressure(zone);
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds
  2010-10-27  8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman
@ 2010-10-27 20:13   ` Christoph Lameter
  2010-10-28  1:10   ` KAMEZAWA Hiroyuki
  2010-11-01  7:06   ` KOSAKI Motohiro
  2 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2010-10-27 20:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM


Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds
  2010-10-27  8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman
  2010-10-27 20:13   ` Christoph Lameter
@ 2010-10-28  1:10   ` KAMEZAWA Hiroyuki
  2010-11-01  7:06   ` KOSAKI Motohiro
  2 siblings, 0 replies; 40+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-28  1:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter,
	David Rientjes, LKML, Linux-MM

On Wed, 27 Oct 2010 09:47:36 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold()
> exist to adjust the per-cpu vmstat thresholds while kswapd is awake to
> avoid errors due to counter drift. The functions duplicate some code so
> this patch replaces them with a single set_pgdat_percpu_threshold() that
> takes a callback function to calculate the desired threshold as a
> parameter.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds
  2010-10-27  8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman
  2010-10-27 20:13   ` Christoph Lameter
  2010-10-28  1:10   ` KAMEZAWA Hiroyuki
@ 2010-11-01  7:06   ` KOSAKI Motohiro
  2 siblings, 0 replies; 40+ messages in thread
From: KOSAKI Motohiro @ 2010-11-01  7:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Shaohua Li, Christoph Lameter,
	David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM

> reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold()
> exist to adjust the per-cpu vmstat thresholds while kswapd is awake to
> avoid errors due to counter drift. The functions duplicate some code so
> this patch replaces them with a single set_pgdat_percpu_threshold() that
> takes a callback function to calculate the desired threshold as a
> parameter.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>






^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 0/2] Reduce the amount of time spent in watermark-related functions V4
@ 2010-10-28 15:13 Mel Gorman
  2010-10-28 15:13 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-10-28 15:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman

Changelog since V3
  o Added Reviewed-bys
  o Added comment on why pressure_threshold is what it is
  o Make sure pressure threshold does not get over 125
  o Call get_online_cpus and put_online_cpus as appropriate

The following two patches are in response to a bug report by Shaohua Li
where the amount of time spent in zone_nr_free_pages() is unacceptable
for large machines. All the background is in the first patches leader. The
second patch replaces two setter functions with one function that takes a
callback function as a parameter.

 include/linux/mmzone.h |   10 ++------
 include/linux/vmstat.h |    7 ++++++
 mm/mmzone.c            |   21 ------------------
 mm/page_alloc.c        |   35 ++++++++++++++++++++++++-------
 mm/vmscan.c            |   25 +++++++++++++--------
 mm/vmstat.c            |   54 +++++++++++++++++++++++++++++++++++++++++++++--
 6 files changed, 103 insertions(+), 49 deletions(-)

Mel Gorman (2):
  mm: page allocator: Adjust the per-cpu counter threshold when memory
    is low
  mm: vmstat: Use a single setter function and callback for adjusting
    percpu thresholds

 include/linux/mmzone.h |   10 ++------
 include/linux/vmstat.h |    7 ++++++
 mm/mmzone.c            |   21 -------------------
 mm/page_alloc.c        |   35 ++++++++++++++++++++++++-------
 mm/vmscan.c            |   25 +++++++++++++---------
 mm/vmstat.c            |   52 +++++++++++++++++++++++++++++++++++++++++++++--
 6 files changed, 101 insertions(+), 49 deletions(-)


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-28 15:13 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions V4 Mel Gorman
@ 2010-10-28 15:13 ` Mel Gorman
  2010-10-28 22:04   ` Andrew Morton
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-10-28 15:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman

Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
memory is low] noted that watermarks were based on the vmstat
NR_FREE_PAGES. To avoid synchronization overhead, these counters are
maintained on a per-cpu basis and drained both periodically and when a
threshold is above a threshold. On large CPU systems, the difference
between the estimate and real value of NR_FREE_PAGES can be very high.
The system can get into a case where pages are allocated far below the
min watermark potentially causing livelock issues. The commit solved the
problem by taking a better reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume
a large amount of CPU time on systems with many sockets due to cache
line bouncing. This patch takes a different approach. For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift
to what should be a safe level. This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine but
the machine should function correctly without accidentally exhausting all
memory on a node. There is an additional cost when kswapd wakes and sleeps
but the event is not expected to be frequent - in Shaohua's test case,
there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
is introduced that takes a more accurate reading of NR_FREE_PAGES when
called from wakeup_kswapd, when deciding whether it is really safe to go
back to sleep in sleeping_prematurely() and when deciding if a zone is
really balanced or not in balance_pgdat(). We are still using an expensive
function but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark functions
is reduced. The following report is on the percentage of time spent
cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(),
__zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(),
zone_page_state().

vanilla                      11.6615%
disable-threshold            0.2584%

Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |   10 ++-----
 include/linux/vmstat.h |    5 +++
 mm/mmzone.c            |   21 ---------------
 mm/page_alloc.c        |   35 +++++++++++++++++++-----
 mm/vmscan.c            |   23 +++++++++-------
 mm/vmstat.c            |   68 +++++++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 115 insertions(+), 47 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3984c4e..8d789d7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
-#ifdef CONFIG_SMP
-unsigned long zone_nr_free_pages(struct zone *zone);
-#else
-#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
-#endif /* CONFIG_SMP */
-
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
@@ -651,7 +645,9 @@ typedef struct pglist_data {
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(void *data);
 void wakeup_kswapd(struct zone *zone, int order);
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+		int classzone_idx, int alloc_flags);
+bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
 		int classzone_idx, int alloc_flags);
 enum memmap_context {
 	MEMMAP_EARLY,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index eaaea37..e4cc21c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 
 void refresh_cpu_vm_stats(int);
+void reduce_pgdat_percpu_threshold(pg_data_t *pgdat);
+void restore_pgdat_percpu_threshold(pg_data_t *pgdat);
 #else /* CONFIG_SMP */
 
 /*
@@ -298,6 +300,9 @@ static inline void __dec_zone_page_state(struct page *page,
 #define dec_zone_page_state __dec_zone_page_state
 #define mod_zone_page_state __mod_zone_page_state
 
+static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { }
+static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { }
+
 static inline void refresh_cpu_vm_stats(int cpu) { }
 #endif
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index e35bfb8..f5b7d17 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
-
-#ifdef CONFIG_SMP
-/* Called when a more accurate view of NR_FREE_PAGES is needed */
-unsigned long zone_nr_free_pages(struct zone *zone)
-{
-	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
-
-	/*
-	 * While kswapd is awake, it is considered the zone is under some
-	 * memory pressure. Under pressure, there is a risk that
-	 * per-cpu-counter-drift will allow the min watermark to be breached
-	 * potentially causing a live-lock. While kswapd is awake and
-	 * free pages are low, get a better estimate for free pages
-	 */
-	if (nr_free_pages < zone->percpu_drift_mark &&
-			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
-		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
-
-	return nr_free_pages;
-}
-#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f12ad18..0286150 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1454,24 +1454,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 #endif /* CONFIG_FAIL_PAGE_ALLOC */
 
 /*
- * Return 1 if free pages are above 'mark'. This takes into account the order
+ * Return true if free pages are above 'mark'. This takes into account the order
  * of the allocation.
  */
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags)
+static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+		      int classzone_idx, int alloc_flags, long free_pages)
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
+	free_pages -= (1 << order) + 1;
 	if (alloc_flags & ALLOC_HIGH)
 		min -= min / 2;
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
 
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
-		return 0;
+		return false;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
 		free_pages -= z->free_area[o].nr_free << o;
@@ -1480,9 +1480,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		min >>= 1;
 
 		if (free_pages <= min)
-			return 0;
+			return false;
 	}
-	return 1;
+	return true;
+}
+
+bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+		      int classzone_idx, int alloc_flags)
+{
+	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+					zone_page_state(z, NR_FREE_PAGES));
+}
+
+bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
+		      int classzone_idx, int alloc_flags)
+{
+	long free_pages = zone_page_state(z, NR_FREE_PAGES);
+
+	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
+		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
+
+	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+								free_pages);
 }
 
 #ifdef CONFIG_NUMA
@@ -2436,7 +2455,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_nr_free_pages(zone)),
+			K(zone_page_state(zone, NR_FREE_PAGES)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..3e71cb1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 		if (zone->all_unreclaimable)
 			continue;
 
-		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
+		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
 								0, 0))
 			return 1;
 	}
@@ -2169,7 +2169,7 @@ loop_again:
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
 							&sc, priority, 0);
 
-			if (!zone_watermark_ok(zone, order,
+			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
 				break;
@@ -2215,7 +2215,7 @@ loop_again:
 			 * We put equal pressure on every zone, unless one
 			 * zone has way too many pages free already.
 			 */
-			if (!zone_watermark_ok(zone, order,
+			if (!zone_watermark_ok_safe(zone, order,
 					8*high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
@@ -2236,7 +2236,7 @@ loop_again:
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
 				sc.may_writepage = 1;
 
-			if (!zone_watermark_ok(zone, order,
+			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
 				/*
@@ -2244,7 +2244,7 @@ loop_again:
 				 * means that we have a GFP_ATOMIC allocation
 				 * failure risk. Hurry up!
 				 */
-				if (!zone_watermark_ok(zone, order,
+				if (!zone_watermark_ok_safe(zone, order,
 					    min_wmark_pages(zone), end_zone, 0))
 					has_under_min_watermark_zone = 1;
 			}
@@ -2378,7 +2378,9 @@ static int kswapd(void *p)
 				 */
 				if (!sleeping_prematurely(pgdat, order, remaining)) {
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+					restore_pgdat_percpu_threshold(pgdat);
 					schedule();
+					reduce_pgdat_percpu_threshold(pgdat);
 				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2417,16 +2419,17 @@ void wakeup_kswapd(struct zone *zone, int order)
 	if (!populated_zone(zone))
 		return;
 
-	pgdat = zone->zone_pgdat;
-	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
+	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
+	pgdat = zone->zone_pgdat;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
-	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
+	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
+		return;
+
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 355a9e6..4d7faeb 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -81,6 +81,30 @@ EXPORT_SYMBOL(vm_stat);
 
 #ifdef CONFIG_SMP
 
+static int calculate_pressure_threshold(struct zone *zone)
+{
+	int threshold;
+	int watermark_distance;
+
+	/*
+	 * As vmstats are not up to date, there is drift between the estimated
+	 * and real values. For high thresholds and a high number of CPUs, it
+	 * is possible for the min watermark to be breached while the estimated
+	 * value looks fine. The pressure threshold is a reduced value such
+	 * that even the maximum amount of drift will not accidentally breach
+	 * the min watermark
+	 */
+	watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone);
+	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
+
+	/*
+	 * Maximum threshold is 125
+	 */
+	threshold = min(125, threshold);
+
+	return threshold;
+}
+
 static int calculate_threshold(struct zone *zone)
 {
 	int threshold;
@@ -159,6 +183,48 @@ static void refresh_zone_stat_thresholds(void)
 	}
 }
 
+void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+	int i;
+
+	get_online_cpus();
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		zone = &pgdat->node_zones[i];
+		if (!zone->percpu_drift_mark)
+			continue;
+
+		threshold = calculate_pressure_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+	put_online_cpus();
+}
+
+void restore_pgdat_percpu_threshold(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int cpu;
+	int threshold;
+	int i;
+
+	get_online_cpus();
+	for (i = 0; i < pgdat->nr_zones; i++) {
+		zone = &pgdat->node_zones[i];
+		if (!zone->percpu_drift_mark)
+			continue;
+
+		threshold = calculate_threshold(zone);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
+	}
+	put_online_cpus();
+}
+
 /*
  * For use when we know that interrupts are disabled.
  */
@@ -826,7 +892,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_nr_free_pages(zone),
+		   zone_page_state(zone, NR_FREE_PAGES),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-28 15:13 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
@ 2010-10-28 22:04   ` Andrew Morton
  2010-10-29 10:12     ` Mel Gorman
  2010-10-29 14:58     ` Christoph Lameter
  0 siblings, 2 replies; 40+ messages in thread
From: Andrew Morton @ 2010-10-28 22:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM

On Thu, 28 Oct 2010 16:13:35 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
> memory is low] noted that watermarks were based on the vmstat
> NR_FREE_PAGES. To avoid synchronization overhead, these counters are
> maintained on a per-cpu basis and drained both periodically and when a
> threshold is above a threshold. On large CPU systems, the difference
> between the estimate and real value of NR_FREE_PAGES can be very high.
> The system can get into a case where pages are allocated far below the
> min watermark potentially causing livelock issues. The commit solved the
> problem by taking a better reading of NR_FREE_PAGES when memory was low.
> 
> Unfortately, as reported by Shaohua Li this accurate reading can consume
> a large amount of CPU time on systems with many sockets due to cache
> line bouncing. This patch takes a different approach. For large machines
> where counter drift might be unsafe and while kswapd is awake, the per-cpu
> thresholds for the target pgdat are reduced to limit the level of drift
> to what should be a safe level. This incurs a performance penalty in heavy
> memory pressure by a factor that depends on the workload and the machine but
> the machine should function correctly without accidentally exhausting all
> memory on a node. There is an additional cost when kswapd wakes and sleeps
> but the event is not expected to be frequent - in Shaohua's test case,
> there was one recorded sleep and wake event at least.
> 
> To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> is introduced that takes a more accurate reading of NR_FREE_PAGES when
> called from wakeup_kswapd, when deciding whether it is really safe to go
> back to sleep in sleeping_prematurely() and when deciding if a zone is
> really balanced or not in balance_pgdat(). We are still using an expensive
> function but limiting how often it is called.

Here I go again.  I have a feeling that I already said this, but I
can't find versions 2 or 3 in the archives..

Did you evaluate using plain on percpu_counters for this?  They won't
solve the performance problem as they're basically the same thing as
these open-coded counters.  But they'd reduce the amount of noise and
custom-coded boilerplate in mm/.

> When the test case is reproduced, the time spent in the watermark functions
> is reduced. The following report is on the percentage of time spent
> cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(),
> __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(),
> zone_page_state().

So how did you decide which callsites needed to use the
fast-but-inaccurate zone_watermark_ok() and which needed to use the
slow-but-more-accurate zone_watermark_ok_safe()?  (Those functions need
comments explaining the difference btw)


I have a feeling this problem will bite us again perhaps due to those
other callsites, but we haven't found the workload yet.

I don't undestand why restore/reduce_pgdat_percpu_threshold() were
called around that particular sleep in kswapd and nowhere else.

> vanilla                      11.6615%
> disable-threshold            0.2584%

Wow.  That's 12% of all CPUs?  How many CPUs and what workload?

>
> ...
>
>  				if (!sleeping_prematurely(pgdat, order, remaining)) {
>  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> +					restore_pgdat_percpu_threshold(pgdat);
>  					schedule();
> +					reduce_pgdat_percpu_threshold(pgdat);

We could do with some code comments here explaining what's going on.

>  				} else {
>  					if (remaining)
>  						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>
> ...
>
> +static int calculate_pressure_threshold(struct zone *zone)
> +{
> +	int threshold;
> +	int watermark_distance;
> +
> +	/*
> +	 * As vmstats are not up to date, there is drift between the estimated
> +	 * and real values. For high thresholds and a high number of CPUs, it
> +	 * is possible for the min watermark to be breached while the estimated
> +	 * value looks fine. The pressure threshold is a reduced value such
> +	 * that even the maximum amount of drift will not accidentally breach
> +	 * the min watermark
> +	 */
> +	watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone);
> +	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
> +
> +	/*
> +	 * Maximum threshold is 125

Reasoning?

> +	 */
> +	threshold = min(125, threshold);
> +
> +	return threshold;
> +}
> +
>  static int calculate_threshold(struct zone *zone)
>  {
>  	int threshold;
>
> ...
>
> +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> +{
> +	struct zone *zone;
> +	int cpu;
> +	int threshold;
> +	int i;
> +
> +	get_online_cpus();
> +	for (i = 0; i < pgdat->nr_zones; i++) {
> +		zone = &pgdat->node_zones[i];
> +		if (!zone->percpu_drift_mark)
> +			continue;
> +
> +		threshold = calculate_pressure_threshold(zone);
> +		for_each_online_cpu(cpu)
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
> +	}
> +	put_online_cpus();
> +}
> +
> +void restore_pgdat_percpu_threshold(pg_data_t *pgdat)
> +{
> +	struct zone *zone;
> +	int cpu;
> +	int threshold;
> +	int i;
> +
> +	get_online_cpus();
> +	for (i = 0; i < pgdat->nr_zones; i++) {
> +		zone = &pgdat->node_zones[i];
> +		if (!zone->percpu_drift_mark)
> +			continue;
> +
> +		threshold = calculate_threshold(zone);
> +		for_each_online_cpu(cpu)
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
> +	}
> +	put_online_cpus();
> +}

Given that ->stat_threshold is the same for each CPU, why store it for
each CPU at all?  Why not put it in the zone and eliminate the inner
loop?



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-28 22:04   ` Andrew Morton
@ 2010-10-29 10:12     ` Mel Gorman
  2010-10-29 19:40       ` Andrew Morton
  2010-10-29 14:58     ` Christoph Lameter
  1 sibling, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2010-10-29 10:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM

On Thu, Oct 28, 2010 at 03:04:33PM -0700, Andrew Morton wrote:
> On Thu, 28 Oct 2010 16:13:35 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when
> > memory is low] noted that watermarks were based on the vmstat
> > NR_FREE_PAGES. To avoid synchronization overhead, these counters are
> > maintained on a per-cpu basis and drained both periodically and when a
> > threshold is above a threshold. On large CPU systems, the difference
> > between the estimate and real value of NR_FREE_PAGES can be very high.
> > The system can get into a case where pages are allocated far below the
> > min watermark potentially causing livelock issues. The commit solved the
> > problem by taking a better reading of NR_FREE_PAGES when memory was low.
> > 
> > Unfortately, as reported by Shaohua Li this accurate reading can consume
> > a large amount of CPU time on systems with many sockets due to cache
> > line bouncing. This patch takes a different approach. For large machines
> > where counter drift might be unsafe and while kswapd is awake, the per-cpu
> > thresholds for the target pgdat are reduced to limit the level of drift
> > to what should be a safe level. This incurs a performance penalty in heavy
> > memory pressure by a factor that depends on the workload and the machine but
> > the machine should function correctly without accidentally exhausting all
> > memory on a node. There is an additional cost when kswapd wakes and sleeps
> > but the event is not expected to be frequent - in Shaohua's test case,
> > there was one recorded sleep and wake event at least.
> > 
> > To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> > is introduced that takes a more accurate reading of NR_FREE_PAGES when
> > called from wakeup_kswapd, when deciding whether it is really safe to go
> > back to sleep in sleeping_prematurely() and when deciding if a zone is
> > really balanced or not in balance_pgdat(). We are still using an expensive
> > function but limiting how often it is called.
> 
> Here I go again.  I have a feeling that I already said this, but I
> can't find versions 2 or 3 in the archives..
> 
> Did you evaluate using plain on percpu_counters for this?  They won't
> solve the performance problem as they're basically the same thing as
> these open-coded counters.  But they'd reduce the amount of noise and
> custom-coded boilerplate in mm/.
> 

You did bring this up before. Here is the reference to the answer you got
from Christoph at the time http://lkml.org/lkml/2010/9/3/453 . It seemed
like a reasonable answer.

> > When the test case is reproduced, the time spent in the watermark functions
> > is reduced. The following report is on the percentage of time spent
> > cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(),
> > __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(),
> > zone_page_state().
> 
> So how did you decide which callsites needed to use the
> fast-but-inaccurate zone_watermark_ok() and which needed to use the
> slow-but-more-accurate zone_watermark_ok_safe()?  (Those functions need
> comments explaining the difference btw)
> 

Selection was based on kswapd being woken up and staying awake

1. When deciding if kswapd should wake (wakeup_kswapd()), we have failed
   the initial allocation attempt and we should be sure of the
   watermarks to decide if kswapd really should wake or not
2. Once kswapd is awake, it shouldn't go to sleep prematurely

While kswapd is awake, drift is less of a problem because thresholds are
reduced.

> 
> I have a feeling this problem will bite us again perhaps due to those
> other callsites, but we haven't found the workload yet.
> 
> I don't undestand why restore/reduce_pgdat_percpu_threshold() were
> called around that particular sleep in kswapd and nowhere else.
> 
> > vanilla                      11.6615%
> > disable-threshold            0.2584%
> 
> Wow.  That's 12% of all CPUs?  How many CPUs and what workload?
> 

112 threads CPUs 14 sockets. Workload initialisation creates NR_CPU sparse
files that are 10*TOTAL_MEMORY/NR_CPU in size. Workload itself is NR_CPU
processes just reading their own file.

The critical thing is the number of sockets. For single-socket-8-thread
for example, vanilla was just 0.66% of time (although the patches did
bring it down to 0.11%).

> >
> > ...
> >
> >  				if (!sleeping_prematurely(pgdat, order, remaining)) {
> >  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > +					restore_pgdat_percpu_threshold(pgdat);
> >  					schedule();
> > +					reduce_pgdat_percpu_threshold(pgdat);
> 
> We could do with some code comments here explaining what's going on.
> 

Follow-on patch?

> >  				} else {
> >  					if (remaining)
> >  						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> >
> > ...
> >
> > +static int calculate_pressure_threshold(struct zone *zone)
> > +{
> > +	int threshold;
> > +	int watermark_distance;
> > +
> > +	/*
> > +	 * As vmstats are not up to date, there is drift between the estimated
> > +	 * and real values. For high thresholds and a high number of CPUs, it
> > +	 * is possible for the min watermark to be breached while the estimated
> > +	 * value looks fine. The pressure threshold is a reduced value such
> > +	 * that even the maximum amount of drift will not accidentally breach
> > +	 * the min watermark
> > +	 */
> > +	watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone);
> > +	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
> > +
> > +	/*
> > +	 * Maximum threshold is 125
> 
> Reasoning?
> 

To match the existing maximum which I assume is due to the deltas being
stored in a s8.

> > +	 */
> > +	threshold = min(125, threshold);
> > +
> > +	return threshold;
> > +}
> > +
> >  static int calculate_threshold(struct zone *zone)
> >  {
> >  	int threshold;
> >
> > ...
> >
> > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
> > +{
> > +	struct zone *zone;
> > +	int cpu;
> > +	int threshold;
> > +	int i;
> > +
> > +	get_online_cpus();
> > +	for (i = 0; i < pgdat->nr_zones; i++) {
> > +		zone = &pgdat->node_zones[i];
> > +		if (!zone->percpu_drift_mark)
> > +			continue;
> > +
> > +		threshold = calculate_pressure_threshold(zone);
> > +		for_each_online_cpu(cpu)
> > +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> > +							= threshold;
> > +	}
> > +	put_online_cpus();
> > +}
> > +
> > +void restore_pgdat_percpu_threshold(pg_data_t *pgdat)
> > +{
> > +	struct zone *zone;
> > +	int cpu;
> > +	int threshold;
> > +	int i;
> > +
> > +	get_online_cpus();
> > +	for (i = 0; i < pgdat->nr_zones; i++) {
> > +		zone = &pgdat->node_zones[i];
> > +		if (!zone->percpu_drift_mark)
> > +			continue;
> > +
> > +		threshold = calculate_threshold(zone);
> > +		for_each_online_cpu(cpu)
> > +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> > +							= threshold;
> > +	}
> > +	put_online_cpus();
> > +}
> 
> Given that ->stat_threshold is the same for each CPU, why store it for
> each CPU at all?  Why not put it in the zone and eliminate the inner
> loop?
> 

I asked why we couldn't move the threshold to struct zone and Christoph
responded;

"If you move it then the cache footprint of the vm stat functions (which
need to access the threshold for each access!) will increase and the
performance sink dramatically. I tried to avoid placing the threshold
there when I developed that approach but it always caused a dramatic
regression under heavy load."

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-29 10:12     ` Mel Gorman
@ 2010-10-29 19:40       ` Andrew Morton
  2010-11-02  0:53         ` Shaohua Li
                           ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Andrew Morton @ 2010-10-29 19:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM

On Fri, 29 Oct 2010 11:12:11 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Oct 28, 2010 at 03:04:33PM -0700, Andrew Morton wrote:
> > On Thu, 28 Oct 2010 16:13:35 +0100
> 
> > 
> > I have a feeling this problem will bite us again perhaps due to those
> > other callsites, but we haven't found the workload yet.
> > 
> > I don't undestand why restore/reduce_pgdat_percpu_threshold() were
> > called around that particular sleep in kswapd and nowhere else.
> > 
> > > vanilla                      11.6615%
> > > disable-threshold            0.2584%
> > 
> > Wow.  That's 12% of all CPUs?  How many CPUs and what workload?
> > 
> 
> 112 threads CPUs 14 sockets. Workload initialisation creates NR_CPU sparse
> files that are 10*TOTAL_MEMORY/NR_CPU in size. Workload itself is NR_CPU
> processes just reading their own file.
> 
> The critical thing is the number of sockets. For single-socket-8-thread
> for example, vanilla was just 0.66% of time (although the patches did
> bring it down to 0.11%).

I'm surprised.  I thought the inefficiency here was caused by CPUs
tromping through percpu data, adding things up.  But the above info
would indicate that the problem was caused by lots of cross-socket
traffic?  If so, where did that come from?

> ...
>
> Follow-on patch?

Sometime, please.

> > >
> > > ...
> > >
> > >  				if (!sleeping_prematurely(pgdat, order, remaining)) {
> > >  					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
> > > +					restore_pgdat_percpu_threshold(pgdat);
> > >  					schedule();
> > > +					reduce_pgdat_percpu_threshold(pgdat);
> > 
> > We could do with some code comments here explaining what's going on.
> > 
> 
> Follow-on patch?
> 
> > >  				} else {
> > >  					if (remaining)
> > >  						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> > >
> > > ...
> > >
> > > +static int calculate_pressure_threshold(struct zone *zone)
> > > +{
> > > +	int threshold;
> > > +	int watermark_distance;
> > > +
> > > +	/*
> > > +	 * As vmstats are not up to date, there is drift between the estimated
> > > +	 * and real values. For high thresholds and a high number of CPUs, it
> > > +	 * is possible for the min watermark to be breached while the estimated
> > > +	 * value looks fine. The pressure threshold is a reduced value such
> > > +	 * that even the maximum amount of drift will not accidentally breach
> > > +	 * the min watermark
> > > +	 */
> > > +	watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone);
> > > +	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
> > > +
> > > +	/*
> > > +	 * Maximum threshold is 125
> > 
> > Reasoning?
> > 
> 
> To match the existing maximum which I assume is due to the deltas being
> stored in a s8.

hm, OK.  So (CHAR_MAX-2) would be a tad clearer, only there's no
CHAR_MAX and "2" remains mysterious ;)

I do go on about code comments a lot lately.  Eric D's kernel just
crashed because we didn't adequately comment first_zones_zonelist()
so I'm feeling all vindicated!

>
> > Given that ->stat_threshold is the same for each CPU, why store it for
> > each CPU at all?  Why not put it in the zone and eliminate the inner
> > loop?
> > 
> 
> I asked why we couldn't move the threshold to struct zone and Christoph
> responded;
> 
> "If you move it then the cache footprint of the vm stat functions (which
> need to access the threshold for each access!) will increase and the
> performance sink dramatically. I tried to avoid placing the threshold
> there when I developed that approach but it always caused a dramatic
> regression under heavy load."

I don't really buy that.  The cache footprint will be increased by a
max of one cacheline (for zone->stat_threshold) and the cache footprint
will be actually reduced in the much larger percpu area (depending on
alignment and padding and stuff).

I'm suspecting something went wrong here, perhaps zone->stat_threshold
shared a cacheline with something unfortunate.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-29 19:40       ` Andrew Morton
@ 2010-11-02  0:53         ` Shaohua Li
  2010-11-09 11:33         ` Mel Gorman
  2010-11-09 16:48         ` Christoph Lameter
  2 siblings, 0 replies; 40+ messages in thread
From: Shaohua Li @ 2010-11-02  0:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM

On Sat, 2010-10-30 at 03:40 +0800, Andrew Morton wrote:
> On Fri, 29 Oct 2010 11:12:11 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Thu, Oct 28, 2010 at 03:04:33PM -0700, Andrew Morton wrote:
> > > On Thu, 28 Oct 2010 16:13:35 +0100
> > 
> > > 
> > > I have a feeling this problem will bite us again perhaps due to those
> > > other callsites, but we haven't found the workload yet.
> > > 
> > > I don't undestand why restore/reduce_pgdat_percpu_threshold() were
> > > called around that particular sleep in kswapd and nowhere else.
> > > 
> > > > vanilla                      11.6615%
> > > > disable-threshold            0.2584%
> > > 
> > > Wow.  That's 12% of all CPUs?  How many CPUs and what workload?
> > > 
> > 
> > 112 threads CPUs 14 sockets. Workload initialisation creates NR_CPU sparse
> > files that are 10*TOTAL_MEMORY/NR_CPU in size. Workload itself is NR_CPU
> > processes just reading their own file.
> > 
> > The critical thing is the number of sockets. For single-socket-8-thread
> > for example, vanilla was just 0.66% of time (although the patches did
> > bring it down to 0.11%).
> 
> I'm surprised.  I thought the inefficiency here was caused by CPUs
> tromping through percpu data, adding things up.  But the above info
> would indicate that the problem was caused by lots of cross-socket
> traffic?  If so, where did that come from?
>From my understanding, the problem is zone_nr_free_pages() will try to
read each cpu's ->vm_stat_diff, while other CPUs are changing their
vm_stat_diff. This will cause a lot of cache bounce.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-29 19:40       ` Andrew Morton
  2010-11-02  0:53         ` Shaohua Li
@ 2010-11-09 11:33         ` Mel Gorman
  2010-11-09 16:48         ` Christoph Lameter
  2 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2010-11-09 11:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM

On Fri, Oct 29, 2010 at 12:40:02PM -0700, Andrew Morton wrote:
> > ...
> >
> > Follow-on patch?
> 
> Sometime, please.
> 

How does this look?

==== CUT HERE ====
mm: vmscan: Comment on why kswapd reduces the per-cpu vmstat threshold

While kswapd is awake, the per-cpu vmstat threshold is reduced to
reduce per-cpu drift to acceptable levels. Add a comment explaining
why.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7966110..ba39948 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2378,6 +2378,19 @@ static int kswapd(void *p)
 				 */
 				if (!sleeping_prematurely(pgdat, order, remaining)) {
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+
+					/*
+					 * vmstat counters are not perfectly
+					 * accurate and the estimated value
+					 * for counters such as NR_FREE_PAGES
+					 * can deviate from the true value by
+					 * nr_online_cpus * threshold. To
+					 * avoid the zone watermarks being
+					 * breached while under pressure, we
+					 * reduce the per-cpu vmstat threshold
+					 * while kswapd is awake and restore
+					 * them before going back to sleep.
+					 */
 					set_pgdat_percpu_threshold(pgdat,
 						calculate_normal_threshold);
 					schedule();

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-29 19:40       ` Andrew Morton
  2010-11-02  0:53         ` Shaohua Li
  2010-11-09 11:33         ` Mel Gorman
@ 2010-11-09 16:48         ` Christoph Lameter
  2 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2010-11-09 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Shaohua Li, KOSAKI Motohiro, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM

On Fri, 29 Oct 2010, Andrew Morton wrote:

> > To match the existing maximum which I assume is due to the deltas being
> > stored in a s8.
>
> hm, OK.  So (CHAR_MAX-2) would be a tad clearer, only there's no
> CHAR_MAX and "2" remains mysterious ;)

inc/dec_zone_page_state first increments (allows compiler to generate a
RMV instruction, hmmm... there we could use this cpu....) and then checks
if we are beyond the threshold. So the maximum value has to be below
CHAR_MAX. 125 is the next value that looks somewhat sane as a limit. We
could use CHAR_MAX if we change the comparison logic for the threshhold.

> I do go on about code comments a lot lately.  Eric D's kernel just
> crashed because we didn't adequately comment first_zones_zonelist()
> so I'm feeling all vindicated!

We should not have done the change to add the nodemask to
first_zones_zonelist but instead added a new function. Now the function
has two different ways of behaving.

> I don't really buy that.  The cache footprint will be increased by a
> max of one cacheline (for zone->stat_threshold) and the cache footprint
> will be actually reduced in the much larger percpu area (depending on
> alignment and padding and stuff).

It will increase by one cacheline per zone in use. Large systems have lots
of zones which increases the effect.

The vmstat functions are often used as primitives in various critical code
paths. The cache footprint needs to be kept as low as possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-28 22:04   ` Andrew Morton
  2010-10-29 10:12     ` Mel Gorman
@ 2010-10-29 14:58     ` Christoph Lameter
  2010-10-29 18:25       ` Andrew Morton
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Lameter @ 2010-10-29 14:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Shaohua Li, KOSAKI Motohiro, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM

On Thu, 28 Oct 2010, Andrew Morton wrote:

> > To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> > is introduced that takes a more accurate reading of NR_FREE_PAGES when
> > called from wakeup_kswapd, when deciding whether it is really safe to go
> > back to sleep in sleeping_prematurely() and when deciding if a zone is
> > really balanced or not in balance_pgdat(). We are still using an expensive
> > function but limiting how often it is called.
>
> Here I go again.  I have a feeling that I already said this, but I
> can't find versions 2 or 3 in the archives..
>
> Did you evaluate using plain on percpu_counters for this?  They won't
> solve the performance problem as they're basically the same thing as
> these open-coded counters.  But they'd reduce the amount of noise and
> custom-coded boilerplate in mm/.

The zone counters are done using the ZVCs in vmstat.c to save space and to
be in the same cacheline as other hot data necessary for allocation and
free.
 >
> > +	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
> > +
> > +	/*
> > +	 * Maximum threshold is 125
>
> Reasoning?

Differentials are stored in 8 bit signed ints.

> > +	put_online_cpus();
> > +}
>
> Given that ->stat_threshold is the same for each CPU, why store it for
> each CPU at all?  Why not put it in the zone and eliminate the inner
> loop?

Doing that caused cache misses in the past and reduced the performance of
the ZVCs. This way the threshold is in the same cacheline as the
differentials.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-29 14:58     ` Christoph Lameter
@ 2010-10-29 18:25       ` Andrew Morton
  2010-10-29 19:33         ` Christoph Lameter
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Morton @ 2010-10-29 18:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Shaohua Li, KOSAKI Motohiro, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM

On Fri, 29 Oct 2010 09:58:25 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Thu, 28 Oct 2010, Andrew Morton wrote:
> 
> > > To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> > > is introduced that takes a more accurate reading of NR_FREE_PAGES when
> > > called from wakeup_kswapd, when deciding whether it is really safe to go
> > > back to sleep in sleeping_prematurely() and when deciding if a zone is
> > > really balanced or not in balance_pgdat(). We are still using an expensive
> > > function but limiting how often it is called.
> >
> > Here I go again.  I have a feeling that I already said this, but I
> > can't find versions 2 or 3 in the archives..
> >
> > Did you evaluate using plain on percpu_counters for this?  They won't
> > solve the performance problem as they're basically the same thing as
> > these open-coded counters.  But they'd reduce the amount of noise and
> > custom-coded boilerplate in mm/.
> 
> The zone counters are done using the ZVCs in vmstat.c to save space

well, they actually waste space because of that threshold thing.

> and to
> be in the same cacheline as other hot data necessary for allocation and
> free.

Yes, that'll save some misses.

>  >
> > > +	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
> > > +
> > > +	/*
> > > +	 * Maximum threshold is 125
> >
> > Reasoning?
> 
> Differentials are stored in 8 bit signed ints.
> 
> > > +	put_online_cpus();
> > > +}
> >
> > Given that ->stat_threshold is the same for each CPU, why store it for
> > each CPU at all?  Why not put it in the zone and eliminate the inner
> > loop?
> 
> Doing that caused cache misses in the past and reduced the performance of
> the ZVCs. This way the threshold is in the same cacheline as the
> differentials.

This sounds wrong.  As long as that threshold isn't stored in a
cacheline which other CPUs are modifying, all CPUs should be able to
happily cache it.  Maybe it needed a bit of padding inside the zone
struct.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
  2010-10-29 18:25       ` Andrew Morton
@ 2010-10-29 19:33         ` Christoph Lameter
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Lameter @ 2010-10-29 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Shaohua Li, KOSAKI Motohiro, David Rientjes,
	KAMEZAWA Hiroyuki, LKML, Linux-MM

On Fri, 29 Oct 2010, Andrew Morton wrote:

> > Doing that caused cache misses in the past and reduced the performance of
> > the ZVCs. This way the threshold is in the same cacheline as the
> > differentials.
>
> This sounds wrong.  As long as that threshold isn't stored in a
> cacheline which other CPUs are modifying, all CPUs should be able to
> happily cache it.  Maybe it needed a bit of padding inside the zone
> struct.

High speed cpu caches are a very scarce resource. The differentials are
not in the zone struct. Tried to put it onto a single cacheline. Even that
did not do the trick for the large configurations.

The same optimizations are done in the slab allocators by the way.

Use of the percpu_counter() would at least quadruple the cache footprint
in use for the counters. percpu_counters() is using s32 and not s8.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2010-12-23 23:19 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-27  8:47 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions Mel Gorman
2010-10-27  8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
2010-10-27 20:16   ` Christoph Lameter
2010-10-28  1:09   ` KAMEZAWA Hiroyuki
2010-10-28  9:49     ` Mel Gorman
2010-10-28  9:58       ` KAMEZAWA Hiroyuki
2010-11-14  8:53     ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro
2010-11-15 10:26       ` Mel Gorman
2010-11-15 14:04         ` Christoph Lameter
2010-11-16  9:58           ` Mel Gorman
2010-11-17  0:07       ` Andrew Morton
2010-11-19 15:29         ` Christoph Lameter
2010-11-23  8:32         ` KOSAKI Motohiro
2010-11-01  7:06   ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low KOSAKI Motohiro
2010-11-26 16:06   ` Kyle McMartin
2010-11-29  9:56     ` Mel Gorman
2010-11-29 13:16       ` Kyle McMartin
2010-11-29 15:08         ` Mel Gorman
2010-11-29 15:22           ` Kyle McMartin
2010-11-29 15:26             ` Kyle McMartin
2010-11-29 15:58             ` Mel Gorman
2010-12-23 22:18               ` David Rientjes
2010-12-23 22:35                 ` Andrew Morton
2010-12-23 23:00                   ` Kyle McMartin
2010-12-23 23:07                   ` David Rientjes
2010-12-23 23:17                     ` Andrew Morton
2010-10-27  8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman
2010-10-27 20:13   ` Christoph Lameter
2010-10-28  1:10   ` KAMEZAWA Hiroyuki
2010-11-01  7:06   ` KOSAKI Motohiro
2010-10-28 15:13 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions V4 Mel Gorman
2010-10-28 15:13 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman
2010-10-28 22:04   ` Andrew Morton
2010-10-29 10:12     ` Mel Gorman
2010-10-29 19:40       ` Andrew Morton
2010-11-02  0:53         ` Shaohua Li
2010-11-09 11:33         ` Mel Gorman
2010-11-09 16:48         ` Christoph Lameter
2010-10-29 14:58     ` Christoph Lameter
2010-10-29 18:25       ` Andrew Morton
2010-10-29 19:33         ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).