* [PATCH 0/2] Reduce the amount of time spent in watermark-related functions @ 2010-10-27 8:47 Mel Gorman 2010-10-27 8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman 2010-10-27 8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman 0 siblings, 2 replies; 40+ messages in thread From: Mel Gorman @ 2010-10-27 8:47 UTC (permalink / raw) To: Andrew Morton Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman The following two patches are in response to a bug report by Shaohua Li where the amount of time spent in zone_nr_free_pages() is unacceptable for large machines. All the background is in the first patches leader. The second patch replaces two setter functions with one function that takes a callback function as a parameter. Mel Gorman (2): mm: page allocator: Adjust the per-cpu counter threshold when memory is low mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds include/linux/mmzone.h | 10 +++------- include/linux/vmstat.h | 7 +++++++ mm/mmzone.c | 21 --------------------- mm/page_alloc.c | 35 +++++++++++++++++++++++++++-------- mm/vmscan.c | 25 +++++++++++++++---------- mm/vmstat.c | 32 +++++++++++++++++++++++++++++--- 6 files changed, 81 insertions(+), 49 deletions(-) ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-27 8:47 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions Mel Gorman @ 2010-10-27 8:47 ` Mel Gorman 2010-10-27 20:16 ` Christoph Lameter ` (3 more replies) 2010-10-27 8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman 1 sibling, 4 replies; 40+ messages in thread From: Mel Gorman @ 2010-10-27 8:47 UTC (permalink / raw) To: Andrew Morton Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when memory is low] noted that watermarks were based on the vmstat NR_FREE_PAGES. To avoid synchronization overhead, these counters are maintained on a per-cpu basis and drained both periodically and when a threshold is above a threshold. On large CPU systems, the difference between the estimate and real value of NR_FREE_PAGES can be very high. The system can get into a case where pages are allocated far below the min watermark potentially causing livelock issues. The commit solved the problem by taking a better reading of NR_FREE_PAGES when memory was low. Unfortunately, as reported by Shaohua Li this accurate reading can consume a large amount of CPU time on systems with many sockets due to cache line bouncing. This patch takes a different approach. For large machines where counter drift might be unsafe and while kswapd is awake, the per-cpu thresholds for the target pgdat are reduced to limit the level of drift to what should be a safe level. This incurs a performance penalty in heavy memory pressure by a factor that depends on the workload and the machine but the machine should function correctly without accidentally exhausting all memory on a node. There is an additional cost when kswapd wakes and sleeps but the event is not expected to be frequent - in Shaohua's test case, there was one recorded sleep and wake event at least. To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is introduced that takes a more accurate reading of NR_FREE_PAGES when called from wakeup_kswapd, when deciding whether it is really safe to go back to sleep in sleeping_prematurely() and when deciding if a zone is really balanced or not in balance_pgdat(). We are still using an expensive function but limiting how often it is called. When the test case is reproduced, the time spent in the watermark functions is reduced. The following report is on the percentage of time spent cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), zone_page_state(). vanilla 11.6615% disable-threshold 0.2584% Reported-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/mmzone.h | 10 +++------- include/linux/vmstat.h | 5 +++++ mm/mmzone.c | 21 --------------------- mm/page_alloc.c | 35 +++++++++++++++++++++++++++-------- mm/vmscan.c | 23 +++++++++++++---------- mm/vmstat.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- 6 files changed, 93 insertions(+), 47 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3984c4e..8d789d7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone) return test_bit(ZONE_OOM_LOCKED, &zone->flags); } -#ifdef CONFIG_SMP -unsigned long zone_nr_free_pages(struct zone *zone); -#else -#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES) -#endif /* CONFIG_SMP */ - /* * The "priority" of VM scanning is how much of the queues we will scan in one * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the @@ -651,7 +645,9 @@ typedef struct pglist_data { extern struct mutex zonelists_mutex; void build_all_zonelists(void *data); void wakeup_kswapd(struct zone *zone, int order); -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags); +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, int classzone_idx, int alloc_flags); enum memmap_context { MEMMAP_EARLY, diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index eaaea37..e4cc21c 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); void refresh_cpu_vm_stats(int); +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat); +void restore_pgdat_percpu_threshold(pg_data_t *pgdat); #else /* CONFIG_SMP */ /* @@ -298,6 +300,9 @@ static inline void __dec_zone_page_state(struct page *page, #define dec_zone_page_state __dec_zone_page_state #define mod_zone_page_state __mod_zone_page_state +static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { } +static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { } + static inline void refresh_cpu_vm_stats(int cpu) { } #endif diff --git a/mm/mmzone.c b/mm/mmzone.c index e35bfb8..f5b7d17 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn, return 1; } #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ - -#ifdef CONFIG_SMP -/* Called when a more accurate view of NR_FREE_PAGES is needed */ -unsigned long zone_nr_free_pages(struct zone *zone) -{ - unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); - - /* - * While kswapd is awake, it is considered the zone is under some - * memory pressure. Under pressure, there is a risk that - * per-cpu-counter-drift will allow the min watermark to be breached - * potentially causing a live-lock. While kswapd is awake and - * free pages are low, get a better estimate for free pages - */ - if (nr_free_pages < zone->percpu_drift_mark && - !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) - return zone_page_state_snapshot(zone, NR_FREE_PAGES); - - return nr_free_pages; -} -#endif /* CONFIG_SMP */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f12ad18..0286150 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1454,24 +1454,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) #endif /* CONFIG_FAIL_PAGE_ALLOC */ /* - * Return 1 if free pages are above 'mark'. This takes into account the order + * Return true if free pages are above 'mark'. This takes into account the order * of the allocation. */ -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, - int classzone_idx, int alloc_flags) +static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags, long free_pages) { /* free_pages my go negative - that's OK */ long min = mark; - long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; int o; + free_pages -= (1 << order) + 1; if (alloc_flags & ALLOC_HIGH) min -= min / 2; if (alloc_flags & ALLOC_HARDER) min -= min / 4; if (free_pages <= min + z->lowmem_reserve[classzone_idx]) - return 0; + return false; for (o = 0; o < order; o++) { /* At the next order, this order's pages become unavailable */ free_pages -= z->free_area[o].nr_free << o; @@ -1480,9 +1480,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, min >>= 1; if (free_pages <= min) - return 0; + return false; } - return 1; + return true; +} + +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags) +{ + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, + zone_page_state(z, NR_FREE_PAGES)); +} + +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags) +{ + long free_pages = zone_page_state(z, NR_FREE_PAGES); + + if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) + free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); + + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, + free_pages); } #ifdef CONFIG_NUMA @@ -2436,7 +2455,7 @@ void show_free_areas(void) " all_unreclaimable? %s" "\n", zone->name, - K(zone_nr_free_pages(zone)), + K(zone_page_state(zone, NR_FREE_PAGES)), K(min_wmark_pages(zone)), K(low_wmark_pages(zone)), K(high_wmark_pages(zone)), diff --git a/mm/vmscan.c b/mm/vmscan.c index c5dfabf..3e71cb1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining) if (zone->all_unreclaimable) continue; - if (!zone_watermark_ok(zone, order, high_wmark_pages(zone), + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) return 1; } @@ -2169,7 +2169,7 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, &sc, priority, 0); - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; break; @@ -2215,7 +2215,7 @@ loop_again: * We put equal pressure on every zone, unless one * zone has way too many pages free already. */ - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, 8*high_wmark_pages(zone), end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; @@ -2236,7 +2236,7 @@ loop_again: total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), end_zone, 0)) { all_zones_ok = 0; /* @@ -2244,7 +2244,7 @@ loop_again: * means that we have a GFP_ATOMIC allocation * failure risk. Hurry up! */ - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, min_wmark_pages(zone), end_zone, 0)) has_under_min_watermark_zone = 1; } @@ -2378,7 +2378,9 @@ static int kswapd(void *p) */ if (!sleeping_prematurely(pgdat, order, remaining)) { trace_mm_vmscan_kswapd_sleep(pgdat->node_id); + restore_pgdat_percpu_threshold(pgdat); schedule(); + reduce_pgdat_percpu_threshold(pgdat); } else { if (remaining) count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); @@ -2417,16 +2419,17 @@ void wakeup_kswapd(struct zone *zone, int order) if (!populated_zone(zone)) return; - pgdat = zone->zone_pgdat; - if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) + if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) return; + pgdat = zone->zone_pgdat; if (pgdat->kswapd_max_order < order) pgdat->kswapd_max_order = order; - trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) - return; if (!waitqueue_active(&pgdat->kswapd_wait)) return; + if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0)) + return; + + trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); wake_up_interruptible(&pgdat->kswapd_wait); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 355a9e6..cafcc2d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat); #ifdef CONFIG_SMP +static int calculate_pressure_threshold(struct zone *zone) +{ + return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) / + num_online_cpus()))); +} + static int calculate_threshold(struct zone *zone) { int threshold; @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void) } } +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) +{ + struct zone *zone; + int cpu; + int threshold; + int i; + + for (i = 0; i < pgdat->nr_zones; i++) { + zone = &pgdat->node_zones[i]; + if (!zone->percpu_drift_mark) + continue; + + threshold = calculate_pressure_threshold(zone); + for_each_online_cpu(cpu) + per_cpu_ptr(zone->pageset, cpu)->stat_threshold + = threshold; + } +} + +void restore_pgdat_percpu_threshold(pg_data_t *pgdat) +{ + struct zone *zone; + int cpu; + int threshold; + int i; + + for (i = 0; i < pgdat->nr_zones; i++) { + zone = &pgdat->node_zones[i]; + if (!zone->percpu_drift_mark) + continue; + + threshold = calculate_threshold(zone); + for_each_online_cpu(cpu) + per_cpu_ptr(zone->pageset, cpu)->stat_threshold + = threshold; + } +} + /* * For use when we know that interrupts are disabled. */ @@ -826,7 +870,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n scanned %lu" "\n spanned %lu" "\n present %lu", - zone_nr_free_pages(zone), + zone_page_state(zone, NR_FREE_PAGES), min_wmark_pages(zone), low_wmark_pages(zone), high_wmark_pages(zone), -- 1.7.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-27 8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman @ 2010-10-27 20:16 ` Christoph Lameter 2010-10-28 1:09 ` KAMEZAWA Hiroyuki ` (2 subsequent siblings) 3 siblings, 0 replies; 40+ messages in thread From: Christoph Lameter @ 2010-10-27 20:16 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM Reviewed-by: Christoph Lameter <cl@linux.com> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-27 8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman 2010-10-27 20:16 ` Christoph Lameter @ 2010-10-28 1:09 ` KAMEZAWA Hiroyuki 2010-10-28 9:49 ` Mel Gorman 2010-11-14 8:53 ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro 2010-11-01 7:06 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low KOSAKI Motohiro 2010-11-26 16:06 ` Kyle McMartin 3 siblings, 2 replies; 40+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-10-28 1:09 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, LKML, Linux-MM On Wed, 27 Oct 2010 09:47:35 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when > memory is low] noted that watermarks were based on the vmstat > NR_FREE_PAGES. To avoid synchronization overhead, these counters are > maintained on a per-cpu basis and drained both periodically and when a > threshold is above a threshold. On large CPU systems, the difference > between the estimate and real value of NR_FREE_PAGES can be very high. > The system can get into a case where pages are allocated far below the > min watermark potentially causing livelock issues. The commit solved the > problem by taking a better reading of NR_FREE_PAGES when memory was low. > > Unfortunately, as reported by Shaohua Li this accurate reading can consume > a large amount of CPU time on systems with many sockets due to cache > line bouncing. This patch takes a different approach. For large machines > where counter drift might be unsafe and while kswapd is awake, the per-cpu > thresholds for the target pgdat are reduced to limit the level of drift > to what should be a safe level. This incurs a performance penalty in heavy > memory pressure by a factor that depends on the workload and the machine but > the machine should function correctly without accidentally exhausting all > memory on a node. There is an additional cost when kswapd wakes and sleeps > but the event is not expected to be frequent - in Shaohua's test case, > there was one recorded sleep and wake event at least. > > To ensure that kswapd wakes up, a safe version of zone_watermark_ok() > is introduced that takes a more accurate reading of NR_FREE_PAGES when > called from wakeup_kswapd, when deciding whether it is really safe to go > back to sleep in sleeping_prematurely() and when deciding if a zone is > really balanced or not in balance_pgdat(). We are still using an expensive > function but limiting how often it is called. > > When the test case is reproduced, the time spent in the watermark functions > is reduced. The following report is on the percentage of time spent > cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), > __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), > zone_page_state(). > > vanilla 11.6615% > disable-threshold 0.2584% > > Reported-by: Shaohua Li <shaohua.li@intel.com> > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > --- > include/linux/mmzone.h | 10 +++------- > include/linux/vmstat.h | 5 +++++ > mm/mmzone.c | 21 --------------------- > mm/page_alloc.c | 35 +++++++++++++++++++++++++++-------- > mm/vmscan.c | 23 +++++++++++++---------- > mm/vmstat.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- > 6 files changed, 93 insertions(+), 47 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 3984c4e..8d789d7 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone) > return test_bit(ZONE_OOM_LOCKED, &zone->flags); > } > > -#ifdef CONFIG_SMP > -unsigned long zone_nr_free_pages(struct zone *zone); > -#else > -#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES) > -#endif /* CONFIG_SMP */ > - > /* > * The "priority" of VM scanning is how much of the queues we will scan in one > * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the > @@ -651,7 +645,9 @@ typedef struct pglist_data { > extern struct mutex zonelists_mutex; > void build_all_zonelists(void *data); > void wakeup_kswapd(struct zone *zone, int order); > -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > + int classzone_idx, int alloc_flags); > +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > int classzone_idx, int alloc_flags); > enum memmap_context { > MEMMAP_EARLY, > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index eaaea37..e4cc21c 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item); > extern void __dec_zone_state(struct zone *, enum zone_stat_item); > > void refresh_cpu_vm_stats(int); > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat); > +void restore_pgdat_percpu_threshold(pg_data_t *pgdat); > #else /* CONFIG_SMP */ > > /* > @@ -298,6 +300,9 @@ static inline void __dec_zone_page_state(struct page *page, > #define dec_zone_page_state __dec_zone_page_state > #define mod_zone_page_state __mod_zone_page_state > > +static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { } > +static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { } > + > static inline void refresh_cpu_vm_stats(int cpu) { } > #endif > > diff --git a/mm/mmzone.c b/mm/mmzone.c > index e35bfb8..f5b7d17 100644 > --- a/mm/mmzone.c > +++ b/mm/mmzone.c > @@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn, > return 1; > } > #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ > - > -#ifdef CONFIG_SMP > -/* Called when a more accurate view of NR_FREE_PAGES is needed */ > -unsigned long zone_nr_free_pages(struct zone *zone) > -{ > - unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); > - > - /* > - * While kswapd is awake, it is considered the zone is under some > - * memory pressure. Under pressure, there is a risk that > - * per-cpu-counter-drift will allow the min watermark to be breached > - * potentially causing a live-lock. While kswapd is awake and > - * free pages are low, get a better estimate for free pages > - */ > - if (nr_free_pages < zone->percpu_drift_mark && > - !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) > - return zone_page_state_snapshot(zone, NR_FREE_PAGES); > - > - return nr_free_pages; > -} > -#endif /* CONFIG_SMP */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f12ad18..0286150 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1454,24 +1454,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) > #endif /* CONFIG_FAIL_PAGE_ALLOC */ > > /* > - * Return 1 if free pages are above 'mark'. This takes into account the order > + * Return true if free pages are above 'mark'. This takes into account the order > * of the allocation. > */ > -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > - int classzone_idx, int alloc_flags) > +static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, > + int classzone_idx, int alloc_flags, long free_pages) > { > /* free_pages my go negative - that's OK */ > long min = mark; > - long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; > int o; > > + free_pages -= (1 << order) + 1; > if (alloc_flags & ALLOC_HIGH) > min -= min / 2; > if (alloc_flags & ALLOC_HARDER) > min -= min / 4; > > if (free_pages <= min + z->lowmem_reserve[classzone_idx]) > - return 0; > + return false; > for (o = 0; o < order; o++) { > /* At the next order, this order's pages become unavailable */ > free_pages -= z->free_area[o].nr_free << o; > @@ -1480,9 +1480,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > min >>= 1; > > if (free_pages <= min) > - return 0; > + return false; > } > - return 1; > + return true; > +} > + > +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, > + int classzone_idx, int alloc_flags) > +{ > + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, > + zone_page_state(z, NR_FREE_PAGES)); > +} > + > +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, > + int classzone_idx, int alloc_flags) > +{ > + long free_pages = zone_page_state(z, NR_FREE_PAGES); > + > + if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) > + free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); > + > + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, > + free_pages); > } > > #ifdef CONFIG_NUMA > @@ -2436,7 +2455,7 @@ void show_free_areas(void) > " all_unreclaimable? %s" > "\n", > zone->name, > - K(zone_nr_free_pages(zone)), > + K(zone_page_state(zone, NR_FREE_PAGES)), > K(min_wmark_pages(zone)), > K(low_wmark_pages(zone)), > K(high_wmark_pages(zone)), > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c5dfabf..3e71cb1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining) > if (zone->all_unreclaimable) > continue; > > - if (!zone_watermark_ok(zone, order, high_wmark_pages(zone), > + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), > 0, 0)) > return 1; > } > @@ -2169,7 +2169,7 @@ loop_again: > shrink_active_list(SWAP_CLUSTER_MAX, zone, > &sc, priority, 0); > > - if (!zone_watermark_ok(zone, order, > + if (!zone_watermark_ok_safe(zone, order, > high_wmark_pages(zone), 0, 0)) { > end_zone = i; > break; > @@ -2215,7 +2215,7 @@ loop_again: > * We put equal pressure on every zone, unless one > * zone has way too many pages free already. > */ > - if (!zone_watermark_ok(zone, order, > + if (!zone_watermark_ok_safe(zone, order, > 8*high_wmark_pages(zone), end_zone, 0)) > shrink_zone(priority, zone, &sc); > reclaim_state->reclaimed_slab = 0; > @@ -2236,7 +2236,7 @@ loop_again: > total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) > sc.may_writepage = 1; > > - if (!zone_watermark_ok(zone, order, > + if (!zone_watermark_ok_safe(zone, order, > high_wmark_pages(zone), end_zone, 0)) { > all_zones_ok = 0; > /* > @@ -2244,7 +2244,7 @@ loop_again: > * means that we have a GFP_ATOMIC allocation > * failure risk. Hurry up! > */ > - if (!zone_watermark_ok(zone, order, > + if (!zone_watermark_ok_safe(zone, order, > min_wmark_pages(zone), end_zone, 0)) > has_under_min_watermark_zone = 1; > } > @@ -2378,7 +2378,9 @@ static int kswapd(void *p) > */ > if (!sleeping_prematurely(pgdat, order, remaining)) { > trace_mm_vmscan_kswapd_sleep(pgdat->node_id); > + restore_pgdat_percpu_threshold(pgdat); > schedule(); > + reduce_pgdat_percpu_threshold(pgdat); > } else { > if (remaining) > count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); > @@ -2417,16 +2419,17 @@ void wakeup_kswapd(struct zone *zone, int order) > if (!populated_zone(zone)) > return; > > - pgdat = zone->zone_pgdat; > - if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) > + if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) > return; > + pgdat = zone->zone_pgdat; > if (pgdat->kswapd_max_order < order) > pgdat->kswapd_max_order = order; > - trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); > - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) > - return; > if (!waitqueue_active(&pgdat->kswapd_wait)) > return; > + if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0)) > + return; > + > + trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); > wake_up_interruptible(&pgdat->kswapd_wait); > } > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 355a9e6..cafcc2d 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat); > > #ifdef CONFIG_SMP > > +static int calculate_pressure_threshold(struct zone *zone) > +{ > + return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) / > + num_online_cpus()))); > +} > + Could you add background theory of this calculation as a comment to show the difference with calculate_threshold() ? And don't we need to have "max=125" thresh here ? > static int calculate_threshold(struct zone *zone) > { > int threshold; > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void) > } > } > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) > +{ > + struct zone *zone; > + int cpu; > + int threshold; > + int i; > + get_online_cpus(); > + for (i = 0; i < pgdat->nr_zones; i++) { > + zone = &pgdat->node_zones[i]; > + if (!zone->percpu_drift_mark) > + continue; > + > + threshold = calculate_pressure_threshold(zone); > + for_each_online_cpu(cpu) > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > + = threshold; > + } put_online_cpus(); > +} > + > +void restore_pgdat_percpu_threshold(pg_data_t *pgdat) > +{ > + struct zone *zone; > + int cpu; > + int threshold; > + int i; > + get_online_cpus(); > + for (i = 0; i < pgdat->nr_zones; i++) { > + zone = &pgdat->node_zones[i]; > + if (!zone->percpu_drift_mark) > + continue; > + > + threshold = calculate_threshold(zone); > + for_each_online_cpu(cpu) > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > + = threshold; > + } put_online_cpus(); Thanks, -Kame ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-28 1:09 ` KAMEZAWA Hiroyuki @ 2010-10-28 9:49 ` Mel Gorman 2010-10-28 9:58 ` KAMEZAWA Hiroyuki 2010-11-14 8:53 ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro 1 sibling, 1 reply; 40+ messages in thread From: Mel Gorman @ 2010-10-28 9:49 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, LKML, Linux-MM On Thu, Oct 28, 2010 at 10:09:20AM +0900, KAMEZAWA Hiroyuki wrote: > On Wed, 27 Oct 2010 09:47:35 +0100 > Mel Gorman <mel@csn.ul.ie> wrote: > > > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when > > memory is low] noted that watermarks were based on the vmstat > > NR_FREE_PAGES. To avoid synchronization overhead, these counters are > > maintained on a per-cpu basis and drained both periodically and when a > > threshold is above a threshold. On large CPU systems, the difference > > between the estimate and real value of NR_FREE_PAGES can be very high. > > The system can get into a case where pages are allocated far below the > > min watermark potentially causing livelock issues. The commit solved the > > problem by taking a better reading of NR_FREE_PAGES when memory was low. > > > > <SNIP> > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index 355a9e6..cafcc2d 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat); > > > > #ifdef CONFIG_SMP > > > > +static int calculate_pressure_threshold(struct zone *zone) > > +{ > > + return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) / > > + num_online_cpus()))); > > +} > > + > > Could you add background theory of this calculation as a comment to > show the difference with calculate_threshold() ? > Sure. When writing it, I realised that the calculations here differ from what percpu_drift_mark does. This is what I currently have int calculate_pressure_threshold(struct zone *zone) { int threshold; int watermark_distance; /* * As vmstats are not up to date, there is drift between the estimated * and real values. For high thresholds and a high number of CPUs, it * is possible for the min watermark to be breached while the estimated * value looks fine. The pressure threshold is a reduced value such * that even the maximum amount of drift will not accidentally breach * the min watermark */ watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone); threshold = max(1, watermark_distance / num_online_cpus()); /* * Maximum threshold is 125 */ threshold = min(125, threshold); return threshold; } Is this better? > And don't we need to have "max=125" thresh here ? > Yes. > > > static int calculate_threshold(struct zone *zone) > > { > > int threshold; > > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void) > > } > > } > > > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) > > +{ > > + struct zone *zone; > > + int cpu; > > + int threshold; > > + int i; > > + > > get_online_cpus(); > Also correct. Thanks very much. I'm revising the series. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-28 9:49 ` Mel Gorman @ 2010-10-28 9:58 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 40+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-10-28 9:58 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, LKML, Linux-MM On Thu, 28 Oct 2010 10:49:03 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > On Thu, Oct 28, 2010 at 10:09:20AM +0900, KAMEZAWA Hiroyuki wrote: > > On Wed, 27 Oct 2010 09:47:35 +0100 > > Mel Gorman <mel@csn.ul.ie> wrote: > > > > > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when > > > memory is low] noted that watermarks were based on the vmstat > > > NR_FREE_PAGES. To avoid synchronization overhead, these counters are > > > maintained on a per-cpu basis and drained both periodically and when a > > > threshold is above a threshold. On large CPU systems, the difference > > > between the estimate and real value of NR_FREE_PAGES can be very high. > > > The system can get into a case where pages are allocated far below the > > > min watermark potentially causing livelock issues. The commit solved the > > > problem by taking a better reading of NR_FREE_PAGES when memory was low. > > > > > > <SNIP> > > > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > index 355a9e6..cafcc2d 100644 > > > --- a/mm/vmstat.c > > > +++ b/mm/vmstat.c > > > @@ -81,6 +81,12 @@ EXPORT_SYMBOL(vm_stat); > > > > > > #ifdef CONFIG_SMP > > > > > > +static int calculate_pressure_threshold(struct zone *zone) > > > +{ > > > + return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) / > > > + num_online_cpus()))); > > > +} > > > + > > > > Could you add background theory of this calculation as a comment to > > show the difference with calculate_threshold() ? > > > > Sure. When writing it, I realised that the calculations here differ from > what percpu_drift_mark does. This is what I currently have > > int calculate_pressure_threshold(struct zone *zone) > { > int threshold; > int watermark_distance; > > /* > * As vmstats are not up to date, there is drift between the estimated > * and real values. For high thresholds and a high number of CPUs, it > * is possible for the min watermark to be breached while the estimated > * value looks fine. The pressure threshold is a reduced value such > * that even the maximum amount of drift will not accidentally breach > * the min watermark > */ > watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone); > threshold = max(1, watermark_distance / num_online_cpus()); > > /* > * Maximum threshold is 125 > */ > threshold = min(125, threshold); > > return threshold; > } > > Is this better? > sounds nice. Regards, -Kame ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu 2010-10-28 1:09 ` KAMEZAWA Hiroyuki 2010-10-28 9:49 ` Mel Gorman @ 2010-11-14 8:53 ` KOSAKI Motohiro 2010-11-15 10:26 ` Mel Gorman 2010-11-17 0:07 ` Andrew Morton 1 sibling, 2 replies; 40+ messages in thread From: KOSAKI Motohiro @ 2010-11-14 8:53 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, Shaohua Li, Christoph Lameter, David Rientjes, LKML, Linux-MM > > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void) > > } > > } > > > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) > > +{ > > + struct zone *zone; > > + int cpu; > > + int threshold; > > + int i; > > + > > get_online_cpus(); This caused following runtime warnings. but I don't think here is real lock inversion. ================================= [ INFO: inconsistent lock state ] 2.6.37-rc1-mm1+ #150 --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes: (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>] get_online_cpus+0x41/0x60 {RECLAIM_FS-ON-W} state was registered at: [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0 [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100 [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0 [<ffffffff812eea10>] idr_pre_get+0x60/0x90 [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0 [<ffffffff8106ebf5>] create_worker+0x55/0x190 [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235 [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0 [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10 [<ffffffff81051f30>] __cpu_notify+0x20/0x40 [<ffffffff8150bff7>] _cpu_up+0x73/0x113 [<ffffffff8150c175>] cpu_up+0xde/0xf1 [<ffffffff81dcc81d>] kernel_init+0x21b/0x342 [<ffffffff81003724>] kernel_thread_helper+0x4/0x10 irq event stamp: 27 hardirqs last enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80 hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0 softirqs last enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0 softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0 other info that might help us debug this: no locks held by kswapd0/419. stack backtrace: Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150 Call Trace: [<ffffffff810890b1>] print_usage_bug+0x171/0x180 [<ffffffff8108a057>] mark_lock+0x377/0x450 [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0 [<ffffffff8107af0f>] ? local_clock+0x6f/0x80 [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150 [<ffffffff8108bf94>] lock_acquire+0xb4/0x150 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90 [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff8107af0f>] ? local_clock+0x6f/0x80 [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60 [<ffffffff810520d1>] get_online_cpus+0x41/0x60 [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0 [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60 [<ffffffff8110b552>] kswapd+0x1f2/0x360 [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8110b360>] ? kswapd+0x0/0x360 [<ffffffff81073ae6>] kthread+0xa6/0xb0 [<ffffffff81003724>] kernel_thread_helper+0x4/0x10 [<ffffffff81515710>] ? restore_args+0x0/0x30 [<ffffffff81073a40>] ? kthread+0x0/0xb0 [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10 I think we have two option 1) call lockdep_clear_current_reclaim_state() every time 2) use for_each_possible_cpu instead for_each_online_cpu. Following patch use (2) beucase removing get_online_cpus() makes good side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock risk. ------------------------------------------------------------------------- >From 74b809353c42a440d0bac6b83ac84281299bb09e Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Date: Fri, 3 Dec 2010 20:21:40 +0900 Subject: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu This patch fixes following lockdep warning. ================================= [ INFO: inconsistent lock state ] 2.6.37-rc1-mm1+ #150 --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes: (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>] get_online_cpus+0x41/0x60 {RECLAIM_FS-ON-W} state was registered at: [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0 [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100 [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0 [<ffffffff812eea10>] idr_pre_get+0x60/0x90 [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0 [<ffffffff8106ebf5>] create_worker+0x55/0x190 [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235 [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0 [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10 [<ffffffff81051f30>] __cpu_notify+0x20/0x40 [<ffffffff8150bff7>] _cpu_up+0x73/0x113 [<ffffffff8150c175>] cpu_up+0xde/0xf1 [<ffffffff81dcc81d>] kernel_init+0x21b/0x342 [<ffffffff81003724>] kernel_thread_helper+0x4/0x10 irq event stamp: 27 hardirqs last enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80 hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0 softirqs last enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0 softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0 other info that might help us debug this: no locks held by kswapd0/419. stack backtrace: Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150 Call Trace: [<ffffffff810890b1>] print_usage_bug+0x171/0x180 [<ffffffff8108a057>] mark_lock+0x377/0x450 [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0 [<ffffffff8107af0f>] ? local_clock+0x6f/0x80 [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150 [<ffffffff8108bf94>] lock_acquire+0xb4/0x150 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90 [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150 [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff8107af0f>] ? local_clock+0x6f/0x80 [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60 [<ffffffff810520d1>] get_online_cpus+0x41/0x60 [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0 [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60 [<ffffffff8110b552>] kswapd+0x1f2/0x360 [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8110b360>] ? kswapd+0x0/0x360 [<ffffffff81073ae6>] kthread+0xa6/0xb0 [<ffffffff81003724>] kernel_thread_helper+0x4/0x10 [<ffffffff81515710>] ? restore_args+0x0/0x30 [<ffffffff81073a40>] ? kthread+0x0/0xb0 [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmstat.c | 4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index 2ab01f2..ca2d3be 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -193,18 +193,16 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat, int threshold; int i; - get_online_cpus(); for (i = 0; i < pgdat->nr_zones; i++) { zone = &pgdat->node_zones[i]; if (!zone->percpu_drift_mark) continue; threshold = (*calculate_pressure)(zone); - for_each_online_cpu(cpu) + for_each_possible_cpu(cpu) per_cpu_ptr(zone->pageset, cpu)->stat_threshold = threshold; } - put_online_cpus(); } /* -- 1.6.5.2 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu 2010-11-14 8:53 ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro @ 2010-11-15 10:26 ` Mel Gorman 2010-11-15 14:04 ` Christoph Lameter 2010-11-17 0:07 ` Andrew Morton 1 sibling, 1 reply; 40+ messages in thread From: Mel Gorman @ 2010-11-15 10:26 UTC (permalink / raw) To: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki, Andrew Morton, Shaohua Li, Christoph Lameter, David Rientjes, LKML, Linux-MM On Sun, Nov 14, 2010 at 05:53:03PM +0900, KOSAKI Motohiro wrote: > > > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void) > > > } > > > } > > > > > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) > > > +{ > > > + struct zone *zone; > > > + int cpu; > > > + int threshold; > > > + int i; > > > + > > > > get_online_cpus(); > > > This caused following runtime warnings. but I don't think here is > real lock inversion. > > ================================= > [ INFO: inconsistent lock state ] > 2.6.37-rc1-mm1+ #150 > --------------------------------- > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes: > (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>] get_online_cpus+0x41/0x60 > {RECLAIM_FS-ON-W} state was registered at: > [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0 > [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100 > [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0 > [<ffffffff812eea10>] idr_pre_get+0x60/0x90 > [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0 > [<ffffffff8106ebf5>] create_worker+0x55/0x190 > [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235 > [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0 > [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10 > [<ffffffff81051f30>] __cpu_notify+0x20/0x40 > [<ffffffff8150bff7>] _cpu_up+0x73/0x113 > [<ffffffff8150c175>] cpu_up+0xde/0xf1 > [<ffffffff81dcc81d>] kernel_init+0x21b/0x342 > [<ffffffff81003724>] kernel_thread_helper+0x4/0x10 > irq event stamp: 27 > hardirqs last enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80 > hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0 > softirqs last enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0 > softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0 > > other info that might help us debug this: > no locks held by kswapd0/419. > > stack backtrace: > Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150 > Call Trace: > [<ffffffff810890b1>] print_usage_bug+0x171/0x180 > [<ffffffff8108a057>] mark_lock+0x377/0x450 > [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0 > [<ffffffff8107af0f>] ? local_clock+0x6f/0x80 > [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150 > [<ffffffff8108bf94>] lock_acquire+0xb4/0x150 > [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 > [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0 > [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 > [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90 > [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150 > [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 > [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10 > [<ffffffff8107af0f>] ? local_clock+0x6f/0x80 > [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60 > [<ffffffff810520d1>] get_online_cpus+0x41/0x60 > [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0 > [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60 > [<ffffffff8110b552>] kswapd+0x1f2/0x360 > [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40 > [<ffffffff8110b360>] ? kswapd+0x0/0x360 > [<ffffffff81073ae6>] kthread+0xa6/0xb0 > [<ffffffff81003724>] kernel_thread_helper+0x4/0x10 > [<ffffffff81515710>] ? restore_args+0x0/0x30 > [<ffffffff81073a40>] ? kthread+0x0/0xb0 > [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10 > > > I think we have two option 1) call lockdep_clear_current_reclaim_state() > every time 2) use for_each_possible_cpu instead for_each_online_cpu. > > Following patch use (2) beucase removing get_online_cpus() makes good > side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock > risk. > With recent per-cpu allocator changes, are we guaranteed that the per-cpu structures exist and are valid? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu 2010-11-15 10:26 ` Mel Gorman @ 2010-11-15 14:04 ` Christoph Lameter 2010-11-16 9:58 ` Mel Gorman 0 siblings, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2010-11-15 14:04 UTC (permalink / raw) To: Mel Gorman Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Shaohua Li, David Rientjes, LKML, Linux-MM, Tejun Heo On Mon, 15 Nov 2010, Mel Gorman wrote: > With recent per-cpu allocator changes, are we guaranteed that the per-cpu > structures exist and are valid? We always guarantee that all per cpu areas for all possible cpus exist. That has always been the case. There was a discussion about changing that though. Could be difficult given the need for additional locking. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu 2010-11-15 14:04 ` Christoph Lameter @ 2010-11-16 9:58 ` Mel Gorman 0 siblings, 0 replies; 40+ messages in thread From: Mel Gorman @ 2010-11-16 9:58 UTC (permalink / raw) To: Christoph Lameter Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrew Morton, Shaohua Li, David Rientjes, LKML, Linux-MM, Tejun Heo On Mon, Nov 15, 2010 at 08:04:23AM -0600, Christoph Lameter wrote: > On Mon, 15 Nov 2010, Mel Gorman wrote: > > > With recent per-cpu allocator changes, are we guaranteed that the per-cpu > > structures exist and are valid? > > We always guarantee that all per cpu areas for all possible cpus exist. > That has always been the case. There was a discussion about changing > that though. Could be difficult given the need for additional locking. > In that case, I do not have any more concerns about the patch. It's unfortunate that more per-cpu structures will have to be updated but I doubt it'll be noticable. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu 2010-11-14 8:53 ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro 2010-11-15 10:26 ` Mel Gorman @ 2010-11-17 0:07 ` Andrew Morton 2010-11-19 15:29 ` Christoph Lameter 2010-11-23 8:32 ` KOSAKI Motohiro 1 sibling, 2 replies; 40+ messages in thread From: Andrew Morton @ 2010-11-17 0:07 UTC (permalink / raw) To: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki, Mel Gorman, Shaohua Li, Christoph Lameter, David Rientjes, LKML, Linux-MM On Sun, 14 Nov 2010 17:53:03 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: > > > @@ -159,6 +165,44 @@ static void refresh_zone_stat_thresholds(void) > > > } > > > } > > > > > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) > > > +{ > > > + struct zone *zone; > > > + int cpu; > > > + int threshold; > > > + int i; > > > + > > > > get_online_cpus(); > > > This caused following runtime warnings. but I don't think here is > real lock inversion. > > ================================= > [ INFO: inconsistent lock state ] > 2.6.37-rc1-mm1+ #150 > --------------------------------- > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. > kswapd0/419 [HC0[0]:SC0[0]:HE1:SE1] takes: > (cpu_hotplug.lock){+.+.?.}, at: [<ffffffff810520d1>] get_online_cpus+0x41/0x60 > {RECLAIM_FS-ON-W} state was registered at: > [<ffffffff8108a1a3>] mark_held_locks+0x73/0xa0 > [<ffffffff8108a296>] lockdep_trace_alloc+0xc6/0x100 > [<ffffffff8113fba9>] kmem_cache_alloc+0x39/0x2b0 > [<ffffffff812eea10>] idr_pre_get+0x60/0x90 > [<ffffffff812ef5b7>] ida_pre_get+0x27/0xf0 > [<ffffffff8106ebf5>] create_worker+0x55/0x190 > [<ffffffff814fb4f4>] workqueue_cpu_callback+0xbc/0x235 > [<ffffffff8151934c>] notifier_call_chain+0x8c/0xe0 > [<ffffffff8107a34e>] __raw_notifier_call_chain+0xe/0x10 > [<ffffffff81051f30>] __cpu_notify+0x20/0x40 > [<ffffffff8150bff7>] _cpu_up+0x73/0x113 > [<ffffffff8150c175>] cpu_up+0xde/0xf1 > [<ffffffff81dcc81d>] kernel_init+0x21b/0x342 > [<ffffffff81003724>] kernel_thread_helper+0x4/0x10 > irq event stamp: 27 > hardirqs last enabled at (27): [<ffffffff815152c0>] _raw_spin_unlock_irqrestore+0x40/0x80 > hardirqs last disabled at (26): [<ffffffff81514982>] _raw_spin_lock_irqsave+0x32/0xa0 > softirqs last enabled at (20): [<ffffffff810614c4>] del_timer_sync+0x54/0xa0 > softirqs last disabled at (18): [<ffffffff8106148c>] del_timer_sync+0x1c/0xa0 > > other info that might help us debug this: > no locks held by kswapd0/419. > > stack backtrace: > Pid: 419, comm: kswapd0 Not tainted 2.6.37-rc1-mm1+ #150 > Call Trace: > [<ffffffff810890b1>] print_usage_bug+0x171/0x180 > [<ffffffff8108a057>] mark_lock+0x377/0x450 > [<ffffffff8108ab67>] __lock_acquire+0x267/0x15e0 > [<ffffffff8107af0f>] ? local_clock+0x6f/0x80 > [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150 > [<ffffffff8108bf94>] lock_acquire+0xb4/0x150 > [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 > [<ffffffff81512cf4>] __mutex_lock_common+0x44/0x3f0 > [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 > [<ffffffff810744f0>] ? prepare_to_wait+0x60/0x90 > [<ffffffff81086789>] ? trace_hardirqs_off_caller+0x29/0x150 > [<ffffffff810520d1>] ? get_online_cpus+0x41/0x60 > [<ffffffff810868bd>] ? trace_hardirqs_off+0xd/0x10 > [<ffffffff8107af0f>] ? local_clock+0x6f/0x80 > [<ffffffff815131a8>] mutex_lock_nested+0x48/0x60 > [<ffffffff810520d1>] get_online_cpus+0x41/0x60 > [<ffffffff811138b2>] set_pgdat_percpu_threshold+0x22/0xe0 > [<ffffffff81113970>] ? calculate_normal_threshold+0x0/0x60 > [<ffffffff8110b552>] kswapd+0x1f2/0x360 > [<ffffffff81074180>] ? autoremove_wake_function+0x0/0x40 > [<ffffffff8110b360>] ? kswapd+0x0/0x360 > [<ffffffff81073ae6>] kthread+0xa6/0xb0 > [<ffffffff81003724>] kernel_thread_helper+0x4/0x10 > [<ffffffff81515710>] ? restore_args+0x0/0x30 > [<ffffffff81073a40>] ? kthread+0x0/0xb0 > [<ffffffff81003720>] ? kernel_thread_helper+0x0/0x10 Well what's actually happening here? Where is the alleged deadlock? In the kernel_init() case we have a GFP_KERNEL allocation inside get_online_cpus(). In the other case we simply have kswapd calling get_online_cpus(), yes? Does lockdep consider all kswapd actions to be "in reclaim context"? If so, why? > > I think we have two option 1) call lockdep_clear_current_reclaim_state() > every time 2) use for_each_possible_cpu instead for_each_online_cpu. > > Following patch use (2) beucase removing get_online_cpus() makes good > side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock > risk. Well. Being able to run for_each_online_cpu() is a pretty low-level and fundamental thing. It's something we're likely to want to do more and more of as time passes. It seems a bad thing to tell ourselves that we cannot use it in reclaim context. That blots out large chunks of filesystem and IO-layer code as well! > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -193,18 +193,16 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat, > int threshold; > int i; > > - get_online_cpus(); > for (i = 0; i < pgdat->nr_zones; i++) { > zone = &pgdat->node_zones[i]; > if (!zone->percpu_drift_mark) > continue; > > threshold = (*calculate_pressure)(zone); > - for_each_online_cpu(cpu) > + for_each_possible_cpu(cpu) > per_cpu_ptr(zone->pageset, cpu)->stat_threshold > = threshold; > } > - put_online_cpus(); > } That's a pretty sad change IMO, especially of num_possible_cpus is much larger than num_online_cpus. What do we need to do to make get_online_cpus() safe to use in reclaim context? (And in kswapd context, if that's really equivalent to "reclaim context"). ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu 2010-11-17 0:07 ` Andrew Morton @ 2010-11-19 15:29 ` Christoph Lameter 2010-11-23 8:32 ` KOSAKI Motohiro 1 sibling, 0 replies; 40+ messages in thread From: Christoph Lameter @ 2010-11-19 15:29 UTC (permalink / raw) To: Andrew Morton Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Mel Gorman, Shaohua Li, David Rientjes, LKML, Linux-MM On Tue, 16 Nov 2010, Andrew Morton wrote: > > Following patch use (2) beucase removing get_online_cpus() makes good > > side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock > > risk. > > Well. Being able to run for_each_online_cpu() is a pretty low-level > and fundamental thing. It's something we're likely to want to do more > and more of as time passes. It seems a bad thing to tell ourselves > that we cannot use it in reclaim context. That blots out large chunks > of filesystem and IO-layer code as well! The online map can change if no locks were taken. Thus it becomes something difficult to do in some code paths and overhead increases significantly. > > threshold = (*calculate_pressure)(zone); > > - for_each_online_cpu(cpu) > > + for_each_possible_cpu(cpu) > > per_cpu_ptr(zone->pageset, cpu)->stat_threshold > > = threshold; > > } > > - put_online_cpus(); > > } > > That's a pretty sad change IMO, especially of num_possible_cpus is much > larger than num_online_cpus. num_possible_cpus should only be higher if the arch code has detected that the system has the ability to physically online and offline cpus. I have never actually seen such a system. Heard rumors from Fujitsu that they have something. Maybe the virtualization people also need this? Otherwise cpu online/offline is useful mainly to debug the cpu offline/online handling in various subsystems which is unsurprisingly often buggy given the rarity of encountering such hardware. > What do we need to do to make get_online_cpus() safe to use in reclaim > context? (And in kswapd context, if that's really equivalent to > "reclaim context"). I think its not worth the effort. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu 2010-11-17 0:07 ` Andrew Morton 2010-11-19 15:29 ` Christoph Lameter @ 2010-11-23 8:32 ` KOSAKI Motohiro 1 sibling, 0 replies; 40+ messages in thread From: KOSAKI Motohiro @ 2010-11-23 8:32 UTC (permalink / raw) To: Andrew Morton Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Mel Gorman, Shaohua Li, Christoph Lameter, David Rientjes, LKML, Linux-MM sorry for the delay. > Well what's actually happening here? Where is the alleged deadlock? > > In the kernel_init() case we have a GFP_KERNEL allocation inside > get_online_cpus(). In the other case we simply have kswapd calling > get_online_cpus(), yes? Yes. > > Does lockdep consider all kswapd actions to be "in reclaim context"? > If so, why? kswapd call lockdep_set_current_reclaim_state() at thread starting time. see below. ---------------------------------------------------------------------- static int kswapd(void *p) { unsigned long order; pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; struct reclaim_state reclaim_state = { .reclaimed_slab = 0, }; const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); lockdep_set_current_reclaim_state(GFP_KERNEL); ...... ---------------------------------------------------------------------- > > I think we have two option 1) call lockdep_clear_current_reclaim_state() > > every time 2) use for_each_possible_cpu instead for_each_online_cpu. > > > > Following patch use (2) beucase removing get_online_cpus() makes good > > side effect. It reduce potentially cpu-hotplug vs memory-shortage deadlock > > risk. > > Well. Being able to run for_each_online_cpu() is a pretty low-level > and fundamental thing. It's something we're likely to want to do more > and more of as time passes. It seems a bad thing to tell ourselves > that we cannot use it in reclaim context. That blots out large chunks > of filesystem and IO-layer code as well! > > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -193,18 +193,16 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat, > > int threshold; > > int i; > > > > - get_online_cpus(); > > for (i = 0; i < pgdat->nr_zones; i++) { > > zone = &pgdat->node_zones[i]; > > if (!zone->percpu_drift_mark) > > continue; > > > > threshold = (*calculate_pressure)(zone); > > - for_each_online_cpu(cpu) > > + for_each_possible_cpu(cpu) > > per_cpu_ptr(zone->pageset, cpu)->stat_threshold > > = threshold; > > } > > - put_online_cpus(); > > } > > That's a pretty sad change IMO, especially of num_possible_cpus is much > larger than num_online_cpus. As far as I know, CPU hotplug is used server area and almost server have ACPI or similar flexible firmware interface. then, num_possible_cpus is not so much big than actual numbers of socket. IOW, I haven't hear embedded people use cpu hotplug. If you've hear, please let me know. > What do we need to do to make get_online_cpus() safe to use in reclaim > context? (And in kswapd context, if that's really equivalent to > "reclaim context"). Hmm... It's too hard. kmalloc() is called from everywhere and cpu hotplug is happen any time. then, any lock design break your requested rule. ;) And again, _now_ I don't think for_each_possible_cpu() is very costly. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-27 8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman 2010-10-27 20:16 ` Christoph Lameter 2010-10-28 1:09 ` KAMEZAWA Hiroyuki @ 2010-11-01 7:06 ` KOSAKI Motohiro 2010-11-26 16:06 ` Kyle McMartin 3 siblings, 0 replies; 40+ messages in thread From: KOSAKI Motohiro @ 2010-11-01 7:06 UTC (permalink / raw) To: Mel Gorman Cc: kosaki.motohiro, Andrew Morton, Shaohua Li, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM > To ensure that kswapd wakes up, a safe version of zone_watermark_ok() > is introduced that takes a more accurate reading of NR_FREE_PAGES when > called from wakeup_kswapd, when deciding whether it is really safe to go > back to sleep in sleeping_prematurely() and when deciding if a zone is > really balanced or not in balance_pgdat(). We are still using an expensive > function but limiting how often it is called. > > When the test case is reproduced, the time spent in the watermark functions > is reduced. The following report is on the percentage of time spent > cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), > __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), > zone_page_state(). > > vanilla 11.6615% > disable-threshold 0.2584% > > Reported-by: Shaohua Li <shaohua.li@intel.com> > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Except kamezawa-san pointed piece Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-27 8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman ` (2 preceding siblings ...) 2010-11-01 7:06 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low KOSAKI Motohiro @ 2010-11-26 16:06 ` Kyle McMartin 2010-11-29 9:56 ` Mel Gorman 3 siblings, 1 reply; 40+ messages in thread From: Kyle McMartin @ 2010-11-26 16:06 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Wed, Oct 27, 2010 at 09:47:35AM +0100, Mel Gorman wrote: ><snip> > To ensure that kswapd wakes up, a safe version of zone_watermark_ok() > is introduced that takes a more accurate reading of NR_FREE_PAGES when > called from wakeup_kswapd, when deciding whether it is really safe to go > back to sleep in sleeping_prematurely() and when deciding if a zone is > really balanced or not in balance_pgdat(). We are still using an expensive > function but limiting how often it is called. ><snip> > Reported-by: Shaohua Li <shaohua.li@intel.com> > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Hi Mel, I notice these aren't flagged for stable, should they be? (They fairly trivially apply and compile on 2.6.36 barring the trace_ points which changed.) I've got a few bug reports against .36/.37 where kswapd has been sleeping for 60s+. I built them some kernels with these patches, but haven't heard back yet as to whether it fixes things for them. Thanks for any insight, Kyle ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-11-26 16:06 ` Kyle McMartin @ 2010-11-29 9:56 ` Mel Gorman 2010-11-29 13:16 ` Kyle McMartin 0 siblings, 1 reply; 40+ messages in thread From: Mel Gorman @ 2010-11-29 9:56 UTC (permalink / raw) To: Kyle McMartin Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Fri, Nov 26, 2010 at 11:06:19AM -0500, Kyle McMartin wrote: > On Wed, Oct 27, 2010 at 09:47:35AM +0100, Mel Gorman wrote: > ><snip> > > To ensure that kswapd wakes up, a safe version of zone_watermark_ok() > > is introduced that takes a more accurate reading of NR_FREE_PAGES when > > called from wakeup_kswapd, when deciding whether it is really safe to go > > back to sleep in sleeping_prematurely() and when deciding if a zone is > > really balanced or not in balance_pgdat(). We are still using an expensive > > function but limiting how often it is called. > ><snip> > > Reported-by: Shaohua Li <shaohua.li@intel.com> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > Hi Mel, > > I notice these aren't flagged for stable, should they be? (They fairly > trivially apply and compile on 2.6.36 barring the trace_ points which > changed.) They were not flagged for stable because they were performance rather than function bugs that affected a limited number of machines. Should that decision be revisited? > I've got a few bug reports against .36/.37 where kswapd has > been sleeping for 60s+. > I do not believe these patches would affect kswapd sleeping for 60s. > I built them some kernels with these patches, but haven't heard back yet > as to whether it fixes things for them. > > Thanks for any insight, Can you point me at a relevant bugzilla entry or forward me the bug report and I'll take a look? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-11-29 9:56 ` Mel Gorman @ 2010-11-29 13:16 ` Kyle McMartin 2010-11-29 15:08 ` Mel Gorman 0 siblings, 1 reply; 40+ messages in thread From: Kyle McMartin @ 2010-11-29 13:16 UTC (permalink / raw) To: Mel Gorman Cc: Kyle McMartin, Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Mon, Nov 29, 2010 at 09:56:19AM +0000, Mel Gorman wrote: > Can you point me at a relevant bugzilla entry or forward me the bug report > and I'll take a look? > https://bugzilla.redhat.com/show_bug.cgi?id=649694 Thanks, Kyle ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-11-29 13:16 ` Kyle McMartin @ 2010-11-29 15:08 ` Mel Gorman 2010-11-29 15:22 ` Kyle McMartin 0 siblings, 1 reply; 40+ messages in thread From: Mel Gorman @ 2010-11-29 15:08 UTC (permalink / raw) To: Kyle McMartin Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Mon, Nov 29, 2010 at 08:16:26AM -0500, Kyle McMartin wrote: > On Mon, Nov 29, 2010 at 09:56:19AM +0000, Mel Gorman wrote: > > Can you point me at a relevant bugzilla entry or forward me the bug report > > and I'll take a look? > > > > https://bugzilla.redhat.com/show_bug.cgi?id=649694 > Ouch! I have been unable to create an exact copy of your kernel source as I'm not running Fedora. From a partial conversion of a source RPM, I saw no changes related to mm/vmscan.c. Is this accurate? I'm trying to establish if this is a mainline bug as well. Second, I see all the stack traces are marked with "?" making them unreliable. Is that anything to be concerned about? I see that one user has reported that the patches fixed the problem for him but I fear that this might be a co-incidence or that the patches close a race of some description. Specifically, I'm trying to identify if there is a situation where kswapd() constantly loops checking watermarks and never calling cond_resched(). This could conceivably happen if kswapd() is always checking sleeping_prematurely() at a higher order where as balance_pgdat() is always checks the watermarks at the lower order. I'm not seeing how this could happen in 2.6.35.6 though. If Fedora doesn't have special changes, it might mean that these patches do need to go into -stable as the cost of zone_page_state_snapshot() is far higher on larger machines than previously reported. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-11-29 15:08 ` Mel Gorman @ 2010-11-29 15:22 ` Kyle McMartin 2010-11-29 15:26 ` Kyle McMartin 2010-11-29 15:58 ` Mel Gorman 0 siblings, 2 replies; 40+ messages in thread From: Kyle McMartin @ 2010-11-29 15:22 UTC (permalink / raw) To: Mel Gorman Cc: Kyle McMartin, Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Mon, Nov 29, 2010 at 03:08:24PM +0000, Mel Gorman wrote: > Ouch! I have been unable to create an exact copy of your kernel source as > I'm not running Fedora. From a partial conversion of a source RPM, I saw no > changes related to mm/vmscan.c. Is this accurate? I'm trying to establish > if this is a mainline bug as well. > Sorry, if you extract the source rpm you should get the patched sources... Aside from a few patches to mm/mmap for execshield, mm/* is otherwise untouched from the latest stable 2.6.35 kernels. If you git clone git://pkgs.fedoraproject.org/kernel and check out the origin/f14/master branch, it has all the patches we apply (based on the 'ApplyPatch' lines in kernel.spec > Second, I see all the stack traces are marked with "?" making them > unreliable. Is that anything to be concerned about? > Hrm, I don't think it is, I think the ones with '?' are just artifacts because we don't have a proper unwinder. Oh! Thanks! I just found a bug in our configs... We don't have CONFIG_FRAME_POINTER set because CONFIG_DEBUG_KERNEL got unset in the 'production' configs... I'll fix that up. > I see that one user has reported that the patches fixed the problem for him > but I fear that this might be a co-incidence or that the patches close a > race of some description. Specifically, I'm trying to identify if there is > a situation where kswapd() constantly loops checking watermarks and never > calling cond_resched(). This could conceivably happen if kswapd() is always > checking sleeping_prematurely() at a higher order where as balance_pgdat() > is always checks the watermarks at the lower order. I'm not seeing how this > could happen in 2.6.35.6 though. If Fedora doesn't have special changes, > it might mean that these patches do need to go into -stable as the > cost of zone_page_state_snapshot() is far higher on larger machines than > previously reported. > Yeah, I am a bit surprised as well. Luke seems to have quite a large machine... I haven't seen any kswapd lockups there on my 18G machine using the same kernel. :< (Possibly it's just not stressed enough though.) Thanks for looking into this! Kyle ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-11-29 15:22 ` Kyle McMartin @ 2010-11-29 15:26 ` Kyle McMartin 2010-11-29 15:58 ` Mel Gorman 1 sibling, 0 replies; 40+ messages in thread From: Kyle McMartin @ 2010-11-29 15:26 UTC (permalink / raw) To: Kyle McMartin Cc: Mel Gorman, Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Mon, Nov 29, 2010 at 10:22:30AM -0500, Kyle McMartin wrote: > Hrm, I don't think it is, I think the ones with '?' are just artifacts > because we don't have a proper unwinder. Oh! Thanks! I just found a bug > in our configs... We don't have CONFIG_FRAME_POINTER set because > CONFIG_DEBUG_KERNEL got unset in the 'production' configs... I'll fix > that up. > Oops, no, misdiagnosed that by accidentally grepping for FRAME_POINTERS instead of FRAME_POINTER... it's set in our configs. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-11-29 15:22 ` Kyle McMartin 2010-11-29 15:26 ` Kyle McMartin @ 2010-11-29 15:58 ` Mel Gorman 2010-12-23 22:18 ` David Rientjes 1 sibling, 1 reply; 40+ messages in thread From: Mel Gorman @ 2010-11-29 15:58 UTC (permalink / raw) To: Kyle McMartin Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Mon, Nov 29, 2010 at 10:22:30AM -0500, Kyle McMartin wrote: > On Mon, Nov 29, 2010 at 03:08:24PM +0000, Mel Gorman wrote: > > Ouch! I have been unable to create an exact copy of your kernel source as > > I'm not running Fedora. From a partial conversion of a source RPM, I saw no > > changes related to mm/vmscan.c. Is this accurate? I'm trying to establish > > if this is a mainline bug as well. > > > > Sorry, if you extract the source rpm you should get the patched > sources... Aside from a few patches to mm/mmap for execshield, mm/* is > otherwise untouched from the latest stable 2.6.35 kernels. > Perfect, that correlates with what I saw so this is probably a mainline issue. > If you git clone git://pkgs.fedoraproject.org/kernel and check out the > origin/f14/master branch, it has all the patches we apply (based on the > 'ApplyPatch' lines in kernel.spec > > > Second, I see all the stack traces are marked with "?" making them > > unreliable. Is that anything to be concerned about? > > > > Hrm, I don't think it is, I think the ones with '?' are just artifacts > because we don't have a proper unwinder. Oh! Thanks! I just found a bug > in our configs... We don't have CONFIG_FRAME_POINTER set because > CONFIG_DEBUG_KERNEL got unset in the 'production' configs... I'll fix > that up. > Ordinarily I'd expect it to be from the lack of a unwinder but if FRAME_POINTER is there (which you say in a follow-up mail that is), it can be a bit of a concern. There is some real weirness as it is. Take on of Luke's examples where it appears to be locked up in [ 5015.448127] Pid: 185, comm: kswapd1 Tainted: P 2.6.35.6-48.fc14.x86_64 #1 X8DA3/X8DA3 [ 5015.448127] RIP: 0010:[<ffffffff81469130>] [<ffffffff81469130>] _raw_spin_unlock_irqrestore+0x18/0x19 I am at a loss to explain under what circumstances that can even happen! Is there any possibility RIP is being translated to the wrong symbol possibly via an userspace decoder of the oops or similar? Is there any possibility the stack is being corrupted if the swap subsystem is on a complicated software stack? > > I see that one user has reported that the patches fixed the problem for him > > but I fear that this might be a co-incidence or that the patches close a > > race of some description. Specifically, I'm trying to identify if there is > > a situation where kswapd() constantly loops checking watermarks and never > > calling cond_resched(). This could conceivably happen if kswapd() is always > > checking sleeping_prematurely() at a higher order where as balance_pgdat() > > is always checks the watermarks at the lower order. I'm not seeing how this > > could happen in 2.6.35.6 though. If Fedora doesn't have special changes, > > it might mean that these patches do need to go into -stable as the > > cost of zone_page_state_snapshot() is far higher on larger machines than > > previously reported. > > > > Yeah, I am a bit surprised as well. Luke seems to have quite a large > machine... I haven't seen any kswapd lockups there on my 18G machine > using the same kernel. :< (Possibly it's just not stressed enough > though.) > He reports his machine as 24-way but with his model of CPU it could still be 2-socket which I ordinarily would not have expected to suffer so badly from zone_page_state_snapshot(). I would have predicted that the patches being thrown about in the thread "Free memory never fully used, swapping" to be more relevant to kswapd failing to go to sleep :/ Andrew, this patch was a performance fix but is a report saying that it fixes a functional regression in Fedora enough to push a patch torwards stable even though an explanation as to *why* it fixes the problem is missing? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-11-29 15:58 ` Mel Gorman @ 2010-12-23 22:18 ` David Rientjes 2010-12-23 22:35 ` Andrew Morton 0 siblings, 1 reply; 40+ messages in thread From: David Rientjes @ 2010-12-23 22:18 UTC (permalink / raw) To: Mel Gorman, Greg Kroah-Hartman, Andrew Morton Cc: Kyle McMartin, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, KAMEZAWA Hiroyuki, linux-kernel, linux-mm On Mon, 29 Nov 2010, Mel Gorman wrote: > Andrew, this patch was a performance fix but is a report saying that it > fixes a functional regression in Fedora enough to push a patch torwards > stable even though an explanation as to *why* it fixes the problem is missing? > We had to pull aa454840 "mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 internally because tests showed that it would cause the machine to stall as the result of heavy kswapd activity. I merged it back with this fix as it is pending in the -mm tree and it solves the issue we were seeing, so I definitely think this should be pushed to -stable (and I would seriously consider it for 2.6.37 inclusion even at this late date). ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-12-23 22:18 ` David Rientjes @ 2010-12-23 22:35 ` Andrew Morton 2010-12-23 23:00 ` Kyle McMartin 2010-12-23 23:07 ` David Rientjes 0 siblings, 2 replies; 40+ messages in thread From: Andrew Morton @ 2010-12-23 22:35 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Greg Kroah-Hartman, Kyle McMartin, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, KAMEZAWA Hiroyuki, linux-kernel, linux-mm On Thu, 23 Dec 2010 14:18:38 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > On Mon, 29 Nov 2010, Mel Gorman wrote: > > > Andrew, this patch was a performance fix but is a report saying that it > > fixes a functional regression in Fedora enough to push a patch torwards > > stable even though an explanation as to *why* it fixes the problem is missing? > > > > We had to pull aa454840 "mm: page allocator: calculate a better estimate > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 > internally because tests showed that it would cause the machine to stall > as the result of heavy kswapd activity. I merged it back with this fix as > it is pending in the -mm tree and it solves the issue we were seeing, so I > definitely think this should be pushed to -stable (and I would seriously > consider it for 2.6.37 inclusion even at this late date). How's about I send mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x? That way it'll get a bit of 2.6.38-rc testing before being merged into 2.6.37.x. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-12-23 22:35 ` Andrew Morton @ 2010-12-23 23:00 ` Kyle McMartin 2010-12-23 23:07 ` David Rientjes 1 sibling, 0 replies; 40+ messages in thread From: Kyle McMartin @ 2010-12-23 23:00 UTC (permalink / raw) To: Andrew Morton Cc: David Rientjes, Mel Gorman, Greg Kroah-Hartman, Kyle McMartin, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, KAMEZAWA Hiroyuki, linux-kernel, linux-mm On Thu, Dec 23, 2010 at 02:35:21PM -0800, Andrew Morton wrote: > > > Andrew, this patch was a performance fix but is a report saying that it > > > fixes a functional regression in Fedora enough to push a patch torwards > > > stable even though an explanation as to *why* it fixes the problem is missing? > > We had to pull aa454840 "mm: page allocator: calculate a better estimate > > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 > > internally because tests showed that it would cause the machine to stall > > as the result of heavy kswapd activity. I merged it back with this fix as > > it is pending in the -mm tree and it solves the issue we were seeing, so I > > definitely think this should be pushed to -stable (and I would seriously > > consider it for 2.6.37 inclusion even at this late date). > > How's about I send > mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch > in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x? > That way it'll get a bit of 2.6.38-rc testing before being merged into > 2.6.37.x. > That sounds fine to me. (Thanks very much for the update, David!) I don't mind carrying a few extra patches here and there in Fedora to get them some exposure if they're low risk... I've been carrying Mel's patches for a month or so now and it hasn't turned up any obvious problems in testing. regards, Kyle ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-12-23 22:35 ` Andrew Morton 2010-12-23 23:00 ` Kyle McMartin @ 2010-12-23 23:07 ` David Rientjes 2010-12-23 23:17 ` Andrew Morton 1 sibling, 1 reply; 40+ messages in thread From: David Rientjes @ 2010-12-23 23:07 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Greg Kroah-Hartman, Kyle McMartin, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, KAMEZAWA Hiroyuki, linux-kernel, linux-mm On Thu, 23 Dec 2010, Andrew Morton wrote: > > We had to pull aa454840 "mm: page allocator: calculate a better estimate > > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 > > internally because tests showed that it would cause the machine to stall > > as the result of heavy kswapd activity. I merged it back with this fix as > > it is pending in the -mm tree and it solves the issue we were seeing, so I > > definitely think this should be pushed to -stable (and I would seriously > > consider it for 2.6.37 inclusion even at this late date). > > How's about I send > mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch > in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x? > That way it'll get a bit of 2.6.38-rc testing before being merged into > 2.6.37.x. > I don't think anyone would be able to answer that judgment call other than you or Linus, it's a trade-off on whether 2.6.37 should be released with the knowledge that it regresses just like 2.6.36 does (rendering both unusable on some of our machines out of the box) because we're late in the cycle. I personally think the testing is already sufficient since it's been sitting in -mm for two months, it's been suggested as stable material by a couple different parties, it was a prerequisite for the transparent hugepage series, and we've tested and merged it as fixing the regression in 2.6.36 (as Fedora has, as far as I know). We've already merged the fix internally, though, so it's not for selfish reasons :) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-12-23 23:07 ` David Rientjes @ 2010-12-23 23:17 ` Andrew Morton 0 siblings, 0 replies; 40+ messages in thread From: Andrew Morton @ 2010-12-23 23:17 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Greg Kroah-Hartman, Kyle McMartin, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, KAMEZAWA Hiroyuki, linux-kernel, linux-mm On Thu, 23 Dec 2010 15:07:02 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > On Thu, 23 Dec 2010, Andrew Morton wrote: > > > > We had to pull aa454840 "mm: page allocator: calculate a better estimate > > > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 > > > internally because tests showed that it would cause the machine to stall > > > as the result of heavy kswapd activity. I merged it back with this fix as > > > it is pending in the -mm tree and it solves the issue we were seeing, so I > > > definitely think this should be pushed to -stable (and I would seriously > > > consider it for 2.6.37 inclusion even at this late date). > > > > How's about I send > > mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch > > in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x? > > That way it'll get a bit of 2.6.38-rc testing before being merged into > > 2.6.37.x. > > > > I don't think anyone would be able to answer that judgment call other than > you or Linus, it's a trade-off on whether 2.6.37 should be released with > the knowledge that it regresses just like 2.6.36 does (rendering both > unusable on some of our machines out of the box) because we're late in the > cycle. > > I personally think the testing is already sufficient since it's been > sitting in -mm for two months, it's been suggested as stable material by a > couple different parties, it was a prerequisite for the transparent > hugepage series, and we've tested and merged it as fixing the regression > in 2.6.36 (as Fedora has, as far as I know). We've already merged the fix > internally, though, so it's not for selfish reasons :) Wibble, wobble. It's good that the patch has been used in RH kernels. otoh, the patch is really quite big and the problem was present in 2.6.36 without a lot of complaints and we're very late in -rc and not many people will be testing over xmas/newyear, and it would be most sad to put badness into mainline at this time. So I'm still inclined to go with (discretion > valour). ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds 2010-10-27 8:47 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions Mel Gorman 2010-10-27 8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman @ 2010-10-27 8:47 ` Mel Gorman 2010-10-27 20:13 ` Christoph Lameter ` (2 more replies) 1 sibling, 3 replies; 40+ messages in thread From: Mel Gorman @ 2010-10-27 8:47 UTC (permalink / raw) To: Andrew Morton Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid errors due to counter drift. The functions duplicate some code so this patch replaces them with a single set_pgdat_percpu_threshold() that takes a callback function to calculate the desired threshold as a parameter. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/vmstat.h | 10 ++++++---- mm/vmscan.c | 6 ++++-- mm/vmstat.c | 30 ++++++------------------------ 3 files changed, 16 insertions(+), 30 deletions(-) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index e4cc21c..833e676 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -254,8 +254,11 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); void refresh_cpu_vm_stats(int); -void reduce_pgdat_percpu_threshold(pg_data_t *pgdat); -void restore_pgdat_percpu_threshold(pg_data_t *pgdat); + +int calculate_pressure_threshold(struct zone *zone); +int calculate_normal_threshold(struct zone *zone); +void set_pgdat_percpu_threshold(pg_data_t *pgdat, + int (*calculate_pressure)(struct zone *)); #else /* CONFIG_SMP */ /* @@ -300,8 +303,7 @@ static inline void __dec_zone_page_state(struct page *page, #define dec_zone_page_state __dec_zone_page_state #define mod_zone_page_state __mod_zone_page_state -static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { } -static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { } +#define set_pgdat_percpu_threshold(pgdat, callback) { } static inline void refresh_cpu_vm_stats(int cpu) { } #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 3e71cb1..7966110 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2378,9 +2378,11 @@ static int kswapd(void *p) */ if (!sleeping_prematurely(pgdat, order, remaining)) { trace_mm_vmscan_kswapd_sleep(pgdat->node_id); - restore_pgdat_percpu_threshold(pgdat); + set_pgdat_percpu_threshold(pgdat, + calculate_normal_threshold); schedule(); - reduce_pgdat_percpu_threshold(pgdat); + set_pgdat_percpu_threshold(pgdat, + calculate_pressure_threshold); } else { if (remaining) count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); diff --git a/mm/vmstat.c b/mm/vmstat.c index cafcc2d..73661e8 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -81,13 +81,13 @@ EXPORT_SYMBOL(vm_stat); #ifdef CONFIG_SMP -static int calculate_pressure_threshold(struct zone *zone) +int calculate_pressure_threshold(struct zone *zone) { return max(1, (int)((high_wmark_pages(zone) - low_wmark_pages(zone) / num_online_cpus()))); } -static int calculate_threshold(struct zone *zone) +int calculate_normal_threshold(struct zone *zone) { int threshold; int mem; /* memory in 128 MB units */ @@ -146,7 +146,7 @@ static void refresh_zone_stat_thresholds(void) for_each_populated_zone(zone) { unsigned long max_drift, tolerate_drift; - threshold = calculate_threshold(zone); + threshold = calculate_normal_threshold(zone); for_each_online_cpu(cpu) per_cpu_ptr(zone->pageset, cpu)->stat_threshold @@ -165,7 +165,8 @@ static void refresh_zone_stat_thresholds(void) } } -void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) +void set_pgdat_percpu_threshold(pg_data_t *pgdat, + int (*calculate_pressure)(struct zone *)) { struct zone *zone; int cpu; @@ -177,26 +178,7 @@ void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) if (!zone->percpu_drift_mark) continue; - threshold = calculate_pressure_threshold(zone); - for_each_online_cpu(cpu) - per_cpu_ptr(zone->pageset, cpu)->stat_threshold - = threshold; - } -} - -void restore_pgdat_percpu_threshold(pg_data_t *pgdat) -{ - struct zone *zone; - int cpu; - int threshold; - int i; - - for (i = 0; i < pgdat->nr_zones; i++) { - zone = &pgdat->node_zones[i]; - if (!zone->percpu_drift_mark) - continue; - - threshold = calculate_threshold(zone); + threshold = calculate_pressure(zone); for_each_online_cpu(cpu) per_cpu_ptr(zone->pageset, cpu)->stat_threshold = threshold; -- 1.7.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds 2010-10-27 8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman @ 2010-10-27 20:13 ` Christoph Lameter 2010-10-28 1:10 ` KAMEZAWA Hiroyuki 2010-11-01 7:06 ` KOSAKI Motohiro 2 siblings, 0 replies; 40+ messages in thread From: Christoph Lameter @ 2010-10-27 20:13 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM Reviewed-by: Christoph Lameter <cl@linux.com> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds 2010-10-27 8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman 2010-10-27 20:13 ` Christoph Lameter @ 2010-10-28 1:10 ` KAMEZAWA Hiroyuki 2010-11-01 7:06 ` KOSAKI Motohiro 2 siblings, 0 replies; 40+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-10-28 1:10 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, LKML, Linux-MM On Wed, 27 Oct 2010 09:47:36 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() > exist to adjust the per-cpu vmstat thresholds while kswapd is awake to > avoid errors due to counter drift. The functions duplicate some code so > this patch replaces them with a single set_pgdat_percpu_threshold() that > takes a callback function to calculate the desired threshold as a > parameter. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds 2010-10-27 8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman 2010-10-27 20:13 ` Christoph Lameter 2010-10-28 1:10 ` KAMEZAWA Hiroyuki @ 2010-11-01 7:06 ` KOSAKI Motohiro 2 siblings, 0 replies; 40+ messages in thread From: KOSAKI Motohiro @ 2010-11-01 7:06 UTC (permalink / raw) To: Mel Gorman Cc: kosaki.motohiro, Andrew Morton, Shaohua Li, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM > reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() > exist to adjust the per-cpu vmstat thresholds while kswapd is awake to > avoid errors due to counter drift. The functions duplicate some code so > this patch replaces them with a single set_pgdat_percpu_threshold() that > takes a callback function to calculate the desired threshold as a > parameter. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 0/2] Reduce the amount of time spent in watermark-related functions V4 @ 2010-10-28 15:13 Mel Gorman 2010-10-28 15:13 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman 0 siblings, 1 reply; 40+ messages in thread From: Mel Gorman @ 2010-10-28 15:13 UTC (permalink / raw) To: Andrew Morton Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman Changelog since V3 o Added Reviewed-bys o Added comment on why pressure_threshold is what it is o Make sure pressure threshold does not get over 125 o Call get_online_cpus and put_online_cpus as appropriate The following two patches are in response to a bug report by Shaohua Li where the amount of time spent in zone_nr_free_pages() is unacceptable for large machines. All the background is in the first patches leader. The second patch replaces two setter functions with one function that takes a callback function as a parameter. include/linux/mmzone.h | 10 ++------ include/linux/vmstat.h | 7 ++++++ mm/mmzone.c | 21 ------------------ mm/page_alloc.c | 35 ++++++++++++++++++++++++------- mm/vmscan.c | 25 +++++++++++++-------- mm/vmstat.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-- 6 files changed, 103 insertions(+), 49 deletions(-) Mel Gorman (2): mm: page allocator: Adjust the per-cpu counter threshold when memory is low mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds include/linux/mmzone.h | 10 ++------ include/linux/vmstat.h | 7 ++++++ mm/mmzone.c | 21 ------------------- mm/page_alloc.c | 35 ++++++++++++++++++++++++------- mm/vmscan.c | 25 +++++++++++++--------- mm/vmstat.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-- 6 files changed, 101 insertions(+), 49 deletions(-) ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-28 15:13 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions V4 Mel Gorman @ 2010-10-28 15:13 ` Mel Gorman 2010-10-28 22:04 ` Andrew Morton 0 siblings, 1 reply; 40+ messages in thread From: Mel Gorman @ 2010-10-28 15:13 UTC (permalink / raw) To: Andrew Morton Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM, Mel Gorman Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when memory is low] noted that watermarks were based on the vmstat NR_FREE_PAGES. To avoid synchronization overhead, these counters are maintained on a per-cpu basis and drained both periodically and when a threshold is above a threshold. On large CPU systems, the difference between the estimate and real value of NR_FREE_PAGES can be very high. The system can get into a case where pages are allocated far below the min watermark potentially causing livelock issues. The commit solved the problem by taking a better reading of NR_FREE_PAGES when memory was low. Unfortately, as reported by Shaohua Li this accurate reading can consume a large amount of CPU time on systems with many sockets due to cache line bouncing. This patch takes a different approach. For large machines where counter drift might be unsafe and while kswapd is awake, the per-cpu thresholds for the target pgdat are reduced to limit the level of drift to what should be a safe level. This incurs a performance penalty in heavy memory pressure by a factor that depends on the workload and the machine but the machine should function correctly without accidentally exhausting all memory on a node. There is an additional cost when kswapd wakes and sleeps but the event is not expected to be frequent - in Shaohua's test case, there was one recorded sleep and wake event at least. To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is introduced that takes a more accurate reading of NR_FREE_PAGES when called from wakeup_kswapd, when deciding whether it is really safe to go back to sleep in sleeping_prematurely() and when deciding if a zone is really balanced or not in balance_pgdat(). We are still using an expensive function but limiting how often it is called. When the test case is reproduced, the time spent in the watermark functions is reduced. The following report is on the percentage of time spent cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), zone_page_state(). vanilla 11.6615% disable-threshold 0.2584% Reported-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux.com> --- include/linux/mmzone.h | 10 ++----- include/linux/vmstat.h | 5 +++ mm/mmzone.c | 21 --------------- mm/page_alloc.c | 35 +++++++++++++++++++----- mm/vmscan.c | 23 +++++++++------- mm/vmstat.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++- 6 files changed, 115 insertions(+), 47 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3984c4e..8d789d7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -448,12 +448,6 @@ static inline int zone_is_oom_locked(const struct zone *zone) return test_bit(ZONE_OOM_LOCKED, &zone->flags); } -#ifdef CONFIG_SMP -unsigned long zone_nr_free_pages(struct zone *zone); -#else -#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES) -#endif /* CONFIG_SMP */ - /* * The "priority" of VM scanning is how much of the queues we will scan in one * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the @@ -651,7 +645,9 @@ typedef struct pglist_data { extern struct mutex zonelists_mutex; void build_all_zonelists(void *data); void wakeup_kswapd(struct zone *zone, int order); -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags); +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, int classzone_idx, int alloc_flags); enum memmap_context { MEMMAP_EARLY, diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index eaaea37..e4cc21c 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item); void refresh_cpu_vm_stats(int); +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat); +void restore_pgdat_percpu_threshold(pg_data_t *pgdat); #else /* CONFIG_SMP */ /* @@ -298,6 +300,9 @@ static inline void __dec_zone_page_state(struct page *page, #define dec_zone_page_state __dec_zone_page_state #define mod_zone_page_state __mod_zone_page_state +static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { } +static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { } + static inline void refresh_cpu_vm_stats(int cpu) { } #endif diff --git a/mm/mmzone.c b/mm/mmzone.c index e35bfb8..f5b7d17 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn, return 1; } #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ - -#ifdef CONFIG_SMP -/* Called when a more accurate view of NR_FREE_PAGES is needed */ -unsigned long zone_nr_free_pages(struct zone *zone) -{ - unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES); - - /* - * While kswapd is awake, it is considered the zone is under some - * memory pressure. Under pressure, there is a risk that - * per-cpu-counter-drift will allow the min watermark to be breached - * potentially causing a live-lock. While kswapd is awake and - * free pages are low, get a better estimate for free pages - */ - if (nr_free_pages < zone->percpu_drift_mark && - !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) - return zone_page_state_snapshot(zone, NR_FREE_PAGES); - - return nr_free_pages; -} -#endif /* CONFIG_SMP */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f12ad18..0286150 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1454,24 +1454,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) #endif /* CONFIG_FAIL_PAGE_ALLOC */ /* - * Return 1 if free pages are above 'mark'. This takes into account the order + * Return true if free pages are above 'mark'. This takes into account the order * of the allocation. */ -int zone_watermark_ok(struct zone *z, int order, unsigned long mark, - int classzone_idx, int alloc_flags) +static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags, long free_pages) { /* free_pages my go negative - that's OK */ long min = mark; - long free_pages = zone_nr_free_pages(z) - (1 << order) + 1; int o; + free_pages -= (1 << order) + 1; if (alloc_flags & ALLOC_HIGH) min -= min / 2; if (alloc_flags & ALLOC_HARDER) min -= min / 4; if (free_pages <= min + z->lowmem_reserve[classzone_idx]) - return 0; + return false; for (o = 0; o < order; o++) { /* At the next order, this order's pages become unavailable */ free_pages -= z->free_area[o].nr_free << o; @@ -1480,9 +1480,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark, min >>= 1; if (free_pages <= min) - return 0; + return false; } - return 1; + return true; +} + +bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags) +{ + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, + zone_page_state(z, NR_FREE_PAGES)); +} + +bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark, + int classzone_idx, int alloc_flags) +{ + long free_pages = zone_page_state(z, NR_FREE_PAGES); + + if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) + free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); + + return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, + free_pages); } #ifdef CONFIG_NUMA @@ -2436,7 +2455,7 @@ void show_free_areas(void) " all_unreclaimable? %s" "\n", zone->name, - K(zone_nr_free_pages(zone)), + K(zone_page_state(zone, NR_FREE_PAGES)), K(min_wmark_pages(zone)), K(low_wmark_pages(zone)), K(high_wmark_pages(zone)), diff --git a/mm/vmscan.c b/mm/vmscan.c index c5dfabf..3e71cb1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2082,7 +2082,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining) if (zone->all_unreclaimable) continue; - if (!zone_watermark_ok(zone, order, high_wmark_pages(zone), + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) return 1; } @@ -2169,7 +2169,7 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, &sc, priority, 0); - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; break; @@ -2215,7 +2215,7 @@ loop_again: * We put equal pressure on every zone, unless one * zone has way too many pages free already. */ - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, 8*high_wmark_pages(zone), end_zone, 0)) shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; @@ -2236,7 +2236,7 @@ loop_again: total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), end_zone, 0)) { all_zones_ok = 0; /* @@ -2244,7 +2244,7 @@ loop_again: * means that we have a GFP_ATOMIC allocation * failure risk. Hurry up! */ - if (!zone_watermark_ok(zone, order, + if (!zone_watermark_ok_safe(zone, order, min_wmark_pages(zone), end_zone, 0)) has_under_min_watermark_zone = 1; } @@ -2378,7 +2378,9 @@ static int kswapd(void *p) */ if (!sleeping_prematurely(pgdat, order, remaining)) { trace_mm_vmscan_kswapd_sleep(pgdat->node_id); + restore_pgdat_percpu_threshold(pgdat); schedule(); + reduce_pgdat_percpu_threshold(pgdat); } else { if (remaining) count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); @@ -2417,16 +2419,17 @@ void wakeup_kswapd(struct zone *zone, int order) if (!populated_zone(zone)) return; - pgdat = zone->zone_pgdat; - if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) + if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) return; + pgdat = zone->zone_pgdat; if (pgdat->kswapd_max_order < order) pgdat->kswapd_max_order = order; - trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) - return; if (!waitqueue_active(&pgdat->kswapd_wait)) return; + if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0)) + return; + + trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); wake_up_interruptible(&pgdat->kswapd_wait); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 355a9e6..4d7faeb 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -81,6 +81,30 @@ EXPORT_SYMBOL(vm_stat); #ifdef CONFIG_SMP +static int calculate_pressure_threshold(struct zone *zone) +{ + int threshold; + int watermark_distance; + + /* + * As vmstats are not up to date, there is drift between the estimated + * and real values. For high thresholds and a high number of CPUs, it + * is possible for the min watermark to be breached while the estimated + * value looks fine. The pressure threshold is a reduced value such + * that even the maximum amount of drift will not accidentally breach + * the min watermark + */ + watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone); + threshold = max(1, (int)(watermark_distance / num_online_cpus())); + + /* + * Maximum threshold is 125 + */ + threshold = min(125, threshold); + + return threshold; +} + static int calculate_threshold(struct zone *zone) { int threshold; @@ -159,6 +183,48 @@ static void refresh_zone_stat_thresholds(void) } } +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) +{ + struct zone *zone; + int cpu; + int threshold; + int i; + + get_online_cpus(); + for (i = 0; i < pgdat->nr_zones; i++) { + zone = &pgdat->node_zones[i]; + if (!zone->percpu_drift_mark) + continue; + + threshold = calculate_pressure_threshold(zone); + for_each_online_cpu(cpu) + per_cpu_ptr(zone->pageset, cpu)->stat_threshold + = threshold; + } + put_online_cpus(); +} + +void restore_pgdat_percpu_threshold(pg_data_t *pgdat) +{ + struct zone *zone; + int cpu; + int threshold; + int i; + + get_online_cpus(); + for (i = 0; i < pgdat->nr_zones; i++) { + zone = &pgdat->node_zones[i]; + if (!zone->percpu_drift_mark) + continue; + + threshold = calculate_threshold(zone); + for_each_online_cpu(cpu) + per_cpu_ptr(zone->pageset, cpu)->stat_threshold + = threshold; + } + put_online_cpus(); +} + /* * For use when we know that interrupts are disabled. */ @@ -826,7 +892,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n scanned %lu" "\n spanned %lu" "\n present %lu", - zone_nr_free_pages(zone), + zone_page_state(zone, NR_FREE_PAGES), min_wmark_pages(zone), low_wmark_pages(zone), high_wmark_pages(zone), -- 1.7.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-28 15:13 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman @ 2010-10-28 22:04 ` Andrew Morton 2010-10-29 10:12 ` Mel Gorman 2010-10-29 14:58 ` Christoph Lameter 0 siblings, 2 replies; 40+ messages in thread From: Andrew Morton @ 2010-10-28 22:04 UTC (permalink / raw) To: Mel Gorman Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Thu, 28 Oct 2010 16:13:35 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when > memory is low] noted that watermarks were based on the vmstat > NR_FREE_PAGES. To avoid synchronization overhead, these counters are > maintained on a per-cpu basis and drained both periodically and when a > threshold is above a threshold. On large CPU systems, the difference > between the estimate and real value of NR_FREE_PAGES can be very high. > The system can get into a case where pages are allocated far below the > min watermark potentially causing livelock issues. The commit solved the > problem by taking a better reading of NR_FREE_PAGES when memory was low. > > Unfortately, as reported by Shaohua Li this accurate reading can consume > a large amount of CPU time on systems with many sockets due to cache > line bouncing. This patch takes a different approach. For large machines > where counter drift might be unsafe and while kswapd is awake, the per-cpu > thresholds for the target pgdat are reduced to limit the level of drift > to what should be a safe level. This incurs a performance penalty in heavy > memory pressure by a factor that depends on the workload and the machine but > the machine should function correctly without accidentally exhausting all > memory on a node. There is an additional cost when kswapd wakes and sleeps > but the event is not expected to be frequent - in Shaohua's test case, > there was one recorded sleep and wake event at least. > > To ensure that kswapd wakes up, a safe version of zone_watermark_ok() > is introduced that takes a more accurate reading of NR_FREE_PAGES when > called from wakeup_kswapd, when deciding whether it is really safe to go > back to sleep in sleeping_prematurely() and when deciding if a zone is > really balanced or not in balance_pgdat(). We are still using an expensive > function but limiting how often it is called. Here I go again. I have a feeling that I already said this, but I can't find versions 2 or 3 in the archives.. Did you evaluate using plain on percpu_counters for this? They won't solve the performance problem as they're basically the same thing as these open-coded counters. But they'd reduce the amount of noise and custom-coded boilerplate in mm/. > When the test case is reproduced, the time spent in the watermark functions > is reduced. The following report is on the percentage of time spent > cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), > __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), > zone_page_state(). So how did you decide which callsites needed to use the fast-but-inaccurate zone_watermark_ok() and which needed to use the slow-but-more-accurate zone_watermark_ok_safe()? (Those functions need comments explaining the difference btw) I have a feeling this problem will bite us again perhaps due to those other callsites, but we haven't found the workload yet. I don't undestand why restore/reduce_pgdat_percpu_threshold() were called around that particular sleep in kswapd and nowhere else. > vanilla 11.6615% > disable-threshold 0.2584% Wow. That's 12% of all CPUs? How many CPUs and what workload? > > ... > > if (!sleeping_prematurely(pgdat, order, remaining)) { > trace_mm_vmscan_kswapd_sleep(pgdat->node_id); > + restore_pgdat_percpu_threshold(pgdat); > schedule(); > + reduce_pgdat_percpu_threshold(pgdat); We could do with some code comments here explaining what's going on. > } else { > if (remaining) > count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); > > ... > > +static int calculate_pressure_threshold(struct zone *zone) > +{ > + int threshold; > + int watermark_distance; > + > + /* > + * As vmstats are not up to date, there is drift between the estimated > + * and real values. For high thresholds and a high number of CPUs, it > + * is possible for the min watermark to be breached while the estimated > + * value looks fine. The pressure threshold is a reduced value such > + * that even the maximum amount of drift will not accidentally breach > + * the min watermark > + */ > + watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone); > + threshold = max(1, (int)(watermark_distance / num_online_cpus())); > + > + /* > + * Maximum threshold is 125 Reasoning? > + */ > + threshold = min(125, threshold); > + > + return threshold; > +} > + > static int calculate_threshold(struct zone *zone) > { > int threshold; > > ... > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) > +{ > + struct zone *zone; > + int cpu; > + int threshold; > + int i; > + > + get_online_cpus(); > + for (i = 0; i < pgdat->nr_zones; i++) { > + zone = &pgdat->node_zones[i]; > + if (!zone->percpu_drift_mark) > + continue; > + > + threshold = calculate_pressure_threshold(zone); > + for_each_online_cpu(cpu) > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > + = threshold; > + } > + put_online_cpus(); > +} > + > +void restore_pgdat_percpu_threshold(pg_data_t *pgdat) > +{ > + struct zone *zone; > + int cpu; > + int threshold; > + int i; > + > + get_online_cpus(); > + for (i = 0; i < pgdat->nr_zones; i++) { > + zone = &pgdat->node_zones[i]; > + if (!zone->percpu_drift_mark) > + continue; > + > + threshold = calculate_threshold(zone); > + for_each_online_cpu(cpu) > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > + = threshold; > + } > + put_online_cpus(); > +} Given that ->stat_threshold is the same for each CPU, why store it for each CPU at all? Why not put it in the zone and eliminate the inner loop? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-28 22:04 ` Andrew Morton @ 2010-10-29 10:12 ` Mel Gorman 2010-10-29 19:40 ` Andrew Morton 2010-10-29 14:58 ` Christoph Lameter 1 sibling, 1 reply; 40+ messages in thread From: Mel Gorman @ 2010-10-29 10:12 UTC (permalink / raw) To: Andrew Morton Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Thu, Oct 28, 2010 at 03:04:33PM -0700, Andrew Morton wrote: > On Thu, 28 Oct 2010 16:13:35 +0100 > Mel Gorman <mel@csn.ul.ie> wrote: > > > Commit [aa45484: calculate a better estimate of NR_FREE_PAGES when > > memory is low] noted that watermarks were based on the vmstat > > NR_FREE_PAGES. To avoid synchronization overhead, these counters are > > maintained on a per-cpu basis and drained both periodically and when a > > threshold is above a threshold. On large CPU systems, the difference > > between the estimate and real value of NR_FREE_PAGES can be very high. > > The system can get into a case where pages are allocated far below the > > min watermark potentially causing livelock issues. The commit solved the > > problem by taking a better reading of NR_FREE_PAGES when memory was low. > > > > Unfortately, as reported by Shaohua Li this accurate reading can consume > > a large amount of CPU time on systems with many sockets due to cache > > line bouncing. This patch takes a different approach. For large machines > > where counter drift might be unsafe and while kswapd is awake, the per-cpu > > thresholds for the target pgdat are reduced to limit the level of drift > > to what should be a safe level. This incurs a performance penalty in heavy > > memory pressure by a factor that depends on the workload and the machine but > > the machine should function correctly without accidentally exhausting all > > memory on a node. There is an additional cost when kswapd wakes and sleeps > > but the event is not expected to be frequent - in Shaohua's test case, > > there was one recorded sleep and wake event at least. > > > > To ensure that kswapd wakes up, a safe version of zone_watermark_ok() > > is introduced that takes a more accurate reading of NR_FREE_PAGES when > > called from wakeup_kswapd, when deciding whether it is really safe to go > > back to sleep in sleeping_prematurely() and when deciding if a zone is > > really balanced or not in balance_pgdat(). We are still using an expensive > > function but limiting how often it is called. > > Here I go again. I have a feeling that I already said this, but I > can't find versions 2 or 3 in the archives.. > > Did you evaluate using plain on percpu_counters for this? They won't > solve the performance problem as they're basically the same thing as > these open-coded counters. But they'd reduce the amount of noise and > custom-coded boilerplate in mm/. > You did bring this up before. Here is the reference to the answer you got from Christoph at the time http://lkml.org/lkml/2010/9/3/453 . It seemed like a reasonable answer. > > When the test case is reproduced, the time spent in the watermark functions > > is reduced. The following report is on the percentage of time spent > > cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), > > __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), > > zone_page_state(). > > So how did you decide which callsites needed to use the > fast-but-inaccurate zone_watermark_ok() and which needed to use the > slow-but-more-accurate zone_watermark_ok_safe()? (Those functions need > comments explaining the difference btw) > Selection was based on kswapd being woken up and staying awake 1. When deciding if kswapd should wake (wakeup_kswapd()), we have failed the initial allocation attempt and we should be sure of the watermarks to decide if kswapd really should wake or not 2. Once kswapd is awake, it shouldn't go to sleep prematurely While kswapd is awake, drift is less of a problem because thresholds are reduced. > > I have a feeling this problem will bite us again perhaps due to those > other callsites, but we haven't found the workload yet. > > I don't undestand why restore/reduce_pgdat_percpu_threshold() were > called around that particular sleep in kswapd and nowhere else. > > > vanilla 11.6615% > > disable-threshold 0.2584% > > Wow. That's 12% of all CPUs? How many CPUs and what workload? > 112 threads CPUs 14 sockets. Workload initialisation creates NR_CPU sparse files that are 10*TOTAL_MEMORY/NR_CPU in size. Workload itself is NR_CPU processes just reading their own file. The critical thing is the number of sockets. For single-socket-8-thread for example, vanilla was just 0.66% of time (although the patches did bring it down to 0.11%). > > > > ... > > > > if (!sleeping_prematurely(pgdat, order, remaining)) { > > trace_mm_vmscan_kswapd_sleep(pgdat->node_id); > > + restore_pgdat_percpu_threshold(pgdat); > > schedule(); > > + reduce_pgdat_percpu_threshold(pgdat); > > We could do with some code comments here explaining what's going on. > Follow-on patch? > > } else { > > if (remaining) > > count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); > > > > ... > > > > +static int calculate_pressure_threshold(struct zone *zone) > > +{ > > + int threshold; > > + int watermark_distance; > > + > > + /* > > + * As vmstats are not up to date, there is drift between the estimated > > + * and real values. For high thresholds and a high number of CPUs, it > > + * is possible for the min watermark to be breached while the estimated > > + * value looks fine. The pressure threshold is a reduced value such > > + * that even the maximum amount of drift will not accidentally breach > > + * the min watermark > > + */ > > + watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone); > > + threshold = max(1, (int)(watermark_distance / num_online_cpus())); > > + > > + /* > > + * Maximum threshold is 125 > > Reasoning? > To match the existing maximum which I assume is due to the deltas being stored in a s8. > > + */ > > + threshold = min(125, threshold); > > + > > + return threshold; > > +} > > + > > static int calculate_threshold(struct zone *zone) > > { > > int threshold; > > > > ... > > > > +void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) > > +{ > > + struct zone *zone; > > + int cpu; > > + int threshold; > > + int i; > > + > > + get_online_cpus(); > > + for (i = 0; i < pgdat->nr_zones; i++) { > > + zone = &pgdat->node_zones[i]; > > + if (!zone->percpu_drift_mark) > > + continue; > > + > > + threshold = calculate_pressure_threshold(zone); > > + for_each_online_cpu(cpu) > > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > > + = threshold; > > + } > > + put_online_cpus(); > > +} > > + > > +void restore_pgdat_percpu_threshold(pg_data_t *pgdat) > > +{ > > + struct zone *zone; > > + int cpu; > > + int threshold; > > + int i; > > + > > + get_online_cpus(); > > + for (i = 0; i < pgdat->nr_zones; i++) { > > + zone = &pgdat->node_zones[i]; > > + if (!zone->percpu_drift_mark) > > + continue; > > + > > + threshold = calculate_threshold(zone); > > + for_each_online_cpu(cpu) > > + per_cpu_ptr(zone->pageset, cpu)->stat_threshold > > + = threshold; > > + } > > + put_online_cpus(); > > +} > > Given that ->stat_threshold is the same for each CPU, why store it for > each CPU at all? Why not put it in the zone and eliminate the inner > loop? > I asked why we couldn't move the threshold to struct zone and Christoph responded; "If you move it then the cache footprint of the vm stat functions (which need to access the threshold for each access!) will increase and the performance sink dramatically. I tried to avoid placing the threshold there when I developed that approach but it always caused a dramatic regression under heavy load." -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-29 10:12 ` Mel Gorman @ 2010-10-29 19:40 ` Andrew Morton 2010-11-02 0:53 ` Shaohua Li ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Andrew Morton @ 2010-10-29 19:40 UTC (permalink / raw) To: Mel Gorman Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Fri, 29 Oct 2010 11:12:11 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > On Thu, Oct 28, 2010 at 03:04:33PM -0700, Andrew Morton wrote: > > On Thu, 28 Oct 2010 16:13:35 +0100 > > > > > I have a feeling this problem will bite us again perhaps due to those > > other callsites, but we haven't found the workload yet. > > > > I don't undestand why restore/reduce_pgdat_percpu_threshold() were > > called around that particular sleep in kswapd and nowhere else. > > > > > vanilla 11.6615% > > > disable-threshold 0.2584% > > > > Wow. That's 12% of all CPUs? How many CPUs and what workload? > > > > 112 threads CPUs 14 sockets. Workload initialisation creates NR_CPU sparse > files that are 10*TOTAL_MEMORY/NR_CPU in size. Workload itself is NR_CPU > processes just reading their own file. > > The critical thing is the number of sockets. For single-socket-8-thread > for example, vanilla was just 0.66% of time (although the patches did > bring it down to 0.11%). I'm surprised. I thought the inefficiency here was caused by CPUs tromping through percpu data, adding things up. But the above info would indicate that the problem was caused by lots of cross-socket traffic? If so, where did that come from? > ... > > Follow-on patch? Sometime, please. > > > > > > ... > > > > > > if (!sleeping_prematurely(pgdat, order, remaining)) { > > > trace_mm_vmscan_kswapd_sleep(pgdat->node_id); > > > + restore_pgdat_percpu_threshold(pgdat); > > > schedule(); > > > + reduce_pgdat_percpu_threshold(pgdat); > > > > We could do with some code comments here explaining what's going on. > > > > Follow-on patch? > > > > } else { > > > if (remaining) > > > count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); > > > > > > ... > > > > > > +static int calculate_pressure_threshold(struct zone *zone) > > > +{ > > > + int threshold; > > > + int watermark_distance; > > > + > > > + /* > > > + * As vmstats are not up to date, there is drift between the estimated > > > + * and real values. For high thresholds and a high number of CPUs, it > > > + * is possible for the min watermark to be breached while the estimated > > > + * value looks fine. The pressure threshold is a reduced value such > > > + * that even the maximum amount of drift will not accidentally breach > > > + * the min watermark > > > + */ > > > + watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone); > > > + threshold = max(1, (int)(watermark_distance / num_online_cpus())); > > > + > > > + /* > > > + * Maximum threshold is 125 > > > > Reasoning? > > > > To match the existing maximum which I assume is due to the deltas being > stored in a s8. hm, OK. So (CHAR_MAX-2) would be a tad clearer, only there's no CHAR_MAX and "2" remains mysterious ;) I do go on about code comments a lot lately. Eric D's kernel just crashed because we didn't adequately comment first_zones_zonelist() so I'm feeling all vindicated! > > > Given that ->stat_threshold is the same for each CPU, why store it for > > each CPU at all? Why not put it in the zone and eliminate the inner > > loop? > > > > I asked why we couldn't move the threshold to struct zone and Christoph > responded; > > "If you move it then the cache footprint of the vm stat functions (which > need to access the threshold for each access!) will increase and the > performance sink dramatically. I tried to avoid placing the threshold > there when I developed that approach but it always caused a dramatic > regression under heavy load." I don't really buy that. The cache footprint will be increased by a max of one cacheline (for zone->stat_threshold) and the cache footprint will be actually reduced in the much larger percpu area (depending on alignment and padding and stuff). I'm suspecting something went wrong here, perhaps zone->stat_threshold shared a cacheline with something unfortunate. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-29 19:40 ` Andrew Morton @ 2010-11-02 0:53 ` Shaohua Li 2010-11-09 11:33 ` Mel Gorman 2010-11-09 16:48 ` Christoph Lameter 2 siblings, 0 replies; 40+ messages in thread From: Shaohua Li @ 2010-11-02 0:53 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Sat, 2010-10-30 at 03:40 +0800, Andrew Morton wrote: > On Fri, 29 Oct 2010 11:12:11 +0100 > Mel Gorman <mel@csn.ul.ie> wrote: > > > On Thu, Oct 28, 2010 at 03:04:33PM -0700, Andrew Morton wrote: > > > On Thu, 28 Oct 2010 16:13:35 +0100 > > > > > > > > I have a feeling this problem will bite us again perhaps due to those > > > other callsites, but we haven't found the workload yet. > > > > > > I don't undestand why restore/reduce_pgdat_percpu_threshold() were > > > called around that particular sleep in kswapd and nowhere else. > > > > > > > vanilla 11.6615% > > > > disable-threshold 0.2584% > > > > > > Wow. That's 12% of all CPUs? How many CPUs and what workload? > > > > > > > 112 threads CPUs 14 sockets. Workload initialisation creates NR_CPU sparse > > files that are 10*TOTAL_MEMORY/NR_CPU in size. Workload itself is NR_CPU > > processes just reading their own file. > > > > The critical thing is the number of sockets. For single-socket-8-thread > > for example, vanilla was just 0.66% of time (although the patches did > > bring it down to 0.11%). > > I'm surprised. I thought the inefficiency here was caused by CPUs > tromping through percpu data, adding things up. But the above info > would indicate that the problem was caused by lots of cross-socket > traffic? If so, where did that come from? >From my understanding, the problem is zone_nr_free_pages() will try to read each cpu's ->vm_stat_diff, while other CPUs are changing their vm_stat_diff. This will cause a lot of cache bounce. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-29 19:40 ` Andrew Morton 2010-11-02 0:53 ` Shaohua Li @ 2010-11-09 11:33 ` Mel Gorman 2010-11-09 16:48 ` Christoph Lameter 2 siblings, 0 replies; 40+ messages in thread From: Mel Gorman @ 2010-11-09 11:33 UTC (permalink / raw) To: Andrew Morton Cc: Shaohua Li, KOSAKI Motohiro, Christoph Lameter, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Fri, Oct 29, 2010 at 12:40:02PM -0700, Andrew Morton wrote: > > ... > > > > Follow-on patch? > > Sometime, please. > How does this look? ==== CUT HERE ==== mm: vmscan: Comment on why kswapd reduces the per-cpu vmstat threshold While kswapd is awake, the per-cpu vmstat threshold is reduced to reduce per-cpu drift to acceptable levels. Add a comment explaining why. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/vmscan.c | 13 +++++++++++++ 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 7966110..ba39948 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2378,6 +2378,19 @@ static int kswapd(void *p) */ if (!sleeping_prematurely(pgdat, order, remaining)) { trace_mm_vmscan_kswapd_sleep(pgdat->node_id); + + /* + * vmstat counters are not perfectly + * accurate and the estimated value + * for counters such as NR_FREE_PAGES + * can deviate from the true value by + * nr_online_cpus * threshold. To + * avoid the zone watermarks being + * breached while under pressure, we + * reduce the per-cpu vmstat threshold + * while kswapd is awake and restore + * them before going back to sleep. + */ set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold); schedule(); ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-29 19:40 ` Andrew Morton 2010-11-02 0:53 ` Shaohua Li 2010-11-09 11:33 ` Mel Gorman @ 2010-11-09 16:48 ` Christoph Lameter 2 siblings, 0 replies; 40+ messages in thread From: Christoph Lameter @ 2010-11-09 16:48 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Shaohua Li, KOSAKI Motohiro, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Fri, 29 Oct 2010, Andrew Morton wrote: > > To match the existing maximum which I assume is due to the deltas being > > stored in a s8. > > hm, OK. So (CHAR_MAX-2) would be a tad clearer, only there's no > CHAR_MAX and "2" remains mysterious ;) inc/dec_zone_page_state first increments (allows compiler to generate a RMV instruction, hmmm... there we could use this cpu....) and then checks if we are beyond the threshold. So the maximum value has to be below CHAR_MAX. 125 is the next value that looks somewhat sane as a limit. We could use CHAR_MAX if we change the comparison logic for the threshhold. > I do go on about code comments a lot lately. Eric D's kernel just > crashed because we didn't adequately comment first_zones_zonelist() > so I'm feeling all vindicated! We should not have done the change to add the nodemask to first_zones_zonelist but instead added a new function. Now the function has two different ways of behaving. > I don't really buy that. The cache footprint will be increased by a > max of one cacheline (for zone->stat_threshold) and the cache footprint > will be actually reduced in the much larger percpu area (depending on > alignment and padding and stuff). It will increase by one cacheline per zone in use. Large systems have lots of zones which increases the effect. The vmstat functions are often used as primitives in various critical code paths. The cache footprint needs to be kept as low as possible. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-28 22:04 ` Andrew Morton 2010-10-29 10:12 ` Mel Gorman @ 2010-10-29 14:58 ` Christoph Lameter 2010-10-29 18:25 ` Andrew Morton 1 sibling, 1 reply; 40+ messages in thread From: Christoph Lameter @ 2010-10-29 14:58 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Shaohua Li, KOSAKI Motohiro, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Thu, 28 Oct 2010, Andrew Morton wrote: > > To ensure that kswapd wakes up, a safe version of zone_watermark_ok() > > is introduced that takes a more accurate reading of NR_FREE_PAGES when > > called from wakeup_kswapd, when deciding whether it is really safe to go > > back to sleep in sleeping_prematurely() and when deciding if a zone is > > really balanced or not in balance_pgdat(). We are still using an expensive > > function but limiting how often it is called. > > Here I go again. I have a feeling that I already said this, but I > can't find versions 2 or 3 in the archives.. > > Did you evaluate using plain on percpu_counters for this? They won't > solve the performance problem as they're basically the same thing as > these open-coded counters. But they'd reduce the amount of noise and > custom-coded boilerplate in mm/. The zone counters are done using the ZVCs in vmstat.c to save space and to be in the same cacheline as other hot data necessary for allocation and free. > > > + threshold = max(1, (int)(watermark_distance / num_online_cpus())); > > + > > + /* > > + * Maximum threshold is 125 > > Reasoning? Differentials are stored in 8 bit signed ints. > > + put_online_cpus(); > > +} > > Given that ->stat_threshold is the same for each CPU, why store it for > each CPU at all? Why not put it in the zone and eliminate the inner > loop? Doing that caused cache misses in the past and reduced the performance of the ZVCs. This way the threshold is in the same cacheline as the differentials. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-29 14:58 ` Christoph Lameter @ 2010-10-29 18:25 ` Andrew Morton 2010-10-29 19:33 ` Christoph Lameter 0 siblings, 1 reply; 40+ messages in thread From: Andrew Morton @ 2010-10-29 18:25 UTC (permalink / raw) To: Christoph Lameter Cc: Mel Gorman, Shaohua Li, KOSAKI Motohiro, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Fri, 29 Oct 2010 09:58:25 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote: > On Thu, 28 Oct 2010, Andrew Morton wrote: > > > > To ensure that kswapd wakes up, a safe version of zone_watermark_ok() > > > is introduced that takes a more accurate reading of NR_FREE_PAGES when > > > called from wakeup_kswapd, when deciding whether it is really safe to go > > > back to sleep in sleeping_prematurely() and when deciding if a zone is > > > really balanced or not in balance_pgdat(). We are still using an expensive > > > function but limiting how often it is called. > > > > Here I go again. I have a feeling that I already said this, but I > > can't find versions 2 or 3 in the archives.. > > > > Did you evaluate using plain on percpu_counters for this? They won't > > solve the performance problem as they're basically the same thing as > > these open-coded counters. But they'd reduce the amount of noise and > > custom-coded boilerplate in mm/. > > The zone counters are done using the ZVCs in vmstat.c to save space well, they actually waste space because of that threshold thing. > and to > be in the same cacheline as other hot data necessary for allocation and > free. Yes, that'll save some misses. > > > > > + threshold = max(1, (int)(watermark_distance / num_online_cpus())); > > > + > > > + /* > > > + * Maximum threshold is 125 > > > > Reasoning? > > Differentials are stored in 8 bit signed ints. > > > > + put_online_cpus(); > > > +} > > > > Given that ->stat_threshold is the same for each CPU, why store it for > > each CPU at all? Why not put it in the zone and eliminate the inner > > loop? > > Doing that caused cache misses in the past and reduced the performance of > the ZVCs. This way the threshold is in the same cacheline as the > differentials. This sounds wrong. As long as that threshold isn't stored in a cacheline which other CPUs are modifying, all CPUs should be able to happily cache it. Maybe it needed a bit of padding inside the zone struct. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low 2010-10-29 18:25 ` Andrew Morton @ 2010-10-29 19:33 ` Christoph Lameter 0 siblings, 0 replies; 40+ messages in thread From: Christoph Lameter @ 2010-10-29 19:33 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Shaohua Li, KOSAKI Motohiro, David Rientjes, KAMEZAWA Hiroyuki, LKML, Linux-MM On Fri, 29 Oct 2010, Andrew Morton wrote: > > Doing that caused cache misses in the past and reduced the performance of > > the ZVCs. This way the threshold is in the same cacheline as the > > differentials. > > This sounds wrong. As long as that threshold isn't stored in a > cacheline which other CPUs are modifying, all CPUs should be able to > happily cache it. Maybe it needed a bit of padding inside the zone > struct. High speed cpu caches are a very scarce resource. The differentials are not in the zone struct. Tried to put it onto a single cacheline. Even that did not do the trick for the large configurations. The same optimizations are done in the slab allocators by the way. Use of the percpu_counter() would at least quadruple the cache footprint in use for the counters. percpu_counters() is using s32 and not s8. ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2010-12-23 23:19 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-10-27 8:47 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions Mel Gorman 2010-10-27 8:47 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman 2010-10-27 20:16 ` Christoph Lameter 2010-10-28 1:09 ` KAMEZAWA Hiroyuki 2010-10-28 9:49 ` Mel Gorman 2010-10-28 9:58 ` KAMEZAWA Hiroyuki 2010-11-14 8:53 ` [PATCH] set_pgdat_percpu_threshold() don't use for_each_online_cpu KOSAKI Motohiro 2010-11-15 10:26 ` Mel Gorman 2010-11-15 14:04 ` Christoph Lameter 2010-11-16 9:58 ` Mel Gorman 2010-11-17 0:07 ` Andrew Morton 2010-11-19 15:29 ` Christoph Lameter 2010-11-23 8:32 ` KOSAKI Motohiro 2010-11-01 7:06 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low KOSAKI Motohiro 2010-11-26 16:06 ` Kyle McMartin 2010-11-29 9:56 ` Mel Gorman 2010-11-29 13:16 ` Kyle McMartin 2010-11-29 15:08 ` Mel Gorman 2010-11-29 15:22 ` Kyle McMartin 2010-11-29 15:26 ` Kyle McMartin 2010-11-29 15:58 ` Mel Gorman 2010-12-23 22:18 ` David Rientjes 2010-12-23 22:35 ` Andrew Morton 2010-12-23 23:00 ` Kyle McMartin 2010-12-23 23:07 ` David Rientjes 2010-12-23 23:17 ` Andrew Morton 2010-10-27 8:47 ` [PATCH 2/2] mm: vmstat: Use a single setter function and callback for adjusting percpu thresholds Mel Gorman 2010-10-27 20:13 ` Christoph Lameter 2010-10-28 1:10 ` KAMEZAWA Hiroyuki 2010-11-01 7:06 ` KOSAKI Motohiro 2010-10-28 15:13 [PATCH 0/2] Reduce the amount of time spent in watermark-related functions V4 Mel Gorman 2010-10-28 15:13 ` [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low Mel Gorman 2010-10-28 22:04 ` Andrew Morton 2010-10-29 10:12 ` Mel Gorman 2010-10-29 19:40 ` Andrew Morton 2010-11-02 0:53 ` Shaohua Li 2010-11-09 11:33 ` Mel Gorman 2010-11-09 16:48 ` Christoph Lameter 2010-10-29 14:58 ` Christoph Lameter 2010-10-29 18:25 ` Andrew Morton 2010-10-29 19:33 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).