All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V3
@ 2010-08-31 17:37 ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Changelog since V3
  o Minor clarifications
  o Rebase to 2.6.36-rc3

Changelog since V1
  o Fix for !CONFIG_SMP
  o Correct spelling mistakes
  o Clarify a ChangeLog
  o Only check for counter drift on machines large enough for the counter
    drift to breach the min watermark when NR_FREE_PAGES report the low
    watermark is fine

Internal IBM test teams beta testing distribution kernels have reported
problems on machines with a large number of CPUs whereby page allocator
failure messages show huge differences between the nr_free_pages vmstat
counter and what is available on the buddy lists. In an extreme example,
nr_free_pages was above the min watermark but zero pages were on the buddy
lists allowing the system to potentially livelock unable to make forward
progress unless an allocation succeeds. There is no reason why the problems
would not affect mainline so the following series mitigates the problems
in the page allocator related to to per-cpu counter drift and lists.

The first patch ensures that counters are updated after pages are added to
free lists.

The second patch notes that the counter drift between nr_free_pages and what
is on the per-cpu lists can be very high. When memory is low and kswapd
is awake, the per-cpu counters are checked as well as reading the value
of NR_FREE_PAGES. This will slow the page allocator when memory is low and
kswapd is awake but it will be much harder to breach the min watermark and
potentially livelock the system.

The third patch notes that after direct-reclaim an allocation can
fail because the necessary pages are on the per-cpu lists. After a
direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
a second attempt is made.

Performance tests against 2.6.36-rc1 did not show up anything interesting. A
version of this series that continually called vmstat_update() when
memory was low was tested internally and found to help the counter drift
problem. I described this during LSF/MM Summit and the potential for IPI
storms was frowned upon. An alternative fix is in patch two which uses
for_each_online_cpu() to read the vmstat deltas while memory is low and
kswapd is awake. This should be functionally similar.

Christoph Lameter made two suggestions that I did not take action on. The
first was to make a generic helper that could be used to get a semi-accurate
reading of any vmstat counter.  However, there is no evidence this is
necessary and it would be better to get a clear understanding of what counter
other than NR_FREE_PAGES would need special treatment by making it obvious
when such a helper is introduced. The second suggestion was to shrink the
threshold that vmstat got updated for affecting all counters. It was also
unclear if this was sufficient or necessary as again. Only NR_FREE_PAGES
is thhe problem counter so why affect every other counter? Also, shrinking
the threshold just shrinks the window the race can occur in. Hence, I'm
reposting the series as-is to see if there are any current objections to
deal with or if we can close up this problem now.

This patch should be merged after the patch "vmstat : update
zone stat threshold at onlining a cpu" which is in mmotm as
vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch . If we can
agree on it, it's a stable candidate.

 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |   29 +++++++++++++++++++++--------
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 77 insertions(+), 9 deletions(-)


^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V3
@ 2010-08-31 17:37 ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Changelog since V3
  o Minor clarifications
  o Rebase to 2.6.36-rc3

Changelog since V1
  o Fix for !CONFIG_SMP
  o Correct spelling mistakes
  o Clarify a ChangeLog
  o Only check for counter drift on machines large enough for the counter
    drift to breach the min watermark when NR_FREE_PAGES report the low
    watermark is fine

Internal IBM test teams beta testing distribution kernels have reported
problems on machines with a large number of CPUs whereby page allocator
failure messages show huge differences between the nr_free_pages vmstat
counter and what is available on the buddy lists. In an extreme example,
nr_free_pages was above the min watermark but zero pages were on the buddy
lists allowing the system to potentially livelock unable to make forward
progress unless an allocation succeeds. There is no reason why the problems
would not affect mainline so the following series mitigates the problems
in the page allocator related to to per-cpu counter drift and lists.

The first patch ensures that counters are updated after pages are added to
free lists.

The second patch notes that the counter drift between nr_free_pages and what
is on the per-cpu lists can be very high. When memory is low and kswapd
is awake, the per-cpu counters are checked as well as reading the value
of NR_FREE_PAGES. This will slow the page allocator when memory is low and
kswapd is awake but it will be much harder to breach the min watermark and
potentially livelock the system.

The third patch notes that after direct-reclaim an allocation can
fail because the necessary pages are on the per-cpu lists. After a
direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
a second attempt is made.

Performance tests against 2.6.36-rc1 did not show up anything interesting. A
version of this series that continually called vmstat_update() when
memory was low was tested internally and found to help the counter drift
problem. I described this during LSF/MM Summit and the potential for IPI
storms was frowned upon. An alternative fix is in patch two which uses
for_each_online_cpu() to read the vmstat deltas while memory is low and
kswapd is awake. This should be functionally similar.

Christoph Lameter made two suggestions that I did not take action on. The
first was to make a generic helper that could be used to get a semi-accurate
reading of any vmstat counter.  However, there is no evidence this is
necessary and it would be better to get a clear understanding of what counter
other than NR_FREE_PAGES would need special treatment by making it obvious
when such a helper is introduced. The second suggestion was to shrink the
threshold that vmstat got updated for affecting all counters. It was also
unclear if this was sufficient or necessary as again. Only NR_FREE_PAGES
is thhe problem counter so why affect every other counter? Also, shrinking
the threshold just shrinks the window the race can occur in. Hence, I'm
reposting the series as-is to see if there are any current objections to
deal with or if we can close up this problem now.

This patch should be merged after the patch "vmstat : update
zone stat threshold at onlining a cpu" which is in mmotm as
vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch . If we can
agree on it, it's a stable candidate.

 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |   29 +++++++++++++++++++++--------
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 77 insertions(+), 9 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
  2010-08-31 17:37 ` Mel Gorman
@ 2010-08-31 17:37   ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When allocating a page, the system uses NR_FREE_PAGES counters to determine
if watermarks would remain intact after the allocation was made. This
check is made without interrupts disabled or the zone lock held and so is
race-prone by nature. Unfortunately, when pages are being freed in batch,
the counters are updated before the pages are added on the list. During this
window, the counters are misleading as the pages do not exist yet. When
under significant pressure on systems with large numbers of CPUs, it's
possible for processes to make progress even though they should have been
stalled. This is particularly problematic if a number of the processes are
using GFP_ATOMIC as the min watermark can be accidentally breached and in
extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..97d74a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
+	int freed = count;
 
 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, page_private(page));
 		} while (--count && --batch_free && !list_empty(list));
 	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
 	spin_unlock(&zone->lock);
 }
 
@@ -631,8 +632,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	__free_one_page(page, zone, order, migratetype);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	spin_unlock(&zone->lock);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
@ 2010-08-31 17:37   ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When allocating a page, the system uses NR_FREE_PAGES counters to determine
if watermarks would remain intact after the allocation was made. This
check is made without interrupts disabled or the zone lock held and so is
race-prone by nature. Unfortunately, when pages are being freed in batch,
the counters are updated before the pages are added on the list. During this
window, the counters are misleading as the pages do not exist yet. When
under significant pressure on systems with large numbers of CPUs, it's
possible for processes to make progress even though they should have been
stalled. This is particularly problematic if a number of the processes are
using GFP_ATOMIC as the min watermark can be accidentally breached and in
extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..97d74a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
+	int freed = count;
 
 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, page_private(page));
 		} while (--count && --batch_free && !list_empty(list));
 	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
 	spin_unlock(&zone->lock);
 }
 
@@ -631,8 +632,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	__free_one_page(page, zone, order, migratetype);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	spin_unlock(&zone->lock);
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-31 17:37 ` Mel Gorman
@ 2010-08-31 17:37   ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can be
very high.  If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero.  Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..69ecbe9 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,32 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct per_cpu_pageset *pset;
+
+			pset = per_cpu_ptr(zone->pageset, cpu);
+			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
+		}
+	}
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-31 17:37   ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can be
very high.  If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero.  Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..69ecbe9 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,32 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct per_cpu_pageset *pset;
+
+			pset = per_cpu_ptr(zone->pageset, cpu);
+			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
+		}
+	}
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-31 17:37 ` Mel Gorman
@ 2010-08-31 17:37   ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-08-31 17:37   ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
  2010-08-31 17:37   ` Mel Gorman
@ 2010-08-31 18:17     ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-31 18:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


I already did a

Reviewed-by: Christoph Lameter <cl@linux.com>

I believe?



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
@ 2010-08-31 18:17     ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-31 18:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


I already did a

Reviewed-by: Christoph Lameter <cl@linux.com>

I believe?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-31 17:37   ` Mel Gorman
@ 2010-08-31 18:20     ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-31 18:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-31 18:20     ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-31 18:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


Reviewed-by: Christoph Lameter <cl@linux.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-31 17:37   ` Mel Gorman
@ 2010-08-31 18:26     ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-31 18:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-08-31 18:26     ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-31 18:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


Reviewed-by: Christoph Lameter <cl@linux.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
  2010-08-31 17:37   ` Mel Gorman
@ 2010-08-31 23:27     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-08-31 23:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> When allocating a page, the system uses NR_FREE_PAGES counters to determine
> if watermarks would remain intact after the allocation was made. This
> check is made without interrupts disabled or the zone lock held and so is
> race-prone by nature. Unfortunately, when pages are being freed in batch,
> the counters are updated before the pages are added on the list. During this
> window, the counters are misleading as the pages do not exist yet. When
> under significant pressure on systems with large numbers of CPUs, it's
> possible for processes to make progress even though they should have been
> stalled. This is particularly problematic if a number of the processes are
> using GFP_ATOMIC as the min watermark can be accidentally breached and in
> extreme cases, the system can livelock.
> 
> This patch updates the counters after the pages have been added to the
> list. This makes the allocator more cautious with respect to preserving
> the watermarks and mitigates livelock possibilities.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
@ 2010-08-31 23:27     ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-08-31 23:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> When allocating a page, the system uses NR_FREE_PAGES counters to determine
> if watermarks would remain intact after the allocation was made. This
> check is made without interrupts disabled or the zone lock held and so is
> race-prone by nature. Unfortunately, when pages are being freed in batch,
> the counters are updated before the pages are added on the list. During this
> window, the counters are misleading as the pages do not exist yet. When
> under significant pressure on systems with large numbers of CPUs, it's
> possible for processes to make progress even though they should have been
> stalled. This is particularly problematic if a number of the processes are
> using GFP_ATOMIC as the min watermark can be accidentally breached and in
> extreme cases, the system can livelock.
> 
> This patch updates the counters after the pages have been added to the
> list. This makes the allocator more cautious with respect to preserving
> the watermarks and mitigates livelock possibilities.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-31 17:37   ` Mel Gorman
@ 2010-08-31 23:37     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-08-31 23:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> +#ifdef CONFIG_SMP
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> +	/*
> +	 * While kswapd is awake, it is considered the zone is under some
> +	 * memory pressure. Under pressure, there is a risk that
> +	 * per-cpu-counter-drift will allow the min watermark to be breached
> +	 * potentially causing a live-lock. While kswapd is awake and
> +	 * free pages are low, get a better estimate for free pages
> +	 */
> +	if (nr_free_pages < zone->percpu_drift_mark &&
> +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> +		int cpu;
> +
> +		for_each_online_cpu(cpu) {
> +			struct per_cpu_pageset *pset;
> +
> +			pset = per_cpu_ptr(zone->pageset, cpu);
> +			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];

If my understanding is correct, we have no lock when reading pset->vm_stat_diff.
It mean nr_free_pages can reach negative value at very rarely race. boundary
check is necessary?





^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-31 23:37     ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-08-31 23:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> +#ifdef CONFIG_SMP
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> +	/*
> +	 * While kswapd is awake, it is considered the zone is under some
> +	 * memory pressure. Under pressure, there is a risk that
> +	 * per-cpu-counter-drift will allow the min watermark to be breached
> +	 * potentially causing a live-lock. While kswapd is awake and
> +	 * free pages are low, get a better estimate for free pages
> +	 */
> +	if (nr_free_pages < zone->percpu_drift_mark &&
> +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> +		int cpu;
> +
> +		for_each_online_cpu(cpu) {
> +			struct per_cpu_pageset *pset;
> +
> +			pset = per_cpu_ptr(zone->pageset, cpu);
> +			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];

If my understanding is correct, we have no lock when reading pset->vm_stat_diff.
It mean nr_free_pages can reach negative value at very rarely race. boundary
check is necessary?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
  2010-08-31 18:17     ` Christoph Lameter
@ 2010-09-01  7:10       ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-01  7:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Aug 31, 2010 at 01:17:44PM -0500, Christoph Lameter wrote:
> 
> I already did a
> 
> Reviewed-by: Christoph Lameter <cl@linux.com>
> 
> I believe?
> 

You did and I omitted it. It's included now. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
@ 2010-09-01  7:10       ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-01  7:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Aug 31, 2010 at 01:17:44PM -0500, Christoph Lameter wrote:
> 
> I already did a
> 
> Reviewed-by: Christoph Lameter <cl@linux.com>
> 
> I believe?
> 

You did and I omitted it. It's included now. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-31 23:37     ` KOSAKI Motohiro
@ 2010-09-01  7:24       ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-01  7:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

On Wed, Sep 01, 2010 at 08:37:41AM +0900, KOSAKI Motohiro wrote:
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +	/*
> > +	 * While kswapd is awake, it is considered the zone is under some
> > +	 * memory pressure. Under pressure, there is a risk that
> > +	 * per-cpu-counter-drift will allow the min watermark to be breached
> > +	 * potentially causing a live-lock. While kswapd is awake and
> > +	 * free pages are low, get a better estimate for free pages
> > +	 */
> > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > +		int cpu;
> > +
> > +		for_each_online_cpu(cpu) {
> > +			struct per_cpu_pageset *pset;
> > +
> > +			pset = per_cpu_ptr(zone->pageset, cpu);
> > +			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> 
> If my understanding is correct, we have no lock when reading pset->vm_stat_diff.
> It mean nr_free_pages can reach negative value at very rarely race. boundary
> check is necessary?
> 

True, well spotted.

How about the following? It records a delta and checks if delta is negative
and would cause underflow.

unsigned long zone_nr_free_pages(struct zone *zone)
{
        unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
        long delta = 0;

        /*
         * While kswapd is awake, it is considered the zone is under some
         * memory pressure. Under pressure, there is a risk that
         * per-cpu-counter-drift will allow the min watermark to be breached
         * potentially causing a live-lock. While kswapd is awake and
         * free pages are low, get a better estimate for free pages
         */
        if (nr_free_pages < zone->percpu_drift_mark &&
                        !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
                int cpu;

                for_each_online_cpu(cpu) {
                        struct per_cpu_pageset *pset;

                        pset = per_cpu_ptr(zone->pageset, cpu);
                        delta += pset->vm_stat_diff[NR_FREE_PAGES];
                }
        }

        /* Watch for underflow */
        if (delta < 0 && abs(delta) > nr_free_pages)
                delta = -nr_free_pages;

        return nr_free_pages + delta;
}

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-01  7:24       ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-01  7:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

On Wed, Sep 01, 2010 at 08:37:41AM +0900, KOSAKI Motohiro wrote:
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +	/*
> > +	 * While kswapd is awake, it is considered the zone is under some
> > +	 * memory pressure. Under pressure, there is a risk that
> > +	 * per-cpu-counter-drift will allow the min watermark to be breached
> > +	 * potentially causing a live-lock. While kswapd is awake and
> > +	 * free pages are low, get a better estimate for free pages
> > +	 */
> > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > +		int cpu;
> > +
> > +		for_each_online_cpu(cpu) {
> > +			struct per_cpu_pageset *pset;
> > +
> > +			pset = per_cpu_ptr(zone->pageset, cpu);
> > +			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> 
> If my understanding is correct, we have no lock when reading pset->vm_stat_diff.
> It mean nr_free_pages can reach negative value at very rarely race. boundary
> check is necessary?
> 

True, well spotted.

How about the following? It records a delta and checks if delta is negative
and would cause underflow.

unsigned long zone_nr_free_pages(struct zone *zone)
{
        unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
        long delta = 0;

        /*
         * While kswapd is awake, it is considered the zone is under some
         * memory pressure. Under pressure, there is a risk that
         * per-cpu-counter-drift will allow the min watermark to be breached
         * potentially causing a live-lock. While kswapd is awake and
         * free pages are low, get a better estimate for free pages
         */
        if (nr_free_pages < zone->percpu_drift_mark &&
                        !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
                int cpu;

                for_each_online_cpu(cpu) {
                        struct per_cpu_pageset *pset;

                        pset = per_cpu_ptr(zone->pageset, cpu);
                        delta += pset->vm_stat_diff[NR_FREE_PAGES];
                }
        }

        /* Watch for underflow */
        if (delta < 0 && abs(delta) > nr_free_pages)
                delta = -nr_free_pages;

        return nr_free_pages + delta;
}

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-01  7:24       ` Mel Gorman
@ 2010-09-01  7:33         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-09-01  7:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> On Wed, Sep 01, 2010 at 08:37:41AM +0900, KOSAKI Motohiro wrote:
> > > +#ifdef CONFIG_SMP
> > > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > > +unsigned long zone_nr_free_pages(struct zone *zone)
> > > +{
> > > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > > +
> > > +	/*
> > > +	 * While kswapd is awake, it is considered the zone is under some
> > > +	 * memory pressure. Under pressure, there is a risk that
> > > +	 * per-cpu-counter-drift will allow the min watermark to be breached
> > > +	 * potentially causing a live-lock. While kswapd is awake and
> > > +	 * free pages are low, get a better estimate for free pages
> > > +	 */
> > > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > > +		int cpu;
> > > +
> > > +		for_each_online_cpu(cpu) {
> > > +			struct per_cpu_pageset *pset;
> > > +
> > > +			pset = per_cpu_ptr(zone->pageset, cpu);
> > > +			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> > 
> > If my understanding is correct, we have no lock when reading pset->vm_stat_diff.
> > It mean nr_free_pages can reach negative value at very rarely race. boundary
> > check is necessary?
> > 
> 
> True, well spotted.
> 
> How about the following? It records a delta and checks if delta is negative
> and would cause underflow.
> 
> unsigned long zone_nr_free_pages(struct zone *zone)
> {
>         unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
>         long delta = 0;
> 
>         /*
>          * While kswapd is awake, it is considered the zone is under some
>          * memory pressure. Under pressure, there is a risk that
>          * per-cpu-counter-drift will allow the min watermark to be breached
>          * potentially causing a live-lock. While kswapd is awake and
>          * free pages are low, get a better estimate for free pages
>          */
>         if (nr_free_pages < zone->percpu_drift_mark &&
>                         !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
>                 int cpu;
> 
>                 for_each_online_cpu(cpu) {
>                         struct per_cpu_pageset *pset;
> 
>                         pset = per_cpu_ptr(zone->pageset, cpu);
>                         delta += pset->vm_stat_diff[NR_FREE_PAGES];
>                 }
>         }
> 
>         /* Watch for underflow */
>         if (delta < 0 && abs(delta) > nr_free_pages)
>                 delta = -nr_free_pages;
> 
>         return nr_free_pages + delta;
> }

Looks good to me :)
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Thanks.





^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-01  7:33         ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-09-01  7:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> On Wed, Sep 01, 2010 at 08:37:41AM +0900, KOSAKI Motohiro wrote:
> > > +#ifdef CONFIG_SMP
> > > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > > +unsigned long zone_nr_free_pages(struct zone *zone)
> > > +{
> > > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > > +
> > > +	/*
> > > +	 * While kswapd is awake, it is considered the zone is under some
> > > +	 * memory pressure. Under pressure, there is a risk that
> > > +	 * per-cpu-counter-drift will allow the min watermark to be breached
> > > +	 * potentially causing a live-lock. While kswapd is awake and
> > > +	 * free pages are low, get a better estimate for free pages
> > > +	 */
> > > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > > +		int cpu;
> > > +
> > > +		for_each_online_cpu(cpu) {
> > > +			struct per_cpu_pageset *pset;
> > > +
> > > +			pset = per_cpu_ptr(zone->pageset, cpu);
> > > +			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> > 
> > If my understanding is correct, we have no lock when reading pset->vm_stat_diff.
> > It mean nr_free_pages can reach negative value at very rarely race. boundary
> > check is necessary?
> > 
> 
> True, well spotted.
> 
> How about the following? It records a delta and checks if delta is negative
> and would cause underflow.
> 
> unsigned long zone_nr_free_pages(struct zone *zone)
> {
>         unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
>         long delta = 0;
> 
>         /*
>          * While kswapd is awake, it is considered the zone is under some
>          * memory pressure. Under pressure, there is a risk that
>          * per-cpu-counter-drift will allow the min watermark to be breached
>          * potentially causing a live-lock. While kswapd is awake and
>          * free pages are low, get a better estimate for free pages
>          */
>         if (nr_free_pages < zone->percpu_drift_mark &&
>                         !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
>                 int cpu;
> 
>                 for_each_online_cpu(cpu) {
>                         struct per_cpu_pageset *pset;
> 
>                         pset = per_cpu_ptr(zone->pageset, cpu);
>                         delta += pset->vm_stat_diff[NR_FREE_PAGES];
>                 }
>         }
> 
>         /* Watch for underflow */
>         if (delta < 0 && abs(delta) > nr_free_pages)
>                 delta = -nr_free_pages;
> 
>         return nr_free_pages + delta;
> }

Looks good to me :)
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Thanks.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-01  7:33         ` KOSAKI Motohiro
@ 2010-09-01 20:16           ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-01 20:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, 1 Sep 2010, KOSAKI Motohiro wrote:

> > How about the following? It records a delta and checks if delta is negative
> > and would cause underflow.
> >
> > unsigned long zone_nr_free_pages(struct zone *zone)
> > {
> >         unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> >         long delta = 0;
> >
> >         /*
> >          * While kswapd is awake, it is considered the zone is under some
> >          * memory pressure. Under pressure, there is a risk that
> >          * per-cpu-counter-drift will allow the min watermark to be breached
> >          * potentially causing a live-lock. While kswapd is awake and
> >          * free pages are low, get a better estimate for free pages
> >          */
> >         if (nr_free_pages < zone->percpu_drift_mark &&
> >                         !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> >                 int cpu;
> >
> >                 for_each_online_cpu(cpu) {
> >                         struct per_cpu_pageset *pset;
> >
> >                         pset = per_cpu_ptr(zone->pageset, cpu);
> >                         delta += pset->vm_stat_diff[NR_FREE_PAGES];
> >                 }
> >         }
> >
> >         /* Watch for underflow */
> >         if (delta < 0 && abs(delta) > nr_free_pages)
> >                 delta = -nr_free_pages;

Not sure what the point here is. If the delta is going below zero then
there was a concurrent operation updating the counters negatively while
we summed up the counters. It is then safe to assume a value of zero. We
cannot really be more accurate than that.

so

	if (delta < 0)
		delta = 0;

would be correct. See also handling of counter underflow in
vmstat.h:zone_page_state(). As I have said before: I would rather have the
counter handling in one place to avoid creating differences in counter
handling.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-01 20:16           ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-01 20:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, 1 Sep 2010, KOSAKI Motohiro wrote:

> > How about the following? It records a delta and checks if delta is negative
> > and would cause underflow.
> >
> > unsigned long zone_nr_free_pages(struct zone *zone)
> > {
> >         unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> >         long delta = 0;
> >
> >         /*
> >          * While kswapd is awake, it is considered the zone is under some
> >          * memory pressure. Under pressure, there is a risk that
> >          * per-cpu-counter-drift will allow the min watermark to be breached
> >          * potentially causing a live-lock. While kswapd is awake and
> >          * free pages are low, get a better estimate for free pages
> >          */
> >         if (nr_free_pages < zone->percpu_drift_mark &&
> >                         !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> >                 int cpu;
> >
> >                 for_each_online_cpu(cpu) {
> >                         struct per_cpu_pageset *pset;
> >
> >                         pset = per_cpu_ptr(zone->pageset, cpu);
> >                         delta += pset->vm_stat_diff[NR_FREE_PAGES];
> >                 }
> >         }
> >
> >         /* Watch for underflow */
> >         if (delta < 0 && abs(delta) > nr_free_pages)
> >                 delta = -nr_free_pages;

Not sure what the point here is. If the delta is going below zero then
there was a concurrent operation updating the counters negatively while
we summed up the counters. It is then safe to assume a value of zero. We
cannot really be more accurate than that.

so

	if (delta < 0)
		delta = 0;

would be correct. See also handling of counter underflow in
vmstat.h:zone_page_state(). As I have said before: I would rather have the
counter handling in one place to avoid creating differences in counter
handling.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-01 20:16           ` Christoph Lameter
@ 2010-09-01 20:34             ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-01 20:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, Sep 01, 2010 at 03:16:59PM -0500, Christoph Lameter wrote:
> On Wed, 1 Sep 2010, KOSAKI Motohiro wrote:
> 
> > > How about the following? It records a delta and checks if delta is negative
> > > and would cause underflow.
> > >
> > > unsigned long zone_nr_free_pages(struct zone *zone)
> > > {
> > >         unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > >         long delta = 0;
> > >
> > >         /*
> > >          * While kswapd is awake, it is considered the zone is under some
> > >          * memory pressure. Under pressure, there is a risk that
> > >          * per-cpu-counter-drift will allow the min watermark to be breached
> > >          * potentially causing a live-lock. While kswapd is awake and
> > >          * free pages are low, get a better estimate for free pages
> > >          */
> > >         if (nr_free_pages < zone->percpu_drift_mark &&
> > >                         !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > >                 int cpu;
> > >
> > >                 for_each_online_cpu(cpu) {
> > >                         struct per_cpu_pageset *pset;
> > >
> > >                         pset = per_cpu_ptr(zone->pageset, cpu);
> > >                         delta += pset->vm_stat_diff[NR_FREE_PAGES];
> > >                 }
> > >         }
> > >
> > >         /* Watch for underflow */
> > >         if (delta < 0 && abs(delta) > nr_free_pages)
> > >                 delta = -nr_free_pages;
> 
> Not sure what the point here is. If the delta is going below zero then
> there was a concurrent operation updating the counters negatively while
> we summed up the counters.

The point is if the negative delta is greater than the current value of
nr_free_pages then nr_free_pages would underflow when delta is applied to it.

> It is then safe to assume a value of zero. We
> cannot really be more accurate than that.
> 
> so
> 
> 	if (delta < 0)
> 		delta = 0;
> 
> would be correct.

Lets say the reading at the start for nr_free_pages is 120 and the delta is
-20, then the estimated true value of nr_free_pages is 100. If we used your
logic, the estimate would be 120. Maybe I'm missing what you're saying.

> See also handling of counter underflow in
> vmstat.h:zone_page_state().

I'm not seeing the relation. zone_nr_free_pages() is trying to
reconcile the reading from zone_page_state() with the contents of
vm_stat_diff[].

> As I have said before: I would rather have the
> counter handling in one place to avoid creating differences in counter
> handling.
> 

And I'd rather not hurt the paths for every counter unnecessarily
without good cause. I can move zone_nr_free_pages() to mm/vmstat.c if
you'd prefer?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-01 20:34             ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-01 20:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, Sep 01, 2010 at 03:16:59PM -0500, Christoph Lameter wrote:
> On Wed, 1 Sep 2010, KOSAKI Motohiro wrote:
> 
> > > How about the following? It records a delta and checks if delta is negative
> > > and would cause underflow.
> > >
> > > unsigned long zone_nr_free_pages(struct zone *zone)
> > > {
> > >         unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > >         long delta = 0;
> > >
> > >         /*
> > >          * While kswapd is awake, it is considered the zone is under some
> > >          * memory pressure. Under pressure, there is a risk that
> > >          * per-cpu-counter-drift will allow the min watermark to be breached
> > >          * potentially causing a live-lock. While kswapd is awake and
> > >          * free pages are low, get a better estimate for free pages
> > >          */
> > >         if (nr_free_pages < zone->percpu_drift_mark &&
> > >                         !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > >                 int cpu;
> > >
> > >                 for_each_online_cpu(cpu) {
> > >                         struct per_cpu_pageset *pset;
> > >
> > >                         pset = per_cpu_ptr(zone->pageset, cpu);
> > >                         delta += pset->vm_stat_diff[NR_FREE_PAGES];
> > >                 }
> > >         }
> > >
> > >         /* Watch for underflow */
> > >         if (delta < 0 && abs(delta) > nr_free_pages)
> > >                 delta = -nr_free_pages;
> 
> Not sure what the point here is. If the delta is going below zero then
> there was a concurrent operation updating the counters negatively while
> we summed up the counters.

The point is if the negative delta is greater than the current value of
nr_free_pages then nr_free_pages would underflow when delta is applied to it.

> It is then safe to assume a value of zero. We
> cannot really be more accurate than that.
> 
> so
> 
> 	if (delta < 0)
> 		delta = 0;
> 
> would be correct.

Lets say the reading at the start for nr_free_pages is 120 and the delta is
-20, then the estimated true value of nr_free_pages is 100. If we used your
logic, the estimate would be 120. Maybe I'm missing what you're saying.

> See also handling of counter underflow in
> vmstat.h:zone_page_state().

I'm not seeing the relation. zone_nr_free_pages() is trying to
reconcile the reading from zone_page_state() with the contents of
vm_stat_diff[].

> As I have said before: I would rather have the
> counter handling in one place to avoid creating differences in counter
> handling.
> 

And I'd rather not hurt the paths for every counter unnecessarily
without good cause. I can move zone_nr_free_pages() to mm/vmstat.c if
you'd prefer?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-01 20:34             ` Mel Gorman
@ 2010-09-02  0:24               ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-02  0:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, 1 Sep 2010, Mel Gorman wrote:

> > > >         if (delta < 0 && abs(delta) > nr_free_pages)
> > > >                 delta = -nr_free_pages;
> >
> > Not sure what the point here is. If the delta is going below zero then
> > there was a concurrent operation updating the counters negatively while
> > we summed up the counters.
>
> The point is if the negative delta is greater than the current value of
> nr_free_pages then nr_free_pages would underflow when delta is applied to it.

Ok. then

	nr_free_pages += delta;
	if (nr_free_pages < 0)
		nr_free_pages = 0;

> > would be correct.
>
> Lets say the reading at the start for nr_free_pages is 120 and the delta is
> -20, then the estimated true value of nr_free_pages is 100. If we used your
> logic, the estimate would be 120. Maybe I'm missing what you're saying.

Well yes the sum of the counter needs to be checked not just the sum of
the deltas. This is the same as the counter determination in vmstat.h

> > See also handling of counter underflow in
> > vmstat.h:zone_page_state().
>
> I'm not seeing the relation. zone_nr_free_pages() is trying to
> reconcile the reading from zone_page_state() with the contents of
> vm_stat_diff[].

Both are determinations of a counter value. The global or zone counters
can also temporarily go below zero due to deferred updates. If
this happens then 0 will be returned(!). zonr_nr_free_pages need to work
in the same way.

> > As I have said before: I would rather have the
> > counter handling in one place to avoid creating differences in counter
> > handling.
> >
>
> And I'd rather not hurt the paths for every counter unnecessarily
> without good cause. I can move zone_nr_free_pages() to mm/vmstat.c if
> you'd prefer?

Generalize it on the way please to work with any counter?



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-02  0:24               ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-02  0:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, 1 Sep 2010, Mel Gorman wrote:

> > > >         if (delta < 0 && abs(delta) > nr_free_pages)
> > > >                 delta = -nr_free_pages;
> >
> > Not sure what the point here is. If the delta is going below zero then
> > there was a concurrent operation updating the counters negatively while
> > we summed up the counters.
>
> The point is if the negative delta is greater than the current value of
> nr_free_pages then nr_free_pages would underflow when delta is applied to it.

Ok. then

	nr_free_pages += delta;
	if (nr_free_pages < 0)
		nr_free_pages = 0;

> > would be correct.
>
> Lets say the reading at the start for nr_free_pages is 120 and the delta is
> -20, then the estimated true value of nr_free_pages is 100. If we used your
> logic, the estimate would be 120. Maybe I'm missing what you're saying.

Well yes the sum of the counter needs to be checked not just the sum of
the deltas. This is the same as the counter determination in vmstat.h

> > See also handling of counter underflow in
> > vmstat.h:zone_page_state().
>
> I'm not seeing the relation. zone_nr_free_pages() is trying to
> reconcile the reading from zone_page_state() with the contents of
> vm_stat_diff[].

Both are determinations of a counter value. The global or zone counters
can also temporarily go below zero due to deferred updates. If
this happens then 0 will be returned(!). zonr_nr_free_pages need to work
in the same way.

> > As I have said before: I would rather have the
> > counter handling in one place to avoid creating differences in counter
> > handling.
> >
>
> And I'd rather not hurt the paths for every counter unnecessarily
> without good cause. I can move zone_nr_free_pages() to mm/vmstat.c if
> you'd prefer?

Generalize it on the way please to work with any counter?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-02  0:24               ` Christoph Lameter
@ 2010-09-02  0:26                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-09-02  0:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, Linux Kernel List,
	linux-mm, Rik van Riel, Johannes Weiner, Minchan Kim,
	KAMEZAWA Hiroyuki

> On Wed, 1 Sep 2010, Mel Gorman wrote:
> 
> > > > >         if (delta < 0 && abs(delta) > nr_free_pages)
> > > > >                 delta = -nr_free_pages;
> > >
> > > Not sure what the point here is. If the delta is going below zero then
> > > there was a concurrent operation updating the counters negatively while
> > > we summed up the counters.
> >
> > The point is if the negative delta is greater than the current value of
> > nr_free_pages then nr_free_pages would underflow when delta is applied to it.
> 
> Ok. then
> 
> 	nr_free_pages += delta;
> 	if (nr_free_pages < 0)
> 		nr_free_pages = 0;

nr_free_pages is unsined. this wouldn't works ;)





^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-02  0:26                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-09-02  0:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, Linux Kernel List,
	linux-mm, Rik van Riel, Johannes Weiner, Minchan Kim,
	KAMEZAWA Hiroyuki

> On Wed, 1 Sep 2010, Mel Gorman wrote:
> 
> > > > >         if (delta < 0 && abs(delta) > nr_free_pages)
> > > > >                 delta = -nr_free_pages;
> > >
> > > Not sure what the point here is. If the delta is going below zero then
> > > there was a concurrent operation updating the counters negatively while
> > > we summed up the counters.
> >
> > The point is if the negative delta is greater than the current value of
> > nr_free_pages then nr_free_pages would underflow when delta is applied to it.
> 
> Ok. then
> 
> 	nr_free_pages += delta;
> 	if (nr_free_pages < 0)
> 		nr_free_pages = 0;

nr_free_pages is unsined. this wouldn't works ;)




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-02  0:26                 ` KOSAKI Motohiro
@ 2010-09-02  0:39                   ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-02  0:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, 2 Sep 2010, KOSAKI Motohiro wrote:

> > 	nr_free_pages += delta;
> > 	if (nr_free_pages < 0)
> > 		nr_free_pages = 0;
>
> nr_free_pages is unsined. this wouldn't works ;)

The VM counters are signed and must be signed otherwise the deferred
update scheme would cause desasters. For treatment in the page allocator
these may be converted to unsigned.

The effect needs to be the same as retrieving a global or
zone ZVC counter. Which is currently implemented in the following way:

static inline unsigned long zone_page_state(struct zone *zone,
                                        enum zone_stat_item item)
{
        long x = atomic_long_read(&zone->vm_stat[item]);
#ifdef CONFIG_SMP
        if (x < 0)
                x = 0;
#endif
        return x;
}

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-02  0:39                   ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-02  0:39 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, 2 Sep 2010, KOSAKI Motohiro wrote:

> > 	nr_free_pages += delta;
> > 	if (nr_free_pages < 0)
> > 		nr_free_pages = 0;
>
> nr_free_pages is unsined. this wouldn't works ;)

The VM counters are signed and must be signed otherwise the deferred
update scheme would cause desasters. For treatment in the page allocator
these may be converted to unsigned.

The effect needs to be the same as retrieving a global or
zone ZVC counter. Which is currently implemented in the following way:

static inline unsigned long zone_page_state(struct zone *zone,
                                        enum zone_stat_item item)
{
        long x = atomic_long_read(&zone->vm_stat[item]);
#ifdef CONFIG_SMP
        if (x < 0)
                x = 0;
#endif
        return x;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-31 17:37   ` Mel Gorman
@ 2010-09-02  0:43     ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-02  0:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, 31 Aug 2010, Mel Gorman wrote:

> +#ifdef CONFIG_SMP
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);

You cannot call zone_page_state here because zone_page_state clips the
counter at zero. The nr_free_pages needs to reflect the unclipped state
and then the deltas need to be added. Then the clipping at zero can be
done.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-02  0:43     ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-02  0:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, 31 Aug 2010, Mel Gorman wrote:

> +#ifdef CONFIG_SMP
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);

You cannot call zone_page_state here because zone_page_state clips the
counter at zero. The nr_free_pages needs to reflect the unclipped state
and then the deltas need to be added. Then the clipping at zero can be
done.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-02  0:43     ` Christoph Lameter
@ 2010-09-02  0:49       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-09-02  0:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, Linux Kernel List,
	linux-mm, Rik van Riel, Johannes Weiner, Minchan Kim,
	KAMEZAWA Hiroyuki

> On Tue, 31 Aug 2010, Mel Gorman wrote:
> 
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> 
> You cannot call zone_page_state here because zone_page_state clips the
> counter at zero. The nr_free_pages needs to reflect the unclipped state
> and then the deltas need to be added. Then the clipping at zero can be
> done.

Good spotting. you are right.




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-02  0:49       ` KOSAKI Motohiro
  0 siblings, 0 replies; 99+ messages in thread
From: KOSAKI Motohiro @ 2010-09-02  0:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, Linux Kernel List,
	linux-mm, Rik van Riel, Johannes Weiner, Minchan Kim,
	KAMEZAWA Hiroyuki

> On Tue, 31 Aug 2010, Mel Gorman wrote:
> 
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> 
> You cannot call zone_page_state here because zone_page_state clips the
> counter at zero. The nr_free_pages needs to reflect the unclipped state
> and then the deltas need to be added. Then the clipping at zero can be
> done.

Good spotting. you are right.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-02  0:39                   ` Christoph Lameter
@ 2010-09-02  0:54                     ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-02  0:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, 1 Sep 2010, Christoph Lameter wrote:

> The effect needs to be the same as retrieving a global or
> zone ZVC counter. Which is currently implemented in the following way:
>
> static inline unsigned long zone_page_state(struct zone *zone,
>                                         enum zone_stat_item item)
> {
>         long x = atomic_long_read(&zone->vm_stat[item]);
> #ifdef CONFIG_SMP
>         if (x < 0)
>                 x = 0;
> #endif
>         return x;
> }
>

Here is a patch that defined a snapshot function that works in the same
way:

Subject: Add a snapshot function for vm statistics

Add a snapshot function that can more accurately determine
the current value of a zone counter.

Signed-off-by: Christoph Lameter <cl@linux.com>


Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2010-09-01 19:45:23.506071189 -0500
+++ linux-2.6/include/linux/vmstat.h	2010-09-01 19:53:02.978979081 -0500
@@ -170,6 +170,28 @@
 	return x;
 }

+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+					enum zone_stat_item item)
+{
+	int cpu;
+	long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-02  0:54                     ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-02  0:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, 1 Sep 2010, Christoph Lameter wrote:

> The effect needs to be the same as retrieving a global or
> zone ZVC counter. Which is currently implemented in the following way:
>
> static inline unsigned long zone_page_state(struct zone *zone,
>                                         enum zone_stat_item item)
> {
>         long x = atomic_long_read(&zone->vm_stat[item]);
> #ifdef CONFIG_SMP
>         if (x < 0)
>                 x = 0;
> #endif
>         return x;
> }
>

Here is a patch that defined a snapshot function that works in the same
way:

Subject: Add a snapshot function for vm statistics

Add a snapshot function that can more accurately determine
the current value of a zone counter.

Signed-off-by: Christoph Lameter <cl@linux.com>


Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2010-09-01 19:45:23.506071189 -0500
+++ linux-2.6/include/linux/vmstat.h	2010-09-01 19:53:02.978979081 -0500
@@ -170,6 +170,28 @@
 	return x;
 }

+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+					enum zone_stat_item item)
+{
+	int cpu;
+	long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-02  0:43     ` Christoph Lameter
@ 2010-09-02  8:51       ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-02  8:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, Sep 01, 2010 at 07:43:41PM -0500, Christoph Lameter wrote:
> On Tue, 31 Aug 2010, Mel Gorman wrote:
> 
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> 
> You cannot call zone_page_state here because zone_page_state clips the
> counter at zero. The nr_free_pages needs to reflect the unclipped state
> and then the deltas need to be added. Then the clipping at zero can be
> done.
> 

Good point. This justifies the use of a generic helper that is co-located
with vmstat.h. I've taken your zone_page_state_snapshot() patch, am using
the helper to take a more accurate reading of NR_FREE_PAGES and preparing
for a test.  Thanks Christoph.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-02  8:51       ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-02  8:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, Sep 01, 2010 at 07:43:41PM -0500, Christoph Lameter wrote:
> On Tue, 31 Aug 2010, Mel Gorman wrote:
> 
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> 
> You cannot call zone_page_state here because zone_page_state clips the
> counter at zero. The nr_free_pages needs to reflect the unclipped state
> and then the deltas need to be added. Then the clipping at zero can be
> done.
> 

Good point. This justifies the use of a generic helper that is co-located
with vmstat.h. I've taken your zone_page_state_snapshot() patch, am using
the helper to take a more accurate reading of NR_FREE_PAGES and preparing
for a test.  Thanks Christoph.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03 22:55     ` Andrew Morton
@ 2010-09-05 18:12       ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-05 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri, Sep 03, 2010 at 03:55:37PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:45 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > From: Christoph Lameter <cl@linux.com>
> > 
> > Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
> > it is cheaper than scanning a number of lists. To avoid synchronization
> > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > periodically and when the delta is above a threshold. On large CPU systems,
> > the difference between the estimated and real value of NR_FREE_PAGES can
> > be very high. If NR_FREE_PAGES is much higher than number of real free page
> > in buddy, the VM can allocate pages below min watermark, at worst reducing
> > the real number of pages to zero. Even if the OOM killer kills some victim
> > for freeing memory, it may not free memory if the exit path requires a new
> > page resulting in livelock.
> > 
> > This patch introduces a zone_page_state_snapshot() function (courtesy of
> > Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
> > It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
> > being accidentally broken.  The estimate is not perfect and may result
> > in cache line bounces but is expected to be lighter than the IPI calls
> > necessary to continually drain the per-cpu counters while kswapd is awake.
> > 
> 
> The "is kswapd awake" heuristic seems fairly hacky.  Can it be
> improved, made more deterministic? 

It could be removed but the problem is that the snap version of the
function could be continually used on large systems that are using
almost all physical memory but not under any memory pressure. kswapd
being awake seemed a reasonable proxy indicator that the system is under
pressure.

> Exactly what state are we looking
> for here?
> 

We want to know when the system is in a state where it is both under
pressure and in danger of breaching the watermark due to per-cpu counter
drift.

> 
> > +/*
> > + * More accurate version that also considers the currently pending
> > + * deltas. For that we need to loop over all cpus to find the current
> > + * deltas. There is no synchronization so the result cannot be
> > + * exactly accurate either.
> > + */
> > +static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> > +					enum zone_stat_item item)
> > +{
> > +	long x = atomic_long_read(&zone->vm_stat[item]);
> > +
> > +#ifdef CONFIG_SMP
> > +	int cpu;
> > +	for_each_online_cpu(cpu)
> > +		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
> > +
> > +	if (x < 0)
> > +		x = 0;
> > +#endif
> > +	return x;
> > +}
> 
> aka percpu_counter_sum()!
> 
> Can someone remind me why per_cpu_pageset went and reimplemented
> percpu_counters rather than just using them?
> 

It's not an exact fit. Christoph answered this and I do not have
anything additional to say.

> >  extern unsigned long global_reclaimable_pages(void);
> >  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> >  
> > diff --git a/mm/mmzone.c b/mm/mmzone.c
> > index f5b7d17..e35bfb8 100644
> > --- a/mm/mmzone.c
> > +++ b/mm/mmzone.c
> > @@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
> >  	return 1;
> >  }
> >  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> > +
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +	/*
> > +	 * While kswapd is awake, it is considered the zone is under some
> > +	 * memory pressure. Under pressure, there is a risk that
> > +	 * per-cpu-counter-drift will allow the min watermark to be breached
> > +	 * potentially causing a live-lock. While kswapd is awake and
> > +	 * free pages are low, get a better estimate for free pages
> > +	 */
> > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> > +		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> > +
> > +	return nr_free_pages;
> > +}
> 
> Is this really the best way of doing it?  The way we usually solve
> this problem (and boy, was this bug a newbie mistake!) is:
> 
> 	foo = percpu_counter_read(x);
> 
> 	if (foo says something bad) {
> 		/* Bad stuff: let's get a more accurate foo */
> 		foo = percpu_counter_sum(x);
> 	}
> 
> 	if (foo still says something bad)
> 		do_bad_thing();
> 
> In other words, don't do all this stuff with percpu_drift_mark and the
> kswapd heuristic.

The percpu_drift_mark and the kswapd heuristic correspond to your "foo
says something bad" above. The drift mark is detecting we're in
potential danger and the kswapd check is telling us we are both in
danger and there is memory pressure. Even if we were using the percpu
counters, it wouldn't eliminate the need for percpu_drift_mark and the
kswapd heuristic, right?

> Just change zone_watermark_ok() to use the more
> accurate read if it's about to return "no".
> 

It could be too late by then. By the tiome zone_watermark_ok() is about
to return no, we could have already breached the watermark by a
significant amount due to the per-cpu counter drift.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-05 18:12       ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-05 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri, Sep 03, 2010 at 03:55:37PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:45 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > From: Christoph Lameter <cl@linux.com>
> > 
> > Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
> > it is cheaper than scanning a number of lists. To avoid synchronization
> > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > periodically and when the delta is above a threshold. On large CPU systems,
> > the difference between the estimated and real value of NR_FREE_PAGES can
> > be very high. If NR_FREE_PAGES is much higher than number of real free page
> > in buddy, the VM can allocate pages below min watermark, at worst reducing
> > the real number of pages to zero. Even if the OOM killer kills some victim
> > for freeing memory, it may not free memory if the exit path requires a new
> > page resulting in livelock.
> > 
> > This patch introduces a zone_page_state_snapshot() function (courtesy of
> > Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
> > It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
> > being accidentally broken.  The estimate is not perfect and may result
> > in cache line bounces but is expected to be lighter than the IPI calls
> > necessary to continually drain the per-cpu counters while kswapd is awake.
> > 
> 
> The "is kswapd awake" heuristic seems fairly hacky.  Can it be
> improved, made more deterministic? 

It could be removed but the problem is that the snap version of the
function could be continually used on large systems that are using
almost all physical memory but not under any memory pressure. kswapd
being awake seemed a reasonable proxy indicator that the system is under
pressure.

> Exactly what state are we looking
> for here?
> 

We want to know when the system is in a state where it is both under
pressure and in danger of breaching the watermark due to per-cpu counter
drift.

> 
> > +/*
> > + * More accurate version that also considers the currently pending
> > + * deltas. For that we need to loop over all cpus to find the current
> > + * deltas. There is no synchronization so the result cannot be
> > + * exactly accurate either.
> > + */
> > +static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> > +					enum zone_stat_item item)
> > +{
> > +	long x = atomic_long_read(&zone->vm_stat[item]);
> > +
> > +#ifdef CONFIG_SMP
> > +	int cpu;
> > +	for_each_online_cpu(cpu)
> > +		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
> > +
> > +	if (x < 0)
> > +		x = 0;
> > +#endif
> > +	return x;
> > +}
> 
> aka percpu_counter_sum()!
> 
> Can someone remind me why per_cpu_pageset went and reimplemented
> percpu_counters rather than just using them?
> 

It's not an exact fit. Christoph answered this and I do not have
anything additional to say.

> >  extern unsigned long global_reclaimable_pages(void);
> >  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> >  
> > diff --git a/mm/mmzone.c b/mm/mmzone.c
> > index f5b7d17..e35bfb8 100644
> > --- a/mm/mmzone.c
> > +++ b/mm/mmzone.c
> > @@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
> >  	return 1;
> >  }
> >  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> > +
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +	/*
> > +	 * While kswapd is awake, it is considered the zone is under some
> > +	 * memory pressure. Under pressure, there is a risk that
> > +	 * per-cpu-counter-drift will allow the min watermark to be breached
> > +	 * potentially causing a live-lock. While kswapd is awake and
> > +	 * free pages are low, get a better estimate for free pages
> > +	 */
> > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> > +		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> > +
> > +	return nr_free_pages;
> > +}
> 
> Is this really the best way of doing it?  The way we usually solve
> this problem (and boy, was this bug a newbie mistake!) is:
> 
> 	foo = percpu_counter_read(x);
> 
> 	if (foo says something bad) {
> 		/* Bad stuff: let's get a more accurate foo */
> 		foo = percpu_counter_sum(x);
> 	}
> 
> 	if (foo still says something bad)
> 		do_bad_thing();
> 
> In other words, don't do all this stuff with percpu_drift_mark and the
> kswapd heuristic.

The percpu_drift_mark and the kswapd heuristic correspond to your "foo
says something bad" above. The drift mark is detecting we're in
potential danger and the kswapd check is telling us we are both in
danger and there is memory pressure. Even if we were using the percpu
counters, it wouldn't eliminate the need for percpu_drift_mark and the
kswapd heuristic, right?

> Just change zone_watermark_ok() to use the more
> accurate read if it's about to return "no".
> 

It could be too late by then. By the tiome zone_watermark_ok() is about
to return no, we could have already breached the watermark by a
significant amount due to the per-cpu counter drift.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03 23:28         ` Andrew Morton
@ 2010-09-04  0:54           ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-04  0:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010, Andrew Morton wrote:

> > percpu counters must always be added up when their value is determined.
>
> Nope.  That's the difference between percpu_counter_read() and
> percpu_counter_sum().

Hmmm... Okay you can fold them therefore. That is analogous to what we do
in the _snapshot function now.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-04  0:54           ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-04  0:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010, Andrew Morton wrote:

> > percpu counters must always be added up when their value is determined.
>
> Nope.  That's the difference between percpu_counter_read() and
> percpu_counter_sum().

Hmmm... Okay you can fold them therefore. That is analogous to what we do
in the _snapshot function now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03 23:17       ` Christoph Lameter
@ 2010-09-03 23:28         ` Andrew Morton
  -1 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2010-09-03 23:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010 18:17:46 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 3 Sep 2010, Andrew Morton wrote:
> 
> > Can someone remind me why per_cpu_pageset went and reimplemented
> > percpu_counters rather than just using them?
> 
> The vm counters are per zone and per cpu and have a flow from per cpu /
> zone deltas to zone counters and then also into global counters.

hm.  percpu counters would require overflow-time hooks to do that. 
Might be worth looking at.

> > Is this really the best way of doing it?  The way we usually solve
> > this problem (and boy, was this bug a newbie mistake!) is:
> >
> > 	foo = percpu_counter_read(x);
> >
> > 	if (foo says something bad) {
> > 		/* Bad stuff: let's get a more accurate foo */
> > 		foo = percpu_counter_sum(x);
> > 	}
> >
> > 	if (foo still says something bad)
> > 		do_bad_thing();
> >
> > In other words, don't do all this stuff with percpu_drift_mark and the
> > kswapd heuristic.  Just change zone_watermark_ok() to use the more
> > accurate read if it's about to return "no".
> 
> percpu counters must always be added up when their value is determined.

Nope.  That's the difference between percpu_counter_read() and
percpu_counter_sum().

> This seems to be a special case here where Mel does not want to have to
> cost to bring the counters up to date nor reduce the delta/time limits to
> get some more accuracy but wants take some sort of snapshot of the whole
> situation for this particular case.

My suggestion didn't actually have anything to do with percpu_counters.



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-03 23:28         ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2010-09-03 23:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010 18:17:46 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 3 Sep 2010, Andrew Morton wrote:
> 
> > Can someone remind me why per_cpu_pageset went and reimplemented
> > percpu_counters rather than just using them?
> 
> The vm counters are per zone and per cpu and have a flow from per cpu /
> zone deltas to zone counters and then also into global counters.

hm.  percpu counters would require overflow-time hooks to do that. 
Might be worth looking at.

> > Is this really the best way of doing it?  The way we usually solve
> > this problem (and boy, was this bug a newbie mistake!) is:
> >
> > 	foo = percpu_counter_read(x);
> >
> > 	if (foo says something bad) {
> > 		/* Bad stuff: let's get a more accurate foo */
> > 		foo = percpu_counter_sum(x);
> > 	}
> >
> > 	if (foo still says something bad)
> > 		do_bad_thing();
> >
> > In other words, don't do all this stuff with percpu_drift_mark and the
> > kswapd heuristic.  Just change zone_watermark_ok() to use the more
> > accurate read if it's about to return "no".
> 
> percpu counters must always be added up when their value is determined.

Nope.  That's the difference between percpu_counter_read() and
percpu_counter_sum().

> This seems to be a special case here where Mel does not want to have to
> cost to bring the counters up to date nor reduce the delta/time limits to
> get some more accuracy but wants take some sort of snapshot of the whole
> situation for this particular case.

My suggestion didn't actually have anything to do with percpu_counters.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03 22:55     ` Andrew Morton
@ 2010-09-03 23:17       ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-03 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010, Andrew Morton wrote:

> Can someone remind me why per_cpu_pageset went and reimplemented
> percpu_counters rather than just using them?

The vm counters are per zone and per cpu and have a flow from per cpu /
zone deltas to zone counters and then also into global counters.

> Is this really the best way of doing it?  The way we usually solve
> this problem (and boy, was this bug a newbie mistake!) is:
>
> 	foo = percpu_counter_read(x);
>
> 	if (foo says something bad) {
> 		/* Bad stuff: let's get a more accurate foo */
> 		foo = percpu_counter_sum(x);
> 	}
>
> 	if (foo still says something bad)
> 		do_bad_thing();
>
> In other words, don't do all this stuff with percpu_drift_mark and the
> kswapd heuristic.  Just change zone_watermark_ok() to use the more
> accurate read if it's about to return "no".

percpu counters must always be added up when their value is determined. We
cannot really affort that for the VM. Counters are always available
without looping over all cpus.

vm counters are continually kept up to date (but may have delta limited by
time and counter values).

This seems to be a special case here where Mel does not want to have to
cost to bring the counters up to date nor reduce the delta/time limits to
get some more accuracy but wants take some sort of snapshot of the whole
situation for this particular case.




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-03 23:17       ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-09-03 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010, Andrew Morton wrote:

> Can someone remind me why per_cpu_pageset went and reimplemented
> percpu_counters rather than just using them?

The vm counters are per zone and per cpu and have a flow from per cpu /
zone deltas to zone counters and then also into global counters.

> Is this really the best way of doing it?  The way we usually solve
> this problem (and boy, was this bug a newbie mistake!) is:
>
> 	foo = percpu_counter_read(x);
>
> 	if (foo says something bad) {
> 		/* Bad stuff: let's get a more accurate foo */
> 		foo = percpu_counter_sum(x);
> 	}
>
> 	if (foo still says something bad)
> 		do_bad_thing();
>
> In other words, don't do all this stuff with percpu_drift_mark and the
> kswapd heuristic.  Just change zone_watermark_ok() to use the more
> accurate read if it's about to return "no".

percpu counters must always be added up when their value is determined. We
cannot really affort that for the VM. Counters are always available
without looping over all cpus.

vm counters are continually kept up to date (but may have delta limited by
time and counter values).

This seems to be a special case here where Mel does not want to have to
cost to bring the counters up to date nor reduce the delta/time limits to
get some more accuracy but wants take some sort of snapshot of the whole
situation for this particular case.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03  9:08   ` Mel Gorman
@ 2010-09-03 22:55     ` Andrew Morton
  -1 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2010-09-03 22:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri,  3 Sep 2010 10:08:45 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: Christoph Lameter <cl@linux.com>
> 
> Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
> it is cheaper than scanning a number of lists. To avoid synchronization
> overhead, counter deltas are maintained on a per-cpu basis and drained both
> periodically and when the delta is above a threshold. On large CPU systems,
> the difference between the estimated and real value of NR_FREE_PAGES can
> be very high. If NR_FREE_PAGES is much higher than number of real free page
> in buddy, the VM can allocate pages below min watermark, at worst reducing
> the real number of pages to zero. Even if the OOM killer kills some victim
> for freeing memory, it may not free memory if the exit path requires a new
> page resulting in livelock.
> 
> This patch introduces a zone_page_state_snapshot() function (courtesy of
> Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
> It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
> being accidentally broken.  The estimate is not perfect and may result
> in cache line bounces but is expected to be lighter than the IPI calls
> necessary to continually drain the per-cpu counters while kswapd is awake.
> 

The "is kswapd awake" heuristic seems fairly hacky.  Can it be
improved, made more deterministic?  Exactly what state are we looking
for here?


> +/*
> + * More accurate version that also considers the currently pending
> + * deltas. For that we need to loop over all cpus to find the current
> + * deltas. There is no synchronization so the result cannot be
> + * exactly accurate either.
> + */
> +static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> +					enum zone_stat_item item)
> +{
> +	long x = atomic_long_read(&zone->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +	int cpu;
> +	for_each_online_cpu(cpu)
> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
> +
> +	if (x < 0)
> +		x = 0;
> +#endif
> +	return x;
> +}

aka percpu_counter_sum()!

Can someone remind me why per_cpu_pageset went and reimplemented
percpu_counters rather than just using them?

>  extern unsigned long global_reclaimable_pages(void);
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index f5b7d17..e35bfb8 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
>  	return 1;
>  }
>  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> +
> +#ifdef CONFIG_SMP
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> +	/*
> +	 * While kswapd is awake, it is considered the zone is under some
> +	 * memory pressure. Under pressure, there is a risk that
> +	 * per-cpu-counter-drift will allow the min watermark to be breached
> +	 * potentially causing a live-lock. While kswapd is awake and
> +	 * free pages are low, get a better estimate for free pages
> +	 */
> +	if (nr_free_pages < zone->percpu_drift_mark &&
> +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> +		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +	return nr_free_pages;
> +}

Is this really the best way of doing it?  The way we usually solve
this problem (and boy, was this bug a newbie mistake!) is:

	foo = percpu_counter_read(x);

	if (foo says something bad) {
		/* Bad stuff: let's get a more accurate foo */
		foo = percpu_counter_sum(x);
	}

	if (foo still says something bad)
		do_bad_thing();

In other words, don't do all this stuff with percpu_drift_mark and the
kswapd heuristic.  Just change zone_watermark_ok() to use the more
accurate read if it's about to return "no".


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-03 22:55     ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2010-09-03 22:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri,  3 Sep 2010 10:08:45 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: Christoph Lameter <cl@linux.com>
> 
> Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
> it is cheaper than scanning a number of lists. To avoid synchronization
> overhead, counter deltas are maintained on a per-cpu basis and drained both
> periodically and when the delta is above a threshold. On large CPU systems,
> the difference between the estimated and real value of NR_FREE_PAGES can
> be very high. If NR_FREE_PAGES is much higher than number of real free page
> in buddy, the VM can allocate pages below min watermark, at worst reducing
> the real number of pages to zero. Even if the OOM killer kills some victim
> for freeing memory, it may not free memory if the exit path requires a new
> page resulting in livelock.
> 
> This patch introduces a zone_page_state_snapshot() function (courtesy of
> Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
> It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
> being accidentally broken.  The estimate is not perfect and may result
> in cache line bounces but is expected to be lighter than the IPI calls
> necessary to continually drain the per-cpu counters while kswapd is awake.
> 

The "is kswapd awake" heuristic seems fairly hacky.  Can it be
improved, made more deterministic?  Exactly what state are we looking
for here?


> +/*
> + * More accurate version that also considers the currently pending
> + * deltas. For that we need to loop over all cpus to find the current
> + * deltas. There is no synchronization so the result cannot be
> + * exactly accurate either.
> + */
> +static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> +					enum zone_stat_item item)
> +{
> +	long x = atomic_long_read(&zone->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +	int cpu;
> +	for_each_online_cpu(cpu)
> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
> +
> +	if (x < 0)
> +		x = 0;
> +#endif
> +	return x;
> +}

aka percpu_counter_sum()!

Can someone remind me why per_cpu_pageset went and reimplemented
percpu_counters rather than just using them?

>  extern unsigned long global_reclaimable_pages(void);
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index f5b7d17..e35bfb8 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
>  	return 1;
>  }
>  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> +
> +#ifdef CONFIG_SMP
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> +	/*
> +	 * While kswapd is awake, it is considered the zone is under some
> +	 * memory pressure. Under pressure, there is a risk that
> +	 * per-cpu-counter-drift will allow the min watermark to be breached
> +	 * potentially causing a live-lock. While kswapd is awake and
> +	 * free pages are low, get a better estimate for free pages
> +	 */
> +	if (nr_free_pages < zone->percpu_drift_mark &&
> +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> +		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +	return nr_free_pages;
> +}

Is this really the best way of doing it?  The way we usually solve
this problem (and boy, was this bug a newbie mistake!) is:

	foo = percpu_counter_read(x);

	if (foo says something bad) {
		/* Bad stuff: let's get a more accurate foo */
		foo = percpu_counter_sum(x);
	}

	if (foo still says something bad)
		do_bad_thing();

In other words, don't do all this stuff with percpu_drift_mark and the
kswapd heuristic.  Just change zone_watermark_ok() to use the more
accurate read if it's about to return "no".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03  9:08 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4 Mel Gorman
@ 2010-09-03  9:08   ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

From: Christoph Lameter <cl@linux.com>

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can
be very high. If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero. Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
being accidentally broken.  The estimate is not perfect and may result
in cache line bounces but is expected to be lighter than the IPI calls
necessary to continually drain the per-cpu counters while kswapd is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 5 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 7f43ccd..eaaea37 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -170,6 +170,28 @@ static inline unsigned long zone_page_state(struct zone *zone,
 	return x;
 }
 
+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+					enum zone_stat_item item)
+{
+	long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..e35bfb8 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
+		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-03  9:08   ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

From: Christoph Lameter <cl@linux.com>

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can
be very high. If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero. Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
being accidentally broken.  The estimate is not perfect and may result
in cache line bounces but is expected to be lighter than the IPI calls
necessary to continually drain the per-cpu counters while kswapd is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 5 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 7f43ccd..eaaea37 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -170,6 +170,28 @@ static inline unsigned long zone_page_state(struct zone *zone,
 	return x;
 }
 
+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+					enum zone_stat_item item)
+{
+	long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..e35bfb8 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
+		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 16:04             ` Christoph Lameter
@ 2010-08-23 16:13               ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-23 16:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 11:04:38AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > When the vm gets into a state where continual reclaim is necessary then
> > > the counters are not that frequently updated. If the machine is already
> > > slowing down due to reclaim then the vm can likely affort more frequent
> > > counter updates.
> > >
> >
> > Ok, but is that better than this patch? Decreasing the size of the window by
> > reducing the threshold still leaves a window. There is still a small amount
> > of drift by summing up all the deltas but you get a much more accurate count
> > at the point of time it was important to know.
> 
> In order to make that decision we would need to know what deltas make a
> significant difference.

A delta on the NR_FREE_PAGES is the obvious problem. The page allocation
failure report I saw clearly stated that free was a value above min watermark
where as the buddy lists just as clearly showed that the number of pages on
the list were 0.

> Would be also important to know if there are any
> other counters that have issues.

I am not aware of similar issues with another counter where drift causes
the system to make the wrong decision, are you?

> If so then the reduction of the
> thresholds is addressing these problems in a number of counters.
> 
> I have no objection against this approach here but it may just be bandaid
> on a larger issue that could be approached in a cleaner way.
> 

Unfortunately, I do not have access to a machine large enough to investigate
around this area. All I have to go on is a few bug reports showing the delta
problem with NR_FREE_PAGES and test results in a patch functionally similar
to this patch showing that the livelock problem went away.

At best all we can do is keep an eye out for problems one large machines
that could be explained by counter drift. If such a bug is found with a
reporter with regular access to the machine for test kernels, we can
investigate if reducing the thresholds fix the problem without affecting
general performance.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 16:13               ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-23 16:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 11:04:38AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > When the vm gets into a state where continual reclaim is necessary then
> > > the counters are not that frequently updated. If the machine is already
> > > slowing down due to reclaim then the vm can likely affort more frequent
> > > counter updates.
> > >
> >
> > Ok, but is that better than this patch? Decreasing the size of the window by
> > reducing the threshold still leaves a window. There is still a small amount
> > of drift by summing up all the deltas but you get a much more accurate count
> > at the point of time it was important to know.
> 
> In order to make that decision we would need to know what deltas make a
> significant difference.

A delta on the NR_FREE_PAGES is the obvious problem. The page allocation
failure report I saw clearly stated that free was a value above min watermark
where as the buddy lists just as clearly showed that the number of pages on
the list were 0.

> Would be also important to know if there are any
> other counters that have issues.

I am not aware of similar issues with another counter where drift causes
the system to make the wrong decision, are you?

> If so then the reduction of the
> thresholds is addressing these problems in a number of counters.
> 
> I have no objection against this approach here but it may just be bandaid
> on a larger issue that could be approached in a cleaner way.
> 

Unfortunately, I do not have access to a machine large enough to investigate
around this area. All I have to go on is a few bug reports showing the delta
problem with NR_FREE_PAGES and test results in a patch functionally similar
to this patch showing that the livelock problem went away.

At best all we can do is keep an eye out for problems one large machines
that could be explained by counter drift. If such a bug is found with a
reporter with regular access to the machine for test kernels, we can
investigate if reducing the thresholds fix the problem without affecting
general performance.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 13:55           ` Mel Gorman
@ 2010-08-23 16:04             ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-23 16:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > When the vm gets into a state where continual reclaim is necessary then
> > the counters are not that frequently updated. If the machine is already
> > slowing down due to reclaim then the vm can likely affort more frequent
> > counter updates.
> >
>
> Ok, but is that better than this patch? Decreasing the size of the window by
> reducing the threshold still leaves a window. There is still a small amount
> of drift by summing up all the deltas but you get a much more accurate count
> at the point of time it was important to know.

In order to make that decision we would need to know what deltas make a
significant difference. Would be also important to know if there are any
other counters that have issues. If so then the reduction of the
thresholds is addressing these problems in a number of counters.

I have no objection against this approach here but it may just be bandaid
on a larger issue that could be approached in a cleaner way.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 16:04             ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-23 16:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > When the vm gets into a state where continual reclaim is necessary then
> > the counters are not that frequently updated. If the machine is already
> > slowing down due to reclaim then the vm can likely affort more frequent
> > counter updates.
> >
>
> Ok, but is that better than this patch? Decreasing the size of the window by
> reducing the threshold still leaves a window. There is still a small amount
> of drift by summing up all the deltas but you get a much more accurate count
> at the point of time it was important to know.

In order to make that decision we would need to know what deltas make a
significant difference. Would be also important to know if there are any
other counters that have issues. If so then the reduction of the
thresholds is addressing these problems in a number of counters.

I have no objection against this approach here but it may just be bandaid
on a larger issue that could be approached in a cleaner way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 13:41         ` Christoph Lameter
@ 2010-08-23 13:55           ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-23 13:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 08:41:56AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > The delta of the counters could also be reduced to increase accuracy.
> > > See refresh_zone_stat_thresholds().
> > True, but I thought that would introduce a constant performance penalty
> > for a corner case which I didn't like.
> 
> Sure, an increased frequency of updates would increase the chance of
> bouncing cachelines. But the bouncing cacheline scenario for the vm
> counters was tuned for applications that continually allocate pages in
> parallel.
> 
> When the vm gets into a state where continual reclaim is necessary then
> the counters are not that frequently updated. If the machine is already
> slowing down due to reclaim then the vm can likely affort more frequent
> counter updates.
> 

Ok, but is that better than this patch? Decreasing the size of the window by
reducing the threshold still leaves a window. There is still a small amount
of drift by summing up all the deltas but you get a much more accurate count
at the point of time it was important to know.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 13:55           ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-23 13:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 08:41:56AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > The delta of the counters could also be reduced to increase accuracy.
> > > See refresh_zone_stat_thresholds().
> > True, but I thought that would introduce a constant performance penalty
> > for a corner case which I didn't like.
> 
> Sure, an increased frequency of updates would increase the chance of
> bouncing cachelines. But the bouncing cacheline scenario for the vm
> counters was tuned for applications that continually allocate pages in
> parallel.
> 
> When the vm gets into a state where continual reclaim is necessary then
> the counters are not that frequently updated. If the machine is already
> slowing down due to reclaim then the vm can likely affort more frequent
> counter updates.
> 

Ok, but is that better than this patch? Decreasing the size of the window by
reducing the threshold still leaves a window. There is still a small amount
of drift by summing up all the deltas but you get a much more accurate count
at the point of time it was important to know.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 13:03       ` Mel Gorman
@ 2010-08-23 13:41         ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-23 13:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > The delta of the counters could also be reduced to increase accuracy.
> > See refresh_zone_stat_thresholds().
> True, but I thought that would introduce a constant performance penalty
> for a corner case which I didn't like.

Sure, an increased frequency of updates would increase the chance of
bouncing cachelines. But the bouncing cacheline scenario for the vm
counters was tuned for applications that continually allocate pages in
parallel.

When the vm gets into a state where continual reclaim is necessary then
the counters are not that frequently updated. If the machine is already
slowing down due to reclaim then the vm can likely affort more frequent
counter updates.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 13:41         ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-23 13:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > The delta of the counters could also be reduced to increase accuracy.
> > See refresh_zone_stat_thresholds().
> True, but I thought that would introduce a constant performance penalty
> for a corner case which I didn't like.

Sure, an increased frequency of updates would increase the chance of
bouncing cachelines. But the bouncing cacheline scenario for the vm
counters was tuned for applications that continually allocate pages in
parallel.

When the vm gets into a state where continual reclaim is necessary then
the counters are not that frequently updated. If the machine is already
slowing down due to reclaim then the vm can likely affort more frequent
counter updates.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 12:56     ` Christoph Lameter
@ 2010-08-23 13:03       ` Mel Gorman
  -1 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-23 13:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 07:56:40AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
> > and may result in cache line bounces but is expected to be lighter than the
> > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > is awake.
> 
> The delta of the counters could also be reduced to increase accuracy.
> See refresh_zone_stat_thresholds().
> 

True, but I thought that would introduce a constant performance penalty
for a corner case which I didn't like.

> Also would it be possible to add the summation function to vmstat? It may
> be useful elsewhere.
> 
> A new function like
> 
> 	zone_page_state_snapshot()
> 
> or so?
> 

We could if there is another counter that results in bad system
behaviour due to counter drift. As NR_FREE_PAGES seemed to be the only
one, zone_nr_free_pages() seemed adequate. If such a helper did exist,
zone_nr_free_pages() would be a simple wrapper around it. The
indirection didn't seem necessary at this point though.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 13:03       ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-23 13:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 07:56:40AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
> > and may result in cache line bounces but is expected to be lighter than the
> > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > is awake.
> 
> The delta of the counters could also be reduced to increase accuracy.
> See refresh_zone_stat_thresholds().
> 

True, but I thought that would introduce a constant performance penalty
for a corner case which I didn't like.

> Also would it be possible to add the summation function to vmstat? It may
> be useful elsewhere.
> 
> A new function like
> 
> 	zone_page_state_snapshot()
> 
> or so?
> 

We could if there is another counter that results in bad system
behaviour due to counter drift. As NR_FREE_PAGES seemed to be the only
one, zone_nr_free_pages() seemed adequate. If such a helper did exist,
zone_nr_free_pages() would be a simple wrapper around it. The
indirection didn't seem necessary at this point though.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23  8:00   ` Mel Gorman
@ 2010-08-23 12:56     ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-23 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> This patch introduces zone_nr_free_pages() to take a slightly more accurate
> estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
> and may result in cache line bounces but is expected to be lighter than the
> IPI calls necessary to continually drain the per-cpu counters while kswapd
> is awake.

The delta of the counters could also be reduced to increase accuracy.
See refresh_zone_stat_thresholds().

Also would it be possible to add the summation function to vmstat? It may
be useful elsewhere.

A new function like

	zone_page_state_snapshot()

or so?



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 12:56     ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2010-08-23 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> This patch introduces zone_nr_free_pages() to take a slightly more accurate
> estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
> and may result in cache line bounces but is expected to be lighter than the
> IPI calls necessary to continually drain the per-cpu counters while kswapd
> is awake.

The delta of the counters could also be reduced to increase accuracy.
See refresh_zone_stat_thresholds().

Also would it be possible to add the summation function to vmstat? It may
be useful elsewhere.

A new function like

	zone_page_state_snapshot()

or so?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23  8:00 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2 Mel Gorman
@ 2010-08-23  8:00   ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can be
very high.  If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero.  Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..69ecbe9 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,32 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct per_cpu_pageset *pset;
+
+			pset = per_cpu_ptr(zone->pageset, cpu);
+			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
+		}
+	}
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23  8:00   ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can be
very high.  If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero.  Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..69ecbe9 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,32 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct per_cpu_pageset *pset;
+
+			pset = per_cpu_ptr(zone->pageset, cpu);
+			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
+		}
+	}
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 19:00         ` Christoph Lameter
@ 2010-08-19 23:49           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-19 23:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KOSAKI Motohiro

On Thu, 19 Aug 2010 14:00:44 -0500 (CDT)
Christoph Lameter <cl@linux-foundation.org> wrote:

> n Thu, 19 Aug 2010, KAMEZAWA Hiroyuki wrote:
> 
> > > > This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE
> > >
> > > calculate_threshold() does its calculation based on the number of online
> > > cpus. Therefore the threshold may change if a cpu is brought down.
> > >
> > yes. but why not calculate at bringing up ?
> 
> True. Seems to have gone missing somehow.
> 
ok, thank you for checking. I'll prepare a patch.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19  0:07       ` KAMEZAWA Hiroyuki
@ 2010-08-19 19:00         ` Christoph Lameter
  2010-08-19 23:49           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2010-08-19 19:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KOSAKI Motohiro

n Thu, 19 Aug 2010, KAMEZAWA Hiroyuki wrote:

> > > This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE
> >
> > calculate_threshold() does its calculation based on the number of online
> > cpus. Therefore the threshold may change if a cpu is brought down.
> >
> yes. but why not calculate at bringing up ?

True. Seems to have gone missing somehow.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 16:06       ` Mel Gorman
@ 2010-08-19 16:45         ` Minchan Kim
  0 siblings, 0 replies; 99+ messages in thread
From: Minchan Kim @ 2010-08-19 16:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 05:06:12PM +0100, Mel Gorman wrote:
> On Fri, Aug 20, 2010 at 12:46:38AM +0900, Minchan Kim wrote:
 > Mel. Could you consider normal(or small) system but has two core at least?
> 
> I did consider it but I was not keen on the idea of small systems behaving
> very differently to large systems in this regard. I thought there was a
> danger that a problem problem would be hidden by such a move.
> 
> > I means we apply you rule according to the number of CPU and RAM size. (ie,
> > threshold value). 
> > Now mobile system begin to have two core in system and above 1G RAM. 
> > Such case, it has threshold 8.
> > 
> > It is unlikey to happen livelock.
> > Is it worth to have such overhead in such system? 
> > What do you think?
> > 
> 
> Such overhead could be avoided if we made a check like the following in
> refresh_zone_stat_thresholds()
> 
>                 /*
>                  * Only set percpu_drift_mark if there is a danger that
>                  * NR_FREE_PAGES reports the low watermark is ok when in fact
>                  * the min watermark could be breached by an allocation
>                  */
>                 tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
>                 max_drift = num_online_cpus() * threshold;
>                 if (max_drift > tolerate_drift)
>                         zone->percpu_drift_mark = high_wmark_pages(zone)
> 					+ max_drift;
> 
> Would this be preferable?

Yes. It looks good to me. 

> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 15:46     ` Minchan Kim
@ 2010-08-19 16:06       ` Mel Gorman
  2010-08-19 16:45         ` Minchan Kim
  0 siblings, 1 reply; 99+ messages in thread
From: Mel Gorman @ 2010-08-19 16:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, Aug 20, 2010 at 12:46:38AM +0900, Minchan Kim wrote:
> On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:
> > On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote:
> > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> > > it is cheaper than scanning a number of lists. To avoid synchronization
> > > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > > periodically and when the delta is above a threshold. On large CPU systems,
> > > the difference between the estimated and real value of NR_FREE_PAGES can be
> > > very high. If the system is under both load and low memory, it's possible
> > > for watermarks to be breached. In extreme cases, the number of free pages
> > > can drop to 0 leading to the possibility of system livelock.
> > > 
> > > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > > estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> > > and may result in cache line bounces but is expected to be lighter than the
> > > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > > is awake.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > And the second I sent this, I realised I had sent a slightly old version
> > that missed a compile-fix :(
> > 
> > ==== CUT HERE ====
> > mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> > 
> > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> > it is cheaper than scanning a number of lists. To avoid synchronization
> > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > periodically and when the delta is above a threshold. On large CPU systems,
> > the difference between the estimated and real value of NR_FREE_PAGES can be
> > very high. If the system is under both load and low memory, it's possible
> > for watermarks to be breached. In extreme cases, the number of free pages
> > can drop to 0 leading to the possibility of system livelock.
> 
> Mel. Could you consider normal(or small) system but has two core at least?

I did consider it but I was not keen on the idea of small systems behaving
very differently to large systems in this regard. I thought there was a
danger that a problem problem would be hidden by such a move.

> I means we apply you rule according to the number of CPU and RAM size. (ie,
> threshold value). 
> Now mobile system begin to have two core in system and above 1G RAM. 
> Such case, it has threshold 8.
> 
> It is unlikey to happen livelock.
> Is it worth to have such overhead in such system? 
> What do you think?
> 

Such overhead could be avoided if we made a check like the following in
refresh_zone_stat_thresholds()

                /*
                 * Only set percpu_drift_mark if there is a danger that
                 * NR_FREE_PAGES reports the low watermark is ok when in fact
                 * the min watermark could be breached by an allocation
                 */
                tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
                max_drift = num_online_cpus() * threshold;
                if (max_drift > tolerate_drift)
                        zone->percpu_drift_mark = high_wmark_pages(zone)
					+ max_drift;

Would this be preferable?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-16  9:43   ` Mel Gorman
  2010-08-16 14:47     ` Rik van Riel
  2010-08-16 16:06     ` Johannes Weiner
@ 2010-08-19 15:46     ` Minchan Kim
  2010-08-19 16:06       ` Mel Gorman
  2 siblings, 1 reply; 99+ messages in thread
From: Minchan Kim @ 2010-08-19 15:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:
> On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote:
> > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> > it is cheaper than scanning a number of lists. To avoid synchronization
> > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > periodically and when the delta is above a threshold. On large CPU systems,
> > the difference between the estimated and real value of NR_FREE_PAGES can be
> > very high. If the system is under both load and low memory, it's possible
> > for watermarks to be breached. In extreme cases, the number of free pages
> > can drop to 0 leading to the possibility of system livelock.
> > 
> > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> > and may result in cache line bounces but is expected to be lighter than the
> > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > is awake.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> And the second I sent this, I realised I had sent a slightly old version
> that missed a compile-fix :(
> 
> ==== CUT HERE ====
> mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> 
> Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> it is cheaper than scanning a number of lists. To avoid synchronization
> overhead, counter deltas are maintained on a per-cpu basis and drained both
> periodically and when the delta is above a threshold. On large CPU systems,
> the difference between the estimated and real value of NR_FREE_PAGES can be
> very high. If the system is under both load and low memory, it's possible
> for watermarks to be breached. In extreme cases, the number of free pages
> can drop to 0 leading to the possibility of system livelock.

Mel. Could you consider normal(or small) system but has two core at least?
I means we apply you rule according to the number of CPU and RAM size. (ie,
threshold value). 
Now mobile system begin to have two core in system and above 1G RAM. 
Such case, it has threshold 8.

It is unlikey to happen livelock.
Is it worth to have such overhead in such system? 
What do you think?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 15:40                               ` Mel Gorman
@ 2010-08-19 15:44                                 ` Minchan Kim
  0 siblings, 0 replies; 99+ messages in thread
From: Minchan Kim @ 2010-08-19 15:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 04:40:33PM +0100, Mel Gorman wrote:
 
> The patch leader now reads as
> 
> Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
> cheaper than scanning a number of lists. To avoid synchronization overhead,
> counter deltas are maintained on a per-cpu basis and drained both periodically
> and when the delta is above a threshold. On large CPU systems, the difference
> between the estimated and real value of NR_FREE_PAGES can be very high.
> If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM
> can allocate pages below min watermark, at worst reducing the real number of
> pages to zero. Even if the OOM killer kills some victim for freeing memory, it
> may not free memory if the exit path requires a new page resulting in livelock.
> 
> This patch introduces zone_nr_free_pages() to take a slightly more accurate
> estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
> and may result in cache line bounces but is expected to be lighter than the
> IPI calls necessary to continually drain the per-cpu counters while kswapd
> is awake.
> 
> Is that better?

Good!

> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 15:22                             ` Minchan Kim
@ 2010-08-19 15:40                               ` Mel Gorman
  2010-08-19 15:44                                 ` Minchan Kim
  0 siblings, 1 reply; 99+ messages in thread
From: Mel Gorman @ 2010-08-19 15:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, Aug 20, 2010 at 12:22:33AM +0900, Minchan Kim wrote:
> On Thu, Aug 19, 2010 at 04:07:39PM +0100, Mel Gorman wrote:
> > On Thu, Aug 19, 2010 at 11:34:39PM +0900, Minchan Kim wrote:
> > > On Thu, Aug 19, 2010 at 03:09:46PM +0100, Mel Gorman wrote:
> > > > On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote:
> > > > > On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote:
> > > > > > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote:
> > > > > > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > > > > > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote:
> > > > > > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
> > > > > > > >> > > What's a window low and min wmark? Maybe I can miss your point.
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > The window is due to the fact kswapd is not awake yet. The window is because
> > > > > > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
> > > > > > > >> > system is really somewhere between the low and min watermark but we are not
> > > > > > > >> > taking the accurate measure until kswapd gets woken up. The first allocation
> > > > > > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or
> > > > > > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of
> > > > > > > >> > any drift) wakes kswapd and other callers then take an accurate count hence
> > > > > > > >> > "we could breach the watermark but I'm expecting it can only happen for at
> > > > > > > >> > worst one allocation".
> > > > > > > >>
> > > > > > > >> Right. I misunderstood your word.
> > > > > > > >> One more question.
> > > > > > > >>
> > > > > > > >> Could you explain live lock scenario?
> > > > > > > >>
> > > > > > > >
> > > > > > > > Lets say
> > > > > > > >
> > > > > > > > NR_FREE_PAGES     = 256
> > > > > > > > Actual free pages = 8
> > > > > > > >
> > > > > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are
> > > > > > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to
> > > > > > > > clean something but all the pages are on a network-backed filesystem. To
> > > > > > > > clean them, it must transmit on the network so it tries to allocate some
> > > > > > > > buffers.
> > > > > > > >
> > > > > > > > The livelock is that to free some memory, an allocation must succeed but
> > > > > > > > for an allocation to succeed, some memory must be freed. The system
> > > > > > > 
> > > > > > > Yes. I understood this as livelock but at last VM will kill victim
> > > > > > > process then it can allocate free pages.
> > > > > > 
> > > > > > And if the exit path for the OOM kill needs to allocate a page what
> > > > > > should it do?
> > > > > 
> > > > > Yeah. It might be livelock. 
> > > > > Then, let's rethink the problem. 
> > > > > 
> > > > > The problem is following as. 
> > > > > 
> > > > > 1. Process A try to allocate the page
> > > > > 2. VM try to reclaim the page for process A
> > > > > 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A
> > > > > 4. VM try to kill process B
> > > > > 5. The exit path need new pages for exiting process B
> > > > > 6. Livelock happens(I am not sure but we need any warning if it really happens at least)
> > > > > 
> > > > 
> > > > The problem this patch is concerned with is about the vmstat counters, not
> > > > the pages on the per-cpu lists. The issue being dealt with is that the page
> > > > allocator grants a page going below the min watermark because NR_FREE_PAGES
> > > > can be inaccurate. The patch aims to fix that but taking greater care
> > > > with NR_FREE_PAGES when memory is low.
> > > 
> > > Your goal is to protect _min_ pages which is reserved. Right?
> > > I thought your final goal is to protect the livelock problem. 
> > > Hmm.. Sorry for the noise. :(
> > > 
> > 
> > Emm, it's the same thing. If the min watermark is not properly
> > preserved, the system is in danger of being live-locked.
> 
> Totally right. 
> Maybe I am sleeping.
> 
> Let's add follwing as comment about livelock.
> 

Sure!

> "If NR_FREE_PAGES is much higher than number of real free page in buddy,
> the VM can allocate pages below min watermark(At worst, buddy is zero). 
> Although VM kills some victim for freeing memory, it can't do it if the 
> exit path requires new page since buddy have zero page. It can result in
> livelock."
> 

Thanks

> At least, it help to not hurt you in future by me who is fool. 
> 

The patch leader now reads as

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists. To avoid synchronization overhead,
counter deltas are maintained on a per-cpu basis and drained both periodically
and when the delta is above a threshold. On large CPU systems, the difference
between the estimated and real value of NR_FREE_PAGES can be very high.
If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM
can allocate pages below min watermark, at worst reducing the real number of
pages to zero. Even if the OOM killer kills some victim for freeing memory, it
may not free memory if the exit path requires a new page resulting in livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Is that better?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 15:07                           ` Mel Gorman
@ 2010-08-19 15:22                             ` Minchan Kim
  2010-08-19 15:40                               ` Mel Gorman
  0 siblings, 1 reply; 99+ messages in thread
From: Minchan Kim @ 2010-08-19 15:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 04:07:39PM +0100, Mel Gorman wrote:
> On Thu, Aug 19, 2010 at 11:34:39PM +0900, Minchan Kim wrote:
> > On Thu, Aug 19, 2010 at 03:09:46PM +0100, Mel Gorman wrote:
> > > On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote:
> > > > > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote:
> > > > > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > > > > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote:
> > > > > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
> > > > > > >> > > What's a window low and min wmark? Maybe I can miss your point.
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > The window is due to the fact kswapd is not awake yet. The window is because
> > > > > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
> > > > > > >> > system is really somewhere between the low and min watermark but we are not
> > > > > > >> > taking the accurate measure until kswapd gets woken up. The first allocation
> > > > > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or
> > > > > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of
> > > > > > >> > any drift) wakes kswapd and other callers then take an accurate count hence
> > > > > > >> > "we could breach the watermark but I'm expecting it can only happen for at
> > > > > > >> > worst one allocation".
> > > > > > >>
> > > > > > >> Right. I misunderstood your word.
> > > > > > >> One more question.
> > > > > > >>
> > > > > > >> Could you explain live lock scenario?
> > > > > > >>
> > > > > > >
> > > > > > > Lets say
> > > > > > >
> > > > > > > NR_FREE_PAGES     = 256
> > > > > > > Actual free pages = 8
> > > > > > >
> > > > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are
> > > > > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to
> > > > > > > clean something but all the pages are on a network-backed filesystem. To
> > > > > > > clean them, it must transmit on the network so it tries to allocate some
> > > > > > > buffers.
> > > > > > >
> > > > > > > The livelock is that to free some memory, an allocation must succeed but
> > > > > > > for an allocation to succeed, some memory must be freed. The system
> > > > > > 
> > > > > > Yes. I understood this as livelock but at last VM will kill victim
> > > > > > process then it can allocate free pages.
> > > > > 
> > > > > And if the exit path for the OOM kill needs to allocate a page what
> > > > > should it do?
> > > > 
> > > > Yeah. It might be livelock. 
> > > > Then, let's rethink the problem. 
> > > > 
> > > > The problem is following as. 
> > > > 
> > > > 1. Process A try to allocate the page
> > > > 2. VM try to reclaim the page for process A
> > > > 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A
> > > > 4. VM try to kill process B
> > > > 5. The exit path need new pages for exiting process B
> > > > 6. Livelock happens(I am not sure but we need any warning if it really happens at least)
> > > > 
> > > 
> > > The problem this patch is concerned with is about the vmstat counters, not
> > > the pages on the per-cpu lists. The issue being dealt with is that the page
> > > allocator grants a page going below the min watermark because NR_FREE_PAGES
> > > can be inaccurate. The patch aims to fix that but taking greater care
> > > with NR_FREE_PAGES when memory is low.
> > 
> > Your goal is to protect _min_ pages which is reserved. Right?
> > I thought your final goal is to protect the livelock problem. 
> > Hmm.. Sorry for the noise. :(
> > 
> 
> Emm, it's the same thing. If the min watermark is not properly
> preserved, the system is in danger of being live-locked.

Totally right. 
Maybe I am sleeping.

Let's add follwing as comment about livelock.

"If NR_FREE_PAGES is much higher than number of real free page in buddy,
the VM can allocate pages below min watermark(At worst, buddy is zero). 
Although VM kills some victim for freeing memory, it can't do it if the 
exit path requires new page since buddy have zero page. It can result in
livelock."

At least, it help to not hurt you in future by me who is fool. 

Thanks, Mel. 


> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 14:34                         ` Minchan Kim
@ 2010-08-19 15:07                           ` Mel Gorman
  2010-08-19 15:22                             ` Minchan Kim
  0 siblings, 1 reply; 99+ messages in thread
From: Mel Gorman @ 2010-08-19 15:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 11:34:39PM +0900, Minchan Kim wrote:
> On Thu, Aug 19, 2010 at 03:09:46PM +0100, Mel Gorman wrote:
> > On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote:
> > > On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote:
> > > > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote:
> > > > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > > > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote:
> > > > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
> > > > > >> > > What's a window low and min wmark? Maybe I can miss your point.
> > > > > >> > >
> > > > > >> >
> > > > > >> > The window is due to the fact kswapd is not awake yet. The window is because
> > > > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
> > > > > >> > system is really somewhere between the low and min watermark but we are not
> > > > > >> > taking the accurate measure until kswapd gets woken up. The first allocation
> > > > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or
> > > > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of
> > > > > >> > any drift) wakes kswapd and other callers then take an accurate count hence
> > > > > >> > "we could breach the watermark but I'm expecting it can only happen for at
> > > > > >> > worst one allocation".
> > > > > >>
> > > > > >> Right. I misunderstood your word.
> > > > > >> One more question.
> > > > > >>
> > > > > >> Could you explain live lock scenario?
> > > > > >>
> > > > > >
> > > > > > Lets say
> > > > > >
> > > > > > NR_FREE_PAGES     = 256
> > > > > > Actual free pages = 8
> > > > > >
> > > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are
> > > > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to
> > > > > > clean something but all the pages are on a network-backed filesystem. To
> > > > > > clean them, it must transmit on the network so it tries to allocate some
> > > > > > buffers.
> > > > > >
> > > > > > The livelock is that to free some memory, an allocation must succeed but
> > > > > > for an allocation to succeed, some memory must be freed. The system
> > > > > 
> > > > > Yes. I understood this as livelock but at last VM will kill victim
> > > > > process then it can allocate free pages.
> > > > 
> > > > And if the exit path for the OOM kill needs to allocate a page what
> > > > should it do?
> > > 
> > > Yeah. It might be livelock. 
> > > Then, let's rethink the problem. 
> > > 
> > > The problem is following as. 
> > > 
> > > 1. Process A try to allocate the page
> > > 2. VM try to reclaim the page for process A
> > > 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A
> > > 4. VM try to kill process B
> > > 5. The exit path need new pages for exiting process B
> > > 6. Livelock happens(I am not sure but we need any warning if it really happens at least)
> > > 
> > 
> > The problem this patch is concerned with is about the vmstat counters, not
> > the pages on the per-cpu lists. The issue being dealt with is that the page
> > allocator grants a page going below the min watermark because NR_FREE_PAGES
> > can be inaccurate. The patch aims to fix that but taking greater care
> > with NR_FREE_PAGES when memory is low.
> 
> Your goal is to protect _min_ pages which is reserved. Right?
> I thought your final goal is to protect the livelock problem. 
> Hmm.. Sorry for the noise. :(
> 

Emm, it's the same thing. If the min watermark is not properly
preserved, the system is in danger of being live-locked.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 14:09                       ` Mel Gorman
@ 2010-08-19 14:34                         ` Minchan Kim
  2010-08-19 15:07                           ` Mel Gorman
  0 siblings, 1 reply; 99+ messages in thread
From: Minchan Kim @ 2010-08-19 14:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 03:09:46PM +0100, Mel Gorman wrote:
> On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote:
> > On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote:
> > > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote:
> > > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
> > > > >> > > What's a window low and min wmark? Maybe I can miss your point.
> > > > >> > >
> > > > >> >
> > > > >> > The window is due to the fact kswapd is not awake yet. The window is because
> > > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
> > > > >> > system is really somewhere between the low and min watermark but we are not
> > > > >> > taking the accurate measure until kswapd gets woken up. The first allocation
> > > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or
> > > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of
> > > > >> > any drift) wakes kswapd and other callers then take an accurate count hence
> > > > >> > "we could breach the watermark but I'm expecting it can only happen for at
> > > > >> > worst one allocation".
> > > > >>
> > > > >> Right. I misunderstood your word.
> > > > >> One more question.
> > > > >>
> > > > >> Could you explain live lock scenario?
> > > > >>
> > > > >
> > > > > Lets say
> > > > >
> > > > > NR_FREE_PAGES     = 256
> > > > > Actual free pages = 8
> > > > >
> > > > > The PCP lists get refilled in patch taking all 8 pages. Now there are
> > > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to
> > > > > clean something but all the pages are on a network-backed filesystem. To
> > > > > clean them, it must transmit on the network so it tries to allocate some
> > > > > buffers.
> > > > >
> > > > > The livelock is that to free some memory, an allocation must succeed but
> > > > > for an allocation to succeed, some memory must be freed. The system
> > > > 
> > > > Yes. I understood this as livelock but at last VM will kill victim
> > > > process then it can allocate free pages.
> > > 
> > > And if the exit path for the OOM kill needs to allocate a page what
> > > should it do?
> > 
> > Yeah. It might be livelock. 
> > Then, let's rethink the problem. 
> > 
> > The problem is following as. 
> > 
> > 1. Process A try to allocate the page
> > 2. VM try to reclaim the page for process A
> > 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A
> > 4. VM try to kill process B
> > 5. The exit path need new pages for exiting process B
> > 6. Livelock happens(I am not sure but we need any warning if it really happens at least)
> > 
> 
> The problem this patch is concerned with is about the vmstat counters, not
> the pages on the per-cpu lists. The issue being dealt with is that the page
> allocator grants a page going below the min watermark because NR_FREE_PAGES
> can be inaccurate. The patch aims to fix that but taking greater care
> with NR_FREE_PAGES when memory is low.

Your goal is to protect _min_ pages which is reserved. Right?
I thought your final goal is to protect the livelock problem. 
Hmm.. Sorry for the noise. :(

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 14:01                     ` Minchan Kim
@ 2010-08-19 14:09                       ` Mel Gorman
  2010-08-19 14:34                         ` Minchan Kim
  0 siblings, 1 reply; 99+ messages in thread
From: Mel Gorman @ 2010-08-19 14:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 11:01:50PM +0900, Minchan Kim wrote:
> On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote:
> > On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote:
> > > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote:
> > > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
> > > >> > > What's a window low and min wmark? Maybe I can miss your point.
> > > >> > >
> > > >> >
> > > >> > The window is due to the fact kswapd is not awake yet. The window is because
> > > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
> > > >> > system is really somewhere between the low and min watermark but we are not
> > > >> > taking the accurate measure until kswapd gets woken up. The first allocation
> > > >> > to notice we are below the low watermark (be it due to vmstat refreshing or
> > > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of
> > > >> > any drift) wakes kswapd and other callers then take an accurate count hence
> > > >> > "we could breach the watermark but I'm expecting it can only happen for at
> > > >> > worst one allocation".
> > > >>
> > > >> Right. I misunderstood your word.
> > > >> One more question.
> > > >>
> > > >> Could you explain live lock scenario?
> > > >>
> > > >
> > > > Lets say
> > > >
> > > > NR_FREE_PAGES     = 256
> > > > Actual free pages = 8
> > > >
> > > > The PCP lists get refilled in patch taking all 8 pages. Now there are
> > > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to
> > > > clean something but all the pages are on a network-backed filesystem. To
> > > > clean them, it must transmit on the network so it tries to allocate some
> > > > buffers.
> > > >
> > > > The livelock is that to free some memory, an allocation must succeed but
> > > > for an allocation to succeed, some memory must be freed. The system
> > > 
> > > Yes. I understood this as livelock but at last VM will kill victim
> > > process then it can allocate free pages.
> > 
> > And if the exit path for the OOM kill needs to allocate a page what
> > should it do?
> 
> Yeah. It might be livelock. 
> Then, let's rethink the problem. 
> 
> The problem is following as. 
> 
> 1. Process A try to allocate the page
> 2. VM try to reclaim the page for process A
> 3. VM reclaims some pages but it remains on PCP so can't allocate pages for A
> 4. VM try to kill process B
> 5. The exit path need new pages for exiting process B
> 6. Livelock happens(I am not sure but we need any warning if it really happens at least)
> 

The problem this patch is concerned with is about the vmstat counters, not
the pages on the per-cpu lists. The issue being dealt with is that the page
allocator grants a page going below the min watermark because NR_FREE_PAGES
can be inaccurate. The patch aims to fix that but taking greater care
with NR_FREE_PAGES when memory is low.

> If OOM kills process B successfully, there ins't the livelock problem. 
> So then How about this?
> 
> We need to retry allocation of new page with draining free pages just before OOM.
> It doesn't have any overhead before going OOM and it's not frequent. 
> 

It's a different problem and it's what patch 3/3 of this series aims to
address.

> This patch can't handle your problem?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1bb327a..113bea9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2045,6 +2045,15 @@ rebalance:
>          * running out of options and have to consider going OOM
>          */
>         if (!did_some_progress) {
> +
> +               /* Ther are some free pages on PCP */
> +               drain_all_pages();
> +               page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> +                               high_zoneidx, alloc_flags &~ALLOCX_NO_WATERMARKS,
> +                               preferred_zone, migratetype);
> +               if (page)
> +                       goto got_pg;
> +
>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>                         if (oom_killer_disabled)
>                                 goto nopage;
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 10:38                   ` Mel Gorman
@ 2010-08-19 14:01                     ` Minchan Kim
  2010-08-19 14:09                       ` Mel Gorman
  0 siblings, 1 reply; 99+ messages in thread
From: Minchan Kim @ 2010-08-19 14:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 11:38:39AM +0100, Mel Gorman wrote:
> On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote:
> > On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote:
> > >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
> > >> > > What's a window low and min wmark? Maybe I can miss your point.
> > >> > >
> > >> >
> > >> > The window is due to the fact kswapd is not awake yet. The window is because
> > >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
> > >> > system is really somewhere between the low and min watermark but we are not
> > >> > taking the accurate measure until kswapd gets woken up. The first allocation
> > >> > to notice we are below the low watermark (be it due to vmstat refreshing or
> > >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of
> > >> > any drift) wakes kswapd and other callers then take an accurate count hence
> > >> > "we could breach the watermark but I'm expecting it can only happen for at
> > >> > worst one allocation".
> > >>
> > >> Right. I misunderstood your word.
> > >> One more question.
> > >>
> > >> Could you explain live lock scenario?
> > >>
> > >
> > > Lets say
> > >
> > > NR_FREE_PAGES     = 256
> > > Actual free pages = 8
> > >
> > > The PCP lists get refilled in patch taking all 8 pages. Now there are
> > > zero free pages. Reclaim kicks in but to reclaim any pages it needs to
> > > clean something but all the pages are on a network-backed filesystem. To
> > > clean them, it must transmit on the network so it tries to allocate some
> > > buffers.
> > >
> > > The livelock is that to free some memory, an allocation must succeed but
> > > for an allocation to succeed, some memory must be freed. The system
> > 
> > Yes. I understood this as livelock but at last VM will kill victim
> > process then it can allocate free pages.
> 
> And if the exit path for the OOM kill needs to allocate a page what
> should it do?

Yeah. It might be livelock. 
Then, let's rethink the problem. 

The problem is following as. 

1. Process A try to allocate the page
2. VM try to reclaim the page for process A
3. VM reclaims some pages but it remains on PCP so can't allocate pages for A
4. VM try to kill process B
5. The exit path need new pages for exiting process B
6. Livelock happens(I am not sure but we need any warning if it really happens at least)

If OOM kills process B successfully, there ins't the livelock problem. 
So then How about this?

We need to retry allocation of new page with draining free pages just before OOM.
It doesn't have any overhead before going OOM and it's not frequent. 

This patch can't handle your problem?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1bb327a..113bea9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2045,6 +2045,15 @@ rebalance:
         * running out of options and have to consider going OOM
         */
        if (!did_some_progress) {
+
+               /* Ther are some free pages on PCP */
+               drain_all_pages();
+               page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
+                               high_zoneidx, alloc_flags &~ALLOCX_NO_WATERMARKS,
+                               preferred_zone, migratetype);
+               if (page)
+                       goto got_pg;
+
                if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
                        if (oom_killer_disabled)
                                goto nopage;



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19 10:33                 ` Minchan Kim
@ 2010-08-19 10:38                   ` Mel Gorman
  2010-08-19 14:01                     ` Minchan Kim
  0 siblings, 1 reply; 99+ messages in thread
From: Mel Gorman @ 2010-08-19 10:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 07:33:57PM +0900, Minchan Kim wrote:
> On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote:
> >> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
> >> > > What's a window low and min wmark? Maybe I can miss your point.
> >> > >
> >> >
> >> > The window is due to the fact kswapd is not awake yet. The window is because
> >> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
> >> > system is really somewhere between the low and min watermark but we are not
> >> > taking the accurate measure until kswapd gets woken up. The first allocation
> >> > to notice we are below the low watermark (be it due to vmstat refreshing or
> >> > that NR_FREE_PAGES happens to report we are below the watermark regardless of
> >> > any drift) wakes kswapd and other callers then take an accurate count hence
> >> > "we could breach the watermark but I'm expecting it can only happen for at
> >> > worst one allocation".
> >>
> >> Right. I misunderstood your word.
> >> One more question.
> >>
> >> Could you explain live lock scenario?
> >>
> >
> > Lets say
> >
> > NR_FREE_PAGES     = 256
> > Actual free pages = 8
> >
> > The PCP lists get refilled in patch taking all 8 pages. Now there are
> > zero free pages. Reclaim kicks in but to reclaim any pages it needs to
> > clean something but all the pages are on a network-backed filesystem. To
> > clean them, it must transmit on the network so it tries to allocate some
> > buffers.
> >
> > The livelock is that to free some memory, an allocation must succeed but
> > for an allocation to succeed, some memory must be freed. The system
> 
> Yes. I understood this as livelock but at last VM will kill victim
> process then it can allocate free pages.

And if the exit path for the OOM kill needs to allocate a page what
should it do?

> So I think it's not a livelock.
> 
> > might still remain alive if a process exits and does not need to
> > allocate memory while exiting but by and large, the system is in a
> > dangerous state.
> 
> Do you mean dangerous state of the system is livelock?
> Maybe not.
> I can't understand livelock in this context.
> Anyway, I am okay with this patch except livelock pharse. :)
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-19  8:06               ` Mel Gorman
@ 2010-08-19 10:33                 ` Minchan Kim
  2010-08-19 10:38                   ` Mel Gorman
  0 siblings, 1 reply; 99+ messages in thread
From: Minchan Kim @ 2010-08-19 10:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 5:06 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote:
>> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
>> > > What's a window low and min wmark? Maybe I can miss your point.
>> > >
>> >
>> > The window is due to the fact kswapd is not awake yet. The window is because
>> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
>> > system is really somewhere between the low and min watermark but we are not
>> > taking the accurate measure until kswapd gets woken up. The first allocation
>> > to notice we are below the low watermark (be it due to vmstat refreshing or
>> > that NR_FREE_PAGES happens to report we are below the watermark regardless of
>> > any drift) wakes kswapd and other callers then take an accurate count hence
>> > "we could breach the watermark but I'm expecting it can only happen for at
>> > worst one allocation".
>>
>> Right. I misunderstood your word.
>> One more question.
>>
>> Could you explain live lock scenario?
>>
>
> Lets say
>
> NR_FREE_PAGES     = 256
> Actual free pages = 8
>
> The PCP lists get refilled in patch taking all 8 pages. Now there are
> zero free pages. Reclaim kicks in but to reclaim any pages it needs to
> clean something but all the pages are on a network-backed filesystem. To
> clean them, it must transmit on the network so it tries to allocate some
> buffers.
>
> The livelock is that to free some memory, an allocation must succeed but
> for an allocation to succeed, some memory must be freed. The system

Yes. I understood this as livelock but at last VM will kill victim
process then it can allocate free pages.
So I think it's not a livelock.

> might still remain alive if a process exits and does not need to
> allocate memory while exiting but by and large, the system is in a
> dangerous state.

Do you mean dangerous state of the system is livelock?
Maybe not.
I can't understand livelock in this context.
Anyway, I am okay with this patch except livelock pharse. :)

Thanks, Mel.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-18 14:57             ` Minchan Kim
@ 2010-08-19  8:06               ` Mel Gorman
  2010-08-19 10:33                 ` Minchan Kim
  0 siblings, 1 reply; 99+ messages in thread
From: Mel Gorman @ 2010-08-19  8:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, Aug 18, 2010 at 11:57:26PM +0900, Minchan Kim wrote:
> On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
> > > What's a window low and min wmark? Maybe I can miss your point. 
> > > 
> > 
> > The window is due to the fact kswapd is not awake yet. The window is because
> > kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
> > system is really somewhere between the low and min watermark but we are not
> > taking the accurate measure until kswapd gets woken up. The first allocation
> > to notice we are below the low watermark (be it due to vmstat refreshing or
> > that NR_FREE_PAGES happens to report we are below the watermark regardless of
> > any drift) wakes kswapd and other callers then take an accurate count hence
> > "we could breach the watermark but I'm expecting it can only happen for at
> > worst one allocation".
> 
> Right. I misunderstood your word. 
> One more question. 
> 
> Could you explain live lock scenario?
> 

Lets say

NR_FREE_PAGES     = 256
Actual free pages = 8

The PCP lists get refilled in patch taking all 8 pages. Now there are
zero free pages. Reclaim kicks in but to reclaim any pages it needs to
clean something but all the pages are on a network-backed filesystem. To
clean them, it must transmit on the network so it tries to allocate some
buffers.

The livelock is that to free some memory, an allocation must succeed but
for an allocation to succeed, some memory must be freed. The system
might still remain alive if a process exits and does not need to
allocate memory while exiting but by and large, the system is in a
dangerous state.

> I looked over the code. Although the VM pass zone_watermark_ok by luck,
> It can't allocate the page from buddy and then might go OOM. 
> When do we meet live lock case?
> 
> I think the description in change log would be better to understand 
> this patch in future. 
> 

Is the above description useful? If so, I can put it in the leader.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-18 15:55     ` Christoph Lameter
@ 2010-08-19  0:07       ` KAMEZAWA Hiroyuki
  2010-08-19 19:00         ` Christoph Lameter
  0 siblings, 1 reply; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-19  0:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KOSAKI Motohiro

On Wed, 18 Aug 2010 10:55:53 -0500 (CDT)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Wed, 18 Aug 2010, KAMEZAWA Hiroyuki wrote:
> 
> > BTW, a nitpick.
> >
> > > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
> > >  		for_each_online_cpu(cpu)
> > >  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> > >  							= threshold;
> > > +
> > > +		zone->percpu_drift_mark = high_wmark_pages(zone) +
> > > +					num_online_cpus() * threshold;
> > >  	}
> > >  }
> >
> > This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE
> 
> calculate_threshold() does its calculation based on the number of online
> cpus. Therefore the threshold may change if a cpu is brought down.
> 
yes. but why not calculate at bringing up ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-18  2:59   ` KAMEZAWA Hiroyuki
@ 2010-08-18 15:55     ` Christoph Lameter
  2010-08-19  0:07       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Lameter @ 2010-08-18 15:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KOSAKI Motohiro

On Wed, 18 Aug 2010, KAMEZAWA Hiroyuki wrote:

> BTW, a nitpick.
>
> > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
> >  		for_each_online_cpu(cpu)
> >  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> >  							= threshold;
> > +
> > +		zone->percpu_drift_mark = high_wmark_pages(zone) +
> > +					num_online_cpus() * threshold;
> >  	}
> >  }
>
> This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE

calculate_threshold() does its calculation based on the number of online
cpus. Therefore the threshold may change if a cpu is brought down.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-18  8:51           ` Mel Gorman
@ 2010-08-18 14:57             ` Minchan Kim
  2010-08-19  8:06               ` Mel Gorman
  0 siblings, 1 reply; 99+ messages in thread
From: Minchan Kim @ 2010-08-18 14:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, Aug 18, 2010 at 09:51:23AM +0100, Mel Gorman wrote:
> > What's a window low and min wmark? Maybe I can miss your point. 
> > 
> 
> The window is due to the fact kswapd is not awake yet. The window is because
> kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
> system is really somewhere between the low and min watermark but we are not
> taking the accurate measure until kswapd gets woken up. The first allocation
> to notice we are below the low watermark (be it due to vmstat refreshing or
> that NR_FREE_PAGES happens to report we are below the watermark regardless of
> any drift) wakes kswapd and other callers then take an accurate count hence
> "we could breach the watermark but I'm expecting it can only happen for at
> worst one allocation".

Right. I misunderstood your word. 
One more question. 

Could you explain live lock scenario?

I looked over the code. Although the VM pass zone_watermark_ok by luck,
It can't allocate the page from buddy and then might go OOM. 
When do we meet live lock case?

I think the description in change log would be better to understand 
this patch in future. 

Thanks. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-17 14:20         ` Minchan Kim
@ 2010-08-18  8:51           ` Mel Gorman
  2010-08-18 14:57             ` Minchan Kim
  0 siblings, 1 reply; 99+ messages in thread
From: Mel Gorman @ 2010-08-18  8:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Aug 17, 2010 at 11:20:40PM +0900, Minchan Kim wrote:
> On Tue, Aug 17, 2010 at 11:16:55AM +0100, Mel Gorman wrote:
> > Well, the drift can be either direction because drift can be due to pages
> > being either freed or allocated. e.g. it could be something like
> > 
> > NR_FREE_PAGES		CPU 0			CPU 1		Actual Free
> > 128			-32			 +64		   160
> > 
> > Because CPU 0 was allocating pages while CPU 1 was freeing them but that
> > is not what is important here. At any given time, the NR_FREE_PAGES can be
> > wrong by as much as
> > 
> > num_online_cpus * (threshold - 1)
> 
> That's the answer I expected.
> As I mentioned previous mail, we need to consider allocation path.
> But you already have been considered it by partially in here. 
> Yes. It looks good to me. :)
> 
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 

Thanks.

> > 
> > As kswapd goes back to sleep when the high watermark is reached, it's important
> > that it has actually reached the watermark before sleeping.  Similarly,
> > if an allocator is checking the low watermark, it needs an accurate count.
> > Hence a more careful accounting for NR_FREE_PAGES should happen when the
> > number of free pages is within
> > 
> > high_watermark + (num_online_cpus * (threshold - 1))
> > 
> > Only checking when kswapd is awake still leaves a window between the low
> > and min watermark when we could breach the watermark but I'm expecting it
> > can only happen for at worst one allocation. After that, kswapd wakes
> > and the count becomes accurate again.
> 
> I can't understand the point. 
> Now kswapd starts from below low wmark and stops until high wmark.

Correct.

> So if VM has pages of below low wmark, it could always check by zone_nr_free_pages 
> regardless of min. 
> 

The difficulty is that NR_FREE_PAGES is an estimate so for a time the VM may
not know it is below the low watermark. We can get a more accurate view but
it's costly so we want to avoid that cost whenever we can.

> What's a window low and min wmark? Maybe I can miss your point. 
> 

The window is due to the fact kswapd is not awake yet. The window is because
kswapd might not be awake as NR_FREE_PAGES is higher than it should be. The
system is really somewhere between the low and min watermark but we are not
taking the accurate measure until kswapd gets woken up. The first allocation
to notice we are below the low watermark (be it due to vmstat refreshing or
that NR_FREE_PAGES happens to report we are below the watermark regardless of
any drift) wakes kswapd and other callers then take an accurate count hence
"we could breach the watermark but I'm expecting it can only happen for at
worst one allocation".

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-16  9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
  2010-08-16  9:43   ` Mel Gorman
@ 2010-08-18  2:59   ` KAMEZAWA Hiroyuki
  2010-08-18 15:55     ` Christoph Lameter
  1 sibling, 1 reply; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-18  2:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KOSAKI Motohiro, cl

On Mon, 16 Aug 2010 10:42:12 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> it is cheaper than scanning a number of lists. To avoid synchronization
> overhead, counter deltas are maintained on a per-cpu basis and drained both
> periodically and when the delta is above a threshold. On large CPU systems,
> the difference between the estimated and real value of NR_FREE_PAGES can be
> very high. If the system is under both load and low memory, it's possible
> for watermarks to be breached. In extreme cases, the number of free pages
> can drop to 0 leading to the possibility of system livelock.
> 
> This patch introduces zone_nr_free_pages() to take a slightly more accurate
> estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> and may result in cache line bounces but is expected to be lighter than the
> IPI calls necessary to continually drain the per-cpu counters while kswapd
> is awake.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

BTW, a nitpick.

> @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
>  		for_each_online_cpu(cpu)
>  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
>  							= threshold;
> +
> +		zone->percpu_drift_mark = high_wmark_pages(zone) +
> +					num_online_cpus() * threshold;
>  	}
>  }

This function is now called only at CPU_DEAD. IOW, not called at CPU_UP_PREPARE

It's done by this patch....but the reason is unclear to me.
==
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d1187ed21026fd512b87851d0ca26d9ae16f9059
==

Christoph ?


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-17 15:01           ` Minchan Kim
@ 2010-08-17 15:05             ` Mel Gorman
  0 siblings, 0 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-17 15:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, Aug 18, 2010 at 12:01:44AM +0900, Minchan Kim wrote:
> On Tue, Aug 17, 2010 at 11:42:46AM +0100, Mel Gorman wrote:
> > On Tue, Aug 17, 2010 at 11:26:05AM +0900, Minchan Kim wrote:
> > > On Tue, Aug 17, 2010 at 1:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > [npiggin@suse.de bounces, switched to yahoo address]
> > > >
> > > > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:
> > > 
> > > <snip>
> > > 
> > > >> +      * potentially causing a live-lock. While kswapd is awake and
> > > >> +      * free pages are low, get a better estimate for free pages
> > > >> +      */
> > > >> +     if (nr_free_pages < zone->percpu_drift_mark &&
> > > >> +                     !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > > >> +             int cpu;
> > > >> +
> > > >> +             for_each_online_cpu(cpu) {
> > > >> +                     struct per_cpu_pageset *pset;
> > > >> +
> > > >> +                     pset = per_cpu_ptr(zone->pageset, cpu);
> > > >> +                     nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> > > 
> > > We need to consider CONFIG_SMP.
> > > 
> > 
> > We do.
> > 
> > #ifdef CONFIG_SMP
> > unsigned long zone_nr_free_pages(struct zone *zone);
> > #else
> > #define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
> > #endif /* CONFIG_SMP */
> > 
> > and a wrapping of CONFIG_SMP around the function in mmzone.c .
> 
> I can't find it in this patch series. 

My bad. What I meant is "You're right, we do need to consider
CONFIG_SMP, how about something like the following";

I've made such a change to my local tree but it was not part of the
released series.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-17 10:42         ` Mel Gorman
@ 2010-08-17 15:01           ` Minchan Kim
  2010-08-17 15:05             ` Mel Gorman
  0 siblings, 1 reply; 99+ messages in thread
From: Minchan Kim @ 2010-08-17 15:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Aug 17, 2010 at 11:42:46AM +0100, Mel Gorman wrote:
> On Tue, Aug 17, 2010 at 11:26:05AM +0900, Minchan Kim wrote:
> > On Tue, Aug 17, 2010 at 1:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > [npiggin@suse.de bounces, switched to yahoo address]
> > >
> > > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:
> > 
> > <snip>
> > 
> > >> +      * potentially causing a live-lock. While kswapd is awake and
> > >> +      * free pages are low, get a better estimate for free pages
> > >> +      */
> > >> +     if (nr_free_pages < zone->percpu_drift_mark &&
> > >> +                     !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > >> +             int cpu;
> > >> +
> > >> +             for_each_online_cpu(cpu) {
> > >> +                     struct per_cpu_pageset *pset;
> > >> +
> > >> +                     pset = per_cpu_ptr(zone->pageset, cpu);
> > >> +                     nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> > 
> > We need to consider CONFIG_SMP.
> > 
> 
> We do.
> 
> #ifdef CONFIG_SMP
> unsigned long zone_nr_free_pages(struct zone *zone);
> #else
> #define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
> #endif /* CONFIG_SMP */
> 
> and a wrapping of CONFIG_SMP around the function in mmzone.c .

I can't find it in this patch series. 
Hmm.. :(

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-17 10:16       ` Mel Gorman
  2010-08-17 11:05         ` Johannes Weiner
@ 2010-08-17 14:20         ` Minchan Kim
  2010-08-18  8:51           ` Mel Gorman
  1 sibling, 1 reply; 99+ messages in thread
From: Minchan Kim @ 2010-08-17 14:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Aug 17, 2010 at 11:16:55AM +0100, Mel Gorman wrote:
> Well, the drift can be either direction because drift can be due to pages
> being either freed or allocated. e.g. it could be something like
> 
> NR_FREE_PAGES		CPU 0			CPU 1		Actual Free
> 128			-32			 +64		   160
> 
> Because CPU 0 was allocating pages while CPU 1 was freeing them but that
> is not what is important here. At any given time, the NR_FREE_PAGES can be
> wrong by as much as
> 
> num_online_cpus * (threshold - 1)

That's the answer I expected.
As I mentioned previous mail, we need to consider allocation path.
But you already have been considered it by partially in here. 
Yes. It looks good to me. :)

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

> 
> As kswapd goes back to sleep when the high watermark is reached, it's important
> that it has actually reached the watermark before sleeping.  Similarly,
> if an allocator is checking the low watermark, it needs an accurate count.
> Hence a more careful accounting for NR_FREE_PAGES should happen when the
> number of free pages is within
> 
> high_watermark + (num_online_cpus * (threshold - 1))
> 
> Only checking when kswapd is awake still leaves a window between the low
> and min watermark when we could breach the watermark but I'm expecting it
> can only happen for at worst one allocation. After that, kswapd wakes
> and the count becomes accurate again.

I can't understand the point. 
Now kswapd starts from below low wmark and stops until high wmark.
So if VM has pages of below low wmark, it could always check by zone_nr_free_pages 
regardless of min. 

What's a window low and min wmark? Maybe I can miss your point. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-17 10:16       ` Mel Gorman
@ 2010-08-17 11:05         ` Johannes Weiner
  2010-08-17 14:20         ` Minchan Kim
  1 sibling, 0 replies; 99+ messages in thread
From: Johannes Weiner @ 2010-08-17 11:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Rik van Riel, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Aug 17, 2010 at 11:16:55AM +0100, Mel Gorman wrote:
> On Mon, Aug 16, 2010 at 06:06:23PM +0200, Johannes Weiner wrote:
> > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 7759941..c95a159 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
> > >  		for_each_online_cpu(cpu)
> > >  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> > >  							= threshold;
> > > +
> > > +		zone->percpu_drift_mark = high_wmark_pages(zone) +
> > > +					num_online_cpus() * threshold;
> > >  	}
> > >  }
> > 
> > Hm, this one I don't quite get (might be the jetlag, though): we have
> > _at least_ NR_FREE_PAGES free pages, there may just be more lurking in
> > the pcp counters.
> > 
> 
> Well, the drift can be either direction because drift can be due to pages
> being either freed or allocated. e.g. it could be something like
> 
> NR_FREE_PAGES		CPU 0			CPU 1		Actual Free
> 128			-32			 +64		   160
> 
> Because CPU 0 was allocating pages while CPU 1 was freeing them but that
> is not what is important here. At any given time, the NR_FREE_PAGES can be
> wrong by as much as
> 
> num_online_cpus * (threshold - 1)

I somehow assumed the pcp cache could only be positive, but the
vm_stat_diff can indeed hold negative values.

> > So shouldn't we only collect the pcp deltas in case the high watermark
> > is breached?  Above this point, we should be fine or better, no?
> > 
> 
> Is that not what is happening in zone_nr_free_pages with this check?
> 
>         /*
>          * While kswapd is awake, it is considered the zone is under some
>          * memory pressure. Under pressure, there is a risk that
>          * per-cpu-counter-drift will allow the min watermark to be breached
>          * potentially causing a live-lock. While kswapd is awake and
>          * free pages are low, get a better estimate for free pages
>          */
>         if (nr_free_pages < zone->percpu_drift_mark &&
>                         !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> 
> Maybe I'm misunderstanding your question.

This was just a conclusion based on my wrong assumption: if the pcp
diff could only be positive, it would be enough to go for accurate
counts at the point NR_FREE_PAGES breaches the watermark.

As it is, however, the error margin needs to be taken into account in
both directions, as you said, so your patch makes perfect sense.

Sorry for the noise! And

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-17  2:26       ` Minchan Kim
@ 2010-08-17 10:42         ` Mel Gorman
  2010-08-17 15:01           ` Minchan Kim
  0 siblings, 1 reply; 99+ messages in thread
From: Mel Gorman @ 2010-08-17 10:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Aug 17, 2010 at 11:26:05AM +0900, Minchan Kim wrote:
> On Tue, Aug 17, 2010 at 1:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > [npiggin@suse.de bounces, switched to yahoo address]
> >
> > On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:
> 
> <snip>
> 
> >> +      * potentially causing a live-lock. While kswapd is awake and
> >> +      * free pages are low, get a better estimate for free pages
> >> +      */
> >> +     if (nr_free_pages < zone->percpu_drift_mark &&
> >> +                     !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> >> +             int cpu;
> >> +
> >> +             for_each_online_cpu(cpu) {
> >> +                     struct per_cpu_pageset *pset;
> >> +
> >> +                     pset = per_cpu_ptr(zone->pageset, cpu);
> >> +                     nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> 
> We need to consider CONFIG_SMP.
> 

We do.

#ifdef CONFIG_SMP
unsigned long zone_nr_free_pages(struct zone *zone);
#else
#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
#endif /* CONFIG_SMP */

and a wrapping of CONFIG_SMP around the function in mmzone.c .

> >> +             }
> >> +     }
> >> +
> >> +     return nr_free_pages;
> >> +}
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index c2407a4..67a2ed0 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> >>  {
> >>       /* free_pages my go negative - that's OK */
> >>       long min = mark;
> >> -     long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
> >> +     long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
> >>       int o;
> >>
> >>       if (alloc_flags & ALLOC_HIGH)
> >> @@ -2413,7 +2413,7 @@ void show_free_areas(void)
> >>                       " all_unreclaimable? %s"
> >>                       "\n",
> >>                       zone->name,
> >> -                     K(zone_page_state(zone, NR_FREE_PAGES)),
> >> +                     K(zone_nr_free_pages(zone)),
> >>                       K(min_wmark_pages(zone)),
> >>                       K(low_wmark_pages(zone)),
> >>                       K(high_wmark_pages(zone)),
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index 7759941..c95a159 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
> >>               for_each_online_cpu(cpu)
> >>                       per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> >>                                                       = threshold;
> >> +
> >> +             zone->percpu_drift_mark = high_wmark_pages(zone) +
> >> +                                     num_online_cpus() * threshold;
> >>       }
> >>  }
> >
> > Hm, this one I don't quite get (might be the jetlag, though): we have
> > _at least_ NR_FREE_PAGES free pages, there may just be more lurking in
> 
> We can't make sure it.
> As I said previous mail, current allocation path decreases
> NR_FREE_PAGES after it removes pages from buddy list.
> 
> > the pcp counters.
> >
> > So shouldn't we only collect the pcp deltas in case the high watermark
> > is breached?  Above this point, we should be fine or better, no?
> 
> If we don't consider allocation path, I agree on Hannes's opinion.
> At least, we need to listen why Mel determine the threshold. :)
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-16 16:06     ` Johannes Weiner
  2010-08-17  2:26       ` Minchan Kim
@ 2010-08-17 10:16       ` Mel Gorman
  2010-08-17 11:05         ` Johannes Weiner
  2010-08-17 14:20         ` Minchan Kim
  1 sibling, 2 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-17 10:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 16, 2010 at 06:06:23PM +0200, Johannes Weiner wrote:
> [npiggin@suse.de bounces, switched to yahoo address]
> 
> On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:
> > On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote:
> > > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> > > it is cheaper than scanning a number of lists. To avoid synchronization
> > > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > > periodically and when the delta is above a threshold. On large CPU systems,
> > > the difference between the estimated and real value of NR_FREE_PAGES can be
> > > very high. If the system is under both load and low memory, it's possible
> > > for watermarks to be breached. In extreme cases, the number of free pages
> > > can drop to 0 leading to the possibility of system livelock.
> > > 
> > > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > > estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> > > and may result in cache line bounces but is expected to be lighter than the
> > > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > > is awake.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > And the second I sent this, I realised I had sent a slightly old version
> > that missed a compile-fix :(
> > 
> > ==== CUT HERE ====
> > mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> > 
> > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> > it is cheaper than scanning a number of lists. To avoid synchronization
> > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > periodically and when the delta is above a threshold. On large CPU systems,
> > the difference between the estimated and real value of NR_FREE_PAGES can be
> > very high. If the system is under both load and low memory, it's possible
> > for watermarks to be breached. In extreme cases, the number of free pages
> > can drop to 0 leading to the possibility of system livelock.
> > 
> > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> > and may result in cache line bounces but is expected to be lighter than the
> > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > is awake.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> [...]
> 
> > --- a/mm/mmzone.c
> > +++ b/mm/mmzone.c
> > @@ -87,3 +87,30 @@ int memmap_valid_within(unsigned long pfn,
> >  	return 1;
> >  }
> >  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> > +
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +	/*
> > +	 * While kswapd is awake, it is considered the zone is under some
> > +	 * memory pressure. Under pressure, there is a risk that
> > +	 * er-cpu-counter-drift will allow the min watermark to be breached
> 
> Missing `p'.
> 

D'oh. Fixed

> > +	 * potentially causing a live-lock. While kswapd is awake and
> > +	 * free pages are low, get a better estimate for free pages
> > +	 */
> > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> > +		int cpu;
> > +
> > +		for_each_online_cpu(cpu) {
> > +			struct per_cpu_pageset *pset;
> > +
> > +			pset = per_cpu_ptr(zone->pageset, cpu);
> > +			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> > +		}
> > +	}
> > +
> > +	return nr_free_pages;
> > +}
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c2407a4..67a2ed0 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> >  {
> >  	/* free_pages my go negative - that's OK */
> >  	long min = mark;
> > -	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
> > +	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
> >  	int o;
> >  
> >  	if (alloc_flags & ALLOC_HIGH)
> > @@ -2413,7 +2413,7 @@ void show_free_areas(void)
> >  			" all_unreclaimable? %s"
> >  			"\n",
> >  			zone->name,
> > -			K(zone_page_state(zone, NR_FREE_PAGES)),
> > +			K(zone_nr_free_pages(zone)),
> >  			K(min_wmark_pages(zone)),
> >  			K(low_wmark_pages(zone)),
> >  			K(high_wmark_pages(zone)),
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 7759941..c95a159 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
> >  		for_each_online_cpu(cpu)
> >  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> >  							= threshold;
> > +
> > +		zone->percpu_drift_mark = high_wmark_pages(zone) +
> > +					num_online_cpus() * threshold;
> >  	}
> >  }
> 
> Hm, this one I don't quite get (might be the jetlag, though): we have
> _at least_ NR_FREE_PAGES free pages, there may just be more lurking in
> the pcp counters.
> 

Well, the drift can be either direction because drift can be due to pages
being either freed or allocated. e.g. it could be something like

NR_FREE_PAGES		CPU 0			CPU 1		Actual Free
128			-32			 +64		   160

Because CPU 0 was allocating pages while CPU 1 was freeing them but that
is not what is important here. At any given time, the NR_FREE_PAGES can be
wrong by as much as

num_online_cpus * (threshold - 1)

As kswapd goes back to sleep when the high watermark is reached, it's important
that it has actually reached the watermark before sleeping.  Similarly,
if an allocator is checking the low watermark, it needs an accurate count.
Hence a more careful accounting for NR_FREE_PAGES should happen when the
number of free pages is within

high_watermark + (num_online_cpus * (threshold - 1))

Only checking when kswapd is awake still leaves a window between the low
and min watermark when we could breach the watermark but I'm expecting it
can only happen for at worst one allocation. After that, kswapd wakes
and the count becomes accurate again.

> So shouldn't we only collect the pcp deltas in case the high watermark
> is breached?  Above this point, we should be fine or better, no?
> 

Is that not what is happening in zone_nr_free_pages with this check?

        /*
         * While kswapd is awake, it is considered the zone is under some
         * memory pressure. Under pressure, there is a risk that
         * per-cpu-counter-drift will allow the min watermark to be breached
         * potentially causing a live-lock. While kswapd is awake and
         * free pages are low, get a better estimate for free pages
         */
        if (nr_free_pages < zone->percpu_drift_mark &&
                        !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {

Maybe I'm misunderstanding your question.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-16 16:06     ` Johannes Weiner
@ 2010-08-17  2:26       ` Minchan Kim
  2010-08-17 10:42         ` Mel Gorman
  2010-08-17 10:16       ` Mel Gorman
  1 sibling, 1 reply; 99+ messages in thread
From: Minchan Kim @ 2010-08-17  2:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, linux-mm, Rik van Riel, Nick Piggin,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Aug 17, 2010 at 1:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> [npiggin@suse.de bounces, switched to yahoo address]
>
> On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:

<snip>

>> +      * potentially causing a live-lock. While kswapd is awake and
>> +      * free pages are low, get a better estimate for free pages
>> +      */
>> +     if (nr_free_pages < zone->percpu_drift_mark &&
>> +                     !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
>> +             int cpu;
>> +
>> +             for_each_online_cpu(cpu) {
>> +                     struct per_cpu_pageset *pset;
>> +
>> +                     pset = per_cpu_ptr(zone->pageset, cpu);
>> +                     nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];

We need to consider CONFIG_SMP.

>> +             }
>> +     }
>> +
>> +     return nr_free_pages;
>> +}
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c2407a4..67a2ed0 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>>  {
>>       /* free_pages my go negative - that's OK */
>>       long min = mark;
>> -     long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
>> +     long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
>>       int o;
>>
>>       if (alloc_flags & ALLOC_HIGH)
>> @@ -2413,7 +2413,7 @@ void show_free_areas(void)
>>                       " all_unreclaimable? %s"
>>                       "\n",
>>                       zone->name,
>> -                     K(zone_page_state(zone, NR_FREE_PAGES)),
>> +                     K(zone_nr_free_pages(zone)),
>>                       K(min_wmark_pages(zone)),
>>                       K(low_wmark_pages(zone)),
>>                       K(high_wmark_pages(zone)),
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 7759941..c95a159 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
>>               for_each_online_cpu(cpu)
>>                       per_cpu_ptr(zone->pageset, cpu)->stat_threshold
>>                                                       = threshold;
>> +
>> +             zone->percpu_drift_mark = high_wmark_pages(zone) +
>> +                                     num_online_cpus() * threshold;
>>       }
>>  }
>
> Hm, this one I don't quite get (might be the jetlag, though): we have
> _at least_ NR_FREE_PAGES free pages, there may just be more lurking in

We can't make sure it.
As I said previous mail, current allocation path decreases
NR_FREE_PAGES after it removes pages from buddy list.

> the pcp counters.
>
> So shouldn't we only collect the pcp deltas in case the high watermark
> is breached?  Above this point, we should be fine or better, no?

If we don't consider allocation path, I agree on Hannes's opinion.
At least, we need to listen why Mel determine the threshold. :)



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-16  9:43   ` Mel Gorman
  2010-08-16 14:47     ` Rik van Riel
@ 2010-08-16 16:06     ` Johannes Weiner
  2010-08-17  2:26       ` Minchan Kim
  2010-08-17 10:16       ` Mel Gorman
  2010-08-19 15:46     ` Minchan Kim
  2 siblings, 2 replies; 99+ messages in thread
From: Johannes Weiner @ 2010-08-16 16:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Rik van Riel, Nick Piggin, KAMEZAWA Hiroyuki, KOSAKI Motohiro

[npiggin@suse.de bounces, switched to yahoo address]

On Mon, Aug 16, 2010 at 10:43:50AM +0100, Mel Gorman wrote:
> On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote:
> > Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> > it is cheaper than scanning a number of lists. To avoid synchronization
> > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > periodically and when the delta is above a threshold. On large CPU systems,
> > the difference between the estimated and real value of NR_FREE_PAGES can be
> > very high. If the system is under both load and low memory, it's possible
> > for watermarks to be breached. In extreme cases, the number of free pages
> > can drop to 0 leading to the possibility of system livelock.
> > 
> > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> > and may result in cache line bounces but is expected to be lighter than the
> > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > is awake.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> And the second I sent this, I realised I had sent a slightly old version
> that missed a compile-fix :(
> 
> ==== CUT HERE ====
> mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> 
> Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> it is cheaper than scanning a number of lists. To avoid synchronization
> overhead, counter deltas are maintained on a per-cpu basis and drained both
> periodically and when the delta is above a threshold. On large CPU systems,
> the difference between the estimated and real value of NR_FREE_PAGES can be
> very high. If the system is under both load and low memory, it's possible
> for watermarks to be breached. In extreme cases, the number of free pages
> can drop to 0 leading to the possibility of system livelock.
> 
> This patch introduces zone_nr_free_pages() to take a slightly more accurate
> estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> and may result in cache line bounces but is expected to be lighter than the
> IPI calls necessary to continually drain the per-cpu counters while kswapd
> is awake.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

[...]

> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -87,3 +87,30 @@ int memmap_valid_within(unsigned long pfn,
>  	return 1;
>  }
>  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> +
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> +	/*
> +	 * While kswapd is awake, it is considered the zone is under some
> +	 * memory pressure. Under pressure, there is a risk that
> +	 * er-cpu-counter-drift will allow the min watermark to be breached

Missing `p'.

> +	 * potentially causing a live-lock. While kswapd is awake and
> +	 * free pages are low, get a better estimate for free pages
> +	 */
> +	if (nr_free_pages < zone->percpu_drift_mark &&
> +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> +		int cpu;
> +
> +		for_each_online_cpu(cpu) {
> +			struct per_cpu_pageset *pset;
> +
> +			pset = per_cpu_ptr(zone->pageset, cpu);
> +			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
> +		}
> +	}
> +
> +	return nr_free_pages;
> +}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c2407a4..67a2ed0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  {
>  	/* free_pages my go negative - that's OK */
>  	long min = mark;
> -	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
> +	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
>  	int o;
>  
>  	if (alloc_flags & ALLOC_HIGH)
> @@ -2413,7 +2413,7 @@ void show_free_areas(void)
>  			" all_unreclaimable? %s"
>  			"\n",
>  			zone->name,
> -			K(zone_page_state(zone, NR_FREE_PAGES)),
> +			K(zone_nr_free_pages(zone)),
>  			K(min_wmark_pages(zone)),
>  			K(low_wmark_pages(zone)),
>  			K(high_wmark_pages(zone)),
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7759941..c95a159 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
>  		for_each_online_cpu(cpu)
>  			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
>  							= threshold;
> +
> +		zone->percpu_drift_mark = high_wmark_pages(zone) +
> +					num_online_cpus() * threshold;
>  	}
>  }

Hm, this one I don't quite get (might be the jetlag, though): we have
_at least_ NR_FREE_PAGES free pages, there may just be more lurking in
the pcp counters.

So shouldn't we only collect the pcp deltas in case the high watermark
is breached?  Above this point, we should be fine or better, no?

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-16  9:43   ` Mel Gorman
@ 2010-08-16 14:47     ` Rik van Riel
  2010-08-16 16:06     ` Johannes Weiner
  2010-08-19 15:46     ` Minchan Kim
  2 siblings, 0 replies; 99+ messages in thread
From: Rik van Riel @ 2010-08-16 14:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On 08/16/2010 05:43 AM, Mel Gorman wrote:
> On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote:
>> Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
>> it is cheaper than scanning a number of lists. To avoid synchronization
>> overhead, counter deltas are maintained on a per-cpu basis and drained both
>> periodically and when the delta is above a threshold. On large CPU systems,
>> the difference between the estimated and real value of NR_FREE_PAGES can be
>> very high. If the system is under both load and low memory, it's possible
>> for watermarks to be breached. In extreme cases, the number of free pages
>> can drop to 0 leading to the possibility of system livelock.
>>
>> This patch introduces zone_nr_free_pages() to take a slightly more accurate
>> estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
>> and may result in cache line bounces but is expected to be lighter than the
>> IPI calls necessary to continually drain the per-cpu counters while kswapd
>> is awake.
>>
>> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
>
> And the second I sent this, I realised I had sent a slightly old version
> that missed a compile-fix :(

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-16  9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
@ 2010-08-16  9:43   ` Mel Gorman
  2010-08-16 14:47     ` Rik van Riel
                       ` (2 more replies)
  2010-08-18  2:59   ` KAMEZAWA Hiroyuki
  1 sibling, 3 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-16  9:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Mon, Aug 16, 2010 at 10:42:12AM +0100, Mel Gorman wrote:
> Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
> it is cheaper than scanning a number of lists. To avoid synchronization
> overhead, counter deltas are maintained on a per-cpu basis and drained both
> periodically and when the delta is above a threshold. On large CPU systems,
> the difference between the estimated and real value of NR_FREE_PAGES can be
> very high. If the system is under both load and low memory, it's possible
> for watermarks to be breached. In extreme cases, the number of free pages
> can drop to 0 leading to the possibility of system livelock.
> 
> This patch introduces zone_nr_free_pages() to take a slightly more accurate
> estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
> and may result in cache line bounces but is expected to be lighter than the
> IPI calls necessary to continually drain the per-cpu counters while kswapd
> is awake.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

And the second I sent this, I realised I had sent a slightly old version
that missed a compile-fix :(

==== CUT HERE ====
mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can be
very high. If the system is under both load and low memory, it's possible
for watermarks to be breached. In extreme cases, the number of free pages
can drop to 0 leading to the possibility of system livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    9 +++++++++
 mm/mmzone.c            |   27 +++++++++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |    5 ++++-
 4 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..1df3c43 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -456,6 +463,8 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+unsigned long zone_nr_free_pages(struct zone *zone);
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..056e374 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,30 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * er-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct per_cpu_pageset *pset;
+
+			pset = per_cpu_ptr(zone->pageset, cpu);
+			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
+		}
+	}
+
+	return nr_free_pages;
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2407a4..67a2ed0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2413,7 +2413,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..c95a159 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		zone->percpu_drift_mark = high_wmark_pages(zone) +
+					num_online_cpus() * threshold;
 	}
 }
 
@@ -813,7 +816,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-16  9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman
@ 2010-08-16  9:42 ` Mel Gorman
  2010-08-16  9:43   ` Mel Gorman
  2010-08-18  2:59   ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 99+ messages in thread
From: Mel Gorman @ 2010-08-16  9:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Ordinarily watermark checks are made based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can be
very high. If the system is under both load and low memory, it's possible
for watermarks to be breached. In extreme cases, the number of free pages
can drop to 0 leading to the possibility of system livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake.  The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    9 +++++++++
 mm/mmzone.c            |   27 +++++++++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |    5 ++++-
 4 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..1df3c43 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -456,6 +463,8 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+unsigned long zone_nr_free_pages(struct zone *zone);
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..89842ec 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,30 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * er-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (free < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct per_cpu_pageset *pset;
+
+			pset = per_cpu_ptr(zone->pageset, cpu);
+			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
+		}
+	}
+
+	return nr_free_pages;
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2407a4..67a2ed0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2413,7 +2413,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..c95a159 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -143,6 +143,9 @@ static void refresh_zone_stat_thresholds(void)
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		zone->percpu_drift_mark = high_wmark_pages(zone) +
+					num_online_cpus() * threshold;
 	}
 }
 
@@ -813,7 +816,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2010-09-05 18:13 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-31 17:37 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V3 Mel Gorman
2010-08-31 17:37 ` Mel Gorman
2010-08-31 17:37 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman
2010-08-31 17:37   ` Mel Gorman
2010-08-31 18:17   ` Christoph Lameter
2010-08-31 18:17     ` Christoph Lameter
2010-09-01  7:10     ` Mel Gorman
2010-09-01  7:10       ` Mel Gorman
2010-08-31 23:27   ` KOSAKI Motohiro
2010-08-31 23:27     ` KOSAKI Motohiro
2010-08-31 17:37 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-31 17:37   ` Mel Gorman
2010-08-31 18:20   ` Christoph Lameter
2010-08-31 18:20     ` Christoph Lameter
2010-08-31 23:37   ` KOSAKI Motohiro
2010-08-31 23:37     ` KOSAKI Motohiro
2010-09-01  7:24     ` Mel Gorman
2010-09-01  7:24       ` Mel Gorman
2010-09-01  7:33       ` KOSAKI Motohiro
2010-09-01  7:33         ` KOSAKI Motohiro
2010-09-01 20:16         ` Christoph Lameter
2010-09-01 20:16           ` Christoph Lameter
2010-09-01 20:34           ` Mel Gorman
2010-09-01 20:34             ` Mel Gorman
2010-09-02  0:24             ` Christoph Lameter
2010-09-02  0:24               ` Christoph Lameter
2010-09-02  0:26               ` KOSAKI Motohiro
2010-09-02  0:26                 ` KOSAKI Motohiro
2010-09-02  0:39                 ` Christoph Lameter
2010-09-02  0:39                   ` Christoph Lameter
2010-09-02  0:54                   ` Christoph Lameter
2010-09-02  0:54                     ` Christoph Lameter
2010-09-02  0:43   ` Christoph Lameter
2010-09-02  0:43     ` Christoph Lameter
2010-09-02  0:49     ` KOSAKI Motohiro
2010-09-02  0:49       ` KOSAKI Motohiro
2010-09-02  8:51     ` Mel Gorman
2010-09-02  8:51       ` Mel Gorman
2010-08-31 17:37 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
2010-08-31 17:37   ` Mel Gorman
2010-08-31 18:26   ` Christoph Lameter
2010-08-31 18:26     ` Christoph Lameter
  -- strict thread matches above, loose matches on Subject: below --
2010-09-03  9:08 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4 Mel Gorman
2010-09-03  9:08 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-09-03  9:08   ` Mel Gorman
2010-09-03 22:55   ` Andrew Morton
2010-09-03 22:55     ` Andrew Morton
2010-09-03 23:17     ` Christoph Lameter
2010-09-03 23:17       ` Christoph Lameter
2010-09-03 23:28       ` Andrew Morton
2010-09-03 23:28         ` Andrew Morton
2010-09-04  0:54         ` Christoph Lameter
2010-09-04  0:54           ` Christoph Lameter
2010-09-05 18:12     ` Mel Gorman
2010-09-05 18:12       ` Mel Gorman
2010-08-23  8:00 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2 Mel Gorman
2010-08-23  8:00 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-23  8:00   ` Mel Gorman
2010-08-23 12:56   ` Christoph Lameter
2010-08-23 12:56     ` Christoph Lameter
2010-08-23 13:03     ` Mel Gorman
2010-08-23 13:03       ` Mel Gorman
2010-08-23 13:41       ` Christoph Lameter
2010-08-23 13:41         ` Christoph Lameter
2010-08-23 13:55         ` Mel Gorman
2010-08-23 13:55           ` Mel Gorman
2010-08-23 16:04           ` Christoph Lameter
2010-08-23 16:04             ` Christoph Lameter
2010-08-23 16:13             ` Mel Gorman
2010-08-23 16:13               ` Mel Gorman
2010-08-16  9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman
2010-08-16  9:42 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-16  9:43   ` Mel Gorman
2010-08-16 14:47     ` Rik van Riel
2010-08-16 16:06     ` Johannes Weiner
2010-08-17  2:26       ` Minchan Kim
2010-08-17 10:42         ` Mel Gorman
2010-08-17 15:01           ` Minchan Kim
2010-08-17 15:05             ` Mel Gorman
2010-08-17 10:16       ` Mel Gorman
2010-08-17 11:05         ` Johannes Weiner
2010-08-17 14:20         ` Minchan Kim
2010-08-18  8:51           ` Mel Gorman
2010-08-18 14:57             ` Minchan Kim
2010-08-19  8:06               ` Mel Gorman
2010-08-19 10:33                 ` Minchan Kim
2010-08-19 10:38                   ` Mel Gorman
2010-08-19 14:01                     ` Minchan Kim
2010-08-19 14:09                       ` Mel Gorman
2010-08-19 14:34                         ` Minchan Kim
2010-08-19 15:07                           ` Mel Gorman
2010-08-19 15:22                             ` Minchan Kim
2010-08-19 15:40                               ` Mel Gorman
2010-08-19 15:44                                 ` Minchan Kim
2010-08-19 15:46     ` Minchan Kim
2010-08-19 16:06       ` Mel Gorman
2010-08-19 16:45         ` Minchan Kim
2010-08-18  2:59   ` KAMEZAWA Hiroyuki
2010-08-18 15:55     ` Christoph Lameter
2010-08-19  0:07       ` KAMEZAWA Hiroyuki
2010-08-19 19:00         ` Christoph Lameter
2010-08-19 23:49           ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.