[PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
@ 2010-08-23  8:00 ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Changelog since V1
  o Fix for !CONFIG_SMP
  o Correct spelling mistakes
  o Clarify a ChangeLog
  o Only check for counter drift on machines large enough for the counter
    drift to breach the min watermark when NR_FREE_PAGES report the low
    watermark is fine

Internal IBM test teams beta testing distribution kernels have reported
problems on machines with a large number of CPUs whereby page allocator
failure messages show huge differences between the nr_free_pages vmstat
counter and what is available on the buddy lists. In an extreme example,
nr_free_pages was above the min watermark but zero pages were on the buddy
lists allowing the system to potentially livelock unable to make forward
progress unless an allocation succeeds. There is no reason why the problems
would not affect mainline so the following series mitigates the problems
in the page allocator related to to per-cpu counter drift and lists.

The first patch ensures that counters are updated after pages are added to
free lists.

The second patch notes that the counter drift between nr_free_pages and what
is on the per-cpu lists can be very high. When memory is low and kswapd
is awake, the per-cpu counters are checked as well as reading the value
of NR_FREE_PAGES. This will slow the page allocator when memory is low and
kswapd is awake but it will be much harder to breach the min watermark and
potentially livelock the system.

The third patch notes that after direct-reclaim an allocation can
fail because the necessary pages are on the per-cpu lists. After a
direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
a second attempt is made.

Performance tests did not show up anything interesting. A version of this
series that continually called vmstat_update() when memory was low was
tested internally and found to help the counter drift problem. I described
this during LSF/MM Summit and the potential for IPI storms was frowned
upon. An alternative fix is in patch two which uses for_each_online_cpu()
to read the vmstat deltas while memory is low and kswapd is awake. This
should be functionally similar.

This patch should be merged after the patch "vmstat : update
zone stat threshold at onlining a cpu" which is in mmotm as
vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .

Are there any objections to merging?

 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |   29 +++++++++++++++++++++--------
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 77 insertions(+), 9 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
@ 2010-08-23  8:00 ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Changelog since V1
  o Fix for !CONFIG_SMP
  o Correct spelling mistakes
  o Clarify a ChangeLog
  o Only check for counter drift on machines large enough for the counter
    drift to breach the min watermark when NR_FREE_PAGES report the low
    watermark is fine

Internal IBM test teams beta testing distribution kernels have reported
problems on machines with a large number of CPUs whereby page allocator
failure messages show huge differences between the nr_free_pages vmstat
counter and what is available on the buddy lists. In an extreme example,
nr_free_pages was above the min watermark but zero pages were on the buddy
lists allowing the system to potentially livelock unable to make forward
progress unless an allocation succeeds. There is no reason why the problems
would not affect mainline so the following series mitigates the problems
in the page allocator related to to per-cpu counter drift and lists.

The first patch ensures that counters are updated after pages are added to
free lists.

The second patch notes that the counter drift between nr_free_pages and what
is on the per-cpu lists can be very high. When memory is low and kswapd
is awake, the per-cpu counters are checked as well as reading the value
of NR_FREE_PAGES. This will slow the page allocator when memory is low and
kswapd is awake but it will be much harder to breach the min watermark and
potentially livelock the system.

The third patch notes that after direct-reclaim an allocation can
fail because the necessary pages are on the per-cpu lists. After a
direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
a second attempt is made.

Performance tests did not show up anything interesting. A version of this
series that continually called vmstat_update() when memory was low was
tested internally and found to help the counter drift problem. I described
this during LSF/MM Summit and the potential for IPI storms was frowned
upon. An alternative fix is in patch two which uses for_each_online_cpu()
to read the vmstat deltas while memory is low and kswapd is awake. This
should be functionally similar.

This patch should be merged after the patch "vmstat : update
zone stat threshold at onlining a cpu" which is in mmotm as
vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .

Are there any objections to merging?

 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |   29 +++++++++++++++++++++--------
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 77 insertions(+), 9 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
  2010-08-23  8:00 ` Mel Gorman
@ 2010-08-23  8:00   ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When allocating a page, the system uses NR_FREE_PAGES counters to determine
if watermarks would remain intact after the allocation was made. This
check is made without interrupts disabled or the zone lock held and so is
race-prone by nature. Unfortunately, when pages are being freed in batch,
the counters are updated before the pages are added on the list. During this
window, the counters are misleading as the pages do not exist yet. When
under significant pressure on systems with large numbers of CPUs, it's
possible for processes to make progress even though they should have been
stalled. This is particularly problematic if a number of the processes are
using GFP_ATOMIC as the min watermark can be accidentally breached and in
extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..97d74a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
+	int freed = count;

 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;

-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, page_private(page));
 		} while (--count && --batch_free && !list_empty(list));
 	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
 	spin_unlock(&zone->lock);
 }

@@ -631,8 +632,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;

-	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	__free_one_page(page, zone, order, migratetype);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	spin_unlock(&zone->lock);
 }

-- 
1.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
@ 2010-08-23  8:00   ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When allocating a page, the system uses NR_FREE_PAGES counters to determine
if watermarks would remain intact after the allocation was made. This
check is made without interrupts disabled or the zone lock held and so is
race-prone by nature. Unfortunately, when pages are being freed in batch,
the counters are updated before the pages are added on the list. During this
window, the counters are misleading as the pages do not exist yet. When
under significant pressure on systems with large numbers of CPUs, it's
possible for processes to make progress even though they should have been
stalled. This is particularly problematic if a number of the processes are
using GFP_ATOMIC as the min watermark can be accidentally breached and in
extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..97d74a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
+	int freed = count;

 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;

-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, page_private(page));
 		} while (--count && --batch_free && !list_empty(list));
 	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
 	spin_unlock(&zone->lock);
 }

@@ -631,8 +632,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;

-	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	__free_one_page(page, zone, order, migratetype);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	spin_unlock(&zone->lock);
 }

-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23  8:00 ` Mel Gorman
@ 2010-08-23  8:00   ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can be
very high.  If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero.  Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..69ecbe9 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,32 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct per_cpu_pageset *pset;
+
+			pset = per_cpu_ptr(zone->pageset, cpu);
+			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
+		}
+	}
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23  8:00   ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can be
very high.  If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero.  Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..69ecbe9 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,32 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			struct per_cpu_pageset *pset;
+
+			pset = per_cpu_ptr(zone->pageset, cpu);
+			nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
+		}
+	}
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-23  8:00 ` Mel Gorman
@ 2010-08-23  8:00   ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-08-23  8:00   ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
  2010-08-23  8:00 ` Mel Gorman
@ 2010-08-23 12:45   ` Christoph Lameter
  -1 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 12:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> Internal IBM test teams beta testing distribution kernels have reported
> problems on machines with a large number of CPUs whereby page allocator
> failure messages show huge differences between the nr_free_pages vmstat
> counter and what is available on the buddy lists. In an extreme example,
> nr_free_pages was above the min watermark but zero pages were on the buddy
> lists allowing the system to potentially livelock unable to make forward
> progress unless an allocation succeeds. There is no reason why the problems
> would not affect mainline so the following series mitigates the problems
> in the page allocator related to to per-cpu counter drift and lists.

The maximum time for which the livelock can exists is the vm stat
interval. By default the counters are brought up to date at least once per
second or if a certain delta was violated. Drifts are controlled by the
delta configuration.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
@ 2010-08-23 12:45   ` Christoph Lameter
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 12:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> Internal IBM test teams beta testing distribution kernels have reported
> problems on machines with a large number of CPUs whereby page allocator
> failure messages show huge differences between the nr_free_pages vmstat
> counter and what is available on the buddy lists. In an extreme example,
> nr_free_pages was above the min watermark but zero pages were on the buddy
> lists allowing the system to potentially livelock unable to make forward
> progress unless an allocation succeeds. There is no reason why the problems
> would not affect mainline so the following series mitigates the problems
> in the page allocator related to to per-cpu counter drift and lists.

The maximum time for which the livelock can exists is the vm stat
interval. By default the counters are brought up to date at least once per
second or if a certain delta was violated. Drifts are controlled by the
delta configuration.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
  2010-08-23  8:00   ` Mel Gorman
@ 2010-08-23 12:47     ` Christoph Lameter
  -1 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 12:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
@ 2010-08-23 12:47     ` Christoph Lameter
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 12:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


Reviewed-by: Christoph Lameter <cl@linux.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23  8:00   ` Mel Gorman
@ 2010-08-23 12:56     ` Christoph Lameter
  -1 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> This patch introduces zone_nr_free_pages() to take a slightly more accurate
> estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
> and may result in cache line bounces but is expected to be lighter than the
> IPI calls necessary to continually drain the per-cpu counters while kswapd
> is awake.

The delta of the counters could also be reduced to increase accuracy.
See refresh_zone_stat_thresholds().

Also would it be possible to add the summation function to vmstat? It may
be useful elsewhere.

A new function like

	zone_page_state_snapshot()

or so?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 12:56     ` Christoph Lameter
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> This patch introduces zone_nr_free_pages() to take a slightly more accurate
> estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
> and may result in cache line bounces but is expected to be lighter than the
> IPI calls necessary to continually drain the per-cpu counters while kswapd
> is awake.

The delta of the counters could also be reduced to increase accuracy.
See refresh_zone_stat_thresholds().

Also would it be possible to add the summation function to vmstat? It may
be useful elsewhere.

A new function like

	zone_page_state_snapshot()

or so?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
  2010-08-23 12:45   ` Christoph Lameter
@ 2010-08-23 13:01     ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 13:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 07:45:25AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > Internal IBM test teams beta testing distribution kernels have reported
> > problems on machines with a large number of CPUs whereby page allocator
> > failure messages show huge differences between the nr_free_pages vmstat
> > counter and what is available on the buddy lists. In an extreme example,
> > nr_free_pages was above the min watermark but zero pages were on the buddy
> > lists allowing the system to potentially livelock unable to make forward
> > progress unless an allocation succeeds. There is no reason why the problems
> > would not affect mainline so the following series mitigates the problems
> > in the page allocator related to to per-cpu counter drift and lists.
> 
> The maximum time for which the livelock can exists is the vm stat
> interval. By default the counters are brought up to date at least once per
> second or if a certain delta was violated. Drifts are controlled by the
> delta configuration.
> 

While there is a maximum time (2 seconds I think) the drift can exist
in, a machine under enough pressure can make a mess of the watermarks
during that time. If it wasn't the case, these livelocks with 0 pages
free wouldn't be happening.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
@ 2010-08-23 13:01     ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 13:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 07:45:25AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > Internal IBM test teams beta testing distribution kernels have reported
> > problems on machines with a large number of CPUs whereby page allocator
> > failure messages show huge differences between the nr_free_pages vmstat
> > counter and what is available on the buddy lists. In an extreme example,
> > nr_free_pages was above the min watermark but zero pages were on the buddy
> > lists allowing the system to potentially livelock unable to make forward
> > progress unless an allocation succeeds. There is no reason why the problems
> > would not affect mainline so the following series mitigates the problems
> > in the page allocator related to to per-cpu counter drift and lists.
> 
> The maximum time for which the livelock can exists is the vm stat
> interval. By default the counters are brought up to date at least once per
> second or if a certain delta was violated. Drifts are controlled by the
> delta configuration.
> 

While there is a maximum time (2 seconds I think) the drift can exist
in, a machine under enough pressure can make a mess of the watermarks
during that time. If it wasn't the case, these livelocks with 0 pages
free wouldn't be happening.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 12:56     ` Christoph Lameter
@ 2010-08-23 13:03       ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 13:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 07:56:40AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
> > and may result in cache line bounces but is expected to be lighter than the
> > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > is awake.
> 
> The delta of the counters could also be reduced to increase accuracy.
> See refresh_zone_stat_thresholds().
> 

True, but I thought that would introduce a constant performance penalty
for a corner case which I didn't like.

> Also would it be possible to add the summation function to vmstat? It may
> be useful elsewhere.
> 
> A new function like
> 
> 	zone_page_state_snapshot()
> 
> or so?
> 

We could if there is another counter that results in bad system
behaviour due to counter drift. As NR_FREE_PAGES seemed to be the only
one, zone_nr_free_pages() seemed adequate. If such a helper did exist,
zone_nr_free_pages() would be a simple wrapper around it. The
indirection didn't seem necessary at this point though.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 13:03       ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 13:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 07:56:40AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > This patch introduces zone_nr_free_pages() to take a slightly more accurate
> > estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
> > and may result in cache line bounces but is expected to be lighter than the
> > IPI calls necessary to continually drain the per-cpu counters while kswapd
> > is awake.
> 
> The delta of the counters could also be reduced to increase accuracy.
> See refresh_zone_stat_thresholds().
> 

True, but I thought that would introduce a constant performance penalty
for a corner case which I didn't like.

> Also would it be possible to add the summation function to vmstat? It may
> be useful elsewhere.
> 
> A new function like
> 
> 	zone_page_state_snapshot()
> 
> or so?
> 

We could if there is another counter that results in bad system
behaviour due to counter drift. As NR_FREE_PAGES seemed to be the only
one, zone_nr_free_pages() seemed adequate. If such a helper did exist,
zone_nr_free_pages() would be a simple wrapper around it. The
indirection didn't seem necessary at this point though.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
  2010-08-23 13:01     ` Mel Gorman
@ 2010-08-23 13:38       ` Christoph Lameter
  -1 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 13:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > The maximum time for which the livelock can exists is the vm stat
> > interval. By default the counters are brought up to date at least once per
> > second or if a certain delta was violated. Drifts are controlled by the
> > delta configuration.
> >
>
> While there is a maximum time (2 seconds I think) the drift can exist
> in, a machine under enough pressure can make a mess of the watermarks
> during that time. If it wasn't the case, these livelocks with 0 pages
> free wouldn't be happening.

So because we go way beyond the watermarks we reach a state in which a
livelock exists that does not go away when the counters are finally
updated?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
@ 2010-08-23 13:38       ` Christoph Lameter
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 13:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > The maximum time for which the livelock can exists is the vm stat
> > interval. By default the counters are brought up to date at least once per
> > second or if a certain delta was violated. Drifts are controlled by the
> > delta configuration.
> >
>
> While there is a maximum time (2 seconds I think) the drift can exist
> in, a machine under enough pressure can make a mess of the watermarks
> during that time. If it wasn't the case, these livelocks with 0 pages
> free wouldn't be happening.

So because we go way beyond the watermarks we reach a state in which a
livelock exists that does not go away when the counters are finally
updated?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
  2010-08-23 13:38       ` Christoph Lameter
@ 2010-08-23 13:39         ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 13:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 08:38:25AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > The maximum time for which the livelock can exists is the vm stat
> > > interval. By default the counters are brought up to date at least once per
> > > second or if a certain delta was violated. Drifts are controlled by the
> > > delta configuration.
> > >
> >
> > While there is a maximum time (2 seconds I think) the drift can exist
> > in, a machine under enough pressure can make a mess of the watermarks
> > during that time. If it wasn't the case, these livelocks with 0 pages
> > free wouldn't be happening.
> 
> So because we go way beyond the watermarks we reach a state in which a
> livelock exists that does not go away when the counters are finally
> updated?
> 

That appears to be the case. The system has already gotten into a state
where there are 0 pages free. Just because the NR_FREE_PAGES counter
gets updated to reflect the accurate count of 0 does not mean the system
can recover from it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2
@ 2010-08-23 13:39         ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 13:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 08:38:25AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > The maximum time for which the livelock can exists is the vm stat
> > > interval. By default the counters are brought up to date at least once per
> > > second or if a certain delta was violated. Drifts are controlled by the
> > > delta configuration.
> > >
> >
> > While there is a maximum time (2 seconds I think) the drift can exist
> > in, a machine under enough pressure can make a mess of the watermarks
> > during that time. If it wasn't the case, these livelocks with 0 pages
> > free wouldn't be happening.
> 
> So because we go way beyond the watermarks we reach a state in which a
> livelock exists that does not go away when the counters are finally
> updated?
> 

That appears to be the case. The system has already gotten into a state
where there are 0 pages free. Just because the NR_FREE_PAGES counter
gets updated to reflect the accurate count of 0 does not mean the system
can recover from it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 13:03       ` Mel Gorman
@ 2010-08-23 13:41         ` Christoph Lameter
  -1 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 13:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > The delta of the counters could also be reduced to increase accuracy.
> > See refresh_zone_stat_thresholds().
> True, but I thought that would introduce a constant performance penalty
> for a corner case which I didn't like.

Sure, an increased frequency of updates would increase the chance of
bouncing cachelines. But the bouncing cacheline scenario for the vm
counters was tuned for applications that continually allocate pages in
parallel.

When the vm gets into a state where continual reclaim is necessary then
the counters are not that frequently updated. If the machine is already
slowing down due to reclaim then the vm can likely affort more frequent
counter updates.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 13:41         ` Christoph Lameter
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 13:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > The delta of the counters could also be reduced to increase accuracy.
> > See refresh_zone_stat_thresholds().
> True, but I thought that would introduce a constant performance penalty
> for a corner case which I didn't like.

Sure, an increased frequency of updates would increase the chance of
bouncing cachelines. But the bouncing cacheline scenario for the vm
counters was tuned for applications that continually allocate pages in
parallel.

When the vm gets into a state where continual reclaim is necessary then
the counters are not that frequently updated. If the machine is already
slowing down due to reclaim then the vm can likely affort more frequent
counter updates.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 13:41         ` Christoph Lameter
@ 2010-08-23 13:55           ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 13:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 08:41:56AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > The delta of the counters could also be reduced to increase accuracy.
> > > See refresh_zone_stat_thresholds().
> > True, but I thought that would introduce a constant performance penalty
> > for a corner case which I didn't like.
> 
> Sure, an increased frequency of updates would increase the chance of
> bouncing cachelines. But the bouncing cacheline scenario for the vm
> counters was tuned for applications that continually allocate pages in
> parallel.
> 
> When the vm gets into a state where continual reclaim is necessary then
> the counters are not that frequently updated. If the machine is already
> slowing down due to reclaim then the vm can likely affort more frequent
> counter updates.
> 

Ok, but is that better than this patch? Decreasing the size of the window by
reducing the threshold still leaves a window. There is still a small amount
of drift by summing up all the deltas but you get a much more accurate count
at the point of time it was important to know.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 13:55           ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 13:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 08:41:56AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > The delta of the counters could also be reduced to increase accuracy.
> > > See refresh_zone_stat_thresholds().
> > True, but I thought that would introduce a constant performance penalty
> > for a corner case which I didn't like.
> 
> Sure, an increased frequency of updates would increase the chance of
> bouncing cachelines. But the bouncing cacheline scenario for the vm
> counters was tuned for applications that continually allocate pages in
> parallel.
> 
> When the vm gets into a state where continual reclaim is necessary then
> the counters are not that frequently updated. If the machine is already
> slowing down due to reclaim then the vm can likely affort more frequent
> counter updates.
> 

Ok, but is that better than this patch? Decreasing the size of the window by
reducing the threshold still leaves a window. There is still a small amount
of drift by summing up all the deltas but you get a much more accurate count
at the point of time it was important to know.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 13:55           ` Mel Gorman
@ 2010-08-23 16:04             ` Christoph Lameter
  -1 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 16:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > When the vm gets into a state where continual reclaim is necessary then
> > the counters are not that frequently updated. If the machine is already
> > slowing down due to reclaim then the vm can likely affort more frequent
> > counter updates.
> >
>
> Ok, but is that better than this patch? Decreasing the size of the window by
> reducing the threshold still leaves a window. There is still a small amount
> of drift by summing up all the deltas but you get a much more accurate count
> at the point of time it was important to know.

In order to make that decision we would need to know what deltas make a
significant difference. Would be also important to know if there are any
other counters that have issues. If so then the reduction of the
thresholds is addressing these problems in a number of counters.

I have no objection against this approach here but it may just be bandaid
on a larger issue that could be approached in a cleaner way.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 16:04             ` Christoph Lameter
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2010-08-23 16:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 23 Aug 2010, Mel Gorman wrote:

> > When the vm gets into a state where continual reclaim is necessary then
> > the counters are not that frequently updated. If the machine is already
> > slowing down due to reclaim then the vm can likely affort more frequent
> > counter updates.
> >
>
> Ok, but is that better than this patch? Decreasing the size of the window by
> reducing the threshold still leaves a window. There is still a small amount
> of drift by summing up all the deltas but you get a much more accurate count
> at the point of time it was important to know.

In order to make that decision we would need to know what deltas make a
significant difference. Would be also important to know if there are any
other counters that have issues. If so then the reduction of the
thresholds is addressing these problems in a number of counters.

I have no objection against this approach here but it may just be bandaid
on a larger issue that could be approached in a cleaner way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-08-23 16:04             ` Christoph Lameter
@ 2010-08-23 16:13               ` Mel Gorman
  -1 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 16:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 11:04:38AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > When the vm gets into a state where continual reclaim is necessary then
> > > the counters are not that frequently updated. If the machine is already
> > > slowing down due to reclaim then the vm can likely affort more frequent
> > > counter updates.
> > >
> >
> > Ok, but is that better than this patch? Decreasing the size of the window by
> > reducing the threshold still leaves a window. There is still a small amount
> > of drift by summing up all the deltas but you get a much more accurate count
> > at the point of time it was important to know.
> 
> In order to make that decision we would need to know what deltas make a
> significant difference.

A delta on the NR_FREE_PAGES is the obvious problem. The page allocation
failure report I saw clearly stated that free was a value above min watermark
where as the buddy lists just as clearly showed that the number of pages on
the list were 0.

> Would be also important to know if there are any
> other counters that have issues.

I am not aware of similar issues with another counter where drift causes
the system to make the wrong decision, are you?

> If so then the reduction of the
> thresholds is addressing these problems in a number of counters.
> 
> I have no objection against this approach here but it may just be bandaid
> on a larger issue that could be approached in a cleaner way.
> 

Unfortunately, I do not have access to a machine large enough to investigate
around this area. All I have to go on is a few bug reports showing the delta
problem with NR_FREE_PAGES and test results in a patch functionally similar
to this patch showing that the livelock problem went away.

At best all we can do is keep an eye out for problems one large machines
that could be explained by counter drift. If such a bug is found with a
reporter with regular access to the machine for test kernels, we can
investigate if reducing the thresholds fix the problem without affecting
general performance.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-08-23 16:13               ` Mel Gorman
  0 siblings, 0 replies; 32+ messages in thread
From: Mel Gorman @ 2010-08-23 16:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 23, 2010 at 11:04:38AM -0500, Christoph Lameter wrote:
> On Mon, 23 Aug 2010, Mel Gorman wrote:
> 
> > > When the vm gets into a state where continual reclaim is necessary then
> > > the counters are not that frequently updated. If the machine is already
> > > slowing down due to reclaim then the vm can likely affort more frequent
> > > counter updates.
> > >
> >
> > Ok, but is that better than this patch? Decreasing the size of the window by
> > reducing the threshold still leaves a window. There is still a small amount
> > of drift by summing up all the deltas but you get a much more accurate count
> > at the point of time it was important to know.
> 
> In order to make that decision we would need to know what deltas make a
> significant difference.

A delta on the NR_FREE_PAGES is the obvious problem. The page allocation
failure report I saw clearly stated that free was a value above min watermark
where as the buddy lists just as clearly showed that the number of pages on
the list were 0.

> Would be also important to know if there are any
> other counters that have issues.

I am not aware of similar issues with another counter where drift causes
the system to make the wrong decision, are you?

> If so then the reduction of the
> thresholds is addressing these problems in a number of counters.
> 
> I have no objection against this approach here but it may just be bandaid
> on a larger issue that could be approached in a cleaner way.
> 

Unfortunately, I do not have access to a machine large enough to investigate
around this area. All I have to go on is a few bug reports showing the delta
problem with NR_FREE_PAGES and test results in a patch functionally similar
to this patch showing that the livelock problem went away.

At best all we can do is keep an eye out for problems one large machines
that could be explained by counter drift. If such a bug is found with a
reporter with regular access to the machine for test kernels, we can
investigate if reducing the thresholds fix the problem without affecting
general performance.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-23  8:00   ` Mel Gorman
@ 2010-08-23 23:17     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 32+ messages in thread
From: KOSAKI Motohiro @ 2010-08-23 23:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
> 
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |   20 ++++++++++++++++----
>  1 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bbaa959..750e1dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	struct reclaim_state reclaim_state;
>  	struct task_struct *p = current;
> +	bool drained = false;
>  
>  	cond_resched();
>  
> @@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  
> -	if (order != 0)
> -		drain_all_pages();
> +	if (unlikely(!(*did_some_progress)))
> +		return NULL;
>  
> -	if (likely(*did_some_progress))
> -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> +retry:
> +	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
> +
> +	/*
> +	 * If an allocation failed after direct reclaim, it could be because
> +	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 */
> +	if (!page && !drained) {
> +		drain_all_pages();
> +		drained = true;
> +		goto retry;
> +	}
> +
>  	return page;

I haven't read all of this patch series. (iow, this mail is luckly on top
of my mail box now) but at least I think this one is correct and good.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>








^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-08-23 23:17     ` KOSAKI Motohiro
  0 siblings, 0 replies; 32+ messages in thread
From: KOSAKI Motohiro @ 2010-08-23 23:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
> 
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |   20 ++++++++++++++++----
>  1 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bbaa959..750e1dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	struct reclaim_state reclaim_state;
>  	struct task_struct *p = current;
> +	bool drained = false;
>  
>  	cond_resched();
>  
> @@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  
> -	if (order != 0)
> -		drain_all_pages();
> +	if (unlikely(!(*did_some_progress)))
> +		return NULL;
>  
> -	if (likely(*did_some_progress))
> -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> +retry:
> +	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
> +
> +	/*
> +	 * If an allocation failed after direct reclaim, it could be because
> +	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 */
> +	if (!page && !drained) {
> +		drain_all_pages();
> +		drained = true;
> +		goto retry;
> +	}
> +
>  	return page;

I haven't read all of this patch series. (iow, this mail is luckly on top
of my mail box now) but at least I think this one is correct and good.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2010-08-23 23:17 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-23  8:00 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2 Mel Gorman
2010-08-23  8:00 ` Mel Gorman
2010-08-23  8:00 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman
2010-08-23  8:00   ` Mel Gorman
2010-08-23 12:47   ` Christoph Lameter
2010-08-23 12:47     ` Christoph Lameter
2010-08-23  8:00 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-08-23  8:00   ` Mel Gorman
2010-08-23 12:56   ` Christoph Lameter
2010-08-23 12:56     ` Christoph Lameter
2010-08-23 13:03     ` Mel Gorman
2010-08-23 13:03       ` Mel Gorman
2010-08-23 13:41       ` Christoph Lameter
2010-08-23 13:41         ` Christoph Lameter
2010-08-23 13:55         ` Mel Gorman
2010-08-23 13:55           ` Mel Gorman
2010-08-23 16:04           ` Christoph Lameter
2010-08-23 16:04             ` Christoph Lameter
2010-08-23 16:13             ` Mel Gorman
2010-08-23 16:13               ` Mel Gorman
2010-08-23  8:00 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
2010-08-23  8:00   ` Mel Gorman
2010-08-23 23:17   ` KOSAKI Motohiro
2010-08-23 23:17     ` KOSAKI Motohiro
2010-08-23 12:45 ` [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2 Christoph Lameter
2010-08-23 12:45   ` Christoph Lameter
2010-08-23 13:01   ` Mel Gorman
2010-08-23 13:01     ` Mel Gorman
2010-08-23 13:38     ` Christoph Lameter
2010-08-23 13:38       ` Christoph Lameter
2010-08-23 13:39       ` Mel Gorman
2010-08-23 13:39         ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.