All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
@ 2010-09-03  9:08 ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

The noteworthy change is to patch 2 which now uses the generic
zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
applies for *when* zone_page_state_snapshot() to avoid ovedhead.

Changelog since V3
  o Use generic helper for NR_FREE_PAGES estimate when necessary

Changelog since V2
  o Minor clarifications
  o Rebase to 2.6.36-rc3

Changelog since V1
  o Fix for !CONFIG_SMP
  o Correct spelling mistakes
  o Clarify a ChangeLog
  o Only check for counter drift on machines large enough for the counter
    drift to breach the min watermark when NR_FREE_PAGES report the low
    watermark is fine

Internal IBM test teams beta testing distribution kernels have reported
problems on machines with a large number of CPUs whereby page allocator
failure messages show huge differences between the nr_free_pages vmstat
counter and what is available on the buddy lists. In an extreme example,
nr_free_pages was above the min watermark but zero pages were on the buddy
lists allowing the system to potentially livelock unable to make forward
progress unless an allocation succeeds. There is no reason why the problems
would not affect mainline so the following series mitigates the problems
in the page allocator related to to per-cpu counter drift and lists.

The first patch ensures that counters are updated after pages are added to
free lists.

The second patch notes that the counter drift between nr_free_pages and what
is on the per-cpu lists can be very high. When memory is low and kswapd
is awake, the per-cpu counters are checked as well as reading the value
of NR_FREE_PAGES. This will slow the page allocator when memory is low and
kswapd is awake but it will be much harder to breach the min watermark and
potentially livelock the system.

The third patch notes that after direct-reclaim an allocation can
fail because the necessary pages are on the per-cpu lists. After a
direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
a second attempt is made.

Performance tests against 2.6.36-rc3 did not show up anything interesting. A
version of this series that continually called vmstat_update() when
memory was low was tested internally and found to help the counter drift
problem. I described this during LSF/MM Summit and the potential for IPI
storms was frowned upon. An alternative fix is in patch two which uses
for_each_online_cpu() to read the vmstat deltas while memory is low and
kswapd is awake. This should be functionally similar.

This patch should be merged after the patch "vmstat : update
zone stat threshold at onlining a cpu" which is in mmotm as
vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .

If we can agree on it, this series is a stable candidate.

 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |   29 +++++++++++++++++++++--------
 mm/vmstat.c            |   15 ++++++++++++++-
 5 files changed, 91 insertions(+), 9 deletions(-)


^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
@ 2010-09-03  9:08 ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

The noteworthy change is to patch 2 which now uses the generic
zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
applies for *when* zone_page_state_snapshot() to avoid ovedhead.

Changelog since V3
  o Use generic helper for NR_FREE_PAGES estimate when necessary

Changelog since V2
  o Minor clarifications
  o Rebase to 2.6.36-rc3

Changelog since V1
  o Fix for !CONFIG_SMP
  o Correct spelling mistakes
  o Clarify a ChangeLog
  o Only check for counter drift on machines large enough for the counter
    drift to breach the min watermark when NR_FREE_PAGES report the low
    watermark is fine

Internal IBM test teams beta testing distribution kernels have reported
problems on machines with a large number of CPUs whereby page allocator
failure messages show huge differences between the nr_free_pages vmstat
counter and what is available on the buddy lists. In an extreme example,
nr_free_pages was above the min watermark but zero pages were on the buddy
lists allowing the system to potentially livelock unable to make forward
progress unless an allocation succeeds. There is no reason why the problems
would not affect mainline so the following series mitigates the problems
in the page allocator related to to per-cpu counter drift and lists.

The first patch ensures that counters are updated after pages are added to
free lists.

The second patch notes that the counter drift between nr_free_pages and what
is on the per-cpu lists can be very high. When memory is low and kswapd
is awake, the per-cpu counters are checked as well as reading the value
of NR_FREE_PAGES. This will slow the page allocator when memory is low and
kswapd is awake but it will be much harder to breach the min watermark and
potentially livelock the system.

The third patch notes that after direct-reclaim an allocation can
fail because the necessary pages are on the per-cpu lists. After a
direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
a second attempt is made.

Performance tests against 2.6.36-rc3 did not show up anything interesting. A
version of this series that continually called vmstat_update() when
memory was low was tested internally and found to help the counter drift
problem. I described this during LSF/MM Summit and the potential for IPI
storms was frowned upon. An alternative fix is in patch two which uses
for_each_online_cpu() to read the vmstat deltas while memory is low and
kswapd is awake. This should be functionally similar.

This patch should be merged after the patch "vmstat : update
zone stat threshold at onlining a cpu" which is in mmotm as
vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .

If we can agree on it, this series is a stable candidate.

 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |   29 +++++++++++++++++++++--------
 mm/vmstat.c            |   15 ++++++++++++++-
 5 files changed, 91 insertions(+), 9 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
  2010-09-03  9:08 ` Mel Gorman
@ 2010-09-03  9:08   ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When allocating a page, the system uses NR_FREE_PAGES counters to determine
if watermarks would remain intact after the allocation was made. This
check is made without interrupts disabled or the zone lock held and so is
race-prone by nature. Unfortunately, when pages are being freed in batch,
the counters are updated before the pages are added on the list. During this
window, the counters are misleading as the pages do not exist yet. When
under significant pressure on systems with large numbers of CPUs, it's
possible for processes to make progress even though they should have been
stalled. This is particularly problematic if a number of the processes are
using GFP_ATOMIC as the min watermark can be accidentally breached and in
extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..97d74a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
+	int freed = count;
 
 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, page_private(page));
 		} while (--count && --batch_free && !list_empty(list));
 	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
 	spin_unlock(&zone->lock);
 }
 
@@ -631,8 +632,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	__free_one_page(page, zone, order, migratetype);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	spin_unlock(&zone->lock);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
@ 2010-09-03  9:08   ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When allocating a page, the system uses NR_FREE_PAGES counters to determine
if watermarks would remain intact after the allocation was made. This
check is made without interrupts disabled or the zone lock held and so is
race-prone by nature. Unfortunately, when pages are being freed in batch,
the counters are updated before the pages are added on the list. During this
window, the counters are misleading as the pages do not exist yet. When
under significant pressure on systems with large numbers of CPUs, it's
possible for processes to make progress even though they should have been
stalled. This is particularly problematic if a number of the processes are
using GFP_ATOMIC as the min watermark can be accidentally breached and in
extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..97d74a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
+	int freed = count;
 
 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			trace_mm_page_pcpu_drain(page, 0, page_private(page));
 		} while (--count && --batch_free && !list_empty(list));
 	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
 	spin_unlock(&zone->lock);
 }
 
@@ -631,8 +632,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	__free_one_page(page, zone, order, migratetype);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 	spin_unlock(&zone->lock);
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03  9:08 ` Mel Gorman
@ 2010-09-03  9:08   ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

From: Christoph Lameter <cl@linux.com>

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can
be very high. If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero. Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
being accidentally broken.  The estimate is not perfect and may result
in cache line bounces but is expected to be lighter than the IPI calls
necessary to continually drain the per-cpu counters while kswapd is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 5 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 7f43ccd..eaaea37 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -170,6 +170,28 @@ static inline unsigned long zone_page_state(struct zone *zone,
 	return x;
 }
 
+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+					enum zone_stat_item item)
+{
+	long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..e35bfb8 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
+		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-03  9:08   ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

From: Christoph Lameter <cl@linux.com>

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can
be very high. If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero. Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
being accidentally broken.  The estimate is not perfect and may result
in cache line bounces but is expected to be lighter than the IPI calls
necessary to continually drain the per-cpu counters while kswapd is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 5 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 7f43ccd..eaaea37 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -170,6 +170,28 @@ static inline unsigned long zone_page_state(struct zone *zone,
 	return x;
 }
 
+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+					enum zone_stat_item item)
+{
+	long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..e35bfb8 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
+		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-03  9:08 ` Mel Gorman
@ 2010-09-03  9:08   ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-03  9:08   ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-03  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
  2010-09-03  9:08   ` Mel Gorman
@ 2010-09-03 22:38     ` Andrew Morton
  -1 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 22:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri,  3 Sep 2010 10:08:44 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> +	int freed = count;
>  
>  	spin_lock(&zone->lock);
>  	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
> -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
>  	while (count) {
>  		struct page *page;
>  		struct list_head *list;
> @@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			trace_mm_page_pcpu_drain(page, 0, page_private(page));
>  		} while (--count && --batch_free && !list_empty(list));
>  	}
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
>  	spin_unlock(&zone->lock);
>  }
>  

nit: this is why it's evil to modify the value of incoming args.  It's
nicer to leave them alone and treat them as const across the lifetime
of the callee.

Can I do this?

--- a/mm/page_alloc.c~mm-page-allocator-update-free-page-counters-after-pages-are-placed-on-the-free-list-fix
+++ a/mm/page_alloc.c
@@ -588,13 +588,13 @@ static void free_pcppages_bulk(struct zo
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	int freed = count;
+	int to_free = count;
 
 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	while (count) {
+	while (to_free) {
 		struct page *page;
 		struct list_head *list;
 
@@ -619,9 +619,9 @@ static void free_pcppages_bulk(struct zo
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, page_private(page));
 			trace_mm_page_pcpu_drain(page, 0, page_private(page));
-		} while (--count && --batch_free && !list_empty(list));
+		} while (--to_free && --batch_free && !list_empty(list));
 	}
-	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	spin_unlock(&zone->lock);
 }
 
_


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
@ 2010-09-03 22:38     ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 22:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri,  3 Sep 2010 10:08:44 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> +	int freed = count;
>  
>  	spin_lock(&zone->lock);
>  	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
> -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
>  	while (count) {
>  		struct page *page;
>  		struct list_head *list;
> @@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  			trace_mm_page_pcpu_drain(page, 0, page_private(page));
>  		} while (--count && --batch_free && !list_empty(list));
>  	}
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
>  	spin_unlock(&zone->lock);
>  }
>  

nit: this is why it's evil to modify the value of incoming args.  It's
nicer to leave them alone and treat them as const across the lifetime
of the callee.

Can I do this?

--- a/mm/page_alloc.c~mm-page-allocator-update-free-page-counters-after-pages-are-placed-on-the-free-list-fix
+++ a/mm/page_alloc.c
@@ -588,13 +588,13 @@ static void free_pcppages_bulk(struct zo
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	int freed = count;
+	int to_free = count;
 
 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
-	while (count) {
+	while (to_free) {
 		struct page *page;
 		struct list_head *list;
 
@@ -619,9 +619,9 @@ static void free_pcppages_bulk(struct zo
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, page_private(page));
 			trace_mm_page_pcpu_drain(page, 0, page_private(page));
-		} while (--count && --batch_free && !list_empty(list));
+		} while (--to_free && --batch_free && !list_empty(list));
 	}
-	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
 	spin_unlock(&zone->lock);
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03  9:08   ` Mel Gorman
@ 2010-09-03 22:55     ` Andrew Morton
  -1 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 22:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri,  3 Sep 2010 10:08:45 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: Christoph Lameter <cl@linux.com>
> 
> Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
> it is cheaper than scanning a number of lists. To avoid synchronization
> overhead, counter deltas are maintained on a per-cpu basis and drained both
> periodically and when the delta is above a threshold. On large CPU systems,
> the difference between the estimated and real value of NR_FREE_PAGES can
> be very high. If NR_FREE_PAGES is much higher than number of real free page
> in buddy, the VM can allocate pages below min watermark, at worst reducing
> the real number of pages to zero. Even if the OOM killer kills some victim
> for freeing memory, it may not free memory if the exit path requires a new
> page resulting in livelock.
> 
> This patch introduces a zone_page_state_snapshot() function (courtesy of
> Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
> It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
> being accidentally broken.  The estimate is not perfect and may result
> in cache line bounces but is expected to be lighter than the IPI calls
> necessary to continually drain the per-cpu counters while kswapd is awake.
> 

The "is kswapd awake" heuristic seems fairly hacky.  Can it be
improved, made more deterministic?  Exactly what state are we looking
for here?


> +/*
> + * More accurate version that also considers the currently pending
> + * deltas. For that we need to loop over all cpus to find the current
> + * deltas. There is no synchronization so the result cannot be
> + * exactly accurate either.
> + */
> +static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> +					enum zone_stat_item item)
> +{
> +	long x = atomic_long_read(&zone->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +	int cpu;
> +	for_each_online_cpu(cpu)
> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
> +
> +	if (x < 0)
> +		x = 0;
> +#endif
> +	return x;
> +}

aka percpu_counter_sum()!

Can someone remind me why per_cpu_pageset went and reimplemented
percpu_counters rather than just using them?

>  extern unsigned long global_reclaimable_pages(void);
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index f5b7d17..e35bfb8 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
>  	return 1;
>  }
>  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> +
> +#ifdef CONFIG_SMP
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> +	/*
> +	 * While kswapd is awake, it is considered the zone is under some
> +	 * memory pressure. Under pressure, there is a risk that
> +	 * per-cpu-counter-drift will allow the min watermark to be breached
> +	 * potentially causing a live-lock. While kswapd is awake and
> +	 * free pages are low, get a better estimate for free pages
> +	 */
> +	if (nr_free_pages < zone->percpu_drift_mark &&
> +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> +		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +	return nr_free_pages;
> +}

Is this really the best way of doing it?  The way we usually solve
this problem (and boy, was this bug a newbie mistake!) is:

	foo = percpu_counter_read(x);

	if (foo says something bad) {
		/* Bad stuff: let's get a more accurate foo */
		foo = percpu_counter_sum(x);
	}

	if (foo still says something bad)
		do_bad_thing();

In other words, don't do all this stuff with percpu_drift_mark and the
kswapd heuristic.  Just change zone_watermark_ok() to use the more
accurate read if it's about to return "no".


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-03 22:55     ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 22:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri,  3 Sep 2010 10:08:45 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: Christoph Lameter <cl@linux.com>
> 
> Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
> it is cheaper than scanning a number of lists. To avoid synchronization
> overhead, counter deltas are maintained on a per-cpu basis and drained both
> periodically and when the delta is above a threshold. On large CPU systems,
> the difference between the estimated and real value of NR_FREE_PAGES can
> be very high. If NR_FREE_PAGES is much higher than number of real free page
> in buddy, the VM can allocate pages below min watermark, at worst reducing
> the real number of pages to zero. Even if the OOM killer kills some victim
> for freeing memory, it may not free memory if the exit path requires a new
> page resulting in livelock.
> 
> This patch introduces a zone_page_state_snapshot() function (courtesy of
> Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
> It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
> being accidentally broken.  The estimate is not perfect and may result
> in cache line bounces but is expected to be lighter than the IPI calls
> necessary to continually drain the per-cpu counters while kswapd is awake.
> 

The "is kswapd awake" heuristic seems fairly hacky.  Can it be
improved, made more deterministic?  Exactly what state are we looking
for here?


> +/*
> + * More accurate version that also considers the currently pending
> + * deltas. For that we need to loop over all cpus to find the current
> + * deltas. There is no synchronization so the result cannot be
> + * exactly accurate either.
> + */
> +static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> +					enum zone_stat_item item)
> +{
> +	long x = atomic_long_read(&zone->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +	int cpu;
> +	for_each_online_cpu(cpu)
> +		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
> +
> +	if (x < 0)
> +		x = 0;
> +#endif
> +	return x;
> +}

aka percpu_counter_sum()!

Can someone remind me why per_cpu_pageset went and reimplemented
percpu_counters rather than just using them?

>  extern unsigned long global_reclaimable_pages(void);
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index f5b7d17..e35bfb8 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
>  	return 1;
>  }
>  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> +
> +#ifdef CONFIG_SMP
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> +	/*
> +	 * While kswapd is awake, it is considered the zone is under some
> +	 * memory pressure. Under pressure, there is a risk that
> +	 * per-cpu-counter-drift will allow the min watermark to be breached
> +	 * potentially causing a live-lock. While kswapd is awake and
> +	 * free pages are low, get a better estimate for free pages
> +	 */
> +	if (nr_free_pages < zone->percpu_drift_mark &&
> +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> +		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +	return nr_free_pages;
> +}

Is this really the best way of doing it?  The way we usually solve
this problem (and boy, was this bug a newbie mistake!) is:

	foo = percpu_counter_read(x);

	if (foo says something bad) {
		/* Bad stuff: let's get a more accurate foo */
		foo = percpu_counter_sum(x);
	}

	if (foo still says something bad)
		do_bad_thing();

In other words, don't do all this stuff with percpu_drift_mark and the
kswapd heuristic.  Just change zone_watermark_ok() to use the more
accurate read if it's about to return "no".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-03  9:08   ` Mel Gorman
@ 2010-09-03 23:00     ` Andrew Morton
  -1 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 23:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Wu Fengguang, David Rientjes

On Fri,  3 Sep 2010 10:08:46 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
> 
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Christoph Lameter <cl@linux.com>
> ---
>  mm/page_alloc.c |   20 ++++++++++++++++----
>  1 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bbaa959..750e1dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	struct reclaim_state reclaim_state;
>  	struct task_struct *p = current;
> +	bool drained = false;
>  
>  	cond_resched();
>  
> @@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  
> -	if (order != 0)
> -		drain_all_pages();
> +	if (unlikely(!(*did_some_progress)))
> +		return NULL;
>  
> -	if (likely(*did_some_progress))
> -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> +retry:
> +	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
> +
> +	/*
> +	 * If an allocation failed after direct reclaim, it could be because
> +	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 */
> +	if (!page && !drained) {
> +		drain_all_pages();
> +		drained = true;
> +		goto retry;
> +	}
> +
>  	return page;
>  }

The patch looks reasonable.

But please take a look at the recent thread "mm: minute-long livelocks
in memory reclaim".  There, people are pointing fingers at that
drain_all_pages() call, suspecting that it's causing huge IPI storms.

Dave was going to test this theory but afaik hasn't yet done so.  It
would be nice to tie these threads together if poss?


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-03 23:00     ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 23:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Wu Fengguang, David Rientjes

On Fri,  3 Sep 2010 10:08:46 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
> 
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Christoph Lameter <cl@linux.com>
> ---
>  mm/page_alloc.c |   20 ++++++++++++++++----
>  1 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bbaa959..750e1dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	struct reclaim_state reclaim_state;
>  	struct task_struct *p = current;
> +	bool drained = false;
>  
>  	cond_resched();
>  
> @@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  
> -	if (order != 0)
> -		drain_all_pages();
> +	if (unlikely(!(*did_some_progress)))
> +		return NULL;
>  
> -	if (likely(*did_some_progress))
> -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> +retry:
> +	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
> +
> +	/*
> +	 * If an allocation failed after direct reclaim, it could be because
> +	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 */
> +	if (!page && !drained) {
> +		drain_all_pages();
> +		drained = true;
> +		goto retry;
> +	}
> +
>  	return page;
>  }

The patch looks reasonable.

But please take a look at the recent thread "mm: minute-long livelocks
in memory reclaim".  There, people are pointing fingers at that
drain_all_pages() call, suspecting that it's causing huge IPI storms.

Dave was going to test this theory but afaik hasn't yet done so.  It
would be nice to tie these threads together if poss?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
  2010-09-03  9:08 ` Mel Gorman
@ 2010-09-03 23:05   ` Andrew Morton
  -1 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 23:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, stable, Dave Chinner

On Fri,  3 Sep 2010 10:08:43 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> The noteworthy change is to patch 2 which now uses the generic
> zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> 
> Changelog since V3
>   o Use generic helper for NR_FREE_PAGES estimate when necessary
> 
> Changelog since V2
>   o Minor clarifications
>   o Rebase to 2.6.36-rc3
> 
> Changelog since V1
>   o Fix for !CONFIG_SMP
>   o Correct spelling mistakes
>   o Clarify a ChangeLog
>   o Only check for counter drift on machines large enough for the counter
>     drift to breach the min watermark when NR_FREE_PAGES report the low
>     watermark is fine
> 
> Internal IBM test teams beta testing distribution kernels have reported
> problems on machines with a large number of CPUs whereby page allocator
> failure messages show huge differences between the nr_free_pages vmstat
> counter and what is available on the buddy lists. In an extreme example,
> nr_free_pages was above the min watermark but zero pages were on the buddy
> lists allowing the system to potentially livelock unable to make forward
> progress unless an allocation succeeds. There is no reason why the problems
> would not affect mainline so the following series mitigates the problems
> in the page allocator related to to per-cpu counter drift and lists.
> 
> The first patch ensures that counters are updated after pages are added to
> free lists.
> 
> The second patch notes that the counter drift between nr_free_pages and what
> is on the per-cpu lists can be very high. When memory is low and kswapd
> is awake, the per-cpu counters are checked as well as reading the value
> of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> kswapd is awake but it will be much harder to breach the min watermark and
> potentially livelock the system.
> 
> The third patch notes that after direct-reclaim an allocation can
> fail because the necessary pages are on the per-cpu lists. After a
> direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> a second attempt is made.
> 
> Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> version of this series that continually called vmstat_update() when
> memory was low was tested internally and found to help the counter drift
> problem. I described this during LSF/MM Summit and the potential for IPI
> storms was frowned upon. An alternative fix is in patch two which uses
> for_each_online_cpu() to read the vmstat deltas while memory is low and
> kswapd is awake. This should be functionally similar.
> 
> This patch should be merged after the patch "vmstat : update
> zone stat threshold at onlining a cpu" which is in mmotm as
> vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> 
> If we can agree on it, this series is a stable candidate.

(cc stable@kernel.org)

>  include/linux/mmzone.h |   13 +++++++++++++
>  include/linux/vmstat.h |   22 ++++++++++++++++++++++
>  mm/mmzone.c            |   21 +++++++++++++++++++++
>  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
>  mm/vmstat.c            |   15 ++++++++++++++-
>  5 files changed, 91 insertions(+), 9 deletions(-)

For the entire patch series I get

 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
 mm/vmstat.c            |   16 +++++++++++++++-
 5 files changed, 94 insertions(+), 11 deletions(-)

The patches do apply OK to 2.6.35.

Give the extent and the coreness of it all, it's a bit more than I'd
usually push at the -stable guys.  But I guess that if the patches fix
all the issues you've noted, as well as David's "minute-long livelocks
in memory reclaim" then yup, it's worth backporting it all.




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
@ 2010-09-03 23:05   ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 23:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, stable, Dave Chinner

On Fri,  3 Sep 2010 10:08:43 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> The noteworthy change is to patch 2 which now uses the generic
> zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> 
> Changelog since V3
>   o Use generic helper for NR_FREE_PAGES estimate when necessary
> 
> Changelog since V2
>   o Minor clarifications
>   o Rebase to 2.6.36-rc3
> 
> Changelog since V1
>   o Fix for !CONFIG_SMP
>   o Correct spelling mistakes
>   o Clarify a ChangeLog
>   o Only check for counter drift on machines large enough for the counter
>     drift to breach the min watermark when NR_FREE_PAGES report the low
>     watermark is fine
> 
> Internal IBM test teams beta testing distribution kernels have reported
> problems on machines with a large number of CPUs whereby page allocator
> failure messages show huge differences between the nr_free_pages vmstat
> counter and what is available on the buddy lists. In an extreme example,
> nr_free_pages was above the min watermark but zero pages were on the buddy
> lists allowing the system to potentially livelock unable to make forward
> progress unless an allocation succeeds. There is no reason why the problems
> would not affect mainline so the following series mitigates the problems
> in the page allocator related to to per-cpu counter drift and lists.
> 
> The first patch ensures that counters are updated after pages are added to
> free lists.
> 
> The second patch notes that the counter drift between nr_free_pages and what
> is on the per-cpu lists can be very high. When memory is low and kswapd
> is awake, the per-cpu counters are checked as well as reading the value
> of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> kswapd is awake but it will be much harder to breach the min watermark and
> potentially livelock the system.
> 
> The third patch notes that after direct-reclaim an allocation can
> fail because the necessary pages are on the per-cpu lists. After a
> direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> a second attempt is made.
> 
> Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> version of this series that continually called vmstat_update() when
> memory was low was tested internally and found to help the counter drift
> problem. I described this during LSF/MM Summit and the potential for IPI
> storms was frowned upon. An alternative fix is in patch two which uses
> for_each_online_cpu() to read the vmstat deltas while memory is low and
> kswapd is awake. This should be functionally similar.
> 
> This patch should be merged after the patch "vmstat : update
> zone stat threshold at onlining a cpu" which is in mmotm as
> vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> 
> If we can agree on it, this series is a stable candidate.

(cc stable@kernel.org)

>  include/linux/mmzone.h |   13 +++++++++++++
>  include/linux/vmstat.h |   22 ++++++++++++++++++++++
>  mm/mmzone.c            |   21 +++++++++++++++++++++
>  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
>  mm/vmstat.c            |   15 ++++++++++++++-
>  5 files changed, 91 insertions(+), 9 deletions(-)

For the entire patch series I get

 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
 mm/vmstat.c            |   16 +++++++++++++++-
 5 files changed, 94 insertions(+), 11 deletions(-)

The patches do apply OK to 2.6.35.

Give the extent and the coreness of it all, it's a bit more than I'd
usually push at the -stable guys.  But I guess that if the patches fix
all the issues you've noted, as well as David's "minute-long livelocks
in memory reclaim" then yup, it's worth backporting it all.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03 22:55     ` Andrew Morton
@ 2010-09-03 23:17       ` Christoph Lameter
  -1 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-03 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010, Andrew Morton wrote:

> Can someone remind me why per_cpu_pageset went and reimplemented
> percpu_counters rather than just using them?

The vm counters are per zone and per cpu and have a flow from per cpu /
zone deltas to zone counters and then also into global counters.

> Is this really the best way of doing it?  The way we usually solve
> this problem (and boy, was this bug a newbie mistake!) is:
>
> 	foo = percpu_counter_read(x);
>
> 	if (foo says something bad) {
> 		/* Bad stuff: let's get a more accurate foo */
> 		foo = percpu_counter_sum(x);
> 	}
>
> 	if (foo still says something bad)
> 		do_bad_thing();
>
> In other words, don't do all this stuff with percpu_drift_mark and the
> kswapd heuristic.  Just change zone_watermark_ok() to use the more
> accurate read if it's about to return "no".

percpu counters must always be added up when their value is determined. We
cannot really affort that for the VM. Counters are always available
without looping over all cpus.

vm counters are continually kept up to date (but may have delta limited by
time and counter values).

This seems to be a special case here where Mel does not want to have to
cost to bring the counters up to date nor reduce the delta/time limits to
get some more accuracy but wants take some sort of snapshot of the whole
situation for this particular case.




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-03 23:17       ` Christoph Lameter
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-03 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010, Andrew Morton wrote:

> Can someone remind me why per_cpu_pageset went and reimplemented
> percpu_counters rather than just using them?

The vm counters are per zone and per cpu and have a flow from per cpu /
zone deltas to zone counters and then also into global counters.

> Is this really the best way of doing it?  The way we usually solve
> this problem (and boy, was this bug a newbie mistake!) is:
>
> 	foo = percpu_counter_read(x);
>
> 	if (foo says something bad) {
> 		/* Bad stuff: let's get a more accurate foo */
> 		foo = percpu_counter_sum(x);
> 	}
>
> 	if (foo still says something bad)
> 		do_bad_thing();
>
> In other words, don't do all this stuff with percpu_drift_mark and the
> kswapd heuristic.  Just change zone_watermark_ok() to use the more
> accurate read if it's about to return "no".

percpu counters must always be added up when their value is determined. We
cannot really affort that for the VM. Counters are always available
without looping over all cpus.

vm counters are continually kept up to date (but may have delta limited by
time and counter values).

This seems to be a special case here where Mel does not want to have to
cost to bring the counters up to date nor reduce the delta/time limits to
get some more accuracy but wants take some sort of snapshot of the whole
situation for this particular case.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03 23:17       ` Christoph Lameter
@ 2010-09-03 23:28         ` Andrew Morton
  -1 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 23:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010 18:17:46 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 3 Sep 2010, Andrew Morton wrote:
> 
> > Can someone remind me why per_cpu_pageset went and reimplemented
> > percpu_counters rather than just using them?
> 
> The vm counters are per zone and per cpu and have a flow from per cpu /
> zone deltas to zone counters and then also into global counters.

hm.  percpu counters would require overflow-time hooks to do that. 
Might be worth looking at.

> > Is this really the best way of doing it?  The way we usually solve
> > this problem (and boy, was this bug a newbie mistake!) is:
> >
> > 	foo = percpu_counter_read(x);
> >
> > 	if (foo says something bad) {
> > 		/* Bad stuff: let's get a more accurate foo */
> > 		foo = percpu_counter_sum(x);
> > 	}
> >
> > 	if (foo still says something bad)
> > 		do_bad_thing();
> >
> > In other words, don't do all this stuff with percpu_drift_mark and the
> > kswapd heuristic.  Just change zone_watermark_ok() to use the more
> > accurate read if it's about to return "no".
> 
> percpu counters must always be added up when their value is determined.

Nope.  That's the difference between percpu_counter_read() and
percpu_counter_sum().

> This seems to be a special case here where Mel does not want to have to
> cost to bring the counters up to date nor reduce the delta/time limits to
> get some more accuracy but wants take some sort of snapshot of the whole
> situation for this particular case.

My suggestion didn't actually have anything to do with percpu_counters.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-03 23:28         ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-03 23:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010 18:17:46 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 3 Sep 2010, Andrew Morton wrote:
> 
> > Can someone remind me why per_cpu_pageset went and reimplemented
> > percpu_counters rather than just using them?
> 
> The vm counters are per zone and per cpu and have a flow from per cpu /
> zone deltas to zone counters and then also into global counters.

hm.  percpu counters would require overflow-time hooks to do that. 
Might be worth looking at.

> > Is this really the best way of doing it?  The way we usually solve
> > this problem (and boy, was this bug a newbie mistake!) is:
> >
> > 	foo = percpu_counter_read(x);
> >
> > 	if (foo says something bad) {
> > 		/* Bad stuff: let's get a more accurate foo */
> > 		foo = percpu_counter_sum(x);
> > 	}
> >
> > 	if (foo still says something bad)
> > 		do_bad_thing();
> >
> > In other words, don't do all this stuff with percpu_drift_mark and the
> > kswapd heuristic.  Just change zone_watermark_ok() to use the more
> > accurate read if it's about to return "no".
> 
> percpu counters must always be added up when their value is determined.

Nope.  That's the difference between percpu_counter_read() and
percpu_counter_sum().

> This seems to be a special case here where Mel does not want to have to
> cost to bring the counters up to date nor reduce the delta/time limits to
> get some more accuracy but wants take some sort of snapshot of the whole
> situation for this particular case.

My suggestion didn't actually have anything to do with percpu_counters.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03 23:28         ` Andrew Morton
@ 2010-09-04  0:54           ` Christoph Lameter
  -1 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-04  0:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010, Andrew Morton wrote:

> > percpu counters must always be added up when their value is determined.
>
> Nope.  That's the difference between percpu_counter_read() and
> percpu_counter_sum().

Hmmm... Okay you can fold them therefore. That is analogous to what we do
in the _snapshot function now.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-04  0:54           ` Christoph Lameter
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-04  0:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 3 Sep 2010, Andrew Morton wrote:

> > percpu counters must always be added up when their value is determined.
>
> Nope.  That's the difference between percpu_counter_read() and
> percpu_counter_sum().

Hmmm... Okay you can fold them therefore. That is analogous to what we do
in the _snapshot function now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-03 23:00     ` Andrew Morton
@ 2010-09-04  2:25       ` Dave Chinner
  -1 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-04  2:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:46 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When under significant memory pressure, a process enters direct reclaim
> > and immediately afterwards tries to allocate a page. If it fails and no
> > further progress is made, it's possible the system will go OOM. However,
> > on systems with large amounts of memory, it's possible that a significant
> > number of pages are on per-cpu lists and inaccessible to the calling
> > process. This leads to a process entering direct reclaim more often than
> > it should increasing the pressure on the system and compounding the problem.
> > 
> > This patch notes that if direct reclaim is making progress but
> > allocations are still failing that the system is already under heavy
> > pressure. In this case, it drains the per-cpu lists and tries the
> > allocation a second time before continuing.
....
> The patch looks reasonable.
> 
> But please take a look at the recent thread "mm: minute-long livelocks
> in memory reclaim".  There, people are pointing fingers at that
> drain_all_pages() call, suspecting that it's causing huge IPI storms.
> 
> Dave was going to test this theory but afaik hasn't yet done so.  It
> would be nice to tie these threads together if poss?

It's been my "next-thing-to-do" since David suggested I try it -
tracking down other problems has got in the way, though. I
just ran my test a couple of times through:

$ ./fs_mark -D 10000 -L 63 -S0 -n 100000 -s 0 \
	-d /mnt/scratch/0 -d /mnt/scratch/1 \
	-d /mnt/scratch/3 -d /mnt/scratch/2 \
	-d /mnt/scratch/4 -d /mnt/scratch/5 \
	-d /mnt/scratch/6 -d /mnt/scratch/7

To create millions of inodes in parallel on an 8p/4G RAM VM.
The filesystem is ~1.1TB XFS:

# mkfs.xfs -f -d agcount=16 /dev/vdb
meta-data=/dev/vdb               isize=256    agcount=16, agsize=16777216 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=268435456, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
# mount -o inode64,delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch

Performance prior to this patch was that each iteration resulted in
~65k files/s, with occassionaly peaks to 90k files/s, but drops to
frequently 45k files/s when reclaim ran to reclaim the inode
caches. This load ran permanently at 800% CPU usage.

Every so often (may once or twice a 50M inode create run) all 8 CPUs
would remain pegged but the create rate would drop to zero for a few
seconds to a couple of minutes. that was the livelock issues I
reported.

With this patchset, I'm seeing a per-iteration average of ~77k
files/s, with only a couple of iterations dropping down to ~55k
file/s and a significantly number above 90k/s. The runtime to 50M
inodes is down by ~30% and the average CPU usage across the run is
around 700%. IOWs, there a significant gain in performance there is
a significant drop in CPU usage. I've done two runs to 50m inodes,
and not seen any sign of a livelock, even for short periods of time.

Ah, spoke too soon - I let the second run keep going, and at ~68M
inodes it's just pegged all the CPUs and is pretty much completely
wedged. Serial console is not responding, I can't get a new login,
and the only thing responding that tells me the machine is alive is
the remote PCP monitoring. It's been stuck for 5 minutes .... and
now it is back. Here's what I saw:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-wedge-1.png

The livelock is at the right of the charts, where the top chart is
all red (system CPU time), and the other charts flat line to zero.

And according to fsmark:

     1     66400000            0      64554.2          7705926
     1     67200000            0      64836.1          7573013
<hang happened here>
     2     68000000            0      69472.8          7941399
     2     68800000            0      85017.5          7585203

it didn't record any change in performance, which means the livelock
probably occurred between iterations.  I couldn't get any info on
what caused the livelock this time so I can only assume it has the
same cause....

Still, given the improvements in performance from this patchset,
I'd say inclusion is a no-braniner....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-04  2:25       ` Dave Chinner
  0 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-04  2:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:46 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When under significant memory pressure, a process enters direct reclaim
> > and immediately afterwards tries to allocate a page. If it fails and no
> > further progress is made, it's possible the system will go OOM. However,
> > on systems with large amounts of memory, it's possible that a significant
> > number of pages are on per-cpu lists and inaccessible to the calling
> > process. This leads to a process entering direct reclaim more often than
> > it should increasing the pressure on the system and compounding the problem.
> > 
> > This patch notes that if direct reclaim is making progress but
> > allocations are still failing that the system is already under heavy
> > pressure. In this case, it drains the per-cpu lists and tries the
> > allocation a second time before continuing.
....
> The patch looks reasonable.
> 
> But please take a look at the recent thread "mm: minute-long livelocks
> in memory reclaim".  There, people are pointing fingers at that
> drain_all_pages() call, suspecting that it's causing huge IPI storms.
> 
> Dave was going to test this theory but afaik hasn't yet done so.  It
> would be nice to tie these threads together if poss?

It's been my "next-thing-to-do" since David suggested I try it -
tracking down other problems has got in the way, though. I
just ran my test a couple of times through:

$ ./fs_mark -D 10000 -L 63 -S0 -n 100000 -s 0 \
	-d /mnt/scratch/0 -d /mnt/scratch/1 \
	-d /mnt/scratch/3 -d /mnt/scratch/2 \
	-d /mnt/scratch/4 -d /mnt/scratch/5 \
	-d /mnt/scratch/6 -d /mnt/scratch/7

To create millions of inodes in parallel on an 8p/4G RAM VM.
The filesystem is ~1.1TB XFS:

# mkfs.xfs -f -d agcount=16 /dev/vdb
meta-data=/dev/vdb               isize=256    agcount=16, agsize=16777216 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=268435456, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
# mount -o inode64,delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch

Performance prior to this patch was that each iteration resulted in
~65k files/s, with occassionaly peaks to 90k files/s, but drops to
frequently 45k files/s when reclaim ran to reclaim the inode
caches. This load ran permanently at 800% CPU usage.

Every so often (may once or twice a 50M inode create run) all 8 CPUs
would remain pegged but the create rate would drop to zero for a few
seconds to a couple of minutes. that was the livelock issues I
reported.

With this patchset, I'm seeing a per-iteration average of ~77k
files/s, with only a couple of iterations dropping down to ~55k
file/s and a significantly number above 90k/s. The runtime to 50M
inodes is down by ~30% and the average CPU usage across the run is
around 700%. IOWs, there a significant gain in performance there is
a significant drop in CPU usage. I've done two runs to 50m inodes,
and not seen any sign of a livelock, even for short periods of time.

Ah, spoke too soon - I let the second run keep going, and at ~68M
inodes it's just pegged all the CPUs and is pretty much completely
wedged. Serial console is not responding, I can't get a new login,
and the only thing responding that tells me the machine is alive is
the remote PCP monitoring. It's been stuck for 5 minutes .... and
now it is back. Here's what I saw:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-wedge-1.png

The livelock is at the right of the charts, where the top chart is
all red (system CPU time), and the other charts flat line to zero.

And according to fsmark:

     1     66400000            0      64554.2          7705926
     1     67200000            0      64836.1          7573013
<hang happened here>
     2     68000000            0      69472.8          7941399
     2     68800000            0      85017.5          7585203

it didn't record any change in performance, which means the livelock
probably occurred between iterations.  I couldn't get any info on
what caused the livelock this time so I can only assume it has the
same cause....

Still, given the improvements in performance from this patchset,
I'd say inclusion is a no-braniner....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-04  2:25       ` Dave Chinner
@ 2010-09-04  3:21         ` Andrew Morton
  -1 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-04  3:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Sat, 4 Sep 2010 12:25:45 +1000 Dave Chinner <david@fromorbit.com> wrote:

> Still, given the improvements in performance from this patchset,
> I'd say inclusion is a no-braniner....

OK, thanks.

It'd be interesting to check the IPI frequency with and without -
/proc/interrupts "CAL" field.  Presumably it went down a lot.

I wouldn't bust a gut over it though :)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-04  3:21         ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-04  3:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Sat, 4 Sep 2010 12:25:45 +1000 Dave Chinner <david@fromorbit.com> wrote:

> Still, given the improvements in performance from this patchset,
> I'd say inclusion is a no-braniner....

OK, thanks.

It'd be interesting to check the IPI frequency with and without -
/proc/interrupts "CAL" field.  Presumably it went down a lot.

I wouldn't bust a gut over it though :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-04  2:25       ` Dave Chinner
@ 2010-09-04  3:23         ` Wu Fengguang
  -1 siblings, 0 replies; 104+ messages in thread
From: Wu Fengguang @ 2010-09-04  3:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sat, Sep 04, 2010 at 10:25:45AM +0800, Dave Chinner wrote:
> On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> > On Fri,  3 Sep 2010 10:08:46 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > When under significant memory pressure, a process enters direct reclaim
> > > and immediately afterwards tries to allocate a page. If it fails and no
> > > further progress is made, it's possible the system will go OOM. However,
> > > on systems with large amounts of memory, it's possible that a significant
> > > number of pages are on per-cpu lists and inaccessible to the calling
> > > process. This leads to a process entering direct reclaim more often than
> > > it should increasing the pressure on the system and compounding the problem.
> > > 
> > > This patch notes that if direct reclaim is making progress but
> > > allocations are still failing that the system is already under heavy
> > > pressure. In this case, it drains the per-cpu lists and tries the
> > > allocation a second time before continuing.
> ....
> > The patch looks reasonable.
> > 
> > But please take a look at the recent thread "mm: minute-long livelocks
> > in memory reclaim".  There, people are pointing fingers at that
> > drain_all_pages() call, suspecting that it's causing huge IPI storms.
> > 
> > Dave was going to test this theory but afaik hasn't yet done so.  It
> > would be nice to tie these threads together if poss?
> 
> It's been my "next-thing-to-do" since David suggested I try it -
> tracking down other problems has got in the way, though. I
> just ran my test a couple of times through:
> 
> $ ./fs_mark -D 10000 -L 63 -S0 -n 100000 -s 0 \
> 	-d /mnt/scratch/0 -d /mnt/scratch/1 \
> 	-d /mnt/scratch/3 -d /mnt/scratch/2 \
> 	-d /mnt/scratch/4 -d /mnt/scratch/5 \
> 	-d /mnt/scratch/6 -d /mnt/scratch/7
> 
> To create millions of inodes in parallel on an 8p/4G RAM VM.
> The filesystem is ~1.1TB XFS:
> 
> # mkfs.xfs -f -d agcount=16 /dev/vdb
> meta-data=/dev/vdb               isize=256    agcount=16, agsize=16777216 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=268435456, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> # mount -o inode64,delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
> 
> Performance prior to this patch was that each iteration resulted in
> ~65k files/s, with occassionaly peaks to 90k files/s, but drops to
> frequently 45k files/s when reclaim ran to reclaim the inode
> caches. This load ran permanently at 800% CPU usage.
> 
> Every so often (may once or twice a 50M inode create run) all 8 CPUs
> would remain pegged but the create rate would drop to zero for a few
> seconds to a couple of minutes. that was the livelock issues I
> reported.
> 
> With this patchset, I'm seeing a per-iteration average of ~77k
> files/s, with only a couple of iterations dropping down to ~55k
> file/s and a significantly number above 90k/s. The runtime to 50M
> inodes is down by ~30% and the average CPU usage across the run is
> around 700%. IOWs, there a significant gain in performance there is
> a significant drop in CPU usage. I've done two runs to 50m inodes,
> and not seen any sign of a livelock, even for short periods of time.
> 
> Ah, spoke too soon - I let the second run keep going, and at ~68M
> inodes it's just pegged all the CPUs and is pretty much completely
> wedged. Serial console is not responding, I can't get a new login,
> and the only thing responding that tells me the machine is alive is
> the remote PCP monitoring. It's been stuck for 5 minutes .... and
> now it is back. Here's what I saw:
> 
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-wedge-1.png
> 
> The livelock is at the right of the charts, where the top chart is
> all red (system CPU time), and the other charts flat line to zero.
> 
> And according to fsmark:
> 
>      1     66400000            0      64554.2          7705926
>      1     67200000            0      64836.1          7573013
> <hang happened here>
>      2     68000000            0      69472.8          7941399
>      2     68800000            0      85017.5          7585203
> 
> it didn't record any change in performance, which means the livelock
> probably occurred between iterations.  I couldn't get any info on
> what caused the livelock this time so I can only assume it has the
> same cause....
> 
> Still, given the improvements in performance from this patchset,
> I'd say inclusion is a no-braniner....

In your case it's not really high memory pressure, but maybe too many
concurrent direct reclaimers, so that when one reclaimed some free
pages, others kick in and "steal" the free pages. So we need to kill
the second cond_resched() call (which effectively gives other tasks a
good chance to steal this task's vmscan fruits), and only do
drain_all_pages() when nothing was reclaimed (instead of allocated).

Dave, will you give a try of this patch? It's based on Mel's.

Thanks,
Fengguang
---

--- linux-next.orig/mm/page_alloc.c	2010-09-04 11:08:03.000000000 +0800
+++ linux-next/mm/page_alloc.c	2010-09-04 11:16:33.000000000 +0800
@@ -1850,6 +1850,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
 
 	cond_resched();
 
+retry:
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	p->flags |= PF_MEMALLOC;
@@ -1863,26 +1864,23 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
 	lockdep_clear_current_reclaim_state();
 	p->flags &= ~PF_MEMALLOC;
 
-	cond_resched();
-
-	if (unlikely(!(*did_some_progress)))
+	if (unlikely(!(*did_some_progress))) {
+		if (!drained) {
+			drain_all_pages();
+			drained = true;
+			goto retry;
+		}
 		return NULL;
+	}
 
-retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
 
-	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
-	 */
-	if (!page && !drained) {
-		drain_all_pages();
-		drained = true;
+	/* someone steal our vmscan fruits? */
+	if (!page && *did_some_progress)
 		goto retry;
-	}
 
 	return page;
 }

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-04  3:23         ` Wu Fengguang
  0 siblings, 0 replies; 104+ messages in thread
From: Wu Fengguang @ 2010-09-04  3:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sat, Sep 04, 2010 at 10:25:45AM +0800, Dave Chinner wrote:
> On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> > On Fri,  3 Sep 2010 10:08:46 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > When under significant memory pressure, a process enters direct reclaim
> > > and immediately afterwards tries to allocate a page. If it fails and no
> > > further progress is made, it's possible the system will go OOM. However,
> > > on systems with large amounts of memory, it's possible that a significant
> > > number of pages are on per-cpu lists and inaccessible to the calling
> > > process. This leads to a process entering direct reclaim more often than
> > > it should increasing the pressure on the system and compounding the problem.
> > > 
> > > This patch notes that if direct reclaim is making progress but
> > > allocations are still failing that the system is already under heavy
> > > pressure. In this case, it drains the per-cpu lists and tries the
> > > allocation a second time before continuing.
> ....
> > The patch looks reasonable.
> > 
> > But please take a look at the recent thread "mm: minute-long livelocks
> > in memory reclaim".  There, people are pointing fingers at that
> > drain_all_pages() call, suspecting that it's causing huge IPI storms.
> > 
> > Dave was going to test this theory but afaik hasn't yet done so.  It
> > would be nice to tie these threads together if poss?
> 
> It's been my "next-thing-to-do" since David suggested I try it -
> tracking down other problems has got in the way, though. I
> just ran my test a couple of times through:
> 
> $ ./fs_mark -D 10000 -L 63 -S0 -n 100000 -s 0 \
> 	-d /mnt/scratch/0 -d /mnt/scratch/1 \
> 	-d /mnt/scratch/3 -d /mnt/scratch/2 \
> 	-d /mnt/scratch/4 -d /mnt/scratch/5 \
> 	-d /mnt/scratch/6 -d /mnt/scratch/7
> 
> To create millions of inodes in parallel on an 8p/4G RAM VM.
> The filesystem is ~1.1TB XFS:
> 
> # mkfs.xfs -f -d agcount=16 /dev/vdb
> meta-data=/dev/vdb               isize=256    agcount=16, agsize=16777216 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=268435456, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> # mount -o inode64,delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
> 
> Performance prior to this patch was that each iteration resulted in
> ~65k files/s, with occassionaly peaks to 90k files/s, but drops to
> frequently 45k files/s when reclaim ran to reclaim the inode
> caches. This load ran permanently at 800% CPU usage.
> 
> Every so often (may once or twice a 50M inode create run) all 8 CPUs
> would remain pegged but the create rate would drop to zero for a few
> seconds to a couple of minutes. that was the livelock issues I
> reported.
> 
> With this patchset, I'm seeing a per-iteration average of ~77k
> files/s, with only a couple of iterations dropping down to ~55k
> file/s and a significantly number above 90k/s. The runtime to 50M
> inodes is down by ~30% and the average CPU usage across the run is
> around 700%. IOWs, there a significant gain in performance there is
> a significant drop in CPU usage. I've done two runs to 50m inodes,
> and not seen any sign of a livelock, even for short periods of time.
> 
> Ah, spoke too soon - I let the second run keep going, and at ~68M
> inodes it's just pegged all the CPUs and is pretty much completely
> wedged. Serial console is not responding, I can't get a new login,
> and the only thing responding that tells me the machine is alive is
> the remote PCP monitoring. It's been stuck for 5 minutes .... and
> now it is back. Here's what I saw:
> 
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-wedge-1.png
> 
> The livelock is at the right of the charts, where the top chart is
> all red (system CPU time), and the other charts flat line to zero.
> 
> And according to fsmark:
> 
>      1     66400000            0      64554.2          7705926
>      1     67200000            0      64836.1          7573013
> <hang happened here>
>      2     68000000            0      69472.8          7941399
>      2     68800000            0      85017.5          7585203
> 
> it didn't record any change in performance, which means the livelock
> probably occurred between iterations.  I couldn't get any info on
> what caused the livelock this time so I can only assume it has the
> same cause....
> 
> Still, given the improvements in performance from this patchset,
> I'd say inclusion is a no-braniner....

In your case it's not really high memory pressure, but maybe too many
concurrent direct reclaimers, so that when one reclaimed some free
pages, others kick in and "steal" the free pages. So we need to kill
the second cond_resched() call (which effectively gives other tasks a
good chance to steal this task's vmscan fruits), and only do
drain_all_pages() when nothing was reclaimed (instead of allocated).

Dave, will you give a try of this patch? It's based on Mel's.

Thanks,
Fengguang
---

--- linux-next.orig/mm/page_alloc.c	2010-09-04 11:08:03.000000000 +0800
+++ linux-next/mm/page_alloc.c	2010-09-04 11:16:33.000000000 +0800
@@ -1850,6 +1850,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
 
 	cond_resched();
 
+retry:
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	p->flags |= PF_MEMALLOC;
@@ -1863,26 +1864,23 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
 	lockdep_clear_current_reclaim_state();
 	p->flags &= ~PF_MEMALLOC;
 
-	cond_resched();
-
-	if (unlikely(!(*did_some_progress)))
+	if (unlikely(!(*did_some_progress))) {
+		if (!drained) {
+			drain_all_pages();
+			drained = true;
+			goto retry;
+		}
 		return NULL;
+	}
 
-retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
 
-	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
-	 */
-	if (!page && !drained) {
-		drain_all_pages();
-		drained = true;
+	/* someone steal our vmscan fruits? */
+	if (!page && *did_some_progress)
 		goto retry;
-	}
 
 	return page;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-04  3:23         ` Wu Fengguang
@ 2010-09-04  3:59           ` Andrew Morton
  -1 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-04  3:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sat, 4 Sep 2010 11:23:11 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> > Still, given the improvements in performance from this patchset,
> > I'd say inclusion is a no-braniner....
> 
> In your case it's not really high memory pressure, but maybe too many
> concurrent direct reclaimers, so that when one reclaimed some free
> pages, others kick in and "steal" the free pages. So we need to kill
> the second cond_resched() call (which effectively gives other tasks a
> good chance to steal this task's vmscan fruits), and only do
> drain_all_pages() when nothing was reclaimed (instead of allocated).

Well...  cond_resched() will only resched when this task has been
marked for preemption.  If that's happening at such a high frequency
then Something Is Up with the scheduler, and the reported context
switch rate will be high.

> Dave, will you give a try of this patch? It's based on Mel's.
> 
> 
> --- linux-next.orig/mm/page_alloc.c	2010-09-04 11:08:03.000000000 +0800
> +++ linux-next/mm/page_alloc.c	2010-09-04 11:16:33.000000000 +0800
> @@ -1850,6 +1850,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>  
>  	cond_resched();
>  
> +retry:
>  	/* We now go into synchronous reclaim */
>  	cpuset_memory_pressure_bump();
>  	p->flags |= PF_MEMALLOC;
> @@ -1863,26 +1864,23 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>  	lockdep_clear_current_reclaim_state();
>  	p->flags &= ~PF_MEMALLOC;
>  
> -	cond_resched();
> -
> -	if (unlikely(!(*did_some_progress)))
> +	if (unlikely(!(*did_some_progress))) {
> +		if (!drained) {
> +			drain_all_pages();
> +			drained = true;
> +			goto retry;
> +		}
>  		return NULL;
> +	}
>  
> -retry:
>  	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
>  
> -	/*
> -	 * If an allocation failed after direct reclaim, it could be because
> -	 * pages are pinned on the per-cpu lists. Drain them and try again
> -	 */
> -	if (!page && !drained) {
> -		drain_all_pages();
> -		drained = true;
> +	/* someone steal our vmscan fruits? */
> +	if (!page && *did_some_progress)
>  		goto retry;
> -	}

Perhaps the fruit-stealing event is worth adding to the
userspace-exposed vm stats somewhere.  But not in /proc - somewhere
more temporary, in debugfs.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-04  3:59           ` Andrew Morton
  0 siblings, 0 replies; 104+ messages in thread
From: Andrew Morton @ 2010-09-04  3:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sat, 4 Sep 2010 11:23:11 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> > Still, given the improvements in performance from this patchset,
> > I'd say inclusion is a no-braniner....
> 
> In your case it's not really high memory pressure, but maybe too many
> concurrent direct reclaimers, so that when one reclaimed some free
> pages, others kick in and "steal" the free pages. So we need to kill
> the second cond_resched() call (which effectively gives other tasks a
> good chance to steal this task's vmscan fruits), and only do
> drain_all_pages() when nothing was reclaimed (instead of allocated).

Well...  cond_resched() will only resched when this task has been
marked for preemption.  If that's happening at such a high frequency
then Something Is Up with the scheduler, and the reported context
switch rate will be high.

> Dave, will you give a try of this patch? It's based on Mel's.
> 
> 
> --- linux-next.orig/mm/page_alloc.c	2010-09-04 11:08:03.000000000 +0800
> +++ linux-next/mm/page_alloc.c	2010-09-04 11:16:33.000000000 +0800
> @@ -1850,6 +1850,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>  
>  	cond_resched();
>  
> +retry:
>  	/* We now go into synchronous reclaim */
>  	cpuset_memory_pressure_bump();
>  	p->flags |= PF_MEMALLOC;
> @@ -1863,26 +1864,23 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
>  	lockdep_clear_current_reclaim_state();
>  	p->flags &= ~PF_MEMALLOC;
>  
> -	cond_resched();
> -
> -	if (unlikely(!(*did_some_progress)))
> +	if (unlikely(!(*did_some_progress))) {
> +		if (!drained) {
> +			drain_all_pages();
> +			drained = true;
> +			goto retry;
> +		}
>  		return NULL;
> +	}
>  
> -retry:
>  	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
>  
> -	/*
> -	 * If an allocation failed after direct reclaim, it could be because
> -	 * pages are pinned on the per-cpu lists. Drain them and try again
> -	 */
> -	if (!page && !drained) {
> -		drain_all_pages();
> -		drained = true;
> +	/* someone steal our vmscan fruits? */
> +	if (!page && *did_some_progress)
>  		goto retry;
> -	}

Perhaps the fruit-stealing event is worth adding to the
userspace-exposed vm stats somewhere.  But not in /proc - somewhere
more temporary, in debugfs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-04  3:59           ` Andrew Morton
@ 2010-09-04  4:37             ` Wu Fengguang
  -1 siblings, 0 replies; 104+ messages in thread
From: Wu Fengguang @ 2010-09-04  4:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sat, Sep 04, 2010 at 11:59:45AM +0800, Andrew Morton wrote:
> On Sat, 4 Sep 2010 11:23:11 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > Still, given the improvements in performance from this patchset,
> > > I'd say inclusion is a no-braniner....
> > 
> > In your case it's not really high memory pressure, but maybe too many
> > concurrent direct reclaimers, so that when one reclaimed some free
> > pages, others kick in and "steal" the free pages. So we need to kill
> > the second cond_resched() call (which effectively gives other tasks a
> > good chance to steal this task's vmscan fruits), and only do
> > drain_all_pages() when nothing was reclaimed (instead of allocated).
> 
> Well...  cond_resched() will only resched when this task has been
> marked for preemption.  If that's happening at such a high frequency
> then Something Is Up with the scheduler, and the reported context
> switch rate will be high.

Yes it may not necessarily schedule away. But if ever this happens,
the task will likely run into drain_all_pages() when re-gain CPU.
Because the drain_all_pages() cost is very high, it don't need too
many reschedules to create the IPI storm..

> > Dave, will you give a try of this patch? It's based on Mel's.
> > 
> > 
> > --- linux-next.orig/mm/page_alloc.c	2010-09-04 11:08:03.000000000 +0800
> > +++ linux-next/mm/page_alloc.c	2010-09-04 11:16:33.000000000 +0800
> > @@ -1850,6 +1850,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> >  
> >  	cond_resched();
> >  
> > +retry:
> >  	/* We now go into synchronous reclaim */
> >  	cpuset_memory_pressure_bump();
> >  	p->flags |= PF_MEMALLOC;
> > @@ -1863,26 +1864,23 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> >  	lockdep_clear_current_reclaim_state();
> >  	p->flags &= ~PF_MEMALLOC;
> >  
> > -	cond_resched();
> > -
> > -	if (unlikely(!(*did_some_progress)))
> > +	if (unlikely(!(*did_some_progress))) {
> > +		if (!drained) {
> > +			drain_all_pages();
> > +			drained = true;
> > +			goto retry;
> > +		}
> >  		return NULL;
> > +	}
> >  
> > -retry:
> >  	page = get_page_from_freelist(gfp_mask, nodemask, order,
> >  					zonelist, high_zoneidx,
> >  					alloc_flags, preferred_zone,
> >  					migratetype);
> >  
> > -	/*
> > -	 * If an allocation failed after direct reclaim, it could be because
> > -	 * pages are pinned on the per-cpu lists. Drain them and try again
> > -	 */
> > -	if (!page && !drained) {
> > -		drain_all_pages();
> > -		drained = true;
> > +	/* someone steal our vmscan fruits? */
> > +	if (!page && *did_some_progress)
> >  		goto retry;
> > -	}
> 
> Perhaps the fruit-stealing event is worth adding to the
> userspace-exposed vm stats somewhere.  But not in /proc - somewhere
> more temporary, in debugfs.

There are no existing debugfs interfaces for vm stats, and I need to
go out right now.. So I did the following quick (and temporary) hack
to allow Dave to collect the information. Will revisit the proper
interface to use later :)

Thanks,
Fengguang
---
 include/linux/mmzone.h |    1 +
 mm/page_alloc.c        |    4 +++-
 mm/vmstat.c            |    1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/mmzone.h	2010-09-04 12:30:26.000000000 +0800
+++ linux-next/include/linux/mmzone.h	2010-09-04 12:30:36.000000000 +0800
@@ -104,6 +104,7 @@ enum zone_stat_item {
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
+	NR_RECLAIM_STEAL,
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
--- linux-next.orig/mm/page_alloc.c	2010-09-04 12:28:09.000000000 +0800
+++ linux-next/mm/page_alloc.c	2010-09-04 12:33:39.000000000 +0800
@@ -1879,8 +1879,10 @@ retry:
 					migratetype);
 
 	/* someone steal our vmscan fruits? */
-	if (!page && *did_some_progress)
+	if (!page && *did_some_progress) {
+		inc_zone_state(preferred_zone, NR_RECLAIM_STEAL);
 		goto retry;
+	}
 
 	return page;
 }
--- linux-next.orig/mm/vmstat.c	2010-09-04 12:31:30.000000000 +0800
+++ linux-next/mm/vmstat.c	2010-09-04 12:31:42.000000000 +0800
@@ -732,6 +732,7 @@ static const char * const vmstat_text[] 
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"nr_shmem",
+	"nr_reclaim_steal",
 #ifdef CONFIG_NUMA
 	"numa_hit",
 	"numa_miss",

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-04  4:37             ` Wu Fengguang
  0 siblings, 0 replies; 104+ messages in thread
From: Wu Fengguang @ 2010-09-04  4:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sat, Sep 04, 2010 at 11:59:45AM +0800, Andrew Morton wrote:
> On Sat, 4 Sep 2010 11:23:11 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > Still, given the improvements in performance from this patchset,
> > > I'd say inclusion is a no-braniner....
> > 
> > In your case it's not really high memory pressure, but maybe too many
> > concurrent direct reclaimers, so that when one reclaimed some free
> > pages, others kick in and "steal" the free pages. So we need to kill
> > the second cond_resched() call (which effectively gives other tasks a
> > good chance to steal this task's vmscan fruits), and only do
> > drain_all_pages() when nothing was reclaimed (instead of allocated).
> 
> Well...  cond_resched() will only resched when this task has been
> marked for preemption.  If that's happening at such a high frequency
> then Something Is Up with the scheduler, and the reported context
> switch rate will be high.

Yes it may not necessarily schedule away. But if ever this happens,
the task will likely run into drain_all_pages() when re-gain CPU.
Because the drain_all_pages() cost is very high, it don't need too
many reschedules to create the IPI storm..

> > Dave, will you give a try of this patch? It's based on Mel's.
> > 
> > 
> > --- linux-next.orig/mm/page_alloc.c	2010-09-04 11:08:03.000000000 +0800
> > +++ linux-next/mm/page_alloc.c	2010-09-04 11:16:33.000000000 +0800
> > @@ -1850,6 +1850,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> >  
> >  	cond_resched();
> >  
> > +retry:
> >  	/* We now go into synchronous reclaim */
> >  	cpuset_memory_pressure_bump();
> >  	p->flags |= PF_MEMALLOC;
> > @@ -1863,26 +1864,23 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> >  	lockdep_clear_current_reclaim_state();
> >  	p->flags &= ~PF_MEMALLOC;
> >  
> > -	cond_resched();
> > -
> > -	if (unlikely(!(*did_some_progress)))
> > +	if (unlikely(!(*did_some_progress))) {
> > +		if (!drained) {
> > +			drain_all_pages();
> > +			drained = true;
> > +			goto retry;
> > +		}
> >  		return NULL;
> > +	}
> >  
> > -retry:
> >  	page = get_page_from_freelist(gfp_mask, nodemask, order,
> >  					zonelist, high_zoneidx,
> >  					alloc_flags, preferred_zone,
> >  					migratetype);
> >  
> > -	/*
> > -	 * If an allocation failed after direct reclaim, it could be because
> > -	 * pages are pinned on the per-cpu lists. Drain them and try again
> > -	 */
> > -	if (!page && !drained) {
> > -		drain_all_pages();
> > -		drained = true;
> > +	/* someone steal our vmscan fruits? */
> > +	if (!page && *did_some_progress)
> >  		goto retry;
> > -	}
> 
> Perhaps the fruit-stealing event is worth adding to the
> userspace-exposed vm stats somewhere.  But not in /proc - somewhere
> more temporary, in debugfs.

There are no existing debugfs interfaces for vm stats, and I need to
go out right now.. So I did the following quick (and temporary) hack
to allow Dave to collect the information. Will revisit the proper
interface to use later :)

Thanks,
Fengguang
---
 include/linux/mmzone.h |    1 +
 mm/page_alloc.c        |    4 +++-
 mm/vmstat.c            |    1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/mmzone.h	2010-09-04 12:30:26.000000000 +0800
+++ linux-next/include/linux/mmzone.h	2010-09-04 12:30:36.000000000 +0800
@@ -104,6 +104,7 @@ enum zone_stat_item {
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
+	NR_RECLAIM_STEAL,
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
--- linux-next.orig/mm/page_alloc.c	2010-09-04 12:28:09.000000000 +0800
+++ linux-next/mm/page_alloc.c	2010-09-04 12:33:39.000000000 +0800
@@ -1879,8 +1879,10 @@ retry:
 					migratetype);
 
 	/* someone steal our vmscan fruits? */
-	if (!page && *did_some_progress)
+	if (!page && *did_some_progress) {
+		inc_zone_state(preferred_zone, NR_RECLAIM_STEAL);
 		goto retry;
+	}
 
 	return page;
 }
--- linux-next.orig/mm/vmstat.c	2010-09-04 12:31:30.000000000 +0800
+++ linux-next/mm/vmstat.c	2010-09-04 12:31:42.000000000 +0800
@@ -732,6 +732,7 @@ static const char * const vmstat_text[] 
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"nr_shmem",
+	"nr_reclaim_steal",
 #ifdef CONFIG_NUMA
 	"numa_hit",
 	"numa_miss",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-04  3:21         ` Andrew Morton
@ 2010-09-04  7:58           ` Dave Chinner
  -1 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-04  7:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Fri, Sep 03, 2010 at 08:21:01PM -0700, Andrew Morton wrote:
> On Sat, 4 Sep 2010 12:25:45 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Still, given the improvements in performance from this patchset,
> > I'd say inclusion is a no-braniner....
> 
> OK, thanks.
> 
> It'd be interesting to check the IPI frequency with and without -
> /proc/interrupts "CAL" field.  Presumably it went down a lot.

Maybe I suspected you would ask for this. I happened to dump
/proc/interrupts after the livelock run finished, so you're in
luck :)

The lines below are:

before: before running the single 50M inode create workload
after: the numbers after the run completes
livelock: the numbers after two runs with a livelock in the second

Vanilla 2.6.36-rc3:

before:      561   350   614   282   559   335   365   363
after:	   10472 10473 10544 10681  9818 10837 10187  9923

.36-rc3 With patchset:

before:      452   426   441   337   748   321   498   357
after:      9463  9112  8671  8830  9391  8684  9768  8971

The numbers aren't that different - roughly 10% lower on average
with the patchset. I will state that vanilla kernel runs I ijust did
had noticably more consistent performance than the previous results
I had acheived, so perhaps it wasn't triggering the livelock
conditions as effectively this time through.

And finally:

livelock:  59458 58367 58559 59493 59614 57970 59060 58207

So the livelock case tends to indicate roughly 40,000 more IPI
interrupts per CPU occurred.  The livelock occurred for close to 5
minutes, so that's roughly 130 IPIs per second per CPU....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-04  7:58           ` Dave Chinner
  0 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-04  7:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Fri, Sep 03, 2010 at 08:21:01PM -0700, Andrew Morton wrote:
> On Sat, 4 Sep 2010 12:25:45 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > Still, given the improvements in performance from this patchset,
> > I'd say inclusion is a no-braniner....
> 
> OK, thanks.
> 
> It'd be interesting to check the IPI frequency with and without -
> /proc/interrupts "CAL" field.  Presumably it went down a lot.

Maybe I suspected you would ask for this. I happened to dump
/proc/interrupts after the livelock run finished, so you're in
luck :)

The lines below are:

before: before running the single 50M inode create workload
after: the numbers after the run completes
livelock: the numbers after two runs with a livelock in the second

Vanilla 2.6.36-rc3:

before:      561   350   614   282   559   335   365   363
after:	   10472 10473 10544 10681  9818 10837 10187  9923

.36-rc3 With patchset:

before:      452   426   441   337   748   321   498   357
after:      9463  9112  8671  8830  9391  8684  9768  8971

The numbers aren't that different - roughly 10% lower on average
with the patchset. I will state that vanilla kernel runs I ijust did
had noticably more consistent performance than the previous results
I had acheived, so perhaps it wasn't triggering the livelock
conditions as effectively this time through.

And finally:

livelock:  59458 58367 58559 59493 59614 57970 59060 58207

So the livelock case tends to indicate roughly 40,000 more IPI
interrupts per CPU occurred.  The livelock occurred for close to 5
minutes, so that's roughly 130 IPIs per second per CPU....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-04  7:58           ` Dave Chinner
@ 2010-09-04  8:14             ` Dave Chinner
  -1 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-04  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Sat, Sep 04, 2010 at 05:58:40PM +1000, Dave Chinner wrote:
> On Fri, Sep 03, 2010 at 08:21:01PM -0700, Andrew Morton wrote:
> > On Sat, 4 Sep 2010 12:25:45 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > Still, given the improvements in performance from this patchset,
> > > I'd say inclusion is a no-braniner....
> > 
> > OK, thanks.
> > 
> > It'd be interesting to check the IPI frequency with and without -
> > /proc/interrupts "CAL" field.  Presumably it went down a lot.
> 
> Maybe I suspected you would ask for this. I happened to dump
> /proc/interrupts after the livelock run finished, so you're in
> luck :)
....
> 
> livelock:  59458 58367 58559 59493 59614 57970 59060 58207
> 
> So the livelock case tends to indicate roughly 40,000 more IPI
> interrupts per CPU occurred.  The livelock occurred for close to 5
> minutes, so that's roughly 130 IPIs per second per CPU....

And just to confuse the issue further, I just had a livelock on a
vanilla kernel that did *not* cause the CAL counts to increase.
Hence it appears that the IPI storms are not the cause of the
livelocks І'm triggering....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-04  8:14             ` Dave Chinner
  0 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-04  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Sat, Sep 04, 2010 at 05:58:40PM +1000, Dave Chinner wrote:
> On Fri, Sep 03, 2010 at 08:21:01PM -0700, Andrew Morton wrote:
> > On Sat, 4 Sep 2010 12:25:45 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > Still, given the improvements in performance from this patchset,
> > > I'd say inclusion is a no-braniner....
> > 
> > OK, thanks.
> > 
> > It'd be interesting to check the IPI frequency with and without -
> > /proc/interrupts "CAL" field.  Presumably it went down a lot.
> 
> Maybe I suspected you would ask for this. I happened to dump
> /proc/interrupts after the livelock run finished, so you're in
> luck :)
....
> 
> livelock:  59458 58367 58559 59493 59614 57970 59060 58207
> 
> So the livelock case tends to indicate roughly 40,000 more IPI
> interrupts per CPU occurred.  The livelock occurred for close to 5
> minutes, so that's roughly 130 IPIs per second per CPU....

And just to confuse the issue further, I just had a livelock on a
vanilla kernel that did *not* cause the CAL counts to increase.
Hence it appears that the IPI storms are not the cause of the
livelocks D?'m triggering....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
       [not found]                   ` <20100905131447.GJ705@dastard>
@ 2010-09-05 13:45                       ` Wu Fengguang
  0 siblings, 0 replies; 104+ messages in thread
From: Wu Fengguang @ 2010-09-05 13:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu, Fengguang,
	David Rientjes

[restoring CC list]

On Sun, Sep 05, 2010 at 09:14:47PM +0800, Dave Chinner wrote:
> On Sun, Sep 05, 2010 at 02:05:39PM +0800, Wu Fengguang wrote:
> > On Sun, Sep 05, 2010 at 10:15:55AM +0800, Dave Chinner wrote:
> > > On Sun, Sep 05, 2010 at 09:54:00AM +0800, Wu Fengguang wrote:
> > > > Dave, could you post (publicly) the kconfig and /proc/vmstat?
> > > > 
> > > > I'd like to check if you have swap or memory compaction enabled..
> > > 
> > > Swap is enabled - it has 512MB of swap space:
> > > 
> > > $ free
> > >              total       used       free     shared    buffers     cached
> > > Mem:       4054304     100928    3953376          0       4096      43108
> > > -/+ buffers/cache:      53724    4000580
> > > Swap:       497976          0     497976
> > 
> > It looks swap is not used at all.
> 
> It isn't 30s after boot, abut I haven't checked after a livelock.

That's fine. I see in your fs_mark-wedge-1.png that there are no
read/write IO at all when CPUs are 100% busy. So there should be no
swap IO at "livelock" time.

> > > And memory compaction is not enabled:
> > > 
> > > $ grep COMPACT .config
> > > # CONFIG_COMPACTION is not set

Memory compaction is not likely the cause too. It will only kick in for
order > 3 allocations.

> > > 
> > > The .config is pretty much a 'make defconfig' and then enabling XFS and
> > > whatever debug I need (e.g. locking, memleak, etc).
> > 
> > Thanks! The problem seems hard to debug -- you cannot login at all
> > when it is doing lock contentions, so cannot get sysrq call traces.
> 
> Well, I don't know whether it is lock contention at all. The sets of
> traces I have got previously have shown backtraces on all CPUs in
> direct reclaim with several in draining queues, but no apparent lock
> contention.

That's interesting. Do you still have the full backtraces?

Maybe your system eats too much slab cache (icache/dcache) by creating
so many zero-sized files. The system may run into problems reclaiming
so many (dirty) slab pages.

> > How about enabling CONFIG_LOCK_STAT? Then you can check
> > /proc/lock_stat when the contentions are over.
> 
> Enabling the locking debug/stats gathering slows the workload
> by a factor of 3 and doesn't produce the livelock....

Oh sorry.. but it would still be interesting to check the top
contended locks for this workload without any livelocks :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-05 13:45                       ` Wu Fengguang
  0 siblings, 0 replies; 104+ messages in thread
From: Wu Fengguang @ 2010-09-05 13:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu, Fengguang,
	David Rientjes

[restoring CC list]

On Sun, Sep 05, 2010 at 09:14:47PM +0800, Dave Chinner wrote:
> On Sun, Sep 05, 2010 at 02:05:39PM +0800, Wu Fengguang wrote:
> > On Sun, Sep 05, 2010 at 10:15:55AM +0800, Dave Chinner wrote:
> > > On Sun, Sep 05, 2010 at 09:54:00AM +0800, Wu Fengguang wrote:
> > > > Dave, could you post (publicly) the kconfig and /proc/vmstat?
> > > > 
> > > > I'd like to check if you have swap or memory compaction enabled..
> > > 
> > > Swap is enabled - it has 512MB of swap space:
> > > 
> > > $ free
> > >              total       used       free     shared    buffers     cached
> > > Mem:       4054304     100928    3953376          0       4096      43108
> > > -/+ buffers/cache:      53724    4000580
> > > Swap:       497976          0     497976
> > 
> > It looks swap is not used at all.
> 
> It isn't 30s after boot, abut I haven't checked after a livelock.

That's fine. I see in your fs_mark-wedge-1.png that there are no
read/write IO at all when CPUs are 100% busy. So there should be no
swap IO at "livelock" time.

> > > And memory compaction is not enabled:
> > > 
> > > $ grep COMPACT .config
> > > # CONFIG_COMPACTION is not set

Memory compaction is not likely the cause too. It will only kick in for
order > 3 allocations.

> > > 
> > > The .config is pretty much a 'make defconfig' and then enabling XFS and
> > > whatever debug I need (e.g. locking, memleak, etc).
> > 
> > Thanks! The problem seems hard to debug -- you cannot login at all
> > when it is doing lock contentions, so cannot get sysrq call traces.
> 
> Well, I don't know whether it is lock contention at all. The sets of
> traces I have got previously have shown backtraces on all CPUs in
> direct reclaim with several in draining queues, but no apparent lock
> contention.

That's interesting. Do you still have the full backtraces?

Maybe your system eats too much slab cache (icache/dcache) by creating
so many zero-sized files. The system may run into problems reclaiming
so many (dirty) slab pages.

> > How about enabling CONFIG_LOCK_STAT? Then you can check
> > /proc/lock_stat when the contentions are over.
> 
> Enabling the locking debug/stats gathering slows the workload
> by a factor of 3 and doesn't produce the livelock....

Oh sorry.. but it would still be interesting to check the top
contended locks for this workload without any livelocks :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
  2010-09-03 22:38     ` Andrew Morton
@ 2010-09-05 18:06       ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-05 18:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri, Sep 03, 2010 at 03:38:59PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:44 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  {
> >  	int migratetype = 0;
> >  	int batch_free = 0;
> > +	int freed = count;
> >  
> >  	spin_lock(&zone->lock);
> >  	zone->all_unreclaimable = 0;
> >  	zone->pages_scanned = 0;
> >  
> > -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
> >  	while (count) {
> >  		struct page *page;
> >  		struct list_head *list;
> > @@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  			trace_mm_page_pcpu_drain(page, 0, page_private(page));
> >  		} while (--count && --batch_free && !list_empty(list));
> >  	}
> > +	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
> >  	spin_unlock(&zone->lock);
> >  }
> >  
> 
> nit: this is why it's evil to modify the value of incoming args.  It's
> nicer to leave them alone and treat them as const across the lifetime
> of the callee.
> 

Ok, I can see the logic of that.

> Can I do this?
> 
> --- a/mm/page_alloc.c~mm-page-allocator-update-free-page-counters-after-pages-are-placed-on-the-free-list-fix
> +++ a/mm/page_alloc.c
> @@ -588,13 +588,13 @@ static void free_pcppages_bulk(struct zo
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	int freed = count;
> +	int to_free = count;
>  
>  	spin_lock(&zone->lock);
>  	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
> -	while (count) {
> +	while (to_free) {
>  		struct page *page;
>  		struct list_head *list;
>  
> @@ -619,9 +619,9 @@ static void free_pcppages_bulk(struct zo
>  			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>  			__free_one_page(page, zone, 0, page_private(page));
>  			trace_mm_page_pcpu_drain(page, 0, page_private(page));
> -		} while (--count && --batch_free && !list_empty(list));
> +		} while (--to_free && --batch_free && !list_empty(list));
>  	}
> -	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
>  	spin_unlock(&zone->lock);
>  }

Yes you can. I see no problem with this alteration.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list
@ 2010-09-05 18:06       ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-05 18:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri, Sep 03, 2010 at 03:38:59PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:44 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  {
> >  	int migratetype = 0;
> >  	int batch_free = 0;
> > +	int freed = count;
> >  
> >  	spin_lock(&zone->lock);
> >  	zone->all_unreclaimable = 0;
> >  	zone->pages_scanned = 0;
> >  
> > -	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
> >  	while (count) {
> >  		struct page *page;
> >  		struct list_head *list;
> > @@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  			trace_mm_page_pcpu_drain(page, 0, page_private(page));
> >  		} while (--count && --batch_free && !list_empty(list));
> >  	}
> > +	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
> >  	spin_unlock(&zone->lock);
> >  }
> >  
> 
> nit: this is why it's evil to modify the value of incoming args.  It's
> nicer to leave them alone and treat them as const across the lifetime
> of the callee.
> 

Ok, I can see the logic of that.

> Can I do this?
> 
> --- a/mm/page_alloc.c~mm-page-allocator-update-free-page-counters-after-pages-are-placed-on-the-free-list-fix
> +++ a/mm/page_alloc.c
> @@ -588,13 +588,13 @@ static void free_pcppages_bulk(struct zo
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	int freed = count;
> +	int to_free = count;
>  
>  	spin_lock(&zone->lock);
>  	zone->all_unreclaimable = 0;
>  	zone->pages_scanned = 0;
>  
> -	while (count) {
> +	while (to_free) {
>  		struct page *page;
>  		struct list_head *list;
>  
> @@ -619,9 +619,9 @@ static void free_pcppages_bulk(struct zo
>  			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
>  			__free_one_page(page, zone, 0, page_private(page));
>  			trace_mm_page_pcpu_drain(page, 0, page_private(page));
> -		} while (--count && --batch_free && !list_empty(list));
> +		} while (--to_free && --batch_free && !list_empty(list));
>  	}
> -	__mod_zone_page_state(zone, NR_FREE_PAGES, freed);
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
>  	spin_unlock(&zone->lock);
>  }

Yes you can. I see no problem with this alteration.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
  2010-09-03 22:55     ` Andrew Morton
@ 2010-09-05 18:12       ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-05 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri, Sep 03, 2010 at 03:55:37PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:45 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > From: Christoph Lameter <cl@linux.com>
> > 
> > Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
> > it is cheaper than scanning a number of lists. To avoid synchronization
> > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > periodically and when the delta is above a threshold. On large CPU systems,
> > the difference between the estimated and real value of NR_FREE_PAGES can
> > be very high. If NR_FREE_PAGES is much higher than number of real free page
> > in buddy, the VM can allocate pages below min watermark, at worst reducing
> > the real number of pages to zero. Even if the OOM killer kills some victim
> > for freeing memory, it may not free memory if the exit path requires a new
> > page resulting in livelock.
> > 
> > This patch introduces a zone_page_state_snapshot() function (courtesy of
> > Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
> > It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
> > being accidentally broken.  The estimate is not perfect and may result
> > in cache line bounces but is expected to be lighter than the IPI calls
> > necessary to continually drain the per-cpu counters while kswapd is awake.
> > 
> 
> The "is kswapd awake" heuristic seems fairly hacky.  Can it be
> improved, made more deterministic? 

It could be removed but the problem is that the snap version of the
function could be continually used on large systems that are using
almost all physical memory but not under any memory pressure. kswapd
being awake seemed a reasonable proxy indicator that the system is under
pressure.

> Exactly what state are we looking
> for here?
> 

We want to know when the system is in a state where it is both under
pressure and in danger of breaching the watermark due to per-cpu counter
drift.

> 
> > +/*
> > + * More accurate version that also considers the currently pending
> > + * deltas. For that we need to loop over all cpus to find the current
> > + * deltas. There is no synchronization so the result cannot be
> > + * exactly accurate either.
> > + */
> > +static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> > +					enum zone_stat_item item)
> > +{
> > +	long x = atomic_long_read(&zone->vm_stat[item]);
> > +
> > +#ifdef CONFIG_SMP
> > +	int cpu;
> > +	for_each_online_cpu(cpu)
> > +		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
> > +
> > +	if (x < 0)
> > +		x = 0;
> > +#endif
> > +	return x;
> > +}
> 
> aka percpu_counter_sum()!
> 
> Can someone remind me why per_cpu_pageset went and reimplemented
> percpu_counters rather than just using them?
> 

It's not an exact fit. Christoph answered this and I do not have
anything additional to say.

> >  extern unsigned long global_reclaimable_pages(void);
> >  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> >  
> > diff --git a/mm/mmzone.c b/mm/mmzone.c
> > index f5b7d17..e35bfb8 100644
> > --- a/mm/mmzone.c
> > +++ b/mm/mmzone.c
> > @@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
> >  	return 1;
> >  }
> >  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> > +
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +	/*
> > +	 * While kswapd is awake, it is considered the zone is under some
> > +	 * memory pressure. Under pressure, there is a risk that
> > +	 * per-cpu-counter-drift will allow the min watermark to be breached
> > +	 * potentially causing a live-lock. While kswapd is awake and
> > +	 * free pages are low, get a better estimate for free pages
> > +	 */
> > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> > +		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> > +
> > +	return nr_free_pages;
> > +}
> 
> Is this really the best way of doing it?  The way we usually solve
> this problem (and boy, was this bug a newbie mistake!) is:
> 
> 	foo = percpu_counter_read(x);
> 
> 	if (foo says something bad) {
> 		/* Bad stuff: let's get a more accurate foo */
> 		foo = percpu_counter_sum(x);
> 	}
> 
> 	if (foo still says something bad)
> 		do_bad_thing();
> 
> In other words, don't do all this stuff with percpu_drift_mark and the
> kswapd heuristic.

The percpu_drift_mark and the kswapd heuristic correspond to your "foo
says something bad" above. The drift mark is detecting we're in
potential danger and the kswapd check is telling us we are both in
danger and there is memory pressure. Even if we were using the percpu
counters, it wouldn't eliminate the need for percpu_drift_mark and the
kswapd heuristic, right?

> Just change zone_watermark_ok() to use the more
> accurate read if it's about to return "no".
> 

It could be too late by then. By the tiome zone_watermark_ok() is about
to return no, we could have already breached the watermark by a
significant amount due to the per-cpu counter drift.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
@ 2010-09-05 18:12       ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-05 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri, Sep 03, 2010 at 03:55:37PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:45 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > From: Christoph Lameter <cl@linux.com>
> > 
> > Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
> > it is cheaper than scanning a number of lists. To avoid synchronization
> > overhead, counter deltas are maintained on a per-cpu basis and drained both
> > periodically and when the delta is above a threshold. On large CPU systems,
> > the difference between the estimated and real value of NR_FREE_PAGES can
> > be very high. If NR_FREE_PAGES is much higher than number of real free page
> > in buddy, the VM can allocate pages below min watermark, at worst reducing
> > the real number of pages to zero. Even if the OOM killer kills some victim
> > for freeing memory, it may not free memory if the exit path requires a new
> > page resulting in livelock.
> > 
> > This patch introduces a zone_page_state_snapshot() function (courtesy of
> > Christoph) that takes a slightly more accurate of an arbitrary vmstat counter.
> > It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark
> > being accidentally broken.  The estimate is not perfect and may result
> > in cache line bounces but is expected to be lighter than the IPI calls
> > necessary to continually drain the per-cpu counters while kswapd is awake.
> > 
> 
> The "is kswapd awake" heuristic seems fairly hacky.  Can it be
> improved, made more deterministic? 

It could be removed but the problem is that the snap version of the
function could be continually used on large systems that are using
almost all physical memory but not under any memory pressure. kswapd
being awake seemed a reasonable proxy indicator that the system is under
pressure.

> Exactly what state are we looking
> for here?
> 

We want to know when the system is in a state where it is both under
pressure and in danger of breaching the watermark due to per-cpu counter
drift.

> 
> > +/*
> > + * More accurate version that also considers the currently pending
> > + * deltas. For that we need to loop over all cpus to find the current
> > + * deltas. There is no synchronization so the result cannot be
> > + * exactly accurate either.
> > + */
> > +static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> > +					enum zone_stat_item item)
> > +{
> > +	long x = atomic_long_read(&zone->vm_stat[item]);
> > +
> > +#ifdef CONFIG_SMP
> > +	int cpu;
> > +	for_each_online_cpu(cpu)
> > +		x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item];
> > +
> > +	if (x < 0)
> > +		x = 0;
> > +#endif
> > +	return x;
> > +}
> 
> aka percpu_counter_sum()!
> 
> Can someone remind me why per_cpu_pageset went and reimplemented
> percpu_counters rather than just using them?
> 

It's not an exact fit. Christoph answered this and I do not have
anything additional to say.

> >  extern unsigned long global_reclaimable_pages(void);
> >  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> >  
> > diff --git a/mm/mmzone.c b/mm/mmzone.c
> > index f5b7d17..e35bfb8 100644
> > --- a/mm/mmzone.c
> > +++ b/mm/mmzone.c
> > @@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
> >  	return 1;
> >  }
> >  #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
> > +
> > +#ifdef CONFIG_SMP
> > +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> > +unsigned long zone_nr_free_pages(struct zone *zone)
> > +{
> > +	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> > +
> > +	/*
> > +	 * While kswapd is awake, it is considered the zone is under some
> > +	 * memory pressure. Under pressure, there is a risk that
> > +	 * per-cpu-counter-drift will allow the min watermark to be breached
> > +	 * potentially causing a live-lock. While kswapd is awake and
> > +	 * free pages are low, get a better estimate for free pages
> > +	 */
> > +	if (nr_free_pages < zone->percpu_drift_mark &&
> > +			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> > +		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
> > +
> > +	return nr_free_pages;
> > +}
> 
> Is this really the best way of doing it?  The way we usually solve
> this problem (and boy, was this bug a newbie mistake!) is:
> 
> 	foo = percpu_counter_read(x);
> 
> 	if (foo says something bad) {
> 		/* Bad stuff: let's get a more accurate foo */
> 		foo = percpu_counter_sum(x);
> 	}
> 
> 	if (foo still says something bad)
> 		do_bad_thing();
> 
> In other words, don't do all this stuff with percpu_drift_mark and the
> kswapd heuristic.

The percpu_drift_mark and the kswapd heuristic correspond to your "foo
says something bad" above. The drift mark is detecting we're in
potential danger and the kswapd check is telling us we are both in
danger and there is memory pressure. Even if we were using the percpu
counters, it wouldn't eliminate the need for percpu_drift_mark and the
kswapd heuristic, right?

> Just change zone_watermark_ok() to use the more
> accurate read if it's about to return "no".
> 

It could be too late by then. By the tiome zone_watermark_ok() is about
to return no, we could have already breached the watermark by a
significant amount due to the per-cpu counter drift.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-03 23:00     ` Andrew Morton
@ 2010-09-05 18:14       ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-05 18:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Wu Fengguang, David Rientjes

On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:46 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When under significant memory pressure, a process enters direct reclaim
> > and immediately afterwards tries to allocate a page. If it fails and no
> > further progress is made, it's possible the system will go OOM. However,
> > on systems with large amounts of memory, it's possible that a significant
> > number of pages are on per-cpu lists and inaccessible to the calling
> > process. This leads to a process entering direct reclaim more often than
> > it should increasing the pressure on the system and compounding the problem.
> > 
> > This patch notes that if direct reclaim is making progress but
> > allocations are still failing that the system is already under heavy
> > pressure. In this case, it drains the per-cpu lists and tries the
> > allocation a second time before continuing.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Reviewed-by: Christoph Lameter <cl@linux.com>
> > ---
> >  mm/page_alloc.c |   20 ++++++++++++++++----
> >  1 files changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bbaa959..750e1dc 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  	struct page *page = NULL;
> >  	struct reclaim_state reclaim_state;
> >  	struct task_struct *p = current;
> > +	bool drained = false;
> >  
> >  	cond_resched();
> >  
> > @@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  
> >  	cond_resched();
> >  
> > -	if (order != 0)
> > -		drain_all_pages();
> > +	if (unlikely(!(*did_some_progress)))
> > +		return NULL;
> >  
> > -	if (likely(*did_some_progress))
> > -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > +retry:
> > +	page = get_page_from_freelist(gfp_mask, nodemask, order,
> >  					zonelist, high_zoneidx,
> >  					alloc_flags, preferred_zone,
> >  					migratetype);
> > +
> > +	/*
> > +	 * If an allocation failed after direct reclaim, it could be because
> > +	 * pages are pinned on the per-cpu lists. Drain them and try again
> > +	 */
> > +	if (!page && !drained) {
> > +		drain_all_pages();
> > +		drained = true;
> > +		goto retry;
> > +	}
> > +
> >  	return page;
> >  }
> 
> The patch looks reasonable.
> 
> But please take a look at the recent thread "mm: minute-long livelocks
> in memory reclaim".  There, people are pointing fingers at that
> drain_all_pages() call, suspecting that it's causing huge IPI storms.
> 

I'm aware of it.

> Dave was going to test this theory but afaik hasn't yet done so.  It
> would be nice to tie these threads together if poss?
> 

I was waiting to hear the results of the test. Certainly it seemed very
plausible that this patch would help it. I also have a hunch that the
congestion_wait() problems are cropping up. I have a revised patch
series that might close the rest of the problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-05 18:14       ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-05 18:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Wu Fengguang, David Rientjes

On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:46 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When under significant memory pressure, a process enters direct reclaim
> > and immediately afterwards tries to allocate a page. If it fails and no
> > further progress is made, it's possible the system will go OOM. However,
> > on systems with large amounts of memory, it's possible that a significant
> > number of pages are on per-cpu lists and inaccessible to the calling
> > process. This leads to a process entering direct reclaim more often than
> > it should increasing the pressure on the system and compounding the problem.
> > 
> > This patch notes that if direct reclaim is making progress but
> > allocations are still failing that the system is already under heavy
> > pressure. In this case, it drains the per-cpu lists and tries the
> > allocation a second time before continuing.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Reviewed-by: Christoph Lameter <cl@linux.com>
> > ---
> >  mm/page_alloc.c |   20 ++++++++++++++++----
> >  1 files changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bbaa959..750e1dc 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  	struct page *page = NULL;
> >  	struct reclaim_state reclaim_state;
> >  	struct task_struct *p = current;
> > +	bool drained = false;
> >  
> >  	cond_resched();
> >  
> > @@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  
> >  	cond_resched();
> >  
> > -	if (order != 0)
> > -		drain_all_pages();
> > +	if (unlikely(!(*did_some_progress)))
> > +		return NULL;
> >  
> > -	if (likely(*did_some_progress))
> > -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > +retry:
> > +	page = get_page_from_freelist(gfp_mask, nodemask, order,
> >  					zonelist, high_zoneidx,
> >  					alloc_flags, preferred_zone,
> >  					migratetype);
> > +
> > +	/*
> > +	 * If an allocation failed after direct reclaim, it could be because
> > +	 * pages are pinned on the per-cpu lists. Drain them and try again
> > +	 */
> > +	if (!page && !drained) {
> > +		drain_all_pages();
> > +		drained = true;
> > +		goto retry;
> > +	}
> > +
> >  	return page;
> >  }
> 
> The patch looks reasonable.
> 
> But please take a look at the recent thread "mm: minute-long livelocks
> in memory reclaim".  There, people are pointing fingers at that
> drain_all_pages() call, suspecting that it's causing huge IPI storms.
> 

I'm aware of it.

> Dave was going to test this theory but afaik hasn't yet done so.  It
> would be nice to tie these threads together if poss?
> 

I was waiting to hear the results of the test. Certainly it seemed very
plausible that this patch would help it. I also have a hunch that the
congestion_wait() problems are cropping up. I have a revised patch
series that might close the rest of the problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-04  2:25       ` Dave Chinner
@ 2010-09-05 18:22         ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-05 18:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Sat, Sep 04, 2010 at 12:25:45PM +1000, Dave Chinner wrote:
> On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> > On Fri,  3 Sep 2010 10:08:46 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > When under significant memory pressure, a process enters direct reclaim
> > > and immediately afterwards tries to allocate a page. If it fails and no
> > > further progress is made, it's possible the system will go OOM. However,
> > > on systems with large amounts of memory, it's possible that a significant
> > > number of pages are on per-cpu lists and inaccessible to the calling
> > > process. This leads to a process entering direct reclaim more often than
> > > it should increasing the pressure on the system and compounding the problem.
> > > 
> > > This patch notes that if direct reclaim is making progress but
> > > allocations are still failing that the system is already under heavy
> > > pressure. In this case, it drains the per-cpu lists and tries the
> > > allocation a second time before continuing.
> ....
> > The patch looks reasonable.
> > 
> > But please take a look at the recent thread "mm: minute-long livelocks
> > in memory reclaim".  There, people are pointing fingers at that
> > drain_all_pages() call, suspecting that it's causing huge IPI storms.
> > 
> > Dave was going to test this theory but afaik hasn't yet done so.  It
> > would be nice to tie these threads together if poss?
> 
> It's been my "next-thing-to-do" since David suggested I try it -
> tracking down other problems has got in the way, though. I
> just ran my test a couple of times through:
> 
> $ ./fs_mark -D 10000 -L 63 -S0 -n 100000 -s 0 \
> 	-d /mnt/scratch/0 -d /mnt/scratch/1 \
> 	-d /mnt/scratch/3 -d /mnt/scratch/2 \
> 	-d /mnt/scratch/4 -d /mnt/scratch/5 \
> 	-d /mnt/scratch/6 -d /mnt/scratch/7
> 
> To create millions of inodes in parallel on an 8p/4G RAM VM.
> The filesystem is ~1.1TB XFS:
> 
> # mkfs.xfs -f -d agcount=16 /dev/vdb
> meta-data=/dev/vdb               isize=256    agcount=16, agsize=16777216 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=268435456, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> # mount -o inode64,delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
> 

Unfortunately, I doubt I'll be able to reproduce this test. I don't have
access to a machine with enough processors or disk. I will try on 4p/4G
and 500M and see how that pans out.

> Performance prior to this patch was that each iteration resulted in
> ~65k files/s, with occassionaly peaks to 90k files/s, but drops to
> frequently 45k files/s when reclaim ran to reclaim the inode
> caches. This load ran permanently at 800% CPU usage.
> 
> Every so often (may once or twice a 50M inode create run) all 8 CPUs
> would remain pegged but the create rate would drop to zero for a few
> seconds to a couple of minutes. that was the livelock issues I
> reported.
> 

Should be easy to spot at least.

> With this patchset, I'm seeing a per-iteration average of ~77k
> files/s, with only a couple of iterations dropping down to ~55k
> file/s and a significantly number above 90k/s. The runtime to 50M
> inodes is down by ~30% and the average CPU usage across the run is
> around 700%. IOWs, there a significant gain in performance there is
> a significant drop in CPU usage. I've done two runs to 50m inodes,
> and not seen any sign of a livelock, even for short periods of time.
> 

Very cool.

> Ah, spoke too soon - I let the second run keep going, and at ~68M
> inodes it's just pegged all the CPUs and is pretty much completely
> wedged. Serial console is not responding, I can't get a new login,
> and the only thing responding that tells me the machine is alive is
> the remote PCP monitoring. It's been stuck for 5 minutes .... and
> now it is back. Here's what I saw:
> 
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-wedge-1.png
> 
> The livelock is at the right of the charts, where the top chart is
> all red (system CPU time), and the other charts flat line to zero.
> 
> And according to fsmark:
> 
>      1     66400000            0      64554.2          7705926
>      1     67200000            0      64836.1          7573013
> <hang happened here>
>      2     68000000            0      69472.8          7941399
>      2     68800000            0      85017.5          7585203
> 
> it didn't record any change in performance, which means the livelock
> probably occurred between iterations.  I couldn't get any info on
> what caused the livelock this time so I can only assume it has the
> same cause....
> 

Not sure where you could have gotten stuck. I thought it might have
locked up in congestion_wait() but it wouldn't have locked up this badly
if that was teh case. Sluggish sure but not that dead.

I'll see about reproducing with your test tomorrow and see what I find.
Thanks.

> Still, given the improvements in performance from this patchset,
> I'd say inclusion is a no-braniner....
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-05 18:22         ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-05 18:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Wu Fengguang, David Rientjes

On Sat, Sep 04, 2010 at 12:25:45PM +1000, Dave Chinner wrote:
> On Fri, Sep 03, 2010 at 04:00:26PM -0700, Andrew Morton wrote:
> > On Fri,  3 Sep 2010 10:08:46 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > When under significant memory pressure, a process enters direct reclaim
> > > and immediately afterwards tries to allocate a page. If it fails and no
> > > further progress is made, it's possible the system will go OOM. However,
> > > on systems with large amounts of memory, it's possible that a significant
> > > number of pages are on per-cpu lists and inaccessible to the calling
> > > process. This leads to a process entering direct reclaim more often than
> > > it should increasing the pressure on the system and compounding the problem.
> > > 
> > > This patch notes that if direct reclaim is making progress but
> > > allocations are still failing that the system is already under heavy
> > > pressure. In this case, it drains the per-cpu lists and tries the
> > > allocation a second time before continuing.
> ....
> > The patch looks reasonable.
> > 
> > But please take a look at the recent thread "mm: minute-long livelocks
> > in memory reclaim".  There, people are pointing fingers at that
> > drain_all_pages() call, suspecting that it's causing huge IPI storms.
> > 
> > Dave was going to test this theory but afaik hasn't yet done so.  It
> > would be nice to tie these threads together if poss?
> 
> It's been my "next-thing-to-do" since David suggested I try it -
> tracking down other problems has got in the way, though. I
> just ran my test a couple of times through:
> 
> $ ./fs_mark -D 10000 -L 63 -S0 -n 100000 -s 0 \
> 	-d /mnt/scratch/0 -d /mnt/scratch/1 \
> 	-d /mnt/scratch/3 -d /mnt/scratch/2 \
> 	-d /mnt/scratch/4 -d /mnt/scratch/5 \
> 	-d /mnt/scratch/6 -d /mnt/scratch/7
> 
> To create millions of inodes in parallel on an 8p/4G RAM VM.
> The filesystem is ~1.1TB XFS:
> 
> # mkfs.xfs -f -d agcount=16 /dev/vdb
> meta-data=/dev/vdb               isize=256    agcount=16, agsize=16777216 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=268435456, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> # mount -o inode64,delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
> 

Unfortunately, I doubt I'll be able to reproduce this test. I don't have
access to a machine with enough processors or disk. I will try on 4p/4G
and 500M and see how that pans out.

> Performance prior to this patch was that each iteration resulted in
> ~65k files/s, with occassionaly peaks to 90k files/s, but drops to
> frequently 45k files/s when reclaim ran to reclaim the inode
> caches. This load ran permanently at 800% CPU usage.
> 
> Every so often (may once or twice a 50M inode create run) all 8 CPUs
> would remain pegged but the create rate would drop to zero for a few
> seconds to a couple of minutes. that was the livelock issues I
> reported.
> 

Should be easy to spot at least.

> With this patchset, I'm seeing a per-iteration average of ~77k
> files/s, with only a couple of iterations dropping down to ~55k
> file/s and a significantly number above 90k/s. The runtime to 50M
> inodes is down by ~30% and the average CPU usage across the run is
> around 700%. IOWs, there a significant gain in performance there is
> a significant drop in CPU usage. I've done two runs to 50m inodes,
> and not seen any sign of a livelock, even for short periods of time.
> 

Very cool.

> Ah, spoke too soon - I let the second run keep going, and at ~68M
> inodes it's just pegged all the CPUs and is pretty much completely
> wedged. Serial console is not responding, I can't get a new login,
> and the only thing responding that tells me the machine is alive is
> the remote PCP monitoring. It's been stuck for 5 minutes .... and
> now it is back. Here's what I saw:
> 
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-wedge-1.png
> 
> The livelock is at the right of the charts, where the top chart is
> all red (system CPU time), and the other charts flat line to zero.
> 
> And according to fsmark:
> 
>      1     66400000            0      64554.2          7705926
>      1     67200000            0      64836.1          7573013
> <hang happened here>
>      2     68000000            0      69472.8          7941399
>      2     68800000            0      85017.5          7585203
> 
> it didn't record any change in performance, which means the livelock
> probably occurred between iterations.  I couldn't get any info on
> what caused the livelock this time so I can only assume it has the
> same cause....
> 

Not sure where you could have gotten stuck. I thought it might have
locked up in congestion_wait() but it wouldn't have locked up this badly
if that was teh case. Sluggish sure but not that dead.

I'll see about reproducing with your test tomorrow and see what I find.
Thanks.

> Still, given the improvements in performance from this patchset,
> I'd say inclusion is a no-braniner....
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-05 13:45                       ` Wu Fengguang
@ 2010-09-05 23:33                         ` Dave Chinner
  -1 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-05 23:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sun, Sep 05, 2010 at 09:45:54PM +0800, Wu Fengguang wrote:
> [restoring CC list]
> 
> On Sun, Sep 05, 2010 at 09:14:47PM +0800, Dave Chinner wrote:
> > On Sun, Sep 05, 2010 at 02:05:39PM +0800, Wu Fengguang wrote:
> > > On Sun, Sep 05, 2010 at 10:15:55AM +0800, Dave Chinner wrote:
> > > > On Sun, Sep 05, 2010 at 09:54:00AM +0800, Wu Fengguang wrote:
> > > > > Dave, could you post (publicly) the kconfig and /proc/vmstat?
> > > > > 
> > > > > I'd like to check if you have swap or memory compaction enabled..
> > > > 
> > > > Swap is enabled - it has 512MB of swap space:
> > > > 
> > > > $ free
> > > >              total       used       free     shared    buffers     cached
> > > > Mem:       4054304     100928    3953376          0       4096      43108
> > > > -/+ buffers/cache:      53724    4000580
> > > > Swap:       497976          0     497976
> > > 
> > > It looks swap is not used at all.
> > 
> > It isn't 30s after boot, abut I haven't checked after a livelock.
> 
> That's fine. I see in your fs_mark-wedge-1.png that there are no
> read/write IO at all when CPUs are 100% busy. So there should be no
> swap IO at "livelock" time.
> 
> > > > And memory compaction is not enabled:
> > > > 
> > > > $ grep COMPACT .config
> > > > # CONFIG_COMPACTION is not set
> 
> Memory compaction is not likely the cause too. It will only kick in for
> order > 3 allocations.
> 
> > > > 
> > > > The .config is pretty much a 'make defconfig' and then enabling XFS and
> > > > whatever debug I need (e.g. locking, memleak, etc).
> > > 
> > > Thanks! The problem seems hard to debug -- you cannot login at all
> > > when it is doing lock contentions, so cannot get sysrq call traces.
> > 
> > Well, I don't know whether it is lock contention at all. The sets of
> > traces I have got previously have shown backtraces on all CPUs in
> > direct reclaim with several in draining queues, but no apparent lock
> > contention.
> 
> That's interesting. Do you still have the full backtraces?
> 
> Maybe your system eats too much slab cache (icache/dcache) by creating
> so many zero-sized files. The system may run into problems reclaiming
> so many (dirty) slab pages.

Yes, that's where most of the memory pressure is coming from.
However, it's not stuck reclaiming slab - it's pretty clear from
another chart that I run that the slab cache contents is not
changing aross the livelock. IOWs, it appears to get stuck before it
gets to shrink_slab().

Worth noting, though, is that XFS metadata workloads do create page
cache pressure as well - all the metadata pages are cached on a
separate address space, so perhaps it is getting stuck there...

> > > How about enabling CONFIG_LOCK_STAT? Then you can check
> > > /proc/lock_stat when the contentions are over.
> > 
> > Enabling the locking debug/stats gathering slows the workload
> > by a factor of 3 and doesn't produce the livelock....
> 
> Oh sorry.. but it would still be interesting to check the top
> contended locks for this workload without any livelocks :)

I'll see what i can do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-05 23:33                         ` Dave Chinner
  0 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-05 23:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sun, Sep 05, 2010 at 09:45:54PM +0800, Wu Fengguang wrote:
> [restoring CC list]
> 
> On Sun, Sep 05, 2010 at 09:14:47PM +0800, Dave Chinner wrote:
> > On Sun, Sep 05, 2010 at 02:05:39PM +0800, Wu Fengguang wrote:
> > > On Sun, Sep 05, 2010 at 10:15:55AM +0800, Dave Chinner wrote:
> > > > On Sun, Sep 05, 2010 at 09:54:00AM +0800, Wu Fengguang wrote:
> > > > > Dave, could you post (publicly) the kconfig and /proc/vmstat?
> > > > > 
> > > > > I'd like to check if you have swap or memory compaction enabled..
> > > > 
> > > > Swap is enabled - it has 512MB of swap space:
> > > > 
> > > > $ free
> > > >              total       used       free     shared    buffers     cached
> > > > Mem:       4054304     100928    3953376          0       4096      43108
> > > > -/+ buffers/cache:      53724    4000580
> > > > Swap:       497976          0     497976
> > > 
> > > It looks swap is not used at all.
> > 
> > It isn't 30s after boot, abut I haven't checked after a livelock.
> 
> That's fine. I see in your fs_mark-wedge-1.png that there are no
> read/write IO at all when CPUs are 100% busy. So there should be no
> swap IO at "livelock" time.
> 
> > > > And memory compaction is not enabled:
> > > > 
> > > > $ grep COMPACT .config
> > > > # CONFIG_COMPACTION is not set
> 
> Memory compaction is not likely the cause too. It will only kick in for
> order > 3 allocations.
> 
> > > > 
> > > > The .config is pretty much a 'make defconfig' and then enabling XFS and
> > > > whatever debug I need (e.g. locking, memleak, etc).
> > > 
> > > Thanks! The problem seems hard to debug -- you cannot login at all
> > > when it is doing lock contentions, so cannot get sysrq call traces.
> > 
> > Well, I don't know whether it is lock contention at all. The sets of
> > traces I have got previously have shown backtraces on all CPUs in
> > direct reclaim with several in draining queues, but no apparent lock
> > contention.
> 
> That's interesting. Do you still have the full backtraces?
> 
> Maybe your system eats too much slab cache (icache/dcache) by creating
> so many zero-sized files. The system may run into problems reclaiming
> so many (dirty) slab pages.

Yes, that's where most of the memory pressure is coming from.
However, it's not stuck reclaiming slab - it's pretty clear from
another chart that I run that the slab cache contents is not
changing aross the livelock. IOWs, it appears to get stuck before it
gets to shrink_slab().

Worth noting, though, is that XFS metadata workloads do create page
cache pressure as well - all the metadata pages are cached on a
separate address space, so perhaps it is getting stuck there...

> > > How about enabling CONFIG_LOCK_STAT? Then you can check
> > > /proc/lock_stat when the contentions are over.
> > 
> > Enabling the locking debug/stats gathering slows the workload
> > by a factor of 3 and doesn't produce the livelock....
> 
> Oh sorry.. but it would still be interesting to check the top
> contended locks for this workload without any livelocks :)

I'll see what i can do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-05 13:45                       ` Wu Fengguang
@ 2010-09-06  4:02                         ` Dave Chinner
  -1 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-06  4:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sun, Sep 05, 2010 at 09:45:54PM +0800, Wu Fengguang wrote:
> [restoring CC list]
> 
> On Sun, Sep 05, 2010 at 09:14:47PM +0800, Dave Chinner wrote:
> > On Sun, Sep 05, 2010 at 02:05:39PM +0800, Wu Fengguang wrote:
> > > On Sun, Sep 05, 2010 at 10:15:55AM +0800, Dave Chinner wrote:
> > > > On Sun, Sep 05, 2010 at 09:54:00AM +0800, Wu Fengguang wrote:
> > > > > Dave, could you post (publicly) the kconfig and /proc/vmstat?
> > > > > 
> > > > > I'd like to check if you have swap or memory compaction enabled..
> > > > 
> > > > Swap is enabled - it has 512MB of swap space:
> > > > 
> > > > $ free
> > > >              total       used       free     shared    buffers     cached
> > > > Mem:       4054304     100928    3953376          0       4096      43108
> > > > -/+ buffers/cache:      53724    4000580
> > > > Swap:       497976          0     497976
> > > 
> > > It looks swap is not used at all.
> > 
> > It isn't 30s after boot, abut I haven't checked after a livelock.
> 
> That's fine. I see in your fs_mark-wedge-1.png that there are no
> read/write IO at all when CPUs are 100% busy. So there should be no
> swap IO at "livelock" time.
> 
> > > > And memory compaction is not enabled:
> > > > 
> > > > $ grep COMPACT .config
> > > > # CONFIG_COMPACTION is not set
> 
> Memory compaction is not likely the cause too. It will only kick in for
> order > 3 allocations.
> 
> > > > 
> > > > The .config is pretty much a 'make defconfig' and then enabling XFS and
> > > > whatever debug I need (e.g. locking, memleak, etc).
> > > 
> > > Thanks! The problem seems hard to debug -- you cannot login at all
> > > when it is doing lock contentions, so cannot get sysrq call traces.
> > 
> > Well, I don't know whether it is lock contention at all. The sets of
> > traces I have got previously have shown backtraces on all CPUs in
> > direct reclaim with several in draining queues, but no apparent lock
> > contention.
> 
> That's interesting. Do you still have the full backtraces?

Just saw one when testing some new code with CONFIG_XFS_DEBUG
enabled. The act of running 'echo t > /proc/sysrq-trigger' seems to
have got the machine unstuck, so I'm not sure if the traces are
completely representative of the livelock state. however, here are
the fs_mark processes:

[  596.628086] fs_mark       R  running task        0  2373   2163 0x00000008
[  596.628086]  0000000000000000 ffffffff81bb8610 00000000000008fc 0000000000000002
[  596.628086]  0000000000000000 0000000000000296 0000000000000297 ffffffffffffff10
[  596.628086]  ffffffff810b48c2 0000000000000010 0000000000000202 ffff880116b61798
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2374   2163 0x00000000
[  596.628086]  0000000000000000 0000000000000002 ffff88011ad619b0 ffff88011fc050c0
[  596.628086]  ffff88011ad619e8 ffffffff8113dce6 ffff88011fc028c0 ffff88011fc02900
[  596.628086]  ffff88011ad619e8 0000025000000000 ffff880100001c08 0000001000000000
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2375   2163 0x00000000
[  596.628086]  ffff8801198f96f8 ffff880100000000 0000000000000001 0000000000000002
[  596.628086]  ffff8801198f9708 ffffffff8110758a 0000000001320122 0000000000000002
[  596.628086]  ffff8801198f8000 ffff880100000000 0000000000000007 0000000000000250
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff810b48c6>] ? smp_call_function_many+0x1a6/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2376   2163 0x00000000
[  596.628086]  ffff88011d303708 ffffffff8110758a 0000000001320122 0000000000000002
[  596.628086]  ffff88011d302000 ffff880100000000 0000000000000007 0000000000000250
[  596.628086]  ffffffff8103694e ffff88011d3037d8 ffff88011c9808f8 0000000000000001
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2377   2163 0x00000008
[  596.628086]  0000000000000000 ffff880103dd9528 ffffffff813f2deb 000000000000003f
[  596.628086]  ffff88011c9806e0 ffff880103dd95a4 ffff880103dd9518 ffffffff813fd98e
[  596.628086]  ffff880103dd95a4 ffff88011ce51800 ffff880103dd9528 ffffffff8180722e
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff813f2deb>] ? radix_tree_gang_lookup_tag+0x8b/0x100
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
[  596.628086]  [<ffffffff8131f318>] ? xfs_inode_ag_iter_next_pag+0x108/0x110
[  596.628086]  [<ffffffff8132018c>] ? xfs_reclaim_inode_shrink+0x4c/0x90
[  596.628086]  [<ffffffff81119d02>] ? zone_nr_free_pages+0xa2/0xc0
[  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130b2f9>] ? xfs_dir_ialloc+0x139/0x340
[  596.628086]  [<ffffffff812f9cc7>] ? xfs_log_reserve+0x167/0x1e0
[  596.628086]  [<ffffffff8130d35c>] ? xfs_create+0x3dc/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2378   2163 0x00000000
[  596.628086]  ffff880103d53a78 0000000000000086 ffff8800a3fc6cc0 0000000000000caf
[  596.628086]  ffff880103d53a18 00000000000135c0 ffff88011f119040 00000000000135c0
[  596.628086]  ffff88011f1193a8 ffff880103d53fd8 ffff88011f1193b0 ffff880103d53fd8
[  596.628086] Call Trace8f/0xe0
[  596.628086]  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
[  596.628086]  [<ffffffff8113e4f8>] ? __kmalloc+0x48/0x230
[  596.628086]  [<ffffffff81804d90>] _cond_resched+0x30/0x40
[  596.628086]  [<ffffffff8113e5e1>] __kmalloc+0x131/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff81306b75>] xfs_trans_alloc_log_vecs+0xa5/0xe0
[  596.628086]  [<ffffffff813082b8>] _xfs_trans_commit+0x138/0x2f0
[  596.628086]  [<ffffffff8130d50e>] xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2379   2163 0x00000000
[  596.628086]  ffff88011f0ddd80 ffff880103da5eb8 000001b600008243 ffff88008652ca80
[  596.628086]  ffffffff8115f8fa 00007fff71ba9370 ffff880000000005 ffff88011e35cb80
[  596.628086]  ffff880076835ed0 ffff880103da5f18 0000000000000005 ffff88006d30b000
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2380   2163 0x00000000
[  596.628086]  00000000000008fc 0000000000000001 0000000000000000 0000000000000296
[  596.628086]  0000000000000293 ffffffffffffff10 ffffffff810b48c2 0000000000000010
[  596.628086]  0000000000000202 ffff880103c05798 0000000000000018 ffffffff810b48a5
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b

and the kswapd thread:

[  596.628086] kswapd0       R  running task        0   547      2 0x00000000
[  596.628086]  ffff88011e78fbc0 0000000000000046 0000000000000000 ffffffff8103694e
[  596.628086]  ffff88011e78fbf0 00000000000135c0 ffff88011f17c040 00000000000135c0
[  596.628086]  ffff88011f17c3a8 ffff88011e78ffd8 ffff88011f17c3b0 ffff88011e78ffd8
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
[  596.628086]  [<ffffffff81077422>] __cond_resched_lock+0x42/0x60
[  596.628086]  [<ffffffff811593a0>] __shrink_dcache_sb+0xf0/0x380
[  596.628086]  [<ffffffff811597c6>] shrink_dcache_memory+0x176/0x200
[  596.628086]  [<ffffffff81110bf4>] shrink_slab+0x124/0x180
[  596.628086]  [<ffffffff811125d2>] balance_pgdat+0x2e2/0x540
[  596.628086]  [<ffffffff8111295d>] kswapd+0x12d/0x390
[  596.628086]  [<ffffffff8109e8c0>] ? autoremove_wake_function+0x0/0x40
[  596.628086]  [<ffffffff81112830>] ? kswapd+0x0/0x390
[  596.628086]  [<ffffffff8109e396>] kthread+0x96/0xa0
[  596.628086]  [<ffffffff81036da4>] kernel_thread_helper+0x4/0x10
[  596.628086]  [<ffffffff8109e300>] ? kthread+0x0/0xa0
[  596.628086]  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10

I just went to grab the CAL counters, and found the system in
another livelock.  This time I managed to start the sysrq-trigger
dump while the livelock was in progress - I basіcally got one shot
at a command before everything stopped responding. Now I'm waiting
for the livelock to pass.... 5min.... the fs_mark workload
has stopped (ctrl-c finally responded), still livelocked....
10min.... 15min.... 20min.... OK, back now.

Interesting - all the fs_mark processes are in D state waiting on IO
completion processing. And the only running processes are the
kworker threads, which are all processing either vmstat updates
(3 CPUs):

 kworker/6:1   R  running task        0   376      2 0x00000000
  ffff88011f255cf0 0000000000000046 ffff88011f255c90 ffffffff813fda34
  ffff88003c969588 00000000000135c0 ffff88011f27c7f0 00000000000135c0
  ffff88011f27cb58 ffff88011f255fd8 ffff88011f27cb60 ffff88011f255fd8
 Call Trace:
  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
  [<ffffffff81804d90>] _cond_resched+0x30/0x40
  [<ffffffff8111ab12>] refresh_cpu_vm_stats+0xc2/0x160
  [<ffffffff8111abb0>] ? vmstat_update+0x0/0x40
  [<ffffffff8111abc6>] vmstat_update+0x16/0x40
  [<ffffffff810978a0>] process_one_work+0x130/0x470
  [<ffffffff81099e72>] worker_thread+0x172/0x3f0
  [<ffffffff81099d00>] ? worker_thread+0x0/0x3f0
  [<ffffffff8109e396>] kthread+0x96/0xa0
  [<ffffffff81036da4>] kernel_thread_helper+0x4/0x10
  [<ffffffff8109e300>] ? kthread+0x0/0xa0
  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10

Or doing inode IO completion processing:

 kworker/7:1   R  running task        0   377      2 0x00000000
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  ffffffffffffff10 0000000000000001 ffffffffffffff10 ffffffff813f8062
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  0000000000000010 0000000000000202 ffff88011f0ffc80 ffffffff813f8067
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  00000000000ac613 ffff88011ba08280 00000000871803d0 0000000000000001
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017] Call Trace:
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f8062>] ? delay_tsc+0x22/0x80
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f808a>] ? delay_tsc+0x4a/0x80
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f7fdf>] ? __delay+0xf/0x20
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813fdb03>] ? do_raw_spin_lock+0x123/0x160
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff818072be>] ? _raw_spin_lock+0xe/0x10
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812f0244>] ? xfs_iflush_done+0x84/0xb0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812cee54>] ? xfs_buf_do_callbacks+0x54/0x70
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812cf0c0>] ? xfs_buf_iodone_callbacks+0x1a0/0x2a0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81315805>] ? xfs_buf_iodone_work+0x45/0x100
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff810978a0>] ? process_one_work+0x130/0x470
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81099e72>] ? worker_thread+0x172/0x3f0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81099d00>] ? worker_thread+0x0/0x3f0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff8109e396>] ? kthread+0x96/0xa0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81036da4>] ? kernel_thread_helper+0x4/0x10
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff8109e300>] ? kthread+0x0/0xa0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10

It looks like there is spinlock contention occurring here on the xfs
AIL lock, so I'll need to look into this further. A second set of
traces I got during the livelock also showed this:

fs_mark       R  running task        0  2713      1 0x00000004
 ffff88011851b518 ffffffff81804669 ffff88011851b4d8 ffff880100000700
 0000000000000000 00000000000135c0 ffff88011f05b7f0 00000000000135c0
 ffff88011f05bb58 ffff88011851bfd8 ffff88011f05bb60 ffff88011851bfd8
Call Trace:
 [<ffffffff81804669>] ? schedule+0x3c9/0x9f0
 [<ffffffff81805235>] schedule_timeout+0x1d5/0x2a0
 [<ffffffff81119d02>] ? zone_nr_free_pages+0xa2/0xc0
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff81807275>] ? _raw_spin_lock_irq+0x15/0x20
 [<ffffffff81806638>] __down+0x78/0xb0
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff8113d8e6>] cache_alloc_refill+0x1c6/0x2b0
 [<ffffffff813fda34>] do_raw_spin_lock+0x54/0x160
 [<ffffffff812e9672>] ? xfs_iext_bno_to_irec+0xb2/0x100
 [<ffffffff811022ce>] ? find_get_page+0x1e/0xa0
 [<ffffffff81103dd7>] ? find_lock_page+0x37/0x80
 [<ffffffff8110438f>] ? find_or_create_page+0x3f/0xb0
 [<ffffffff811025e7>] ? unlock_page+0x27/0x30
 [<ffffffff81315167>] ? _xfs_buf_lookup_pages+0x297/0x370
 [<ffffffff813f808a>] ? delay_tsc+0x4a/0x80
 [<ffffffff813f7fdf>] ? __delay+0xf/0x20
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff81109e69>] ? free_pcppages_bulk+0x369/0x400
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff81109e69>] ? free_pcppages_bulk+0x369/0x400
 [<ffffffff8110a508>] ? __pagevec_free+0x58/0xb0
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff810af53c>] ? debug_mutex_add_waiter+0x2c/0x70
 [<ffffffff81805d70>] ? __mutex_lock_slowpath+0x1e0/0x280
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff810fdac2>] ? perf_event_exit_task+0x32/0x160
 [<ffffffff813fd772>] ? do_raw_write_lock+0x42/0xa0
 [<ffffffff81807015>] ? _raw_write_lock_irq+0x15/0x20
 [<ffffffff81082315>] ? do_exit+0x195/0x7c0
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff81082991>] ? do_group_exit+0x51/0xc0
 [<ffffffff81092d8c>] ? get_signal_to_deliver+0x27c/0x430
 [<ffffffff810352b5>] ? do_signal+0x75/0x7c0
 [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff81035a65>] ? do_notify_resume+0x65/0x90
 [<ffffffff81036283>] ? int_signal+0x12/0x17

Because I tried to ctrl-c the fs_mark workload. All those lock
traces on the stack aren't related to XFS, so I'm wondering exactly
where they have come from....

Finally, /proc/interrupts shows:

CAL:      12156      12039      12676      12478      12919    12177      12767      12460   Function call interrupts

Which shows that this wasn't an IPI storm that caused this
particular livelock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-06  4:02                         ` Dave Chinner
  0 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-06  4:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Mel Gorman, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Sun, Sep 05, 2010 at 09:45:54PM +0800, Wu Fengguang wrote:
> [restoring CC list]
> 
> On Sun, Sep 05, 2010 at 09:14:47PM +0800, Dave Chinner wrote:
> > On Sun, Sep 05, 2010 at 02:05:39PM +0800, Wu Fengguang wrote:
> > > On Sun, Sep 05, 2010 at 10:15:55AM +0800, Dave Chinner wrote:
> > > > On Sun, Sep 05, 2010 at 09:54:00AM +0800, Wu Fengguang wrote:
> > > > > Dave, could you post (publicly) the kconfig and /proc/vmstat?
> > > > > 
> > > > > I'd like to check if you have swap or memory compaction enabled..
> > > > 
> > > > Swap is enabled - it has 512MB of swap space:
> > > > 
> > > > $ free
> > > >              total       used       free     shared    buffers     cached
> > > > Mem:       4054304     100928    3953376          0       4096      43108
> > > > -/+ buffers/cache:      53724    4000580
> > > > Swap:       497976          0     497976
> > > 
> > > It looks swap is not used at all.
> > 
> > It isn't 30s after boot, abut I haven't checked after a livelock.
> 
> That's fine. I see in your fs_mark-wedge-1.png that there are no
> read/write IO at all when CPUs are 100% busy. So there should be no
> swap IO at "livelock" time.
> 
> > > > And memory compaction is not enabled:
> > > > 
> > > > $ grep COMPACT .config
> > > > # CONFIG_COMPACTION is not set
> 
> Memory compaction is not likely the cause too. It will only kick in for
> order > 3 allocations.
> 
> > > > 
> > > > The .config is pretty much a 'make defconfig' and then enabling XFS and
> > > > whatever debug I need (e.g. locking, memleak, etc).
> > > 
> > > Thanks! The problem seems hard to debug -- you cannot login at all
> > > when it is doing lock contentions, so cannot get sysrq call traces.
> > 
> > Well, I don't know whether it is lock contention at all. The sets of
> > traces I have got previously have shown backtraces on all CPUs in
> > direct reclaim with several in draining queues, but no apparent lock
> > contention.
> 
> That's interesting. Do you still have the full backtraces?

Just saw one when testing some new code with CONFIG_XFS_DEBUG
enabled. The act of running 'echo t > /proc/sysrq-trigger' seems to
have got the machine unstuck, so I'm not sure if the traces are
completely representative of the livelock state. however, here are
the fs_mark processes:

[  596.628086] fs_mark       R  running task        0  2373   2163 0x00000008
[  596.628086]  0000000000000000 ffffffff81bb8610 00000000000008fc 0000000000000002
[  596.628086]  0000000000000000 0000000000000296 0000000000000297 ffffffffffffff10
[  596.628086]  ffffffff810b48c2 0000000000000010 0000000000000202 ffff880116b61798
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2374   2163 0x00000000
[  596.628086]  0000000000000000 0000000000000002 ffff88011ad619b0 ffff88011fc050c0
[  596.628086]  ffff88011ad619e8 ffffffff8113dce6 ffff88011fc028c0 ffff88011fc02900
[  596.628086]  ffff88011ad619e8 0000025000000000 ffff880100001c08 0000001000000000
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2375   2163 0x00000000
[  596.628086]  ffff8801198f96f8 ffff880100000000 0000000000000001 0000000000000002
[  596.628086]  ffff8801198f9708 ffffffff8110758a 0000000001320122 0000000000000002
[  596.628086]  ffff8801198f8000 ffff880100000000 0000000000000007 0000000000000250
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff810b48c6>] ? smp_call_function_many+0x1a6/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2376   2163 0x00000000
[  596.628086]  ffff88011d303708 ffffffff8110758a 0000000001320122 0000000000000002
[  596.628086]  ffff88011d302000 ffff880100000000 0000000000000007 0000000000000250
[  596.628086]  ffffffff8103694e ffff88011d3037d8 ffff88011c9808f8 0000000000000001
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2377   2163 0x00000008
[  596.628086]  0000000000000000 ffff880103dd9528 ffffffff813f2deb 000000000000003f
[  596.628086]  ffff88011c9806e0 ffff880103dd95a4 ffff880103dd9518 ffffffff813fd98e
[  596.628086]  ffff880103dd95a4 ffff88011ce51800 ffff880103dd9528 ffffffff8180722e
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff813f2deb>] ? radix_tree_gang_lookup_tag+0x8b/0x100
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
[  596.628086]  [<ffffffff8131f318>] ? xfs_inode_ag_iter_next_pag+0x108/0x110
[  596.628086]  [<ffffffff8132018c>] ? xfs_reclaim_inode_shrink+0x4c/0x90
[  596.628086]  [<ffffffff81119d02>] ? zone_nr_free_pages+0xa2/0xc0
[  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130b2f9>] ? xfs_dir_ialloc+0x139/0x340
[  596.628086]  [<ffffffff812f9cc7>] ? xfs_log_reserve+0x167/0x1e0
[  596.628086]  [<ffffffff8130d35c>] ? xfs_create+0x3dc/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2378   2163 0x00000000
[  596.628086]  ffff880103d53a78 0000000000000086 ffff8800a3fc6cc0 0000000000000caf
[  596.628086]  ffff880103d53a18 00000000000135c0 ffff88011f119040 00000000000135c0
[  596.628086]  ffff88011f1193a8 ffff880103d53fd8 ffff88011f1193b0 ffff880103d53fd8
[  596.628086] Call Trace8f/0xe0
[  596.628086]  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
[  596.628086]  [<ffffffff8113e4f8>] ? __kmalloc+0x48/0x230
[  596.628086]  [<ffffffff81804d90>] _cond_resched+0x30/0x40
[  596.628086]  [<ffffffff8113e5e1>] __kmalloc+0x131/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff81306b75>] xfs_trans_alloc_log_vecs+0xa5/0xe0
[  596.628086]  [<ffffffff813082b8>] _xfs_trans_commit+0x138/0x2f0
[  596.628086]  [<ffffffff8130d50e>] xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2379   2163 0x00000000
[  596.628086]  ffff88011f0ddd80 ffff880103da5eb8 000001b600008243 ffff88008652ca80
[  596.628086]  ffffffff8115f8fa 00007fff71ba9370 ffff880000000005 ffff88011e35cb80
[  596.628086]  ffff880076835ed0 ffff880103da5f18 0000000000000005 ffff88006d30b000
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
[  596.628086] fs_mark       R  running task        0  2380   2163 0x00000000
[  596.628086]  00000000000008fc 0000000000000001 0000000000000000 0000000000000296
[  596.628086]  0000000000000293 ffffffffffffff10 ffffffff810b48c2 0000000000000010
[  596.628086]  0000000000000202 ffff880103c05798 0000000000000018 ffffffff810b48a5
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
[  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
[  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
[  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
[  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
[  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
[  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
[  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
[  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
[  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
[  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
[  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
[  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
[  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
[  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
[  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
[  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
[  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
[  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
[  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
[  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
[  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
[  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
[  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
[  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
[  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
[  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
[  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b

and the kswapd thread:

[  596.628086] kswapd0       R  running task        0   547      2 0x00000000
[  596.628086]  ffff88011e78fbc0 0000000000000046 0000000000000000 ffffffff8103694e
[  596.628086]  ffff88011e78fbf0 00000000000135c0 ffff88011f17c040 00000000000135c0
[  596.628086]  ffff88011f17c3a8 ffff88011e78ffd8 ffff88011f17c3b0 ffff88011e78ffd8
[  596.628086] Call Trace:
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
[  596.628086]  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
[  596.628086]  [<ffffffff81077422>] __cond_resched_lock+0x42/0x60
[  596.628086]  [<ffffffff811593a0>] __shrink_dcache_sb+0xf0/0x380
[  596.628086]  [<ffffffff811597c6>] shrink_dcache_memory+0x176/0x200
[  596.628086]  [<ffffffff81110bf4>] shrink_slab+0x124/0x180
[  596.628086]  [<ffffffff811125d2>] balance_pgdat+0x2e2/0x540
[  596.628086]  [<ffffffff8111295d>] kswapd+0x12d/0x390
[  596.628086]  [<ffffffff8109e8c0>] ? autoremove_wake_function+0x0/0x40
[  596.628086]  [<ffffffff81112830>] ? kswapd+0x0/0x390
[  596.628086]  [<ffffffff8109e396>] kthread+0x96/0xa0
[  596.628086]  [<ffffffff81036da4>] kernel_thread_helper+0x4/0x10
[  596.628086]  [<ffffffff8109e300>] ? kthread+0x0/0xa0
[  596.628086]  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10

I just went to grab the CAL counters, and found the system in
another livelock.  This time I managed to start the sysrq-trigger
dump while the livelock was in progress - I basN?cally got one shot
at a command before everything stopped responding. Now I'm waiting
for the livelock to pass.... 5min.... the fs_mark workload
has stopped (ctrl-c finally responded), still livelocked....
10min.... 15min.... 20min.... OK, back now.

Interesting - all the fs_mark processes are in D state waiting on IO
completion processing. And the only running processes are the
kworker threads, which are all processing either vmstat updates
(3 CPUs):

 kworker/6:1   R  running task        0   376      2 0x00000000
  ffff88011f255cf0 0000000000000046 ffff88011f255c90 ffffffff813fda34
  ffff88003c969588 00000000000135c0 ffff88011f27c7f0 00000000000135c0
  ffff88011f27cb58 ffff88011f255fd8 ffff88011f27cb60 ffff88011f255fd8
 Call Trace:
  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
  [<ffffffff81804d90>] _cond_resched+0x30/0x40
  [<ffffffff8111ab12>] refresh_cpu_vm_stats+0xc2/0x160
  [<ffffffff8111abb0>] ? vmstat_update+0x0/0x40
  [<ffffffff8111abc6>] vmstat_update+0x16/0x40
  [<ffffffff810978a0>] process_one_work+0x130/0x470
  [<ffffffff81099e72>] worker_thread+0x172/0x3f0
  [<ffffffff81099d00>] ? worker_thread+0x0/0x3f0
  [<ffffffff8109e396>] kthread+0x96/0xa0
  [<ffffffff81036da4>] kernel_thread_helper+0x4/0x10
  [<ffffffff8109e300>] ? kthread+0x0/0xa0
  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10

Or doing inode IO completion processing:

 kworker/7:1   R  running task        0   377      2 0x00000000
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  ffffffffffffff10 0000000000000001 ffffffffffffff10 ffffffff813f8062
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  0000000000000010 0000000000000202 ffff88011f0ffc80 ffffffff813f8067
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  00000000000ac613 ffff88011ba08280 00000000871803d0 0000000000000001
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017] Call Trace:
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f8062>] ? delay_tsc+0x22/0x80
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f808a>] ? delay_tsc+0x4a/0x80
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f7fdf>] ? __delay+0xf/0x20
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813fdb03>] ? do_raw_spin_lock+0x123/0x160
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff818072be>] ? _raw_spin_lock+0xe/0x10
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812f0244>] ? xfs_iflush_done+0x84/0xb0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812cee54>] ? xfs_buf_do_callbacks+0x54/0x70
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812cf0c0>] ? xfs_buf_iodone_callbacks+0x1a0/0x2a0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81315805>] ? xfs_buf_iodone_work+0x45/0x100
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff810978a0>] ? process_one_work+0x130/0x470
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81099e72>] ? worker_thread+0x172/0x3f0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81099d00>] ? worker_thread+0x0/0x3f0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff8109e396>] ? kthread+0x96/0xa0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81036da4>] ? kernel_thread_helper+0x4/0x10
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff8109e300>] ? kthread+0x0/0xa0
 Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10

It looks like there is spinlock contention occurring here on the xfs
AIL lock, so I'll need to look into this further. A second set of
traces I got during the livelock also showed this:

fs_mark       R  running task        0  2713      1 0x00000004
 ffff88011851b518 ffffffff81804669 ffff88011851b4d8 ffff880100000700
 0000000000000000 00000000000135c0 ffff88011f05b7f0 00000000000135c0
 ffff88011f05bb58 ffff88011851bfd8 ffff88011f05bb60 ffff88011851bfd8
Call Trace:
 [<ffffffff81804669>] ? schedule+0x3c9/0x9f0
 [<ffffffff81805235>] schedule_timeout+0x1d5/0x2a0
 [<ffffffff81119d02>] ? zone_nr_free_pages+0xa2/0xc0
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff81807275>] ? _raw_spin_lock_irq+0x15/0x20
 [<ffffffff81806638>] __down+0x78/0xb0
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff8113d8e6>] cache_alloc_refill+0x1c6/0x2b0
 [<ffffffff813fda34>] do_raw_spin_lock+0x54/0x160
 [<ffffffff812e9672>] ? xfs_iext_bno_to_irec+0xb2/0x100
 [<ffffffff811022ce>] ? find_get_page+0x1e/0xa0
 [<ffffffff81103dd7>] ? find_lock_page+0x37/0x80
 [<ffffffff8110438f>] ? find_or_create_page+0x3f/0xb0
 [<ffffffff811025e7>] ? unlock_page+0x27/0x30
 [<ffffffff81315167>] ? _xfs_buf_lookup_pages+0x297/0x370
 [<ffffffff813f808a>] ? delay_tsc+0x4a/0x80
 [<ffffffff813f7fdf>] ? __delay+0xf/0x20
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff81109e69>] ? free_pcppages_bulk+0x369/0x400
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff81109e69>] ? free_pcppages_bulk+0x369/0x400
 [<ffffffff8110a508>] ? __pagevec_free+0x58/0xb0
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff810af53c>] ? debug_mutex_add_waiter+0x2c/0x70
 [<ffffffff81805d70>] ? __mutex_lock_slowpath+0x1e0/0x280
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff810fdac2>] ? perf_event_exit_task+0x32/0x160
 [<ffffffff813fd772>] ? do_raw_write_lock+0x42/0xa0
 [<ffffffff81807015>] ? _raw_write_lock_irq+0x15/0x20
 [<ffffffff81082315>] ? do_exit+0x195/0x7c0
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff81082991>] ? do_group_exit+0x51/0xc0
 [<ffffffff81092d8c>] ? get_signal_to_deliver+0x27c/0x430
 [<ffffffff810352b5>] ? do_signal+0x75/0x7c0
 [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
 [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
 [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
 [<ffffffff81035a65>] ? do_notify_resume+0x65/0x90
 [<ffffffff81036283>] ? int_signal+0x12/0x17

Because I tried to ctrl-c the fs_mark workload. All those lock
traces on the stack aren't related to XFS, so I'm wondering exactly
where they have come from....

Finally, /proc/interrupts shows:

CAL:      12156      12039      12676      12478      12919    12177      12767      12460   Function call interrupts

Which shows that this wasn't an IPI storm that caused this
particular livelock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-06  4:02                         ` Dave Chinner
@ 2010-09-06  8:40                           ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-06  8:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Mon, Sep 06, 2010 at 02:02:43PM +1000, Dave Chinner wrote:
> On Sun, Sep 05, 2010 at 09:45:54PM +0800, Wu Fengguang wrote:
> > [restoring CC list]
> > 
> > On Sun, Sep 05, 2010 at 09:14:47PM +0800, Dave Chinner wrote:
> > > On Sun, Sep 05, 2010 at 02:05:39PM +0800, Wu Fengguang wrote:
> > > > On Sun, Sep 05, 2010 at 10:15:55AM +0800, Dave Chinner wrote:
> > > > > On Sun, Sep 05, 2010 at 09:54:00AM +0800, Wu Fengguang wrote:
> > > > > > Dave, could you post (publicly) the kconfig and /proc/vmstat?
> > > > > > 
> > > > > > I'd like to check if you have swap or memory compaction enabled..
> > > > > 
> > > > > Swap is enabled - it has 512MB of swap space:
> > > > > 
> > > > > $ free
> > > > >              total       used       free     shared    buffers     cached
> > > > > Mem:       4054304     100928    3953376          0       4096      43108
> > > > > -/+ buffers/cache:      53724    4000580
> > > > > Swap:       497976          0     497976
> > > > 
> > > > It looks swap is not used at all.
> > > 
> > > It isn't 30s after boot, abut I haven't checked after a livelock.
> > 
> > That's fine. I see in your fs_mark-wedge-1.png that there are no
> > read/write IO at all when CPUs are 100% busy. So there should be no
> > swap IO at "livelock" time.
> > 
> > > > > And memory compaction is not enabled:
> > > > > 
> > > > > $ grep COMPACT .config
> > > > > # CONFIG_COMPACTION is not set
> > 
> > Memory compaction is not likely the cause too. It will only kick in for
> > order > 3 allocations.
> > 
> > > > > 
> > > > > The .config is pretty much a 'make defconfig' and then enabling XFS and
> > > > > whatever debug I need (e.g. locking, memleak, etc).
> > > > 
> > > > Thanks! The problem seems hard to debug -- you cannot login at all
> > > > when it is doing lock contentions, so cannot get sysrq call traces.
> > > 
> > > Well, I don't know whether it is lock contention at all. The sets of
> > > traces I have got previously have shown backtraces on all CPUs in
> > > direct reclaim with several in draining queues, but no apparent lock
> > > contention.
> > 
> > That's interesting. Do you still have the full backtraces?
> 
> Just saw one when testing some new code with CONFIG_XFS_DEBUG
> enabled. The act of running 'echo t > /proc/sysrq-trigger' seems to
> have got the machine unstuck, so I'm not sure if the traces are
> completely representative of the livelock state.

sysrq-trigger would alter the timing of at least one of the direct
reclaimers. It's vaguely possible this was enough of a backoff to allow
further progress.

> however, here are the fs_mark processes:
> 
> [  596.628086] fs_mark       R  running task        0  2373   2163 0x00000008
> [  596.628086]  0000000000000000 ffffffff81bb8610 00000000000008fc 0000000000000002
> [  596.628086]  0000000000000000 0000000000000296 0000000000000297 ffffffffffffff10
> [  596.628086]  ffffffff810b48c2 0000000000000010 0000000000000202 ffff880116b61798
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50

This looks like an order-0 allocation. The "Drain per-cpu lists after
direct reclaim allocation fails" avoids calling drain_all_pages() for a
number of cases but introduces a case where it's called for order-0
pages. The intention was to avoid allocations failing just because of
the lists but maybe it's happening too often.

I include a patch at the very end of this mail that might relieve this.

> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2374   2163 0x00000000
> [  596.628086]  0000000000000000 0000000000000002 ffff88011ad619b0 ffff88011fc050c0
> [  596.628086]  ffff88011ad619e8 ffffffff8113dce6 ffff88011fc028c0 ffff88011fc02900
> [  596.628086]  ffff88011ad619e8 0000025000000000 ffff880100001c08 0000001000000000
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2375   2163 0x00000000
> [  596.628086]  ffff8801198f96f8 ffff880100000000 0000000000000001 0000000000000002
> [  596.628086]  ffff8801198f9708 ffffffff8110758a 0000000001320122 0000000000000002
> [  596.628086]  ffff8801198f8000 ffff880100000000 0000000000000007 0000000000000250
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff810b48c6>] ? smp_call_function_many+0x1a6/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2376   2163 0x00000000
> [  596.628086]  ffff88011d303708 ffffffff8110758a 0000000001320122 0000000000000002
> [  596.628086]  ffff88011d302000 ffff880100000000 0000000000000007 0000000000000250
> [  596.628086]  ffffffff8103694e ffff88011d3037d8 ffff88011c9808f8 0000000000000001
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2377   2163 0x00000008
> [  596.628086]  0000000000000000 ffff880103dd9528 ffffffff813f2deb 000000000000003f
> [  596.628086]  ffff88011c9806e0 ffff880103dd95a4 ffff880103dd9518 ffffffff813fd98e
> [  596.628086]  ffff880103dd95a4 ffff88011ce51800 ffff880103dd9528 ffffffff8180722e
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff813f2deb>] ? radix_tree_gang_lookup_tag+0x8b/0x100
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
> [  596.628086]  [<ffffffff8131f318>] ? xfs_inode_ag_iter_next_pag+0x108/0x110
> [  596.628086]  [<ffffffff8132018c>] ? xfs_reclaim_inode_shrink+0x4c/0x90
> [  596.628086]  [<ffffffff81119d02>] ? zone_nr_free_pages+0xa2/0xc0
> [  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130b2f9>] ? xfs_dir_ialloc+0x139/0x340
> [  596.628086]  [<ffffffff812f9cc7>] ? xfs_log_reserve+0x167/0x1e0
> [  596.628086]  [<ffffffff8130d35c>] ? xfs_create+0x3dc/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2378   2163 0x00000000
> [  596.628086]  ffff880103d53a78 0000000000000086 ffff8800a3fc6cc0 0000000000000caf
> [  596.628086]  ffff880103d53a18 00000000000135c0 ffff88011f119040 00000000000135c0
> [  596.628086]  ffff88011f1193a8 ffff880103d53fd8 ffff88011f1193b0 ffff880103d53fd8
> [  596.628086] Call Trace8f/0xe0
> [  596.628086]  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
> [  596.628086]  [<ffffffff8113e4f8>] ? __kmalloc+0x48/0x230
> [  596.628086]  [<ffffffff81804d90>] _cond_resched+0x30/0x40
> [  596.628086]  [<ffffffff8113e5e1>] __kmalloc+0x131/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff81306b75>] xfs_trans_alloc_log_vecs+0xa5/0xe0
> [  596.628086]  [<ffffffff813082b8>] _xfs_trans_commit+0x138/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2379   2163 0x00000000
> [  596.628086]  ffff88011f0ddd80 ffff880103da5eb8 000001b600008243 ffff88008652ca80
> [  596.628086]  ffffffff8115f8fa 00007fff71ba9370 ffff880000000005 ffff88011e35cb80
> [  596.628086]  ffff880076835ed0 ffff880103da5f18 0000000000000005 ffff88006d30b000
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2380   2163 0x00000000
> [  596.628086]  00000000000008fc 0000000000000001 0000000000000000 0000000000000296
> [  596.628086]  0000000000000293 ffffffffffffff10 ffffffff810b48c2 0000000000000010
> [  596.628086]  0000000000000202 ffff880103c05798 0000000000000018 ffffffff810b48a5
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> 

Many do seem to either be calling drain_all_pages or have done it
recently enough to see sharpnel from it in the backtrace.

> and the kswapd thread:
> 
> [  596.628086] kswapd0       R  running task        0   547      2 0x00000000
> [  596.628086]  ffff88011e78fbc0 0000000000000046 0000000000000000 ffffffff8103694e
> [  596.628086]  ffff88011e78fbf0 00000000000135c0 ffff88011f17c040 00000000000135c0
> [  596.628086]  ffff88011f17c3a8 ffff88011e78ffd8 ffff88011f17c3b0 ffff88011e78ffd8
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
> [  596.628086]  [<ffffffff81077422>] __cond_resched_lock+0x42/0x60
> [  596.628086]  [<ffffffff811593a0>] __shrink_dcache_sb+0xf0/0x380
> [  596.628086]  [<ffffffff811597c6>] shrink_dcache_memory+0x176/0x200
> [  596.628086]  [<ffffffff81110bf4>] shrink_slab+0x124/0x180
> [  596.628086]  [<ffffffff811125d2>] balance_pgdat+0x2e2/0x540
> [  596.628086]  [<ffffffff8111295d>] kswapd+0x12d/0x390
> [  596.628086]  [<ffffffff8109e8c0>] ? autoremove_wake_function+0x0/0x40
> [  596.628086]  [<ffffffff81112830>] ? kswapd+0x0/0x390
> [  596.628086]  [<ffffffff8109e396>] kthread+0x96/0xa0
> [  596.628086]  [<ffffffff81036da4>] kernel_thread_helper+0x4/0x10
> [  596.628086]  [<ffffffff8109e300>] ? kthread+0x0/0xa0
> [  596.628086]  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10
> 
> I just went to grab the CAL counters, and found the system in
> another livelock.  This time I managed to start the sysrq-trigger
> dump while the livelock was in progress - I bas??cally got one shot
> at a command before everything stopped responding. Now I'm waiting
> for the livelock to pass.... 5min.... the fs_mark workload
> has stopped (ctrl-c finally responded), still livelocked....
> 10min.... 15min.... 20min.... OK, back now.
> 
> Interesting - all the fs_mark processes are in D state waiting on IO
> completion processing.

Very interesting, maybe they are all stuck in congestion_wait() this
time? There are a few sources where that is possible.

> And the only running processes are the
> kworker threads, which are all processing either vmstat updates
> (3 CPUs):
> 
>  kworker/6:1   R  running task        0   376      2 0x00000000
>   ffff88011f255cf0 0000000000000046 ffff88011f255c90 ffffffff813fda34
>   ffff88003c969588 00000000000135c0 ffff88011f27c7f0 00000000000135c0
>   ffff88011f27cb58 ffff88011f255fd8 ffff88011f27cb60 ffff88011f255fd8
>  Call Trace:
>   [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>   [<ffffffff810773ca>] __cond_resched+0x2a/0x40
>   [<ffffffff81804d90>] _cond_resched+0x30/0x40
>   [<ffffffff8111ab12>] refresh_cpu_vm_stats+0xc2/0x160
>   [<ffffffff8111abb0>] ? vmstat_update+0x0/0x40
>   [<ffffffff8111abc6>] vmstat_update+0x16/0x40
>   [<ffffffff810978a0>] process_one_work+0x130/0x470
>   [<ffffffff81099e72>] worker_thread+0x172/0x3f0
>   [<ffffffff81099d00>] ? worker_thread+0x0/0x3f0
>   [<ffffffff8109e396>] kthread+0x96/0xa0
>   [<ffffffff81036da4>] kernel_thread_helper+0x4/0x10
>   [<ffffffff8109e300>] ? kthread+0x0/0xa0
>   [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10
> 
> Or doing inode IO completion processing:
> 
>  kworker/7:1   R  running task        0   377      2 0x00000000
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  ffffffffffffff10 0000000000000001 ffffffffffffff10 ffffffff813f8062
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  0000000000000010 0000000000000202 ffff88011f0ffc80 ffffffff813f8067
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  00000000000ac613 ffff88011ba08280 00000000871803d0 0000000000000001
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017] Call Trace:
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f8062>] ? delay_tsc+0x22/0x80
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f808a>] ? delay_tsc+0x4a/0x80
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f7fdf>] ? __delay+0xf/0x20
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813fdb03>] ? do_raw_spin_lock+0x123/0x160
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff818072be>] ? _raw_spin_lock+0xe/0x10
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812f0244>] ? xfs_iflush_done+0x84/0xb0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812cee54>] ? xfs_buf_do_callbacks+0x54/0x70
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812cf0c0>] ? xfs_buf_iodone_callbacks+0x1a0/0x2a0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81315805>] ? xfs_buf_iodone_work+0x45/0x100
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff810978a0>] ? process_one_work+0x130/0x470
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81099e72>] ? worker_thread+0x172/0x3f0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81099d00>] ? worker_thread+0x0/0x3f0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff8109e396>] ? kthread+0x96/0xa0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81036da4>] ? kernel_thread_helper+0x4/0x10
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff8109e300>] ? kthread+0x0/0xa0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10
> 
> It looks like there is spinlock contention occurring here on the xfs
> AIL lock, so I'll need to look into this further. A second set of
> traces I got during the livelock also showed this:
> 
> fs_mark       R  running task        0  2713      1 0x00000004
>  ffff88011851b518 ffffffff81804669 ffff88011851b4d8 ffff880100000700
>  0000000000000000 00000000000135c0 ffff88011f05b7f0 00000000000135c0
>  ffff88011f05bb58 ffff88011851bfd8 ffff88011f05bb60 ffff88011851bfd8
> Call Trace:
>  [<ffffffff81804669>] ? schedule+0x3c9/0x9f0
>  [<ffffffff81805235>] schedule_timeout+0x1d5/0x2a0
>  [<ffffffff81119d02>] ? zone_nr_free_pages+0xa2/0xc0
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff81807275>] ? _raw_spin_lock_irq+0x15/0x20
>  [<ffffffff81806638>] __down+0x78/0xb0
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff8113d8e6>] cache_alloc_refill+0x1c6/0x2b0
>  [<ffffffff813fda34>] do_raw_spin_lock+0x54/0x160
>  [<ffffffff812e9672>] ? xfs_iext_bno_to_irec+0xb2/0x100
>  [<ffffffff811022ce>] ? find_get_page+0x1e/0xa0
>  [<ffffffff81103dd7>] ? find_lock_page+0x37/0x80
>  [<ffffffff8110438f>] ? find_or_create_page+0x3f/0xb0
>  [<ffffffff811025e7>] ? unlock_page+0x27/0x30
>  [<ffffffff81315167>] ? _xfs_buf_lookup_pages+0x297/0x370
>  [<ffffffff813f808a>] ? delay_tsc+0x4a/0x80
>  [<ffffffff813f7fdf>] ? __delay+0xf/0x20
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff81109e69>] ? free_pcppages_bulk+0x369/0x400
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff81109e69>] ? free_pcppages_bulk+0x369/0x400
>  [<ffffffff8110a508>] ? __pagevec_free+0x58/0xb0
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff810af53c>] ? debug_mutex_add_waiter+0x2c/0x70
>  [<ffffffff81805d70>] ? __mutex_lock_slowpath+0x1e0/0x280
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff810fdac2>] ? perf_event_exit_task+0x32/0x160
>  [<ffffffff813fd772>] ? do_raw_write_lock+0x42/0xa0
>  [<ffffffff81807015>] ? _raw_write_lock_irq+0x15/0x20
>  [<ffffffff81082315>] ? do_exit+0x195/0x7c0
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff81082991>] ? do_group_exit+0x51/0xc0
>  [<ffffffff81092d8c>] ? get_signal_to_deliver+0x27c/0x430
>  [<ffffffff810352b5>] ? do_signal+0x75/0x7c0
>  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff81035a65>] ? do_notify_resume+0x65/0x90
>  [<ffffffff81036283>] ? int_signal+0x12/0x17
> 
> Because I tried to ctrl-c the fs_mark workload. All those lock
> traces on the stack aren't related to XFS, so I'm wondering exactly
> where they have come from....
> 
> Finally, /proc/interrupts shows:
> 
> CAL:      12156      12039      12676      12478      12919    12177      12767      12460   Function call interrupts
> 
> Which shows that this wasn't an IPI storm that caused this
> particular livelock.
> 

No, but it's possible we got stuck somewhere like too_many_isolated() or
in congestion_wait. One thing at a time though, would you mind testing
the following patch? I haven't tested this *at all* but it should reduce
the number of times drain_all_pages() are called further while not
eliminating them entirely.

I expect to make a start soon on trying to reproduce the problem with fs_mark
but I'm waiting on a machine to free up so it could be a while.

==== CUT HERE ====
mm: page allocator: Reduce the instances where drain_all_pages() is called

When a page allocation fails after direct reclaim, the per-cpu lists are
drained and another attempt made to allocate. On very large systems,
this can cause IPI storms. This patch restores old behaviour to call
drain_all_pages() after direct reclaim fails only for high-order
allocations. It's still the case that an allocation can fail because the
necessary pages are pinned in the per-cpu list. After this patch, the
lists are only drained as a last-resort before calling the OOM killer.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   21 ++++++++++++++++++---
 1 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 750e1dc..0599cf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1737,6 +1737,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	int migratetype)
 {
 	struct page *page;
+	bool drained = false;
 
 	/* Acquire the OOM killer lock for the zones in zonelist */
 	if (!try_set_zonelist_oom(zonelist, gfp_mask)) {
@@ -1744,6 +1745,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
+retry:
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
@@ -1773,6 +1775,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
+
+	/*
+	 * If an allocation failed, it could be because pages are pinned on
+	 * the per-cpu lists. Before resorting to the OOM killer, try
+	 * draining 
+	 */
+	if (!drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	/* Exhausted what can be done so it's blamo time */
 	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
@@ -1876,10 +1890,11 @@ retry:
 					migratetype);
 
 	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * If a high-order allocation failed after direct reclaim, it could
+	 * be because pages are pinned on the per-cpu lists. Drain them once
+	 * and try again
 	 */
-	if (!page && !drained) {
+	if (!page && !drained && order != 0) {
 		drain_all_pages();
 		drained = true;
 		goto retry;

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-06  8:40                           ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-06  8:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Mon, Sep 06, 2010 at 02:02:43PM +1000, Dave Chinner wrote:
> On Sun, Sep 05, 2010 at 09:45:54PM +0800, Wu Fengguang wrote:
> > [restoring CC list]
> > 
> > On Sun, Sep 05, 2010 at 09:14:47PM +0800, Dave Chinner wrote:
> > > On Sun, Sep 05, 2010 at 02:05:39PM +0800, Wu Fengguang wrote:
> > > > On Sun, Sep 05, 2010 at 10:15:55AM +0800, Dave Chinner wrote:
> > > > > On Sun, Sep 05, 2010 at 09:54:00AM +0800, Wu Fengguang wrote:
> > > > > > Dave, could you post (publicly) the kconfig and /proc/vmstat?
> > > > > > 
> > > > > > I'd like to check if you have swap or memory compaction enabled..
> > > > > 
> > > > > Swap is enabled - it has 512MB of swap space:
> > > > > 
> > > > > $ free
> > > > >              total       used       free     shared    buffers     cached
> > > > > Mem:       4054304     100928    3953376          0       4096      43108
> > > > > -/+ buffers/cache:      53724    4000580
> > > > > Swap:       497976          0     497976
> > > > 
> > > > It looks swap is not used at all.
> > > 
> > > It isn't 30s after boot, abut I haven't checked after a livelock.
> > 
> > That's fine. I see in your fs_mark-wedge-1.png that there are no
> > read/write IO at all when CPUs are 100% busy. So there should be no
> > swap IO at "livelock" time.
> > 
> > > > > And memory compaction is not enabled:
> > > > > 
> > > > > $ grep COMPACT .config
> > > > > # CONFIG_COMPACTION is not set
> > 
> > Memory compaction is not likely the cause too. It will only kick in for
> > order > 3 allocations.
> > 
> > > > > 
> > > > > The .config is pretty much a 'make defconfig' and then enabling XFS and
> > > > > whatever debug I need (e.g. locking, memleak, etc).
> > > > 
> > > > Thanks! The problem seems hard to debug -- you cannot login at all
> > > > when it is doing lock contentions, so cannot get sysrq call traces.
> > > 
> > > Well, I don't know whether it is lock contention at all. The sets of
> > > traces I have got previously have shown backtraces on all CPUs in
> > > direct reclaim with several in draining queues, but no apparent lock
> > > contention.
> > 
> > That's interesting. Do you still have the full backtraces?
> 
> Just saw one when testing some new code with CONFIG_XFS_DEBUG
> enabled. The act of running 'echo t > /proc/sysrq-trigger' seems to
> have got the machine unstuck, so I'm not sure if the traces are
> completely representative of the livelock state.

sysrq-trigger would alter the timing of at least one of the direct
reclaimers. It's vaguely possible this was enough of a backoff to allow
further progress.

> however, here are the fs_mark processes:
> 
> [  596.628086] fs_mark       R  running task        0  2373   2163 0x00000008
> [  596.628086]  0000000000000000 ffffffff81bb8610 00000000000008fc 0000000000000002
> [  596.628086]  0000000000000000 0000000000000296 0000000000000297 ffffffffffffff10
> [  596.628086]  ffffffff810b48c2 0000000000000010 0000000000000202 ffff880116b61798
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50

This looks like an order-0 allocation. The "Drain per-cpu lists after
direct reclaim allocation fails" avoids calling drain_all_pages() for a
number of cases but introduces a case where it's called for order-0
pages. The intention was to avoid allocations failing just because of
the lists but maybe it's happening too often.

I include a patch at the very end of this mail that might relieve this.

> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2374   2163 0x00000000
> [  596.628086]  0000000000000000 0000000000000002 ffff88011ad619b0 ffff88011fc050c0
> [  596.628086]  ffff88011ad619e8 ffffffff8113dce6 ffff88011fc028c0 ffff88011fc02900
> [  596.628086]  ffff88011ad619e8 0000025000000000 ffff880100001c08 0000001000000000
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2375   2163 0x00000000
> [  596.628086]  ffff8801198f96f8 ffff880100000000 0000000000000001 0000000000000002
> [  596.628086]  ffff8801198f9708 ffffffff8110758a 0000000001320122 0000000000000002
> [  596.628086]  ffff8801198f8000 ffff880100000000 0000000000000007 0000000000000250
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff810b48c6>] ? smp_call_function_many+0x1a6/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2376   2163 0x00000000
> [  596.628086]  ffff88011d303708 ffffffff8110758a 0000000001320122 0000000000000002
> [  596.628086]  ffff88011d302000 ffff880100000000 0000000000000007 0000000000000250
> [  596.628086]  ffffffff8103694e ffff88011d3037d8 ffff88011c9808f8 0000000000000001
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2377   2163 0x00000008
> [  596.628086]  0000000000000000 ffff880103dd9528 ffffffff813f2deb 000000000000003f
> [  596.628086]  ffff88011c9806e0 ffff880103dd95a4 ffff880103dd9518 ffffffff813fd98e
> [  596.628086]  ffff880103dd95a4 ffff88011ce51800 ffff880103dd9528 ffffffff8180722e
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff813f2deb>] ? radix_tree_gang_lookup_tag+0x8b/0x100
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
> [  596.628086]  [<ffffffff8131f318>] ? xfs_inode_ag_iter_next_pag+0x108/0x110
> [  596.628086]  [<ffffffff8132018c>] ? xfs_reclaim_inode_shrink+0x4c/0x90
> [  596.628086]  [<ffffffff81119d02>] ? zone_nr_free_pages+0xa2/0xc0
> [  596.628086]  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130b2f9>] ? xfs_dir_ialloc+0x139/0x340
> [  596.628086]  [<ffffffff812f9cc7>] ? xfs_log_reserve+0x167/0x1e0
> [  596.628086]  [<ffffffff8130d35c>] ? xfs_create+0x3dc/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2378   2163 0x00000000
> [  596.628086]  ffff880103d53a78 0000000000000086 ffff8800a3fc6cc0 0000000000000caf
> [  596.628086]  ffff880103d53a18 00000000000135c0 ffff88011f119040 00000000000135c0
> [  596.628086]  ffff88011f1193a8 ffff880103d53fd8 ffff88011f1193b0 ffff880103d53fd8
> [  596.628086] Call Trace8f/0xe0
> [  596.628086]  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
> [  596.628086]  [<ffffffff8113e4f8>] ? __kmalloc+0x48/0x230
> [  596.628086]  [<ffffffff81804d90>] _cond_resched+0x30/0x40
> [  596.628086]  [<ffffffff8113e5e1>] __kmalloc+0x131/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff81306b75>] xfs_trans_alloc_log_vecs+0xa5/0xe0
> [  596.628086]  [<ffffffff813082b8>] _xfs_trans_commit+0x138/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2379   2163 0x00000000
> [  596.628086]  ffff88011f0ddd80 ffff880103da5eb8 000001b600008243 ffff88008652ca80
> [  596.628086]  ffffffff8115f8fa 00007fff71ba9370 ffff880000000005 ffff88011e35cb80
> [  596.628086]  ffff880076835ed0 ffff880103da5f18 0000000000000005 ffff88006d30b000
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> [  596.628086] fs_mark       R  running task        0  2380   2163 0x00000000
> [  596.628086]  00000000000008fc 0000000000000001 0000000000000000 0000000000000296
> [  596.628086]  0000000000000293 ffffffffffffff10 ffffffff810b48c2 0000000000000010
> [  596.628086]  0000000000000202 ffff880103c05798 0000000000000018 ffffffff810b48a5
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> [  596.628086]  [<ffffffff813082d6>] ? _xfs_trans_commit+0x156/0x2f0
> [  596.628086]  [<ffffffff8130d50e>] ? xfs_create+0x58e/0x700
> [  596.628086]  [<ffffffff8131c587>] ? xfs_vn_mknod+0xa7/0x1c0
> [  596.628086]  [<ffffffff8131c6d0>] ? xfs_vn_create+0x10/0x20
> [  596.628086]  [<ffffffff81151f48>] ? vfs_create+0xb8/0xf0
> [  596.628086]  [<ffffffff8115273c>] ? do_last+0x4dc/0x5d0
> [  596.628086]  [<ffffffff81154937>] ? do_filp_open+0x207/0x5e0
> [  596.628086]  [<ffffffff8115b7bc>] ? d_lookup+0x3c/0x60
> [  596.628086]  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
> [  596.628086]  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
> [  596.628086]  [<ffffffff8115f8fa>] ? alloc_fd+0xfa/0x140
> [  596.628086]  [<ffffffff811448a5>] ? do_sys_open+0x65/0x130
> [  596.628086]  [<ffffffff811449b0>] ? sys_open+0x20/0x30
> [  596.628086]  [<ffffffff81036032>] ? system_call_fastpath+0x16/0x1b
> 

Many do seem to either be calling drain_all_pages or have done it
recently enough to see sharpnel from it in the backtrace.

> and the kswapd thread:
> 
> [  596.628086] kswapd0       R  running task        0   547      2 0x00000000
> [  596.628086]  ffff88011e78fbc0 0000000000000046 0000000000000000 ffffffff8103694e
> [  596.628086]  ffff88011e78fbf0 00000000000135c0 ffff88011f17c040 00000000000135c0
> [  596.628086]  ffff88011f17c3a8 ffff88011e78ffd8 ffff88011f17c3b0 ffff88011e78ffd8
> [  596.628086] Call Trace:
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff8103694e>] ? apic_timer_interrupt+0xe/0x20
> [  596.628086]  [<ffffffff810773ca>] __cond_resched+0x2a/0x40
> [  596.628086]  [<ffffffff81077422>] __cond_resched_lock+0x42/0x60
> [  596.628086]  [<ffffffff811593a0>] __shrink_dcache_sb+0xf0/0x380
> [  596.628086]  [<ffffffff811597c6>] shrink_dcache_memory+0x176/0x200
> [  596.628086]  [<ffffffff81110bf4>] shrink_slab+0x124/0x180
> [  596.628086]  [<ffffffff811125d2>] balance_pgdat+0x2e2/0x540
> [  596.628086]  [<ffffffff8111295d>] kswapd+0x12d/0x390
> [  596.628086]  [<ffffffff8109e8c0>] ? autoremove_wake_function+0x0/0x40
> [  596.628086]  [<ffffffff81112830>] ? kswapd+0x0/0x390
> [  596.628086]  [<ffffffff8109e396>] kthread+0x96/0xa0
> [  596.628086]  [<ffffffff81036da4>] kernel_thread_helper+0x4/0x10
> [  596.628086]  [<ffffffff8109e300>] ? kthread+0x0/0xa0
> [  596.628086]  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10
> 
> I just went to grab the CAL counters, and found the system in
> another livelock.  This time I managed to start the sysrq-trigger
> dump while the livelock was in progress - I bas??cally got one shot
> at a command before everything stopped responding. Now I'm waiting
> for the livelock to pass.... 5min.... the fs_mark workload
> has stopped (ctrl-c finally responded), still livelocked....
> 10min.... 15min.... 20min.... OK, back now.
> 
> Interesting - all the fs_mark processes are in D state waiting on IO
> completion processing.

Very interesting, maybe they are all stuck in congestion_wait() this
time? There are a few sources where that is possible.

> And the only running processes are the
> kworker threads, which are all processing either vmstat updates
> (3 CPUs):
> 
>  kworker/6:1   R  running task        0   376      2 0x00000000
>   ffff88011f255cf0 0000000000000046 ffff88011f255c90 ffffffff813fda34
>   ffff88003c969588 00000000000135c0 ffff88011f27c7f0 00000000000135c0
>   ffff88011f27cb58 ffff88011f255fd8 ffff88011f27cb60 ffff88011f255fd8
>  Call Trace:
>   [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>   [<ffffffff810773ca>] __cond_resched+0x2a/0x40
>   [<ffffffff81804d90>] _cond_resched+0x30/0x40
>   [<ffffffff8111ab12>] refresh_cpu_vm_stats+0xc2/0x160
>   [<ffffffff8111abb0>] ? vmstat_update+0x0/0x40
>   [<ffffffff8111abc6>] vmstat_update+0x16/0x40
>   [<ffffffff810978a0>] process_one_work+0x130/0x470
>   [<ffffffff81099e72>] worker_thread+0x172/0x3f0
>   [<ffffffff81099d00>] ? worker_thread+0x0/0x3f0
>   [<ffffffff8109e396>] kthread+0x96/0xa0
>   [<ffffffff81036da4>] kernel_thread_helper+0x4/0x10
>   [<ffffffff8109e300>] ? kthread+0x0/0xa0
>   [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10
> 
> Or doing inode IO completion processing:
> 
>  kworker/7:1   R  running task        0   377      2 0x00000000
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  ffffffffffffff10 0000000000000001 ffffffffffffff10 ffffffff813f8062
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  0000000000000010 0000000000000202 ffff88011f0ffc80 ffffffff813f8067
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  00000000000ac613 ffff88011ba08280 00000000871803d0 0000000000000001
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017] Call Trace:
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f8062>] ? delay_tsc+0x22/0x80
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f808a>] ? delay_tsc+0x4a/0x80
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813f7fdf>] ? __delay+0xf/0x20
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813fdb03>] ? do_raw_spin_lock+0x123/0x160
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff818072be>] ? _raw_spin_lock+0xe/0x10
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812f0244>] ? xfs_iflush_done+0x84/0xb0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812cee54>] ? xfs_buf_do_callbacks+0x54/0x70
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff812cf0c0>] ? xfs_buf_iodone_callbacks+0x1a0/0x2a0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81315805>] ? xfs_buf_iodone_work+0x45/0x100
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff813157c0>] ? xfs_buf_iodone_work+0x0/0x100
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff810978a0>] ? process_one_work+0x130/0x470
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81099e72>] ? worker_thread+0x172/0x3f0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81099d00>] ? worker_thread+0x0/0x3f0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff8109e396>] ? kthread+0x96/0xa0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81036da4>] ? kernel_thread_helper+0x4/0x10
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff8109e300>] ? kthread+0x0/0xa0
>  Sep  6 13:20:47 test-4 kernel: [ 2114.056017]  [<ffffffff81036da0>] ? kernel_thread_helper+0x0/0x10
> 
> It looks like there is spinlock contention occurring here on the xfs
> AIL lock, so I'll need to look into this further. A second set of
> traces I got during the livelock also showed this:
> 
> fs_mark       R  running task        0  2713      1 0x00000004
>  ffff88011851b518 ffffffff81804669 ffff88011851b4d8 ffff880100000700
>  0000000000000000 00000000000135c0 ffff88011f05b7f0 00000000000135c0
>  ffff88011f05bb58 ffff88011851bfd8 ffff88011f05bb60 ffff88011851bfd8
> Call Trace:
>  [<ffffffff81804669>] ? schedule+0x3c9/0x9f0
>  [<ffffffff81805235>] schedule_timeout+0x1d5/0x2a0
>  [<ffffffff81119d02>] ? zone_nr_free_pages+0xa2/0xc0
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff8110758a>] ? zone_watermark_ok+0x2a/0xf0
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff81807275>] ? _raw_spin_lock_irq+0x15/0x20
>  [<ffffffff81806638>] __down+0x78/0xb0
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff8113d8e6>] cache_alloc_refill+0x1c6/0x2b0
>  [<ffffffff813fda34>] do_raw_spin_lock+0x54/0x160
>  [<ffffffff812e9672>] ? xfs_iext_bno_to_irec+0xb2/0x100
>  [<ffffffff811022ce>] ? find_get_page+0x1e/0xa0
>  [<ffffffff81103dd7>] ? find_lock_page+0x37/0x80
>  [<ffffffff8110438f>] ? find_or_create_page+0x3f/0xb0
>  [<ffffffff811025e7>] ? unlock_page+0x27/0x30
>  [<ffffffff81315167>] ? _xfs_buf_lookup_pages+0x297/0x370
>  [<ffffffff813f808a>] ? delay_tsc+0x4a/0x80
>  [<ffffffff813f7fdf>] ? __delay+0xf/0x20
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff81109e69>] ? free_pcppages_bulk+0x369/0x400
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff81109e69>] ? free_pcppages_bulk+0x369/0x400
>  [<ffffffff8110a508>] ? __pagevec_free+0x58/0xb0
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff810af53c>] ? debug_mutex_add_waiter+0x2c/0x70
>  [<ffffffff81805d70>] ? __mutex_lock_slowpath+0x1e0/0x280
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff810fdac2>] ? perf_event_exit_task+0x32/0x160
>  [<ffffffff813fd772>] ? do_raw_write_lock+0x42/0xa0
>  [<ffffffff81807015>] ? _raw_write_lock_irq+0x15/0x20
>  [<ffffffff81082315>] ? do_exit+0x195/0x7c0
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff81082991>] ? do_group_exit+0x51/0xc0
>  [<ffffffff81092d8c>] ? get_signal_to_deliver+0x27c/0x430
>  [<ffffffff810352b5>] ? do_signal+0x75/0x7c0
>  [<ffffffff8105fc58>] ? pvclock_clocksource_read+0x58/0xd0
>  [<ffffffff813fda34>] ? do_raw_spin_lock+0x54/0x160
>  [<ffffffff813fd98e>] ? do_raw_spin_unlock+0x5e/0xb0
>  [<ffffffff8180722e>] ? _raw_spin_unlock+0xe/0x10
>  [<ffffffff81035a65>] ? do_notify_resume+0x65/0x90
>  [<ffffffff81036283>] ? int_signal+0x12/0x17
> 
> Because I tried to ctrl-c the fs_mark workload. All those lock
> traces on the stack aren't related to XFS, so I'm wondering exactly
> where they have come from....
> 
> Finally, /proc/interrupts shows:
> 
> CAL:      12156      12039      12676      12478      12919    12177      12767      12460   Function call interrupts
> 
> Which shows that this wasn't an IPI storm that caused this
> particular livelock.
> 

No, but it's possible we got stuck somewhere like too_many_isolated() or
in congestion_wait. One thing at a time though, would you mind testing
the following patch? I haven't tested this *at all* but it should reduce
the number of times drain_all_pages() are called further while not
eliminating them entirely.

I expect to make a start soon on trying to reproduce the problem with fs_mark
but I'm waiting on a machine to free up so it could be a while.

==== CUT HERE ====
mm: page allocator: Reduce the instances where drain_all_pages() is called

When a page allocation fails after direct reclaim, the per-cpu lists are
drained and another attempt made to allocate. On very large systems,
this can cause IPI storms. This patch restores old behaviour to call
drain_all_pages() after direct reclaim fails only for high-order
allocations. It's still the case that an allocation can fail because the
necessary pages are pinned in the per-cpu list. After this patch, the
lists are only drained as a last-resort before calling the OOM killer.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   21 ++++++++++++++++++---
 1 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 750e1dc..0599cf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1737,6 +1737,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	int migratetype)
 {
 	struct page *page;
+	bool drained = false;
 
 	/* Acquire the OOM killer lock for the zones in zonelist */
 	if (!try_set_zonelist_oom(zonelist, gfp_mask)) {
@@ -1744,6 +1745,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
+retry:
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
@@ -1773,6 +1775,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
+
+	/*
+	 * If an allocation failed, it could be because pages are pinned on
+	 * the per-cpu lists. Before resorting to the OOM killer, try
+	 * draining 
+	 */
+	if (!drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	/* Exhausted what can be done so it's blamo time */
 	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
@@ -1876,10 +1890,11 @@ retry:
 					migratetype);
 
 	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * If a high-order allocation failed after direct reclaim, it could
+	 * be because pages are pinned on the per-cpu lists. Drain them once
+	 * and try again
 	 */
-	if (!page && !drained) {
+	if (!page && !drained && order != 0) {
 		drain_all_pages();
 		drained = true;
 		goto retry;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-06  8:40                           ` Mel Gorman
@ 2010-09-06 21:50                             ` Dave Chinner
  -1 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-06 21:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Mon, Sep 06, 2010 at 09:40:15AM +0100, Mel Gorman wrote:
> On Mon, Sep 06, 2010 at 02:02:43PM +1000, Dave Chinner wrote:
> > however, here are the fs_mark processes:
> > 
> > [  596.628086] fs_mark       R  running task        0  2373   2163 0x00000008
> > [  596.628086]  0000000000000000 ffffffff81bb8610 00000000000008fc 0000000000000002
> > [  596.628086]  0000000000000000 0000000000000296 0000000000000297 ffffffffffffff10
> > [  596.628086]  ffffffff810b48c2 0000000000000010 0000000000000202 ffff880116b61798
> > [  596.628086] Call Trace:
> > [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> > [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> > [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> > [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> > [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> > [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> > [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> > [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> > [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> > [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> > [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> > [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> > [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> > [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> > [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> > [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> 
> This looks like an order-0 allocation. The "Drain per-cpu lists after
> direct reclaim allocation fails" avoids calling drain_all_pages() for a
> number of cases but introduces a case where it's called for order-0
> pages. The intention was to avoid allocations failing just because of
> the lists but maybe it's happening too often.

Yes, that should be an order-0 allocation. Possibly an order-1
allocation. but unlikely.

> I include a patch at the very end of this mail that might relieve this.

Ok, I'll try it later today.

> > I just went to grab the CAL counters, and found the system in
> > another livelock.  This time I managed to start the sysrq-trigger
> > dump while the livelock was in progress - I bas??cally got one shot
> > at a command before everything stopped responding. Now I'm waiting
> > for the livelock to pass.... 5min.... the fs_mark workload
> > has stopped (ctrl-c finally responded), still livelocked....
> > 10min.... 15min.... 20min.... OK, back now.
> > 
> > Interesting - all the fs_mark processes are in D state waiting on IO
> > completion processing.
> 
> Very interesting, maybe they are all stuck in congestion_wait() this
> time? There are a few sources where that is possible.

No, they are waiting on log IO completion, not doing allocation or
in the VM at all.  They stuck in xlog_get_iclog_state() waiting for
all the log IO buffers to be processed which are stuck behind the
inode buffer IO completions in th kworker threads that I posted. 

This potentially is caused by the kworker thread consolidation - log
IO completion processing used to be in a separate workqueue for
processing latency and deadlock prevention reasons - the data and
metadata IO completion can block, whereas we need the log IO
completion to occur as quickly as possible. I've seen one deadlock
that the separate work queues solved w.r.t. loop devices, and I
suspect that part of the problem here is that transaction completion
cannot occur (and free the memory it and the CIL holds) because log IO
completion processing is being delayed significantly by metadata IO
completion...

> > A second set of
> > traces I got during the livelock also showed this:
....
> > 
> > Because I tried to ctrl-c the fs_mark workload. All those lock
> > traces on the stack aren't related to XFS, so I'm wondering exactly
> > where they have come from....
> > 
> > Finally, /proc/interrupts shows:
> > 
> > CAL:      12156      12039      12676      12478      12919    12177      12767      12460   Function call interrupts
> > 
> > Which shows that this wasn't an IPI storm that caused this
> > particular livelock.
> 
> No, but it's possible we got stuck somewhere like too_many_isolated() or
> in congestion_wait. One thing at a time though, would you mind testing
> the following patch? I haven't tested this *at all* but it should reduce
> the number of times drain_all_pages() are called further while not
> eliminating them entirely.

Ok, I'll try it later today, but first I think I need to do some
deeper investigation on the kworker thread behaviour....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-06 21:50                             ` Dave Chinner
  0 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-06 21:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Mon, Sep 06, 2010 at 09:40:15AM +0100, Mel Gorman wrote:
> On Mon, Sep 06, 2010 at 02:02:43PM +1000, Dave Chinner wrote:
> > however, here are the fs_mark processes:
> > 
> > [  596.628086] fs_mark       R  running task        0  2373   2163 0x00000008
> > [  596.628086]  0000000000000000 ffffffff81bb8610 00000000000008fc 0000000000000002
> > [  596.628086]  0000000000000000 0000000000000296 0000000000000297 ffffffffffffff10
> > [  596.628086]  ffffffff810b48c2 0000000000000010 0000000000000202 ffff880116b61798
> > [  596.628086] Call Trace:
> > [  596.628086]  [<ffffffff810b48c2>] ? smp_call_function_many+0x1a2/0x210
> > [  596.628086]  [<ffffffff810b48a5>] ? smp_call_function_many+0x185/0x210
> > [  596.628086]  [<ffffffff81109ff0>] ? drain_local_pages+0x0/0x20
> > [  596.628086]  [<ffffffff810b4952>] ? smp_call_function+0x22/0x30
> > [  596.628086]  [<ffffffff81084934>] ? on_each_cpu+0x24/0x50
> > [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> > [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> > [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> > [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> > [  596.628086]  [<ffffffff8113da68>] ? ____cache_alloc_node+0x98/0x180
> > [  596.628086]  [<ffffffff8113e643>] ? __kmalloc+0x193/0x230
> > [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> > [  596.628086]  [<ffffffff8131083f>] ? kmem_alloc+0x8f/0xe0
> > [  596.628086]  [<ffffffff8131092e>] ? kmem_zalloc+0x1e/0x50
> > [  596.628086]  [<ffffffff812fac80>] ? xfs_log_commit_cil+0x500/0x590
> > [  596.628086]  [<ffffffff81310943>] ? kmem_zalloc+0x33/0x50
> 
> This looks like an order-0 allocation. The "Drain per-cpu lists after
> direct reclaim allocation fails" avoids calling drain_all_pages() for a
> number of cases but introduces a case where it's called for order-0
> pages. The intention was to avoid allocations failing just because of
> the lists but maybe it's happening too often.

Yes, that should be an order-0 allocation. Possibly an order-1
allocation. but unlikely.

> I include a patch at the very end of this mail that might relieve this.

Ok, I'll try it later today.

> > I just went to grab the CAL counters, and found the system in
> > another livelock.  This time I managed to start the sysrq-trigger
> > dump while the livelock was in progress - I bas??cally got one shot
> > at a command before everything stopped responding. Now I'm waiting
> > for the livelock to pass.... 5min.... the fs_mark workload
> > has stopped (ctrl-c finally responded), still livelocked....
> > 10min.... 15min.... 20min.... OK, back now.
> > 
> > Interesting - all the fs_mark processes are in D state waiting on IO
> > completion processing.
> 
> Very interesting, maybe they are all stuck in congestion_wait() this
> time? There are a few sources where that is possible.

No, they are waiting on log IO completion, not doing allocation or
in the VM at all.  They stuck in xlog_get_iclog_state() waiting for
all the log IO buffers to be processed which are stuck behind the
inode buffer IO completions in th kworker threads that I posted. 

This potentially is caused by the kworker thread consolidation - log
IO completion processing used to be in a separate workqueue for
processing latency and deadlock prevention reasons - the data and
metadata IO completion can block, whereas we need the log IO
completion to occur as quickly as possible. I've seen one deadlock
that the separate work queues solved w.r.t. loop devices, and I
suspect that part of the problem here is that transaction completion
cannot occur (and free the memory it and the CIL holds) because log IO
completion processing is being delayed significantly by metadata IO
completion...

> > A second set of
> > traces I got during the livelock also showed this:
....
> > 
> > Because I tried to ctrl-c the fs_mark workload. All those lock
> > traces on the stack aren't related to XFS, so I'm wondering exactly
> > where they have come from....
> > 
> > Finally, /proc/interrupts shows:
> > 
> > CAL:      12156      12039      12676      12478      12919    12177      12767      12460   Function call interrupts
> > 
> > Which shows that this wasn't an IPI storm that caused this
> > particular livelock.
> 
> No, but it's possible we got stuck somewhere like too_many_isolated() or
> in congestion_wait. One thing at a time though, would you mind testing
> the following patch? I haven't tested this *at all* but it should reduce
> the number of times drain_all_pages() are called further while not
> eliminating them entirely.

Ok, I'll try it later today, but first I think I need to do some
deeper investigation on the kworker thread behaviour....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-06  4:02                         ` Dave Chinner
@ 2010-09-07 14:23                           ` Christoph Lameter
  -1 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-07 14:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Andrew Morton, Mel Gorman, Linux Kernel List,
	linux-mm, Rik van Riel, Johannes Weiner, Minchan Kim,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Mon, 6 Sep 2010, Dave Chinner wrote:

> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240

fallback_alloc() showing up here means that one page allocator call from
SLAB has already failed. SLAB then did an expensive search through all
object caches on all nodes to find some available object. There were no
objects in queues at all therefore SLAB called the page allocator again
(kmem_getpages()).

As soon as memory is available (on any node or any cpu, they are all
empty) SLAB will repopulate its queues(!).


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-07 14:23                           ` Christoph Lameter
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-07 14:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Andrew Morton, Mel Gorman, Linux Kernel List,
	linux-mm, Rik van Riel, Johannes Weiner, Minchan Kim,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Mon, 6 Sep 2010, Dave Chinner wrote:

> [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240

fallback_alloc() showing up here means that one page allocator call from
SLAB has already failed. SLAB then did an expensive search through all
object caches on all nodes to find some available object. There were no
objects in queues at all therefore SLAB called the page allocator again
(kmem_getpages()).

As soon as memory is available (on any node or any cpu, they are all
empty) SLAB will repopulate its queues(!).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-07 14:23                           ` Christoph Lameter
@ 2010-09-08  2:13                             ` Wu Fengguang
  -1 siblings, 0 replies; 104+ messages in thread
From: Wu Fengguang @ 2010-09-08  2:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, Linux Kernel List,
	linux-mm, Rik van Riel, Johannes Weiner, Minchan Kim,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Tue, Sep 07, 2010 at 10:23:48PM +0800, Christoph Lameter wrote:
> On Mon, 6 Sep 2010, Dave Chinner wrote:
> 
> > [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> > [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> > [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> > [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> 
> fallback_alloc() showing up here means that one page allocator call from
> SLAB has already failed.

That may be due to the GFP_THISNODE flag which includes __GFP_NORETRY
which may fail the allocation simply because there are many concurrent
page allocating tasks, but not necessary in real short of memory.

The concurrent page allocating tasks may consume all the pages freed
by try_to_free_pages() inside __alloc_pages_direct_reclaim(), before
the direct reclaim task is able to get it's page with
get_page_from_freelist(). Then should_alloc_retry() returns 0 for
__GFP_NORETRY which stops further retries.

In theory, __GFP_NORETRY might fail even without other tasks
concurrently stealing current task's direct reclaimed pages. The pcp
lists might happen to be low populated (pcp.count ranges 0 to pcp.batch),
and try_to_free_pages() might not free enough pages to fill them to
the pcp.high watermark, hence no pages are freed into the buddy system
and NR_FREE_PAGES increased. Then zone_watermark_ok() will remain
false and allocation fails. Mel's patch to increase accuracy of
zone_watermark_ok() should help this case.

> SLAB then did an expensive search through all
> object caches on all nodes to find some available object. There were no
> objects in queues at all therefore SLAB called the page allocator again
> (kmem_getpages()).
> 
> As soon as memory is available (on any node or any cpu, they are all
> empty) SLAB will repopulate its queues(!).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-08  2:13                             ` Wu Fengguang
  0 siblings, 0 replies; 104+ messages in thread
From: Wu Fengguang @ 2010-09-08  2:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, Linux Kernel List,
	linux-mm, Rik van Riel, Johannes Weiner, Minchan Kim,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Tue, Sep 07, 2010 at 10:23:48PM +0800, Christoph Lameter wrote:
> On Mon, 6 Sep 2010, Dave Chinner wrote:
> 
> > [  596.628086]  [<ffffffff81108a8c>] ? drain_all_pages+0x1c/0x20
> > [  596.628086]  [<ffffffff81108fad>] ? __alloc_pages_nodemask+0x42d/0x700
> > [  596.628086]  [<ffffffff8113d0f2>] ? kmem_getpages+0x62/0x160
> > [  596.628086]  [<ffffffff8113dce6>] ? fallback_alloc+0x196/0x240
> 
> fallback_alloc() showing up here means that one page allocator call from
> SLAB has already failed.

That may be due to the GFP_THISNODE flag which includes __GFP_NORETRY
which may fail the allocation simply because there are many concurrent
page allocating tasks, but not necessary in real short of memory.

The concurrent page allocating tasks may consume all the pages freed
by try_to_free_pages() inside __alloc_pages_direct_reclaim(), before
the direct reclaim task is able to get it's page with
get_page_from_freelist(). Then should_alloc_retry() returns 0 for
__GFP_NORETRY which stops further retries.

In theory, __GFP_NORETRY might fail even without other tasks
concurrently stealing current task's direct reclaimed pages. The pcp
lists might happen to be low populated (pcp.count ranges 0 to pcp.batch),
and try_to_free_pages() might not free enough pages to fill them to
the pcp.high watermark, hence no pages are freed into the buddy system
and NR_FREE_PAGES increased. Then zone_watermark_ok() will remain
false and allocation fails. Mel's patch to increase accuracy of
zone_watermark_ok() should help this case.

> SLAB then did an expensive search through all
> object caches on all nodes to find some available object. There were no
> objects in queues at all therefore SLAB called the page allocator again
> (kmem_getpages()).
> 
> As soon as memory is available (on any node or any cpu, they are all
> empty) SLAB will repopulate its queues(!).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-03  9:08   ` Mel Gorman
@ 2010-09-08  7:43     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08  7:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> +	/*
> +	 * If an allocation failed after direct reclaim, it could be because
> +	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 */
> +	if (!page && !drained) {
> +		drain_all_pages();
> +		drained = true;
> +		goto retry;
> +	}

nit: when slub, get_page_from_freelist() failure is frequently happen
than slab because slub try to allocate high order page at first.
So, I guess we have to avoid drain_all_pages() if __GFP_NORETRY is passed.




>From 9209ceb1d48446b031576ba9360036ddabc1a0e5 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Fri, 10 Sep 2010 03:29:05 +0900
Subject: [PATCH] mm: don't call drain_all_pages() when __GFP_NORETRY

SLUB try to allocate high order pages at first. therefore page allocator
eventually call drain_all_pages() frequently. We don't hope IPI storm.
Thus, we don't call drain_all_pages() when caller passed __GFP_NORETRY.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8587c10..b9eafb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1878,7 +1878,7 @@ retry:
 	 * If an allocation failed after direct reclaim, it could be because
 	 * pages are pinned on the per-cpu lists. Drain them and try again
 	 */
-	if (!page && !drained) {
+	if (!page && !drained && !(gfp_mask & __GFP_NORETRY)) {
 		drain_all_pages();
 		drained = true;
 		goto retry;
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-08  7:43     ` KOSAKI Motohiro
  0 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08  7:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> +	/*
> +	 * If an allocation failed after direct reclaim, it could be because
> +	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 */
> +	if (!page && !drained) {
> +		drain_all_pages();
> +		drained = true;
> +		goto retry;
> +	}

nit: when slub, get_page_from_freelist() failure is frequently happen
than slab because slub try to allocate high order page at first.
So, I guess we have to avoid drain_all_pages() if __GFP_NORETRY is passed.




From 9209ceb1d48446b031576ba9360036ddabc1a0e5 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Fri, 10 Sep 2010 03:29:05 +0900
Subject: [PATCH] mm: don't call drain_all_pages() when __GFP_NORETRY

SLUB try to allocate high order pages at first. therefore page allocator
eventually call drain_all_pages() frequently. We don't hope IPI storm.
Thus, we don't call drain_all_pages() when caller passed __GFP_NORETRY.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8587c10..b9eafb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1878,7 +1878,7 @@ retry:
 	 * If an allocation failed after direct reclaim, it could be because
 	 * pages are pinned on the per-cpu lists. Drain them and try again
 	 */
-	if (!page && !drained) {
+	if (!page && !drained && !(gfp_mask & __GFP_NORETRY)) {
 		drain_all_pages();
 		drained = true;
 		goto retry;
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-06 21:50                             ` Dave Chinner
@ 2010-09-08  8:49                               ` Dave Chinner
  -1 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-08  8:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Tue, Sep 07, 2010 at 07:50:23AM +1000, Dave Chinner wrote:
> On Mon, Sep 06, 2010 at 09:40:15AM +0100, Mel Gorman wrote:
> > On Mon, Sep 06, 2010 at 02:02:43PM +1000, Dave Chinner wrote:
> > > I just went to grab the CAL counters, and found the system in
> > > another livelock.  This time I managed to start the sysrq-trigger
> > > dump while the livelock was in progress - I bas??cally got one shot
> > > at a command before everything stopped responding. Now I'm waiting
> > > for the livelock to pass.... 5min.... the fs_mark workload
> > > has stopped (ctrl-c finally responded), still livelocked....
> > > 10min.... 15min.... 20min.... OK, back now.
> > > 
> > > Interesting - all the fs_mark processes are in D state waiting on IO
> > > completion processing.
> > 
> > Very interesting, maybe they are all stuck in congestion_wait() this
> > time? There are a few sources where that is possible.
> 
> No, they are waiting on log IO completion, not doing allocation or
> in the VM at all.  They stuck in xlog_get_iclog_state() waiting for
> all the log IO buffers to be processed which are stuck behind the
> inode buffer IO completions in th kworker threads that I posted. 
> 
> This potentially is caused by the kworker thread consolidation - log
> IO completion processing used to be in a separate workqueue for
> processing latency and deadlock prevention reasons - the data and
> metadata IO completion can block, whereas we need the log IO
> completion to occur as quickly as possible. I've seen one deadlock
> that the separate work queues solved w.r.t. loop devices, and I
> suspect that part of the problem here is that transaction completion
> cannot occur (and free the memory it and the CIL holds) because log IO
> completion processing is being delayed significantly by metadata IO
> completion...
.....
> > > Which shows that this wasn't an IPI storm that caused this
> > > particular livelock.
> > 
> > No, but it's possible we got stuck somewhere like too_many_isolated() or
> > in congestion_wait. One thing at a time though, would you mind testing
> > the following patch? I haven't tested this *at all* but it should reduce
> > the number of times drain_all_pages() are called further while not
> > eliminating them entirely.
> 
> Ok, I'll try it later today, but first I think I need to do some
> deeper investigation on the kworker thread behaviour....

Ok, so an update is needed here. I have confirmed that the above
livelock was caused by the kworker thread consolidation, and I have
a fix for it (make the log IO completion processing queue WQ_HIGHPRI
so it gets queued ahead of the data/metadata IO completions), and
I've been able to create over a billion inodes now without a
livelock occurring. See the thread titled "[2.6.36-rc3] Workqueues,
XFS, dependencies and deadlock" if you want more details.

To make sure I've been seeing two different livelocks, I removed
Mel's series from my tree (which still contained the above workqueue
fix), and I started seeing short memory allocation livelocks (10-15s
at most) with abnormal increases in CAL counts indication an
increase in IPIs during the short livelocks.  IOWs, the livelock
was't as severe as before the workqueue fix, but still present.
Hence the workqueue issue was definitely a contributing factor to
the severity of the memory allocation triggered issue.

It is clear that there have been two different livelocks with
different caused by the same test, which has led to a lot of
confusion in this thread. It appears that Mel's patch series as
originally posted in this thread is all that is necessary to avoid
the memory allocation livelock issue I was seeing. The workqueue
fix solves the other livelock I was seeing once Mel's patches were
in place.

Thanks to everyone for helping me track these livelocks down and
providing lots of suggestions for things to try. I'll keep testing
and looking for livelocks, but my confidence is increasing that
we've got to the root of them now. 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-08  8:49                               ` Dave Chinner
  0 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-08  8:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Tue, Sep 07, 2010 at 07:50:23AM +1000, Dave Chinner wrote:
> On Mon, Sep 06, 2010 at 09:40:15AM +0100, Mel Gorman wrote:
> > On Mon, Sep 06, 2010 at 02:02:43PM +1000, Dave Chinner wrote:
> > > I just went to grab the CAL counters, and found the system in
> > > another livelock.  This time I managed to start the sysrq-trigger
> > > dump while the livelock was in progress - I bas??cally got one shot
> > > at a command before everything stopped responding. Now I'm waiting
> > > for the livelock to pass.... 5min.... the fs_mark workload
> > > has stopped (ctrl-c finally responded), still livelocked....
> > > 10min.... 15min.... 20min.... OK, back now.
> > > 
> > > Interesting - all the fs_mark processes are in D state waiting on IO
> > > completion processing.
> > 
> > Very interesting, maybe they are all stuck in congestion_wait() this
> > time? There are a few sources where that is possible.
> 
> No, they are waiting on log IO completion, not doing allocation or
> in the VM at all.  They stuck in xlog_get_iclog_state() waiting for
> all the log IO buffers to be processed which are stuck behind the
> inode buffer IO completions in th kworker threads that I posted. 
> 
> This potentially is caused by the kworker thread consolidation - log
> IO completion processing used to be in a separate workqueue for
> processing latency and deadlock prevention reasons - the data and
> metadata IO completion can block, whereas we need the log IO
> completion to occur as quickly as possible. I've seen one deadlock
> that the separate work queues solved w.r.t. loop devices, and I
> suspect that part of the problem here is that transaction completion
> cannot occur (and free the memory it and the CIL holds) because log IO
> completion processing is being delayed significantly by metadata IO
> completion...
.....
> > > Which shows that this wasn't an IPI storm that caused this
> > > particular livelock.
> > 
> > No, but it's possible we got stuck somewhere like too_many_isolated() or
> > in congestion_wait. One thing at a time though, would you mind testing
> > the following patch? I haven't tested this *at all* but it should reduce
> > the number of times drain_all_pages() are called further while not
> > eliminating them entirely.
> 
> Ok, I'll try it later today, but first I think I need to do some
> deeper investigation on the kworker thread behaviour....

Ok, so an update is needed here. I have confirmed that the above
livelock was caused by the kworker thread consolidation, and I have
a fix for it (make the log IO completion processing queue WQ_HIGHPRI
so it gets queued ahead of the data/metadata IO completions), and
I've been able to create over a billion inodes now without a
livelock occurring. See the thread titled "[2.6.36-rc3] Workqueues,
XFS, dependencies and deadlock" if you want more details.

To make sure I've been seeing two different livelocks, I removed
Mel's series from my tree (which still contained the above workqueue
fix), and I started seeing short memory allocation livelocks (10-15s
at most) with abnormal increases in CAL counts indication an
increase in IPIs during the short livelocks.  IOWs, the livelock
was't as severe as before the workqueue fix, but still present.
Hence the workqueue issue was definitely a contributing factor to
the severity of the memory allocation triggered issue.

It is clear that there have been two different livelocks with
different caused by the same test, which has led to a lot of
confusion in this thread. It appears that Mel's patch series as
originally posted in this thread is all that is necessary to avoid
the memory allocation livelock issue I was seeing. The workqueue
fix solves the other livelock I was seeing once Mel's patches were
in place.

Thanks to everyone for helping me track these livelocks down and
providing lots of suggestions for things to try. I'll keep testing
and looking for livelocks, but my confidence is increasing that
we've got to the root of them now. 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-08  7:43     ` KOSAKI Motohiro
@ 2010-09-08 20:05       ` Christoph Lameter
  -1 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-08 20:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, 8 Sep 2010, KOSAKI Motohiro wrote:

> nit: when slub, get_page_from_freelist() failure is frequently happen
> than slab because slub try to allocate high order page at first.
> So, I guess we have to avoid drain_all_pages() if __GFP_NORETRY is passed.

SLAB also tries to allocate higher order pages for many slabs but not as
high as SLUB (SLAB does not support fallback to order 0). SLAB also always
uses GFP_THISNODE (which include GFP_NORETRY).

Your patch will make SLAB's initial call to the page allocator fail more
frequently and therefore will increase the use of fallback_alloc().

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-08 20:05       ` Christoph Lameter
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-08 20:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Wed, 8 Sep 2010, KOSAKI Motohiro wrote:

> nit: when slub, get_page_from_freelist() failure is frequently happen
> than slab because slub try to allocate high order page at first.
> So, I guess we have to avoid drain_all_pages() if __GFP_NORETRY is passed.

SLAB also tries to allocate higher order pages for many slabs but not as
high as SLUB (SLAB does not support fallback to order 0). SLAB also always
uses GFP_THISNODE (which include GFP_NORETRY).

Your patch will make SLAB's initial call to the page allocator fail more
frequently and therefore will increase the use of fallback_alloc().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-08  8:49                               ` Dave Chinner
@ 2010-09-09 12:39                                 ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-09 12:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Wed, Sep 08, 2010 at 06:49:23PM +1000, Dave Chinner wrote:
> On Tue, Sep 07, 2010 at 07:50:23AM +1000, Dave Chinner wrote:
> > On Mon, Sep 06, 2010 at 09:40:15AM +0100, Mel Gorman wrote:
> > > On Mon, Sep 06, 2010 at 02:02:43PM +1000, Dave Chinner wrote:
> > > > I just went to grab the CAL counters, and found the system in
> > > > another livelock.  This time I managed to start the sysrq-trigger
> > > > dump while the livelock was in progress - I bas??cally got one shot
> > > > at a command before everything stopped responding. Now I'm waiting
> > > > for the livelock to pass.... 5min.... the fs_mark workload
> > > > has stopped (ctrl-c finally responded), still livelocked....
> > > > 10min.... 15min.... 20min.... OK, back now.
> > > > 
> > > > Interesting - all the fs_mark processes are in D state waiting on IO
> > > > completion processing.
> > > 
> > > Very interesting, maybe they are all stuck in congestion_wait() this
> > > time? There are a few sources where that is possible.
> > 
> > No, they are waiting on log IO completion, not doing allocation or
> > in the VM at all.  They stuck in xlog_get_iclog_state() waiting for
> > all the log IO buffers to be processed which are stuck behind the
> > inode buffer IO completions in th kworker threads that I posted. 
> > 
> > This potentially is caused by the kworker thread consolidation - log
> > IO completion processing used to be in a separate workqueue for
> > processing latency and deadlock prevention reasons - the data and
> > metadata IO completion can block, whereas we need the log IO
> > completion to occur as quickly as possible. I've seen one deadlock
> > that the separate work queues solved w.r.t. loop devices, and I
> > suspect that part of the problem here is that transaction completion
> > cannot occur (and free the memory it and the CIL holds) because log IO
> > completion processing is being delayed significantly by metadata IO
> > completion...
> .....
> > > > Which shows that this wasn't an IPI storm that caused this
> > > > particular livelock.
> > > 
> > > No, but it's possible we got stuck somewhere like too_many_isolated() or
> > > in congestion_wait. One thing at a time though, would you mind testing
> > > the following patch? I haven't tested this *at all* but it should reduce
> > > the number of times drain_all_pages() are called further while not
> > > eliminating them entirely.
> > 
> > Ok, I'll try it later today, but first I think I need to do some
> > deeper investigation on the kworker thread behaviour....
> 
> Ok, so an update is needed here. I have confirmed that the above
> livelock was caused by the kworker thread consolidation, and I have
> a fix for it (make the log IO completion processing queue WQ_HIGHPRI
> so it gets queued ahead of the data/metadata IO completions), and
> I've been able to create over a billion inodes now without a
> livelock occurring. See the thread titled "[2.6.36-rc3] Workqueues,
> XFS, dependencies and deadlock" if you want more details.
> 

Good stuff. I read through the thread and it seemed reasonable.

> To make sure I've been seeing two different livelocks, I removed
> Mel's series from my tree (which still contained the above workqueue
> fix), and I started seeing short memory allocation livelocks (10-15s
> at most) with abnormal increases in CAL counts indication an
> increase in IPIs during the short livelocks.  IOWs, the livelock
> was't as severe as before the workqueue fix, but still present.
> Hence the workqueue issue was definitely a contributing factor to
> the severity of the memory allocation triggered issue.
> 

Good. Considering that this class of bugs in either the page allocator
or page reclaim can be down to timing, it makes sense that a big change
in ordering of events could compound problems in the VM.

> It is clear that there have been two different livelocks with
> different caused by the same test, which has led to a lot of
> confusion in this thread. It appears that Mel's patch series as
> originally posted in this thread is all that is necessary to avoid
> the memory allocation livelock issue I was seeing. The workqueue
> fix solves the other livelock I was seeing once Mel's patches were
> in place.
> 
> Thanks to everyone for helping me track these livelocks down and
> providing lots of suggestions for things to try. I'll keep testing
> and looking for livelocks, but my confidence is increasing that
> we've got to the root of them now. 
> 

It has been pointed out that the fix potentially increases the number of
IPIs sent. On larger machines, I worry that these delays could be severe
and we'll see other problems down the line. Hence, I'd like to reduce
the number of calls to drain_all_pages() without eliminating them
entirely. I'm currently in the process of testing the following patch
but can you try it as well please?

In particular, I am curious to see if the performance of fs_mark
improves any and if the interrupt counts drop as a result of the patch.

Thanks

==== CUT HERE ====
mm: page allocator: Reduce the instances where drain_all_pages() is called

When a page allocation fails after direct reclaim, the per-cpu lists are
drained and another attempt made to allocate. On larger systems,
this can cause IPI storms in low-memory situations with latencies
increasing the more CPUs there are on the system. In extreme situations,
it is suspected it could cause livelock-like situations.

This patch restores older behaviour to call drain_all_pages() after direct
reclaim fails only for high-order allocations. As there is an expectation
that lower-orders will free naturally, the drain only occurs for order >
PAGE_ALLOC_COSTLY_ORDER. The reasoning is that the allocation is already
expected to be very expensive and rare so there will not be a resulting IPI
storm. drain_all_pages() called are not eliminated as it is still the case
that an allocation can fail because the necessary pages are pinned in the
per-cpu list. After this patch, the lists are only drained as a last-resort
before calling the OOM killer.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 750e1dc..16f516c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1737,6 +1737,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	int migratetype)
 {
 	struct page *page;
+	bool drained = false;
 
 	/* Acquire the OOM killer lock for the zones in zonelist */
 	if (!try_set_zonelist_oom(zonelist, gfp_mask)) {
@@ -1744,6 +1745,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
+retry:
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
@@ -1773,6 +1775,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
+
+	/*
+	 * If an allocation failed, it could be because pages are pinned on
+	 * the per-cpu lists. Before resorting to the OOM killer, try
+	 * draining 
+	 */
+	if (!drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	/* Exhausted what can be done so it's blamo time */
 	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
@@ -1876,10 +1890,13 @@ retry:
 					migratetype);
 
 	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * If a high-order allocation failed after direct reclaim, it could
+	 * be because pages are pinned on the per-cpu lists. However, only
+	 * do it for PAGE_ALLOC_COSTLY_ORDER as the cost of the IPI needed
+	 * to drain the pages is itself high. Assume that lower orders
+	 * will naturally free without draining.
 	 */
-	if (!page && !drained) {
+	if (!page && !drained && order > PAGE_ALLOC_COSTLY_ORDER) {
 		drain_all_pages();
 		drained = true;
 		goto retry;

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-09 12:39                                 ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-09 12:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Wed, Sep 08, 2010 at 06:49:23PM +1000, Dave Chinner wrote:
> On Tue, Sep 07, 2010 at 07:50:23AM +1000, Dave Chinner wrote:
> > On Mon, Sep 06, 2010 at 09:40:15AM +0100, Mel Gorman wrote:
> > > On Mon, Sep 06, 2010 at 02:02:43PM +1000, Dave Chinner wrote:
> > > > I just went to grab the CAL counters, and found the system in
> > > > another livelock.  This time I managed to start the sysrq-trigger
> > > > dump while the livelock was in progress - I bas??cally got one shot
> > > > at a command before everything stopped responding. Now I'm waiting
> > > > for the livelock to pass.... 5min.... the fs_mark workload
> > > > has stopped (ctrl-c finally responded), still livelocked....
> > > > 10min.... 15min.... 20min.... OK, back now.
> > > > 
> > > > Interesting - all the fs_mark processes are in D state waiting on IO
> > > > completion processing.
> > > 
> > > Very interesting, maybe they are all stuck in congestion_wait() this
> > > time? There are a few sources where that is possible.
> > 
> > No, they are waiting on log IO completion, not doing allocation or
> > in the VM at all.  They stuck in xlog_get_iclog_state() waiting for
> > all the log IO buffers to be processed which are stuck behind the
> > inode buffer IO completions in th kworker threads that I posted. 
> > 
> > This potentially is caused by the kworker thread consolidation - log
> > IO completion processing used to be in a separate workqueue for
> > processing latency and deadlock prevention reasons - the data and
> > metadata IO completion can block, whereas we need the log IO
> > completion to occur as quickly as possible. I've seen one deadlock
> > that the separate work queues solved w.r.t. loop devices, and I
> > suspect that part of the problem here is that transaction completion
> > cannot occur (and free the memory it and the CIL holds) because log IO
> > completion processing is being delayed significantly by metadata IO
> > completion...
> .....
> > > > Which shows that this wasn't an IPI storm that caused this
> > > > particular livelock.
> > > 
> > > No, but it's possible we got stuck somewhere like too_many_isolated() or
> > > in congestion_wait. One thing at a time though, would you mind testing
> > > the following patch? I haven't tested this *at all* but it should reduce
> > > the number of times drain_all_pages() are called further while not
> > > eliminating them entirely.
> > 
> > Ok, I'll try it later today, but first I think I need to do some
> > deeper investigation on the kworker thread behaviour....
> 
> Ok, so an update is needed here. I have confirmed that the above
> livelock was caused by the kworker thread consolidation, and I have
> a fix for it (make the log IO completion processing queue WQ_HIGHPRI
> so it gets queued ahead of the data/metadata IO completions), and
> I've been able to create over a billion inodes now without a
> livelock occurring. See the thread titled "[2.6.36-rc3] Workqueues,
> XFS, dependencies and deadlock" if you want more details.
> 

Good stuff. I read through the thread and it seemed reasonable.

> To make sure I've been seeing two different livelocks, I removed
> Mel's series from my tree (which still contained the above workqueue
> fix), and I started seeing short memory allocation livelocks (10-15s
> at most) with abnormal increases in CAL counts indication an
> increase in IPIs during the short livelocks.  IOWs, the livelock
> was't as severe as before the workqueue fix, but still present.
> Hence the workqueue issue was definitely a contributing factor to
> the severity of the memory allocation triggered issue.
> 

Good. Considering that this class of bugs in either the page allocator
or page reclaim can be down to timing, it makes sense that a big change
in ordering of events could compound problems in the VM.

> It is clear that there have been two different livelocks with
> different caused by the same test, which has led to a lot of
> confusion in this thread. It appears that Mel's patch series as
> originally posted in this thread is all that is necessary to avoid
> the memory allocation livelock issue I was seeing. The workqueue
> fix solves the other livelock I was seeing once Mel's patches were
> in place.
> 
> Thanks to everyone for helping me track these livelocks down and
> providing lots of suggestions for things to try. I'll keep testing
> and looking for livelocks, but my confidence is increasing that
> we've got to the root of them now. 
> 

It has been pointed out that the fix potentially increases the number of
IPIs sent. On larger machines, I worry that these delays could be severe
and we'll see other problems down the line. Hence, I'd like to reduce
the number of calls to drain_all_pages() without eliminating them
entirely. I'm currently in the process of testing the following patch
but can you try it as well please?

In particular, I am curious to see if the performance of fs_mark
improves any and if the interrupt counts drop as a result of the patch.

Thanks

==== CUT HERE ====
mm: page allocator: Reduce the instances where drain_all_pages() is called

When a page allocation fails after direct reclaim, the per-cpu lists are
drained and another attempt made to allocate. On larger systems,
this can cause IPI storms in low-memory situations with latencies
increasing the more CPUs there are on the system. In extreme situations,
it is suspected it could cause livelock-like situations.

This patch restores older behaviour to call drain_all_pages() after direct
reclaim fails only for high-order allocations. As there is an expectation
that lower-orders will free naturally, the drain only occurs for order >
PAGE_ALLOC_COSTLY_ORDER. The reasoning is that the allocation is already
expected to be very expensive and rare so there will not be a resulting IPI
storm. drain_all_pages() called are not eliminated as it is still the case
that an allocation can fail because the necessary pages are pinned in the
per-cpu list. After this patch, the lists are only drained as a last-resort
before calling the OOM killer.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 750e1dc..16f516c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1737,6 +1737,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	int migratetype)
 {
 	struct page *page;
+	bool drained = false;
 
 	/* Acquire the OOM killer lock for the zones in zonelist */
 	if (!try_set_zonelist_oom(zonelist, gfp_mask)) {
@@ -1744,6 +1745,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
+retry:
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
@@ -1773,6 +1775,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
+
+	/*
+	 * If an allocation failed, it could be because pages are pinned on
+	 * the per-cpu lists. Before resorting to the OOM killer, try
+	 * draining 
+	 */
+	if (!drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	/* Exhausted what can be done so it's blamo time */
 	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
@@ -1876,10 +1890,13 @@ retry:
 					migratetype);
 
 	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * If a high-order allocation failed after direct reclaim, it could
+	 * be because pages are pinned on the per-cpu lists. However, only
+	 * do it for PAGE_ALLOC_COSTLY_ORDER as the cost of the IPI needed
+	 * to drain the pages is itself high. Assume that lower orders
+	 * will naturally free without draining.
 	 */
-	if (!page && !drained) {
+	if (!page && !drained && order > PAGE_ALLOC_COSTLY_ORDER) {
 		drain_all_pages();
 		drained = true;
 		goto retry;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-08  7:43     ` KOSAKI Motohiro
@ 2010-09-09 12:41       ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-09 12:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

On Wed, Sep 08, 2010 at 04:43:03PM +0900, KOSAKI Motohiro wrote:
> > +	/*
> > +	 * If an allocation failed after direct reclaim, it could be because
> > +	 * pages are pinned on the per-cpu lists. Drain them and try again
> > +	 */
> > +	if (!page && !drained) {
> > +		drain_all_pages();
> > +		drained = true;
> > +		goto retry;
> > +	}
> 
> nit: when slub, get_page_from_freelist() failure is frequently happen
> than slab because slub try to allocate high order page at first.
> So, I guess we have to avoid drain_all_pages() if __GFP_NORETRY is passed.
> 

Old behaviour was for high-order allocations which one would assume did
not have __GFP_NORETRY specified except in very rare cases. Still, calling
drain_all_pages() raises interrupt counts and I worried that large machines
might exhibit some livelock-like problem. I'm considering the following patch,
what do you think?

==== CUT HERE ====
mm: page allocator: Reduce the instances where drain_all_pages() is called

When a page allocation fails after direct reclaim, the per-cpu lists are
drained and another attempt made to allocate. On larger systems,
this can cause IPI storms in low-memory situations with latencies
increasing the more CPUs there are on the system. In extreme situations,
it is suspected it could cause livelock-like situations.

This patch restores older behaviour to call drain_all_pages() after direct
reclaim fails only for high-order allocations. As there is an expectation
that lower-orders will free naturally, the drain only occurs for order >
PAGE_ALLOC_COSTLY_ORDER. The reasoning is that the allocation is already
expected to be very expensive and rare so there will not be a resulting IPI
storm. drain_all_pages() called are not eliminated as it is still the case
that an allocation can fail because the necessary pages are pinned in the
per-cpu list. After this patch, the lists are only drained as a last-resort
before calling the OOM killer.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 750e1dc..16f516c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1737,6 +1737,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	int migratetype)
 {
 	struct page *page;
+	bool drained = false;
 
 	/* Acquire the OOM killer lock for the zones in zonelist */
 	if (!try_set_zonelist_oom(zonelist, gfp_mask)) {
@@ -1744,6 +1745,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
+retry:
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
@@ -1773,6 +1775,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
+
+	/*
+	 * If an allocation failed, it could be because pages are pinned on
+	 * the per-cpu lists. Before resorting to the OOM killer, try
+	 * draining 
+	 */
+	if (!drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	/* Exhausted what can be done so it's blamo time */
 	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
@@ -1876,10 +1890,13 @@ retry:
 					migratetype);
 
 	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * If a high-order allocation failed after direct reclaim, it could
+	 * be because pages are pinned on the per-cpu lists. However, only
+	 * do it for PAGE_ALLOC_COSTLY_ORDER as the cost of the IPI needed
+	 * to drain the pages is itself high. Assume that lower orders
+	 * will naturally free without draining.
 	 */
-	if (!page && !drained) {
+	if (!page && !drained && order > PAGE_ALLOC_COSTLY_ORDER) {
 		drain_all_pages();
 		drained = true;
 		goto retry;

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-09 12:41       ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-09 12:41 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

On Wed, Sep 08, 2010 at 04:43:03PM +0900, KOSAKI Motohiro wrote:
> > +	/*
> > +	 * If an allocation failed after direct reclaim, it could be because
> > +	 * pages are pinned on the per-cpu lists. Drain them and try again
> > +	 */
> > +	if (!page && !drained) {
> > +		drain_all_pages();
> > +		drained = true;
> > +		goto retry;
> > +	}
> 
> nit: when slub, get_page_from_freelist() failure is frequently happen
> than slab because slub try to allocate high order page at first.
> So, I guess we have to avoid drain_all_pages() if __GFP_NORETRY is passed.
> 

Old behaviour was for high-order allocations which one would assume did
not have __GFP_NORETRY specified except in very rare cases. Still, calling
drain_all_pages() raises interrupt counts and I worried that large machines
might exhibit some livelock-like problem. I'm considering the following patch,
what do you think?

==== CUT HERE ====
mm: page allocator: Reduce the instances where drain_all_pages() is called

When a page allocation fails after direct reclaim, the per-cpu lists are
drained and another attempt made to allocate. On larger systems,
this can cause IPI storms in low-memory situations with latencies
increasing the more CPUs there are on the system. In extreme situations,
it is suspected it could cause livelock-like situations.

This patch restores older behaviour to call drain_all_pages() after direct
reclaim fails only for high-order allocations. As there is an expectation
that lower-orders will free naturally, the drain only occurs for order >
PAGE_ALLOC_COSTLY_ORDER. The reasoning is that the allocation is already
expected to be very expensive and rare so there will not be a resulting IPI
storm. drain_all_pages() called are not eliminated as it is still the case
that an allocation can fail because the necessary pages are pinned in the
per-cpu list. After this patch, the lists are only drained as a last-resort
before calling the OOM killer.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 750e1dc..16f516c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1737,6 +1737,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	int migratetype)
 {
 	struct page *page;
+	bool drained = false;
 
 	/* Acquire the OOM killer lock for the zones in zonelist */
 	if (!try_set_zonelist_oom(zonelist, gfp_mask)) {
@@ -1744,6 +1745,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
+retry:
 	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
@@ -1773,6 +1775,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
+
+	/*
+	 * If an allocation failed, it could be because pages are pinned on
+	 * the per-cpu lists. Before resorting to the OOM killer, try
+	 * draining 
+	 */
+	if (!drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	/* Exhausted what can be done so it's blamo time */
 	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
@@ -1876,10 +1890,13 @@ retry:
 					migratetype);
 
 	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * If a high-order allocation failed after direct reclaim, it could
+	 * be because pages are pinned on the per-cpu lists. However, only
+	 * do it for PAGE_ALLOC_COSTLY_ORDER as the cost of the IPI needed
+	 * to drain the pages is itself high. Assume that lower orders
+	 * will naturally free without draining.
 	 */
-	if (!page && !drained) {
+	if (!page && !drained && order > PAGE_ALLOC_COSTLY_ORDER) {
 		drain_all_pages();
 		drained = true;
 		goto retry;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-09 12:41       ` Mel Gorman
@ 2010-09-09 13:45         ` Christoph Lameter
  -1 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-09 13:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, 9 Sep 2010, Mel Gorman wrote:

> @@ -1876,10 +1890,13 @@ retry:
>  					migratetype);
>
>  	/*
> -	 * If an allocation failed after direct reclaim, it could be because
> -	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 * If a high-order allocation failed after direct reclaim, it could
> +	 * be because pages are pinned on the per-cpu lists. However, only
> +	 * do it for PAGE_ALLOC_COSTLY_ORDER as the cost of the IPI needed
> +	 * to drain the pages is itself high. Assume that lower orders
> +	 * will naturally free without draining.
>  	 */
> -	if (!page && !drained) {
> +	if (!page && !drained && order > PAGE_ALLOC_COSTLY_ORDER) {
>  		drain_all_pages();
>  		drained = true;
>  		goto retry;
>

This will have the effect of never sending IPIs for slab allocations since
they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-09 13:45         ` Christoph Lameter
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-09 13:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, 9 Sep 2010, Mel Gorman wrote:

> @@ -1876,10 +1890,13 @@ retry:
>  					migratetype);
>
>  	/*
> -	 * If an allocation failed after direct reclaim, it could be because
> -	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 * If a high-order allocation failed after direct reclaim, it could
> +	 * be because pages are pinned on the per-cpu lists. However, only
> +	 * do it for PAGE_ALLOC_COSTLY_ORDER as the cost of the IPI needed
> +	 * to drain the pages is itself high. Assume that lower orders
> +	 * will naturally free without draining.
>  	 */
> -	if (!page && !drained) {
> +	if (!page && !drained && order > PAGE_ALLOC_COSTLY_ORDER) {
>  		drain_all_pages();
>  		drained = true;
>  		goto retry;
>

This will have the effect of never sending IPIs for slab allocations since
they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-09 13:45         ` Christoph Lameter
@ 2010-09-09 13:55           ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-09 13:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, Sep 09, 2010 at 08:45:16AM -0500, Christoph Lameter wrote:
> On Thu, 9 Sep 2010, Mel Gorman wrote:
> 
> > @@ -1876,10 +1890,13 @@ retry:
> >  					migratetype);
> >
> >  	/*
> > -	 * If an allocation failed after direct reclaim, it could be because
> > -	 * pages are pinned on the per-cpu lists. Drain them and try again
> > +	 * If a high-order allocation failed after direct reclaim, it could
> > +	 * be because pages are pinned on the per-cpu lists. However, only
> > +	 * do it for PAGE_ALLOC_COSTLY_ORDER as the cost of the IPI needed
> > +	 * to drain the pages is itself high. Assume that lower orders
> > +	 * will naturally free without draining.
> >  	 */
> > -	if (!page && !drained) {
> > +	if (!page && !drained && order > PAGE_ALLOC_COSTLY_ORDER) {
> >  		drain_all_pages();
> >  		drained = true;
> >  		goto retry;
> >
> 
> This will have the effect of never sending IPIs for slab allocations since
> they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.
>  

The question is how severe is that? There is somewhat of an expectation
that the lower orders free naturally so it the IPI justified? That said,
our historical behaviour would have looked like

if (!page && !drained && order) {
	drain_all_pages();
	draiained = true;
	goto retry;
}

Play it safe for now and go with that?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-09 13:55           ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-09 13:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, Sep 09, 2010 at 08:45:16AM -0500, Christoph Lameter wrote:
> On Thu, 9 Sep 2010, Mel Gorman wrote:
> 
> > @@ -1876,10 +1890,13 @@ retry:
> >  					migratetype);
> >
> >  	/*
> > -	 * If an allocation failed after direct reclaim, it could be because
> > -	 * pages are pinned on the per-cpu lists. Drain them and try again
> > +	 * If a high-order allocation failed after direct reclaim, it could
> > +	 * be because pages are pinned on the per-cpu lists. However, only
> > +	 * do it for PAGE_ALLOC_COSTLY_ORDER as the cost of the IPI needed
> > +	 * to drain the pages is itself high. Assume that lower orders
> > +	 * will naturally free without draining.
> >  	 */
> > -	if (!page && !drained) {
> > +	if (!page && !drained && order > PAGE_ALLOC_COSTLY_ORDER) {
> >  		drain_all_pages();
> >  		drained = true;
> >  		goto retry;
> >
> 
> This will have the effect of never sending IPIs for slab allocations since
> they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.
>  

The question is how severe is that? There is somewhat of an expectation
that the lower orders free naturally so it the IPI justified? That said,
our historical behaviour would have looked like

if (!page && !drained && order) {
	drain_all_pages();
	draiained = true;
	goto retry;
}

Play it safe for now and go with that?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-09 13:55           ` Mel Gorman
@ 2010-09-09 14:32             ` Christoph Lameter
  -1 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-09 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, 9 Sep 2010, Mel Gorman wrote:

> > This will have the effect of never sending IPIs for slab allocations since
> > they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.
> >
>
> The question is how severe is that? There is somewhat of an expectation
> that the lower orders free naturally so it the IPI justified? That said,
> our historical behaviour would have looked like
>
> if (!page && !drained && order) {
> 	drain_all_pages();
> 	draiained = true;
> 	goto retry;
> }
>
> Play it safe for now and go with that?

I am fine with no IPIs for order <= COSTLY. Just be aware that this is
a change that may have some side effects. Lets run some tests and see
how it affect the issues that we are seeing.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-09 14:32             ` Christoph Lameter
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-09-09 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, 9 Sep 2010, Mel Gorman wrote:

> > This will have the effect of never sending IPIs for slab allocations since
> > they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.
> >
>
> The question is how severe is that? There is somewhat of an expectation
> that the lower orders free naturally so it the IPI justified? That said,
> our historical behaviour would have looked like
>
> if (!page && !drained && order) {
> 	drain_all_pages();
> 	draiained = true;
> 	goto retry;
> }
>
> Play it safe for now and go with that?

I am fine with no IPIs for order <= COSTLY. Just be aware that this is
a change that may have some side effects. Lets run some tests and see
how it affect the issues that we are seeing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-09 14:32             ` Christoph Lameter
@ 2010-09-09 15:05               ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-09 15:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, Sep 09, 2010 at 09:32:52AM -0500, Christoph Lameter wrote:
> On Thu, 9 Sep 2010, Mel Gorman wrote:
> 
> > > This will have the effect of never sending IPIs for slab allocations since
> > > they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.
> > >
> >
> > The question is how severe is that? There is somewhat of an expectation
> > that the lower orders free naturally so it the IPI justified? That said,
> > our historical behaviour would have looked like
> >
> > if (!page && !drained && order) {
> > 	drain_all_pages();
> > 	draiained = true;
> > 	goto retry;
> > }
> >
> > Play it safe for now and go with that?
> 
> I am fine with no IPIs for order <= COSTLY. Just be aware that this is
> a change that may have some side effects.

I made the choice consciously. I felt that if slab or slub were depending on
IPIs to make successful allocations in low-memory conditions that it would
experience varying stalls on bigger machines due to increased interrupts that
might be difficult to diagnose while not necessarily improving allocation
success rates. I also considered that if the machine is under pressure then
slab and slub may also be releasing pages of the same order and effectively
recycling their pages without depending on IPIs.

> Lets run some tests and see
> how it affect the issues that we are seeing.
> 

Perfect, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-09 15:05               ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-09 15:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki

On Thu, Sep 09, 2010 at 09:32:52AM -0500, Christoph Lameter wrote:
> On Thu, 9 Sep 2010, Mel Gorman wrote:
> 
> > > This will have the effect of never sending IPIs for slab allocations since
> > > they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.
> > >
> >
> > The question is how severe is that? There is somewhat of an expectation
> > that the lower orders free naturally so it the IPI justified? That said,
> > our historical behaviour would have looked like
> >
> > if (!page && !drained && order) {
> > 	drain_all_pages();
> > 	draiained = true;
> > 	goto retry;
> > }
> >
> > Play it safe for now and go with that?
> 
> I am fine with no IPIs for order <= COSTLY. Just be aware that this is
> a change that may have some side effects.

I made the choice consciously. I felt that if slab or slub were depending on
IPIs to make successful allocations in low-memory conditions that it would
experience varying stalls on bigger machines due to increased interrupts that
might be difficult to diagnose while not necessarily improving allocation
success rates. I also considered that if the machine is under pressure then
slab and slub may also be releasing pages of the same order and effectively
recycling their pages without depending on IPIs.

> Lets run some tests and see
> how it affect the issues that we are seeing.
> 

Perfect, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-09 15:05               ` Mel Gorman
@ 2010-09-10  2:56                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10  2:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Christoph Lameter, Andrew Morton,
	Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, KAMEZAWA Hiroyuki

> On Thu, Sep 09, 2010 at 09:32:52AM -0500, Christoph Lameter wrote:
> > On Thu, 9 Sep 2010, Mel Gorman wrote:
> > 
> > > > This will have the effect of never sending IPIs for slab allocations since
> > > > they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.
> > > >
> > >
> > > The question is how severe is that? There is somewhat of an expectation
> > > that the lower orders free naturally so it the IPI justified? That said,
> > > our historical behaviour would have looked like
> > >
> > > if (!page && !drained && order) {
> > > 	drain_all_pages();
> > > 	draiained = true;
> > > 	goto retry;
> > > }
> > >
> > > Play it safe for now and go with that?
> > 
> > I am fine with no IPIs for order <= COSTLY. Just be aware that this is
> > a change that may have some side effects.
> 
> I made the choice consciously. I felt that if slab or slub were depending on
> IPIs to make successful allocations in low-memory conditions that it would
> experience varying stalls on bigger machines due to increased interrupts that
> might be difficult to diagnose while not necessarily improving allocation
> success rates. I also considered that if the machine is under pressure then
> slab and slub may also be releasing pages of the same order and effectively
> recycling their pages without depending on IPIs.

+1.

In these days, average numbers of CPUs are increasing. So we need to be afraid
IPI storm than past.




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-10  2:56                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10  2:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Christoph Lameter, Andrew Morton,
	Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, KAMEZAWA Hiroyuki

> On Thu, Sep 09, 2010 at 09:32:52AM -0500, Christoph Lameter wrote:
> > On Thu, 9 Sep 2010, Mel Gorman wrote:
> > 
> > > > This will have the effect of never sending IPIs for slab allocations since
> > > > they do not do allocations for orders > PAGE_ALLOC_COSTLY_ORDER.
> > > >
> > >
> > > The question is how severe is that? There is somewhat of an expectation
> > > that the lower orders free naturally so it the IPI justified? That said,
> > > our historical behaviour would have looked like
> > >
> > > if (!page && !drained && order) {
> > > 	drain_all_pages();
> > > 	draiained = true;
> > > 	goto retry;
> > > }
> > >
> > > Play it safe for now and go with that?
> > 
> > I am fine with no IPIs for order <= COSTLY. Just be aware that this is
> > a change that may have some side effects.
> 
> I made the choice consciously. I felt that if slab or slub were depending on
> IPIs to make successful allocations in low-memory conditions that it would
> experience varying stalls on bigger machines due to increased interrupts that
> might be difficult to diagnose while not necessarily improving allocation
> success rates. I also considered that if the machine is under pressure then
> slab and slub may also be releasing pages of the same order and effectively
> recycling their pages without depending on IPIs.

+1.

In these days, average numbers of CPUs are increasing. So we need to be afraid
IPI storm than past.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-09-09 12:39                                 ` Mel Gorman
@ 2010-09-10  6:17                                   ` Dave Chinner
  -1 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-10  6:17 UTC (permalink / raw)
  To: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Thu, Sep 09, 2010 at 01:39:10PM +0100,  wrote:
> It has been pointed out that the fix potentially increases the number of
> IPIs sent. On larger machines, I worry that these delays could be severe
> and we'll see other problems down the line. Hence, I'd like to reduce
> the number of calls to drain_all_pages() without eliminating them
> entirely. I'm currently in the process of testing the following patch
> but can you try it as well please?
> 
> In particular, I am curious to see if the performance of fs_mark
> improves any and if the interrupt counts drop as a result of the patch.

The interrupt counts have definitely dropped - this is after
creating 200M inodes and then removing them all:

CAL:      11154 10596 11804 15366 10048 12916 13049 9864

That's in the same ballpark as a single 50M inode create run without
the patch.

Performance seems a bit lower, though (2-3% maybe less), and CPU
usage seems a bit higher (stays much closer to 800% than 700-750%
without the patch). Those are subjective observations from watching
graphs and counters, so take them with a grain of salt.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-09-10  6:17                                   ` Dave Chinner
  0 siblings, 0 replies; 104+ messages in thread
From: Dave Chinner @ 2010-09-10  6:17 UTC (permalink / raw)
  To: Wu Fengguang, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, David Rientjes

On Thu, Sep 09, 2010 at 01:39:10PM +0100,  wrote:
> It has been pointed out that the fix potentially increases the number of
> IPIs sent. On larger machines, I worry that these delays could be severe
> and we'll see other problems down the line. Hence, I'd like to reduce
> the number of calls to drain_all_pages() without eliminating them
> entirely. I'm currently in the process of testing the following patch
> but can you try it as well please?
> 
> In particular, I am curious to see if the performance of fs_mark
> improves any and if the interrupt counts drop as a result of the patch.

The interrupt counts have definitely dropped - this is after
creating 200M inodes and then removing them all:

CAL:      11154 10596 11804 15366 10048 12916 13049 9864

That's in the same ballpark as a single 50M inode create run without
the patch.

Performance seems a bit lower, though (2-3% maybe less), and CPU
usage seems a bit higher (stays much closer to 800% than 700-750%
without the patch). Those are subjective observations from watching
graphs and counters, so take them with a grain of salt.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
  2010-09-03 23:05   ` Andrew Morton
@ 2010-09-21 11:17     ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-21 11:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, stable, Dave Chinner

On Fri, Sep 03, 2010 at 04:05:51PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:43 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The noteworthy change is to patch 2 which now uses the generic
> > zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> > applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> > 
> > Changelog since V3
> >   o Use generic helper for NR_FREE_PAGES estimate when necessary
> > 
> > Changelog since V2
> >   o Minor clarifications
> >   o Rebase to 2.6.36-rc3
> > 
> > Changelog since V1
> >   o Fix for !CONFIG_SMP
> >   o Correct spelling mistakes
> >   o Clarify a ChangeLog
> >   o Only check for counter drift on machines large enough for the counter
> >     drift to breach the min watermark when NR_FREE_PAGES report the low
> >     watermark is fine
> > 
> > Internal IBM test teams beta testing distribution kernels have reported
> > problems on machines with a large number of CPUs whereby page allocator
> > failure messages show huge differences between the nr_free_pages vmstat
> > counter and what is available on the buddy lists. In an extreme example,
> > nr_free_pages was above the min watermark but zero pages were on the buddy
> > lists allowing the system to potentially livelock unable to make forward
> > progress unless an allocation succeeds. There is no reason why the problems
> > would not affect mainline so the following series mitigates the problems
> > in the page allocator related to to per-cpu counter drift and lists.
> > 
> > The first patch ensures that counters are updated after pages are added to
> > free lists.
> > 
> > The second patch notes that the counter drift between nr_free_pages and what
> > is on the per-cpu lists can be very high. When memory is low and kswapd
> > is awake, the per-cpu counters are checked as well as reading the value
> > of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> > kswapd is awake but it will be much harder to breach the min watermark and
> > potentially livelock the system.
> > 
> > The third patch notes that after direct-reclaim an allocation can
> > fail because the necessary pages are on the per-cpu lists. After a
> > direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> > a second attempt is made.
> > 
> > Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> > version of this series that continually called vmstat_update() when
> > memory was low was tested internally and found to help the counter drift
> > problem. I described this during LSF/MM Summit and the potential for IPI
> > storms was frowned upon. An alternative fix is in patch two which uses
> > for_each_online_cpu() to read the vmstat deltas while memory is low and
> > kswapd is awake. This should be functionally similar.
> > 
> > This patch should be merged after the patch "vmstat : update
> > zone stat threshold at onlining a cpu" which is in mmotm as
> > vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> > 
> > If we can agree on it, this series is a stable candidate.
> 
> (cc stable@kernel.org)
> 
> >  include/linux/mmzone.h |   13 +++++++++++++
> >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> >  mm/mmzone.c            |   21 +++++++++++++++++++++
> >  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
> >  mm/vmstat.c            |   15 ++++++++++++++-
> >  5 files changed, 91 insertions(+), 9 deletions(-)
> 
> For the entire patch series I get
> 
>  include/linux/mmzone.h |   13 +++++++++++++
>  include/linux/vmstat.h |   22 ++++++++++++++++++++++
>  mm/mmzone.c            |   21 +++++++++++++++++++++
>  mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
>  mm/vmstat.c            |   16 +++++++++++++++-
>  5 files changed, 94 insertions(+), 11 deletions(-)
> 
> The patches do apply OK to 2.6.35.
> 
> Give the extent and the coreness of it all, it's a bit more than I'd
> usually push at the -stable guys.  But I guess that if the patches fix
> all the issues you've noted, as well as David's "minute-long livelocks
> in memory reclaim" then yup, it's worth backporting it all.
> 

These patches have made it to mainline as the following commits.

9ee493c mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
aa45484 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
72853e2 mm: page allocator: update free page counters after pages are placed on the free list

I have not heard from the -stable guys, is there a reasonable
expectation that they'll be picked up?

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
@ 2010-09-21 11:17     ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-21 11:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, stable, Dave Chinner

On Fri, Sep 03, 2010 at 04:05:51PM -0700, Andrew Morton wrote:
> On Fri,  3 Sep 2010 10:08:43 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The noteworthy change is to patch 2 which now uses the generic
> > zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> > applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> > 
> > Changelog since V3
> >   o Use generic helper for NR_FREE_PAGES estimate when necessary
> > 
> > Changelog since V2
> >   o Minor clarifications
> >   o Rebase to 2.6.36-rc3
> > 
> > Changelog since V1
> >   o Fix for !CONFIG_SMP
> >   o Correct spelling mistakes
> >   o Clarify a ChangeLog
> >   o Only check for counter drift on machines large enough for the counter
> >     drift to breach the min watermark when NR_FREE_PAGES report the low
> >     watermark is fine
> > 
> > Internal IBM test teams beta testing distribution kernels have reported
> > problems on machines with a large number of CPUs whereby page allocator
> > failure messages show huge differences between the nr_free_pages vmstat
> > counter and what is available on the buddy lists. In an extreme example,
> > nr_free_pages was above the min watermark but zero pages were on the buddy
> > lists allowing the system to potentially livelock unable to make forward
> > progress unless an allocation succeeds. There is no reason why the problems
> > would not affect mainline so the following series mitigates the problems
> > in the page allocator related to to per-cpu counter drift and lists.
> > 
> > The first patch ensures that counters are updated after pages are added to
> > free lists.
> > 
> > The second patch notes that the counter drift between nr_free_pages and what
> > is on the per-cpu lists can be very high. When memory is low and kswapd
> > is awake, the per-cpu counters are checked as well as reading the value
> > of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> > kswapd is awake but it will be much harder to breach the min watermark and
> > potentially livelock the system.
> > 
> > The third patch notes that after direct-reclaim an allocation can
> > fail because the necessary pages are on the per-cpu lists. After a
> > direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> > a second attempt is made.
> > 
> > Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> > version of this series that continually called vmstat_update() when
> > memory was low was tested internally and found to help the counter drift
> > problem. I described this during LSF/MM Summit and the potential for IPI
> > storms was frowned upon. An alternative fix is in patch two which uses
> > for_each_online_cpu() to read the vmstat deltas while memory is low and
> > kswapd is awake. This should be functionally similar.
> > 
> > This patch should be merged after the patch "vmstat : update
> > zone stat threshold at onlining a cpu" which is in mmotm as
> > vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> > 
> > If we can agree on it, this series is a stable candidate.
> 
> (cc stable@kernel.org)
> 
> >  include/linux/mmzone.h |   13 +++++++++++++
> >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> >  mm/mmzone.c            |   21 +++++++++++++++++++++
> >  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
> >  mm/vmstat.c            |   15 ++++++++++++++-
> >  5 files changed, 91 insertions(+), 9 deletions(-)
> 
> For the entire patch series I get
> 
>  include/linux/mmzone.h |   13 +++++++++++++
>  include/linux/vmstat.h |   22 ++++++++++++++++++++++
>  mm/mmzone.c            |   21 +++++++++++++++++++++
>  mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
>  mm/vmstat.c            |   16 +++++++++++++++-
>  5 files changed, 94 insertions(+), 11 deletions(-)
> 
> The patches do apply OK to 2.6.35.
> 
> Give the extent and the coreness of it all, it's a bit more than I'd
> usually push at the -stable guys.  But I guess that if the patches fix
> all the issues you've noted, as well as David's "minute-long livelocks
> in memory reclaim" then yup, it's worth backporting it all.
> 

These patches have made it to mainline as the following commits.

9ee493c mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
aa45484 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
72853e2 mm: page allocator: update free page counters after pages are placed on the free list

I have not heard from the -stable guys, is there a reasonable
expectation that they'll be picked up?

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [stable] [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
  2010-09-21 11:17     ` Mel Gorman
@ 2010-09-21 12:58       ` Greg KH
  -1 siblings, 0 replies; 104+ messages in thread
From: Greg KH @ 2010-09-21 12:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Christoph Lameter, Dave Chinner,
	Linux Kernel List, linux-mm, Minchan Kim, KOSAKI Motohiro,
	Johannes Weiner, stable, KAMEZAWA Hiroyuki

On Tue, Sep 21, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> On Fri, Sep 03, 2010 at 04:05:51PM -0700, Andrew Morton wrote:
> > On Fri,  3 Sep 2010 10:08:43 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > The noteworthy change is to patch 2 which now uses the generic
> > > zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> > > applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> > > 
> > > Changelog since V3
> > >   o Use generic helper for NR_FREE_PAGES estimate when necessary
> > > 
> > > Changelog since V2
> > >   o Minor clarifications
> > >   o Rebase to 2.6.36-rc3
> > > 
> > > Changelog since V1
> > >   o Fix for !CONFIG_SMP
> > >   o Correct spelling mistakes
> > >   o Clarify a ChangeLog
> > >   o Only check for counter drift on machines large enough for the counter
> > >     drift to breach the min watermark when NR_FREE_PAGES report the low
> > >     watermark is fine
> > > 
> > > Internal IBM test teams beta testing distribution kernels have reported
> > > problems on machines with a large number of CPUs whereby page allocator
> > > failure messages show huge differences between the nr_free_pages vmstat
> > > counter and what is available on the buddy lists. In an extreme example,
> > > nr_free_pages was above the min watermark but zero pages were on the buddy
> > > lists allowing the system to potentially livelock unable to make forward
> > > progress unless an allocation succeeds. There is no reason why the problems
> > > would not affect mainline so the following series mitigates the problems
> > > in the page allocator related to to per-cpu counter drift and lists.
> > > 
> > > The first patch ensures that counters are updated after pages are added to
> > > free lists.
> > > 
> > > The second patch notes that the counter drift between nr_free_pages and what
> > > is on the per-cpu lists can be very high. When memory is low and kswapd
> > > is awake, the per-cpu counters are checked as well as reading the value
> > > of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> > > kswapd is awake but it will be much harder to breach the min watermark and
> > > potentially livelock the system.
> > > 
> > > The third patch notes that after direct-reclaim an allocation can
> > > fail because the necessary pages are on the per-cpu lists. After a
> > > direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> > > a second attempt is made.
> > > 
> > > Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> > > version of this series that continually called vmstat_update() when
> > > memory was low was tested internally and found to help the counter drift
> > > problem. I described this during LSF/MM Summit and the potential for IPI
> > > storms was frowned upon. An alternative fix is in patch two which uses
> > > for_each_online_cpu() to read the vmstat deltas while memory is low and
> > > kswapd is awake. This should be functionally similar.
> > > 
> > > This patch should be merged after the patch "vmstat : update
> > > zone stat threshold at onlining a cpu" which is in mmotm as
> > > vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> > > 
> > > If we can agree on it, this series is a stable candidate.
> > 
> > (cc stable@kernel.org)
> > 
> > >  include/linux/mmzone.h |   13 +++++++++++++
> > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > >  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
> > >  mm/vmstat.c            |   15 ++++++++++++++-
> > >  5 files changed, 91 insertions(+), 9 deletions(-)
> > 
> > For the entire patch series I get
> > 
> >  include/linux/mmzone.h |   13 +++++++++++++
> >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> >  mm/mmzone.c            |   21 +++++++++++++++++++++
> >  mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
> >  mm/vmstat.c            |   16 +++++++++++++++-
> >  5 files changed, 94 insertions(+), 11 deletions(-)
> > 
> > The patches do apply OK to 2.6.35.
> > 
> > Give the extent and the coreness of it all, it's a bit more than I'd
> > usually push at the -stable guys.  But I guess that if the patches fix
> > all the issues you've noted, as well as David's "minute-long livelocks
> > in memory reclaim" then yup, it's worth backporting it all.
> > 
> 
> These patches have made it to mainline as the following commits.
> 
> 9ee493c mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
> aa45484 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> 72853e2 mm: page allocator: update free page counters after pages are placed on the free list
> 
> I have not heard from the -stable guys, is there a reasonable
> expectation that they'll be picked up?

If you ask me, then I'll know to give a response :)

None of these were tagged as going to the stable tree, should I include
them?  If so, for which -stable tree?  .27, .32, and .35 are all
currently active.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [stable] [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
@ 2010-09-21 12:58       ` Greg KH
  0 siblings, 0 replies; 104+ messages in thread
From: Greg KH @ 2010-09-21 12:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Christoph Lameter, Dave Chinner,
	Linux Kernel List, linux-mm, Minchan Kim, KOSAKI Motohiro,
	Johannes Weiner, stable, KAMEZAWA Hiroyuki

On Tue, Sep 21, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> On Fri, Sep 03, 2010 at 04:05:51PM -0700, Andrew Morton wrote:
> > On Fri,  3 Sep 2010 10:08:43 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > The noteworthy change is to patch 2 which now uses the generic
> > > zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> > > applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> > > 
> > > Changelog since V3
> > >   o Use generic helper for NR_FREE_PAGES estimate when necessary
> > > 
> > > Changelog since V2
> > >   o Minor clarifications
> > >   o Rebase to 2.6.36-rc3
> > > 
> > > Changelog since V1
> > >   o Fix for !CONFIG_SMP
> > >   o Correct spelling mistakes
> > >   o Clarify a ChangeLog
> > >   o Only check for counter drift on machines large enough for the counter
> > >     drift to breach the min watermark when NR_FREE_PAGES report the low
> > >     watermark is fine
> > > 
> > > Internal IBM test teams beta testing distribution kernels have reported
> > > problems on machines with a large number of CPUs whereby page allocator
> > > failure messages show huge differences between the nr_free_pages vmstat
> > > counter and what is available on the buddy lists. In an extreme example,
> > > nr_free_pages was above the min watermark but zero pages were on the buddy
> > > lists allowing the system to potentially livelock unable to make forward
> > > progress unless an allocation succeeds. There is no reason why the problems
> > > would not affect mainline so the following series mitigates the problems
> > > in the page allocator related to to per-cpu counter drift and lists.
> > > 
> > > The first patch ensures that counters are updated after pages are added to
> > > free lists.
> > > 
> > > The second patch notes that the counter drift between nr_free_pages and what
> > > is on the per-cpu lists can be very high. When memory is low and kswapd
> > > is awake, the per-cpu counters are checked as well as reading the value
> > > of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> > > kswapd is awake but it will be much harder to breach the min watermark and
> > > potentially livelock the system.
> > > 
> > > The third patch notes that after direct-reclaim an allocation can
> > > fail because the necessary pages are on the per-cpu lists. After a
> > > direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> > > a second attempt is made.
> > > 
> > > Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> > > version of this series that continually called vmstat_update() when
> > > memory was low was tested internally and found to help the counter drift
> > > problem. I described this during LSF/MM Summit and the potential for IPI
> > > storms was frowned upon. An alternative fix is in patch two which uses
> > > for_each_online_cpu() to read the vmstat deltas while memory is low and
> > > kswapd is awake. This should be functionally similar.
> > > 
> > > This patch should be merged after the patch "vmstat : update
> > > zone stat threshold at onlining a cpu" which is in mmotm as
> > > vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> > > 
> > > If we can agree on it, this series is a stable candidate.
> > 
> > (cc stable@kernel.org)
> > 
> > >  include/linux/mmzone.h |   13 +++++++++++++
> > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > >  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
> > >  mm/vmstat.c            |   15 ++++++++++++++-
> > >  5 files changed, 91 insertions(+), 9 deletions(-)
> > 
> > For the entire patch series I get
> > 
> >  include/linux/mmzone.h |   13 +++++++++++++
> >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> >  mm/mmzone.c            |   21 +++++++++++++++++++++
> >  mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
> >  mm/vmstat.c            |   16 +++++++++++++++-
> >  5 files changed, 94 insertions(+), 11 deletions(-)
> > 
> > The patches do apply OK to 2.6.35.
> > 
> > Give the extent and the coreness of it all, it's a bit more than I'd
> > usually push at the -stable guys.  But I guess that if the patches fix
> > all the issues you've noted, as well as David's "minute-long livelocks
> > in memory reclaim" then yup, it's worth backporting it all.
> > 
> 
> These patches have made it to mainline as the following commits.
> 
> 9ee493c mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
> aa45484 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> 72853e2 mm: page allocator: update free page counters after pages are placed on the free list
> 
> I have not heard from the -stable guys, is there a reasonable
> expectation that they'll be picked up?

If you ask me, then I'll know to give a response :)

None of these were tagged as going to the stable tree, should I include
them?  If so, for which -stable tree?  .27, .32, and .35 are all
currently active.

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [stable] [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
  2010-09-21 12:58       ` Greg KH
@ 2010-09-21 14:23         ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-21 14:23 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Rik van Riel, Christoph Lameter, Dave Chinner,
	Linux Kernel List, linux-mm, Minchan Kim, KOSAKI Motohiro,
	Johannes Weiner, stable, KAMEZAWA Hiroyuki

On Tue, Sep 21, 2010 at 05:58:14AM -0700, Greg KH wrote:
> On Tue, Sep 21, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> > On Fri, Sep 03, 2010 at 04:05:51PM -0700, Andrew Morton wrote:
> > > On Fri,  3 Sep 2010 10:08:43 +0100
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > The noteworthy change is to patch 2 which now uses the generic
> > > > zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> > > > applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> > > > 
> > > > Changelog since V3
> > > >   o Use generic helper for NR_FREE_PAGES estimate when necessary
> > > > 
> > > > Changelog since V2
> > > >   o Minor clarifications
> > > >   o Rebase to 2.6.36-rc3
> > > > 
> > > > Changelog since V1
> > > >   o Fix for !CONFIG_SMP
> > > >   o Correct spelling mistakes
> > > >   o Clarify a ChangeLog
> > > >   o Only check for counter drift on machines large enough for the counter
> > > >     drift to breach the min watermark when NR_FREE_PAGES report the low
> > > >     watermark is fine
> > > > 
> > > > Internal IBM test teams beta testing distribution kernels have reported
> > > > problems on machines with a large number of CPUs whereby page allocator
> > > > failure messages show huge differences between the nr_free_pages vmstat
> > > > counter and what is available on the buddy lists. In an extreme example,
> > > > nr_free_pages was above the min watermark but zero pages were on the buddy
> > > > lists allowing the system to potentially livelock unable to make forward
> > > > progress unless an allocation succeeds. There is no reason why the problems
> > > > would not affect mainline so the following series mitigates the problems
> > > > in the page allocator related to to per-cpu counter drift and lists.
> > > > 
> > > > The first patch ensures that counters are updated after pages are added to
> > > > free lists.
> > > > 
> > > > The second patch notes that the counter drift between nr_free_pages and what
> > > > is on the per-cpu lists can be very high. When memory is low and kswapd
> > > > is awake, the per-cpu counters are checked as well as reading the value
> > > > of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> > > > kswapd is awake but it will be much harder to breach the min watermark and
> > > > potentially livelock the system.
> > > > 
> > > > The third patch notes that after direct-reclaim an allocation can
> > > > fail because the necessary pages are on the per-cpu lists. After a
> > > > direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> > > > a second attempt is made.
> > > > 
> > > > Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> > > > version of this series that continually called vmstat_update() when
> > > > memory was low was tested internally and found to help the counter drift
> > > > problem. I described this during LSF/MM Summit and the potential for IPI
> > > > storms was frowned upon. An alternative fix is in patch two which uses
> > > > for_each_online_cpu() to read the vmstat deltas while memory is low and
> > > > kswapd is awake. This should be functionally similar.
> > > > 
> > > > This patch should be merged after the patch "vmstat : update
> > > > zone stat threshold at onlining a cpu" which is in mmotm as
> > > > vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> > > > 
> > > > If we can agree on it, this series is a stable candidate.
> > > 
> > > (cc stable@kernel.org)
> > > 
> > > >  include/linux/mmzone.h |   13 +++++++++++++
> > > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > > >  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
> > > >  mm/vmstat.c            |   15 ++++++++++++++-
> > > >  5 files changed, 91 insertions(+), 9 deletions(-)
> > > 
> > > For the entire patch series I get
> > > 
> > >  include/linux/mmzone.h |   13 +++++++++++++
> > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > >  mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
> > >  mm/vmstat.c            |   16 +++++++++++++++-
> > >  5 files changed, 94 insertions(+), 11 deletions(-)
> > > 
> > > The patches do apply OK to 2.6.35.
> > > 
> > > Give the extent and the coreness of it all, it's a bit more than I'd
> > > usually push at the -stable guys.  But I guess that if the patches fix
> > > all the issues you've noted, as well as David's "minute-long livelocks
> > > in memory reclaim" then yup, it's worth backporting it all.
> > > 
> > 
> > These patches have made it to mainline as the following commits.
> > 
> > 9ee493c mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
> > aa45484 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> > 72853e2 mm: page allocator: update free page counters after pages are placed on the free list
> > 
> > I have not heard from the -stable guys, is there a reasonable
> > expectation that they'll be picked up?
> 
> If you ask me, then I'll know to give a response :)
> 

Hi Greg,

I would ask you directly but I didn't want anyone else on stable@ to
feel left out :)

> None of these were tagged as going to the stable tree, should I include
> them? 

Yes please unless there is a late objection. The patches were first developed
as a result of a distro bug whose kernel was based on 2.6.32.  There was
every indication this affected mainline as well. The details of the testing
are above.

Dave Chinner had also reported problems with livelocks in reclaim that
looked like IPI storms. There were two major factors at play and these
patches addressed one of them. It works out as both a bug and a
performance fix.

> If so, for which -stable tree?  .27, .32, and .35 are all
> currently active.
> 

2.6.35 for certain.

I would have a strong preference for 2.6.32 as well as it's a baseline for
a number of distros. The second commit will conflict with per-cpu changes
but the resolution is straight-forward.

In mm/vmstat.c, the correct per-cpu related line that conflicts should
look like

	zone_pcp(zone, cpu)->stat_threshold = threshold;

vmstat.h will fail to build again due to per-cpu changes. A rebased
version looks like

static inline unsigned long zone_page_state_snapshot(struct zone *zone,
                                        enum zone_stat_item item)
{
        long x = atomic_long_read(&zone->vm_stat[item]);

#ifdef CONFIG_SMP
        int cpu;
        for_each_online_cpu(cpu)
                x += zone_pcp(zone, cpu)->vm_stat_diff[item];

        if (x < 0)
                x = 0;
#endif
        return x;
}

A rebased version of patch 2 against 2.6.32.21 is below.

I do not know who the users of 2.6.27.x are so I have no strong opinions
on whether they need these patches or not.

Thanks Greg.

==== CUT HERE ====
mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists.  To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold.  On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero.  Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken.  The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Conflicts:

	mm/vmstat.c
---
 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 5 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6f75617..6c31a2a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -290,6 +290,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -460,6 +467,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2d0f222..13070d6 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -166,6 +166,28 @@ static inline unsigned long zone_page_state(struct zone *zone,
 	return x;
 }
 
+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+					enum zone_stat_item item)
+{
+	long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_online_cpu(cpu)
+		x += zone_pcp(zone, cpu)->vm_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..e35bfb8 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
+		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 542fc4d..ed53cfd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1366,7 +1366,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2239,7 +2239,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c81321f..42d76c6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -136,10 +136,23 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			zone_pcp(zone, cpu)->stat_threshold = threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -715,7 +728,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [stable] [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
@ 2010-09-21 14:23         ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-21 14:23 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Rik van Riel, Christoph Lameter, Dave Chinner,
	Linux Kernel List, linux-mm, Minchan Kim, KOSAKI Motohiro,
	Johannes Weiner, stable, KAMEZAWA Hiroyuki

On Tue, Sep 21, 2010 at 05:58:14AM -0700, Greg KH wrote:
> On Tue, Sep 21, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> > On Fri, Sep 03, 2010 at 04:05:51PM -0700, Andrew Morton wrote:
> > > On Fri,  3 Sep 2010 10:08:43 +0100
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > The noteworthy change is to patch 2 which now uses the generic
> > > > zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> > > > applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> > > > 
> > > > Changelog since V3
> > > >   o Use generic helper for NR_FREE_PAGES estimate when necessary
> > > > 
> > > > Changelog since V2
> > > >   o Minor clarifications
> > > >   o Rebase to 2.6.36-rc3
> > > > 
> > > > Changelog since V1
> > > >   o Fix for !CONFIG_SMP
> > > >   o Correct spelling mistakes
> > > >   o Clarify a ChangeLog
> > > >   o Only check for counter drift on machines large enough for the counter
> > > >     drift to breach the min watermark when NR_FREE_PAGES report the low
> > > >     watermark is fine
> > > > 
> > > > Internal IBM test teams beta testing distribution kernels have reported
> > > > problems on machines with a large number of CPUs whereby page allocator
> > > > failure messages show huge differences between the nr_free_pages vmstat
> > > > counter and what is available on the buddy lists. In an extreme example,
> > > > nr_free_pages was above the min watermark but zero pages were on the buddy
> > > > lists allowing the system to potentially livelock unable to make forward
> > > > progress unless an allocation succeeds. There is no reason why the problems
> > > > would not affect mainline so the following series mitigates the problems
> > > > in the page allocator related to to per-cpu counter drift and lists.
> > > > 
> > > > The first patch ensures that counters are updated after pages are added to
> > > > free lists.
> > > > 
> > > > The second patch notes that the counter drift between nr_free_pages and what
> > > > is on the per-cpu lists can be very high. When memory is low and kswapd
> > > > is awake, the per-cpu counters are checked as well as reading the value
> > > > of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> > > > kswapd is awake but it will be much harder to breach the min watermark and
> > > > potentially livelock the system.
> > > > 
> > > > The third patch notes that after direct-reclaim an allocation can
> > > > fail because the necessary pages are on the per-cpu lists. After a
> > > > direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> > > > a second attempt is made.
> > > > 
> > > > Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> > > > version of this series that continually called vmstat_update() when
> > > > memory was low was tested internally and found to help the counter drift
> > > > problem. I described this during LSF/MM Summit and the potential for IPI
> > > > storms was frowned upon. An alternative fix is in patch two which uses
> > > > for_each_online_cpu() to read the vmstat deltas while memory is low and
> > > > kswapd is awake. This should be functionally similar.
> > > > 
> > > > This patch should be merged after the patch "vmstat : update
> > > > zone stat threshold at onlining a cpu" which is in mmotm as
> > > > vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> > > > 
> > > > If we can agree on it, this series is a stable candidate.
> > > 
> > > (cc stable@kernel.org)
> > > 
> > > >  include/linux/mmzone.h |   13 +++++++++++++
> > > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > > >  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
> > > >  mm/vmstat.c            |   15 ++++++++++++++-
> > > >  5 files changed, 91 insertions(+), 9 deletions(-)
> > > 
> > > For the entire patch series I get
> > > 
> > >  include/linux/mmzone.h |   13 +++++++++++++
> > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > >  mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
> > >  mm/vmstat.c            |   16 +++++++++++++++-
> > >  5 files changed, 94 insertions(+), 11 deletions(-)
> > > 
> > > The patches do apply OK to 2.6.35.
> > > 
> > > Give the extent and the coreness of it all, it's a bit more than I'd
> > > usually push at the -stable guys.  But I guess that if the patches fix
> > > all the issues you've noted, as well as David's "minute-long livelocks
> > > in memory reclaim" then yup, it's worth backporting it all.
> > > 
> > 
> > These patches have made it to mainline as the following commits.
> > 
> > 9ee493c mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
> > aa45484 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> > 72853e2 mm: page allocator: update free page counters after pages are placed on the free list
> > 
> > I have not heard from the -stable guys, is there a reasonable
> > expectation that they'll be picked up?
> 
> If you ask me, then I'll know to give a response :)
> 

Hi Greg,

I would ask you directly but I didn't want anyone else on stable@ to
feel left out :)

> None of these were tagged as going to the stable tree, should I include
> them? 

Yes please unless there is a late objection. The patches were first developed
as a result of a distro bug whose kernel was based on 2.6.32.  There was
every indication this affected mainline as well. The details of the testing
are above.

Dave Chinner had also reported problems with livelocks in reclaim that
looked like IPI storms. There were two major factors at play and these
patches addressed one of them. It works out as both a bug and a
performance fix.

> If so, for which -stable tree?  .27, .32, and .35 are all
> currently active.
> 

2.6.35 for certain.

I would have a strong preference for 2.6.32 as well as it's a baseline for
a number of distros. The second commit will conflict with per-cpu changes
but the resolution is straight-forward.

In mm/vmstat.c, the correct per-cpu related line that conflicts should
look like

	zone_pcp(zone, cpu)->stat_threshold = threshold;

vmstat.h will fail to build again due to per-cpu changes. A rebased
version looks like

static inline unsigned long zone_page_state_snapshot(struct zone *zone,
                                        enum zone_stat_item item)
{
        long x = atomic_long_read(&zone->vm_stat[item]);

#ifdef CONFIG_SMP
        int cpu;
        for_each_online_cpu(cpu)
                x += zone_pcp(zone, cpu)->vm_stat_diff[item];

        if (x < 0)
                x = 0;
#endif
        return x;
}

A rebased version of patch 2 against 2.6.32.21 is below.

I do not know who the users of 2.6.27.x are so I have no strong opinions
on whether they need these patches or not.

Thanks Greg.

==== CUT HERE ====
mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists.  To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold.  On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero.  Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken.  The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Conflicts:

	mm/vmstat.c
---
 include/linux/mmzone.h |   13 +++++++++++++
 include/linux/vmstat.h |   22 ++++++++++++++++++++++
 mm/mmzone.c            |   21 +++++++++++++++++++++
 mm/page_alloc.c        |    4 ++--
 mm/vmstat.c            |   15 ++++++++++++++-
 5 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6f75617..6c31a2a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -290,6 +290,13 @@ struct zone {
 	unsigned long watermark[NR_WMARK];
 
 	/*
+	 * When free pages are below this point, additional steps are taken
+	 * when reading the number of free pages to avoid per-cpu counter
+	 * drift allowing watermarks to be breached
+	 */
+	unsigned long percpu_drift_mark;
+
+	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
 	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -460,6 +467,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
 }
 
+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
 /*
  * The "priority" of VM scanning is how much of the queues we will scan in one
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2d0f222..13070d6 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -166,6 +166,28 @@ static inline unsigned long zone_page_state(struct zone *zone,
 	return x;
 }
 
+/*
+ * More accurate version that also considers the currently pending
+ * deltas. For that we need to loop over all cpus to find the current
+ * deltas. There is no synchronization so the result cannot be
+ * exactly accurate either.
+ */
+static inline unsigned long zone_page_state_snapshot(struct zone *zone,
+					enum zone_stat_item item)
+{
+	long x = atomic_long_read(&zone->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_online_cpu(cpu)
+		x += zone_pcp(zone, cpu)->vm_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
 extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..e35bfb8 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,24 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+	/*
+	 * While kswapd is awake, it is considered the zone is under some
+	 * memory pressure. Under pressure, there is a risk that
+	 * per-cpu-counter-drift will allow the min watermark to be breached
+	 * potentially causing a live-lock. While kswapd is awake and
+	 * free pages are low, get a better estimate for free pages
+	 */
+	if (nr_free_pages < zone->percpu_drift_mark &&
+			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
+		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+	return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 542fc4d..ed53cfd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1366,7 +1366,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
-	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
 	int o;
 
 	if (alloc_flags & ALLOC_HIGH)
@@ -2239,7 +2239,7 @@ void show_free_areas(void)
 			" all_unreclaimable? %s"
 			"\n",
 			zone->name,
-			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone_nr_free_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c81321f..42d76c6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -136,10 +136,23 @@ static void refresh_zone_stat_thresholds(void)
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			zone_pcp(zone, cpu)->stat_threshold = threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -715,7 +728,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [stable] [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
  2010-09-21 14:23         ` Mel Gorman
@ 2010-09-23 18:49           ` Greg KH
  -1 siblings, 0 replies; 104+ messages in thread
From: Greg KH @ 2010-09-23 18:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Christoph Lameter, Dave Chinner, Linux Kernel List,
	linux-mm, Minchan Kim, KOSAKI Motohiro, Johannes Weiner,
	Andrew Morton, stable, KAMEZAWA Hiroyuki

On Tue, Sep 21, 2010 at 03:23:09PM +0100, Mel Gorman wrote:
> On Tue, Sep 21, 2010 at 05:58:14AM -0700, Greg KH wrote:
> > On Tue, Sep 21, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> > > On Fri, Sep 03, 2010 at 04:05:51PM -0700, Andrew Morton wrote:
> > > > On Fri,  3 Sep 2010 10:08:43 +0100
> > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > > 
> > > > > The noteworthy change is to patch 2 which now uses the generic
> > > > > zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> > > > > applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> > > > > 
> > > > > Changelog since V3
> > > > >   o Use generic helper for NR_FREE_PAGES estimate when necessary
> > > > > 
> > > > > Changelog since V2
> > > > >   o Minor clarifications
> > > > >   o Rebase to 2.6.36-rc3
> > > > > 
> > > > > Changelog since V1
> > > > >   o Fix for !CONFIG_SMP
> > > > >   o Correct spelling mistakes
> > > > >   o Clarify a ChangeLog
> > > > >   o Only check for counter drift on machines large enough for the counter
> > > > >     drift to breach the min watermark when NR_FREE_PAGES report the low
> > > > >     watermark is fine
> > > > > 
> > > > > Internal IBM test teams beta testing distribution kernels have reported
> > > > > problems on machines with a large number of CPUs whereby page allocator
> > > > > failure messages show huge differences between the nr_free_pages vmstat
> > > > > counter and what is available on the buddy lists. In an extreme example,
> > > > > nr_free_pages was above the min watermark but zero pages were on the buddy
> > > > > lists allowing the system to potentially livelock unable to make forward
> > > > > progress unless an allocation succeeds. There is no reason why the problems
> > > > > would not affect mainline so the following series mitigates the problems
> > > > > in the page allocator related to to per-cpu counter drift and lists.
> > > > > 
> > > > > The first patch ensures that counters are updated after pages are added to
> > > > > free lists.
> > > > > 
> > > > > The second patch notes that the counter drift between nr_free_pages and what
> > > > > is on the per-cpu lists can be very high. When memory is low and kswapd
> > > > > is awake, the per-cpu counters are checked as well as reading the value
> > > > > of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> > > > > kswapd is awake but it will be much harder to breach the min watermark and
> > > > > potentially livelock the system.
> > > > > 
> > > > > The third patch notes that after direct-reclaim an allocation can
> > > > > fail because the necessary pages are on the per-cpu lists. After a
> > > > > direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> > > > > a second attempt is made.
> > > > > 
> > > > > Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> > > > > version of this series that continually called vmstat_update() when
> > > > > memory was low was tested internally and found to help the counter drift
> > > > > problem. I described this during LSF/MM Summit and the potential for IPI
> > > > > storms was frowned upon. An alternative fix is in patch two which uses
> > > > > for_each_online_cpu() to read the vmstat deltas while memory is low and
> > > > > kswapd is awake. This should be functionally similar.
> > > > > 
> > > > > This patch should be merged after the patch "vmstat : update
> > > > > zone stat threshold at onlining a cpu" which is in mmotm as
> > > > > vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> > > > > 
> > > > > If we can agree on it, this series is a stable candidate.
> > > > 
> > > > (cc stable@kernel.org)
> > > > 
> > > > >  include/linux/mmzone.h |   13 +++++++++++++
> > > > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > > > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > > > >  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
> > > > >  mm/vmstat.c            |   15 ++++++++++++++-
> > > > >  5 files changed, 91 insertions(+), 9 deletions(-)
> > > > 
> > > > For the entire patch series I get
> > > > 
> > > >  include/linux/mmzone.h |   13 +++++++++++++
> > > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > > >  mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
> > > >  mm/vmstat.c            |   16 +++++++++++++++-
> > > >  5 files changed, 94 insertions(+), 11 deletions(-)
> > > > 
> > > > The patches do apply OK to 2.6.35.
> > > > 
> > > > Give the extent and the coreness of it all, it's a bit more than I'd
> > > > usually push at the -stable guys.  But I guess that if the patches fix
> > > > all the issues you've noted, as well as David's "minute-long livelocks
> > > > in memory reclaim" then yup, it's worth backporting it all.
> > > > 
> > > 
> > > These patches have made it to mainline as the following commits.
> > > 
> > > 9ee493c mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
> > > aa45484 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> > > 72853e2 mm: page allocator: update free page counters after pages are placed on the free list
> > > 
> > > I have not heard from the -stable guys, is there a reasonable
> > > expectation that they'll be picked up?
> > 
> > If you ask me, then I'll know to give a response :)
> > 
> 
> Hi Greg,
> 
> I would ask you directly but I didn't want anyone else on stable@ to
> feel left out :)
> 
> > None of these were tagged as going to the stable tree, should I include
> > them? 
> 
> Yes please unless there is a late objection. The patches were first developed
> as a result of a distro bug whose kernel was based on 2.6.32.  There was
> every indication this affected mainline as well. The details of the testing
> are above.
> 
> Dave Chinner had also reported problems with livelocks in reclaim that
> looked like IPI storms. There were two major factors at play and these
> patches addressed one of them. It works out as both a bug and a
> performance fix.
> 
> > If so, for which -stable tree?  .27, .32, and .35 are all
> > currently active.
> > 
> 
> 2.6.35 for certain.
> 
> I would have a strong preference for 2.6.32 as well as it's a baseline for
> a number of distros. The second commit will conflict with per-cpu changes
> but the resolution is straight-forward.

Thanks for the backport, I've queued these up for .32 and .35 now.

greg k-h

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [stable] [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
@ 2010-09-23 18:49           ` Greg KH
  0 siblings, 0 replies; 104+ messages in thread
From: Greg KH @ 2010-09-23 18:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Christoph Lameter, Dave Chinner, Linux Kernel List,
	linux-mm, Minchan Kim, KOSAKI Motohiro, Johannes Weiner,
	Andrew Morton, stable, KAMEZAWA Hiroyuki

On Tue, Sep 21, 2010 at 03:23:09PM +0100, Mel Gorman wrote:
> On Tue, Sep 21, 2010 at 05:58:14AM -0700, Greg KH wrote:
> > On Tue, Sep 21, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> > > On Fri, Sep 03, 2010 at 04:05:51PM -0700, Andrew Morton wrote:
> > > > On Fri,  3 Sep 2010 10:08:43 +0100
> > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > > 
> > > > > The noteworthy change is to patch 2 which now uses the generic
> > > > > zone_page_state_snapshot() in zone_nr_free_pages(). Similar logic still
> > > > > applies for *when* zone_page_state_snapshot() to avoid ovedhead.
> > > > > 
> > > > > Changelog since V3
> > > > >   o Use generic helper for NR_FREE_PAGES estimate when necessary
> > > > > 
> > > > > Changelog since V2
> > > > >   o Minor clarifications
> > > > >   o Rebase to 2.6.36-rc3
> > > > > 
> > > > > Changelog since V1
> > > > >   o Fix for !CONFIG_SMP
> > > > >   o Correct spelling mistakes
> > > > >   o Clarify a ChangeLog
> > > > >   o Only check for counter drift on machines large enough for the counter
> > > > >     drift to breach the min watermark when NR_FREE_PAGES report the low
> > > > >     watermark is fine
> > > > > 
> > > > > Internal IBM test teams beta testing distribution kernels have reported
> > > > > problems on machines with a large number of CPUs whereby page allocator
> > > > > failure messages show huge differences between the nr_free_pages vmstat
> > > > > counter and what is available on the buddy lists. In an extreme example,
> > > > > nr_free_pages was above the min watermark but zero pages were on the buddy
> > > > > lists allowing the system to potentially livelock unable to make forward
> > > > > progress unless an allocation succeeds. There is no reason why the problems
> > > > > would not affect mainline so the following series mitigates the problems
> > > > > in the page allocator related to to per-cpu counter drift and lists.
> > > > > 
> > > > > The first patch ensures that counters are updated after pages are added to
> > > > > free lists.
> > > > > 
> > > > > The second patch notes that the counter drift between nr_free_pages and what
> > > > > is on the per-cpu lists can be very high. When memory is low and kswapd
> > > > > is awake, the per-cpu counters are checked as well as reading the value
> > > > > of NR_FREE_PAGES. This will slow the page allocator when memory is low and
> > > > > kswapd is awake but it will be much harder to breach the min watermark and
> > > > > potentially livelock the system.
> > > > > 
> > > > > The third patch notes that after direct-reclaim an allocation can
> > > > > fail because the necessary pages are on the per-cpu lists. After a
> > > > > direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
> > > > > a second attempt is made.
> > > > > 
> > > > > Performance tests against 2.6.36-rc3 did not show up anything interesting. A
> > > > > version of this series that continually called vmstat_update() when
> > > > > memory was low was tested internally and found to help the counter drift
> > > > > problem. I described this during LSF/MM Summit and the potential for IPI
> > > > > storms was frowned upon. An alternative fix is in patch two which uses
> > > > > for_each_online_cpu() to read the vmstat deltas while memory is low and
> > > > > kswapd is awake. This should be functionally similar.
> > > > > 
> > > > > This patch should be merged after the patch "vmstat : update
> > > > > zone stat threshold at onlining a cpu" which is in mmotm as
> > > > > vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch .
> > > > > 
> > > > > If we can agree on it, this series is a stable candidate.
> > > > 
> > > > (cc stable@kernel.org)
> > > > 
> > > > >  include/linux/mmzone.h |   13 +++++++++++++
> > > > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > > > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > > > >  mm/page_alloc.c        |   29 +++++++++++++++++++++--------
> > > > >  mm/vmstat.c            |   15 ++++++++++++++-
> > > > >  5 files changed, 91 insertions(+), 9 deletions(-)
> > > > 
> > > > For the entire patch series I get
> > > > 
> > > >  include/linux/mmzone.h |   13 +++++++++++++
> > > >  include/linux/vmstat.h |   22 ++++++++++++++++++++++
> > > >  mm/mmzone.c            |   21 +++++++++++++++++++++
> > > >  mm/page_alloc.c        |   33 +++++++++++++++++++++++----------
> > > >  mm/vmstat.c            |   16 +++++++++++++++-
> > > >  5 files changed, 94 insertions(+), 11 deletions(-)
> > > > 
> > > > The patches do apply OK to 2.6.35.
> > > > 
> > > > Give the extent and the coreness of it all, it's a bit more than I'd
> > > > usually push at the -stable guys.  But I guess that if the patches fix
> > > > all the issues you've noted, as well as David's "minute-long livelocks
> > > > in memory reclaim" then yup, it's worth backporting it all.
> > > > 
> > > 
> > > These patches have made it to mainline as the following commits.
> > > 
> > > 9ee493c mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
> > > aa45484 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
> > > 72853e2 mm: page allocator: update free page counters after pages are placed on the free list
> > > 
> > > I have not heard from the -stable guys, is there a reasonable
> > > expectation that they'll be picked up?
> > 
> > If you ask me, then I'll know to give a response :)
> > 
> 
> Hi Greg,
> 
> I would ask you directly but I didn't want anyone else on stable@ to
> feel left out :)
> 
> > None of these were tagged as going to the stable tree, should I include
> > them? 
> 
> Yes please unless there is a late objection. The patches were first developed
> as a result of a distro bug whose kernel was based on 2.6.32.  There was
> every indication this affected mainline as well. The details of the testing
> are above.
> 
> Dave Chinner had also reported problems with livelocks in reclaim that
> looked like IPI storms. There were two major factors at play and these
> patches addressed one of them. It works out as both a bug and a
> performance fix.
> 
> > If so, for which -stable tree?  .27, .32, and .35 are all
> > currently active.
> > 
> 
> 2.6.35 for certain.
> 
> I would have a strong preference for 2.6.32 as well as it's a baseline for
> a number of distros. The second commit will conflict with per-cpu changes
> but the resolution is straight-forward.

Thanks for the backport, I've queued these up for .32 and .35 now.

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [stable] [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
  2010-09-23 18:49           ` Greg KH
@ 2010-09-24  9:14             ` Mel Gorman
  -1 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-24  9:14 UTC (permalink / raw)
  To: Greg KH
  Cc: Rik van Riel, Christoph Lameter, Dave Chinner, Linux Kernel List,
	linux-mm, Minchan Kim, KOSAKI Motohiro, Johannes Weiner,
	Andrew Morton, stable, KAMEZAWA Hiroyuki

> > > <SNIP>
> > > If so, for which -stable tree?  .27, .32, and .35 are all
> > > currently active.
> > > 
> > 
> > 2.6.35 for certain.
> > 
> > I would have a strong preference for 2.6.32 as well as it's a baseline for
> > a number of distros. The second commit will conflict with per-cpu changes
> > but the resolution is straight-forward.
> 
> Thanks for the backport, I've queued these up for .32 and .35 now.
> 

Thanks Greg.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [stable] [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4
@ 2010-09-24  9:14             ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-09-24  9:14 UTC (permalink / raw)
  To: Greg KH
  Cc: Rik van Riel, Christoph Lameter, Dave Chinner, Linux Kernel List,
	linux-mm, Minchan Kim, KOSAKI Motohiro, Johannes Weiner,
	Andrew Morton, stable, KAMEZAWA Hiroyuki

> > > <SNIP>
> > > If so, for which -stable tree?  .27, .32, and .35 are all
> > > currently active.
> > > 
> > 
> > 2.6.35 for certain.
> > 
> > I would have a strong preference for 2.6.32 as well as it's a baseline for
> > a number of distros. The second commit will conflict with per-cpu changes
> > but the resolution is straight-forward.
> 
> Thanks for the backport, I've queued these up for .32 and .35 now.
> 

Thanks Greg.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-31 17:37   ` Mel Gorman
@ 2010-08-31 18:26     ` Christoph Lameter
  -1 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-08-31 18:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-08-31 18:26     ` Christoph Lameter
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Lameter @ 2010-08-31 18:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux Kernel List, linux-mm, Rik van Riel,
	Johannes Weiner, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro


Reviewed-by: Christoph Lameter <cl@linux.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-31 17:37 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V3 Mel Gorman
@ 2010-08-31 17:37   ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-08-31 17:37   ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-08-31 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-23  8:00   ` Mel Gorman
@ 2010-08-23 23:17     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-08-23 23:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
> 
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |   20 ++++++++++++++++----
>  1 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bbaa959..750e1dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	struct reclaim_state reclaim_state;
>  	struct task_struct *p = current;
> +	bool drained = false;
>  
>  	cond_resched();
>  
> @@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  
> -	if (order != 0)
> -		drain_all_pages();
> +	if (unlikely(!(*did_some_progress)))
> +		return NULL;
>  
> -	if (likely(*did_some_progress))
> -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> +retry:
> +	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
> +
> +	/*
> +	 * If an allocation failed after direct reclaim, it could be because
> +	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 */
> +	if (!page && !drained) {
> +		drain_all_pages();
> +		drained = true;
> +		goto retry;
> +	}
> +
>  	return page;

I haven't read all of this patch series. (iow, this mail is luckly on top
of my mail box now) but at least I think this one is correct and good.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>








^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-08-23 23:17     ` KOSAKI Motohiro
  0 siblings, 0 replies; 104+ messages in thread
From: KOSAKI Motohiro @ 2010-08-23 23:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Andrew Morton, Linux Kernel List, linux-mm,
	Rik van Riel, Johannes Weiner, Minchan Kim, Christoph Lameter,
	KAMEZAWA Hiroyuki

> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
> 
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |   20 ++++++++++++++++----
>  1 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bbaa959..750e1dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	struct reclaim_state reclaim_state;
>  	struct task_struct *p = current;
> +	bool drained = false;
>  
>  	cond_resched();
>  
> @@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  
> -	if (order != 0)
> -		drain_all_pages();
> +	if (unlikely(!(*did_some_progress)))
> +		return NULL;
>  
> -	if (likely(*did_some_progress))
> -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> +retry:
> +	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
> +
> +	/*
> +	 * If an allocation failed after direct reclaim, it could be because
> +	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 */
> +	if (!page && !drained) {
> +		drain_all_pages();
> +		drained = true;
> +		goto retry;
> +	}
> +
>  	return page;

I haven't read all of this patch series. (iow, this mail is luckly on top
of my mail box now) but at least I think this one is correct and good.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-23  8:00 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2 Mel Gorman
@ 2010-08-23  8:00   ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
@ 2010-08-23  8:00   ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-08-23  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel List, linux-mm, Rik van Riel, Johannes Weiner,
	Minchan Kim, Christoph Lameter, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_alloc.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	if (order != 0)
-		drain_all_pages();
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-19 14:47   ` Minchan Kim
@ 2010-08-19 15:10     ` Mel Gorman
  0 siblings, 0 replies; 104+ messages in thread
From: Mel Gorman @ 2010-08-19 15:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Aug 19, 2010 at 11:47:03PM +0900, Minchan Kim wrote:
> On Mon, Aug 16, 2010 at 10:42:13AM +0100, Mel Gorman wrote:
> > When under significant memory pressure, a process enters direct reclaim
> > and immediately afterwards tries to allocate a page. If it fails and no
> > further progress is made, it's possible the system will go OOM. However,
> > on systems with large amounts of memory, it's possible that a significant
> > number of pages are on per-cpu lists and inaccessible to the calling
> > process. This leads to a process entering direct reclaim more often than
> > it should increasing the pressure on the system and compounding the problem.
> > 
> > This patch notes that if direct reclaim is making progress but
> > allocations are still failing that the system is already under heavy
> > pressure. In this case, it drains the per-cpu lists and tries the
> > allocation a second time before continuing.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |   19 +++++++++++++++++--
> >  1 files changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 67a2ed0..a8651a4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1844,6 +1844,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  	struct page *page = NULL;
> >  	struct reclaim_state reclaim_state;
> >  	struct task_struct *p = current;
> > +	bool drained = false;
> >  
> >  	cond_resched();
> >  
> > @@ -1865,11 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  	if (order != 0)
> >  		drain_all_pages();
> >  
> 
> Nitpick: 
> 
> How about removing above condition and drain_all_pages?
> If get_page_from_freelist fails, we do drain_all_pages at last. 
> It can remove double calling of drain_all_pagse in case of order > 0.
> In addition, if the VM can't reclaim anythings, we don't need to drain
> in case of order > 0. 
> 

That sounds reasonable. V2 of this series will delete the lines

if (order != 0)
	drain_all_pages()

> 
> > -	if (likely(*did_some_progress))
> > -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> > +	if (unlikely(!(*did_some_progress)))
> > +		return NULL;
> > +
> > +retry:
> > +	page = get_page_from_freelist(gfp_mask, nodemask, order,
> >  					zonelist, high_zoneidx,
> >  					alloc_flags, preferred_zone,
> >  					migratetype);
> > +
> > +	/*
> > +	 * If an allocation failed after direct reclaim, it could be because
> > +	 * pages are pinned on the per-cpu lists. Drain them and try again
> > +	 */
> > +	if (!page && !drained) {
> > +		drain_all_pages();
> > +		drained = true;
> > +		goto retry;
> > +	}
> > +
> >  	return page;
> >  }
> >  
> > -- 
> > 1.7.1
> > 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-16  9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
                     ` (2 preceding siblings ...)
  2010-08-18  3:02   ` KAMEZAWA Hiroyuki
@ 2010-08-19 14:47   ` Minchan Kim
  2010-08-19 15:10     ` Mel Gorman
  3 siblings, 1 reply; 104+ messages in thread
From: Minchan Kim @ 2010-08-19 14:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 16, 2010 at 10:42:13AM +0100, Mel Gorman wrote:
> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
> 
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |   19 +++++++++++++++++--
>  1 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 67a2ed0..a8651a4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1844,6 +1844,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	struct reclaim_state reclaim_state;
>  	struct task_struct *p = current;
> +	bool drained = false;
>  
>  	cond_resched();
>  
> @@ -1865,11 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	if (order != 0)
>  		drain_all_pages();
>  

Nitpick: 

How about removing above condition and drain_all_pages?
If get_page_from_freelist fails, we do drain_all_pages at last. 
It can remove double calling of drain_all_pagse in case of order > 0.
In addition, if the VM can't reclaim anythings, we don't need to drain
in case of order > 0. 


> -	if (likely(*did_some_progress))
> -		page = get_page_from_freelist(gfp_mask, nodemask, order,
> +	if (unlikely(!(*did_some_progress)))
> +		return NULL;
> +
> +retry:
> +	page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx,
>  					alloc_flags, preferred_zone,
>  					migratetype);
> +
> +	/*
> +	 * If an allocation failed after direct reclaim, it could be because
> +	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 */
> +	if (!page && !drained) {
> +		drain_all_pages();
> +		drained = true;
> +		goto retry;
> +	}
> +
>  	return page;
>  }
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-16  9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
  2010-08-16 14:50   ` Rik van Riel
  2010-08-17  2:57   ` Minchan Kim
@ 2010-08-18  3:02   ` KAMEZAWA Hiroyuki
  2010-08-19 14:47   ` Minchan Kim
  3 siblings, 0 replies; 104+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-18  3:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner, KOSAKI Motohiro

On Mon, 16 Aug 2010 10:42:13 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
> 
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-16  9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
  2010-08-16 14:50   ` Rik van Riel
@ 2010-08-17  2:57   ` Minchan Kim
  2010-08-18  3:02   ` KAMEZAWA Hiroyuki
  2010-08-19 14:47   ` Minchan Kim
  3 siblings, 0 replies; 104+ messages in thread
From: Minchan Kim @ 2010-08-17  2:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Rik van Riel, Nick Piggin, Johannes Weiner,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, Aug 16, 2010 at 6:42 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
>
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

IPI overhead would be good rather than going OOM or nopage.
In addition, here isn't a hot path and frequent case.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-16  9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
@ 2010-08-16 14:50   ` Rik van Riel
  2010-08-17  2:57   ` Minchan Kim
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 104+ messages in thread
From: Rik van Riel @ 2010-08-16 14:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On 08/16/2010 05:42 AM, Mel Gorman wrote:
> When under significant memory pressure, a process enters direct reclaim
> and immediately afterwards tries to allocate a page. If it fails and no
> further progress is made, it's possible the system will go OOM. However,
> on systems with large amounts of memory, it's possible that a significant
> number of pages are on per-cpu lists and inaccessible to the calling
> process. This leads to a process entering direct reclaim more often than
> it should increasing the pressure on the system and compounding the problem.
>
> This patch notes that if direct reclaim is making progress but
> allocations are still failing that the system is already under heavy
> pressure. In this case, it drains the per-cpu lists and tries the
> allocation a second time before continuing.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails
  2010-08-16  9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman
@ 2010-08-16  9:42 ` Mel Gorman
  2010-08-16 14:50   ` Rik van Riel
                     ` (3 more replies)
  0 siblings, 4 replies; 104+ messages in thread
From: Mel Gorman @ 2010-08-16  9:42 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Mel Gorman

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   19 +++++++++++++++++--
 1 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 67a2ed0..a8651a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1844,6 +1844,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
+	bool drained = false;
 
 	cond_resched();
 
@@ -1865,11 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (order != 0)
 		drain_all_pages();
 
-	if (likely(*did_some_progress))
-		page = get_page_from_freelist(gfp_mask, nodemask, order,
+	if (unlikely(!(*did_some_progress)))
+		return NULL;
+
+retry:
+	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	/*
+	 * If an allocation failed after direct reclaim, it could be because
+	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 */
+	if (!page && !drained) {
+		drain_all_pages();
+		drained = true;
+		goto retry;
+	}
+
 	return page;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2010-09-24  9:14 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-03  9:08 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4 Mel Gorman
2010-09-03  9:08 ` Mel Gorman
2010-09-03  9:08 ` [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list Mel Gorman
2010-09-03  9:08   ` Mel Gorman
2010-09-03 22:38   ` Andrew Morton
2010-09-03 22:38     ` Andrew Morton
2010-09-05 18:06     ` Mel Gorman
2010-09-05 18:06       ` Mel Gorman
2010-09-03  9:08 ` [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Mel Gorman
2010-09-03  9:08   ` Mel Gorman
2010-09-03 22:55   ` Andrew Morton
2010-09-03 22:55     ` Andrew Morton
2010-09-03 23:17     ` Christoph Lameter
2010-09-03 23:17       ` Christoph Lameter
2010-09-03 23:28       ` Andrew Morton
2010-09-03 23:28         ` Andrew Morton
2010-09-04  0:54         ` Christoph Lameter
2010-09-04  0:54           ` Christoph Lameter
2010-09-05 18:12     ` Mel Gorman
2010-09-05 18:12       ` Mel Gorman
2010-09-03  9:08 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
2010-09-03  9:08   ` Mel Gorman
2010-09-03 23:00   ` Andrew Morton
2010-09-03 23:00     ` Andrew Morton
2010-09-04  2:25     ` Dave Chinner
2010-09-04  2:25       ` Dave Chinner
2010-09-04  3:21       ` Andrew Morton
2010-09-04  3:21         ` Andrew Morton
2010-09-04  7:58         ` Dave Chinner
2010-09-04  7:58           ` Dave Chinner
2010-09-04  8:14           ` Dave Chinner
2010-09-04  8:14             ` Dave Chinner
     [not found]             ` <20100905015400.GA10714@localhost>
     [not found]               ` <20100905021555.GG705@dastard>
     [not found]                 ` <20100905060539.GA17450@localhost>
     [not found]                   ` <20100905131447.GJ705@dastard>
2010-09-05 13:45                     ` Wu Fengguang
2010-09-05 13:45                       ` Wu Fengguang
2010-09-05 23:33                       ` Dave Chinner
2010-09-05 23:33                         ` Dave Chinner
2010-09-06  4:02                       ` Dave Chinner
2010-09-06  4:02                         ` Dave Chinner
2010-09-06  8:40                         ` Mel Gorman
2010-09-06  8:40                           ` Mel Gorman
2010-09-06 21:50                           ` Dave Chinner
2010-09-06 21:50                             ` Dave Chinner
2010-09-08  8:49                             ` Dave Chinner
2010-09-08  8:49                               ` Dave Chinner
2010-09-09 12:39                               ` Mel Gorman
2010-09-09 12:39                                 ` Mel Gorman
2010-09-10  6:17                                 ` Dave Chinner
2010-09-10  6:17                                   ` Dave Chinner
2010-09-07 14:23                         ` Christoph Lameter
2010-09-07 14:23                           ` Christoph Lameter
2010-09-08  2:13                           ` Wu Fengguang
2010-09-08  2:13                             ` Wu Fengguang
2010-09-04  3:23       ` Wu Fengguang
2010-09-04  3:23         ` Wu Fengguang
2010-09-04  3:59         ` Andrew Morton
2010-09-04  3:59           ` Andrew Morton
2010-09-04  4:37           ` Wu Fengguang
2010-09-04  4:37             ` Wu Fengguang
2010-09-05 18:22       ` Mel Gorman
2010-09-05 18:22         ` Mel Gorman
2010-09-05 18:14     ` Mel Gorman
2010-09-05 18:14       ` Mel Gorman
2010-09-08  7:43   ` KOSAKI Motohiro
2010-09-08  7:43     ` KOSAKI Motohiro
2010-09-08 20:05     ` Christoph Lameter
2010-09-08 20:05       ` Christoph Lameter
2010-09-09 12:41     ` Mel Gorman
2010-09-09 12:41       ` Mel Gorman
2010-09-09 13:45       ` Christoph Lameter
2010-09-09 13:45         ` Christoph Lameter
2010-09-09 13:55         ` Mel Gorman
2010-09-09 13:55           ` Mel Gorman
2010-09-09 14:32           ` Christoph Lameter
2010-09-09 14:32             ` Christoph Lameter
2010-09-09 15:05             ` Mel Gorman
2010-09-09 15:05               ` Mel Gorman
2010-09-10  2:56               ` KOSAKI Motohiro
2010-09-10  2:56                 ` KOSAKI Motohiro
2010-09-03 23:05 ` [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V4 Andrew Morton
2010-09-03 23:05   ` Andrew Morton
2010-09-21 11:17   ` Mel Gorman
2010-09-21 11:17     ` Mel Gorman
2010-09-21 12:58     ` [stable] " Greg KH
2010-09-21 12:58       ` Greg KH
2010-09-21 14:23       ` Mel Gorman
2010-09-21 14:23         ` Mel Gorman
2010-09-23 18:49         ` Greg KH
2010-09-23 18:49           ` Greg KH
2010-09-24  9:14           ` Mel Gorman
2010-09-24  9:14             ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2010-08-31 17:37 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V3 Mel Gorman
2010-08-31 17:37 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
2010-08-31 17:37   ` Mel Gorman
2010-08-31 18:26   ` Christoph Lameter
2010-08-31 18:26     ` Christoph Lameter
2010-08-23  8:00 [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V2 Mel Gorman
2010-08-23  8:00 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
2010-08-23  8:00   ` Mel Gorman
2010-08-23 23:17   ` KOSAKI Motohiro
2010-08-23 23:17     ` KOSAKI Motohiro
2010-08-16  9:42 [RFC PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator Mel Gorman
2010-08-16  9:42 ` [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails Mel Gorman
2010-08-16 14:50   ` Rik van Riel
2010-08-17  2:57   ` Minchan Kim
2010-08-18  3:02   ` KAMEZAWA Hiroyuki
2010-08-19 14:47   ` Minchan Kim
2010-08-19 15:10     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.