All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-08 11:48 ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-08 11:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Nick Piggin, Christian Ehrhardt, Chris Mason, Jens Axboe,
	linux-kernel, Mel Gorman

(CC'ing some people who were involved in the last discussion on the use of
congestion_wait() in the VM)

Under memory pressure, the page allocator and kswapd can go to sleep using
congestion_wait(). In two of these cases, it may not be the appropriate
action as congestion may not be the problem. This patchset replaces two
sets of instances of congestion_wait() usage with a waitqueue sleep with
the view of having the VM behaviour depend on the relevant state of the
zone instead of on congestion which may or may not be a factor. A third
patch updates the frequency zone pressure is checked.

The first patch addresses the page allocator which calls congestion_wait()
to back off. The patch adds a zone->pressure_wq to sleep on instead of the
congestion queues.. If a direct reclaimer or kswapd brings the zone over
the min watermark, processes on the waitqueue are woken up.

The second patch checks zone pressure when a batch of pages from the PCP lists
are freed. The assumption is that there is a reasonable change if processes
are sleeping that a batch of frees can push a zone above its watermark.

The third patch address kswapd going to sleep when it is raising priority. As
vmscan makes more appropriate checks on congestion elsewhere, this patch
puts kswapd back on its own waitqueue to wait for either a timeout or
another process to call wakeup_kswapd.

The tricky problem is determining if this patch is really doing the right
thing. I took three machines X86, X86-64 and PPC64 booted with 1GB of RAM and
ran a series of tests including sysbench, iozone and a desktop latency test.
The performance results that did not involve memory pressure were fine -
no major impact due to the zone_pressure check in the free path. However,
the sysbench and iozone results varied wildly and depended somewhat on the
starting state of the machine. The objective was to see if the tests completed
faster because less time was needlessly spent waiting on congestion but the
fact is the benchmarks were really IO-bound meant there was little difference
with the patch applied. For the record, there were both massive gains and
losses with the patch applied but it was not consistently reproducible.

I'm somewhat at an impasse to identify a reasonable scenario the patch
can make a real difference to. It might depend on a setup like Christian's
with many disks which I cannot reproduce unfortunately. Otherwise, it's a
case of eyeballing the patch and stating whether it makes sense or not.

Nick, I haven't implemented the
queueing-if-a-process-is-already-waiting-for-fairness yet largely because
a proper way has to be devised to measure how "good" or "bad" this patch is.

Any comments on whether this patch is really doing the right thing or
suggestions on how it should be properly tested? Christian, minimally it
would be nice if you could retest your iozone tests to confirm the symptoms
of your problem are still being dealt with.

 include/linux/mmzone.h |    3 ++
 mm/internal.h          |    4 +++
 mm/mmzone.c            |   47 ++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |   53 +++++++++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c            |   13 ++++++++---
 5 files changed, 111 insertions(+), 9 deletions(-)


^ permalink raw reply	[flat|nested] 136+ messages in thread

* [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-08 11:48 ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-08 11:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Nick Piggin, Christian Ehrhardt, Chris Mason, Jens Axboe,
	linux-kernel, Mel Gorman

(CC'ing some people who were involved in the last discussion on the use of
congestion_wait() in the VM)

Under memory pressure, the page allocator and kswapd can go to sleep using
congestion_wait(). In two of these cases, it may not be the appropriate
action as congestion may not be the problem. This patchset replaces two
sets of instances of congestion_wait() usage with a waitqueue sleep with
the view of having the VM behaviour depend on the relevant state of the
zone instead of on congestion which may or may not be a factor. A third
patch updates the frequency zone pressure is checked.

The first patch addresses the page allocator which calls congestion_wait()
to back off. The patch adds a zone->pressure_wq to sleep on instead of the
congestion queues.. If a direct reclaimer or kswapd brings the zone over
the min watermark, processes on the waitqueue are woken up.

The second patch checks zone pressure when a batch of pages from the PCP lists
are freed. The assumption is that there is a reasonable change if processes
are sleeping that a batch of frees can push a zone above its watermark.

The third patch address kswapd going to sleep when it is raising priority. As
vmscan makes more appropriate checks on congestion elsewhere, this patch
puts kswapd back on its own waitqueue to wait for either a timeout or
another process to call wakeup_kswapd.

The tricky problem is determining if this patch is really doing the right
thing. I took three machines X86, X86-64 and PPC64 booted with 1GB of RAM and
ran a series of tests including sysbench, iozone and a desktop latency test.
The performance results that did not involve memory pressure were fine -
no major impact due to the zone_pressure check in the free path. However,
the sysbench and iozone results varied wildly and depended somewhat on the
starting state of the machine. The objective was to see if the tests completed
faster because less time was needlessly spent waiting on congestion but the
fact is the benchmarks were really IO-bound meant there was little difference
with the patch applied. For the record, there were both massive gains and
losses with the patch applied but it was not consistently reproducible.

I'm somewhat at an impasse to identify a reasonable scenario the patch
can make a real difference to. It might depend on a setup like Christian's
with many disks which I cannot reproduce unfortunately. Otherwise, it's a
case of eyeballing the patch and stating whether it makes sense or not.

Nick, I haven't implemented the
queueing-if-a-process-is-already-waiting-for-fairness yet largely because
a proper way has to be devised to measure how "good" or "bad" this patch is.

Any comments on whether this patch is really doing the right thing or
suggestions on how it should be properly tested? Christian, minimally it
would be nice if you could retest your iozone tests to confirm the symptoms
of your problem are still being dealt with.

 include/linux/mmzone.h |    3 ++
 mm/internal.h          |    4 +++
 mm/mmzone.c            |   47 ++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |   53 +++++++++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c            |   13 ++++++++---
 5 files changed, 111 insertions(+), 9 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-08 11:48 ` Mel Gorman
@ 2010-03-08 11:48   ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-08 11:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Nick Piggin, Christian Ehrhardt, Chris Mason, Jens Axboe,
	linux-kernel, Mel Gorman

Under heavy memory pressure, the page allocator may call congestion_wait()
to wait for IO congestion to clear or a timeout. This is not as sensible
a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
even congested as the pressure could have been due to a large number of
SYNC reads and the allocator waits for the entire timeout, possibly uselessly.

At the point of congestion_wait(), the allocator is struggling to get the
pages it needs and it should back off. This patch puts the allocator to sleep
on a zone->pressure_wq for either a timeout or until a direct reclaimer or
kswapd brings the zone over the low watermark, whichever happens first.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    3 ++
 mm/internal.h          |    4 +++
 mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c            |    2 +
 5 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..72465c1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -398,6 +398,9 @@ struct zone {
 	unsigned long		wait_table_hash_nr_entries;
 	unsigned long		wait_table_bits;
 
+	/* queue for processes waiting for pressure to relieve */
+	wait_queue_head_t	*pressure_wq;
+
 	/*
 	 * Discontig memory support fields.
 	 */
diff --git a/mm/internal.h b/mm/internal.h
index 6a697bb..caa5bc8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -251,6 +251,10 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 #define ZONE_RECLAIM_SUCCESS	1
 #endif
 
+extern void check_zone_pressure(struct zone *zone);
+extern long zonepressure_wait(struct zone *zone, unsigned int order,
+			long timeout);
+
 extern int hwpoison_filter(struct page *p);
 
 extern u32 hwpoison_filter_dev_major;
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..e80b89f 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/mmzone.h>
 #include <linux/module.h>
+#include <linux/sched.h>
 
 struct pglist_data *first_online_pgdat(void)
 {
@@ -87,3 +88,49 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+void check_zone_pressure(struct zone *zone)
+{
+	/* If no process is waiting, nothing to do */
+	if (!waitqueue_active(zone->pressure_wq))
+		return;
+
+	/* Check if the high watermark is ok for order 0 */
+	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
+		wake_up_interruptible(zone->pressure_wq);
+}
+
+/**
+ * zonepressure_wait - Wait for pressure on a zone to ease off
+ * @zone: The zone that is expected to be under pressure
+ * @order: The order the caller is waiting on pages for
+ * @timeout: Wait until pressure is relieved or this timeout is reached
+ *
+ * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
+ * It's considered to be relieved if any direct reclaimer or kswapd brings
+ * the zone above the high watermark
+ */
+long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
+{
+	long ret;
+	DEFINE_WAIT(wait);
+
+wait_again:
+	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
+
+	/*
+	 * The use of io_schedule_timeout() here means that it gets
+	 * accounted for as IO waiting. This may or may not be the case
+	 * but at least this way it gets picked up by vmstat
+	 */
+	ret = io_schedule_timeout(timeout);
+	finish_wait(zone->pressure_wq, &wait);
+
+	/* If woken early, check watermarks before continuing */
+	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
+		timeout = ret;
+		goto wait_again;
+	}
+
+	return ret;
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..1383ff9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1734,8 +1734,10 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
 			preferred_zone, migratetype);
 
-		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+		if (!page && gfp_mask & __GFP_NOFAIL) {
+			/* If still failing, wait for pressure on zone to relieve */
+			zonepressure_wait(preferred_zone, order, HZ/50);
+		}
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -1905,8 +1907,8 @@ rebalance:
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
-		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		/* Too much pressure, back off a bit at let reclaimers do work */
+		zonepressure_wait(preferred_zone, order, HZ/50);
 		goto rebalance;
 	}
 
@@ -3254,6 +3256,38 @@ int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 	return 0;
 }
 
+static noinline __init_refok
+void zone_pressure_wq_cleanup(struct zone *zone)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	size_t free_size = sizeof(wait_queue_head_t);
+
+	if (!slab_is_available())
+		free_bootmem_node(pgdat, __pa(zone->pressure_wq), free_size);
+	else
+		kfree(zone->pressure_wq);
+}
+
+static noinline __init_refok
+int zone_pressure_wq_init(struct zone *zone)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	size_t alloc_size = sizeof(wait_queue_head_t);
+
+	if (!slab_is_available())
+		zone->pressure_wq = (wait_queue_head_t *)
+			alloc_bootmem_node(pgdat, alloc_size);
+	else
+		zone->pressure_wq = kmalloc(alloc_size, GFP_KERNEL);
+
+	if (!zone->pressure_wq)
+		return -ENOMEM;
+
+	init_waitqueue_head(zone->pressure_wq);
+
+	return 0;
+}
+
 static int __zone_pcp_update(void *data)
 {
 	struct zone *zone = data;
@@ -3306,9 +3340,15 @@ __meminit int init_currently_empty_zone(struct zone *zone,
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int ret;
-	ret = zone_wait_table_init(zone, size);
+
+	ret = zone_pressure_wq_init(zone);
 	if (ret)
 		return ret;
+	ret = zone_wait_table_init(zone, size);
+	if (ret) {
+		zone_pressure_wq_cleanup(zone);
+		return ret;
+	}
 	pgdat->nr_zones = zone_idx(zone) + 1;
 
 	zone->zone_start_pfn = zone_start_pfn;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..4f92a48 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1709,6 +1709,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		}
 
 		shrink_zone(priority, zone, sc);
+		check_zone_pressure(zone);
 	}
 }
 
@@ -2082,6 +2083,7 @@ loop_again:
 			if (!zone_watermark_ok(zone, order,
 					8*high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
+			check_zone_pressure(zone);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-08 11:48   ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-08 11:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Nick Piggin, Christian Ehrhardt, Chris Mason, Jens Axboe,
	linux-kernel, Mel Gorman

Under heavy memory pressure, the page allocator may call congestion_wait()
to wait for IO congestion to clear or a timeout. This is not as sensible
a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
even congested as the pressure could have been due to a large number of
SYNC reads and the allocator waits for the entire timeout, possibly uselessly.

At the point of congestion_wait(), the allocator is struggling to get the
pages it needs and it should back off. This patch puts the allocator to sleep
on a zone->pressure_wq for either a timeout or until a direct reclaimer or
kswapd brings the zone over the low watermark, whichever happens first.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    3 ++
 mm/internal.h          |    4 +++
 mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c            |    2 +
 5 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..72465c1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -398,6 +398,9 @@ struct zone {
 	unsigned long		wait_table_hash_nr_entries;
 	unsigned long		wait_table_bits;
 
+	/* queue for processes waiting for pressure to relieve */
+	wait_queue_head_t	*pressure_wq;
+
 	/*
 	 * Discontig memory support fields.
 	 */
diff --git a/mm/internal.h b/mm/internal.h
index 6a697bb..caa5bc8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -251,6 +251,10 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 #define ZONE_RECLAIM_SUCCESS	1
 #endif
 
+extern void check_zone_pressure(struct zone *zone);
+extern long zonepressure_wait(struct zone *zone, unsigned int order,
+			long timeout);
+
 extern int hwpoison_filter(struct page *p);
 
 extern u32 hwpoison_filter_dev_major;
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..e80b89f 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/mmzone.h>
 #include <linux/module.h>
+#include <linux/sched.h>
 
 struct pglist_data *first_online_pgdat(void)
 {
@@ -87,3 +88,49 @@ int memmap_valid_within(unsigned long pfn,
 	return 1;
 }
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+void check_zone_pressure(struct zone *zone)
+{
+	/* If no process is waiting, nothing to do */
+	if (!waitqueue_active(zone->pressure_wq))
+		return;
+
+	/* Check if the high watermark is ok for order 0 */
+	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
+		wake_up_interruptible(zone->pressure_wq);
+}
+
+/**
+ * zonepressure_wait - Wait for pressure on a zone to ease off
+ * @zone: The zone that is expected to be under pressure
+ * @order: The order the caller is waiting on pages for
+ * @timeout: Wait until pressure is relieved or this timeout is reached
+ *
+ * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
+ * It's considered to be relieved if any direct reclaimer or kswapd brings
+ * the zone above the high watermark
+ */
+long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
+{
+	long ret;
+	DEFINE_WAIT(wait);
+
+wait_again:
+	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
+
+	/*
+	 * The use of io_schedule_timeout() here means that it gets
+	 * accounted for as IO waiting. This may or may not be the case
+	 * but at least this way it gets picked up by vmstat
+	 */
+	ret = io_schedule_timeout(timeout);
+	finish_wait(zone->pressure_wq, &wait);
+
+	/* If woken early, check watermarks before continuing */
+	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
+		timeout = ret;
+		goto wait_again;
+	}
+
+	return ret;
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..1383ff9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1734,8 +1734,10 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
 			preferred_zone, migratetype);
 
-		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+		if (!page && gfp_mask & __GFP_NOFAIL) {
+			/* If still failing, wait for pressure on zone to relieve */
+			zonepressure_wait(preferred_zone, order, HZ/50);
+		}
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -1905,8 +1907,8 @@ rebalance:
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
-		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		/* Too much pressure, back off a bit at let reclaimers do work */
+		zonepressure_wait(preferred_zone, order, HZ/50);
 		goto rebalance;
 	}
 
@@ -3254,6 +3256,38 @@ int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 	return 0;
 }
 
+static noinline __init_refok
+void zone_pressure_wq_cleanup(struct zone *zone)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	size_t free_size = sizeof(wait_queue_head_t);
+
+	if (!slab_is_available())
+		free_bootmem_node(pgdat, __pa(zone->pressure_wq), free_size);
+	else
+		kfree(zone->pressure_wq);
+}
+
+static noinline __init_refok
+int zone_pressure_wq_init(struct zone *zone)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	size_t alloc_size = sizeof(wait_queue_head_t);
+
+	if (!slab_is_available())
+		zone->pressure_wq = (wait_queue_head_t *)
+			alloc_bootmem_node(pgdat, alloc_size);
+	else
+		zone->pressure_wq = kmalloc(alloc_size, GFP_KERNEL);
+
+	if (!zone->pressure_wq)
+		return -ENOMEM;
+
+	init_waitqueue_head(zone->pressure_wq);
+
+	return 0;
+}
+
 static int __zone_pcp_update(void *data)
 {
 	struct zone *zone = data;
@@ -3306,9 +3340,15 @@ __meminit int init_currently_empty_zone(struct zone *zone,
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int ret;
-	ret = zone_wait_table_init(zone, size);
+
+	ret = zone_pressure_wq_init(zone);
 	if (ret)
 		return ret;
+	ret = zone_wait_table_init(zone, size);
+	if (ret) {
+		zone_pressure_wq_cleanup(zone);
+		return ret;
+	}
 	pgdat->nr_zones = zone_idx(zone) + 1;
 
 	zone->zone_start_pfn = zone_start_pfn;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..4f92a48 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1709,6 +1709,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		}
 
 		shrink_zone(priority, zone, sc);
+		check_zone_pressure(zone);
 	}
 }
 
@@ -2082,6 +2083,7 @@ loop_again:
 			if (!zone_watermark_ok(zone, order,
 					8*high_wmark_pages(zone), end_zone, 0))
 				shrink_zone(priority, zone, &sc);
+			check_zone_pressure(zone);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
  2010-03-08 11:48 ` Mel Gorman
@ 2010-03-08 11:48   ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-08 11:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Nick Piggin, Christian Ehrhardt, Chris Mason, Jens Axboe,
	linux-kernel, Mel Gorman

When a batch of pages have been freed to the buddy allocator, it is possible
that it is enough to push a zone above its watermarks. This patch puts a
check in the free path for zone pressure. It's in a common path but for
the most part, it should only be checking if a linked list is empty and
have minimal performance impact.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1383ff9..3c8e8b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -562,6 +562,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (--count && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
+
+	/* A batch of pages have been freed so check zone pressure */
+	check_zone_pressure(zone);
 }
 
 static void free_one_page(struct zone *zone, struct page *page, int order,
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
@ 2010-03-08 11:48   ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-08 11:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Nick Piggin, Christian Ehrhardt, Chris Mason, Jens Axboe,
	linux-kernel, Mel Gorman

When a batch of pages have been freed to the buddy allocator, it is possible
that it is enough to push a zone above its watermarks. This patch puts a
check in the free path for zone pressure. It's in a common path but for
the most part, it should only be checking if a linked list is empty and
have minimal performance impact.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1383ff9..3c8e8b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -562,6 +562,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (--count && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
+
+	/* A batch of pages have been freed so check zone pressure */
+	check_zone_pressure(zone);
 }
 
 static void free_one_page(struct zone *zone, struct page *page, int order,
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion
  2010-03-08 11:48 ` Mel Gorman
@ 2010-03-08 11:48   ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-08 11:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Nick Piggin, Christian Ehrhardt, Chris Mason, Jens Axboe,
	linux-kernel, Mel Gorman

If kswapd is raising its priority to get the zone over the high
watermark, it may call congestion_wait() ostensibly to allow congestion
to clear. However, there is no guarantee that the queue is congested at
this point because it depends on kswapds previous actions as well as the
rest of the system. Kswapd could simply be working hard because there is
a lot of SYNC traffic in which case it shouldn't be sleeping.

Rather than waiting on congestion and potentially sleeping for longer
than it should, this patch puts kswapd back to sleep on the kswapd_wait
queue for the timeout. If direct reclaimers are in trouble, kswapd will
be rewoken as it should instead of sleeping when there is work to be
done.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f92a48..894d366 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1955,7 +1955,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
  * interoperates with the page allocator fallback scheme to ensure that aging
  * of pages is balanced across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, wait_queue_t *wait, int order)
 {
 	int all_zones_ok;
 	int priority;
@@ -2122,8 +2122,11 @@ loop_again:
 		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
 			if (has_under_min_watermark_zone)
 				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
-			else
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			else {
+				prepare_to_wait(&pgdat->kswapd_wait, wait, TASK_INTERRUPTIBLE);
+				schedule_timeout(HZ/10);
+				finish_wait(&pgdat->kswapd_wait, wait);
+			}
 		}
 
 		/*
@@ -2272,7 +2275,7 @@ static int kswapd(void *p)
 		 * after returning from the refrigerator
 		 */
 		if (!ret)
-			balance_pgdat(pgdat, order);
+			balance_pgdat(pgdat, &wait, order);
 	}
 	return 0;
 }
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion
@ 2010-03-08 11:48   ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-08 11:48 UTC (permalink / raw)
  To: linux-mm
  Cc: Nick Piggin, Christian Ehrhardt, Chris Mason, Jens Axboe,
	linux-kernel, Mel Gorman

If kswapd is raising its priority to get the zone over the high
watermark, it may call congestion_wait() ostensibly to allow congestion
to clear. However, there is no guarantee that the queue is congested at
this point because it depends on kswapds previous actions as well as the
rest of the system. Kswapd could simply be working hard because there is
a lot of SYNC traffic in which case it shouldn't be sleeping.

Rather than waiting on congestion and potentially sleeping for longer
than it should, this patch puts kswapd back to sleep on the kswapd_wait
queue for the timeout. If direct reclaimers are in trouble, kswapd will
be rewoken as it should instead of sleeping when there is work to be
done.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f92a48..894d366 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1955,7 +1955,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
  * interoperates with the page allocator fallback scheme to ensure that aging
  * of pages is balanced across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, wait_queue_t *wait, int order)
 {
 	int all_zones_ok;
 	int priority;
@@ -2122,8 +2122,11 @@ loop_again:
 		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
 			if (has_under_min_watermark_zone)
 				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
-			else
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			else {
+				prepare_to_wait(&pgdat->kswapd_wait, wait, TASK_INTERRUPTIBLE);
+				schedule_timeout(HZ/10);
+				finish_wait(&pgdat->kswapd_wait, wait);
+			}
 		}
 
 		/*
@@ -2272,7 +2275,7 @@ static int kswapd(void *p)
 		 * after returning from the refrigerator
 		 */
 		if (!ret)
-			balance_pgdat(pgdat, order);
+			balance_pgdat(pgdat, &wait, order);
 	}
 	return 0;
 }
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
  2010-03-08 11:48   ` Mel Gorman
@ 2010-03-09  9:53     ` Nick Piggin
  -1 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09  9:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

Cool, you found this doesn't hurt performance too much?

Can't you remove the check from the reclaim code now? (The check
here should give a more timely wait anyway)

This is good because it should eliminate most all cases of extra
waiting. I wonder if you've also thought of doing the check in the
allocation path too as we were discussing? (this would give a better
FIFO behaviour under memory pressure but I could easily agree it is not
worth the cost)

On Mon, Mar 08, 2010 at 11:48:22AM +0000, Mel Gorman wrote:
> When a batch of pages have been freed to the buddy allocator, it is possible
> that it is enough to push a zone above its watermarks. This patch puts a
> check in the free path for zone pressure. It's in a common path but for
> the most part, it should only be checking if a linked list is empty and
> have minimal performance impact.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1383ff9..3c8e8b7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -562,6 +562,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  		} while (--count && --batch_free && !list_empty(list));
>  	}
>  	spin_unlock(&zone->lock);
> +
> +	/* A batch of pages have been freed so check zone pressure */
> +	check_zone_pressure(zone);
>  }
>  
>  static void free_one_page(struct zone *zone, struct page *page, int order,
> -- 
> 1.6.5

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
@ 2010-03-09  9:53     ` Nick Piggin
  0 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09  9:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

Cool, you found this doesn't hurt performance too much?

Can't you remove the check from the reclaim code now? (The check
here should give a more timely wait anyway)

This is good because it should eliminate most all cases of extra
waiting. I wonder if you've also thought of doing the check in the
allocation path too as we were discussing? (this would give a better
FIFO behaviour under memory pressure but I could easily agree it is not
worth the cost)

On Mon, Mar 08, 2010 at 11:48:22AM +0000, Mel Gorman wrote:
> When a batch of pages have been freed to the buddy allocator, it is possible
> that it is enough to push a zone above its watermarks. This patch puts a
> check in the free path for zone pressure. It's in a common path but for
> the most part, it should only be checking if a linked list is empty and
> have minimal performance impact.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/page_alloc.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1383ff9..3c8e8b7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -562,6 +562,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  		} while (--count && --batch_free && !list_empty(list));
>  	}
>  	spin_unlock(&zone->lock);
> +
> +	/* A batch of pages have been freed so check zone pressure */
> +	check_zone_pressure(zone);
>  }
>  
>  static void free_one_page(struct zone *zone, struct page *page, int order,
> -- 
> 1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion
  2010-03-08 11:48   ` Mel Gorman
@ 2010-03-09 10:00     ` Nick Piggin
  -1 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 10:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Mon, Mar 08, 2010 at 11:48:23AM +0000, Mel Gorman wrote:
> If kswapd is raising its priority to get the zone over the high
> watermark, it may call congestion_wait() ostensibly to allow congestion
> to clear. However, there is no guarantee that the queue is congested at
> this point because it depends on kswapds previous actions as well as the
> rest of the system. Kswapd could simply be working hard because there is
> a lot of SYNC traffic in which case it shouldn't be sleeping.
> 
> Rather than waiting on congestion and potentially sleeping for longer
> than it should, this patch puts kswapd back to sleep on the kswapd_wait
> queue for the timeout. If direct reclaimers are in trouble, kswapd will
> be rewoken as it should instead of sleeping when there is work to be
> done.

Well but it is quite possible that many allocators are coming in to
wake it up. So with your patch, I think we'd need to consider the case
where the timeout approaches 0 here (if it's always being woken).

Direct reclaimers need not be involved because the pages might be
hovering around the asynchronous reclaim watermarks (which would be
the ideal case of system operation).

In which case, can you explain how this change makes sense? Why is
it a good thing not to wait when we previously did wait?



> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   11 +++++++----
>  1 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f92a48..894d366 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1955,7 +1955,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>   * interoperates with the page allocator fallback scheme to ensure that aging
>   * of pages is balanced across the zones.
>   */
> -static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> +static unsigned long balance_pgdat(pg_data_t *pgdat, wait_queue_t *wait, int order)
>  {
>  	int all_zones_ok;
>  	int priority;
> @@ -2122,8 +2122,11 @@ loop_again:
>  		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
>  			if (has_under_min_watermark_zone)
>  				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
> -			else
> -				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			else {
> +				prepare_to_wait(&pgdat->kswapd_wait, wait, TASK_INTERRUPTIBLE);
> +				schedule_timeout(HZ/10);
> +				finish_wait(&pgdat->kswapd_wait, wait);
> +			}
>  		}
>  
>  		/*
> @@ -2272,7 +2275,7 @@ static int kswapd(void *p)
>  		 * after returning from the refrigerator
>  		 */
>  		if (!ret)
> -			balance_pgdat(pgdat, order);
> +			balance_pgdat(pgdat, &wait, order);
>  	}
>  	return 0;
>  }
> -- 
> 1.6.5

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion
@ 2010-03-09 10:00     ` Nick Piggin
  0 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 10:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Mon, Mar 08, 2010 at 11:48:23AM +0000, Mel Gorman wrote:
> If kswapd is raising its priority to get the zone over the high
> watermark, it may call congestion_wait() ostensibly to allow congestion
> to clear. However, there is no guarantee that the queue is congested at
> this point because it depends on kswapds previous actions as well as the
> rest of the system. Kswapd could simply be working hard because there is
> a lot of SYNC traffic in which case it shouldn't be sleeping.
> 
> Rather than waiting on congestion and potentially sleeping for longer
> than it should, this patch puts kswapd back to sleep on the kswapd_wait
> queue for the timeout. If direct reclaimers are in trouble, kswapd will
> be rewoken as it should instead of sleeping when there is work to be
> done.

Well but it is quite possible that many allocators are coming in to
wake it up. So with your patch, I think we'd need to consider the case
where the timeout approaches 0 here (if it's always being woken).

Direct reclaimers need not be involved because the pages might be
hovering around the asynchronous reclaim watermarks (which would be
the ideal case of system operation).

In which case, can you explain how this change makes sense? Why is
it a good thing not to wait when we previously did wait?



> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   11 +++++++----
>  1 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f92a48..894d366 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1955,7 +1955,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>   * interoperates with the page allocator fallback scheme to ensure that aging
>   * of pages is balanced across the zones.
>   */
> -static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> +static unsigned long balance_pgdat(pg_data_t *pgdat, wait_queue_t *wait, int order)
>  {
>  	int all_zones_ok;
>  	int priority;
> @@ -2122,8 +2122,11 @@ loop_again:
>  		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
>  			if (has_under_min_watermark_zone)
>  				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
> -			else
> -				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +			else {
> +				prepare_to_wait(&pgdat->kswapd_wait, wait, TASK_INTERRUPTIBLE);
> +				schedule_timeout(HZ/10);
> +				finish_wait(&pgdat->kswapd_wait, wait);
> +			}
>  		}
>  
>  		/*
> @@ -2272,7 +2275,7 @@ static int kswapd(void *p)
>  		 * after returning from the refrigerator
>  		 */
>  		if (!ret)
> -			balance_pgdat(pgdat, order);
> +			balance_pgdat(pgdat, &wait, order);
>  	}
>  	return 0;
>  }
> -- 
> 1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
  2010-03-09  9:53     ` Nick Piggin
@ 2010-03-09 10:08       ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 10:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> Cool, you found this doesn't hurt performance too much?
> 

Nothing outside the noise was measured. I didn't profile it to be
absolutly sure but I expect it's ok.

> Can't you remove the check from the reclaim code now? (The check
> here should give a more timely wait anyway)
> 

I'll try and see what the timing and total IO figures look like.

> This is good because it should eliminate most all cases of extra
> waiting. I wonder if you've also thought of doing the check in the
> allocation path too as we were discussing? (this would give a better
> FIFO behaviour under memory pressure but I could easily agree it is not
> worth the cost)
> 

I *could* make the check but as I noted in the leader, there isn't
really a good test case that determines if these changes are "good" or
"bad". Removing congestion_wait() seems like an obvious win but other
modifications that alter how and when processes wait are less obvious.

> On Mon, Mar 08, 2010 at 11:48:22AM +0000, Mel Gorman wrote:
> > When a batch of pages have been freed to the buddy allocator, it is possible
> > that it is enough to push a zone above its watermarks. This patch puts a
> > check in the free path for zone pressure. It's in a common path but for
> > the most part, it should only be checking if a linked list is empty and
> > have minimal performance impact.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1383ff9..3c8e8b7 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -562,6 +562,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  		} while (--count && --batch_free && !list_empty(list));
> >  	}
> >  	spin_unlock(&zone->lock);
> > +
> > +	/* A batch of pages have been freed so check zone pressure */
> > +	check_zone_pressure(zone);
> >  }
> >  
> >  static void free_one_page(struct zone *zone, struct page *page, int order,
> > -- 
> > 1.6.5
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
@ 2010-03-09 10:08       ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 10:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> Cool, you found this doesn't hurt performance too much?
> 

Nothing outside the noise was measured. I didn't profile it to be
absolutly sure but I expect it's ok.

> Can't you remove the check from the reclaim code now? (The check
> here should give a more timely wait anyway)
> 

I'll try and see what the timing and total IO figures look like.

> This is good because it should eliminate most all cases of extra
> waiting. I wonder if you've also thought of doing the check in the
> allocation path too as we were discussing? (this would give a better
> FIFO behaviour under memory pressure but I could easily agree it is not
> worth the cost)
> 

I *could* make the check but as I noted in the leader, there isn't
really a good test case that determines if these changes are "good" or
"bad". Removing congestion_wait() seems like an obvious win but other
modifications that alter how and when processes wait are less obvious.

> On Mon, Mar 08, 2010 at 11:48:22AM +0000, Mel Gorman wrote:
> > When a batch of pages have been freed to the buddy allocator, it is possible
> > that it is enough to push a zone above its watermarks. This patch puts a
> > check in the free path for zone pressure. It's in a common path but for
> > the most part, it should only be checking if a linked list is empty and
> > have minimal performance impact.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/page_alloc.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1383ff9..3c8e8b7 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -562,6 +562,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  		} while (--count && --batch_free && !list_empty(list));
> >  	}
> >  	spin_unlock(&zone->lock);
> > +
> > +	/* A batch of pages have been freed so check zone pressure */
> > +	check_zone_pressure(zone);
> >  }
> >  
> >  static void free_one_page(struct zone *zone, struct page *page, int order,
> > -- 
> > 1.6.5
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion
  2010-03-09 10:00     ` Nick Piggin
@ 2010-03-09 10:21       ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 10:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 09:00:44PM +1100, Nick Piggin wrote:
> On Mon, Mar 08, 2010 at 11:48:23AM +0000, Mel Gorman wrote:
> > If kswapd is raising its priority to get the zone over the high
> > watermark, it may call congestion_wait() ostensibly to allow congestion
> > to clear. However, there is no guarantee that the queue is congested at
> > this point because it depends on kswapds previous actions as well as the
> > rest of the system. Kswapd could simply be working hard because there is
> > a lot of SYNC traffic in which case it shouldn't be sleeping.
> > 
> > Rather than waiting on congestion and potentially sleeping for longer
> > than it should, this patch puts kswapd back to sleep on the kswapd_wait
> > queue for the timeout. If direct reclaimers are in trouble, kswapd will
> > be rewoken as it should instead of sleeping when there is work to be
> > done.
> 
> Well but it is quite possible that many allocators are coming in to
> wake it up. So with your patch, I think we'd need to consider the case
> where the timeout approaches 0 here (if it's always being woken).
> 

True, similar to how zonepressure_wait() rechecks the watermarks if
there is still a timeout left and deciding whether to sleep again or
not.

> Direct reclaimers need not be involved because the pages might be
> hovering around the asynchronous reclaim watermarks (which would be
> the ideal case of system operation).
> 
> In which case, can you explain how this change makes sense? Why is
> it a good thing not to wait when we previously did wait?
> 

Well, it makes sense from the perspective it's better for kswapd to be doing
work than direct reclaim. If processes are hitting the watermarks then
why should kswapd be asleep?

That said, if the timeout was non-zero it should be able to make some decision
on whether it should be really awake. Putting the page allocator and kswapd
patches into the same series was a mistake because it's conflating two
different problems as one. I'm going to drop this one for the moment and
treat the page allocator patch in isolation.

> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   11 +++++++----
> >  1 files changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4f92a48..894d366 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1955,7 +1955,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >   * interoperates with the page allocator fallback scheme to ensure that aging
> >   * of pages is balanced across the zones.
> >   */
> > -static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> > +static unsigned long balance_pgdat(pg_data_t *pgdat, wait_queue_t *wait, int order)
> >  {
> >  	int all_zones_ok;
> >  	int priority;
> > @@ -2122,8 +2122,11 @@ loop_again:
> >  		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
> >  			if (has_under_min_watermark_zone)
> >  				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
> > -			else
> > -				congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +			else {
> > +				prepare_to_wait(&pgdat->kswapd_wait, wait, TASK_INTERRUPTIBLE);
> > +				schedule_timeout(HZ/10);
> > +				finish_wait(&pgdat->kswapd_wait, wait);
> > +			}
> >  		}
> >  
> >  		/*
> > @@ -2272,7 +2275,7 @@ static int kswapd(void *p)
> >  		 * after returning from the refrigerator
> >  		 */
> >  		if (!ret)
> > -			balance_pgdat(pgdat, order);
> > +			balance_pgdat(pgdat, &wait, order);
> >  	}
> >  	return 0;
> >  }
> > -- 
> > 1.6.5
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion
@ 2010-03-09 10:21       ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 10:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 09:00:44PM +1100, Nick Piggin wrote:
> On Mon, Mar 08, 2010 at 11:48:23AM +0000, Mel Gorman wrote:
> > If kswapd is raising its priority to get the zone over the high
> > watermark, it may call congestion_wait() ostensibly to allow congestion
> > to clear. However, there is no guarantee that the queue is congested at
> > this point because it depends on kswapds previous actions as well as the
> > rest of the system. Kswapd could simply be working hard because there is
> > a lot of SYNC traffic in which case it shouldn't be sleeping.
> > 
> > Rather than waiting on congestion and potentially sleeping for longer
> > than it should, this patch puts kswapd back to sleep on the kswapd_wait
> > queue for the timeout. If direct reclaimers are in trouble, kswapd will
> > be rewoken as it should instead of sleeping when there is work to be
> > done.
> 
> Well but it is quite possible that many allocators are coming in to
> wake it up. So with your patch, I think we'd need to consider the case
> where the timeout approaches 0 here (if it's always being woken).
> 

True, similar to how zonepressure_wait() rechecks the watermarks if
there is still a timeout left and deciding whether to sleep again or
not.

> Direct reclaimers need not be involved because the pages might be
> hovering around the asynchronous reclaim watermarks (which would be
> the ideal case of system operation).
> 
> In which case, can you explain how this change makes sense? Why is
> it a good thing not to wait when we previously did wait?
> 

Well, it makes sense from the perspective it's better for kswapd to be doing
work than direct reclaim. If processes are hitting the watermarks then
why should kswapd be asleep?

That said, if the timeout was non-zero it should be able to make some decision
on whether it should be really awake. Putting the page allocator and kswapd
patches into the same series was a mistake because it's conflating two
different problems as one. I'm going to drop this one for the moment and
treat the page allocator patch in isolation.

> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   11 +++++++----
> >  1 files changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4f92a48..894d366 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1955,7 +1955,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >   * interoperates with the page allocator fallback scheme to ensure that aging
> >   * of pages is balanced across the zones.
> >   */
> > -static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> > +static unsigned long balance_pgdat(pg_data_t *pgdat, wait_queue_t *wait, int order)
> >  {
> >  	int all_zones_ok;
> >  	int priority;
> > @@ -2122,8 +2122,11 @@ loop_again:
> >  		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
> >  			if (has_under_min_watermark_zone)
> >  				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
> > -			else
> > -				congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +			else {
> > +				prepare_to_wait(&pgdat->kswapd_wait, wait, TASK_INTERRUPTIBLE);
> > +				schedule_timeout(HZ/10);
> > +				finish_wait(&pgdat->kswapd_wait, wait);
> > +			}
> >  		}
> >  
> >  		/*
> > @@ -2272,7 +2275,7 @@ static int kswapd(void *p)
> >  		 * after returning from the refrigerator
> >  		 */
> >  		if (!ret)
> > -			balance_pgdat(pgdat, order);
> > +			balance_pgdat(pgdat, &wait, order);
> >  	}
> >  	return 0;
> >  }
> > -- 
> > 1.6.5
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
  2010-03-09 10:08       ` Mel Gorman
@ 2010-03-09 10:23         ` Nick Piggin
  -1 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 10:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:08:35AM +0000, Mel Gorman wrote:
> On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> > Cool, you found this doesn't hurt performance too much?
> > 
> 
> Nothing outside the noise was measured. I didn't profile it to be
> absolutly sure but I expect it's ok.

OK. Moving the waitqueue cacheline out of the fastpath footprint
and doing the flag thing might be a good idea?

 
> > Can't you remove the check from the reclaim code now? (The check
> > here should give a more timely wait anyway)
> > 
> 
> I'll try and see what the timing and total IO figures look like.

Well reclaim goes through free_pages_bulk anyway, doesn't it? So
I don't see why you would have to run any test.

 
> > This is good because it should eliminate most all cases of extra
> > waiting. I wonder if you've also thought of doing the check in the
> > allocation path too as we were discussing? (this would give a better
> > FIFO behaviour under memory pressure but I could easily agree it is not
> > worth the cost)
> > 
> 
> I *could* make the check but as I noted in the leader, there isn't
> really a good test case that determines if these changes are "good" or
> "bad". Removing congestion_wait() seems like an obvious win but other
> modifications that alter how and when processes wait are less obvious.

Fair enough. But we could be sure it increases fairness, which is a
good thing. So then we'd just have to check it against performance.

Your patches seem like a good idea regardless of this issue, don't get
me wrong.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
@ 2010-03-09 10:23         ` Nick Piggin
  0 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 10:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:08:35AM +0000, Mel Gorman wrote:
> On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> > Cool, you found this doesn't hurt performance too much?
> > 
> 
> Nothing outside the noise was measured. I didn't profile it to be
> absolutly sure but I expect it's ok.

OK. Moving the waitqueue cacheline out of the fastpath footprint
and doing the flag thing might be a good idea?

 
> > Can't you remove the check from the reclaim code now? (The check
> > here should give a more timely wait anyway)
> > 
> 
> I'll try and see what the timing and total IO figures look like.

Well reclaim goes through free_pages_bulk anyway, doesn't it? So
I don't see why you would have to run any test.

 
> > This is good because it should eliminate most all cases of extra
> > waiting. I wonder if you've also thought of doing the check in the
> > allocation path too as we were discussing? (this would give a better
> > FIFO behaviour under memory pressure but I could easily agree it is not
> > worth the cost)
> > 
> 
> I *could* make the check but as I noted in the leader, there isn't
> really a good test case that determines if these changes are "good" or
> "bad". Removing congestion_wait() seems like an obvious win but other
> modifications that alter how and when processes wait are less obvious.

Fair enough. But we could be sure it increases fairness, which is a
good thing. So then we'd just have to check it against performance.

Your patches seem like a good idea regardless of this issue, don't get
me wrong.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion
  2010-03-09 10:21       ` Mel Gorman
@ 2010-03-09 10:32         ` Nick Piggin
  -1 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 10:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:21:46AM +0000, Mel Gorman wrote:
> On Tue, Mar 09, 2010 at 09:00:44PM +1100, Nick Piggin wrote:
> > On Mon, Mar 08, 2010 at 11:48:23AM +0000, Mel Gorman wrote:
> > > If kswapd is raising its priority to get the zone over the high
> > > watermark, it may call congestion_wait() ostensibly to allow congestion
> > > to clear. However, there is no guarantee that the queue is congested at
> > > this point because it depends on kswapds previous actions as well as the
> > > rest of the system. Kswapd could simply be working hard because there is
> > > a lot of SYNC traffic in which case it shouldn't be sleeping.
> > > 
> > > Rather than waiting on congestion and potentially sleeping for longer
> > > than it should, this patch puts kswapd back to sleep on the kswapd_wait
> > > queue for the timeout. If direct reclaimers are in trouble, kswapd will
> > > be rewoken as it should instead of sleeping when there is work to be
> > > done.
> > 
> > Well but it is quite possible that many allocators are coming in to
> > wake it up. So with your patch, I think we'd need to consider the case
> > where the timeout approaches 0 here (if it's always being woken).
> > 
> 
> True, similar to how zonepressure_wait() rechecks the watermarks if
> there is still a timeout left and deciding whether to sleep again or
> not.
> 
> > Direct reclaimers need not be involved because the pages might be
> > hovering around the asynchronous reclaim watermarks (which would be
> > the ideal case of system operation).
> > 
> > In which case, can you explain how this change makes sense? Why is
> > it a good thing not to wait when we previously did wait?
> > 
> 
> Well, it makes sense from the perspective it's better for kswapd to be doing
> work than direct reclaim. If processes are hitting the watermarks then
> why should kswapd be asleep?

Well I said we should consider the case where level is remaining within
the asynch watermarks.

The kswapd waitqueue does not correlate so well to direct reclaimers.

And I don't know if I agree with your assertion really. Once we _know_
we have to do direct reclaim anyway, it is too late to wait for kswapd
(unless we change that aspect of reclaim so it actually does wait for
kswapd or limit direct reclaimers).

And also, once we know we have to do direct reclaim (eg. under serious
memory pressure), then maybe it actually would be better to account
CPU time to the processes doing the allocations rather than kswapd.

kswapd seems a good thing for when we *are* able to keep up with asynch
watermarks, but after that it isn't quite so clear (FS|IO reclaim
context is obviously also a good thing too).

 
> That said, if the timeout was non-zero it should be able to make some decision
> on whether it should be really awake. Putting the page allocator and kswapd
> patches into the same series was a mistake because it's conflating two
> different problems as one. I'm going to drop this one for the moment and
> treat the page allocator patch in isolation.

Probably a good idea.


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion
@ 2010-03-09 10:32         ` Nick Piggin
  0 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 10:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:21:46AM +0000, Mel Gorman wrote:
> On Tue, Mar 09, 2010 at 09:00:44PM +1100, Nick Piggin wrote:
> > On Mon, Mar 08, 2010 at 11:48:23AM +0000, Mel Gorman wrote:
> > > If kswapd is raising its priority to get the zone over the high
> > > watermark, it may call congestion_wait() ostensibly to allow congestion
> > > to clear. However, there is no guarantee that the queue is congested at
> > > this point because it depends on kswapds previous actions as well as the
> > > rest of the system. Kswapd could simply be working hard because there is
> > > a lot of SYNC traffic in which case it shouldn't be sleeping.
> > > 
> > > Rather than waiting on congestion and potentially sleeping for longer
> > > than it should, this patch puts kswapd back to sleep on the kswapd_wait
> > > queue for the timeout. If direct reclaimers are in trouble, kswapd will
> > > be rewoken as it should instead of sleeping when there is work to be
> > > done.
> > 
> > Well but it is quite possible that many allocators are coming in to
> > wake it up. So with your patch, I think we'd need to consider the case
> > where the timeout approaches 0 here (if it's always being woken).
> > 
> 
> True, similar to how zonepressure_wait() rechecks the watermarks if
> there is still a timeout left and deciding whether to sleep again or
> not.
> 
> > Direct reclaimers need not be involved because the pages might be
> > hovering around the asynchronous reclaim watermarks (which would be
> > the ideal case of system operation).
> > 
> > In which case, can you explain how this change makes sense? Why is
> > it a good thing not to wait when we previously did wait?
> > 
> 
> Well, it makes sense from the perspective it's better for kswapd to be doing
> work than direct reclaim. If processes are hitting the watermarks then
> why should kswapd be asleep?

Well I said we should consider the case where level is remaining within
the asynch watermarks.

The kswapd waitqueue does not correlate so well to direct reclaimers.

And I don't know if I agree with your assertion really. Once we _know_
we have to do direct reclaim anyway, it is too late to wait for kswapd
(unless we change that aspect of reclaim so it actually does wait for
kswapd or limit direct reclaimers).

And also, once we know we have to do direct reclaim (eg. under serious
memory pressure), then maybe it actually would be better to account
CPU time to the processes doing the allocations rather than kswapd.

kswapd seems a good thing for when we *are* able to keep up with asynch
watermarks, but after that it isn't quite so clear (FS|IO reclaim
context is obviously also a good thing too).

 
> That said, if the timeout was non-zero it should be able to make some decision
> on whether it should be really awake. Putting the page allocator and kswapd
> patches into the same series was a mistake because it's conflating two
> different problems as one. I'm going to drop this one for the moment and
> treat the page allocator patch in isolation.

Probably a good idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
  2010-03-09 10:23         ` Nick Piggin
@ 2010-03-09 10:36           ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 10:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 09:23:45PM +1100, Nick Piggin wrote:
> On Tue, Mar 09, 2010 at 10:08:35AM +0000, Mel Gorman wrote:
> > On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> > > Cool, you found this doesn't hurt performance too much?
> > > 
> > 
> > Nothing outside the noise was measured. I didn't profile it to be
> > absolutly sure but I expect it's ok.
> 
> OK. Moving the waitqueue cacheline out of the fastpath footprint
> and doing the flag thing might be a good idea?
> 

Probably, I'll do it as a separate micro-optimisation patch so it's
clear what I'm doing.

> > > Can't you remove the check from the reclaim code now? (The check
> > > here should give a more timely wait anyway)
> > > 
> > 
> > I'll try and see what the timing and total IO figures look like.
> 
> Well reclaim goes through free_pages_bulk anyway, doesn't it? So
> I don't see why you would have to run any test.
>  

It should be fine but no harm in double checking. The tests I'm doing
are not great anyway. I'm somewhat depending on people familar with
IO-related performance testing to give this a whirl or tell me how they
typically benchmark low-memory situations.

> > > This is good because it should eliminate most all cases of extra
> > > waiting. I wonder if you've also thought of doing the check in the
> > > allocation path too as we were discussing? (this would give a better
> > > FIFO behaviour under memory pressure but I could easily agree it is not
> > > worth the cost)
> > > 
> > 
> > I *could* make the check but as I noted in the leader, there isn't
> > really a good test case that determines if these changes are "good" or
> > "bad". Removing congestion_wait() seems like an obvious win but other
> > modifications that alter how and when processes wait are less obvious.
> 
> Fair enough. But we could be sure it increases fairness, which is a
> good thing. So then we'd just have to check it against performance.
> 

Ordinarily, I'd agree but we've seen bug reports before from applications
that depended on unfairness for good performance. dbench figures depended
at one point in unfair behaviour (specifically being allowed to dirty the
whole system). volanomark was one that suffered when the scheduler became
more fair (think sched_yield was also a biggie). The new behaviour was
better and arguably the applications were doing the wrong thing but I'd
still like to treat "increase fairness in the page allocator" as a
separate patch as a result.

> Your patches seem like a good idea regardless of this issue, don't get
> me wrong.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
@ 2010-03-09 10:36           ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 10:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 09:23:45PM +1100, Nick Piggin wrote:
> On Tue, Mar 09, 2010 at 10:08:35AM +0000, Mel Gorman wrote:
> > On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> > > Cool, you found this doesn't hurt performance too much?
> > > 
> > 
> > Nothing outside the noise was measured. I didn't profile it to be
> > absolutly sure but I expect it's ok.
> 
> OK. Moving the waitqueue cacheline out of the fastpath footprint
> and doing the flag thing might be a good idea?
> 

Probably, I'll do it as a separate micro-optimisation patch so it's
clear what I'm doing.

> > > Can't you remove the check from the reclaim code now? (The check
> > > here should give a more timely wait anyway)
> > > 
> > 
> > I'll try and see what the timing and total IO figures look like.
> 
> Well reclaim goes through free_pages_bulk anyway, doesn't it? So
> I don't see why you would have to run any test.
>  

It should be fine but no harm in double checking. The tests I'm doing
are not great anyway. I'm somewhat depending on people familar with
IO-related performance testing to give this a whirl or tell me how they
typically benchmark low-memory situations.

> > > This is good because it should eliminate most all cases of extra
> > > waiting. I wonder if you've also thought of doing the check in the
> > > allocation path too as we were discussing? (this would give a better
> > > FIFO behaviour under memory pressure but I could easily agree it is not
> > > worth the cost)
> > > 
> > 
> > I *could* make the check but as I noted in the leader, there isn't
> > really a good test case that determines if these changes are "good" or
> > "bad". Removing congestion_wait() seems like an obvious win but other
> > modifications that alter how and when processes wait are less obvious.
> 
> Fair enough. But we could be sure it increases fairness, which is a
> good thing. So then we'd just have to check it against performance.
> 

Ordinarily, I'd agree but we've seen bug reports before from applications
that depended on unfairness for good performance. dbench figures depended
at one point in unfair behaviour (specifically being allowed to dirty the
whole system). volanomark was one that suffered when the scheduler became
more fair (think sched_yield was also a biggie). The new behaviour was
better and arguably the applications were doing the wrong thing but I'd
still like to treat "increase fairness in the page allocator" as a
separate patch as a result.

> Your patches seem like a good idea regardless of this issue, don't get
> me wrong.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
  2010-03-09 10:36           ` Mel Gorman
@ 2010-03-09 11:11             ` Nick Piggin
  -1 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 11:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:36:08AM +0000, Mel Gorman wrote:
> On Tue, Mar 09, 2010 at 09:23:45PM +1100, Nick Piggin wrote:
> > On Tue, Mar 09, 2010 at 10:08:35AM +0000, Mel Gorman wrote:
> > > On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> > > > Cool, you found this doesn't hurt performance too much?
> > > > 
> > > 
> > > Nothing outside the noise was measured. I didn't profile it to be
> > > absolutly sure but I expect it's ok.
> > 
> > OK. Moving the waitqueue cacheline out of the fastpath footprint
> > and doing the flag thing might be a good idea?
> > 
> 
> Probably, I'll do it as a separate micro-optimisation patch so it's
> clear what I'm doing.

Fair enough.

 
> > > > Can't you remove the check from the reclaim code now? (The check
> > > > here should give a more timely wait anyway)
> > > > 
> > > 
> > > I'll try and see what the timing and total IO figures look like.
> > 
> > Well reclaim goes through free_pages_bulk anyway, doesn't it? So
> > I don't see why you would have to run any test.
> >  
> 
> It should be fine but no harm in double checking. The tests I'm doing
> are not great anyway. I'm somewhat depending on people familar with
> IO-related performance testing to give this a whirl or tell me how they
> typically benchmark low-memory situations.

I don't really like that logic. It makes things harder to understand
down the road if you have double checks.

 
> > > > This is good because it should eliminate most all cases of extra
> > > > waiting. I wonder if you've also thought of doing the check in the
> > > > allocation path too as we were discussing? (this would give a better
> > > > FIFO behaviour under memory pressure but I could easily agree it is not
> > > > worth the cost)
> > > > 
> > > 
> > > I *could* make the check but as I noted in the leader, there isn't
> > > really a good test case that determines if these changes are "good" or
> > > "bad". Removing congestion_wait() seems like an obvious win but other
> > > modifications that alter how and when processes wait are less obvious.
> > 
> > Fair enough. But we could be sure it increases fairness, which is a
> > good thing. So then we'd just have to check it against performance.
> > 
> 
> Ordinarily, I'd agree but we've seen bug reports before from applications
> that depended on unfairness for good performance. dbench figures depended
> at one point in unfair behaviour (specifically being allowed to dirty the
> whole system). volanomark was one that suffered when the scheduler became
> more fair (think sched_yield was also a biggie). The new behaviour was
> better and arguably the applications were doing the wrong thing but I'd
> still like to treat "increase fairness in the page allocator" as a
> separate patch as a result.

Yeah sure it would be done as another patch. I don't think there is much
question that making things fairer is better. Especially if the
alternative is a theoretical starvation.

That's not to say that batching shouldn't then be used to help improve
performance of fairly scheduled resources. But it should be done in a
carefully designed and controlled way, so that neither the fairness /
starvation, nor the good performance from batching, depend on timing
and behaviours of the hardware interconnect etc.



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
@ 2010-03-09 11:11             ` Nick Piggin
  0 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 11:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:36:08AM +0000, Mel Gorman wrote:
> On Tue, Mar 09, 2010 at 09:23:45PM +1100, Nick Piggin wrote:
> > On Tue, Mar 09, 2010 at 10:08:35AM +0000, Mel Gorman wrote:
> > > On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> > > > Cool, you found this doesn't hurt performance too much?
> > > > 
> > > 
> > > Nothing outside the noise was measured. I didn't profile it to be
> > > absolutly sure but I expect it's ok.
> > 
> > OK. Moving the waitqueue cacheline out of the fastpath footprint
> > and doing the flag thing might be a good idea?
> > 
> 
> Probably, I'll do it as a separate micro-optimisation patch so it's
> clear what I'm doing.

Fair enough.

 
> > > > Can't you remove the check from the reclaim code now? (The check
> > > > here should give a more timely wait anyway)
> > > > 
> > > 
> > > I'll try and see what the timing and total IO figures look like.
> > 
> > Well reclaim goes through free_pages_bulk anyway, doesn't it? So
> > I don't see why you would have to run any test.
> >  
> 
> It should be fine but no harm in double checking. The tests I'm doing
> are not great anyway. I'm somewhat depending on people familar with
> IO-related performance testing to give this a whirl or tell me how they
> typically benchmark low-memory situations.

I don't really like that logic. It makes things harder to understand
down the road if you have double checks.

 
> > > > This is good because it should eliminate most all cases of extra
> > > > waiting. I wonder if you've also thought of doing the check in the
> > > > allocation path too as we were discussing? (this would give a better
> > > > FIFO behaviour under memory pressure but I could easily agree it is not
> > > > worth the cost)
> > > > 
> > > 
> > > I *could* make the check but as I noted in the leader, there isn't
> > > really a good test case that determines if these changes are "good" or
> > > "bad". Removing congestion_wait() seems like an obvious win but other
> > > modifications that alter how and when processes wait are less obvious.
> > 
> > Fair enough. But we could be sure it increases fairness, which is a
> > good thing. So then we'd just have to check it against performance.
> > 
> 
> Ordinarily, I'd agree but we've seen bug reports before from applications
> that depended on unfairness for good performance. dbench figures depended
> at one point in unfair behaviour (specifically being allowed to dirty the
> whole system). volanomark was one that suffered when the scheduler became
> more fair (think sched_yield was also a biggie). The new behaviour was
> better and arguably the applications were doing the wrong thing but I'd
> still like to treat "increase fairness in the page allocator" as a
> separate patch as a result.

Yeah sure it would be done as another patch. I don't think there is much
question that making things fairer is better. Especially if the
alternative is a theoretical starvation.

That's not to say that batching shouldn't then be used to help improve
performance of fairly scheduled resources. But it should be done in a
carefully designed and controlled way, so that neither the fairness /
starvation, nor the good performance from batching, depend on timing
and behaviours of the hardware interconnect etc.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
  2010-03-09 11:11             ` Nick Piggin
@ 2010-03-09 11:29               ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 11:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:11:18PM +1100, Nick Piggin wrote:
> On Tue, Mar 09, 2010 at 10:36:08AM +0000, Mel Gorman wrote:
> > On Tue, Mar 09, 2010 at 09:23:45PM +1100, Nick Piggin wrote:
> > > On Tue, Mar 09, 2010 at 10:08:35AM +0000, Mel Gorman wrote:
> > > > On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> > > > > Cool, you found this doesn't hurt performance too much?
> > > > > 
> > > > 
> > > > Nothing outside the noise was measured. I didn't profile it to be
> > > > absolutly sure but I expect it's ok.
> > > 
> > > OK. Moving the waitqueue cacheline out of the fastpath footprint
> > > and doing the flag thing might be a good idea?
> > > 
> > 
> > Probably, I'll do it as a separate micro-optimisation patch so it's
> > clear what I'm doing.
> 
> Fair enough.
> 
> > > > > Can't you remove the check from the reclaim code now? (The check
> > > > > here should give a more timely wait anyway)
> > > > > 
> > > > 
> > > > I'll try and see what the timing and total IO figures look like.
> > > 
> > > Well reclaim goes through free_pages_bulk anyway, doesn't it? So
> > > I don't see why you would have to run any test.
> > >  
> > 
> > It should be fine but no harm in double checking. The tests I'm doing
> > are not great anyway. I'm somewhat depending on people familar with
> > IO-related performance testing to give this a whirl or tell me how they
> > typically benchmark low-memory situations.
> 
> I don't really like that logic. It makes things harder to understand
> down the road if you have double checks.

There *should* be no difference and that is my expectation. If there is,
it means I'm missing something important. Hence, the double check.

> > > > > This is good because it should eliminate most all cases of extra
> > > > > waiting. I wonder if you've also thought of doing the check in the
> > > > > allocation path too as we were discussing? (this would give a better
> > > > > FIFO behaviour under memory pressure but I could easily agree it is not
> > > > > worth the cost)
> > > > > 
> > > > 
> > > > I *could* make the check but as I noted in the leader, there isn't
> > > > really a good test case that determines if these changes are "good" or
> > > > "bad". Removing congestion_wait() seems like an obvious win but other
> > > > modifications that alter how and when processes wait are less obvious.
> > > 
> > > Fair enough. But we could be sure it increases fairness, which is a
> > > good thing. So then we'd just have to check it against performance.
> > > 
> > 
> > Ordinarily, I'd agree but we've seen bug reports before from applications
> > that depended on unfairness for good performance. dbench figures depended
> > at one point in unfair behaviour (specifically being allowed to dirty the
> > whole system). volanomark was one that suffered when the scheduler became
> > more fair (think sched_yield was also a biggie). The new behaviour was
> > better and arguably the applications were doing the wrong thing but I'd
> > still like to treat "increase fairness in the page allocator" as a
> > separate patch as a result.
> 
> Yeah sure it would be done as another patch. I don't think there is much
> question that making things fairer is better. Especially if the
> alternative is a theoretical starvation.
> 

Agreed.

> That's not to say that batching shouldn't then be used to help improve
> performance of fairly scheduled resources. But it should be done in a
> carefully designed and controlled way, so that neither the fairness /
> starvation, nor the good performance from batching, depend on timing
> and behaviours of the hardware interconnect etc.
> 

Indeed. Batching is less clear-cut in this context. We are already
batching on a per-CPU basis but not on a per-process basis. My feeling
is that the problem to watch out for with queueing in the allocation
path is 2+ processes waiting on the queue and then allocating too much
on the per-cpu lists. Easy enough to handle that one but there are
probably a few more gotchas in there somewhere. Will revisit for sure
though.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed
@ 2010-03-09 11:29               ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 11:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:11:18PM +1100, Nick Piggin wrote:
> On Tue, Mar 09, 2010 at 10:36:08AM +0000, Mel Gorman wrote:
> > On Tue, Mar 09, 2010 at 09:23:45PM +1100, Nick Piggin wrote:
> > > On Tue, Mar 09, 2010 at 10:08:35AM +0000, Mel Gorman wrote:
> > > > On Tue, Mar 09, 2010 at 08:53:42PM +1100, Nick Piggin wrote:
> > > > > Cool, you found this doesn't hurt performance too much?
> > > > > 
> > > > 
> > > > Nothing outside the noise was measured. I didn't profile it to be
> > > > absolutly sure but I expect it's ok.
> > > 
> > > OK. Moving the waitqueue cacheline out of the fastpath footprint
> > > and doing the flag thing might be a good idea?
> > > 
> > 
> > Probably, I'll do it as a separate micro-optimisation patch so it's
> > clear what I'm doing.
> 
> Fair enough.
> 
> > > > > Can't you remove the check from the reclaim code now? (The check
> > > > > here should give a more timely wait anyway)
> > > > > 
> > > > 
> > > > I'll try and see what the timing and total IO figures look like.
> > > 
> > > Well reclaim goes through free_pages_bulk anyway, doesn't it? So
> > > I don't see why you would have to run any test.
> > >  
> > 
> > It should be fine but no harm in double checking. The tests I'm doing
> > are not great anyway. I'm somewhat depending on people familar with
> > IO-related performance testing to give this a whirl or tell me how they
> > typically benchmark low-memory situations.
> 
> I don't really like that logic. It makes things harder to understand
> down the road if you have double checks.

There *should* be no difference and that is my expectation. If there is,
it means I'm missing something important. Hence, the double check.

> > > > > This is good because it should eliminate most all cases of extra
> > > > > waiting. I wonder if you've also thought of doing the check in the
> > > > > allocation path too as we were discussing? (this would give a better
> > > > > FIFO behaviour under memory pressure but I could easily agree it is not
> > > > > worth the cost)
> > > > > 
> > > > 
> > > > I *could* make the check but as I noted in the leader, there isn't
> > > > really a good test case that determines if these changes are "good" or
> > > > "bad". Removing congestion_wait() seems like an obvious win but other
> > > > modifications that alter how and when processes wait are less obvious.
> > > 
> > > Fair enough. But we could be sure it increases fairness, which is a
> > > good thing. So then we'd just have to check it against performance.
> > > 
> > 
> > Ordinarily, I'd agree but we've seen bug reports before from applications
> > that depended on unfairness for good performance. dbench figures depended
> > at one point in unfair behaviour (specifically being allowed to dirty the
> > whole system). volanomark was one that suffered when the scheduler became
> > more fair (think sched_yield was also a biggie). The new behaviour was
> > better and arguably the applications were doing the wrong thing but I'd
> > still like to treat "increase fairness in the page allocator" as a
> > separate patch as a result.
> 
> Yeah sure it would be done as another patch. I don't think there is much
> question that making things fairer is better. Especially if the
> alternative is a theoretical starvation.
> 

Agreed.

> That's not to say that batching shouldn't then be used to help improve
> performance of fairly scheduled resources. But it should be done in a
> carefully designed and controlled way, so that neither the fairness /
> starvation, nor the good performance from batching, depend on timing
> and behaviours of the hardware interconnect etc.
> 

Indeed. Batching is less clear-cut in this context. We are already
batching on a per-CPU basis but not on a per-process basis. My feeling
is that the problem to watch out for with queueing in the allocation
path is 2+ processes waiting on the queue and then allocating too much
on the per-cpu lists. Easy enough to handle that one but there are
probably a few more gotchas in there somewhere. Will revisit for sure
though.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-08 11:48   ` Mel Gorman
@ 2010-03-09 13:35     ` Nick Piggin
  -1 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 13:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> Under heavy memory pressure, the page allocator may call congestion_wait()
> to wait for IO congestion to clear or a timeout. This is not as sensible
> a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> even congested as the pressure could have been due to a large number of
> SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> 
> At the point of congestion_wait(), the allocator is struggling to get the
> pages it needs and it should back off. This patch puts the allocator to sleep
> on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> kswapd brings the zone over the low watermark, whichever happens first.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/mmzone.h |    3 ++
>  mm/internal.h          |    4 +++
>  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
>  mm/vmscan.c            |    2 +
>  5 files changed, 101 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..72465c1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -398,6 +398,9 @@ struct zone {
>  	unsigned long		wait_table_hash_nr_entries;
>  	unsigned long		wait_table_bits;
>  
> +	/* queue for processes waiting for pressure to relieve */
> +	wait_queue_head_t	*pressure_wq;

Hmm, processes may be eligible to allocate from > 1 zone, but you
have them only waiting for one. I wonder if we shouldn't wait for
more zones?

Congestion waiting uses a global waitqueue, which hasn't seemed to
cause a big scalability problem. Would it be better to have a global
waitqueue for this too?


> +void check_zone_pressure(struct zone *zone)

I don't really like the name pressure. We use that term for the reclaim
pressure wheras we're just checking watermarks here (actual pressure
could be anything).


> +{
> +	/* If no process is waiting, nothing to do */
> +	if (!waitqueue_active(zone->pressure_wq))
> +		return;
> +
> +	/* Check if the high watermark is ok for order 0 */
> +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> +		wake_up_interruptible(zone->pressure_wq);
> +}

If you were to do this under the zone lock (in your subsequent patch),
then it could avoid races. I would suggest doing it all as a single
patch and not doing the pressure checks in reclaim at all.

If you are missing anything, then that needs to be explained and fixed
rather than just adding extra checks.

> +
> +/**
> + * zonepressure_wait - Wait for pressure on a zone to ease off
> + * @zone: The zone that is expected to be under pressure
> + * @order: The order the caller is waiting on pages for
> + * @timeout: Wait until pressure is relieved or this timeout is reached
> + *
> + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> + * It's considered to be relieved if any direct reclaimer or kswapd brings
> + * the zone above the high watermark
> + */
> +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> +{
> +	long ret;
> +	DEFINE_WAIT(wait);
> +
> +wait_again:
> +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);

I guess to do it without races you need to check watermark here.
And possibly some barriers if it is done without zone->lock.

> +
> +	/*
> +	 * The use of io_schedule_timeout() here means that it gets
> +	 * accounted for as IO waiting. This may or may not be the case
> +	 * but at least this way it gets picked up by vmstat
> +	 */
> +	ret = io_schedule_timeout(timeout);
> +	finish_wait(zone->pressure_wq, &wait);
> +
> +	/* If woken early, check watermarks before continuing */
> +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> +		timeout = ret;
> +		goto wait_again;
> +	}

And then I don't know if we'd really need the extra check here. Might as
well just let the allocator try again and avoid the code?


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 13:35     ` Nick Piggin
  0 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 13:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> Under heavy memory pressure, the page allocator may call congestion_wait()
> to wait for IO congestion to clear or a timeout. This is not as sensible
> a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> even congested as the pressure could have been due to a large number of
> SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> 
> At the point of congestion_wait(), the allocator is struggling to get the
> pages it needs and it should back off. This patch puts the allocator to sleep
> on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> kswapd brings the zone over the low watermark, whichever happens first.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/mmzone.h |    3 ++
>  mm/internal.h          |    4 +++
>  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
>  mm/vmscan.c            |    2 +
>  5 files changed, 101 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..72465c1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -398,6 +398,9 @@ struct zone {
>  	unsigned long		wait_table_hash_nr_entries;
>  	unsigned long		wait_table_bits;
>  
> +	/* queue for processes waiting for pressure to relieve */
> +	wait_queue_head_t	*pressure_wq;

Hmm, processes may be eligible to allocate from > 1 zone, but you
have them only waiting for one. I wonder if we shouldn't wait for
more zones?

Congestion waiting uses a global waitqueue, which hasn't seemed to
cause a big scalability problem. Would it be better to have a global
waitqueue for this too?


> +void check_zone_pressure(struct zone *zone)

I don't really like the name pressure. We use that term for the reclaim
pressure wheras we're just checking watermarks here (actual pressure
could be anything).


> +{
> +	/* If no process is waiting, nothing to do */
> +	if (!waitqueue_active(zone->pressure_wq))
> +		return;
> +
> +	/* Check if the high watermark is ok for order 0 */
> +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> +		wake_up_interruptible(zone->pressure_wq);
> +}

If you were to do this under the zone lock (in your subsequent patch),
then it could avoid races. I would suggest doing it all as a single
patch and not doing the pressure checks in reclaim at all.

If you are missing anything, then that needs to be explained and fixed
rather than just adding extra checks.

> +
> +/**
> + * zonepressure_wait - Wait for pressure on a zone to ease off
> + * @zone: The zone that is expected to be under pressure
> + * @order: The order the caller is waiting on pages for
> + * @timeout: Wait until pressure is relieved or this timeout is reached
> + *
> + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> + * It's considered to be relieved if any direct reclaimer or kswapd brings
> + * the zone above the high watermark
> + */
> +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> +{
> +	long ret;
> +	DEFINE_WAIT(wait);
> +
> +wait_again:
> +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);

I guess to do it without races you need to check watermark here.
And possibly some barriers if it is done without zone->lock.

> +
> +	/*
> +	 * The use of io_schedule_timeout() here means that it gets
> +	 * accounted for as IO waiting. This may or may not be the case
> +	 * but at least this way it gets picked up by vmstat
> +	 */
> +	ret = io_schedule_timeout(timeout);
> +	finish_wait(zone->pressure_wq, &wait);
> +
> +	/* If woken early, check watermarks before continuing */
> +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> +		timeout = ret;
> +		goto wait_again;
> +	}

And then I don't know if we'd really need the extra check here. Might as
well just let the allocator try again and avoid the code?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 13:35     ` Nick Piggin
@ 2010-03-09 14:17       ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 14:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > Under heavy memory pressure, the page allocator may call congestion_wait()
> > to wait for IO congestion to clear or a timeout. This is not as sensible
> > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > even congested as the pressure could have been due to a large number of
> > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > 
> > At the point of congestion_wait(), the allocator is struggling to get the
> > pages it needs and it should back off. This patch puts the allocator to sleep
> > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > kswapd brings the zone over the low watermark, whichever happens first.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/mmzone.h |    3 ++
> >  mm/internal.h          |    4 +++
> >  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> >  mm/vmscan.c            |    2 +
> >  5 files changed, 101 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 30fe668..72465c1 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -398,6 +398,9 @@ struct zone {
> >  	unsigned long		wait_table_hash_nr_entries;
> >  	unsigned long		wait_table_bits;
> >  
> > +	/* queue for processes waiting for pressure to relieve */
> > +	wait_queue_head_t	*pressure_wq;
> 
> Hmm, processes may be eligible to allocate from > 1 zone, but you
> have them only waiting for one. I wonder if we shouldn't wait for
> more zones?
> 

It's waiting for the zone that is most desirable. If that zones watermarks
are met, why would it wait on any other zone? If you mean that it would
wait for any of the eligible zones to meet their watermark, it might have
an impact on NUMA locality but it could be managed. It might make sense to
wait on a node-based queue rather than a zone if this behaviour was desirable.

> Congestion waiting uses a global waitqueue, which hasn't seemed to
> cause a big scalability problem. Would it be better to have a global
> waitqueue for this too?
> 

Considering that the congestion wait queue is for a relatively slow operation,
it would be surprising if lock scalability was noticeable.  Potentially the
pressure_wq involves no IO so scalability may be noticeable there.

What would the advantages of a global waitqueue be? Obviously, a smaller
memory footprint. A second potential advantage is that on wakeup, it
could check the watermarks on multiple zones which might reduce
latencies in some cases. Can you think of more compelling reasons?

> 
> > +void check_zone_pressure(struct zone *zone)
> 
> I don't really like the name pressure. We use that term for the reclaim
> pressure wheras we're just checking watermarks here (actual pressure
> could be anything).
> 

pressure_wq => watermark_wq
check_zone_pressure => check_watermark_wq

?

> 
> > +{
> > +	/* If no process is waiting, nothing to do */
> > +	if (!waitqueue_active(zone->pressure_wq))
> > +		return;
> > +
> > +	/* Check if the high watermark is ok for order 0 */
> > +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > +		wake_up_interruptible(zone->pressure_wq);
> > +}
> 
> If you were to do this under the zone lock (in your subsequent patch),
> then it could avoid races. I would suggest doing it all as a single
> patch and not doing the pressure checks in reclaim at all.
> 

That is reasonable. I've already dropped the checks in reclaim because as you
say, if the free path check is cheap enough, it's also sufficient. Checking
in the reclaim paths as well is redundant.

I'll move the call to check_zone_pressure() within the zone lock to avoid
races.

> If you are missing anything, then that needs to be explained and fixed
> rather than just adding extra checks.
> 
> > +
> > +/**
> > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > + * @zone: The zone that is expected to be under pressure
> > + * @order: The order the caller is waiting on pages for
> > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > + *
> > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > + * the zone above the high watermark
> > + */
> > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > +{
> > +	long ret;
> > +	DEFINE_WAIT(wait);
> > +
> > +wait_again:
> > +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> 
> I guess to do it without races you need to check watermark here.
> And possibly some barriers if it is done without zone->lock.
> 

As watermark checks are already done without the zone->lock and without
barriers, why are they needed here? Yes, there are small races. For
example, it's possible to hit a window where pages were freed between
watermarks were checked and we went to sleep here but that is similar to
current behaviour.


> > +
> > +	/*
> > +	 * The use of io_schedule_timeout() here means that it gets
> > +	 * accounted for as IO waiting. This may or may not be the case
> > +	 * but at least this way it gets picked up by vmstat
> > +	 */
> > +	ret = io_schedule_timeout(timeout);
> > +	finish_wait(zone->pressure_wq, &wait);
> > +
> > +	/* If woken early, check watermarks before continuing */
> > +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > +		timeout = ret;
> > +		goto wait_again;
> > +	}
> 
> And then I don't know if we'd really need the extra check here. Might as
> well just let the allocator try again and avoid the code?
> 

I was considering multiple processes been woken up and racing with each
other. I can drop this check though. The worst that happens is multiple
processes wake and walk the full zonelist. Some will succeed and others
will go back to sleep.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 14:17       ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 14:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > Under heavy memory pressure, the page allocator may call congestion_wait()
> > to wait for IO congestion to clear or a timeout. This is not as sensible
> > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > even congested as the pressure could have been due to a large number of
> > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > 
> > At the point of congestion_wait(), the allocator is struggling to get the
> > pages it needs and it should back off. This patch puts the allocator to sleep
> > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > kswapd brings the zone over the low watermark, whichever happens first.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/mmzone.h |    3 ++
> >  mm/internal.h          |    4 +++
> >  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> >  mm/vmscan.c            |    2 +
> >  5 files changed, 101 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 30fe668..72465c1 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -398,6 +398,9 @@ struct zone {
> >  	unsigned long		wait_table_hash_nr_entries;
> >  	unsigned long		wait_table_bits;
> >  
> > +	/* queue for processes waiting for pressure to relieve */
> > +	wait_queue_head_t	*pressure_wq;
> 
> Hmm, processes may be eligible to allocate from > 1 zone, but you
> have them only waiting for one. I wonder if we shouldn't wait for
> more zones?
> 

It's waiting for the zone that is most desirable. If that zones watermarks
are met, why would it wait on any other zone? If you mean that it would
wait for any of the eligible zones to meet their watermark, it might have
an impact on NUMA locality but it could be managed. It might make sense to
wait on a node-based queue rather than a zone if this behaviour was desirable.

> Congestion waiting uses a global waitqueue, which hasn't seemed to
> cause a big scalability problem. Would it be better to have a global
> waitqueue for this too?
> 

Considering that the congestion wait queue is for a relatively slow operation,
it would be surprising if lock scalability was noticeable.  Potentially the
pressure_wq involves no IO so scalability may be noticeable there.

What would the advantages of a global waitqueue be? Obviously, a smaller
memory footprint. A second potential advantage is that on wakeup, it
could check the watermarks on multiple zones which might reduce
latencies in some cases. Can you think of more compelling reasons?

> 
> > +void check_zone_pressure(struct zone *zone)
> 
> I don't really like the name pressure. We use that term for the reclaim
> pressure wheras we're just checking watermarks here (actual pressure
> could be anything).
> 

pressure_wq => watermark_wq
check_zone_pressure => check_watermark_wq

?

> 
> > +{
> > +	/* If no process is waiting, nothing to do */
> > +	if (!waitqueue_active(zone->pressure_wq))
> > +		return;
> > +
> > +	/* Check if the high watermark is ok for order 0 */
> > +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > +		wake_up_interruptible(zone->pressure_wq);
> > +}
> 
> If you were to do this under the zone lock (in your subsequent patch),
> then it could avoid races. I would suggest doing it all as a single
> patch and not doing the pressure checks in reclaim at all.
> 

That is reasonable. I've already dropped the checks in reclaim because as you
say, if the free path check is cheap enough, it's also sufficient. Checking
in the reclaim paths as well is redundant.

I'll move the call to check_zone_pressure() within the zone lock to avoid
races.

> If you are missing anything, then that needs to be explained and fixed
> rather than just adding extra checks.
> 
> > +
> > +/**
> > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > + * @zone: The zone that is expected to be under pressure
> > + * @order: The order the caller is waiting on pages for
> > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > + *
> > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > + * the zone above the high watermark
> > + */
> > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > +{
> > +	long ret;
> > +	DEFINE_WAIT(wait);
> > +
> > +wait_again:
> > +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> 
> I guess to do it without races you need to check watermark here.
> And possibly some barriers if it is done without zone->lock.
> 

As watermark checks are already done without the zone->lock and without
barriers, why are they needed here? Yes, there are small races. For
example, it's possible to hit a window where pages were freed between
watermarks were checked and we went to sleep here but that is similar to
current behaviour.


> > +
> > +	/*
> > +	 * The use of io_schedule_timeout() here means that it gets
> > +	 * accounted for as IO waiting. This may or may not be the case
> > +	 * but at least this way it gets picked up by vmstat
> > +	 */
> > +	ret = io_schedule_timeout(timeout);
> > +	finish_wait(zone->pressure_wq, &wait);
> > +
> > +	/* If woken early, check watermarks before continuing */
> > +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > +		timeout = ret;
> > +		goto wait_again;
> > +	}
> 
> And then I don't know if we'd really need the extra check here. Might as
> well just let the allocator try again and avoid the code?
> 

I was considering multiple processes been woken up and racing with each
other. I can drop this check though. The worst that happens is multiple
processes wake and walk the full zonelist. Some will succeed and others
will go back to sleep.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 14:17       ` Mel Gorman
@ 2010-03-09 15:03         ` Nick Piggin
  -1 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 15:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> > On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > > Under heavy memory pressure, the page allocator may call congestion_wait()
> > > to wait for IO congestion to clear or a timeout. This is not as sensible
> > > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > > even congested as the pressure could have been due to a large number of
> > > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > > 
> > > At the point of congestion_wait(), the allocator is struggling to get the
> > > pages it needs and it should back off. This patch puts the allocator to sleep
> > > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > > kswapd brings the zone over the low watermark, whichever happens first.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  include/linux/mmzone.h |    3 ++
> > >  mm/internal.h          |    4 +++
> > >  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> > >  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> > >  mm/vmscan.c            |    2 +
> > >  5 files changed, 101 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 30fe668..72465c1 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -398,6 +398,9 @@ struct zone {
> > >  	unsigned long		wait_table_hash_nr_entries;
> > >  	unsigned long		wait_table_bits;
> > >  
> > > +	/* queue for processes waiting for pressure to relieve */
> > > +	wait_queue_head_t	*pressure_wq;
> > 
> > Hmm, processes may be eligible to allocate from > 1 zone, but you
> > have them only waiting for one. I wonder if we shouldn't wait for
> > more zones?
> > 
> 
> It's waiting for the zone that is most desirable. If that zones watermarks
> are met, why would it wait on any other zone?

I mean the other way around. If that zone's watermarks are not met, then
why shouldn't it be woken up by other zones reaching their watermarks.


> If you mean that it would
> wait for any of the eligible zones to meet their watermark, it might have
> an impact on NUMA locality but it could be managed. It might make sense to
> wait on a node-based queue rather than a zone if this behaviour was desirable.
> 
> > Congestion waiting uses a global waitqueue, which hasn't seemed to
> > cause a big scalability problem. Would it be better to have a global
> > waitqueue for this too?
> > 
> 
> Considering that the congestion wait queue is for a relatively slow operation,
> it would be surprising if lock scalability was noticeable.  Potentially the
> pressure_wq involves no IO so scalability may be noticeable there.
> 
> What would the advantages of a global waitqueue be? Obviously, a smaller
> memory footprint. A second potential advantage is that on wakeup, it
> could check the watermarks on multiple zones which might reduce
> latencies in some cases. Can you think of more compelling reasons?

Your 2nd advantage is what I mean above.


> > 
> > > +void check_zone_pressure(struct zone *zone)
> > 
> > I don't really like the name pressure. We use that term for the reclaim
> > pressure wheras we're just checking watermarks here (actual pressure
> > could be anything).
> > 
> 
> pressure_wq => watermark_wq
> check_zone_pressure => check_watermark_wq
> 
> ?

Thanks.

> 
> > 
> > > +{
> > > +	/* If no process is waiting, nothing to do */
> > > +	if (!waitqueue_active(zone->pressure_wq))
> > > +		return;
> > > +
> > > +	/* Check if the high watermark is ok for order 0 */
> > > +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > > +		wake_up_interruptible(zone->pressure_wq);
> > > +}
> > 
> > If you were to do this under the zone lock (in your subsequent patch),
> > then it could avoid races. I would suggest doing it all as a single
> > patch and not doing the pressure checks in reclaim at all.
> > 
> 
> That is reasonable. I've already dropped the checks in reclaim because as you
> say, if the free path check is cheap enough, it's also sufficient. Checking
> in the reclaim paths as well is redundant.
> 
> I'll move the call to check_zone_pressure() within the zone lock to avoid
> races.
> 
> > If you are missing anything, then that needs to be explained and fixed
> > rather than just adding extra checks.
> > 
> > > +
> > > +/**
> > > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > > + * @zone: The zone that is expected to be under pressure
> > > + * @order: The order the caller is waiting on pages for
> > > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > > + *
> > > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > > + * the zone above the high watermark
> > > + */
> > > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > > +{
> > > +	long ret;
> > > +	DEFINE_WAIT(wait);
> > > +
> > > +wait_again:
> > > +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> > 
> > I guess to do it without races you need to check watermark here.
> > And possibly some barriers if it is done without zone->lock.
> > 
> 
> As watermark checks are already done without the zone->lock and without
> barriers, why are they needed here? Yes, there are small races. For
> example, it's possible to hit a window where pages were freed between
> watermarks were checked and we went to sleep here but that is similar to
> current behaviour.

Well with the check in free_pages_bulk then doing another check here
before the wait should be able to close all lost-wakeup races. I agree
it is pretty fuzzy heuristics anyway, so these races don't *really*
matter a lot. But it seems easy to close the races, so I don't see
why not.


> > > +
> > > +	/*
> > > +	 * The use of io_schedule_timeout() here means that it gets
> > > +	 * accounted for as IO waiting. This may or may not be the case
> > > +	 * but at least this way it gets picked up by vmstat
> > > +	 */
> > > +	ret = io_schedule_timeout(timeout);
> > > +	finish_wait(zone->pressure_wq, &wait);
> > > +
> > > +	/* If woken early, check watermarks before continuing */
> > > +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > > +		timeout = ret;
> > > +		goto wait_again;
> > > +	}
> > 
> > And then I don't know if we'd really need the extra check here. Might as
> > well just let the allocator try again and avoid the code?
> > 
> 
> I was considering multiple processes been woken up and racing with each
> other. I can drop this check though. The worst that happens is multiple
> processes wake and walk the full zonelist. Some will succeed and others
> will go back to sleep.

Yep. And it doesn't really solve that race either becuase the zone
might subsequently go below the watermark.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 15:03         ` Nick Piggin
  0 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-09 15:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> > On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > > Under heavy memory pressure, the page allocator may call congestion_wait()
> > > to wait for IO congestion to clear or a timeout. This is not as sensible
> > > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > > even congested as the pressure could have been due to a large number of
> > > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > > 
> > > At the point of congestion_wait(), the allocator is struggling to get the
> > > pages it needs and it should back off. This patch puts the allocator to sleep
> > > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > > kswapd brings the zone over the low watermark, whichever happens first.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  include/linux/mmzone.h |    3 ++
> > >  mm/internal.h          |    4 +++
> > >  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> > >  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> > >  mm/vmscan.c            |    2 +
> > >  5 files changed, 101 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 30fe668..72465c1 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -398,6 +398,9 @@ struct zone {
> > >  	unsigned long		wait_table_hash_nr_entries;
> > >  	unsigned long		wait_table_bits;
> > >  
> > > +	/* queue for processes waiting for pressure to relieve */
> > > +	wait_queue_head_t	*pressure_wq;
> > 
> > Hmm, processes may be eligible to allocate from > 1 zone, but you
> > have them only waiting for one. I wonder if we shouldn't wait for
> > more zones?
> > 
> 
> It's waiting for the zone that is most desirable. If that zones watermarks
> are met, why would it wait on any other zone?

I mean the other way around. If that zone's watermarks are not met, then
why shouldn't it be woken up by other zones reaching their watermarks.


> If you mean that it would
> wait for any of the eligible zones to meet their watermark, it might have
> an impact on NUMA locality but it could be managed. It might make sense to
> wait on a node-based queue rather than a zone if this behaviour was desirable.
> 
> > Congestion waiting uses a global waitqueue, which hasn't seemed to
> > cause a big scalability problem. Would it be better to have a global
> > waitqueue for this too?
> > 
> 
> Considering that the congestion wait queue is for a relatively slow operation,
> it would be surprising if lock scalability was noticeable.  Potentially the
> pressure_wq involves no IO so scalability may be noticeable there.
> 
> What would the advantages of a global waitqueue be? Obviously, a smaller
> memory footprint. A second potential advantage is that on wakeup, it
> could check the watermarks on multiple zones which might reduce
> latencies in some cases. Can you think of more compelling reasons?

Your 2nd advantage is what I mean above.


> > 
> > > +void check_zone_pressure(struct zone *zone)
> > 
> > I don't really like the name pressure. We use that term for the reclaim
> > pressure wheras we're just checking watermarks here (actual pressure
> > could be anything).
> > 
> 
> pressure_wq => watermark_wq
> check_zone_pressure => check_watermark_wq
> 
> ?

Thanks.

> 
> > 
> > > +{
> > > +	/* If no process is waiting, nothing to do */
> > > +	if (!waitqueue_active(zone->pressure_wq))
> > > +		return;
> > > +
> > > +	/* Check if the high watermark is ok for order 0 */
> > > +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > > +		wake_up_interruptible(zone->pressure_wq);
> > > +}
> > 
> > If you were to do this under the zone lock (in your subsequent patch),
> > then it could avoid races. I would suggest doing it all as a single
> > patch and not doing the pressure checks in reclaim at all.
> > 
> 
> That is reasonable. I've already dropped the checks in reclaim because as you
> say, if the free path check is cheap enough, it's also sufficient. Checking
> in the reclaim paths as well is redundant.
> 
> I'll move the call to check_zone_pressure() within the zone lock to avoid
> races.
> 
> > If you are missing anything, then that needs to be explained and fixed
> > rather than just adding extra checks.
> > 
> > > +
> > > +/**
> > > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > > + * @zone: The zone that is expected to be under pressure
> > > + * @order: The order the caller is waiting on pages for
> > > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > > + *
> > > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > > + * the zone above the high watermark
> > > + */
> > > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > > +{
> > > +	long ret;
> > > +	DEFINE_WAIT(wait);
> > > +
> > > +wait_again:
> > > +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> > 
> > I guess to do it without races you need to check watermark here.
> > And possibly some barriers if it is done without zone->lock.
> > 
> 
> As watermark checks are already done without the zone->lock and without
> barriers, why are they needed here? Yes, there are small races. For
> example, it's possible to hit a window where pages were freed between
> watermarks were checked and we went to sleep here but that is similar to
> current behaviour.

Well with the check in free_pages_bulk then doing another check here
before the wait should be able to close all lost-wakeup races. I agree
it is pretty fuzzy heuristics anyway, so these races don't *really*
matter a lot. But it seems easy to close the races, so I don't see
why not.


> > > +
> > > +	/*
> > > +	 * The use of io_schedule_timeout() here means that it gets
> > > +	 * accounted for as IO waiting. This may or may not be the case
> > > +	 * but at least this way it gets picked up by vmstat
> > > +	 */
> > > +	ret = io_schedule_timeout(timeout);
> > > +	finish_wait(zone->pressure_wq, &wait);
> > > +
> > > +	/* If woken early, check watermarks before continuing */
> > > +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > > +		timeout = ret;
> > > +		goto wait_again;
> > > +	}
> > 
> > And then I don't know if we'd really need the extra check here. Might as
> > well just let the allocator try again and avoid the code?
> > 
> 
> I was considering multiple processes been woken up and racing with each
> other. I can drop this check though. The worst that happens is multiple
> processes wake and walk the full zonelist. Some will succeed and others
> will go back to sleep.

Yep. And it doesn't really solve that race either becuase the zone
might subsequently go below the watermark.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 15:03         ` Nick Piggin
@ 2010-03-09 15:42           ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-09 15:42 UTC (permalink / raw)
  To: Nick Piggin, Mel Gorman; +Cc: linux-mm, Chris Mason, Jens Axboe, linux-kernel



Nick Piggin wrote:
> On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
>> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
>>> On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
>>>> Under heavy memory pressure, the page allocator may call congestion_wait()
>>>> to wait for IO congestion to clear or a timeout. This is not as sensible
>>>> a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
>>>> even congested as the pressure could have been due to a large number of
>>>> SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
>>>>
>>>> At the point of congestion_wait(), the allocator is struggling to get the
>>>> pages it needs and it should back off. This patch puts the allocator to sleep
>>>> on a zone->pressure_wq for either a timeout or until a direct reclaimer or
>>>> kswapd brings the zone over the low watermark, whichever happens first.
>>>>
>>>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>>>> ---
>>>>  include/linux/mmzone.h |    3 ++
>>>>  mm/internal.h          |    4 +++
>>>>  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
>>>>  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
>>>>  mm/vmscan.c            |    2 +
>>>>  5 files changed, 101 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index 30fe668..72465c1 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
[...]
>>>> +{
>>>> +	/* If no process is waiting, nothing to do */
>>>> +	if (!waitqueue_active(zone->pressure_wq))
>>>> +		return;
>>>> +
>>>> +	/* Check if the high watermark is ok for order 0 */
>>>> +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
>>>> +		wake_up_interruptible(zone->pressure_wq);
>>>> +}
>>> If you were to do this under the zone lock (in your subsequent patch),
>>> then it could avoid races. I would suggest doing it all as a single
>>> patch and not doing the pressure checks in reclaim at all.
>>>
>> That is reasonable. I've already dropped the checks in reclaim because as you
>> say, if the free path check is cheap enough, it's also sufficient. Checking
>> in the reclaim paths as well is redundant.
>>
>> I'll move the call to check_zone_pressure() within the zone lock to avoid
>> races.
>>

Mel, we talked about a thundering herd issue that might come up here in 
very constraint cases.
So wherever you end up putting that wake_up call, how about being extra 
paranoid about a thundering herd flagging them WQ_FLAG_EXCLUSIVE and 
waking them with something like that:

wake_up_interruptible_nr(zone->pressure_wq, #nrofpagesabovewatermark#);

That should be an easy to calculate sane max of waiters to wake up.
On the other hand it might be over-engineered and it implies the need to 
reconsider when it would be best to wake up the rest.

Get me right - I don't really have a hard requirement or need for that, 
I just wanted to mention it early on to hear your opinions about it.

looking forward to test the v2 patch series, adapted to all the good 
stuff already discussed.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 15:42           ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-09 15:42 UTC (permalink / raw)
  To: Nick Piggin, Mel Gorman; +Cc: linux-mm, Chris Mason, Jens Axboe, linux-kernel



Nick Piggin wrote:
> On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
>> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
>>> On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
>>>> Under heavy memory pressure, the page allocator may call congestion_wait()
>>>> to wait for IO congestion to clear or a timeout. This is not as sensible
>>>> a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
>>>> even congested as the pressure could have been due to a large number of
>>>> SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
>>>>
>>>> At the point of congestion_wait(), the allocator is struggling to get the
>>>> pages it needs and it should back off. This patch puts the allocator to sleep
>>>> on a zone->pressure_wq for either a timeout or until a direct reclaimer or
>>>> kswapd brings the zone over the low watermark, whichever happens first.
>>>>
>>>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>>>> ---
>>>>  include/linux/mmzone.h |    3 ++
>>>>  mm/internal.h          |    4 +++
>>>>  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
>>>>  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
>>>>  mm/vmscan.c            |    2 +
>>>>  5 files changed, 101 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index 30fe668..72465c1 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
[...]
>>>> +{
>>>> +	/* If no process is waiting, nothing to do */
>>>> +	if (!waitqueue_active(zone->pressure_wq))
>>>> +		return;
>>>> +
>>>> +	/* Check if the high watermark is ok for order 0 */
>>>> +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
>>>> +		wake_up_interruptible(zone->pressure_wq);
>>>> +}
>>> If you were to do this under the zone lock (in your subsequent patch),
>>> then it could avoid races. I would suggest doing it all as a single
>>> patch and not doing the pressure checks in reclaim at all.
>>>
>> That is reasonable. I've already dropped the checks in reclaim because as you
>> say, if the free path check is cheap enough, it's also sufficient. Checking
>> in the reclaim paths as well is redundant.
>>
>> I'll move the call to check_zone_pressure() within the zone lock to avoid
>> races.
>>

Mel, we talked about a thundering herd issue that might come up here in 
very constraint cases.
So wherever you end up putting that wake_up call, how about being extra 
paranoid about a thundering herd flagging them WQ_FLAG_EXCLUSIVE and 
waking them with something like that:

wake_up_interruptible_nr(zone->pressure_wq, #nrofpagesabovewatermark#);

That should be an easy to calculate sane max of waiters to wake up.
On the other hand it might be over-engineered and it implies the need to 
reconsider when it would be best to wake up the rest.

Get me right - I don't really have a hard requirement or need for that, 
I just wanted to mention it early on to hear your opinions about it.

looking forward to test the v2 patch series, adapted to all the good 
stuff already discussed.

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-08 11:48   ` Mel Gorman
@ 2010-03-09 15:50     ` Christoph Lameter
  -1 siblings, 0 replies; 136+ messages in thread
From: Christoph Lameter @ 2010-03-09 15:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Nick Piggin, Christian Ehrhardt, Chris Mason,
	Jens Axboe, linux-kernel

On Mon, 8 Mar 2010, Mel Gorman wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..72465c1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -398,6 +398,9 @@ struct zone {
>  	unsigned long		wait_table_hash_nr_entries;
>  	unsigned long		wait_table_bits;
>
> +	/* queue for processes waiting for pressure to relieve */
> +	wait_queue_head_t	*pressure_wq;
> +
>  	/*

The waitqueue is in a zone? But allocation occurs by scanning a
list of possible zones.

> +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)

So zone specific.

>
> -		if (!page && gfp_mask & __GFP_NOFAIL)
> -			congestion_wait(BLK_RW_ASYNC, HZ/50);
> +		if (!page && gfp_mask & __GFP_NOFAIL) {
> +			/* If still failing, wait for pressure on zone to relieve */
> +			zonepressure_wait(preferred_zone, order, HZ/50);

The first zone is special therefore...

What happens if memory becomes available in another zone? Lets say we are
waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 15:50     ` Christoph Lameter
  0 siblings, 0 replies; 136+ messages in thread
From: Christoph Lameter @ 2010-03-09 15:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Nick Piggin, Christian Ehrhardt, Chris Mason,
	Jens Axboe, linux-kernel

On Mon, 8 Mar 2010, Mel Gorman wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..72465c1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -398,6 +398,9 @@ struct zone {
>  	unsigned long		wait_table_hash_nr_entries;
>  	unsigned long		wait_table_bits;
>
> +	/* queue for processes waiting for pressure to relieve */
> +	wait_queue_head_t	*pressure_wq;
> +
>  	/*

The waitqueue is in a zone? But allocation occurs by scanning a
list of possible zones.

> +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)

So zone specific.

>
> -		if (!page && gfp_mask & __GFP_NOFAIL)
> -			congestion_wait(BLK_RW_ASYNC, HZ/50);
> +		if (!page && gfp_mask & __GFP_NOFAIL) {
> +			/* If still failing, wait for pressure on zone to relieve */
> +			zonepressure_wait(preferred_zone, order, HZ/50);

The first zone is special therefore...

What happens if memory becomes available in another zone? Lets say we are
waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 15:50     ` Christoph Lameter
@ 2010-03-09 15:56       ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-09 15:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel



Christoph Lameter wrote:
> On Mon, 8 Mar 2010, Mel Gorman wrote:
> 
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 30fe668..72465c1 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -398,6 +398,9 @@ struct zone {
>>  	unsigned long		wait_table_hash_nr_entries;
>>  	unsigned long		wait_table_bits;
>>
>> +	/* queue for processes waiting for pressure to relieve */
>> +	wait_queue_head_t	*pressure_wq;
>> +
>>  	/*
> 
> The waitqueue is in a zone? But allocation occurs by scanning a
> list of possible zones.
> 
>> +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> 
> So zone specific.
> 
>> -		if (!page && gfp_mask & __GFP_NOFAIL)
>> -			congestion_wait(BLK_RW_ASYNC, HZ/50);
>> +		if (!page && gfp_mask & __GFP_NOFAIL) {
>> +			/* If still failing, wait for pressure on zone to relieve */
>> +			zonepressure_wait(preferred_zone, order, HZ/50);
> 
> The first zone is special therefore...
> 
> What happens if memory becomes available in another zone? Lets say we are
> waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?

Do you mean the same as Nick asked or another aspect of it?
citation:
"I mean the other way around. If that zone's watermarks are not met, 
then why shouldn't it be woken up by other zones reaching their watermarks."


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 15:56       ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-09 15:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel



Christoph Lameter wrote:
> On Mon, 8 Mar 2010, Mel Gorman wrote:
> 
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 30fe668..72465c1 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -398,6 +398,9 @@ struct zone {
>>  	unsigned long		wait_table_hash_nr_entries;
>>  	unsigned long		wait_table_bits;
>>
>> +	/* queue for processes waiting for pressure to relieve */
>> +	wait_queue_head_t	*pressure_wq;
>> +
>>  	/*
> 
> The waitqueue is in a zone? But allocation occurs by scanning a
> list of possible zones.
> 
>> +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> 
> So zone specific.
> 
>> -		if (!page && gfp_mask & __GFP_NOFAIL)
>> -			congestion_wait(BLK_RW_ASYNC, HZ/50);
>> +		if (!page && gfp_mask & __GFP_NOFAIL) {
>> +			/* If still failing, wait for pressure on zone to relieve */
>> +			zonepressure_wait(preferred_zone, order, HZ/50);
> 
> The first zone is special therefore...
> 
> What happens if memory becomes available in another zone? Lets say we are
> waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?

Do you mean the same as Nick asked or another aspect of it?
citation:
"I mean the other way around. If that zone's watermarks are not met, 
then why shouldn't it be woken up by other zones reaching their watermarks."


-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 15:56       ` Christian Ehrhardt
@ 2010-03-09 16:09         ` Christoph Lameter
  -1 siblings, 0 replies; 136+ messages in thread
From: Christoph Lameter @ 2010-03-09 16:09 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel

On Tue, 9 Mar 2010, Christian Ehrhardt wrote:

> > What happens if memory becomes available in another zone? Lets say we are
> > waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?
>
> Do you mean the same as Nick asked or another aspect of it?
> citation:
> "I mean the other way around. If that zone's watermarks are not met, then why
> shouldn't it be woken up by other zones reaching their watermarks."

Just saw that exchange. Yes it is similar. Mel only thought about NUMA
but the situation can also occur in !NUMA because multiple zones do not
require NUMA.

If a process goes to sleep on an allocation that has a preferred zone of
HIGHMEM then other processors may free up memory in ZONE_DMA and
ZONE_NORMAL and therefore memory may become available but the process will
continue to sleep.

The wait structure needs to be placed in the pgdat structure to make it
node specific.

But then an overallocated node may stall processes. If that node is full
of unreclaimable memory then the process may never wake up?




^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 16:09         ` Christoph Lameter
  0 siblings, 0 replies; 136+ messages in thread
From: Christoph Lameter @ 2010-03-09 16:09 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel

On Tue, 9 Mar 2010, Christian Ehrhardt wrote:

> > What happens if memory becomes available in another zone? Lets say we are
> > waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?
>
> Do you mean the same as Nick asked or another aspect of it?
> citation:
> "I mean the other way around. If that zone's watermarks are not met, then why
> shouldn't it be woken up by other zones reaching their watermarks."

Just saw that exchange. Yes it is similar. Mel only thought about NUMA
but the situation can also occur in !NUMA because multiple zones do not
require NUMA.

If a process goes to sleep on an allocation that has a preferred zone of
HIGHMEM then other processors may free up memory in ZONE_DMA and
ZONE_NORMAL and therefore memory may become available but the process will
continue to sleep.

The wait structure needs to be placed in the pgdat structure to make it
node specific.

But then an overallocated node may stall processes. If that node is full
of unreclaimable memory then the process may never wake up?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 16:09         ` Christoph Lameter
@ 2010-03-09 17:01           ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 17:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:09:11AM -0600, Christoph Lameter wrote:
> On Tue, 9 Mar 2010, Christian Ehrhardt wrote:
> 
> > > What happens if memory becomes available in another zone? Lets say we are
> > > waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?
> >
> > Do you mean the same as Nick asked or another aspect of it?
> > citation:
> > "I mean the other way around. If that zone's watermarks are not met, then why
> > shouldn't it be woken up by other zones reaching their watermarks."
> 
> Just saw that exchange. Yes it is similar. Mel only thought about NUMA
> but the situation can also occur in !NUMA because multiple zones do not
> require NUMA.
> 

True, although rare. Elsewhere I suggested that the wait could be on a
per-node basis instead of per-zone. My main concern there would be
adding a new hot cache line in the page free path or an unfortunate mix
of zone and node logic. I'm not fully convinced it's worth it but will
check it out.

> If a process goes to sleep on an allocation that has a preferred zone of
> HIGHMEM then other processors may free up memory in ZONE_DMA and
> ZONE_NORMAL and therefore memory may become available but the process will
> continue to sleep.
> 

Until it's timeout at least. It's still better than the current
situation of sleeping on congestion.

The ideal would be waiting on a per-node basis. I'm just not liking having
to look up the node structure when freeing a patch of pages and making a
cache line in there unnecessarily hot.

> The wait structure needs to be placed in the pgdat structure to make it
> node specific.
> 
> But then an overallocated node may stall processes. If that node is full
> of unreclaimable memory then the process may never wake up?
> 

Processes wake after a timeout.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 17:01           ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 17:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 10:09:11AM -0600, Christoph Lameter wrote:
> On Tue, 9 Mar 2010, Christian Ehrhardt wrote:
> 
> > > What happens if memory becomes available in another zone? Lets say we are
> > > waiting on HIGHMEM and memory in ZONE_NORMAL becomes available?
> >
> > Do you mean the same as Nick asked or another aspect of it?
> > citation:
> > "I mean the other way around. If that zone's watermarks are not met, then why
> > shouldn't it be woken up by other zones reaching their watermarks."
> 
> Just saw that exchange. Yes it is similar. Mel only thought about NUMA
> but the situation can also occur in !NUMA because multiple zones do not
> require NUMA.
> 

True, although rare. Elsewhere I suggested that the wait could be on a
per-node basis instead of per-zone. My main concern there would be
adding a new hot cache line in the page free path or an unfortunate mix
of zone and node logic. I'm not fully convinced it's worth it but will
check it out.

> If a process goes to sleep on an allocation that has a preferred zone of
> HIGHMEM then other processors may free up memory in ZONE_DMA and
> ZONE_NORMAL and therefore memory may become available but the process will
> continue to sleep.
> 

Until it's timeout at least. It's still better than the current
situation of sleeping on congestion.

The ideal would be waiting on a per-node basis. I'm just not liking having
to look up the node structure when freeing a patch of pages and making a
cache line in there unnecessarily hot.

> The wait structure needs to be placed in the pgdat structure to make it
> node specific.
> 
> But then an overallocated node may stall processes. If that node is full
> of unreclaimable memory then the process may never wake up?
> 

Processes wake after a timeout.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 17:01           ` Mel Gorman
@ 2010-03-09 17:11             ` Christoph Lameter
  -1 siblings, 0 replies; 136+ messages in thread
From: Christoph Lameter @ 2010-03-09 17:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Tue, 9 Mar 2010, Mel Gorman wrote:

> Until it's timeout at least. It's still better than the current
> situation of sleeping on congestion.

Congestion may clear if memory becomes available in other zones.

> The ideal would be waiting on a per-node basis. I'm just not liking having
> to look up the node structure when freeing a patch of pages and making a
> cache line in there unnecessarily hot.

The node structure (pgdat) contains the zone structures. If you know the
type of zone then you can calculate the pgdat address.

> > But then an overallocated node may stall processes. If that node is full
> > of unreclaimable memory then the process may never wake up?
> Processes wake after a timeout.

Ok that limits it but still we may be waiting for no reason.


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 17:11             ` Christoph Lameter
  0 siblings, 0 replies; 136+ messages in thread
From: Christoph Lameter @ 2010-03-09 17:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Tue, 9 Mar 2010, Mel Gorman wrote:

> Until it's timeout at least. It's still better than the current
> situation of sleeping on congestion.

Congestion may clear if memory becomes available in other zones.

> The ideal would be waiting on a per-node basis. I'm just not liking having
> to look up the node structure when freeing a patch of pages and making a
> cache line in there unnecessarily hot.

The node structure (pgdat) contains the zone structures. If you know the
type of zone then you can calculate the pgdat address.

> > But then an overallocated node may stall processes. If that node is full
> > of unreclaimable memory then the process may never wake up?
> Processes wake after a timeout.

Ok that limits it but still we may be waiting for no reason.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 17:11             ` Christoph Lameter
@ 2010-03-09 17:30               ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 17:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 11:11:55AM -0600, Christoph Lameter wrote:
> On Tue, 9 Mar 2010, Mel Gorman wrote:
> 
> > Until it's timeout at least. It's still better than the current
> > situation of sleeping on congestion.
> 
> Congestion may clear if memory becomes available in other zones.
> 

I understand that.

> > The ideal would be waiting on a per-node basis. I'm just not liking having
> > to look up the node structure when freeing a patch of pages and making a
> > cache line in there unnecessarily hot.
> 
> The node structure (pgdat) contains the zone structures. If you know the
> type of zone then you can calculate the pgdat address.
> 

I know you can lookup the pgdat from the zone structure. The concern is that
the suggestion requires adding fields to the node structure that then become
hot in the free_page path when the per-cpu lists are being drained. This patch
also adds a hot cache line to the zone but at least it can be eliminated by
using zone->flags. The same optimisation does not apply to working on a
per-node basis.

Adding such a hot line is a big minus and the gain is that processes may
wake up slightly faster when under memory pressure. It's not a good trade-off.

> > > But then an overallocated node may stall processes. If that node is full
> > > of unreclaimable memory then the process may never wake up?
> >
> > Processes wake after a timeout.
> 
> Ok that limits it but still we may be waiting for no reason.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 17:30               ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 17:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 11:11:55AM -0600, Christoph Lameter wrote:
> On Tue, 9 Mar 2010, Mel Gorman wrote:
> 
> > Until it's timeout at least. It's still better than the current
> > situation of sleeping on congestion.
> 
> Congestion may clear if memory becomes available in other zones.
> 

I understand that.

> > The ideal would be waiting on a per-node basis. I'm just not liking having
> > to look up the node structure when freeing a patch of pages and making a
> > cache line in there unnecessarily hot.
> 
> The node structure (pgdat) contains the zone structures. If you know the
> type of zone then you can calculate the pgdat address.
> 

I know you can lookup the pgdat from the zone structure. The concern is that
the suggestion requires adding fields to the node structure that then become
hot in the free_page path when the per-cpu lists are being drained. This patch
also adds a hot cache line to the zone but at least it can be eliminated by
using zone->flags. The same optimisation does not apply to working on a
per-node basis.

Adding such a hot line is a big minus and the gain is that processes may
wake up slightly faster when under memory pressure. It's not a good trade-off.

> > > But then an overallocated node may stall processes. If that node is full
> > > of unreclaimable memory then the process may never wake up?
> >
> > Processes wake after a timeout.
> 
> Ok that limits it but still we may be waiting for no reason.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 15:03         ` Nick Piggin
@ 2010-03-09 17:35           ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 17:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Wed, Mar 10, 2010 at 02:03:32AM +1100, Nick Piggin wrote:
> On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
> > On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> > > On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > > > Under heavy memory pressure, the page allocator may call congestion_wait()
> > > > to wait for IO congestion to clear or a timeout. This is not as sensible
> > > > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > > > even congested as the pressure could have been due to a large number of
> > > > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > > > 
> > > > At the point of congestion_wait(), the allocator is struggling to get the
> > > > pages it needs and it should back off. This patch puts the allocator to sleep
> > > > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > > > kswapd brings the zone over the low watermark, whichever happens first.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > >  include/linux/mmzone.h |    3 ++
> > > >  mm/internal.h          |    4 +++
> > > >  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> > > >  mm/vmscan.c            |    2 +
> > > >  5 files changed, 101 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 30fe668..72465c1 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -398,6 +398,9 @@ struct zone {
> > > >  	unsigned long		wait_table_hash_nr_entries;
> > > >  	unsigned long		wait_table_bits;
> > > >  
> > > > +	/* queue for processes waiting for pressure to relieve */
> > > > +	wait_queue_head_t	*pressure_wq;
> > > 
> > > Hmm, processes may be eligible to allocate from > 1 zone, but you
> > > have them only waiting for one. I wonder if we shouldn't wait for
> > > more zones?
> > > 
> > 
> > It's waiting for the zone that is most desirable. If that zones watermarks
> > are met, why would it wait on any other zone?
> 
> I mean the other way around. If that zone's watermarks are not met, then
> why shouldn't it be woken up by other zones reaching their watermarks.
> 

Doing it requires moving to a per-node structure or a global queue. I'd rather
not add hot lines to the node structure (and the associated lookup cost in
the free path) if I can help it. A global queue would work on smaller machines
but I'd be worried about thundering herd problems on larger machines. I know
congestion_wait is already a global queue but IO is a relatively slow event.
Potentially the wakeups from this queue are a lot faster.

Should I just move to a global queue as a starting point and see what
problems are caused later?

> > If you mean that it would
> > wait for any of the eligible zones to meet their watermark, it might have
> > an impact on NUMA locality but it could be managed. It might make sense to
> > wait on a node-based queue rather than a zone if this behaviour was desirable.
> > 
> > > Congestion waiting uses a global waitqueue, which hasn't seemed to
> > > cause a big scalability problem. Would it be better to have a global
> > > waitqueue for this too?
> > > 
> > 
> > Considering that the congestion wait queue is for a relatively slow operation,
> > it would be surprising if lock scalability was noticeable.  Potentially the
> > pressure_wq involves no IO so scalability may be noticeable there.
> > 
> > What would the advantages of a global waitqueue be? Obviously, a smaller
> > memory footprint. A second potential advantage is that on wakeup, it
> > could check the watermarks on multiple zones which might reduce
> > latencies in some cases. Can you think of more compelling reasons?
> 
> Your 2nd advantage is what I mean above.
> 
> 
> > > 
> > > > +void check_zone_pressure(struct zone *zone)
> > > 
> > > I don't really like the name pressure. We use that term for the reclaim
> > > pressure wheras we're just checking watermarks here (actual pressure
> > > could be anything).
> > > 
> > 
> > pressure_wq => watermark_wq
> > check_zone_pressure => check_watermark_wq
> > 
> > ?
> 
> Thanks.
> 
> > 
> > > 
> > > > +{
> > > > +	/* If no process is waiting, nothing to do */
> > > > +	if (!waitqueue_active(zone->pressure_wq))
> > > > +		return;
> > > > +
> > > > +	/* Check if the high watermark is ok for order 0 */
> > > > +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > > > +		wake_up_interruptible(zone->pressure_wq);
> > > > +}
> > > 
> > > If you were to do this under the zone lock (in your subsequent patch),
> > > then it could avoid races. I would suggest doing it all as a single
> > > patch and not doing the pressure checks in reclaim at all.
> > > 
> > 
> > That is reasonable. I've already dropped the checks in reclaim because as you
> > say, if the free path check is cheap enough, it's also sufficient. Checking
> > in the reclaim paths as well is redundant.
> > 
> > I'll move the call to check_zone_pressure() within the zone lock to avoid
> > races.
> > 
> > > If you are missing anything, then that needs to be explained and fixed
> > > rather than just adding extra checks.
> > > 
> > > > +
> > > > +/**
> > > > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > > > + * @zone: The zone that is expected to be under pressure
> > > > + * @order: The order the caller is waiting on pages for
> > > > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > > > + *
> > > > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > > > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > > > + * the zone above the high watermark
> > > > + */
> > > > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > > > +{
> > > > +	long ret;
> > > > +	DEFINE_WAIT(wait);
> > > > +
> > > > +wait_again:
> > > > +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> > > 
> > > I guess to do it without races you need to check watermark here.
> > > And possibly some barriers if it is done without zone->lock.
> > > 
> > 
> > As watermark checks are already done without the zone->lock and without
> > barriers, why are they needed here? Yes, there are small races. For
> > example, it's possible to hit a window where pages were freed between
> > watermarks were checked and we went to sleep here but that is similar to
> > current behaviour.
> 
> Well with the check in free_pages_bulk then doing another check here
> before the wait should be able to close all lost-wakeup races. I agree
> it is pretty fuzzy heuristics anyway, so these races don't *really*
> matter a lot. But it seems easy to close the races, so I don't see
> why not.
> 

I agree that the window is unnecessarily large. I'll tighten it.,

> 
> > > > +
> > > > +	/*
> > > > +	 * The use of io_schedule_timeout() here means that it gets
> > > > +	 * accounted for as IO waiting. This may or may not be the case
> > > > +	 * but at least this way it gets picked up by vmstat
> > > > +	 */
> > > > +	ret = io_schedule_timeout(timeout);
> > > > +	finish_wait(zone->pressure_wq, &wait);
> > > > +
> > > > +	/* If woken early, check watermarks before continuing */
> > > > +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > > > +		timeout = ret;
> > > > +		goto wait_again;
> > > > +	}
> > > 
> > > And then I don't know if we'd really need the extra check here. Might as
> > > well just let the allocator try again and avoid the code?
> > > 
> > 
> > I was considering multiple processes been woken up and racing with each
> > other. I can drop this check though. The worst that happens is multiple
> > processes wake and walk the full zonelist. Some will succeed and others
> > will go back to sleep.
> 
> Yep. And it doesn't really solve that race either becuase the zone
> might subsequently go below the watermark.
> 

True. In theory, the same sort of races currently apply with
congestion_wait() but that's just an excuse. There is a strong
possibility we could behave better with respect to watermarks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 17:35           ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 17:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Wed, Mar 10, 2010 at 02:03:32AM +1100, Nick Piggin wrote:
> On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
> > On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> > > On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > > > Under heavy memory pressure, the page allocator may call congestion_wait()
> > > > to wait for IO congestion to clear or a timeout. This is not as sensible
> > > > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > > > even congested as the pressure could have been due to a large number of
> > > > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > > > 
> > > > At the point of congestion_wait(), the allocator is struggling to get the
> > > > pages it needs and it should back off. This patch puts the allocator to sleep
> > > > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > > > kswapd brings the zone over the low watermark, whichever happens first.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > >  include/linux/mmzone.h |    3 ++
> > > >  mm/internal.h          |    4 +++
> > > >  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> > > >  mm/vmscan.c            |    2 +
> > > >  5 files changed, 101 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 30fe668..72465c1 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -398,6 +398,9 @@ struct zone {
> > > >  	unsigned long		wait_table_hash_nr_entries;
> > > >  	unsigned long		wait_table_bits;
> > > >  
> > > > +	/* queue for processes waiting for pressure to relieve */
> > > > +	wait_queue_head_t	*pressure_wq;
> > > 
> > > Hmm, processes may be eligible to allocate from > 1 zone, but you
> > > have them only waiting for one. I wonder if we shouldn't wait for
> > > more zones?
> > > 
> > 
> > It's waiting for the zone that is most desirable. If that zones watermarks
> > are met, why would it wait on any other zone?
> 
> I mean the other way around. If that zone's watermarks are not met, then
> why shouldn't it be woken up by other zones reaching their watermarks.
> 

Doing it requires moving to a per-node structure or a global queue. I'd rather
not add hot lines to the node structure (and the associated lookup cost in
the free path) if I can help it. A global queue would work on smaller machines
but I'd be worried about thundering herd problems on larger machines. I know
congestion_wait is already a global queue but IO is a relatively slow event.
Potentially the wakeups from this queue are a lot faster.

Should I just move to a global queue as a starting point and see what
problems are caused later?

> > If you mean that it would
> > wait for any of the eligible zones to meet their watermark, it might have
> > an impact on NUMA locality but it could be managed. It might make sense to
> > wait on a node-based queue rather than a zone if this behaviour was desirable.
> > 
> > > Congestion waiting uses a global waitqueue, which hasn't seemed to
> > > cause a big scalability problem. Would it be better to have a global
> > > waitqueue for this too?
> > > 
> > 
> > Considering that the congestion wait queue is for a relatively slow operation,
> > it would be surprising if lock scalability was noticeable.  Potentially the
> > pressure_wq involves no IO so scalability may be noticeable there.
> > 
> > What would the advantages of a global waitqueue be? Obviously, a smaller
> > memory footprint. A second potential advantage is that on wakeup, it
> > could check the watermarks on multiple zones which might reduce
> > latencies in some cases. Can you think of more compelling reasons?
> 
> Your 2nd advantage is what I mean above.
> 
> 
> > > 
> > > > +void check_zone_pressure(struct zone *zone)
> > > 
> > > I don't really like the name pressure. We use that term for the reclaim
> > > pressure wheras we're just checking watermarks here (actual pressure
> > > could be anything).
> > > 
> > 
> > pressure_wq => watermark_wq
> > check_zone_pressure => check_watermark_wq
> > 
> > ?
> 
> Thanks.
> 
> > 
> > > 
> > > > +{
> > > > +	/* If no process is waiting, nothing to do */
> > > > +	if (!waitqueue_active(zone->pressure_wq))
> > > > +		return;
> > > > +
> > > > +	/* Check if the high watermark is ok for order 0 */
> > > > +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > > > +		wake_up_interruptible(zone->pressure_wq);
> > > > +}
> > > 
> > > If you were to do this under the zone lock (in your subsequent patch),
> > > then it could avoid races. I would suggest doing it all as a single
> > > patch and not doing the pressure checks in reclaim at all.
> > > 
> > 
> > That is reasonable. I've already dropped the checks in reclaim because as you
> > say, if the free path check is cheap enough, it's also sufficient. Checking
> > in the reclaim paths as well is redundant.
> > 
> > I'll move the call to check_zone_pressure() within the zone lock to avoid
> > races.
> > 
> > > If you are missing anything, then that needs to be explained and fixed
> > > rather than just adding extra checks.
> > > 
> > > > +
> > > > +/**
> > > > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > > > + * @zone: The zone that is expected to be under pressure
> > > > + * @order: The order the caller is waiting on pages for
> > > > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > > > + *
> > > > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > > > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > > > + * the zone above the high watermark
> > > > + */
> > > > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > > > +{
> > > > +	long ret;
> > > > +	DEFINE_WAIT(wait);
> > > > +
> > > > +wait_again:
> > > > +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> > > 
> > > I guess to do it without races you need to check watermark here.
> > > And possibly some barriers if it is done without zone->lock.
> > > 
> > 
> > As watermark checks are already done without the zone->lock and without
> > barriers, why are they needed here? Yes, there are small races. For
> > example, it's possible to hit a window where pages were freed between
> > watermarks were checked and we went to sleep here but that is similar to
> > current behaviour.
> 
> Well with the check in free_pages_bulk then doing another check here
> before the wait should be able to close all lost-wakeup races. I agree
> it is pretty fuzzy heuristics anyway, so these races don't *really*
> matter a lot. But it seems easy to close the races, so I don't see
> why not.
> 

I agree that the window is unnecessarily large. I'll tighten it.,

> 
> > > > +
> > > > +	/*
> > > > +	 * The use of io_schedule_timeout() here means that it gets
> > > > +	 * accounted for as IO waiting. This may or may not be the case
> > > > +	 * but at least this way it gets picked up by vmstat
> > > > +	 */
> > > > +	ret = io_schedule_timeout(timeout);
> > > > +	finish_wait(zone->pressure_wq, &wait);
> > > > +
> > > > +	/* If woken early, check watermarks before continuing */
> > > > +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > > > +		timeout = ret;
> > > > +		goto wait_again;
> > > > +	}
> > > 
> > > And then I don't know if we'd really need the extra check here. Might as
> > > well just let the allocator try again and avoid the code?
> > > 
> > 
> > I was considering multiple processes been woken up and racing with each
> > other. I can drop this check though. The worst that happens is multiple
> > processes wake and walk the full zonelist. Some will succeed and others
> > will go back to sleep.
> 
> Yep. And it doesn't really solve that race either becuase the zone
> might subsequently go below the watermark.
> 

True. In theory, the same sort of races currently apply with
congestion_wait() but that's just an excuse. There is a strong
possibility we could behave better with respect to watermarks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 15:42           ` Christian Ehrhardt
@ 2010-03-09 18:22             ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 18:22 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Nick Piggin, linux-mm, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 04:42:50PM +0100, Christian Ehrhardt wrote:
>
>
> Nick Piggin wrote:
>> On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
>>> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
>>>> On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
>>>>> Under heavy memory pressure, the page allocator may call congestion_wait()
>>>>> to wait for IO congestion to clear or a timeout. This is not as sensible
>>>>> a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
>>>>> even congested as the pressure could have been due to a large number of
>>>>> SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
>>>>>
>>>>> At the point of congestion_wait(), the allocator is struggling to get the
>>>>> pages it needs and it should back off. This patch puts the allocator to sleep
>>>>> on a zone->pressure_wq for either a timeout or until a direct reclaimer or
>>>>> kswapd brings the zone over the low watermark, whichever happens first.
>>>>>
>>>>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>>>>> ---
>>>>>  include/linux/mmzone.h |    3 ++
>>>>>  mm/internal.h          |    4 +++
>>>>>  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
>>>>>  mm/vmscan.c            |    2 +
>>>>>  5 files changed, 101 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>>> index 30fe668..72465c1 100644
>>>>> --- a/include/linux/mmzone.h
>>>>> +++ b/include/linux/mmzone.h
> [...]
>>>>> +{
>>>>> +	/* If no process is waiting, nothing to do */
>>>>> +	if (!waitqueue_active(zone->pressure_wq))
>>>>> +		return;
>>>>> +
>>>>> +	/* Check if the high watermark is ok for order 0 */
>>>>> +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
>>>>> +		wake_up_interruptible(zone->pressure_wq);
>>>>> +}
>>>> If you were to do this under the zone lock (in your subsequent patch),
>>>> then it could avoid races. I would suggest doing it all as a single
>>>> patch and not doing the pressure checks in reclaim at all.
>>>>
>>> That is reasonable. I've already dropped the checks in reclaim because as you
>>> say, if the free path check is cheap enough, it's also sufficient. Checking
>>> in the reclaim paths as well is redundant.
>>>
>>> I'll move the call to check_zone_pressure() within the zone lock to avoid
>>> races.
>>>
>
> Mel, we talked about a thundering herd issue that might come up here in  
> very constraint cases.
> So wherever you end up putting that wake_up call, how about being extra  
> paranoid about a thundering herd flagging them WQ_FLAG_EXCLUSIVE and  
> waking them with something like that:
>
> wake_up_interruptible_nr(zone->pressure_wq, #nrofpagesabovewatermark#);
>
> That should be an easy to calculate sane max of waiters to wake up.
> On the other hand it might be over-engineered and it implies the need to  
> reconsider when it would be best to wake up the rest.
>

It seems over-engineering considering that they wake up after a timeout
unconditionally. I think it's best for the moment to let them wake up in
a herd and recheck their zonelists as they'll go back to sleep if
necessary.

> Get me right - I don't really have a hard requirement or need for that,  
> I just wanted to mention it early on to hear your opinions about it.
>
> looking forward to test the v2 patch series, adapted to all the good  
> stuff already discussed.
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-09 18:22             ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-09 18:22 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Nick Piggin, linux-mm, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 04:42:50PM +0100, Christian Ehrhardt wrote:
>
>
> Nick Piggin wrote:
>> On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
>>> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
>>>> On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
>>>>> Under heavy memory pressure, the page allocator may call congestion_wait()
>>>>> to wait for IO congestion to clear or a timeout. This is not as sensible
>>>>> a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
>>>>> even congested as the pressure could have been due to a large number of
>>>>> SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
>>>>>
>>>>> At the point of congestion_wait(), the allocator is struggling to get the
>>>>> pages it needs and it should back off. This patch puts the allocator to sleep
>>>>> on a zone->pressure_wq for either a timeout or until a direct reclaimer or
>>>>> kswapd brings the zone over the low watermark, whichever happens first.
>>>>>
>>>>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>>>>> ---
>>>>>  include/linux/mmzone.h |    3 ++
>>>>>  mm/internal.h          |    4 +++
>>>>>  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
>>>>>  mm/vmscan.c            |    2 +
>>>>>  5 files changed, 101 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>>> index 30fe668..72465c1 100644
>>>>> --- a/include/linux/mmzone.h
>>>>> +++ b/include/linux/mmzone.h
> [...]
>>>>> +{
>>>>> +	/* If no process is waiting, nothing to do */
>>>>> +	if (!waitqueue_active(zone->pressure_wq))
>>>>> +		return;
>>>>> +
>>>>> +	/* Check if the high watermark is ok for order 0 */
>>>>> +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
>>>>> +		wake_up_interruptible(zone->pressure_wq);
>>>>> +}
>>>> If you were to do this under the zone lock (in your subsequent patch),
>>>> then it could avoid races. I would suggest doing it all as a single
>>>> patch and not doing the pressure checks in reclaim at all.
>>>>
>>> That is reasonable. I've already dropped the checks in reclaim because as you
>>> say, if the free path check is cheap enough, it's also sufficient. Checking
>>> in the reclaim paths as well is redundant.
>>>
>>> I'll move the call to check_zone_pressure() within the zone lock to avoid
>>> races.
>>>
>
> Mel, we talked about a thundering herd issue that might come up here in  
> very constraint cases.
> So wherever you end up putting that wake_up call, how about being extra  
> paranoid about a thundering herd flagging them WQ_FLAG_EXCLUSIVE and  
> waking them with something like that:
>
> wake_up_interruptible_nr(zone->pressure_wq, #nrofpagesabovewatermark#);
>
> That should be an easy to calculate sane max of waiters to wake up.
> On the other hand it might be over-engineered and it implies the need to  
> reconsider when it would be best to wake up the rest.
>

It seems over-engineering considering that they wake up after a timeout
unconditionally. I think it's best for the moment to let them wake up in
a herd and recheck their zonelists as they'll go back to sleep if
necessary.

> Get me right - I don't really have a hard requirement or need for that,  
> I just wanted to mention it early on to hear your opinions about it.
>
> looking forward to test the v2 patch series, adapted to all the good  
> stuff already discussed.
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 17:35           ` Mel Gorman
@ 2010-03-10  2:35             ` Nick Piggin
  -1 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-10  2:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 05:35:36PM +0000, Mel Gorman wrote:
> On Wed, Mar 10, 2010 at 02:03:32AM +1100, Nick Piggin wrote:
> > I mean the other way around. If that zone's watermarks are not met, then
> > why shouldn't it be woken up by other zones reaching their watermarks.
> > 
> 
> Doing it requires moving to a per-node structure or a global queue. I'd rather
> not add hot lines to the node structure (and the associated lookup cost in
> the free path) if I can help it. A global queue would work on smaller machines
> but I'd be worried about thundering herd problems on larger machines. I know
> congestion_wait is already a global queue but IO is a relatively slow event.
> Potentially the wakeups from this queue are a lot faster.
> 
> Should I just move to a global queue as a starting point and see what
> problems are caused later?

Yes. This should change allocation behaviours less than your patch does
now in the presence of multiple allocatees stuck in the wait with
different preferred zones.

I would worry about thundering herds as a different problem we already
have. And if wakeups are less frequent, then each one is more likely to
cause a thundering herd anyway.


> > Yep. And it doesn't really solve that race either becuase the zone
> > might subsequently go below the watermark.
> > 
> 
> True. In theory, the same sort of races currently apply with
> congestion_wait() but that's just an excuse. There is a strong
> possibility we could behave better with respect to watermarks.

We can probably avoid all races where the process sleeps too long
(ie. misses wakeups). Waking up too early and finding pages already
allocated is harder and probably can't really be solved without all
allocatees checking the waitqueue before taking pages.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-10  2:35             ` Nick Piggin
  0 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-10  2:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Christian Ehrhardt, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 05:35:36PM +0000, Mel Gorman wrote:
> On Wed, Mar 10, 2010 at 02:03:32AM +1100, Nick Piggin wrote:
> > I mean the other way around. If that zone's watermarks are not met, then
> > why shouldn't it be woken up by other zones reaching their watermarks.
> > 
> 
> Doing it requires moving to a per-node structure or a global queue. I'd rather
> not add hot lines to the node structure (and the associated lookup cost in
> the free path) if I can help it. A global queue would work on smaller machines
> but I'd be worried about thundering herd problems on larger machines. I know
> congestion_wait is already a global queue but IO is a relatively slow event.
> Potentially the wakeups from this queue are a lot faster.
> 
> Should I just move to a global queue as a starting point and see what
> problems are caused later?

Yes. This should change allocation behaviours less than your patch does
now in the presence of multiple allocatees stuck in the wait with
different preferred zones.

I would worry about thundering herds as a different problem we already
have. And if wakeups are less frequent, then each one is more likely to
cause a thundering herd anyway.


> > Yep. And it doesn't really solve that race either becuase the zone
> > might subsequently go below the watermark.
> > 
> 
> True. In theory, the same sort of races currently apply with
> congestion_wait() but that's just an excuse. There is a strong
> possibility we could behave better with respect to watermarks.

We can probably avoid all races where the process sleeps too long
(ie. misses wakeups). Waking up too early and finding pages already
allocated is harder and probably can't really be solved without all
allocatees checking the waitqueue before taking pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
  2010-03-09 18:22             ` Mel Gorman
@ 2010-03-10  2:38               ` Nick Piggin
  -1 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-10  2:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christian Ehrhardt, linux-mm, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 06:22:28PM +0000, Mel Gorman wrote:
> On Tue, Mar 09, 2010 at 04:42:50PM +0100, Christian Ehrhardt wrote:
> >
> >
> > Nick Piggin wrote:
> >> On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
> >>> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> >>>> On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> >>>>> Under heavy memory pressure, the page allocator may call congestion_wait()
> >>>>> to wait for IO congestion to clear or a timeout. This is not as sensible
> >>>>> a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> >>>>> even congested as the pressure could have been due to a large number of
> >>>>> SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> >>>>>
> >>>>> At the point of congestion_wait(), the allocator is struggling to get the
> >>>>> pages it needs and it should back off. This patch puts the allocator to sleep
> >>>>> on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> >>>>> kswapd brings the zone over the low watermark, whichever happens first.
> >>>>>
> >>>>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >>>>> ---
> >>>>>  include/linux/mmzone.h |    3 ++
> >>>>>  mm/internal.h          |    4 +++
> >>>>>  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> >>>>>  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> >>>>>  mm/vmscan.c            |    2 +
> >>>>>  5 files changed, 101 insertions(+), 5 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >>>>> index 30fe668..72465c1 100644
> >>>>> --- a/include/linux/mmzone.h
> >>>>> +++ b/include/linux/mmzone.h
> > [...]
> >>>>> +{
> >>>>> +	/* If no process is waiting, nothing to do */
> >>>>> +	if (!waitqueue_active(zone->pressure_wq))
> >>>>> +		return;
> >>>>> +
> >>>>> +	/* Check if the high watermark is ok for order 0 */
> >>>>> +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> >>>>> +		wake_up_interruptible(zone->pressure_wq);
> >>>>> +}
> >>>> If you were to do this under the zone lock (in your subsequent patch),
> >>>> then it could avoid races. I would suggest doing it all as a single
> >>>> patch and not doing the pressure checks in reclaim at all.
> >>>>
> >>> That is reasonable. I've already dropped the checks in reclaim because as you
> >>> say, if the free path check is cheap enough, it's also sufficient. Checking
> >>> in the reclaim paths as well is redundant.
> >>>
> >>> I'll move the call to check_zone_pressure() within the zone lock to avoid
> >>> races.
> >>>
> >
> > Mel, we talked about a thundering herd issue that might come up here in  
> > very constraint cases.
> > So wherever you end up putting that wake_up call, how about being extra  
> > paranoid about a thundering herd flagging them WQ_FLAG_EXCLUSIVE and  
> > waking them with something like that:
> >
> > wake_up_interruptible_nr(zone->pressure_wq, #nrofpagesabovewatermark#);
> >
> > That should be an easy to calculate sane max of waiters to wake up.
> > On the other hand it might be over-engineered and it implies the need to  
> > reconsider when it would be best to wake up the rest.
> >
> 
> It seems over-engineering considering that they wake up after a timeout
> unconditionally. I think it's best for the moment to let them wake up in
> a herd and recheck their zonelists as they'll go back to sleep if
> necessary.

I think it isn't a bad idea and I was thinking about that myself, except
that we might do several wakeups before previously woken processes get
the chance to run and take pages from the watermark.

Also, if we have a large number of processes waiting here, the system
isn't in great shape, so a little CPU activity probably isn't going to
be noticable.

So I agree with Mel it probably isn't needed.


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion
@ 2010-03-10  2:38               ` Nick Piggin
  0 siblings, 0 replies; 136+ messages in thread
From: Nick Piggin @ 2010-03-10  2:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christian Ehrhardt, linux-mm, Chris Mason, Jens Axboe, linux-kernel

On Tue, Mar 09, 2010 at 06:22:28PM +0000, Mel Gorman wrote:
> On Tue, Mar 09, 2010 at 04:42:50PM +0100, Christian Ehrhardt wrote:
> >
> >
> > Nick Piggin wrote:
> >> On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
> >>> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> >>>> On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> >>>>> Under heavy memory pressure, the page allocator may call congestion_wait()
> >>>>> to wait for IO congestion to clear or a timeout. This is not as sensible
> >>>>> a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> >>>>> even congested as the pressure could have been due to a large number of
> >>>>> SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> >>>>>
> >>>>> At the point of congestion_wait(), the allocator is struggling to get the
> >>>>> pages it needs and it should back off. This patch puts the allocator to sleep
> >>>>> on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> >>>>> kswapd brings the zone over the low watermark, whichever happens first.
> >>>>>
> >>>>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >>>>> ---
> >>>>>  include/linux/mmzone.h |    3 ++
> >>>>>  mm/internal.h          |    4 +++
> >>>>>  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> >>>>>  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> >>>>>  mm/vmscan.c            |    2 +
> >>>>>  5 files changed, 101 insertions(+), 5 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >>>>> index 30fe668..72465c1 100644
> >>>>> --- a/include/linux/mmzone.h
> >>>>> +++ b/include/linux/mmzone.h
> > [...]
> >>>>> +{
> >>>>> +	/* If no process is waiting, nothing to do */
> >>>>> +	if (!waitqueue_active(zone->pressure_wq))
> >>>>> +		return;
> >>>>> +
> >>>>> +	/* Check if the high watermark is ok for order 0 */
> >>>>> +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> >>>>> +		wake_up_interruptible(zone->pressure_wq);
> >>>>> +}
> >>>> If you were to do this under the zone lock (in your subsequent patch),
> >>>> then it could avoid races. I would suggest doing it all as a single
> >>>> patch and not doing the pressure checks in reclaim at all.
> >>>>
> >>> That is reasonable. I've already dropped the checks in reclaim because as you
> >>> say, if the free path check is cheap enough, it's also sufficient. Checking
> >>> in the reclaim paths as well is redundant.
> >>>
> >>> I'll move the call to check_zone_pressure() within the zone lock to avoid
> >>> races.
> >>>
> >
> > Mel, we talked about a thundering herd issue that might come up here in  
> > very constraint cases.
> > So wherever you end up putting that wake_up call, how about being extra  
> > paranoid about a thundering herd flagging them WQ_FLAG_EXCLUSIVE and  
> > waking them with something like that:
> >
> > wake_up_interruptible_nr(zone->pressure_wq, #nrofpagesabovewatermark#);
> >
> > That should be an easy to calculate sane max of waiters to wake up.
> > On the other hand it might be over-engineered and it implies the need to  
> > reconsider when it would be best to wake up the rest.
> >
> 
> It seems over-engineering considering that they wake up after a timeout
> unconditionally. I think it's best for the moment to let them wake up in
> a herd and recheck their zonelists as they'll go back to sleep if
> necessary.

I think it isn't a bad idea and I was thinking about that myself, except
that we might do several wakeups before previously woken processes get
the chance to run and take pages from the watermark.

Also, if we have a large number of processes waiting here, the system
isn't in great shape, so a little CPU activity probably isn't going to
be noticable.

So I agree with Mel it probably isn't needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-08 11:48 ` Mel Gorman
@ 2010-03-11 23:41   ` Andrew Morton
  -1 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2010-03-11 23:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Nick Piggin, Christian Ehrhardt, Chris Mason,
	Jens Axboe, linux-kernel

On Mon,  8 Mar 2010 11:48:20 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> Under memory pressure, the page allocator and kswapd can go to sleep using
> congestion_wait(). In two of these cases, it may not be the appropriate
> action as congestion may not be the problem.

clear_bdi_congested() is called each time a write completes and the
queue is below the congestion threshold.

So if the page allocator or kswapd call congestion_wait() against a
non-congested queue, they'll wake up on the very next write completion.

Hence the above-quoted claim seems to me to be a significant mis-analysis and
perhaps explains why the patchset didn't seem to help anything?

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-11 23:41   ` Andrew Morton
  0 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2010-03-11 23:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Nick Piggin, Christian Ehrhardt, Chris Mason,
	Jens Axboe, linux-kernel

On Mon,  8 Mar 2010 11:48:20 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> Under memory pressure, the page allocator and kswapd can go to sleep using
> congestion_wait(). In two of these cases, it may not be the appropriate
> action as congestion may not be the problem.

clear_bdi_congested() is called each time a write completes and the
queue is below the congestion threshold.

So if the page allocator or kswapd call congestion_wait() against a
non-congested queue, they'll wake up on the very next write completion.

Hence the above-quoted claim seems to me to be a significant mis-analysis and
perhaps explains why the patchset didn't seem to help anything?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-11 23:41   ` Andrew Morton
@ 2010-03-12  6:39     ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-12  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel



Andrew Morton wrote:
> On Mon,  8 Mar 2010 11:48:20 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
>> Under memory pressure, the page allocator and kswapd can go to sleep using
>> congestion_wait(). In two of these cases, it may not be the appropriate
>> action as congestion may not be the problem.
> 
> clear_bdi_congested() is called each time a write completes and the
> queue is below the congestion threshold.
> 
> So if the page allocator or kswapd call congestion_wait() against a
> non-congested queue, they'll wake up on the very next write completion.

Well the issue came up in all kind of loads where you don't have any 
writes at all that can wake up congestion_wait.
Thats true for several benchmarks, but also real workload as well e.g. A 
backup job reading almost all files sequentially and pumping out stuff 
via network.

> Hence the above-quoted claim seems to me to be a significant mis-analysis and
> perhaps explains why the patchset didn't seem to help anything?

While I might have misunderstood you and it is a mis-analysis in your 
opinion, it fixes a -80% Throughput regression on sequential read 
workloads, thats not nothing - its more like absolutely required :-)

You might check out the discussion with the subject "Performance 
regression in scsi sequential throughput (iozone)	due to "e084b - 
page-allocator: preserve PFN ordering when	__GFP_COLD is set"".
While the original subject is misleading from todays point of view, it 
contains a lengthy discussion about exactly when/why/where time is lost 
due to congestion wait with a lot of traces, counters, data attachments 
and such stuff.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-12  6:39     ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-12  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel



Andrew Morton wrote:
> On Mon,  8 Mar 2010 11:48:20 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
>> Under memory pressure, the page allocator and kswapd can go to sleep using
>> congestion_wait(). In two of these cases, it may not be the appropriate
>> action as congestion may not be the problem.
> 
> clear_bdi_congested() is called each time a write completes and the
> queue is below the congestion threshold.
> 
> So if the page allocator or kswapd call congestion_wait() against a
> non-congested queue, they'll wake up on the very next write completion.

Well the issue came up in all kind of loads where you don't have any 
writes at all that can wake up congestion_wait.
Thats true for several benchmarks, but also real workload as well e.g. A 
backup job reading almost all files sequentially and pumping out stuff 
via network.

> Hence the above-quoted claim seems to me to be a significant mis-analysis and
> perhaps explains why the patchset didn't seem to help anything?

While I might have misunderstood you and it is a mis-analysis in your 
opinion, it fixes a -80% Throughput regression on sequential read 
workloads, thats not nothing - its more like absolutely required :-)

You might check out the discussion with the subject "Performance 
regression in scsi sequential throughput (iozone)	due to "e084b - 
page-allocator: preserve PFN ordering when	__GFP_COLD is set"".
While the original subject is misleading from todays point of view, it 
contains a lengthy discussion about exactly when/why/where time is lost 
due to congestion wait with a lot of traces, counters, data attachments 
and such stuff.

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-12  6:39     ` Christian Ehrhardt
@ 2010-03-12  7:05       ` Andrew Morton
  -1 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2010-03-12  7:05 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel

On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:

> 
> 
> Andrew Morton wrote:
> > On Mon,  8 Mar 2010 11:48:20 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> >> Under memory pressure, the page allocator and kswapd can go to sleep using
> >> congestion_wait(). In two of these cases, it may not be the appropriate
> >> action as congestion may not be the problem.
> > 
> > clear_bdi_congested() is called each time a write completes and the
> > queue is below the congestion threshold.
> > 
> > So if the page allocator or kswapd call congestion_wait() against a
> > non-congested queue, they'll wake up on the very next write completion.
> 
> Well the issue came up in all kind of loads where you don't have any 
> writes at all that can wake up congestion_wait.
> Thats true for several benchmarks, but also real workload as well e.g. A 
> backup job reading almost all files sequentially and pumping out stuff 
> via network.

Why is reclaim going into congestion_wait() at all if there's heaps of
clean reclaimable pagecache lying around?

(I don't thing the read side of the congestion_wqh[] has ever been used, btw)

> > Hence the above-quoted claim seems to me to be a significant mis-analysis and
> > perhaps explains why the patchset didn't seem to help anything?
> 
> While I might have misunderstood you and it is a mis-analysis in your 
> opinion, it fixes a -80% Throughput regression on sequential read 
> workloads, thats not nothing - its more like absolutely required :-)
> 
> You might check out the discussion with the subject "Performance 
> regression in scsi sequential throughput (iozone)	due to "e084b - 
> page-allocator: preserve PFN ordering when	__GFP_COLD is set"".
> While the original subject is misleading from todays point of view, it 
> contains a lengthy discussion about exactly when/why/where time is lost 
> due to congestion wait with a lot of traces, counters, data attachments 
> and such stuff.

Well if we're not encountering lots of dirty pages in reclaim then we
shouldn't be waiting for writes to retire, of course.

But if we're not encountering lots of dirty pages in reclaim, we should
be reclaiming pages, normally.

I could understand reclaim accidentally going into congestion_wait() if
it hit a large pile of pages which are unreclaimable for reasons other
than being dirty, but is that happening in this case?

If not, we broke it again.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-12  7:05       ` Andrew Morton
  0 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2010-03-12  7:05 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel

On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:

> 
> 
> Andrew Morton wrote:
> > On Mon,  8 Mar 2010 11:48:20 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> >> Under memory pressure, the page allocator and kswapd can go to sleep using
> >> congestion_wait(). In two of these cases, it may not be the appropriate
> >> action as congestion may not be the problem.
> > 
> > clear_bdi_congested() is called each time a write completes and the
> > queue is below the congestion threshold.
> > 
> > So if the page allocator or kswapd call congestion_wait() against a
> > non-congested queue, they'll wake up on the very next write completion.
> 
> Well the issue came up in all kind of loads where you don't have any 
> writes at all that can wake up congestion_wait.
> Thats true for several benchmarks, but also real workload as well e.g. A 
> backup job reading almost all files sequentially and pumping out stuff 
> via network.

Why is reclaim going into congestion_wait() at all if there's heaps of
clean reclaimable pagecache lying around?

(I don't thing the read side of the congestion_wqh[] has ever been used, btw)

> > Hence the above-quoted claim seems to me to be a significant mis-analysis and
> > perhaps explains why the patchset didn't seem to help anything?
> 
> While I might have misunderstood you and it is a mis-analysis in your 
> opinion, it fixes a -80% Throughput regression on sequential read 
> workloads, thats not nothing - its more like absolutely required :-)
> 
> You might check out the discussion with the subject "Performance 
> regression in scsi sequential throughput (iozone)	due to "e084b - 
> page-allocator: preserve PFN ordering when	__GFP_COLD is set"".
> While the original subject is misleading from todays point of view, it 
> contains a lengthy discussion about exactly when/why/where time is lost 
> due to congestion wait with a lot of traces, counters, data attachments 
> and such stuff.

Well if we're not encountering lots of dirty pages in reclaim then we
shouldn't be waiting for writes to retire, of course.

But if we're not encountering lots of dirty pages in reclaim, we should
be reclaiming pages, normally.

I could understand reclaim accidentally going into congestion_wait() if
it hit a large pile of pages which are unreclaimable for reasons other
than being dirty, but is that happening in this case?

If not, we broke it again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-11 23:41   ` Andrew Morton
@ 2010-03-12  9:09     ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-12  9:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Nick Piggin, Christian Ehrhardt, Chris Mason,
	Jens Axboe, linux-kernel

On Thu, Mar 11, 2010 at 03:41:24PM -0800, Andrew Morton wrote:
> On Mon,  8 Mar 2010 11:48:20 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Under memory pressure, the page allocator and kswapd can go to sleep using
> > congestion_wait(). In two of these cases, it may not be the appropriate
> > action as congestion may not be the problem.
> 
> clear_bdi_congested() is called each time a write completes and the
> queue is below the congestion threshold.
> 

Where you appear to get a kicking is if you want on "congestion" but no
writes are involved. In that case you potentially sleep for the whole timeout
waiting on an event that is not going to occur.

> So if the page allocator or kswapd call congestion_wait() against a
> non-congested queue, they'll wake up on the very next write completion.
> 
> Hence the above-quoted claim seems to me to be a significant mis-analysis and
> perhaps explains why the patchset didn't seem to help anything?
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-12  9:09     ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-12  9:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Nick Piggin, Christian Ehrhardt, Chris Mason,
	Jens Axboe, linux-kernel

On Thu, Mar 11, 2010 at 03:41:24PM -0800, Andrew Morton wrote:
> On Mon,  8 Mar 2010 11:48:20 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Under memory pressure, the page allocator and kswapd can go to sleep using
> > congestion_wait(). In two of these cases, it may not be the appropriate
> > action as congestion may not be the problem.
> 
> clear_bdi_congested() is called each time a write completes and the
> queue is below the congestion threshold.
> 

Where you appear to get a kicking is if you want on "congestion" but no
writes are involved. In that case you potentially sleep for the whole timeout
waiting on an event that is not going to occur.

> So if the page allocator or kswapd call congestion_wait() against a
> non-congested queue, they'll wake up on the very next write completion.
> 
> Hence the above-quoted claim seems to me to be a significant mis-analysis and
> perhaps explains why the patchset didn't seem to help anything?
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-12  7:05       ` Andrew Morton
@ 2010-03-12 10:47         ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-12 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Fri, Mar 12, 2010 at 02:05:26AM -0500, Andrew Morton wrote:
> On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > 
> > 
> > Andrew Morton wrote:
> > > On Mon,  8 Mar 2010 11:48:20 +0000
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > >> Under memory pressure, the page allocator and kswapd can go to sleep using
> > >> congestion_wait(). In two of these cases, it may not be the appropriate
> > >> action as congestion may not be the problem.
> > > 
> > > clear_bdi_congested() is called each time a write completes and the
> > > queue is below the congestion threshold.
> > > 
> > > So if the page allocator or kswapd call congestion_wait() against a
> > > non-congested queue, they'll wake up on the very next write completion.
> > 
> > Well the issue came up in all kind of loads where you don't have any 
> > writes at all that can wake up congestion_wait.
> > Thats true for several benchmarks, but also real workload as well e.g. A 
> > backup job reading almost all files sequentially and pumping out stuff 
> > via network.
> 
> Why is reclaim going into congestion_wait() at all if there's heaps of
> clean reclaimable pagecache lying around?
> 
> (I don't thing the read side of the congestion_wqh[] has ever been used, btw)
> 

I believe it's a race albeit one that has been there a long time.

In __alloc_pages_direct_reclaim, a process does approximately the
following

1. Enters direct reclaim
2. Calls cond_reched()
3. Drain pages if necessary
4. Attempt to allocate a page

Between steps 2 and 3, it's possible to have reclaimed the pages but
another process allocate them. It then proceeds and decides try again
but calls congestion_wait() before it loops around.

Plenty of read cache reclaimed but no forward progress.

> > > Hence the above-quoted claim seems to me to be a significant mis-analysis and
> > > perhaps explains why the patchset didn't seem to help anything?
> > 
> > While I might have misunderstood you and it is a mis-analysis in your 
> > opinion, it fixes a -80% Throughput regression on sequential read 
> > workloads, thats not nothing - its more like absolutely required :-)
> > 
> > You might check out the discussion with the subject "Performance 
> > regression in scsi sequential throughput (iozone)	due to "e084b - 
> > page-allocator: preserve PFN ordering when	__GFP_COLD is set"".
> > While the original subject is misleading from todays point of view, it 
> > contains a lengthy discussion about exactly when/why/where time is lost 
> > due to congestion wait with a lot of traces, counters, data attachments 
> > and such stuff.
> 
> Well if we're not encountering lots of dirty pages in reclaim then we
> shouldn't be waiting for writes to retire, of course.
> 
> But if we're not encountering lots of dirty pages in reclaim, we should
> be reclaiming pages, normally.
> 

We probably are.

> I could understand reclaim accidentally going into congestion_wait() if
> it hit a large pile of pages which are unreclaimable for reasons other
> than being dirty, but is that happening in this case?
> 

Probably not. It's almost certainly the race I described above.

> If not, we broke it again.
> 

We were broken with respect to this in the first place. That
cond_reched() is badly placed and waiting on congestion when congestion
might not be involved is also a bit odd.

It's possible that Christian's specific problem would also be addressed
by the following patch. Christian, willing to test?

It still feels a bit unnatural though that the page allocator waits on
congestion when what it really cares about is watermarks. Even if this
patch works for Christian, I think it still has merit so will kick it a
few more times.

==== CUT HERE ====
page-allocator: Attempt page allocation immediately after direct reclaim

After a process completes direct reclaim it calls cond_resched() as
potentially it has been running a long time. When it wakes up, it
attempts to allocate a page. There is a large window during which
another process can allocate the pages reclaimed by direct reclaim. This
patch attempts to allocate a page immediately after direct reclaim but
will still go to sleep afterwards if its quantum has expired.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8182c8..973b7fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1721,8 +1721,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	lockdep_clear_current_reclaim_state();
 	p->flags &= ~PF_MEMALLOC;
 
-	cond_resched();
-
 	if (order != 0)
 		drain_all_pages();
 
@@ -1731,6 +1729,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	cond_resched();
+
 	return page;
 }
 

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-12 10:47         ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-12 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Fri, Mar 12, 2010 at 02:05:26AM -0500, Andrew Morton wrote:
> On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > 
> > 
> > Andrew Morton wrote:
> > > On Mon,  8 Mar 2010 11:48:20 +0000
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > >> Under memory pressure, the page allocator and kswapd can go to sleep using
> > >> congestion_wait(). In two of these cases, it may not be the appropriate
> > >> action as congestion may not be the problem.
> > > 
> > > clear_bdi_congested() is called each time a write completes and the
> > > queue is below the congestion threshold.
> > > 
> > > So if the page allocator or kswapd call congestion_wait() against a
> > > non-congested queue, they'll wake up on the very next write completion.
> > 
> > Well the issue came up in all kind of loads where you don't have any 
> > writes at all that can wake up congestion_wait.
> > Thats true for several benchmarks, but also real workload as well e.g. A 
> > backup job reading almost all files sequentially and pumping out stuff 
> > via network.
> 
> Why is reclaim going into congestion_wait() at all if there's heaps of
> clean reclaimable pagecache lying around?
> 
> (I don't thing the read side of the congestion_wqh[] has ever been used, btw)
> 

I believe it's a race albeit one that has been there a long time.

In __alloc_pages_direct_reclaim, a process does approximately the
following

1. Enters direct reclaim
2. Calls cond_reched()
3. Drain pages if necessary
4. Attempt to allocate a page

Between steps 2 and 3, it's possible to have reclaimed the pages but
another process allocate them. It then proceeds and decides try again
but calls congestion_wait() before it loops around.

Plenty of read cache reclaimed but no forward progress.

> > > Hence the above-quoted claim seems to me to be a significant mis-analysis and
> > > perhaps explains why the patchset didn't seem to help anything?
> > 
> > While I might have misunderstood you and it is a mis-analysis in your 
> > opinion, it fixes a -80% Throughput regression on sequential read 
> > workloads, thats not nothing - its more like absolutely required :-)
> > 
> > You might check out the discussion with the subject "Performance 
> > regression in scsi sequential throughput (iozone)	due to "e084b - 
> > page-allocator: preserve PFN ordering when	__GFP_COLD is set"".
> > While the original subject is misleading from todays point of view, it 
> > contains a lengthy discussion about exactly when/why/where time is lost 
> > due to congestion wait with a lot of traces, counters, data attachments 
> > and such stuff.
> 
> Well if we're not encountering lots of dirty pages in reclaim then we
> shouldn't be waiting for writes to retire, of course.
> 
> But if we're not encountering lots of dirty pages in reclaim, we should
> be reclaiming pages, normally.
> 

We probably are.

> I could understand reclaim accidentally going into congestion_wait() if
> it hit a large pile of pages which are unreclaimable for reasons other
> than being dirty, but is that happening in this case?
> 

Probably not. It's almost certainly the race I described above.

> If not, we broke it again.
> 

We were broken with respect to this in the first place. That
cond_reched() is badly placed and waiting on congestion when congestion
might not be involved is also a bit odd.

It's possible that Christian's specific problem would also be addressed
by the following patch. Christian, willing to test?

It still feels a bit unnatural though that the page allocator waits on
congestion when what it really cares about is watermarks. Even if this
patch works for Christian, I think it still has merit so will kick it a
few more times.

==== CUT HERE ====
page-allocator: Attempt page allocation immediately after direct reclaim

After a process completes direct reclaim it calls cond_resched() as
potentially it has been running a long time. When it wakes up, it
attempts to allocate a page. There is a large window during which
another process can allocate the pages reclaimed by direct reclaim. This
patch attempts to allocate a page immediately after direct reclaim but
will still go to sleep afterwards if its quantum has expired.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8182c8..973b7fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1721,8 +1721,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	lockdep_clear_current_reclaim_state();
 	p->flags &= ~PF_MEMALLOC;
 
-	cond_resched();
-
 	if (order != 0)
 		drain_all_pages();
 
@@ -1731,6 +1729,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	cond_resched();
+
 	return page;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-12 10:47         ` Mel Gorman
@ 2010-03-12 12:15           ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-12 12:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel

Mel Gorman wrote:
> On Fri, Mar 12, 2010 at 02:05:26AM -0500, Andrew Morton wrote:
>> On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>>
>>>
>>> Andrew Morton wrote:
>>>> On Mon,  8 Mar 2010 11:48:20 +0000
>>>> Mel Gorman <mel@csn.ul.ie> wrote:
[...]

>> If not, we broke it again.
>>
> 
> We were broken with respect to this in the first place. That
> cond_reched() is badly placed and waiting on congestion when congestion
> might not be involved is also a bit odd.
> 
> It's possible that Christian's specific problem would also be addressed
> by the following patch. Christian, willing to test?

Will is here, but no chance before monday/tuesday to get a free machine 
slot - I'll post results as soon as I get them.

> It still feels a bit unnatural though that the page allocator waits on
> congestion when what it really cares about is watermarks. Even if this
> patch works for Christian, I think it still has merit so will kick it a
> few more times.

In whatever way I can look at it watermark_wait should be supperior to 
congestion_wait. Because as Mel points out waiting for watermarks is 
what is semantically correct there.

If there eventually some day comes a solution without any of those waits 
I'm fine too - e.g. by closing whatever races we have and fixing that 
one context can never run into this in direct_reclaim:
1. free pages with try_to_free
2. not getting one in the subsequent get_page call

But as long as we have a wait - watermark waiting > congestion waiting 
(IMHO).

> ==== CUT HERE ====
> page-allocator: Attempt page allocation immediately after direct reclaim
[...]
-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-12 12:15           ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-12 12:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel

Mel Gorman wrote:
> On Fri, Mar 12, 2010 at 02:05:26AM -0500, Andrew Morton wrote:
>> On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>>
>>>
>>> Andrew Morton wrote:
>>>> On Mon,  8 Mar 2010 11:48:20 +0000
>>>> Mel Gorman <mel@csn.ul.ie> wrote:
[...]

>> If not, we broke it again.
>>
> 
> We were broken with respect to this in the first place. That
> cond_reched() is badly placed and waiting on congestion when congestion
> might not be involved is also a bit odd.
> 
> It's possible that Christian's specific problem would also be addressed
> by the following patch. Christian, willing to test?

Will is here, but no chance before monday/tuesday to get a free machine 
slot - I'll post results as soon as I get them.

> It still feels a bit unnatural though that the page allocator waits on
> congestion when what it really cares about is watermarks. Even if this
> patch works for Christian, I think it still has merit so will kick it a
> few more times.

In whatever way I can look at it watermark_wait should be supperior to 
congestion_wait. Because as Mel points out waiting for watermarks is 
what is semantically correct there.

If there eventually some day comes a solution without any of those waits 
I'm fine too - e.g. by closing whatever races we have and fixing that 
one context can never run into this in direct_reclaim:
1. free pages with try_to_free
2. not getting one in the subsequent get_page call

But as long as we have a wait - watermark waiting > congestion waiting 
(IMHO).

> ==== CUT HERE ====
> page-allocator: Attempt page allocation immediately after direct reclaim
[...]
-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-12 12:15           ` Christian Ehrhardt
@ 2010-03-12 14:37             ` Andrew Morton
  -1 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2010-03-12 14:37 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel

On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:

> > It still feels a bit unnatural though that the page allocator waits on
> > congestion when what it really cares about is watermarks. Even if this
> > patch works for Christian, I think it still has merit so will kick it a
> > few more times.
> 
> In whatever way I can look at it watermark_wait should be supperior to 
> congestion_wait. Because as Mel points out waiting for watermarks is 
> what is semantically correct there.

If a direct-reclaimer waits for some thresholds to be achieved then what
task is doing reclaim?

Ultimately, kswapd.  This will introduce a hard dependency upon kswapd
activity.  This might introduce scalability problems.  And latency
problems if kswapd if off doodling with a slow device (say), or doing a
journal commit.  And perhaps deadlocks if kswapd tries to take a lock
which one of the waiting-for-watermark direct relcaimers holds.

Generally, kswapd is an optional, best-effort latency optimisation
thing and we haven't designed for it to be a critical service. 
Probably stuff would break were we to do so.


This is one of the reasons why we avoided creating such dependencies in
reclaim.  Instead, what we do when a reclaimer is encountering lots of
dirty or in-flight pages is

	msleep(100);

then try again.  We're waiting for the disks, not kswapd.

Only the hard-wired 100 is a bit silly, so we made the "100" variable,
inversely dependent upon the number of disks and their speed.  If you
have more and faster disks then you sleep for less time.

And that's what congestion_wait() does, in a very simplistic fashion. 
It's a facility which direct-reclaimers use to ratelimit themselves in
inverse proportion to the speed with which the system can retire writes.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-12 14:37             ` Andrew Morton
  0 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2010-03-12 14:37 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe, linux-kernel

On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:

> > It still feels a bit unnatural though that the page allocator waits on
> > congestion when what it really cares about is watermarks. Even if this
> > patch works for Christian, I think it still has merit so will kick it a
> > few more times.
> 
> In whatever way I can look at it watermark_wait should be supperior to 
> congestion_wait. Because as Mel points out waiting for watermarks is 
> what is semantically correct there.

If a direct-reclaimer waits for some thresholds to be achieved then what
task is doing reclaim?

Ultimately, kswapd.  This will introduce a hard dependency upon kswapd
activity.  This might introduce scalability problems.  And latency
problems if kswapd if off doodling with a slow device (say), or doing a
journal commit.  And perhaps deadlocks if kswapd tries to take a lock
which one of the waiting-for-watermark direct relcaimers holds.

Generally, kswapd is an optional, best-effort latency optimisation
thing and we haven't designed for it to be a critical service. 
Probably stuff would break were we to do so.


This is one of the reasons why we avoided creating such dependencies in
reclaim.  Instead, what we do when a reclaimer is encountering lots of
dirty or in-flight pages is

	msleep(100);

then try again.  We're waiting for the disks, not kswapd.

Only the hard-wired 100 is a bit silly, so we made the "100" variable,
inversely dependent upon the number of disks and their speed.  If you
have more and faster disks then you sleep for less time.

And that's what congestion_wait() does, in a very simplistic fashion. 
It's a facility which direct-reclaimers use to ratelimit themselves in
inverse proportion to the speed with which the system can retire writes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-12 14:37             ` Andrew Morton
@ 2010-03-15 12:29               ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-15 12:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Fri, Mar 12, 2010 at 09:37:55AM -0500, Andrew Morton wrote:
> On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > > It still feels a bit unnatural though that the page allocator waits on
> > > congestion when what it really cares about is watermarks. Even if this
> > > patch works for Christian, I think it still has merit so will kick it a
> > > few more times.
> > 
> > In whatever way I can look at it watermark_wait should be supperior to 
> > congestion_wait. Because as Mel points out waiting for watermarks is 
> > what is semantically correct there.
> 
> If a direct-reclaimer waits for some thresholds to be achieved then what
> task is doing reclaim?
> 
> Ultimately, kswapd. 

Well, not quite. The direct reclaimer will still wake up after a timeout
and try again regardless of whether watermarks have been met or not. The
intention is to back after after direct reclaim has failed. Granted, the
window during which a direct reclaim finishes and an allocation attempt
occurs is unnecessarily large. This may be addressed by the patch that
changes where cond_resched() is called.

> This will introduce a hard dependency upon kswapd
> activity.  This might introduce scalability problems.  And latency
> problems if kswapd if off doodling with a slow device (say), or doing a
> journal commit.  And perhaps deadlocks if kswapd tries to take a lock
> which one of the waiting-for-watermark direct relcaimers holds.
> 

What lock could they be holding? Even if that is the case, the direct
reclaimers do not wait indefinitily.

> Generally, kswapd is an optional, best-effort latency optimisation
> thing and we haven't designed for it to be a critical service. 
> Probably stuff would break were we to do so.
> 

No disagreements there.

> This is one of the reasons why we avoided creating such dependencies in
> reclaim.  Instead, what we do when a reclaimer is encountering lots of
> dirty or in-flight pages is
> 
> 	msleep(100);
> 
> then try again.  We're waiting for the disks, not kswapd.
> 
> Only the hard-wired 100 is a bit silly, so we made the "100" variable,
> inversely dependent upon the number of disks and their speed.  If you
> have more and faster disks then you sleep for less time.
> 
> And that's what congestion_wait() does, in a very simplistic fashion. 
> It's a facility which direct-reclaimers use to ratelimit themselves in
> inverse proportion to the speed with which the system can retire writes.
> 

The problem being hit is when a direct reclaimer goes to sleep waiting
on congestion when in reality there were not lots of dirty or in-flight
pages. It goes to sleep for the wrong reasons and doesn't get woken up
again until the timeout expires.

Bear in mind that even if congestion clears, it just means that dirty
pages are now clean although I admit that the next direct reclaim it
does is going to encounter clean pages and should succeed.

Lets see how the other patch that changes when cond_reched() gets called
gets on. If it also works out, then it's harder to justify this patch.
If it doesn't work out then it'll need to be kicked another few times.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-15 12:29               ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-15 12:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel

On Fri, Mar 12, 2010 at 09:37:55AM -0500, Andrew Morton wrote:
> On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > > It still feels a bit unnatural though that the page allocator waits on
> > > congestion when what it really cares about is watermarks. Even if this
> > > patch works for Christian, I think it still has merit so will kick it a
> > > few more times.
> > 
> > In whatever way I can look at it watermark_wait should be supperior to 
> > congestion_wait. Because as Mel points out waiting for watermarks is 
> > what is semantically correct there.
> 
> If a direct-reclaimer waits for some thresholds to be achieved then what
> task is doing reclaim?
> 
> Ultimately, kswapd. 

Well, not quite. The direct reclaimer will still wake up after a timeout
and try again regardless of whether watermarks have been met or not. The
intention is to back after after direct reclaim has failed. Granted, the
window during which a direct reclaim finishes and an allocation attempt
occurs is unnecessarily large. This may be addressed by the patch that
changes where cond_resched() is called.

> This will introduce a hard dependency upon kswapd
> activity.  This might introduce scalability problems.  And latency
> problems if kswapd if off doodling with a slow device (say), or doing a
> journal commit.  And perhaps deadlocks if kswapd tries to take a lock
> which one of the waiting-for-watermark direct relcaimers holds.
> 

What lock could they be holding? Even if that is the case, the direct
reclaimers do not wait indefinitily.

> Generally, kswapd is an optional, best-effort latency optimisation
> thing and we haven't designed for it to be a critical service. 
> Probably stuff would break were we to do so.
> 

No disagreements there.

> This is one of the reasons why we avoided creating such dependencies in
> reclaim.  Instead, what we do when a reclaimer is encountering lots of
> dirty or in-flight pages is
> 
> 	msleep(100);
> 
> then try again.  We're waiting for the disks, not kswapd.
> 
> Only the hard-wired 100 is a bit silly, so we made the "100" variable,
> inversely dependent upon the number of disks and their speed.  If you
> have more and faster disks then you sleep for less time.
> 
> And that's what congestion_wait() does, in a very simplistic fashion. 
> It's a facility which direct-reclaimers use to ratelimit themselves in
> inverse proportion to the speed with which the system can retire writes.
> 

The problem being hit is when a direct reclaimer goes to sleep waiting
on congestion when in reality there were not lots of dirty or in-flight
pages. It goes to sleep for the wrong reasons and doesn't get woken up
again until the timeout expires.

Bear in mind that even if congestion clears, it just means that dirty
pages are now clean although I admit that the next direct reclaim it
does is going to encounter clean pages and should succeed.

Lets see how the other patch that changes when cond_reched() gets called
gets on. If it also works out, then it's harder to justify this patch.
If it doesn't work out then it'll need to be kicked another few times.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-12 14:37             ` Andrew Morton
@ 2010-03-15 12:34               ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-15 12:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel, gregkh

Andrew Morton wrote:
> On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
>>> It still feels a bit unnatural though that the page allocator waits on
>>> congestion when what it really cares about is watermarks. Even if this
>>> patch works for Christian, I think it still has merit so will kick it a
>>> few more times.
>> In whatever way I can look at it watermark_wait should be supperior to 
>> congestion_wait. Because as Mel points out waiting for watermarks is 
>> what is semantically correct there.
> 
> If a direct-reclaimer waits for some thresholds to be achieved then what
> task is doing reclaim?
> 
> Ultimately, kswapd.  This will introduce a hard dependency upon kswapd
> activity.  This might introduce scalability problems.  And latency
> problems if kswapd if off doodling with a slow device (say), or doing a
> journal commit.  And perhaps deadlocks if kswapd tries to take a lock
> which one of the waiting-for-watermark direct relcaimers holds.

So then why not letting the process do something about it if no writes 
are outstanding instead of going to sleep. It might be able to
take care of its bad situation alone, maybe by calling try_to_free again.

> Generally, kswapd is an optional, best-effort latency optimisation
> thing and we haven't designed for it to be a critical service. 
> Probably stuff would break were we to do so.
> 
> 
> This is one of the reasons why we avoided creating such dependencies in
> reclaim.  Instead, what we do when a reclaimer is encountering lots of
> dirty or in-flight pages is
> 
> 	msleep(100);
> 
> then try again.  We're waiting for the disks, not kswapd.
> 
> Only the hard-wired 100 is a bit silly, so we made the "100" variable,
> inversely dependent upon the number of disks and their speed.  If you
> have more and faster disks then you sleep for less time.
> 
> And that's what congestion_wait() does, in a very simplistic fashion. 
> It's a facility which direct-reclaimers use to ratelimit themselves in
> inverse proportion to the speed with which the system can retire writes.

I would totally agree if I wouldn't have that scenario suffering so much
from that mechanism.

In the scenario Mel, Nick and I discussed for a while are no writes at
all, but a lot of page cache reads.
In this scenario direct_reclaimer runs quite frequently into the case of
"did_some_progress && !page" which leads to congestion_wait calls in the
caller of direct_reclaim - eventually waiting always the full timeout as
there are no writes.

I think reclaim in this case is just done by dropping clean page cache
pages in try_to_free_pages in this case -> so still no writes.
For the solution it is hard to find the right layer, as the race is in 
direct_reclaim but the wait call is outside of it.

The alternatives we have so far are:
a) congestion_wait which works fine with writes in flight in the system,
but with a huge drawback for non writing systems.
b) watermark wait which covers writes like congestion_wait (if they free
up enough) but also any other kind of reclaimers like processes freeing
up stuff, other page cache droppers.

new suggestions:
These ideas came up when trying to view it from your position. I don't 
know exactly if all are doable/feasible, but as we are going to wait 
anyway so we could do complex things in that path.

c) If direct reclaim did reasonable progress in try_to_free but did not
get a page, AND there is no write in flight at all then let it try again
to free up something.
This could be extended by some kind of max retry to avoid some weird
looping cases as well.

d) Another way might be as easy as letting congestion_wait return
immediately if there are no outstanding writes - this would keep the 
behavior for cases with write and avoid the "running always in full 
timeout" issue without writes.

e) like d, but let it go to the watermark wait if no writes exist.

So I don't consider option a) a solution as we have real world scenarios 
with huge impacts, even putting more burden on top of kswapd's shoulders 
b) is still better - remember as long as writes are there its almost the 
same as congestion_wait, but waiting for the right time to wake up 
(awoken allocs will still fail if below watermark).
And c-e) well I'm not sure yet, just things that came to my mind.

For the moment I would suggest going forward with Mels watermark wait
towards the stable tree as it "fixes" a huge issue there (or better its 
symptoms) and the patch is small, neat and matching .32.
We can then separately continue discuss without any pressure how we can 
finally get rid of all that race/latency/kswap issues at all in 2.6.3n+1

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-15 12:34               ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-15 12:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel, gregkh

Andrew Morton wrote:
> On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
>>> It still feels a bit unnatural though that the page allocator waits on
>>> congestion when what it really cares about is watermarks. Even if this
>>> patch works for Christian, I think it still has merit so will kick it a
>>> few more times.
>> In whatever way I can look at it watermark_wait should be supperior to 
>> congestion_wait. Because as Mel points out waiting for watermarks is 
>> what is semantically correct there.
> 
> If a direct-reclaimer waits for some thresholds to be achieved then what
> task is doing reclaim?
> 
> Ultimately, kswapd.  This will introduce a hard dependency upon kswapd
> activity.  This might introduce scalability problems.  And latency
> problems if kswapd if off doodling with a slow device (say), or doing a
> journal commit.  And perhaps deadlocks if kswapd tries to take a lock
> which one of the waiting-for-watermark direct relcaimers holds.

So then why not letting the process do something about it if no writes 
are outstanding instead of going to sleep. It might be able to
take care of its bad situation alone, maybe by calling try_to_free again.

> Generally, kswapd is an optional, best-effort latency optimisation
> thing and we haven't designed for it to be a critical service. 
> Probably stuff would break were we to do so.
> 
> 
> This is one of the reasons why we avoided creating such dependencies in
> reclaim.  Instead, what we do when a reclaimer is encountering lots of
> dirty or in-flight pages is
> 
> 	msleep(100);
> 
> then try again.  We're waiting for the disks, not kswapd.
> 
> Only the hard-wired 100 is a bit silly, so we made the "100" variable,
> inversely dependent upon the number of disks and their speed.  If you
> have more and faster disks then you sleep for less time.
> 
> And that's what congestion_wait() does, in a very simplistic fashion. 
> It's a facility which direct-reclaimers use to ratelimit themselves in
> inverse proportion to the speed with which the system can retire writes.

I would totally agree if I wouldn't have that scenario suffering so much
from that mechanism.

In the scenario Mel, Nick and I discussed for a while are no writes at
all, but a lot of page cache reads.
In this scenario direct_reclaimer runs quite frequently into the case of
"did_some_progress && !page" which leads to congestion_wait calls in the
caller of direct_reclaim - eventually waiting always the full timeout as
there are no writes.

I think reclaim in this case is just done by dropping clean page cache
pages in try_to_free_pages in this case -> so still no writes.
For the solution it is hard to find the right layer, as the race is in 
direct_reclaim but the wait call is outside of it.

The alternatives we have so far are:
a) congestion_wait which works fine with writes in flight in the system,
but with a huge drawback for non writing systems.
b) watermark wait which covers writes like congestion_wait (if they free
up enough) but also any other kind of reclaimers like processes freeing
up stuff, other page cache droppers.

new suggestions:
These ideas came up when trying to view it from your position. I don't 
know exactly if all are doable/feasible, but as we are going to wait 
anyway so we could do complex things in that path.

c) If direct reclaim did reasonable progress in try_to_free but did not
get a page, AND there is no write in flight at all then let it try again
to free up something.
This could be extended by some kind of max retry to avoid some weird
looping cases as well.

d) Another way might be as easy as letting congestion_wait return
immediately if there are no outstanding writes - this would keep the 
behavior for cases with write and avoid the "running always in full 
timeout" issue without writes.

e) like d, but let it go to the watermark wait if no writes exist.

So I don't consider option a) a solution as we have real world scenarios 
with huge impacts, even putting more burden on top of kswapd's shoulders 
b) is still better - remember as long as writes are there its almost the 
same as congestion_wait, but waiting for the right time to wake up 
(awoken allocs will still fail if below watermark).
And c-e) well I'm not sure yet, just things that came to my mind.

For the moment I would suggest going forward with Mels watermark wait
towards the stable tree as it "fixes" a huge issue there (or better its 
symptoms) and the patch is small, neat and matching .32.
We can then separately continue discuss without any pressure how we can 
finally get rid of all that race/latency/kswap issues at all in 2.6.3n+1

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-15 12:29               ` Mel Gorman
@ 2010-03-15 14:45                 ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-15 14:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel



Mel Gorman wrote:
> On Fri, Mar 12, 2010 at 09:37:55AM -0500, Andrew Morton wrote:
>> On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>>
>>>> It still feels a bit unnatural though that the page allocator waits on
>>>> congestion when what it really cares about is watermarks. Even if this
>>>> patch works for Christian, I think it still has merit so will kick it a
>>>> few more times.
>>> In whatever way I can look at it watermark_wait should be supperior to 
>>> congestion_wait. Because as Mel points out waiting for watermarks is 
>>> what is semantically correct there.
>> If a direct-reclaimer waits for some thresholds to be achieved then what
>> task is doing reclaim?
>>
>> Ultimately, kswapd. 
> 
> Well, not quite. The direct reclaimer will still wake up after a timeout
> and try again regardless of whether watermarks have been met or not. The
> intention is to back after after direct reclaim has failed. Granted, the
> window during which a direct reclaim finishes and an allocation attempt
> occurs is unnecessarily large. This may be addressed by the patch that
> changes where cond_resched() is called.
> 
>> This will introduce a hard dependency upon kswapd
>> activity.  This might introduce scalability problems.  And latency
>> problems if kswapd if off doodling with a slow device (say), or doing a
>> journal commit.  And perhaps deadlocks if kswapd tries to take a lock
>> which one of the waiting-for-watermark direct relcaimers holds.
>>
> 
> What lock could they be holding? Even if that is the case, the direct
> reclaimers do not wait indefinitily.
> 
>> Generally, kswapd is an optional, best-effort latency optimisation
>> thing and we haven't designed for it to be a critical service. 
>> Probably stuff would break were we to do so.
>>
> 
> No disagreements there.
> 
>> This is one of the reasons why we avoided creating such dependencies in
>> reclaim.  Instead, what we do when a reclaimer is encountering lots of
>> dirty or in-flight pages is
>>
>> 	msleep(100);
>>
>> then try again.  We're waiting for the disks, not kswapd.
>>
>> Only the hard-wired 100 is a bit silly, so we made the "100" variable,
>> inversely dependent upon the number of disks and their speed.  If you
>> have more and faster disks then you sleep for less time.
>>
>> And that's what congestion_wait() does, in a very simplistic fashion. 
>> It's a facility which direct-reclaimers use to ratelimit themselves in
>> inverse proportion to the speed with which the system can retire writes.
>>
> 
> The problem being hit is when a direct reclaimer goes to sleep waiting
> on congestion when in reality there were not lots of dirty or in-flight
> pages. It goes to sleep for the wrong reasons and doesn't get woken up
> again until the timeout expires.
> 
> Bear in mind that even if congestion clears, it just means that dirty
> pages are now clean although I admit that the next direct reclaim it
> does is going to encounter clean pages and should succeed.
> 
> Lets see how the other patch that changes when cond_reched() gets called
> gets on. If it also works out, then it's harder to justify this patch.
> If it doesn't work out then it'll need to be kicked another few times.
> 

Unfortunately "page-allocator: Attempt page allocation immediately after 
direct reclaim" don't help. No improvement in the regression we had 
fixed with the watermark wait patch.

-> *kick*^^


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-15 14:45                 ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-15 14:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel



Mel Gorman wrote:
> On Fri, Mar 12, 2010 at 09:37:55AM -0500, Andrew Morton wrote:
>> On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>>
>>>> It still feels a bit unnatural though that the page allocator waits on
>>>> congestion when what it really cares about is watermarks. Even if this
>>>> patch works for Christian, I think it still has merit so will kick it a
>>>> few more times.
>>> In whatever way I can look at it watermark_wait should be supperior to 
>>> congestion_wait. Because as Mel points out waiting for watermarks is 
>>> what is semantically correct there.
>> If a direct-reclaimer waits for some thresholds to be achieved then what
>> task is doing reclaim?
>>
>> Ultimately, kswapd. 
> 
> Well, not quite. The direct reclaimer will still wake up after a timeout
> and try again regardless of whether watermarks have been met or not. The
> intention is to back after after direct reclaim has failed. Granted, the
> window during which a direct reclaim finishes and an allocation attempt
> occurs is unnecessarily large. This may be addressed by the patch that
> changes where cond_resched() is called.
> 
>> This will introduce a hard dependency upon kswapd
>> activity.  This might introduce scalability problems.  And latency
>> problems if kswapd if off doodling with a slow device (say), or doing a
>> journal commit.  And perhaps deadlocks if kswapd tries to take a lock
>> which one of the waiting-for-watermark direct relcaimers holds.
>>
> 
> What lock could they be holding? Even if that is the case, the direct
> reclaimers do not wait indefinitily.
> 
>> Generally, kswapd is an optional, best-effort latency optimisation
>> thing and we haven't designed for it to be a critical service. 
>> Probably stuff would break were we to do so.
>>
> 
> No disagreements there.
> 
>> This is one of the reasons why we avoided creating such dependencies in
>> reclaim.  Instead, what we do when a reclaimer is encountering lots of
>> dirty or in-flight pages is
>>
>> 	msleep(100);
>>
>> then try again.  We're waiting for the disks, not kswapd.
>>
>> Only the hard-wired 100 is a bit silly, so we made the "100" variable,
>> inversely dependent upon the number of disks and their speed.  If you
>> have more and faster disks then you sleep for less time.
>>
>> And that's what congestion_wait() does, in a very simplistic fashion. 
>> It's a facility which direct-reclaimers use to ratelimit themselves in
>> inverse proportion to the speed with which the system can retire writes.
>>
> 
> The problem being hit is when a direct reclaimer goes to sleep waiting
> on congestion when in reality there were not lots of dirty or in-flight
> pages. It goes to sleep for the wrong reasons and doesn't get woken up
> again until the timeout expires.
> 
> Bear in mind that even if congestion clears, it just means that dirty
> pages are now clean although I admit that the next direct reclaim it
> does is going to encounter clean pages and should succeed.
> 
> Lets see how the other patch that changes when cond_reched() gets called
> gets on. If it also works out, then it's harder to justify this patch.
> If it doesn't work out then it'll need to be kicked another few times.
> 

Unfortunately "page-allocator: Attempt page allocation immediately after 
direct reclaim" don't help. No improvement in the regression we had 
fixed with the watermark wait patch.

-> *kick*^^


-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-15 12:34               ` Christian Ehrhardt
@ 2010-03-15 20:09                 ` Andrew Morton
  -1 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2010-03-15 20:09 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel, gregkh

On Mon, 15 Mar 2010 13:34:50 +0100
Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:

> c) If direct reclaim did reasonable progress in try_to_free but did not
> get a page, AND there is no write in flight at all then let it try again
> to free up something.
> This could be extended by some kind of max retry to avoid some weird
> looping cases as well.
> 
> d) Another way might be as easy as letting congestion_wait return
> immediately if there are no outstanding writes - this would keep the 
> behavior for cases with write and avoid the "running always in full 
> timeout" issue without writes.

They're pretty much equivalent and would work.  But there are two
things I still don't understand:

1: Why is direct reclaim calling congestion_wait() at all?  If no
writes are going on there's lots of clean pagecache around so reclaim
should trivially succeed.  What's preventing it from doing so?

2: This is, I think, new behaviour.  A regression.  What caused it?


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-15 20:09                 ` Andrew Morton
  0 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2010-03-15 20:09 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel, gregkh

On Mon, 15 Mar 2010 13:34:50 +0100
Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:

> c) If direct reclaim did reasonable progress in try_to_free but did not
> get a page, AND there is no write in flight at all then let it try again
> to free up something.
> This could be extended by some kind of max retry to avoid some weird
> looping cases as well.
> 
> d) Another way might be as easy as letting congestion_wait return
> immediately if there are no outstanding writes - this would keep the 
> behavior for cases with write and avoid the "running always in full 
> timeout" issue without writes.

They're pretty much equivalent and would work.  But there are two
things I still don't understand:

1: Why is direct reclaim calling congestion_wait() at all?  If no
writes are going on there's lots of clean pagecache around so reclaim
should trivially succeed.  What's preventing it from doing so?

2: This is, I think, new behaviour.  A regression.  What caused it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-15 20:09                 ` Andrew Morton
@ 2010-03-16 10:11                   ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-16 10:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel, gregkh

On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> On Mon, 15 Mar 2010 13:34:50 +0100
> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > c) If direct reclaim did reasonable progress in try_to_free but did not
> > get a page, AND there is no write in flight at all then let it try again
> > to free up something.
> > This could be extended by some kind of max retry to avoid some weird
> > looping cases as well.
> > 
> > d) Another way might be as easy as letting congestion_wait return
> > immediately if there are no outstanding writes - this would keep the 
> > behavior for cases with write and avoid the "running always in full 
> > timeout" issue without writes.
> 
> They're pretty much equivalent and would work.  But there are two
> things I still don't understand:
> 

Unfortunately, this regression is very poorly understood. I haven't been able
to reproduce it locally and while Christian has provided various debugging
information, it still isn't clear why the problem occurs now.

> 1: Why is direct reclaim calling congestion_wait() at all?  If no
> writes are going on there's lots of clean pagecache around so reclaim
> should trivially succeed.  What's preventing it from doing so?
> 

Memory pressure I think. The workload involves 16 processes (see
http://lkml.org/lkml/2009/12/7/237). I suspect they are all direct reclaimers
and some processes are getting their pages stolen before they have a
chance to allocate them. It's knowing that adding a small amount of
memory "fixes" this problem.

> 2: This is, I think, new behaviour.  A regression.  What caused it?
> 

Short answer, I don't know.

Longer answer. Initially, this was reported as being caused by commit e084b2d:
page-allocator: preserve PFN ordering when __GFP_COLD is set but it was never
established why and reverting it was unpalatable because it fixed another
performance problem. According to Christian, the controller does nothing
with the merging of IO requests and he was very sure about this. As all the
patch does is change the order that pages are returned in and the timing
slightly due to differences in cache hotness, although the fact that such
a small change could make a big difference in reclaim later was surprising.
There were other bugs that might have complicated this such as errors in free
page counters but they were fixed up and the problem still did not go away.

It was after much debugging that it was found that direct reclaim was
returning, the subsequent allocation attempt failed and congestion_wait()
was called but without dirty pages, congestion or writes, it waits for
the full timeout.  congestion_wait() was also being called a lot more
frequently so something was causing reclaim to fail more frequently
(http://lkml.org/lkml/2009/12/18/150). Again, I couldn't figure out why
e084b2d would make a difference.

Later, it got even worse because patches e084b2d and 5f8dcc21 had to be
reverted in 2.6.33 to "resolve" the problem. 5f8dcc21 was more plausible as it
affected how many pages were on the per-cpu lists but making it behave like
2.6.32 did not help the situation. Again, it looked like a very small timing
problem but it could not be isolated exactly why reclaim would fail. Again,
other bugs were found and fixed but made no difference.

What lead to this patch was recognising we could enter congestion_wait()
and wait the entire timeout because no writes were in progress or dirty
pages to be cleaned. As what was really of interest was watermarks in
this path, the patch intended to make the page allocator care about
watermarks instead of congestion. We know it was treating symptoms
rather than understanding the underlying problem but I was somewhat at a
loss to explain why small changes in timing made such a large
difference.

Any new insight is welcome.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-16 10:11                   ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-16 10:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel, gregkh

On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> On Mon, 15 Mar 2010 13:34:50 +0100
> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > c) If direct reclaim did reasonable progress in try_to_free but did not
> > get a page, AND there is no write in flight at all then let it try again
> > to free up something.
> > This could be extended by some kind of max retry to avoid some weird
> > looping cases as well.
> > 
> > d) Another way might be as easy as letting congestion_wait return
> > immediately if there are no outstanding writes - this would keep the 
> > behavior for cases with write and avoid the "running always in full 
> > timeout" issue without writes.
> 
> They're pretty much equivalent and would work.  But there are two
> things I still don't understand:
> 

Unfortunately, this regression is very poorly understood. I haven't been able
to reproduce it locally and while Christian has provided various debugging
information, it still isn't clear why the problem occurs now.

> 1: Why is direct reclaim calling congestion_wait() at all?  If no
> writes are going on there's lots of clean pagecache around so reclaim
> should trivially succeed.  What's preventing it from doing so?
> 

Memory pressure I think. The workload involves 16 processes (see
http://lkml.org/lkml/2009/12/7/237). I suspect they are all direct reclaimers
and some processes are getting their pages stolen before they have a
chance to allocate them. It's knowing that adding a small amount of
memory "fixes" this problem.

> 2: This is, I think, new behaviour.  A regression.  What caused it?
> 

Short answer, I don't know.

Longer answer. Initially, this was reported as being caused by commit e084b2d:
page-allocator: preserve PFN ordering when __GFP_COLD is set but it was never
established why and reverting it was unpalatable because it fixed another
performance problem. According to Christian, the controller does nothing
with the merging of IO requests and he was very sure about this. As all the
patch does is change the order that pages are returned in and the timing
slightly due to differences in cache hotness, although the fact that such
a small change could make a big difference in reclaim later was surprising.
There were other bugs that might have complicated this such as errors in free
page counters but they were fixed up and the problem still did not go away.

It was after much debugging that it was found that direct reclaim was
returning, the subsequent allocation attempt failed and congestion_wait()
was called but without dirty pages, congestion or writes, it waits for
the full timeout.  congestion_wait() was also being called a lot more
frequently so something was causing reclaim to fail more frequently
(http://lkml.org/lkml/2009/12/18/150). Again, I couldn't figure out why
e084b2d would make a difference.

Later, it got even worse because patches e084b2d and 5f8dcc21 had to be
reverted in 2.6.33 to "resolve" the problem. 5f8dcc21 was more plausible as it
affected how many pages were on the per-cpu lists but making it behave like
2.6.32 did not help the situation. Again, it looked like a very small timing
problem but it could not be isolated exactly why reclaim would fail. Again,
other bugs were found and fixed but made no difference.

What lead to this patch was recognising we could enter congestion_wait()
and wait the entire timeout because no writes were in progress or dirty
pages to be cleaned. As what was really of interest was watermarks in
this path, the patch intended to make the page allocator care about
watermarks instead of congestion. We know it was treating symptoms
rather than understanding the underlying problem but I was somewhat at a
loss to explain why small changes in timing made such a large
difference.

Any new insight is welcome.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-15 20:09                 ` Andrew Morton
@ 2010-03-18 17:42                   ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-18 17:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel, gregkh

On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> On Mon, 15 Mar 2010 13:34:50 +0100
> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > c) If direct reclaim did reasonable progress in try_to_free but did not
> > get a page, AND there is no write in flight at all then let it try again
> > to free up something.
> > This could be extended by some kind of max retry to avoid some weird
> > looping cases as well.
> > 
> > d) Another way might be as easy as letting congestion_wait return
> > immediately if there are no outstanding writes - this would keep the 
> > behavior for cases with write and avoid the "running always in full 
> > timeout" issue without writes.
> 
> They're pretty much equivalent and would work.  But there are two
> things I still don't understand:
> 
> 1: Why is direct reclaim calling congestion_wait() at all?  If no
> writes are going on there's lots of clean pagecache around so reclaim
> should trivially succeed.  What's preventing it from doing so?
> 
> 2: This is, I think, new behaviour.  A regression.  What caused it?
> 

I looked at this a bit closer using an iozone test very similar to
Christian's. Despite buying a number of disks, I still can't reproduce his
problem but I instrumented congestion_wait counts and times similar to
what he did.

2.6.29-instrument:congestion_waittime 990
2.6.30-instrument:congestion_waittime 2823
2.6.31-instrument:congestion_waittime 193169
2.6.32-instrument:congestion_waittime 228890
2.6.33-instrument:congestion_waittime 785529
2.6.34-rc1-instrument:congestion_waittime 797178

So in the problem window, there was *definite* increases in the time spent
in congestion_wait and the number of times it was called. I'll look
closer at this tomorrow and Monday and see can I pin down what is
happening.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-18 17:42                   ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-18 17:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel, gregkh

On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> On Mon, 15 Mar 2010 13:34:50 +0100
> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > c) If direct reclaim did reasonable progress in try_to_free but did not
> > get a page, AND there is no write in flight at all then let it try again
> > to free up something.
> > This could be extended by some kind of max retry to avoid some weird
> > looping cases as well.
> > 
> > d) Another way might be as easy as letting congestion_wait return
> > immediately if there are no outstanding writes - this would keep the 
> > behavior for cases with write and avoid the "running always in full 
> > timeout" issue without writes.
> 
> They're pretty much equivalent and would work.  But there are two
> things I still don't understand:
> 
> 1: Why is direct reclaim calling congestion_wait() at all?  If no
> writes are going on there's lots of clean pagecache around so reclaim
> should trivially succeed.  What's preventing it from doing so?
> 
> 2: This is, I think, new behaviour.  A regression.  What caused it?
> 

I looked at this a bit closer using an iozone test very similar to
Christian's. Despite buying a number of disks, I still can't reproduce his
problem but I instrumented congestion_wait counts and times similar to
what he did.

2.6.29-instrument:congestion_waittime 990
2.6.30-instrument:congestion_waittime 2823
2.6.31-instrument:congestion_waittime 193169
2.6.32-instrument:congestion_waittime 228890
2.6.33-instrument:congestion_waittime 785529
2.6.34-rc1-instrument:congestion_waittime 797178

So in the problem window, there was *definite* increases in the time spent
in congestion_wait and the number of times it was called. I'll look
closer at this tomorrow and Monday and see can I pin down what is
happening.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-15 20:09                 ` Andrew Morton
@ 2010-03-22 23:50                   ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-22 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo, Rik van Riel,
	Johannes Weiner

On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> On Mon, 15 Mar 2010 13:34:50 +0100
> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > c) If direct reclaim did reasonable progress in try_to_free but did not
> > get a page, AND there is no write in flight at all then let it try again
> > to free up something.
> > This could be extended by some kind of max retry to avoid some weird
> > looping cases as well.
> > 
> > d) Another way might be as easy as letting congestion_wait return
> > immediately if there are no outstanding writes - this would keep the 
> > behavior for cases with write and avoid the "running always in full 
> > timeout" issue without writes.
> 
> They're pretty much equivalent and would work.  But there are two
> things I still don't understand:
> 
> 1: Why is direct reclaim calling congestion_wait() at all?  If no
> writes are going on there's lots of clean pagecache around so reclaim
> should trivially succeed.  What's preventing it from doing so?
> 
> 2: This is, I think, new behaviour.  A regression.  What caused it?
> 

120+ kernels and a lot of hurt later;

Short summary - The number of times kswapd and the page allocator have been
	calling congestion_wait and the length of time it spends in there
	has been increasing since 2.6.29. Oddly, it has little to do
	with the page allocator itself.

Test scenario
=============
X86-64 machine 1 socket 4 cores
4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
	on-board and a piece of crap, and a decent RAID card could blow
	the budget.
Booted mem=256 to ensure it is fully IO-bound and match closer to what
	Christian was doing

At each test, the disks are partitioned, the raid arrays created and an
ext2 filesystem created. iozone sequential read/write tests are run with
increasing number of processes up to 64. Each test creates 8G of files. i.e.
1 process = 8G. 2 processes = 2x4G etc

	iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
	iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
	etc.

Metrics
=======

Each kernel was instrumented to collected the following stats

	pg-Stall	Page allocator stalled calling congestion_wait
	pg-Wait		The amount of time spent in congestion_wait
	pg-Rclm		Pages reclaimed by direct reclaim
	ksd-stall	balance_pgdat() (ie kswapd) staled on congestion_wait
	ksd-wait	Time spend by balance_pgdat in congestion_wait

Large differences in this do not necessarily show up in iozone because the
disks are so slow that the stalls are a tiny percentage overall. However, in
the event that there are many disks, it might be a greater problem. I believe
Christian is hitting a corner case where small delays trigger a much larger
stall.

Why The Increases
=================

The big problem here is that there was no one change. Instead, it has been
a steady build-up of a number of problems. The ones I identified are in the
block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
but need backporting and others I expect are a major surprise. Whether they
are worth backporting or not heavily depends on whether Christian's problem
is resolved.

Some of the "fixes" below are obviously not fixes at all. Gathering this data
took a significant amount of time. It'd be nice if people more familiar with
the relevant problem patches could spring a theory or patch.

The Problems
============

1. Block layer congestion queue async/sync difficulty
	fix title: asyncconfusion
	fixed in mainline? yes, in 2.6.31
	affects: 2.6.30

	2.6.30 replaced congestion queues based on read/write with sync/async
	in commit 1faa16d2. Problems were identified with this and fixed in
	2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings
	2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.

2. TTY using high order allocations more frequently
	fix title: ttyfix
	fixed in mainline? yes, in 2.6.34-rc2
	affects: 2.6.31 to 2.6.34-rc1

	2.6.31 made pty's use the same buffering logic as tty.	Unfortunately,
	it was also allowed to make high-order GFP_ATOMIC allocations. This
	triggers some high-order reclaim and introduces some stalls. It's
	fixed in 2.6.34-rc2 but needs back-porting.

3. Page reclaim evict-once logic from 56e49d21 hurts really badly
	fix title: revertevict
	fixed in mainline? no
	affects: 2.6.31 to now

	For reasons that are not immediately obvious, the evict-once patches
	*really* hurt the time spent on congestion and the number of pages
	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
	because clearly you tested this for AIM7 and might have some
	theories. For the purposes of testing, I just reverted the changes.

4. CFQ scheduler fairness commit 718eee057 causes some hurt
	fix title: none available
	fixed in mainline? no
	affects: 2.6.33 to now

	A bisection finger printed this patch as being a problem introduced
	between 2.6.32 and 2.6.33. It increases a small amount the number of
	times the page allocator stalls but drastically increased the number
	of pages reclaimed. It's not clear why the commit is such a problem.

	Unfortunately, I could not test a revert of this patch. The CFQ and
	block IO changes made in this window were extremely convulated and
	overlapped heavily with a large number of patches altering the same
	code as touched by commit 718eee057. I tried reverting everything
	made on and after this commit but the results were unsatisfactory.

	Hence, there is no fix in the results below

Results
=======

Here are the highlights of kernels tested. I'm omitting the bisection
results for obvious reasons. The metrics were gathered at two points;
after filesystem creation and after IOZone completed.

The lower the number for each metric, the better.

                                                     After Filesystem Setup                                       After IOZone
                                         pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait
2.6.29                                          0        0        0          2         1               4        3      183        152         0
2.6.30                                          1        5       34          1        25             783     3752    31939         76         0
2.6.30-asyncconfusion                           0        0        0          3         1              44       60     2656        893         0
2.6.30.10                                       0        0        0          2        43             777     3699    32661         74         0
2.6.30.10-asyncconfusion                        0        0        0          2         1              36       88     1699       1114         0

asyncconfusion can be back-ported easily to 2.6.30.10. Performance is not
perfectly in line with 2.6.29 but it's better.

2.6.31                                          0        0        0          3         1           49175   245727  2730626     176344         0
2.6.31-revertevict                              0        0        0          3         2              31      147     1887        114         0
2.6.31-ttyfix                                   0        0        0          2         2           46238   231000  2549462     170912         0
2.6.31-ttyfix-revertevict                       0        0        0          3         0               7       35      448        121         0
2.6.31.12                                       0        0        0          2         0           68897   344268  4050646     183523         0
2.6.31.12-revertevict                           0        0        0          3         1              18       87     1009        147         0
2.6.31.12-ttyfix                                0        0        0          2         0           62797   313805  3786539     173398         0
2.6.31.12-ttyfix-revertevict                    0        0        0          3         2               7       35      448        199         0

Applying the tty fixes from 2.6.34-rc2 and getting rid of the evict-once
patches bring things back in line with 2.6.29 again.

Rik, any theory on evict-once?

2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0
2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0
2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0
2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0
2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0
2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0
2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0
2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0

Again, fixing tty and reverting evict-once helps bring figures more in line
with 2.6.29.

2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0

At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
queues" has lodged itself deep within CGQ and I couldn't tear it out or
see how to fix it. Fixing tty and reverting evict-once helps but the number
of stalls is significantly increased and a much larger number of pages get
reclaimed overall.

Corrado?

2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0
2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0
2.6.34-rc1-ttyfix                               0        0        0          1         1          159603   791056  5186082     223458         0
2.6.34-rc1-ttyfix-revertevict                   0        0        0          0         0            1549     7641    50484        679         0

Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to get
back to 2.6.29 performance.

Next Steps
==========

Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
2.6.30.x (assuming that is still maintained, Greg?)?

Rik, any suggestions on what can be done with evict-once?

Corrado, any suggestions on what can be done with CFQ?

Christian, can you test the following amalgamated patch on 2.6.32.10 and
2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
and revertevict. If your problem goes away, it implies that the stalls I
can measure are roughly correlated to the more significant problem you have.

===== CUT HERE =====

>From d9661adfb8e53a7647360140af3b92284cbe52d4 Mon Sep 17 00:00:00 2001
From: Alan Cox <alan@linux.intel.com>
Date: Thu, 18 Feb 2010 16:43:47 +0000
Subject: [PATCH] tty: Keep the default buffering to sub-page units

We allocate during interrupts so while our buffering is normally diced up
small anyway on some hardware at speed we can pressure the VM excessively
for page pairs. We don't really need big buffers to be linear so don't try
so hard.

In order to make this work well we will tidy up excess callers to request_room,
which cannot itself enforce this break up.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
index 66fa4e1..f27c4d6 100644
--- a/drivers/char/tty_buffer.c
+++ b/drivers/char/tty_buffer.c
@@ -247,7 +247,8 @@ int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars,
 {
 	int copied = 0;
 	do {
-		int space = tty_buffer_request_room(tty, size - copied);
+		int goal = min(size - copied, TTY_BUFFER_PAGE);
+		int space = tty_buffer_request_room(tty, goal);
 		struct tty_buffer *tb = tty->buf.tail;
 		/* If there is no space then tb may be NULL */
 		if (unlikely(space == 0))
@@ -283,7 +284,8 @@ int tty_insert_flip_string_flags(struct tty_struct *tty,
 {
 	int copied = 0;
 	do {
-		int space = tty_buffer_request_room(tty, size - copied);
+		int goal = min(size - copied, TTY_BUFFER_PAGE);
+		int space = tty_buffer_request_room(tty, goal);
 		struct tty_buffer *tb = tty->buf.tail;
 		/* If there is no space then tb may be NULL */
 		if (unlikely(space == 0))
diff --git a/include/linux/tty.h b/include/linux/tty.h
index 6abfcf5..d96e588 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -68,6 +68,16 @@ struct tty_buffer {
 	unsigned long data[0];
 };
 
+/*
+ * We default to dicing tty buffer allocations to this many characters
+ * in order to avoid multiple page allocations. We assume tty_buffer itself
+ * is under 256 bytes. See tty_buffer_find for the allocation logic this
+ * must match
+ */
+
+#define TTY_BUFFER_PAGE		((PAGE_SIZE  - 256) / 2)
+
+
 struct tty_bufhead {
 	struct delayed_work work;
 	spinlock_t lock;
>From 352fa6ad16b89f8ffd1a93b4419b1a8f2259feab Mon Sep 17 00:00:00 2001
From: Mel Gorman <mel@csn.ul.ie>
Date: Tue, 2 Mar 2010 22:24:19 +0000
Subject: [PATCH] tty: Take a 256 byte padding into account when buffering below sub-page units

The TTY layer takes some care to ensure that only sub-page allocations
are made with interrupts disabled. It does this by setting a goal of
"TTY_BUFFER_PAGE" to allocate. Unfortunately, while TTY_BUFFER_PAGE takes the
size of tty_buffer into account, it fails to account that tty_buffer_find()
rounds the buffer size out to the next 256 byte boundary before adding on
the size of the tty_buffer.

This patch adjusts the TTY_BUFFER_PAGE calculation to take into account the
size of the tty_buffer and the padding. Once applied, tty_buffer_alloc()
should not require high-order allocations.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

diff --git a/include/linux/tty.h b/include/linux/tty.h
index 568369a..593228a 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -70,12 +70,13 @@ struct tty_buffer {
 
 /*
  * We default to dicing tty buffer allocations to this many characters
- * in order to avoid multiple page allocations. We assume tty_buffer itself
- * is under 256 bytes. See tty_buffer_find for the allocation logic this
- * must match
+ * in order to avoid multiple page allocations. We know the size of
+ * tty_buffer itself but it must also be taken into account that the
+ * the buffer is 256 byte aligned. See tty_buffer_find for the allocation
+ * logic this must match
  */
 
-#define TTY_BUFFER_PAGE		((PAGE_SIZE  - 256) / 2)
+#define TTY_BUFFER_PAGE	(((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)
 
 
 struct tty_bufhead {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bf9213b..5ba0d9a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -94,7 +94,6 @@ extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
 extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
 							int priority);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru);
@@ -243,12 +242,6 @@ mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 	return 1;
 }
 
-static inline int
-mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
-{
-	return 1;
-}
-
 static inline unsigned long
 mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
 			 enum lru_list lru)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 66035bf..bbb0eda 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -843,17 +843,6 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 	return 0;
 }
 
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
-{
-	unsigned long active;
-	unsigned long inactive;
-
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
-	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
-
-	return (active > inactive);
-}
-
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 692807f..5512301 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1428,59 +1428,13 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 	return low;
 }
 
-static int inactive_file_is_low_global(struct zone *zone)
-{
-	unsigned long active, inactive;
-
-	active = zone_page_state(zone, NR_ACTIVE_FILE);
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-
-	return (active > inactive);
-}
-
-/**
- * inactive_file_is_low - check if file pages need to be deactivated
- * @zone: zone to check
- * @sc:   scan control of this context
- *
- * When the system is doing streaming IO, memory pressure here
- * ensures that active file pages get deactivated, until more
- * than half of the file pages are on the inactive list.
- *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
- *
- * This uses a different ratio than the anonymous pages, because
- * the page cache uses a use-once replacement algorithm.
- */
-static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
-{
-	int low;
-
-	if (scanning_global_lru(sc))
-		low = inactive_file_is_low_global(zone);
-	else
-		low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
-	return low;
-}
-
-static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
-				int file)
-{
-	if (file)
-		return inactive_file_is_low(zone, sc);
-	else
-		return inactive_anon_is_low(zone, sc);
-}
-
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
-	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(zone, sc, file))
-		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
+	if (lru == LRU_ACTIVE_FILE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-22 23:50                   ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-22 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Ehrhardt, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo, Rik van Riel,
	Johannes Weiner

On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> On Mon, 15 Mar 2010 13:34:50 +0100
> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> 
> > c) If direct reclaim did reasonable progress in try_to_free but did not
> > get a page, AND there is no write in flight at all then let it try again
> > to free up something.
> > This could be extended by some kind of max retry to avoid some weird
> > looping cases as well.
> > 
> > d) Another way might be as easy as letting congestion_wait return
> > immediately if there are no outstanding writes - this would keep the 
> > behavior for cases with write and avoid the "running always in full 
> > timeout" issue without writes.
> 
> They're pretty much equivalent and would work.  But there are two
> things I still don't understand:
> 
> 1: Why is direct reclaim calling congestion_wait() at all?  If no
> writes are going on there's lots of clean pagecache around so reclaim
> should trivially succeed.  What's preventing it from doing so?
> 
> 2: This is, I think, new behaviour.  A regression.  What caused it?
> 

120+ kernels and a lot of hurt later;

Short summary - The number of times kswapd and the page allocator have been
	calling congestion_wait and the length of time it spends in there
	has been increasing since 2.6.29. Oddly, it has little to do
	with the page allocator itself.

Test scenario
=============
X86-64 machine 1 socket 4 cores
4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
	on-board and a piece of crap, and a decent RAID card could blow
	the budget.
Booted mem=256 to ensure it is fully IO-bound and match closer to what
	Christian was doing

At each test, the disks are partitioned, the raid arrays created and an
ext2 filesystem created. iozone sequential read/write tests are run with
increasing number of processes up to 64. Each test creates 8G of files. i.e.
1 process = 8G. 2 processes = 2x4G etc

	iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
	iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
	etc.

Metrics
=======

Each kernel was instrumented to collected the following stats

	pg-Stall	Page allocator stalled calling congestion_wait
	pg-Wait		The amount of time spent in congestion_wait
	pg-Rclm		Pages reclaimed by direct reclaim
	ksd-stall	balance_pgdat() (ie kswapd) staled on congestion_wait
	ksd-wait	Time spend by balance_pgdat in congestion_wait

Large differences in this do not necessarily show up in iozone because the
disks are so slow that the stalls are a tiny percentage overall. However, in
the event that there are many disks, it might be a greater problem. I believe
Christian is hitting a corner case where small delays trigger a much larger
stall.

Why The Increases
=================

The big problem here is that there was no one change. Instead, it has been
a steady build-up of a number of problems. The ones I identified are in the
block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
but need backporting and others I expect are a major surprise. Whether they
are worth backporting or not heavily depends on whether Christian's problem
is resolved.

Some of the "fixes" below are obviously not fixes at all. Gathering this data
took a significant amount of time. It'd be nice if people more familiar with
the relevant problem patches could spring a theory or patch.

The Problems
============

1. Block layer congestion queue async/sync difficulty
	fix title: asyncconfusion
	fixed in mainline? yes, in 2.6.31
	affects: 2.6.30

	2.6.30 replaced congestion queues based on read/write with sync/async
	in commit 1faa16d2. Problems were identified with this and fixed in
	2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings
	2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.

2. TTY using high order allocations more frequently
	fix title: ttyfix
	fixed in mainline? yes, in 2.6.34-rc2
	affects: 2.6.31 to 2.6.34-rc1

	2.6.31 made pty's use the same buffering logic as tty.	Unfortunately,
	it was also allowed to make high-order GFP_ATOMIC allocations. This
	triggers some high-order reclaim and introduces some stalls. It's
	fixed in 2.6.34-rc2 but needs back-porting.

3. Page reclaim evict-once logic from 56e49d21 hurts really badly
	fix title: revertevict
	fixed in mainline? no
	affects: 2.6.31 to now

	For reasons that are not immediately obvious, the evict-once patches
	*really* hurt the time spent on congestion and the number of pages
	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
	because clearly you tested this for AIM7 and might have some
	theories. For the purposes of testing, I just reverted the changes.

4. CFQ scheduler fairness commit 718eee057 causes some hurt
	fix title: none available
	fixed in mainline? no
	affects: 2.6.33 to now

	A bisection finger printed this patch as being a problem introduced
	between 2.6.32 and 2.6.33. It increases a small amount the number of
	times the page allocator stalls but drastically increased the number
	of pages reclaimed. It's not clear why the commit is such a problem.

	Unfortunately, I could not test a revert of this patch. The CFQ and
	block IO changes made in this window were extremely convulated and
	overlapped heavily with a large number of patches altering the same
	code as touched by commit 718eee057. I tried reverting everything
	made on and after this commit but the results were unsatisfactory.

	Hence, there is no fix in the results below

Results
=======

Here are the highlights of kernels tested. I'm omitting the bisection
results for obvious reasons. The metrics were gathered at two points;
after filesystem creation and after IOZone completed.

The lower the number for each metric, the better.

                                                     After Filesystem Setup                                       After IOZone
                                         pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait
2.6.29                                          0        0        0          2         1               4        3      183        152         0
2.6.30                                          1        5       34          1        25             783     3752    31939         76         0
2.6.30-asyncconfusion                           0        0        0          3         1              44       60     2656        893         0
2.6.30.10                                       0        0        0          2        43             777     3699    32661         74         0
2.6.30.10-asyncconfusion                        0        0        0          2         1              36       88     1699       1114         0

asyncconfusion can be back-ported easily to 2.6.30.10. Performance is not
perfectly in line with 2.6.29 but it's better.

2.6.31                                          0        0        0          3         1           49175   245727  2730626     176344         0
2.6.31-revertevict                              0        0        0          3         2              31      147     1887        114         0
2.6.31-ttyfix                                   0        0        0          2         2           46238   231000  2549462     170912         0
2.6.31-ttyfix-revertevict                       0        0        0          3         0               7       35      448        121         0
2.6.31.12                                       0        0        0          2         0           68897   344268  4050646     183523         0
2.6.31.12-revertevict                           0        0        0          3         1              18       87     1009        147         0
2.6.31.12-ttyfix                                0        0        0          2         0           62797   313805  3786539     173398         0
2.6.31.12-ttyfix-revertevict                    0        0        0          3         2               7       35      448        199         0

Applying the tty fixes from 2.6.34-rc2 and getting rid of the evict-once
patches bring things back in line with 2.6.29 again.

Rik, any theory on evict-once?

2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0
2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0
2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0
2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0
2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0
2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0
2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0
2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0

Again, fixing tty and reverting evict-once helps bring figures more in line
with 2.6.29.

2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0

At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
queues" has lodged itself deep within CGQ and I couldn't tear it out or
see how to fix it. Fixing tty and reverting evict-once helps but the number
of stalls is significantly increased and a much larger number of pages get
reclaimed overall.

Corrado?

2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0
2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0
2.6.34-rc1-ttyfix                               0        0        0          1         1          159603   791056  5186082     223458         0
2.6.34-rc1-ttyfix-revertevict                   0        0        0          0         0            1549     7641    50484        679         0

Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to get
back to 2.6.29 performance.

Next Steps
==========

Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
2.6.30.x (assuming that is still maintained, Greg?)?

Rik, any suggestions on what can be done with evict-once?

Corrado, any suggestions on what can be done with CFQ?

Christian, can you test the following amalgamated patch on 2.6.32.10 and
2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
and revertevict. If your problem goes away, it implies that the stalls I
can measure are roughly correlated to the more significant problem you have.

===== CUT HERE =====

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-22 23:50                   ` Mel Gorman
@ 2010-03-23 14:35                     ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-23 14:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel, gregkh, Corrado Zoccolo, Rik van Riel,
	Johannes Weiner

Mel Gorman wrote:
> On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
>> On Mon, 15 Mar 2010 13:34:50 +0100
>> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>>
[...]
> 
> 120+ kernels and a lot of hurt later;

Thanks for all your effort in searching the real cause behind 
congestion_wait becoming such a time sink for some benchmarks.

> 
> 2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0
> 2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0
> 2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0
> 2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0
> 2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0
> 2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0
> 2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0
> 2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0
> 
> Again, fixing tty and reverting evict-once helps bring figures more in line
> with 2.6.29.
> 
> 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
> 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
> 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
> 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
> 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
> 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
> 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
> 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0
> 

[...]

> 
> Christian, can you test the following amalgamated patch on 2.6.32.10 and
> 2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
> cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
> and revertevict. If your problem goes away, it implies that the stalls I
> can measure are roughly correlated to the more significant problem you have.

While your tty&evict patch might fix something as seen by your numbers, 
it unfortunately doesn't affect my big throughput loss.

Again the scenario was 4,8 and 16 threads iozone sequential read with 
2Gb files and one disk per process, running on a s390x machine with 4 
cpus and 256m.
My table shows the throughput deviation to plain 2.6.32 git in percent.

percentage                       4thr     8thr    16thr
2.6.32                          0.00%    0.00%    0.00%
2.6.32.10 (stable)              4.44%    7.97%    4.11%
2.6.32.10-ttyfix-revertevict    3.33%    6.64%    5.07%
2.6.33                          5.33%   -2.82%  -10.87%
2.6.33-ttyfix-revertevict       3.33%   -3.32%  -10.51%
2.6.32-watermarkwait           40.00%   58.47%   42.03%

In terms of throughput for my load your patch doesn't change anything 
significantly above the noise level of the test case (which is around 
~1%). The fix probably even has a slight performance decrease in low 
thread cases.

For better comparison I added a 2.6.32 run with your watermark wait 
patch which is still the only one fixing the issue.

That said I'd still love to see watermark wait getting accepted :-)

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-23 14:35                     ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-03-23 14:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, Nick Piggin, Chris Mason, Jens Axboe,
	linux-kernel, gregkh, Corrado Zoccolo, Rik van Riel,
	Johannes Weiner

Mel Gorman wrote:
> On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
>> On Mon, 15 Mar 2010 13:34:50 +0100
>> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>>
[...]
> 
> 120+ kernels and a lot of hurt later;

Thanks for all your effort in searching the real cause behind 
congestion_wait becoming such a time sink for some benchmarks.

> 
> 2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0
> 2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0
> 2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0
> 2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0
> 2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0
> 2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0
> 2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0
> 2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0
> 
> Again, fixing tty and reverting evict-once helps bring figures more in line
> with 2.6.29.
> 
> 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
> 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
> 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
> 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
> 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
> 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
> 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
> 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0
> 

[...]

> 
> Christian, can you test the following amalgamated patch on 2.6.32.10 and
> 2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
> cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
> and revertevict. If your problem goes away, it implies that the stalls I
> can measure are roughly correlated to the more significant problem you have.

While your tty&evict patch might fix something as seen by your numbers, 
it unfortunately doesn't affect my big throughput loss.

Again the scenario was 4,8 and 16 threads iozone sequential read with 
2Gb files and one disk per process, running on a s390x machine with 4 
cpus and 256m.
My table shows the throughput deviation to plain 2.6.32 git in percent.

percentage                       4thr     8thr    16thr
2.6.32                          0.00%    0.00%    0.00%
2.6.32.10 (stable)              4.44%    7.97%    4.11%
2.6.32.10-ttyfix-revertevict    3.33%    6.64%    5.07%
2.6.33                          5.33%   -2.82%  -10.87%
2.6.33-ttyfix-revertevict       3.33%   -3.32%  -10.51%
2.6.32-watermarkwait           40.00%   58.47%   42.03%

In terms of throughput for my load your patch doesn't change anything 
significantly above the noise level of the test case (which is around 
~1%). The fix probably even has a slight performance decrease in low 
thread cases.

For better comparison I added a 2.6.32 run with your watermark wait 
patch which is still the only one fixing the issue.

That said I'd still love to see watermark wait getting accepted :-)

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone  pressure
  2010-03-22 23:50                   ` Mel Gorman
@ 2010-03-23 21:35                     ` Corrado Zoccolo
  -1 siblings, 0 replies; 136+ messages in thread
From: Corrado Zoccolo @ 2010-03-23 21:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Rik van Riel,
	Johannes Weiner

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 25808 bytes --]

Hi Mel,On Tue, Mar 23, 2010 at 12:50 AM, Mel Gorman <mel@csn.ul.ie> wrote:> On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:>> On Mon, 15 Mar 2010 13:34:50 +0100>> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:>>>> > c) If direct reclaim did reasonable progress in try_to_free but did not>> > get a page, AND there is no write in flight at all then let it try again>> > to free up something.>> > This could be extended by some kind of max retry to avoid some weird>> > looping cases as well.>> >>> > d) Another way might be as easy as letting congestion_wait return>> > immediately if there are no outstanding writes - this would keep the>> > behavior for cases with write and avoid the "running always in full>> > timeout" issue without writes.>>>> They're pretty much equivalent and would work.  But there are two>> things I still don't understand:>>>> 1: Why is direct reclaim calling congestion_wait() at all?  If no>> writes are going on there's lots of clean pagecache around so reclaim>> should trivially succeed.  What's preventing it from doing so?>>>> 2: This is, I think, new behaviour.  A regression.  What caused it?>>>> 120+ kernels and a lot of hurt later;>> Short summary - The number of times kswapd and the page allocator have been>        calling congestion_wait and the length of time it spends in there>        has been increasing since 2.6.29. Oddly, it has little to do>        with the page allocator itself.>> Test scenario> =============> X86-64 machine 1 socket 4 cores> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller>        on-board and a piece of crap, and a decent RAID card could blow>        the budget.> Booted mem=256 to ensure it is fully IO-bound and match closer to what>        Christian was doing>> At each test, the disks are partitioned, the raid arrays created and an> ext2 filesystem created. iozone sequential read/write tests are run with> increasing number of processes up to 64. Each test creates 8G of files. i.e.> 1 process = 8G. 2 processes = 2x4G etc>>        iozone -s 8388608 -t 1 -r 64 -i 0 -i 1>        iozone -s 4194304 -t 2 -r 64 -i 0 -i 1>        etc.>> Metrics> =======>> Each kernel was instrumented to collected the following stats>>        pg-Stall        Page allocator stalled calling congestion_wait>        pg-Wait         The amount of time spent in congestion_wait>        pg-Rclm         Pages reclaimed by direct reclaim>        ksd-stall       balance_pgdat() (ie kswapd) staled on congestion_wait>        ksd-wait        Time spend by balance_pgdat in congestion_wait>> Large differences in this do not necessarily show up in iozone because the> disks are so slow that the stalls are a tiny percentage overall. However, in> the event that there are many disks, it might be a greater problem. I believe> Christian is hitting a corner case where small delays trigger a much larger> stall.>> Why The Increases> =================>> The big problem here is that there was no one change. Instead, it has been> a steady build-up of a number of problems. The ones I identified are in the> block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed> but need backporting and others I expect are a major surprise. Whether they> are worth backporting or not heavily depends on whether Christian's problem> is resolved.>> Some of the "fixes" below are obviously not fixes at all. Gathering this data> took a significant amount of time. It'd be nice if people more familiar with> the relevant problem patches could spring a theory or patch.>> The Problems> ============>> 1. Block layer congestion queue async/sync difficulty>        fix title: asyncconfusion>        fixed in mainline? yes, in 2.6.31>        affects: 2.6.30>>        2.6.30 replaced congestion queues based on read/write with sync/async>        in commit 1faa16d2. Problems were identified with this and fixed in>        2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings>        2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.>> 2. TTY using high order allocations more frequently>        fix title: ttyfix>        fixed in mainline? yes, in 2.6.34-rc2>        affects: 2.6.31 to 2.6.34-rc1>>        2.6.31 made pty's use the same buffering logic as tty.  Unfortunately,>        it was also allowed to make high-order GFP_ATOMIC allocations. This>        triggers some high-order reclaim and introduces some stalls. It's>        fixed in 2.6.34-rc2 but needs back-porting.>> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly>        fix title: revertevict>        fixed in mainline? no>        affects: 2.6.31 to now>>        For reasons that are not immediately obvious, the evict-once patches>        *really* hurt the time spent on congestion and the number of pages>        reclaimed. Rik, I'm afaid I'm punting this to you for explanation>        because clearly you tested this for AIM7 and might have some>        theories. For the purposes of testing, I just reverted the changes.>> 4. CFQ scheduler fairness commit 718eee057 causes some hurt>        fix title: none available>        fixed in mainline? no>        affects: 2.6.33 to now>>        A bisection finger printed this patch as being a problem introduced>        between 2.6.32 and 2.6.33. It increases a small amount the number of>        times the page allocator stalls but drastically increased the number>        of pages reclaimed. It's not clear why the commit is such a problem.>>        Unfortunately, I could not test a revert of this patch. The CFQ and>        block IO changes made in this window were extremely convulated and>        overlapped heavily with a large number of patches altering the same>        code as touched by commit 718eee057. I tried reverting everything>        made on and after this commit but the results were unsatisfactory.>>        Hence, there is no fix in the results below>> Results> =======>> Here are the highlights of kernels tested. I'm omitting the bisection> results for obvious reasons. The metrics were gathered at two points;> after filesystem creation and after IOZone completed.>> The lower the number for each metric, the better.>>                                                     After Filesystem Setup                                       After IOZone>                                         pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait> 2.6.29                                          0        0        0          2         1               4        3      183        152         0> 2.6.30                                          1        5       34          1        25             783     3752    31939         76         0> 2.6.30-asyncconfusion                           0        0        0          3         1              44       60     2656        893         0> 2.6.30.10                                       0        0        0          2        43             777     3699    32661         74         0> 2.6.30.10-asyncconfusion                        0        0        0          2         1              36       88     1699       1114         0>> asyncconfusion can be back-ported easily to 2.6.30.10. Performance is not> perfectly in line with 2.6.29 but it's better.>> 2.6.31                                          0        0        0          3         1           49175   245727  2730626     176344         0> 2.6.31-revertevict                              0        0        0          3         2              31      147     1887        114         0> 2.6.31-ttyfix                                   0        0        0          2         2           46238   231000  2549462     170912         0> 2.6.31-ttyfix-revertevict                       0        0        0          3         0               7       35      448        121         0> 2.6.31.12                                       0        0        0          2         0           68897   344268  4050646     183523         0> 2.6.31.12-revertevict                           0        0        0          3         1              18       87     1009        147         0> 2.6.31.12-ttyfix                                0        0        0          2         0           62797   313805  3786539     173398         0> 2.6.31.12-ttyfix-revertevict                    0        0        0          3         2               7       35      448        199         0>> Applying the tty fixes from 2.6.34-rc2 and getting rid of the evict-once> patches bring things back in line with 2.6.29 again.>> Rik, any theory on evict-once?>> 2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0> 2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0> 2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0> 2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0> 2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0> 2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0> 2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0> 2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0>> Again, fixing tty and reverting evict-once helps bring figures more in line> with 2.6.29.>> 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0> 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0> 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0> 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0> 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0> 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0> 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0> 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0>> At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle> queues" has lodged itself deep within CGQ and I couldn't tear it out or> see how to fix it. Fixing tty and reverting evict-once helps but the number> of stalls is significantly increased and a much larger number of pages get> reclaimed overall.>> Corrado?
The major changes in I/O scheduing behaviour are:* buffered writes:  * before we could schedule few writes, then interrupt them to dosome reads, and then go back to writes; now we guarantee someuninterruptible time slice for writes, but the delay between twoslices is increased. The total write throughput averaged over a timewindow larger than 300ms should be comparable, or even better with2.6.33. Note that the commit you cite has introduced a bug regardingwrite throughput on NCQ disks that was later fixed by 1efe8fe1, mergedbefore 2.6.33 (this may lead to confusing bisection results).* reads (and sync writes):  * before, we serviced a single process for 100ms, then switched toan other, and so on.  * after, we go round robin for random requests (they get a unifiedtime slice, like buffered writes do), and we have consecutive timeslices for sequential requests, but the length of the slice is reducedwhen the number of concurrent processes doing I/O increases.
This means that with 16 processes doing sequential I/O on the samedisk, before you were switching between processes every 100ms, and nowevery 32ms. The old behaviour can be brought back by setting/sys/block/sd*/queue/iosched/low_latency to 0.For random I/O, the situation (going round robin, it will translate toswitching every 8 ms on average) is not revertable via flags.
>> 2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0> 2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0> 2.6.34-rc1-ttyfix                               0        0        0          1         1          159603   791056  5186082     223458         0> 2.6.34-rc1-ttyfix-revertevict                   0        0        0          0         0            1549     7641    50484        679         0>> Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to get> back to 2.6.29 performance.>> Next Steps> ==========>> Jens, any problems with me backporting the async/sync fixes from 2.6.31 to> 2.6.30.x (assuming that is still maintained, Greg?)?>> Rik, any suggestions on what can be done with evict-once?>> Corrado, any suggestions on what can be done with CFQ?
If my intuition that switching between processes too often isdetrimental when you have memory pressure (higher probability to needto re-page-in some of the pages that were just discarded), I suggesttrying setting low_latency to 0, and maybe increasing the slice_sync(to get more slice to a single process before switching to an other),slice_async (to give more uninterruptible time to buffered writes) andslice_async_rq (to higher the limit of consecutive write requests canbe sent to disk).While this would normally lead to a bad user experience on a systemwith plenty of memory, it should keep things acceptable when paging in/ swapping / dirty page writeback is overwhelming.
Corrado
>> Christian, can you test the following amalgamated patch on 2.6.32.10 and> 2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply> cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix> and revertevict. If your problem goes away, it implies that the stalls I> can measure are roughly correlated to the more significant problem you have.>> ===== CUT HERE =====>> From d9661adfb8e53a7647360140af3b92284cbe52d4 Mon Sep 17 00:00:00 2001> From: Alan Cox <alan@linux.intel.com>> Date: Thu, 18 Feb 2010 16:43:47 +0000> Subject: [PATCH] tty: Keep the default buffering to sub-page units>> We allocate during interrupts so while our buffering is normally diced up> small anyway on some hardware at speed we can pressure the VM excessively> for page pairs. We don't really need big buffers to be linear so don't try> so hard.>> In order to make this work well we will tidy up excess callers to request_room,> which cannot itself enforce this break up.>> Signed-off-by: Alan Cox <alan@linux.intel.com>> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>>> diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c> index 66fa4e1..f27c4d6 100644> --- a/drivers/char/tty_buffer.c> +++ b/drivers/char/tty_buffer.c> @@ -247,7 +247,8 @@ int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars,>  {>        int copied = 0;>        do {> -               int space = tty_buffer_request_room(tty, size - copied);> +               int goal = min(size - copied, TTY_BUFFER_PAGE);> +               int space = tty_buffer_request_room(tty, goal);>                struct tty_buffer *tb = tty->buf.tail;>                /* If there is no space then tb may be NULL */>                if (unlikely(space == 0))> @@ -283,7 +284,8 @@ int tty_insert_flip_string_flags(struct tty_struct *tty,>  {>        int copied = 0;>        do {> -               int space = tty_buffer_request_room(tty, size - copied);> +               int goal = min(size - copied, TTY_BUFFER_PAGE);> +               int space = tty_buffer_request_room(tty, goal);>                struct tty_buffer *tb = tty->buf.tail;>                /* If there is no space then tb may be NULL */>                if (unlikely(space == 0))> diff --git a/include/linux/tty.h b/include/linux/tty.h> index 6abfcf5..d96e588 100644> --- a/include/linux/tty.h> +++ b/include/linux/tty.h> @@ -68,6 +68,16 @@ struct tty_buffer {>        unsigned long data[0];>  };>> +/*> + * We default to dicing tty buffer allocations to this many characters> + * in order to avoid multiple page allocations. We assume tty_buffer itself> + * is under 256 bytes. See tty_buffer_find for the allocation logic this> + * must match> + */> +> +#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)> +> +>  struct tty_bufhead {>        struct delayed_work work;>        spinlock_t lock;> From 352fa6ad16b89f8ffd1a93b4419b1a8f2259feab Mon Sep 17 00:00:00 2001> From: Mel Gorman <mel@csn.ul.ie>> Date: Tue, 2 Mar 2010 22:24:19 +0000> Subject: [PATCH] tty: Take a 256 byte padding into account when buffering below sub-page units>> The TTY layer takes some care to ensure that only sub-page allocations> are made with interrupts disabled. It does this by setting a goal of> "TTY_BUFFER_PAGE" to allocate. Unfortunately, while TTY_BUFFER_PAGE takes the> size of tty_buffer into account, it fails to account that tty_buffer_find()> rounds the buffer size out to the next 256 byte boundary before adding on> the size of the tty_buffer.>> This patch adjusts the TTY_BUFFER_PAGE calculation to take into account the> size of the tty_buffer and the padding. Once applied, tty_buffer_alloc()> should not require high-order allocations.>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>> Cc: stable <stable@kernel.org>> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>>> diff --git a/include/linux/tty.h b/include/linux/tty.h> index 568369a..593228a 100644> --- a/include/linux/tty.h> +++ b/include/linux/tty.h> @@ -70,12 +70,13 @@ struct tty_buffer {>>  /*>  * We default to dicing tty buffer allocations to this many characters> - * in order to avoid multiple page allocations. We assume tty_buffer itself> - * is under 256 bytes. See tty_buffer_find for the allocation logic this> - * must match> + * in order to avoid multiple page allocations. We know the size of> + * tty_buffer itself but it must also be taken into account that the> + * the buffer is 256 byte aligned. See tty_buffer_find for the allocation> + * logic this must match>  */>> -#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)> +#define TTY_BUFFER_PAGE        (((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)>>>  struct tty_bufhead {> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h> index bf9213b..5ba0d9a 100644> --- a/include/linux/memcontrol.h> +++ b/include/linux/memcontrol.h> @@ -94,7 +94,6 @@ extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,>  extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,>                                                        int priority);>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);> -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,>                                       struct zone *zone,>                                       enum lru_list lru);> @@ -243,12 +242,6 @@ mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)>        return 1;>  }>> -static inline int> -mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)> -{> -       return 1;> -}> ->  static inline unsigned long>  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,>                         enum lru_list lru)> diff --git a/mm/memcontrol.c b/mm/memcontrol.c> index 66035bf..bbb0eda 100644> --- a/mm/memcontrol.c> +++ b/mm/memcontrol.c> @@ -843,17 +843,6 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)>        return 0;>  }>> -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)> -{> -       unsigned long active;> -       unsigned long inactive;> -> -       inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);> -       active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);> -> -       return (active > inactive);> -}> ->  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,>                                       struct zone *zone,>                                       enum lru_list lru)> diff --git a/mm/vmscan.c b/mm/vmscan.c> index 692807f..5512301 100644> --- a/mm/vmscan.c> +++ b/mm/vmscan.c> @@ -1428,59 +1428,13 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)>        return low;>  }>> -static int inactive_file_is_low_global(struct zone *zone)> -{> -       unsigned long active, inactive;> -> -       active = zone_page_state(zone, NR_ACTIVE_FILE);> -       inactive = zone_page_state(zone, NR_INACTIVE_FILE);> -> -       return (active > inactive);> -}> -> -/**> - * inactive_file_is_low - check if file pages need to be deactivated> - * @zone: zone to check> - * @sc:   scan control of this context> - *> - * When the system is doing streaming IO, memory pressure here> - * ensures that active file pages get deactivated, until more> - * than half of the file pages are on the inactive list.> - *> - * Once we get to that situation, protect the system's working> - * set from being evicted by disabling active file page aging.> - *> - * This uses a different ratio than the anonymous pages, because> - * the page cache uses a use-once replacement algorithm.> - */> -static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)> -{> -       int low;> -> -       if (scanning_global_lru(sc))> -               low = inactive_file_is_low_global(zone);> -       else> -               low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);> -       return low;> -}> -> -static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,> -                               int file)> -{> -       if (file)> -               return inactive_file_is_low(zone, sc);> -       else> -               return inactive_anon_is_low(zone, sc);> -}> ->  static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,>        struct zone *zone, struct scan_control *sc, int priority)>  {>        int file = is_file_lru(lru);>> -       if (is_active_lru(lru)) {> -               if (inactive_list_is_low(zone, sc, file))> -                   shrink_active_list(nr_to_scan, zone, sc, priority, file);> +       if (lru == LRU_ACTIVE_FILE) {> +               shrink_active_list(nr_to_scan, zone, sc, priority, file);>                return 0;>        }>>> --> Mel Gorman> Part-time Phd Student                          Linux Technology Center> University of Limerick                         IBM Dublin Software Lab>ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-23 21:35                     ` Corrado Zoccolo
  0 siblings, 0 replies; 136+ messages in thread
From: Corrado Zoccolo @ 2010-03-23 21:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Rik van Riel,
	Johannes Weiner

Hi Mel,
On Tue, Mar 23, 2010 at 12:50 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
>> On Mon, 15 Mar 2010 13:34:50 +0100
>> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>>
>> > c) If direct reclaim did reasonable progress in try_to_free but did not
>> > get a page, AND there is no write in flight at all then let it try again
>> > to free up something.
>> > This could be extended by some kind of max retry to avoid some weird
>> > looping cases as well.
>> >
>> > d) Another way might be as easy as letting congestion_wait return
>> > immediately if there are no outstanding writes - this would keep the
>> > behavior for cases with write and avoid the "running always in full
>> > timeout" issue without writes.
>>
>> They're pretty much equivalent and would work.  But there are two
>> things I still don't understand:
>>
>> 1: Why is direct reclaim calling congestion_wait() at all?  If no
>> writes are going on there's lots of clean pagecache around so reclaim
>> should trivially succeed.  What's preventing it from doing so?
>>
>> 2: This is, I think, new behaviour.  A regression.  What caused it?
>>
>
> 120+ kernels and a lot of hurt later;
>
> Short summary - The number of times kswapd and the page allocator have been
>        calling congestion_wait and the length of time it spends in there
>        has been increasing since 2.6.29. Oddly, it has little to do
>        with the page allocator itself.
>
> Test scenario
> =============
> X86-64 machine 1 socket 4 cores
> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>        on-board and a piece of crap, and a decent RAID card could blow
>        the budget.
> Booted mem=256 to ensure it is fully IO-bound and match closer to what
>        Christian was doing
>
> At each test, the disks are partitioned, the raid arrays created and an
> ext2 filesystem created. iozone sequential read/write tests are run with
> increasing number of processes up to 64. Each test creates 8G of files. i.e.
> 1 process = 8G. 2 processes = 2x4G etc
>
>        iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
>        iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
>        etc.
>
> Metrics
> =======
>
> Each kernel was instrumented to collected the following stats
>
>        pg-Stall        Page allocator stalled calling congestion_wait
>        pg-Wait         The amount of time spent in congestion_wait
>        pg-Rclm         Pages reclaimed by direct reclaim
>        ksd-stall       balance_pgdat() (ie kswapd) staled on congestion_wait
>        ksd-wait        Time spend by balance_pgdat in congestion_wait
>
> Large differences in this do not necessarily show up in iozone because the
> disks are so slow that the stalls are a tiny percentage overall. However, in
> the event that there are many disks, it might be a greater problem. I believe
> Christian is hitting a corner case where small delays trigger a much larger
> stall.
>
> Why The Increases
> =================
>
> The big problem here is that there was no one change. Instead, it has been
> a steady build-up of a number of problems. The ones I identified are in the
> block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
> but need backporting and others I expect are a major surprise. Whether they
> are worth backporting or not heavily depends on whether Christian's problem
> is resolved.
>
> Some of the "fixes" below are obviously not fixes at all. Gathering this data
> took a significant amount of time. It'd be nice if people more familiar with
> the relevant problem patches could spring a theory or patch.
>
> The Problems
> ============
>
> 1. Block layer congestion queue async/sync difficulty
>        fix title: asyncconfusion
>        fixed in mainline? yes, in 2.6.31
>        affects: 2.6.30
>
>        2.6.30 replaced congestion queues based on read/write with sync/async
>        in commit 1faa16d2. Problems were identified with this and fixed in
>        2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings
>        2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.
>
> 2. TTY using high order allocations more frequently
>        fix title: ttyfix
>        fixed in mainline? yes, in 2.6.34-rc2
>        affects: 2.6.31 to 2.6.34-rc1
>
>        2.6.31 made pty's use the same buffering logic as tty.  Unfortunately,
>        it was also allowed to make high-order GFP_ATOMIC allocations. This
>        triggers some high-order reclaim and introduces some stalls. It's
>        fixed in 2.6.34-rc2 but needs back-porting.
>
> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>        fix title: revertevict
>        fixed in mainline? no
>        affects: 2.6.31 to now
>
>        For reasons that are not immediately obvious, the evict-once patches
>        *really* hurt the time spent on congestion and the number of pages
>        reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>        because clearly you tested this for AIM7 and might have some
>        theories. For the purposes of testing, I just reverted the changes.
>
> 4. CFQ scheduler fairness commit 718eee057 causes some hurt
>        fix title: none available
>        fixed in mainline? no
>        affects: 2.6.33 to now
>
>        A bisection finger printed this patch as being a problem introduced
>        between 2.6.32 and 2.6.33. It increases a small amount the number of
>        times the page allocator stalls but drastically increased the number
>        of pages reclaimed. It's not clear why the commit is such a problem.
>
>        Unfortunately, I could not test a revert of this patch. The CFQ and
>        block IO changes made in this window were extremely convulated and
>        overlapped heavily with a large number of patches altering the same
>        code as touched by commit 718eee057. I tried reverting everything
>        made on and after this commit but the results were unsatisfactory.
>
>        Hence, there is no fix in the results below
>
> Results
> =======
>
> Here are the highlights of kernels tested. I'm omitting the bisection
> results for obvious reasons. The metrics were gathered at two points;
> after filesystem creation and after IOZone completed.
>
> The lower the number for each metric, the better.
>
>                                                     After Filesystem Setup                                       After IOZone
>                                         pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait
> 2.6.29                                          0        0        0          2         1               4        3      183        152         0
> 2.6.30                                          1        5       34          1        25             783     3752    31939         76         0
> 2.6.30-asyncconfusion                           0        0        0          3         1              44       60     2656        893         0
> 2.6.30.10                                       0        0        0          2        43             777     3699    32661         74         0
> 2.6.30.10-asyncconfusion                        0        0        0          2         1              36       88     1699       1114         0
>
> asyncconfusion can be back-ported easily to 2.6.30.10. Performance is not
> perfectly in line with 2.6.29 but it's better.
>
> 2.6.31                                          0        0        0          3         1           49175   245727  2730626     176344         0
> 2.6.31-revertevict                              0        0        0          3         2              31      147     1887        114         0
> 2.6.31-ttyfix                                   0        0        0          2         2           46238   231000  2549462     170912         0
> 2.6.31-ttyfix-revertevict                       0        0        0          3         0               7       35      448        121         0
> 2.6.31.12                                       0        0        0          2         0           68897   344268  4050646     183523         0
> 2.6.31.12-revertevict                           0        0        0          3         1              18       87     1009        147         0
> 2.6.31.12-ttyfix                                0        0        0          2         0           62797   313805  3786539     173398         0
> 2.6.31.12-ttyfix-revertevict                    0        0        0          3         2               7       35      448        199         0
>
> Applying the tty fixes from 2.6.34-rc2 and getting rid of the evict-once
> patches bring things back in line with 2.6.29 again.
>
> Rik, any theory on evict-once?
>
> 2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0
> 2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0
> 2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0
> 2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0
> 2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0
> 2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0
> 2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0
> 2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0
>
> Again, fixing tty and reverting evict-once helps bring figures more in line
> with 2.6.29.
>
> 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
> 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
> 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
> 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
> 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
> 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
> 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
> 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0
>
> At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
> queues" has lodged itself deep within CGQ and I couldn't tear it out or
> see how to fix it. Fixing tty and reverting evict-once helps but the number
> of stalls is significantly increased and a much larger number of pages get
> reclaimed overall.
>
> Corrado?

The major changes in I/O scheduing behaviour are:
* buffered writes:
  * before we could schedule few writes, then interrupt them to do
some reads, and then go back to writes; now we guarantee some
uninterruptible time slice for writes, but the delay between two
slices is increased. The total write throughput averaged over a time
window larger than 300ms should be comparable, or even better with
2.6.33. Note that the commit you cite has introduced a bug regarding
write throughput on NCQ disks that was later fixed by 1efe8fe1, merged
before 2.6.33 (this may lead to confusing bisection results).
* reads (and sync writes):
  * before, we serviced a single process for 100ms, then switched to
an other, and so on.
  * after, we go round robin for random requests (they get a unified
time slice, like buffered writes do), and we have consecutive time
slices for sequential requests, but the length of the slice is reduced
when the number of concurrent processes doing I/O increases.

This means that with 16 processes doing sequential I/O on the same
disk, before you were switching between processes every 100ms, and now
every 32ms. The old behaviour can be brought back by setting
/sys/block/sd*/queue/iosched/low_latency to 0.
For random I/O, the situation (going round robin, it will translate to
switching every 8 ms on average) is not revertable via flags.

>
> 2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0
> 2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0
> 2.6.34-rc1-ttyfix                               0        0        0          1         1          159603   791056  5186082     223458         0
> 2.6.34-rc1-ttyfix-revertevict                   0        0        0          0         0            1549     7641    50484        679         0
>
> Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to get
> back to 2.6.29 performance.
>
> Next Steps
> ==========
>
> Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
> 2.6.30.x (assuming that is still maintained, Greg?)?
>
> Rik, any suggestions on what can be done with evict-once?
>
> Corrado, any suggestions on what can be done with CFQ?

If my intuition that switching between processes too often is
detrimental when you have memory pressure (higher probability to need
to re-page-in some of the pages that were just discarded), I suggest
trying setting low_latency to 0, and maybe increasing the slice_sync
(to get more slice to a single process before switching to an other),
slice_async (to give more uninterruptible time to buffered writes) and
slice_async_rq (to higher the limit of consecutive write requests can
be sent to disk).
While this would normally lead to a bad user experience on a system
with plenty of memory, it should keep things acceptable when paging in
/ swapping / dirty page writeback is overwhelming.

Corrado

>
> Christian, can you test the following amalgamated patch on 2.6.32.10 and
> 2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
> cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
> and revertevict. If your problem goes away, it implies that the stalls I
> can measure are roughly correlated to the more significant problem you have.
>
> ===== CUT HERE =====
>
> From d9661adfb8e53a7647360140af3b92284cbe52d4 Mon Sep 17 00:00:00 2001
> From: Alan Cox <alan@linux.intel.com>
> Date: Thu, 18 Feb 2010 16:43:47 +0000
> Subject: [PATCH] tty: Keep the default buffering to sub-page units
>
> We allocate during interrupts so while our buffering is normally diced up
> small anyway on some hardware at speed we can pressure the VM excessively
> for page pairs. We don't really need big buffers to be linear so don't try
> so hard.
>
> In order to make this work well we will tidy up excess callers to request_room,
> which cannot itself enforce this break up.
>
> Signed-off-by: Alan Cox <alan@linux.intel.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
>
> diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
> index 66fa4e1..f27c4d6 100644
> --- a/drivers/char/tty_buffer.c
> +++ b/drivers/char/tty_buffer.c
> @@ -247,7 +247,8 @@ int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars,
>  {
>        int copied = 0;
>        do {
> -               int space = tty_buffer_request_room(tty, size - copied);
> +               int goal = min(size - copied, TTY_BUFFER_PAGE);
> +               int space = tty_buffer_request_room(tty, goal);
>                struct tty_buffer *tb = tty->buf.tail;
>                /* If there is no space then tb may be NULL */
>                if (unlikely(space == 0))
> @@ -283,7 +284,8 @@ int tty_insert_flip_string_flags(struct tty_struct *tty,
>  {
>        int copied = 0;
>        do {
> -               int space = tty_buffer_request_room(tty, size - copied);
> +               int goal = min(size - copied, TTY_BUFFER_PAGE);
> +               int space = tty_buffer_request_room(tty, goal);
>                struct tty_buffer *tb = tty->buf.tail;
>                /* If there is no space then tb may be NULL */
>                if (unlikely(space == 0))
> diff --git a/include/linux/tty.h b/include/linux/tty.h
> index 6abfcf5..d96e588 100644
> --- a/include/linux/tty.h
> +++ b/include/linux/tty.h
> @@ -68,6 +68,16 @@ struct tty_buffer {
>        unsigned long data[0];
>  };
>
> +/*
> + * We default to dicing tty buffer allocations to this many characters
> + * in order to avoid multiple page allocations. We assume tty_buffer itself
> + * is under 256 bytes. See tty_buffer_find for the allocation logic this
> + * must match
> + */
> +
> +#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
> +
> +
>  struct tty_bufhead {
>        struct delayed_work work;
>        spinlock_t lock;
> From 352fa6ad16b89f8ffd1a93b4419b1a8f2259feab Mon Sep 17 00:00:00 2001
> From: Mel Gorman <mel@csn.ul.ie>
> Date: Tue, 2 Mar 2010 22:24:19 +0000
> Subject: [PATCH] tty: Take a 256 byte padding into account when buffering below sub-page units
>
> The TTY layer takes some care to ensure that only sub-page allocations
> are made with interrupts disabled. It does this by setting a goal of
> "TTY_BUFFER_PAGE" to allocate. Unfortunately, while TTY_BUFFER_PAGE takes the
> size of tty_buffer into account, it fails to account that tty_buffer_find()
> rounds the buffer size out to the next 256 byte boundary before adding on
> the size of the tty_buffer.
>
> This patch adjusts the TTY_BUFFER_PAGE calculation to take into account the
> size of the tty_buffer and the padding. Once applied, tty_buffer_alloc()
> should not require high-order allocations.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Cc: stable <stable@kernel.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
>
> diff --git a/include/linux/tty.h b/include/linux/tty.h
> index 568369a..593228a 100644
> --- a/include/linux/tty.h
> +++ b/include/linux/tty.h
> @@ -70,12 +70,13 @@ struct tty_buffer {
>
>  /*
>  * We default to dicing tty buffer allocations to this many characters
> - * in order to avoid multiple page allocations. We assume tty_buffer itself
> - * is under 256 bytes. See tty_buffer_find for the allocation logic this
> - * must match
> + * in order to avoid multiple page allocations. We know the size of
> + * tty_buffer itself but it must also be taken into account that the
> + * the buffer is 256 byte aligned. See tty_buffer_find for the allocation
> + * logic this must match
>  */
>
> -#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
> +#define TTY_BUFFER_PAGE        (((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)
>
>
>  struct tty_bufhead {
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bf9213b..5ba0d9a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -94,7 +94,6 @@ extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
>  extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
>                                                        int priority);
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>                                       struct zone *zone,
>                                       enum lru_list lru);
> @@ -243,12 +242,6 @@ mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>        return 1;
>  }
>
> -static inline int
> -mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
> -{
> -       return 1;
> -}
> -
>  static inline unsigned long
>  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
>                         enum lru_list lru)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 66035bf..bbb0eda 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -843,17 +843,6 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>        return 0;
>  }
>
> -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
> -{
> -       unsigned long active;
> -       unsigned long inactive;
> -
> -       inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
> -       active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
> -
> -       return (active > inactive);
> -}
> -
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>                                       struct zone *zone,
>                                       enum lru_list lru)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 692807f..5512301 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1428,59 +1428,13 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
>        return low;
>  }
>
> -static int inactive_file_is_low_global(struct zone *zone)
> -{
> -       unsigned long active, inactive;
> -
> -       active = zone_page_state(zone, NR_ACTIVE_FILE);
> -       inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -
> -       return (active > inactive);
> -}
> -
> -/**
> - * inactive_file_is_low - check if file pages need to be deactivated
> - * @zone: zone to check
> - * @sc:   scan control of this context
> - *
> - * When the system is doing streaming IO, memory pressure here
> - * ensures that active file pages get deactivated, until more
> - * than half of the file pages are on the inactive list.
> - *
> - * Once we get to that situation, protect the system's working
> - * set from being evicted by disabling active file page aging.
> - *
> - * This uses a different ratio than the anonymous pages, because
> - * the page cache uses a use-once replacement algorithm.
> - */
> -static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
> -{
> -       int low;
> -
> -       if (scanning_global_lru(sc))
> -               low = inactive_file_is_low_global(zone);
> -       else
> -               low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
> -       return low;
> -}
> -
> -static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
> -                               int file)
> -{
> -       if (file)
> -               return inactive_file_is_low(zone, sc);
> -       else
> -               return inactive_anon_is_low(zone, sc);
> -}
> -
>  static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>        struct zone *zone, struct scan_control *sc, int priority)
>  {
>        int file = is_file_lru(lru);
>
> -       if (is_active_lru(lru)) {
> -               if (inactive_list_is_low(zone, sc, file))
> -                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
> +       if (lru == LRU_ACTIVE_FILE) {
> +               shrink_active_list(nr_to_scan, zone, sc, priority, file);
>                return 0;
>        }
>
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-22 23:50                   ` Mel Gorman
@ 2010-03-23 22:29                     ` Rik van Riel
  -1 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-03-23 22:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo,
	Johannes Weiner

On 03/22/2010 07:50 PM, Mel Gorman wrote:

> Test scenario
> =============
> X86-64 machine 1 socket 4 cores
> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
> 	on-board and a piece of crap, and a decent RAID card could blow
> 	the budget.
> Booted mem=256 to ensure it is fully IO-bound and match closer to what
> 	Christian was doing

With that many disks, you can easily have dozens of megabytes
of data in flight to the disk at once.  That is a major
fraction of memory.

In fact, you might have all of the inactive file pages under
IO...

> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
> 	fix title: revertevict
> 	fixed in mainline? no
> 	affects: 2.6.31 to now
>
> 	For reasons that are not immediately obvious, the evict-once patches
> 	*really* hurt the time spent on congestion and the number of pages
> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
> 	because clearly you tested this for AIM7 and might have some
> 	theories. For the purposes of testing, I just reverted the changes.

The patch helped IO tests with reasonable amounts of memory
available, because the VM can cache frequently used data
much more effectively.

This comes at the cost of caching less recently accessed
use-once data, which should not be an issue since the data
is only used once...

> Rik, any theory on evict-once?

No real theories yet, just the observation that your revert
appears to be buggy (see below) and the possibility that your
test may have all of the inactive file pages under IO...

Can you reproduce the stall if you lower the dirty limits?

>   static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>   	struct zone *zone, struct scan_control *sc, int priority)
>   {
>   	int file = is_file_lru(lru);
>
> -	if (is_active_lru(lru)) {
> -		if (inactive_list_is_low(zone, sc, file))
> -		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
> +	if (lru == LRU_ACTIVE_FILE) {
> +		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>   		return 0;
>   	}

Your revert is buggy.  With this change, anonymous pages will
never get deactivated via shrink_list.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-23 22:29                     ` Rik van Riel
  0 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-03-23 22:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo,
	Johannes Weiner

On 03/22/2010 07:50 PM, Mel Gorman wrote:

> Test scenario
> =============
> X86-64 machine 1 socket 4 cores
> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
> 	on-board and a piece of crap, and a decent RAID card could blow
> 	the budget.
> Booted mem=256 to ensure it is fully IO-bound and match closer to what
> 	Christian was doing

With that many disks, you can easily have dozens of megabytes
of data in flight to the disk at once.  That is a major
fraction of memory.

In fact, you might have all of the inactive file pages under
IO...

> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
> 	fix title: revertevict
> 	fixed in mainline? no
> 	affects: 2.6.31 to now
>
> 	For reasons that are not immediately obvious, the evict-once patches
> 	*really* hurt the time spent on congestion and the number of pages
> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
> 	because clearly you tested this for AIM7 and might have some
> 	theories. For the purposes of testing, I just reverted the changes.

The patch helped IO tests with reasonable amounts of memory
available, because the VM can cache frequently used data
much more effectively.

This comes at the cost of caching less recently accessed
use-once data, which should not be an issue since the data
is only used once...

> Rik, any theory on evict-once?

No real theories yet, just the observation that your revert
appears to be buggy (see below) and the possibility that your
test may have all of the inactive file pages under IO...

Can you reproduce the stall if you lower the dirty limits?

>   static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>   	struct zone *zone, struct scan_control *sc, int priority)
>   {
>   	int file = is_file_lru(lru);
>
> -	if (is_active_lru(lru)) {
> -		if (inactive_list_is_low(zone, sc, file))
> -		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
> +	if (lru == LRU_ACTIVE_FILE) {
> +		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>   		return 0;
>   	}

Your revert is buggy.  With this change, anonymous pages will
never get deactivated via shrink_list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-22 23:50                   ` Mel Gorman
@ 2010-03-24  2:38                     ` Greg KH
  -1 siblings, 0 replies; 136+ messages in thread
From: Greg KH @ 2010-03-24  2:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, Corrado Zoccolo,
	Rik van Riel, Johannes Weiner

On Mon, Mar 22, 2010 at 11:50:54PM +0000, Mel Gorman wrote:
> 2. TTY using high order allocations more frequently
> 	fix title: ttyfix
> 	fixed in mainline? yes, in 2.6.34-rc2
> 	affects: 2.6.31 to 2.6.34-rc1
> 
> 	2.6.31 made pty's use the same buffering logic as tty.	Unfortunately,
> 	it was also allowed to make high-order GFP_ATOMIC allocations. This
> 	triggers some high-order reclaim and introduces some stalls. It's
> 	fixed in 2.6.34-rc2 but needs back-porting.

It will go to the other stable kernels for their next round of releases
now that it is in Linus's tree.

> Next Steps
> ==========
> 
> Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
> 2.6.30.x (assuming that is still maintained, Greg?)?

No, .30 is no longer being maintained.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-24  2:38                     ` Greg KH
  0 siblings, 0 replies; 136+ messages in thread
From: Greg KH @ 2010-03-24  2:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, Corrado Zoccolo,
	Rik van Riel, Johannes Weiner

On Mon, Mar 22, 2010 at 11:50:54PM +0000, Mel Gorman wrote:
> 2. TTY using high order allocations more frequently
> 	fix title: ttyfix
> 	fixed in mainline? yes, in 2.6.34-rc2
> 	affects: 2.6.31 to 2.6.34-rc1
> 
> 	2.6.31 made pty's use the same buffering logic as tty.	Unfortunately,
> 	it was also allowed to make high-order GFP_ATOMIC allocations. This
> 	triggers some high-order reclaim and introduces some stalls. It's
> 	fixed in 2.6.34-rc2 but needs back-porting.

It will go to the other stable kernels for their next round of releases
now that it is in Linus's tree.

> Next Steps
> ==========
> 
> Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
> 2.6.30.x (assuming that is still maintained, Greg?)?

No, .30 is no longer being maintained.

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-23 21:35                     ` Corrado Zoccolo
@ 2010-03-24 11:48                       ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-24 11:48 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Rik van Riel,
	Johannes Weiner

On Tue, Mar 23, 2010 at 10:35:20PM +0100, Corrado Zoccolo wrote:
> Hi Mel,
> On Tue, Mar 23, 2010 at 12:50 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> >> On Mon, 15 Mar 2010 13:34:50 +0100
> >> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> >>
> >> > c) If direct reclaim did reasonable progress in try_to_free but did not
> >> > get a page, AND there is no write in flight at all then let it try again
> >> > to free up something.
> >> > This could be extended by some kind of max retry to avoid some weird
> >> > looping cases as well.
> >> >
> >> > d) Another way might be as easy as letting congestion_wait return
> >> > immediately if there are no outstanding writes - this would keep the
> >> > behavior for cases with write and avoid the "running always in full
> >> > timeout" issue without writes.
> >>
> >> They're pretty much equivalent and would work.  But there are two
> >> things I still don't understand:
> >>
> >> 1: Why is direct reclaim calling congestion_wait() at all?  If no
> >> writes are going on there's lots of clean pagecache around so reclaim
> >> should trivially succeed.  What's preventing it from doing so?
> >>
> >> 2: This is, I think, new behaviour.  A regression.  What caused it?
> >>
> >
> > 120+ kernels and a lot of hurt later;
> >
> > Short summary - The number of times kswapd and the page allocator have been
> >        calling congestion_wait and the length of time it spends in there
> >        has been increasing since 2.6.29. Oddly, it has little to do
> >        with the page allocator itself.
> >
> > Test scenario
> > =============
> > X86-64 machine 1 socket 4 cores
> > 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
> >        on-board and a piece of crap, and a decent RAID card could blow
> >        the budget.
> > Booted mem=256 to ensure it is fully IO-bound and match closer to what
> >        Christian was doing
> >
> > At each test, the disks are partitioned, the raid arrays created and an
> > ext2 filesystem created. iozone sequential read/write tests are run with
> > increasing number of processes up to 64. Each test creates 8G of files. i.e.
> > 1 process = 8G. 2 processes = 2x4G etc
> >
> >        iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
> >        iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
> >        etc.
> >
> > Metrics
> > =======
> >
> > Each kernel was instrumented to collected the following stats
> >
> >        pg-Stall        Page allocator stalled calling congestion_wait
> >        pg-Wait         The amount of time spent in congestion_wait
> >        pg-Rclm         Pages reclaimed by direct reclaim
> >        ksd-stall       balance_pgdat() (ie kswapd) staled on congestion_wait
> >        ksd-wait        Time spend by balance_pgdat in congestion_wait
> >
> > Large differences in this do not necessarily show up in iozone because the
> > disks are so slow that the stalls are a tiny percentage overall. However, in
> > the event that there are many disks, it might be a greater problem. I believe
> > Christian is hitting a corner case where small delays trigger a much larger
> > stall.
> >
> > Why The Increases
> > =================
> >
> > The big problem here is that there was no one change. Instead, it has been
> > a steady build-up of a number of problems. The ones I identified are in the
> > block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
> > but need backporting and others I expect are a major surprise. Whether they
> > are worth backporting or not heavily depends on whether Christian's problem
> > is resolved.
> >
> > Some of the "fixes" below are obviously not fixes at all. Gathering this data
> > took a significant amount of time. It'd be nice if people more familiar with
> > the relevant problem patches could spring a theory or patch.
> >
> > The Problems
> > ============
> >
> > 1. Block layer congestion queue async/sync difficulty
> >        fix title: asyncconfusion
> >        fixed in mainline? yes, in 2.6.31
> >        affects: 2.6.30
> >
> >        2.6.30 replaced congestion queues based on read/write with sync/async
> >        in commit 1faa16d2. Problems were identified with this and fixed in
> >        2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings
> >        2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.
> >
> > 2. TTY using high order allocations more frequently
> >        fix title: ttyfix
> >        fixed in mainline? yes, in 2.6.34-rc2
> >        affects: 2.6.31 to 2.6.34-rc1
> >
> >        2.6.31 made pty's use the same buffering logic as tty.  Unfortunately,
> >        it was also allowed to make high-order GFP_ATOMIC allocations. This
> >        triggers some high-order reclaim and introduces some stalls. It's
> >        fixed in 2.6.34-rc2 but needs back-porting.
> >
> > 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
> >        fix title: revertevict
> >        fixed in mainline? no
> >        affects: 2.6.31 to now
> >
> >        For reasons that are not immediately obvious, the evict-once patches
> >        *really* hurt the time spent on congestion and the number of pages
> >        reclaimed. Rik, I'm afaid I'm punting this to you for explanation
> >        because clearly you tested this for AIM7 and might have some
> >        theories. For the purposes of testing, I just reverted the changes.
> >
> > 4. CFQ scheduler fairness commit 718eee057 causes some hurt
> >        fix title: none available
> >        fixed in mainline? no
> >        affects: 2.6.33 to now
> >
> >        A bisection finger printed this patch as being a problem introduced
> >        between 2.6.32 and 2.6.33. It increases a small amount the number of
> >        times the page allocator stalls but drastically increased the number
> >        of pages reclaimed. It's not clear why the commit is such a problem.
> >
> >        Unfortunately, I could not test a revert of this patch. The CFQ and
> >        block IO changes made in this window were extremely convulated and
> >        overlapped heavily with a large number of patches altering the same
> >        code as touched by commit 718eee057. I tried reverting everything
> >        made on and after this commit but the results were unsatisfactory.
> >
> >        Hence, there is no fix in the results below
> >
> > Results
> > =======
> >
> > Here are the highlights of kernels tested. I'm omitting the bisection
> > results for obvious reasons. The metrics were gathered at two points;
> > after filesystem creation and after IOZone completed.
> >
> > The lower the number for each metric, the better.
> >
> >                                                     After Filesystem Setup                                       After IOZone
> >                                         pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait
> > 2.6.29                                          0        0        0          2         1               4        3      183        152         0
> > 2.6.30                                          1        5       34          1        25             783     3752    31939         76         0
> > 2.6.30-asyncconfusion                           0        0        0          3         1              44       60     2656        893         0
> > 2.6.30.10                                       0        0        0          2        43             777     3699    32661         74         0
> > 2.6.30.10-asyncconfusion                        0        0        0          2         1              36       88     1699       1114         0
> >
> > asyncconfusion can be back-ported easily to 2.6.30.10. Performance is not
> > perfectly in line with 2.6.29 but it's better.
> >
> > 2.6.31                                          0        0        0          3         1           49175   245727  2730626     176344         0
> > 2.6.31-revertevict                              0        0        0          3         2              31      147     1887        114         0
> > 2.6.31-ttyfix                                   0        0        0          2         2           46238   231000  2549462     170912         0
> > 2.6.31-ttyfix-revertevict                       0        0        0          3         0               7       35      448        121         0
> > 2.6.31.12                                       0        0        0          2         0           68897   344268  4050646     183523         0
> > 2.6.31.12-revertevict                           0        0        0          3         1              18       87     1009        147         0
> > 2.6.31.12-ttyfix                                0        0        0          2         0           62797   313805  3786539     173398         0
> > 2.6.31.12-ttyfix-revertevict                    0        0        0          3         2               7       35      448        199         0
> >
> > Applying the tty fixes from 2.6.34-rc2 and getting rid of the evict-once
> > patches bring things back in line with 2.6.29 again.
> >
> > Rik, any theory on evict-once?
> >
> > 2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0
> > 2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0
> > 2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0
> > 2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0
> > 2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0
> > 2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0
> > 2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0
> > 2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0
> >
> > Again, fixing tty and reverting evict-once helps bring figures more in line
> > with 2.6.29.
> >
> > 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
> > 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
> > 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
> > 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
> > 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
> > 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
> > 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
> > 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0
> >
> > At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
> > queues" has lodged itself deep within CGQ and I couldn't tear it out or
> > see how to fix it. Fixing tty and reverting evict-once helps but the number
> > of stalls is significantly increased and a much larger number of pages get
> > reclaimed overall.
> >
> > Corrado?
> 
> The major changes in I/O scheduing behaviour are:
> * buffered writes:
>    before we could schedule few writes, then interrupt them to do
>   some reads, and then go back to writes; now we guarantee some
>   uninterruptible time slice for writes, but the delay between two
>   slices is increased. The total write throughput averaged over a time
>   window larger than 300ms should be comparable, or even better with
>   2.6.33. Note that the commit you cite has introduced a bug regarding
>   write throughput on NCQ disks that was later fixed by 1efe8fe1, merged
>   before 2.6.33 (this may lead to confusing bisection results).

This is true. The CFQ and block IO changes in that window are almost
impossible to properly bisect and isolate individual changes. There were
multiple dependant patches that modified each others changes. It's unclear
if this modification can even be isolated although your suggestion below
is the best bet.

> * reads (and sync writes):
>   * before, we serviced a single process for 100ms, then switched to
>     an other, and so on.
>   * after, we go round robin for random requests (they get a unified
>     time slice, like buffered writes do), and we have consecutive time
>     slices for sequential requests, but the length of the slice is reduced
>     when the number of concurrent processes doing I/O increases.
> 
> This means that with 16 processes doing sequential I/O on the same
> disk, before you were switching between processes every 100ms, and now
> every 32ms. The old behaviour can be brought back by setting
> /sys/block/sd*/queue/iosched/low_latency to 0.

Will try this and see what happens.

> For random I/O, the situation (going round robin, it will translate to
> switching every 8 ms on average) is not revertable via flags.
> 

At the moment, I'm not testing random IO so it shouldn't be a factor in
the tests.

> >
> > 2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0
> > 2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0
> > 2.6.34-rc1-ttyfix                               0        0        0          1         1          159603   791056  5186082     223458         0
> > 2.6.34-rc1-ttyfix-revertevict                   0        0        0          0         0            1549     7641    50484        679         0
> >
> > Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to get
> > back to 2.6.29 performance.
> >
> > Next Steps
> > ==========
> >
> > Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
> > 2.6.30.x (assuming that is still maintained, Greg?)?
> >
> > Rik, any suggestions on what can be done with evict-once?
> >
> > Corrado, any suggestions on what can be done with CFQ?
> 
> If my intuition that switching between processes too often is
> detrimental when you have memory pressure (higher probability to need
> to re-page-in some of the pages that were just discarded), I suggest
> trying setting low_latency to 0, and maybe increasing the slice_sync
> (to get more slice to a single process before switching to an other),
> slice_async (to give more uninterruptible time to buffered writes) and
> slice_async_rq (to higher the limit of consecutive write requests can
> be sent to disk).
> While this would normally lead to a bad user experience on a system
> with plenty of memory, it should keep things acceptable when paging in
> / swapping / dirty page writeback is overwhelming.
> 

Christian, would you be able to follow the same instructions and see can
you make a difference to your test? It is known for your situation that
memory is unusually low for size of your workload so it's a possibility.

Thanks Corrado.

> Corrado
> 
> >
> > Christian, can you test the following amalgamated patch on 2.6.32.10 and
> > 2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
> > cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
> > and revertevict. If your problem goes away, it implies that the stalls I
> > can measure are roughly correlated to the more significant problem you have.
> >
> > ===== CUT HERE =====
> >
> > From d9661adfb8e53a7647360140af3b92284cbe52d4 Mon Sep 17 00:00:00 2001
> > From: Alan Cox <alan@linux.intel.com>
> > Date: Thu, 18 Feb 2010 16:43:47 +0000
> > Subject: [PATCH] tty: Keep the default buffering to sub-page units
> >
> > We allocate during interrupts so while our buffering is normally diced up
> > small anyway on some hardware at speed we can pressure the VM excessively
> > for page pairs. We don't really need big buffers to be linear so don't try
> > so hard.
> >
> > In order to make this work well we will tidy up excess callers to request_room,
> > which cannot itself enforce this break up.
> >
> > Signed-off-by: Alan Cox <alan@linux.intel.com>
> > Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
> >
> > diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
> > index 66fa4e1..f27c4d6 100644
> > --- a/drivers/char/tty_buffer.c
> > +++ b/drivers/char/tty_buffer.c
> > @@ -247,7 +247,8 @@ int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars,
> >  {
> >        int copied = 0;
> >        do {
> > -               int space = tty_buffer_request_room(tty, size - copied);
> > +               int goal = min(size - copied, TTY_BUFFER_PAGE);
> > +               int space = tty_buffer_request_room(tty, goal);
> >                struct tty_buffer *tb = tty->buf.tail;
> >                /* If there is no space then tb may be NULL */
> >                if (unlikely(space == 0))
> > @@ -283,7 +284,8 @@ int tty_insert_flip_string_flags(struct tty_struct *tty,
> >  {
> >        int copied = 0;
> >        do {
> > -               int space = tty_buffer_request_room(tty, size - copied);
> > +               int goal = min(size - copied, TTY_BUFFER_PAGE);
> > +               int space = tty_buffer_request_room(tty, goal);
> >                struct tty_buffer *tb = tty->buf.tail;
> >                /* If there is no space then tb may be NULL */
> >                if (unlikely(space == 0))
> > diff --git a/include/linux/tty.h b/include/linux/tty.h
> > index 6abfcf5..d96e588 100644
> > --- a/include/linux/tty.h
> > +++ b/include/linux/tty.h
> > @@ -68,6 +68,16 @@ struct tty_buffer {
> >        unsigned long data[0];
> >  };
> >
> > +/*
> > + * We default to dicing tty buffer allocations to this many characters
> > + * in order to avoid multiple page allocations. We assume tty_buffer itself
> > + * is under 256 bytes. See tty_buffer_find for the allocation logic this
> > + * must match
> > + */
> > +
> > +#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
> > +
> > +
> >  struct tty_bufhead {
> >        struct delayed_work work;
> >        spinlock_t lock;
> > From 352fa6ad16b89f8ffd1a93b4419b1a8f2259feab Mon Sep 17 00:00:00 2001
> > From: Mel Gorman <mel@csn.ul.ie>
> > Date: Tue, 2 Mar 2010 22:24:19 +0000
> > Subject: [PATCH] tty: Take a 256 byte padding into account when buffering below sub-page units
> >
> > The TTY layer takes some care to ensure that only sub-page allocations
> > are made with interrupts disabled. It does this by setting a goal of
> > "TTY_BUFFER_PAGE" to allocate. Unfortunately, while TTY_BUFFER_PAGE takes the
> > size of tty_buffer into account, it fails to account that tty_buffer_find()
> > rounds the buffer size out to the next 256 byte boundary before adding on
> > the size of the tty_buffer.
> >
> > This patch adjusts the TTY_BUFFER_PAGE calculation to take into account the
> > size of the tty_buffer and the padding. Once applied, tty_buffer_alloc()
> > should not require high-order allocations.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Cc: stable <stable@kernel.org>
> > Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
> >
> > diff --git a/include/linux/tty.h b/include/linux/tty.h
> > index 568369a..593228a 100644
> > --- a/include/linux/tty.h
> > +++ b/include/linux/tty.h
> > @@ -70,12 +70,13 @@ struct tty_buffer {
> >
> >  /*
> >  * We default to dicing tty buffer allocations to this many characters
> > - * in order to avoid multiple page allocations. We assume tty_buffer itself
> > - * is under 256 bytes. See tty_buffer_find for the allocation logic this
> > - * must match
> > + * in order to avoid multiple page allocations. We know the size of
> > + * tty_buffer itself but it must also be taken into account that the
> > + * the buffer is 256 byte aligned. See tty_buffer_find for the allocation
> > + * logic this must match
> >  */
> >
> > -#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
> > +#define TTY_BUFFER_PAGE        (((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)
> >
> >
> >  struct tty_bufhead {
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index bf9213b..5ba0d9a 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -94,7 +94,6 @@ extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
> >  extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
> >                                                        int priority);
> >  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> > -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                       struct zone *zone,
> >                                       enum lru_list lru);
> > @@ -243,12 +242,6 @@ mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> >        return 1;
> >  }
> >
> > -static inline int
> > -mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
> > -{
> > -       return 1;
> > -}
> > -
> >  static inline unsigned long
> >  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> >                         enum lru_list lru)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 66035bf..bbb0eda 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -843,17 +843,6 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> >        return 0;
> >  }
> >
> > -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
> > -{
> > -       unsigned long active;
> > -       unsigned long inactive;
> > -
> > -       inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
> > -       active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
> > -
> > -       return (active > inactive);
> > -}
> > -
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                       struct zone *zone,
> >                                       enum lru_list lru)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 692807f..5512301 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1428,59 +1428,13 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
> >        return low;
> >  }
> >
> > -static int inactive_file_is_low_global(struct zone *zone)
> > -{
> > -       unsigned long active, inactive;
> > -
> > -       active = zone_page_state(zone, NR_ACTIVE_FILE);
> > -       inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> > -
> > -       return (active > inactive);
> > -}
> > -
> > -/**
> > - * inactive_file_is_low - check if file pages need to be deactivated
> > - * @zone: zone to check
> > - * @sc:   scan control of this context
> > - *
> > - * When the system is doing streaming IO, memory pressure here
> > - * ensures that active file pages get deactivated, until more
> > - * than half of the file pages are on the inactive list.
> > - *
> > - * Once we get to that situation, protect the system's working
> > - * set from being evicted by disabling active file page aging.
> > - *
> > - * This uses a different ratio than the anonymous pages, because
> > - * the page cache uses a use-once replacement algorithm.
> > - */
> > -static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
> > -{
> > -       int low;
> > -
> > -       if (scanning_global_lru(sc))
> > -               low = inactive_file_is_low_global(zone);
> > -       else
> > -               low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
> > -       return low;
> > -}
> > -
> > -static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
> > -                               int file)
> > -{
> > -       if (file)
> > -               return inactive_file_is_low(zone, sc);
> > -       else
> > -               return inactive_anon_is_low(zone, sc);
> > -}
> > -
> >  static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> >        struct zone *zone, struct scan_control *sc, int priority)
> >  {
> >        int file = is_file_lru(lru);
> >
> > -       if (is_active_lru(lru)) {
> > -               if (inactive_list_is_low(zone, sc, file))
> > -                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
> > +       if (lru == LRU_ACTIVE_FILE) {
> > +               shrink_active_list(nr_to_scan, zone, sc, priority, file);
> >                return 0;
> >        }
> >

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-24 11:48                       ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-24 11:48 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Rik van Riel,
	Johannes Weiner

On Tue, Mar 23, 2010 at 10:35:20PM +0100, Corrado Zoccolo wrote:
> Hi Mel,
> On Tue, Mar 23, 2010 at 12:50 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> >> On Mon, 15 Mar 2010 13:34:50 +0100
> >> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> >>
> >> > c) If direct reclaim did reasonable progress in try_to_free but did not
> >> > get a page, AND there is no write in flight at all then let it try again
> >> > to free up something.
> >> > This could be extended by some kind of max retry to avoid some weird
> >> > looping cases as well.
> >> >
> >> > d) Another way might be as easy as letting congestion_wait return
> >> > immediately if there are no outstanding writes - this would keep the
> >> > behavior for cases with write and avoid the "running always in full
> >> > timeout" issue without writes.
> >>
> >> They're pretty much equivalent and would work.  But there are two
> >> things I still don't understand:
> >>
> >> 1: Why is direct reclaim calling congestion_wait() at all?  If no
> >> writes are going on there's lots of clean pagecache around so reclaim
> >> should trivially succeed.  What's preventing it from doing so?
> >>
> >> 2: This is, I think, new behaviour.  A regression.  What caused it?
> >>
> >
> > 120+ kernels and a lot of hurt later;
> >
> > Short summary - The number of times kswapd and the page allocator have been
> >        calling congestion_wait and the length of time it spends in there
> >        has been increasing since 2.6.29. Oddly, it has little to do
> >        with the page allocator itself.
> >
> > Test scenario
> > =============
> > X86-64 machine 1 socket 4 cores
> > 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
> >        on-board and a piece of crap, and a decent RAID card could blow
> >        the budget.
> > Booted mem=256 to ensure it is fully IO-bound and match closer to what
> >        Christian was doing
> >
> > At each test, the disks are partitioned, the raid arrays created and an
> > ext2 filesystem created. iozone sequential read/write tests are run with
> > increasing number of processes up to 64. Each test creates 8G of files. i.e.
> > 1 process = 8G. 2 processes = 2x4G etc
> >
> >        iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
> >        iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
> >        etc.
> >
> > Metrics
> > =======
> >
> > Each kernel was instrumented to collected the following stats
> >
> >        pg-Stall        Page allocator stalled calling congestion_wait
> >        pg-Wait         The amount of time spent in congestion_wait
> >        pg-Rclm         Pages reclaimed by direct reclaim
> >        ksd-stall       balance_pgdat() (ie kswapd) staled on congestion_wait
> >        ksd-wait        Time spend by balance_pgdat in congestion_wait
> >
> > Large differences in this do not necessarily show up in iozone because the
> > disks are so slow that the stalls are a tiny percentage overall. However, in
> > the event that there are many disks, it might be a greater problem. I believe
> > Christian is hitting a corner case where small delays trigger a much larger
> > stall.
> >
> > Why The Increases
> > =================
> >
> > The big problem here is that there was no one change. Instead, it has been
> > a steady build-up of a number of problems. The ones I identified are in the
> > block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
> > but need backporting and others I expect are a major surprise. Whether they
> > are worth backporting or not heavily depends on whether Christian's problem
> > is resolved.
> >
> > Some of the "fixes" below are obviously not fixes at all. Gathering this data
> > took a significant amount of time. It'd be nice if people more familiar with
> > the relevant problem patches could spring a theory or patch.
> >
> > The Problems
> > ============
> >
> > 1. Block layer congestion queue async/sync difficulty
> >        fix title: asyncconfusion
> >        fixed in mainline? yes, in 2.6.31
> >        affects: 2.6.30
> >
> >        2.6.30 replaced congestion queues based on read/write with sync/async
> >        in commit 1faa16d2. Problems were identified with this and fixed in
> >        2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings
> >        2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.
> >
> > 2. TTY using high order allocations more frequently
> >        fix title: ttyfix
> >        fixed in mainline? yes, in 2.6.34-rc2
> >        affects: 2.6.31 to 2.6.34-rc1
> >
> >        2.6.31 made pty's use the same buffering logic as tty.  Unfortunately,
> >        it was also allowed to make high-order GFP_ATOMIC allocations. This
> >        triggers some high-order reclaim and introduces some stalls. It's
> >        fixed in 2.6.34-rc2 but needs back-porting.
> >
> > 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
> >        fix title: revertevict
> >        fixed in mainline? no
> >        affects: 2.6.31 to now
> >
> >        For reasons that are not immediately obvious, the evict-once patches
> >        *really* hurt the time spent on congestion and the number of pages
> >        reclaimed. Rik, I'm afaid I'm punting this to you for explanation
> >        because clearly you tested this for AIM7 and might have some
> >        theories. For the purposes of testing, I just reverted the changes.
> >
> > 4. CFQ scheduler fairness commit 718eee057 causes some hurt
> >        fix title: none available
> >        fixed in mainline? no
> >        affects: 2.6.33 to now
> >
> >        A bisection finger printed this patch as being a problem introduced
> >        between 2.6.32 and 2.6.33. It increases a small amount the number of
> >        times the page allocator stalls but drastically increased the number
> >        of pages reclaimed. It's not clear why the commit is such a problem.
> >
> >        Unfortunately, I could not test a revert of this patch. The CFQ and
> >        block IO changes made in this window were extremely convulated and
> >        overlapped heavily with a large number of patches altering the same
> >        code as touched by commit 718eee057. I tried reverting everything
> >        made on and after this commit but the results were unsatisfactory.
> >
> >        Hence, there is no fix in the results below
> >
> > Results
> > =======
> >
> > Here are the highlights of kernels tested. I'm omitting the bisection
> > results for obvious reasons. The metrics were gathered at two points;
> > after filesystem creation and after IOZone completed.
> >
> > The lower the number for each metric, the better.
> >
> >                                                     After Filesystem Setup                                       After IOZone
> >                                         pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait
> > 2.6.29                                          0        0        0          2         1               4        3      183        152         0
> > 2.6.30                                          1        5       34          1        25             783     3752    31939         76         0
> > 2.6.30-asyncconfusion                           0        0        0          3         1              44       60     2656        893         0
> > 2.6.30.10                                       0        0        0          2        43             777     3699    32661         74         0
> > 2.6.30.10-asyncconfusion                        0        0        0          2         1              36       88     1699       1114         0
> >
> > asyncconfusion can be back-ported easily to 2.6.30.10. Performance is not
> > perfectly in line with 2.6.29 but it's better.
> >
> > 2.6.31                                          0        0        0          3         1           49175   245727  2730626     176344         0
> > 2.6.31-revertevict                              0        0        0          3         2              31      147     1887        114         0
> > 2.6.31-ttyfix                                   0        0        0          2         2           46238   231000  2549462     170912         0
> > 2.6.31-ttyfix-revertevict                       0        0        0          3         0               7       35      448        121         0
> > 2.6.31.12                                       0        0        0          2         0           68897   344268  4050646     183523         0
> > 2.6.31.12-revertevict                           0        0        0          3         1              18       87     1009        147         0
> > 2.6.31.12-ttyfix                                0        0        0          2         0           62797   313805  3786539     173398         0
> > 2.6.31.12-ttyfix-revertevict                    0        0        0          3         2               7       35      448        199         0
> >
> > Applying the tty fixes from 2.6.34-rc2 and getting rid of the evict-once
> > patches bring things back in line with 2.6.29 again.
> >
> > Rik, any theory on evict-once?
> >
> > 2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0
> > 2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0
> > 2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0
> > 2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0
> > 2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0
> > 2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0
> > 2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0
> > 2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0
> >
> > Again, fixing tty and reverting evict-once helps bring figures more in line
> > with 2.6.29.
> >
> > 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
> > 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
> > 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
> > 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
> > 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
> > 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
> > 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
> > 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0
> >
> > At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
> > queues" has lodged itself deep within CGQ and I couldn't tear it out or
> > see how to fix it. Fixing tty and reverting evict-once helps but the number
> > of stalls is significantly increased and a much larger number of pages get
> > reclaimed overall.
> >
> > Corrado?
> 
> The major changes in I/O scheduing behaviour are:
> * buffered writes:
>    before we could schedule few writes, then interrupt them to do
>   some reads, and then go back to writes; now we guarantee some
>   uninterruptible time slice for writes, but the delay between two
>   slices is increased. The total write throughput averaged over a time
>   window larger than 300ms should be comparable, or even better with
>   2.6.33. Note that the commit you cite has introduced a bug regarding
>   write throughput on NCQ disks that was later fixed by 1efe8fe1, merged
>   before 2.6.33 (this may lead to confusing bisection results).

This is true. The CFQ and block IO changes in that window are almost
impossible to properly bisect and isolate individual changes. There were
multiple dependant patches that modified each others changes. It's unclear
if this modification can even be isolated although your suggestion below
is the best bet.

> * reads (and sync writes):
>   * before, we serviced a single process for 100ms, then switched to
>     an other, and so on.
>   * after, we go round robin for random requests (they get a unified
>     time slice, like buffered writes do), and we have consecutive time
>     slices for sequential requests, but the length of the slice is reduced
>     when the number of concurrent processes doing I/O increases.
> 
> This means that with 16 processes doing sequential I/O on the same
> disk, before you were switching between processes every 100ms, and now
> every 32ms. The old behaviour can be brought back by setting
> /sys/block/sd*/queue/iosched/low_latency to 0.

Will try this and see what happens.

> For random I/O, the situation (going round robin, it will translate to
> switching every 8 ms on average) is not revertable via flags.
> 

At the moment, I'm not testing random IO so it shouldn't be a factor in
the tests.

> >
> > 2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0
> > 2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0
> > 2.6.34-rc1-ttyfix                               0        0        0          1         1          159603   791056  5186082     223458         0
> > 2.6.34-rc1-ttyfix-revertevict                   0        0        0          0         0            1549     7641    50484        679         0
> >
> > Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to get
> > back to 2.6.29 performance.
> >
> > Next Steps
> > ==========
> >
> > Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
> > 2.6.30.x (assuming that is still maintained, Greg?)?
> >
> > Rik, any suggestions on what can be done with evict-once?
> >
> > Corrado, any suggestions on what can be done with CFQ?
> 
> If my intuition that switching between processes too often is
> detrimental when you have memory pressure (higher probability to need
> to re-page-in some of the pages that were just discarded), I suggest
> trying setting low_latency to 0, and maybe increasing the slice_sync
> (to get more slice to a single process before switching to an other),
> slice_async (to give more uninterruptible time to buffered writes) and
> slice_async_rq (to higher the limit of consecutive write requests can
> be sent to disk).
> While this would normally lead to a bad user experience on a system
> with plenty of memory, it should keep things acceptable when paging in
> / swapping / dirty page writeback is overwhelming.
> 

Christian, would you be able to follow the same instructions and see can
you make a difference to your test? It is known for your situation that
memory is unusually low for size of your workload so it's a possibility.

Thanks Corrado.

> Corrado
> 
> >
> > Christian, can you test the following amalgamated patch on 2.6.32.10 and
> > 2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
> > cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
> > and revertevict. If your problem goes away, it implies that the stalls I
> > can measure are roughly correlated to the more significant problem you have.
> >
> > ===== CUT HERE =====
> >
> > From d9661adfb8e53a7647360140af3b92284cbe52d4 Mon Sep 17 00:00:00 2001
> > From: Alan Cox <alan@linux.intel.com>
> > Date: Thu, 18 Feb 2010 16:43:47 +0000
> > Subject: [PATCH] tty: Keep the default buffering to sub-page units
> >
> > We allocate during interrupts so while our buffering is normally diced up
> > small anyway on some hardware at speed we can pressure the VM excessively
> > for page pairs. We don't really need big buffers to be linear so don't try
> > so hard.
> >
> > In order to make this work well we will tidy up excess callers to request_room,
> > which cannot itself enforce this break up.
> >
> > Signed-off-by: Alan Cox <alan@linux.intel.com>
> > Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
> >
> > diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
> > index 66fa4e1..f27c4d6 100644
> > --- a/drivers/char/tty_buffer.c
> > +++ b/drivers/char/tty_buffer.c
> > @@ -247,7 +247,8 @@ int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars,
> >  {
> >        int copied = 0;
> >        do {
> > -               int space = tty_buffer_request_room(tty, size - copied);
> > +               int goal = min(size - copied, TTY_BUFFER_PAGE);
> > +               int space = tty_buffer_request_room(tty, goal);
> >                struct tty_buffer *tb = tty->buf.tail;
> >                /* If there is no space then tb may be NULL */
> >                if (unlikely(space == 0))
> > @@ -283,7 +284,8 @@ int tty_insert_flip_string_flags(struct tty_struct *tty,
> >  {
> >        int copied = 0;
> >        do {
> > -               int space = tty_buffer_request_room(tty, size - copied);
> > +               int goal = min(size - copied, TTY_BUFFER_PAGE);
> > +               int space = tty_buffer_request_room(tty, goal);
> >                struct tty_buffer *tb = tty->buf.tail;
> >                /* If there is no space then tb may be NULL */
> >                if (unlikely(space == 0))
> > diff --git a/include/linux/tty.h b/include/linux/tty.h
> > index 6abfcf5..d96e588 100644
> > --- a/include/linux/tty.h
> > +++ b/include/linux/tty.h
> > @@ -68,6 +68,16 @@ struct tty_buffer {
> >        unsigned long data[0];
> >  };
> >
> > +/*
> > + * We default to dicing tty buffer allocations to this many characters
> > + * in order to avoid multiple page allocations. We assume tty_buffer itself
> > + * is under 256 bytes. See tty_buffer_find for the allocation logic this
> > + * must match
> > + */
> > +
> > +#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
> > +
> > +
> >  struct tty_bufhead {
> >        struct delayed_work work;
> >        spinlock_t lock;
> > From 352fa6ad16b89f8ffd1a93b4419b1a8f2259feab Mon Sep 17 00:00:00 2001
> > From: Mel Gorman <mel@csn.ul.ie>
> > Date: Tue, 2 Mar 2010 22:24:19 +0000
> > Subject: [PATCH] tty: Take a 256 byte padding into account when buffering below sub-page units
> >
> > The TTY layer takes some care to ensure that only sub-page allocations
> > are made with interrupts disabled. It does this by setting a goal of
> > "TTY_BUFFER_PAGE" to allocate. Unfortunately, while TTY_BUFFER_PAGE takes the
> > size of tty_buffer into account, it fails to account that tty_buffer_find()
> > rounds the buffer size out to the next 256 byte boundary before adding on
> > the size of the tty_buffer.
> >
> > This patch adjusts the TTY_BUFFER_PAGE calculation to take into account the
> > size of the tty_buffer and the padding. Once applied, tty_buffer_alloc()
> > should not require high-order allocations.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Cc: stable <stable@kernel.org>
> > Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
> >
> > diff --git a/include/linux/tty.h b/include/linux/tty.h
> > index 568369a..593228a 100644
> > --- a/include/linux/tty.h
> > +++ b/include/linux/tty.h
> > @@ -70,12 +70,13 @@ struct tty_buffer {
> >
> >  /*
> >  * We default to dicing tty buffer allocations to this many characters
> > - * in order to avoid multiple page allocations. We assume tty_buffer itself
> > - * is under 256 bytes. See tty_buffer_find for the allocation logic this
> > - * must match
> > + * in order to avoid multiple page allocations. We know the size of
> > + * tty_buffer itself but it must also be taken into account that the
> > + * the buffer is 256 byte aligned. See tty_buffer_find for the allocation
> > + * logic this must match
> >  */
> >
> > -#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
> > +#define TTY_BUFFER_PAGE        (((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)
> >
> >
> >  struct tty_bufhead {
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index bf9213b..5ba0d9a 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -94,7 +94,6 @@ extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
> >  extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
> >                                                        int priority);
> >  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> > -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                       struct zone *zone,
> >                                       enum lru_list lru);
> > @@ -243,12 +242,6 @@ mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> >        return 1;
> >  }
> >
> > -static inline int
> > -mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
> > -{
> > -       return 1;
> > -}
> > -
> >  static inline unsigned long
> >  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> >                         enum lru_list lru)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 66035bf..bbb0eda 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -843,17 +843,6 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> >        return 0;
> >  }
> >
> > -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
> > -{
> > -       unsigned long active;
> > -       unsigned long inactive;
> > -
> > -       inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
> > -       active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
> > -
> > -       return (active > inactive);
> > -}
> > -
> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> >                                       struct zone *zone,
> >                                       enum lru_list lru)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 692807f..5512301 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1428,59 +1428,13 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
> >        return low;
> >  }
> >
> > -static int inactive_file_is_low_global(struct zone *zone)
> > -{
> > -       unsigned long active, inactive;
> > -
> > -       active = zone_page_state(zone, NR_ACTIVE_FILE);
> > -       inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> > -
> > -       return (active > inactive);
> > -}
> > -
> > -/**
> > - * inactive_file_is_low - check if file pages need to be deactivated
> > - * @zone: zone to check
> > - * @sc:   scan control of this context
> > - *
> > - * When the system is doing streaming IO, memory pressure here
> > - * ensures that active file pages get deactivated, until more
> > - * than half of the file pages are on the inactive list.
> > - *
> > - * Once we get to that situation, protect the system's working
> > - * set from being evicted by disabling active file page aging.
> > - *
> > - * This uses a different ratio than the anonymous pages, because
> > - * the page cache uses a use-once replacement algorithm.
> > - */
> > -static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
> > -{
> > -       int low;
> > -
> > -       if (scanning_global_lru(sc))
> > -               low = inactive_file_is_low_global(zone);
> > -       else
> > -               low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
> > -       return low;
> > -}
> > -
> > -static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
> > -                               int file)
> > -{
> > -       if (file)
> > -               return inactive_file_is_low(zone, sc);
> > -       else
> > -               return inactive_anon_is_low(zone, sc);
> > -}
> > -
> >  static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> >        struct zone *zone, struct scan_control *sc, int priority)
> >  {
> >        int file = is_file_lru(lru);
> >
> > -       if (is_active_lru(lru)) {
> > -               if (inactive_list_is_low(zone, sc, file))
> > -                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
> > +       if (lru == LRU_ACTIVE_FILE) {
> > +               shrink_active_list(nr_to_scan, zone, sc, priority, file);
> >                return 0;
> >        }
> >

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-24  2:38                     ` Greg KH
@ 2010-03-24 11:49                       ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-24 11:49 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, Corrado Zoccolo,
	Rik van Riel, Johannes Weiner

On Tue, Mar 23, 2010 at 07:38:37PM -0700, Greg KH wrote:
> On Mon, Mar 22, 2010 at 11:50:54PM +0000, Mel Gorman wrote:
> > 2. TTY using high order allocations more frequently
> > 	fix title: ttyfix
> > 	fixed in mainline? yes, in 2.6.34-rc2
> > 	affects: 2.6.31 to 2.6.34-rc1
> > 
> > 	2.6.31 made pty's use the same buffering logic as tty.	Unfortunately,
> > 	it was also allowed to make high-order GFP_ATOMIC allocations. This
> > 	triggers some high-order reclaim and introduces some stalls. It's
> > 	fixed in 2.6.34-rc2 but needs back-porting.
> 
> It will go to the other stable kernels for their next round of releases
> now that it is in Linus's tree.
> 

Great.

> > Next Steps
> > ==========
> > 
> > Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
> > 2.6.30.x (assuming that is still maintained, Greg?)?
> 
> No, .30 is no longer being maintained.
> 

Right, I won't lose any sleep over 2.6.30.dodo so :)

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-24 11:49                       ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-24 11:49 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, Corrado Zoccolo,
	Rik van Riel, Johannes Weiner

On Tue, Mar 23, 2010 at 07:38:37PM -0700, Greg KH wrote:
> On Mon, Mar 22, 2010 at 11:50:54PM +0000, Mel Gorman wrote:
> > 2. TTY using high order allocations more frequently
> > 	fix title: ttyfix
> > 	fixed in mainline? yes, in 2.6.34-rc2
> > 	affects: 2.6.31 to 2.6.34-rc1
> > 
> > 	2.6.31 made pty's use the same buffering logic as tty.	Unfortunately,
> > 	it was also allowed to make high-order GFP_ATOMIC allocations. This
> > 	triggers some high-order reclaim and introduces some stalls. It's
> > 	fixed in 2.6.34-rc2 but needs back-porting.
> 
> It will go to the other stable kernels for their next round of releases
> now that it is in Linus's tree.
> 

Great.

> > Next Steps
> > ==========
> > 
> > Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
> > 2.6.30.x (assuming that is still maintained, Greg?)?
> 
> No, .30 is no longer being maintained.
> 

Right, I won't lose any sleep over 2.6.30.dodo so :)

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone  pressure
  2010-03-24 11:48                       ` Mel Gorman
@ 2010-03-24 12:56                         ` Corrado Zoccolo
  -1 siblings, 0 replies; 136+ messages in thread
From: Corrado Zoccolo @ 2010-03-24 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Rik van Riel,
	Johannes Weiner

On Wed, Mar 24, 2010 at 12:48 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Tue, Mar 23, 2010 at 10:35:20PM +0100, Corrado Zoccolo wrote:
>> Hi Mel,
>> On Tue, Mar 23, 2010 at 12:50 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>> > On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
>> >> On Mon, 15 Mar 2010 13:34:50 +0100
>> >> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>> >>
>> >> > c) If direct reclaim did reasonable progress in try_to_free but did not
>> >> > get a page, AND there is no write in flight at all then let it try again
>> >> > to free up something.
>> >> > This could be extended by some kind of max retry to avoid some weird
>> >> > looping cases as well.
>> >> >
>> >> > d) Another way might be as easy as letting congestion_wait return
>> >> > immediately if there are no outstanding writes - this would keep the
>> >> > behavior for cases with write and avoid the "running always in full
>> >> > timeout" issue without writes.
>> >>
>> >> They're pretty much equivalent and would work.  But there are two
>> >> things I still don't understand:
>> >>
>> >> 1: Why is direct reclaim calling congestion_wait() at all?  If no
>> >> writes are going on there's lots of clean pagecache around so reclaim
>> >> should trivially succeed.  What's preventing it from doing so?
>> >>
>> >> 2: This is, I think, new behaviour.  A regression.  What caused it?
>> >>
>> >
>> > 120+ kernels and a lot of hurt later;
>> >
>> > Short summary - The number of times kswapd and the page allocator have been
>> >        calling congestion_wait and the length of time it spends in there
>> >        has been increasing since 2.6.29. Oddly, it has little to do
>> >        with the page allocator itself.
>> >
>> > Test scenario
>> > =============
>> > X86-64 machine 1 socket 4 cores
>> > 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>> >        on-board and a piece of crap, and a decent RAID card could blow
>> >        the budget.
>> > Booted mem=256 to ensure it is fully IO-bound and match closer to what
>> >        Christian was doing
>> >
>> > At each test, the disks are partitioned, the raid arrays created and an
>> > ext2 filesystem created. iozone sequential read/write tests are run with
>> > increasing number of processes up to 64. Each test creates 8G of files. i.e.
>> > 1 process = 8G. 2 processes = 2x4G etc
>> >
>> >        iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
>> >        iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
>> >        etc.
>> >
>> > Metrics
>> > =======
>> >
>> > Each kernel was instrumented to collected the following stats
>> >
>> >        pg-Stall        Page allocator stalled calling congestion_wait
>> >        pg-Wait         The amount of time spent in congestion_wait
>> >        pg-Rclm         Pages reclaimed by direct reclaim
>> >        ksd-stall       balance_pgdat() (ie kswapd) staled on congestion_wait
>> >        ksd-wait        Time spend by balance_pgdat in congestion_wait
>> >
>> > Large differences in this do not necessarily show up in iozone because the
>> > disks are so slow that the stalls are a tiny percentage overall. However, in
>> > the event that there are many disks, it might be a greater problem. I believe
>> > Christian is hitting a corner case where small delays trigger a much larger
>> > stall.
>> >
>> > Why The Increases
>> > =================
>> >
>> > The big problem here is that there was no one change. Instead, it has been
>> > a steady build-up of a number of problems. The ones I identified are in the
>> > block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
>> > but need backporting and others I expect are a major surprise. Whether they
>> > are worth backporting or not heavily depends on whether Christian's problem
>> > is resolved.
>> >
>> > Some of the "fixes" below are obviously not fixes at all. Gathering this data
>> > took a significant amount of time. It'd be nice if people more familiar with
>> > the relevant problem patches could spring a theory or patch.
>> >
>> > The Problems
>> > ============
>> >
>> > 1. Block layer congestion queue async/sync difficulty
>> >        fix title: asyncconfusion
>> >        fixed in mainline? yes, in 2.6.31
>> >        affects: 2.6.30
>> >
>> >        2.6.30 replaced congestion queues based on read/write with sync/async
>> >        in commit 1faa16d2. Problems were identified with this and fixed in
>> >        2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings
>> >        2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.
>> >
>> > 2. TTY using high order allocations more frequently
>> >        fix title: ttyfix
>> >        fixed in mainline? yes, in 2.6.34-rc2
>> >        affects: 2.6.31 to 2.6.34-rc1
>> >
>> >        2.6.31 made pty's use the same buffering logic as tty.  Unfortunately,
>> >        it was also allowed to make high-order GFP_ATOMIC allocations. This
>> >        triggers some high-order reclaim and introduces some stalls. It's
>> >        fixed in 2.6.34-rc2 but needs back-porting.
>> >
>> > 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>> >        fix title: revertevict
>> >        fixed in mainline? no
>> >        affects: 2.6.31 to now
>> >
>> >        For reasons that are not immediately obvious, the evict-once patches
>> >        *really* hurt the time spent on congestion and the number of pages
>> >        reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>> >        because clearly you tested this for AIM7 and might have some
>> >        theories. For the purposes of testing, I just reverted the changes.
>> >
>> > 4. CFQ scheduler fairness commit 718eee057 causes some hurt
>> >        fix title: none available
>> >        fixed in mainline? no
>> >        affects: 2.6.33 to now
>> >
>> >        A bisection finger printed this patch as being a problem introduced
>> >        between 2.6.32 and 2.6.33. It increases a small amount the number of
>> >        times the page allocator stalls but drastically increased the number
>> >        of pages reclaimed. It's not clear why the commit is such a problem.
>> >
>> >        Unfortunately, I could not test a revert of this patch. The CFQ and
>> >        block IO changes made in this window were extremely convulated and
>> >        overlapped heavily with a large number of patches altering the same
>> >        code as touched by commit 718eee057. I tried reverting everything
>> >        made on and after this commit but the results were unsatisfactory.
>> >
>> >        Hence, there is no fix in the results below
>> >
>> > Results
>> > =======
>> >
>> > Here are the highlights of kernels tested. I'm omitting the bisection
>> > results for obvious reasons. The metrics were gathered at two points;
>> > after filesystem creation and after IOZone completed.
>> >
>> > The lower the number for each metric, the better.
>> >
>> >                                                     After Filesystem Setup                                       After IOZone
>> >                                         pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait
>> > 2.6.29                                          0        0        0          2         1               4        3      183        152         0
>> > 2.6.30                                          1        5       34          1        25             783     3752    31939         76         0
>> > 2.6.30-asyncconfusion                           0        0        0          3         1              44       60     2656        893         0
>> > 2.6.30.10                                       0        0        0          2        43             777     3699    32661         74         0
>> > 2.6.30.10-asyncconfusion                        0        0        0          2         1              36       88     1699       1114         0
>> >
>> > asyncconfusion can be back-ported easily to 2.6.30.10. Performance is not
>> > perfectly in line with 2.6.29 but it's better.
>> >
>> > 2.6.31                                          0        0        0          3         1           49175   245727  2730626     176344         0
>> > 2.6.31-revertevict                              0        0        0          3         2              31      147     1887        114         0
>> > 2.6.31-ttyfix                                   0        0        0          2         2           46238   231000  2549462     170912         0
>> > 2.6.31-ttyfix-revertevict                       0        0        0          3         0               7       35      448        121         0
>> > 2.6.31.12                                       0        0        0          2         0           68897   344268  4050646     183523         0
>> > 2.6.31.12-revertevict                           0        0        0          3         1              18       87     1009        147         0
>> > 2.6.31.12-ttyfix                                0        0        0          2         0           62797   313805  3786539     173398         0
>> > 2.6.31.12-ttyfix-revertevict                    0        0        0          3         2               7       35      448        199         0
>> >
>> > Applying the tty fixes from 2.6.34-rc2 and getting rid of the evict-once
>> > patches bring things back in line with 2.6.29 again.
>> >
>> > Rik, any theory on evict-once?
>> >
>> > 2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0
>> > 2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0
>> > 2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0
>> > 2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0
>> > 2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0
>> > 2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0
>> > 2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0
>> > 2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0
>> >
>> > Again, fixing tty and reverting evict-once helps bring figures more in line
>> > with 2.6.29.
>> >
>> > 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
>> > 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
>> > 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
>> > 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
>> > 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
>> > 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
>> > 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
>> > 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0
>> >
>> > At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
>> > queues" has lodged itself deep within CGQ and I couldn't tear it out or
>> > see how to fix it. Fixing tty and reverting evict-once helps but the number
>> > of stalls is significantly increased and a much larger number of pages get
>> > reclaimed overall.
>> >
>> > Corrado?
>>
>> The major changes in I/O scheduing behaviour are:
>> * buffered writes:
>>    before we could schedule few writes, then interrupt them to do
>>   some reads, and then go back to writes; now we guarantee some
>>   uninterruptible time slice for writes, but the delay between two
>>   slices is increased. The total write throughput averaged over a time
>>   window larger than 300ms should be comparable, or even better with
>>   2.6.33. Note that the commit you cite has introduced a bug regarding
>>   write throughput on NCQ disks that was later fixed by 1efe8fe1, merged
>>   before 2.6.33 (this may lead to confusing bisection results).
>
> This is true. The CFQ and block IO changes in that window are almost
> impossible to properly bisect and isolate individual changes. There were
> multiple dependant patches that modified each others changes. It's unclear
> if this modification can even be isolated although your suggestion below
> is the best bet.
>
>> * reads (and sync writes):
>>   * before, we serviced a single process for 100ms, then switched to
>>     an other, and so on.
>>   * after, we go round robin for random requests (they get a unified
>>     time slice, like buffered writes do), and we have consecutive time
>>     slices for sequential requests, but the length of the slice is reduced
>>     when the number of concurrent processes doing I/O increases.
>>
>> This means that with 16 processes doing sequential I/O on the same
>> disk, before you were switching between processes every 100ms, and now
>> every 32ms. The old behaviour can be brought back by setting
>> /sys/block/sd*/queue/iosched/low_latency to 0.
>
> Will try this and see what happens.
>
>> For random I/O, the situation (going round robin, it will translate to
>> switching every 8 ms on average) is not revertable via flags.
>>
>
> At the moment, I'm not testing random IO so it shouldn't be a factor in
> the tests.
>
>> >
>> > 2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0
>> > 2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0
>> > 2.6.34-rc1-ttyfix                               0        0        0          1         1          159603   791056  5186082     223458         0
>> > 2.6.34-rc1-ttyfix-revertevict                   0        0        0          0         0            1549     7641    50484        679         0
>> >
>> > Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to get
>> > back to 2.6.29 performance.
>> >
>> > Next Steps
>> > ==========
>> >
>> > Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
>> > 2.6.30.x (assuming that is still maintained, Greg?)?
>> >
>> > Rik, any suggestions on what can be done with evict-once?
>> >
>> > Corrado, any suggestions on what can be done with CFQ?
>>
>> If my intuition that switching between processes too often is
>> detrimental when you have memory pressure (higher probability to need
>> to re-page-in some of the pages that were just discarded), I suggest
>> trying setting low_latency to 0, and maybe increasing the slice_sync
>> (to get more slice to a single process before switching to an other),
>> slice_async (to give more uninterruptible time to buffered writes) and
>> slice_async_rq (to higher the limit of consecutive write requests can
>> be sent to disk).
>> While this would normally lead to a bad user experience on a system
>> with plenty of memory, it should keep things acceptable when paging in
>> / swapping / dirty page writeback is overwhelming.
>>
>
> Christian, would you be able to follow the same instructions and see can
> you make a difference to your test? It is known for your situation that
> memory is unusually low for size of your workload so it's a possibility.

An other parameter that is worth tweaking in this case is the
readahead size. If readahead size is too large for the available
memory, we might be reading, then discarding, and then reading again
the same pages.

I would also like to see some iostats output (iostats -kx 5 >
iostats.log) during the experiment run, to better understand what's
happening.

Thanks,
Corrado

>
> Thanks Corrado.
>
>> Corrado
>>
>> >
>> > Christian, can you test the following amalgamated patch on 2.6.32.10 and
>> > 2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
>> > cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
>> > and revertevict. If your problem goes away, it implies that the stalls I
>> > can measure are roughly correlated to the more significant problem you have.
>> >
>> > ===== CUT HERE =====
>> >
>> > From d9661adfb8e53a7647360140af3b92284cbe52d4 Mon Sep 17 00:00:00 2001
>> > From: Alan Cox <alan@linux.intel.com>
>> > Date: Thu, 18 Feb 2010 16:43:47 +0000
>> > Subject: [PATCH] tty: Keep the default buffering to sub-page units
>> >
>> > We allocate during interrupts so while our buffering is normally diced up
>> > small anyway on some hardware at speed we can pressure the VM excessively
>> > for page pairs. We don't really need big buffers to be linear so don't try
>> > so hard.
>> >
>> > In order to make this work well we will tidy up excess callers to request_room,
>> > which cannot itself enforce this break up.
>> >
>> > Signed-off-by: Alan Cox <alan@linux.intel.com>
>> > Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
>> >
>> > diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
>> > index 66fa4e1..f27c4d6 100644
>> > --- a/drivers/char/tty_buffer.c
>> > +++ b/drivers/char/tty_buffer.c
>> > @@ -247,7 +247,8 @@ int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars,
>> >  {
>> >        int copied = 0;
>> >        do {
>> > -               int space = tty_buffer_request_room(tty, size - copied);
>> > +               int goal = min(size - copied, TTY_BUFFER_PAGE);
>> > +               int space = tty_buffer_request_room(tty, goal);
>> >                struct tty_buffer *tb = tty->buf.tail;
>> >                /* If there is no space then tb may be NULL */
>> >                if (unlikely(space == 0))
>> > @@ -283,7 +284,8 @@ int tty_insert_flip_string_flags(struct tty_struct *tty,
>> >  {
>> >        int copied = 0;
>> >        do {
>> > -               int space = tty_buffer_request_room(tty, size - copied);
>> > +               int goal = min(size - copied, TTY_BUFFER_PAGE);
>> > +               int space = tty_buffer_request_room(tty, goal);
>> >                struct tty_buffer *tb = tty->buf.tail;
>> >                /* If there is no space then tb may be NULL */
>> >                if (unlikely(space == 0))
>> > diff --git a/include/linux/tty.h b/include/linux/tty.h
>> > index 6abfcf5..d96e588 100644
>> > --- a/include/linux/tty.h
>> > +++ b/include/linux/tty.h
>> > @@ -68,6 +68,16 @@ struct tty_buffer {
>> >        unsigned long data[0];
>> >  };
>> >
>> > +/*
>> > + * We default to dicing tty buffer allocations to this many characters
>> > + * in order to avoid multiple page allocations. We assume tty_buffer itself
>> > + * is under 256 bytes. See tty_buffer_find for the allocation logic this
>> > + * must match
>> > + */
>> > +
>> > +#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
>> > +
>> > +
>> >  struct tty_bufhead {
>> >        struct delayed_work work;
>> >        spinlock_t lock;
>> > From 352fa6ad16b89f8ffd1a93b4419b1a8f2259feab Mon Sep 17 00:00:00 2001
>> > From: Mel Gorman <mel@csn.ul.ie>
>> > Date: Tue, 2 Mar 2010 22:24:19 +0000
>> > Subject: [PATCH] tty: Take a 256 byte padding into account when buffering below sub-page units
>> >
>> > The TTY layer takes some care to ensure that only sub-page allocations
>> > are made with interrupts disabled. It does this by setting a goal of
>> > "TTY_BUFFER_PAGE" to allocate. Unfortunately, while TTY_BUFFER_PAGE takes the
>> > size of tty_buffer into account, it fails to account that tty_buffer_find()
>> > rounds the buffer size out to the next 256 byte boundary before adding on
>> > the size of the tty_buffer.
>> >
>> > This patch adjusts the TTY_BUFFER_PAGE calculation to take into account the
>> > size of the tty_buffer and the padding. Once applied, tty_buffer_alloc()
>> > should not require high-order allocations.
>> >
>> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> > Cc: stable <stable@kernel.org>
>> > Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
>> >
>> > diff --git a/include/linux/tty.h b/include/linux/tty.h
>> > index 568369a..593228a 100644
>> > --- a/include/linux/tty.h
>> > +++ b/include/linux/tty.h
>> > @@ -70,12 +70,13 @@ struct tty_buffer {
>> >
>> >  /*
>> >  * We default to dicing tty buffer allocations to this many characters
>> > - * in order to avoid multiple page allocations. We assume tty_buffer itself
>> > - * is under 256 bytes. See tty_buffer_find for the allocation logic this
>> > - * must match
>> > + * in order to avoid multiple page allocations. We know the size of
>> > + * tty_buffer itself but it must also be taken into account that the
>> > + * the buffer is 256 byte aligned. See tty_buffer_find for the allocation
>> > + * logic this must match
>> >  */
>> >
>> > -#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
>> > +#define TTY_BUFFER_PAGE        (((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)
>> >
>> >
>> >  struct tty_bufhead {
>> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> > index bf9213b..5ba0d9a 100644
>> > --- a/include/linux/memcontrol.h
>> > +++ b/include/linux/memcontrol.h
>> > @@ -94,7 +94,6 @@ extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
>> >  extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
>> >                                                        int priority);
>> >  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>> > -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
>> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>> >                                       struct zone *zone,
>> >                                       enum lru_list lru);
>> > @@ -243,12 +242,6 @@ mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>> >        return 1;
>> >  }
>> >
>> > -static inline int
>> > -mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>> > -{
>> > -       return 1;
>> > -}
>> > -
>> >  static inline unsigned long
>> >  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
>> >                         enum lru_list lru)
>> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> > index 66035bf..bbb0eda 100644
>> > --- a/mm/memcontrol.c
>> > +++ b/mm/memcontrol.c
>> > @@ -843,17 +843,6 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>> >        return 0;
>> >  }
>> >
>> > -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>> > -{
>> > -       unsigned long active;
>> > -       unsigned long inactive;
>> > -
>> > -       inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
>> > -       active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
>> > -
>> > -       return (active > inactive);
>> > -}
>> > -
>> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>> >                                       struct zone *zone,
>> >                                       enum lru_list lru)
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 692807f..5512301 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1428,59 +1428,13 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
>> >        return low;
>> >  }
>> >
>> > -static int inactive_file_is_low_global(struct zone *zone)
>> > -{
>> > -       unsigned long active, inactive;
>> > -
>> > -       active = zone_page_state(zone, NR_ACTIVE_FILE);
>> > -       inactive = zone_page_state(zone, NR_INACTIVE_FILE);
>> > -
>> > -       return (active > inactive);
>> > -}
>> > -
>> > -/**
>> > - * inactive_file_is_low - check if file pages need to be deactivated
>> > - * @zone: zone to check
>> > - * @sc:   scan control of this context
>> > - *
>> > - * When the system is doing streaming IO, memory pressure here
>> > - * ensures that active file pages get deactivated, until more
>> > - * than half of the file pages are on the inactive list.
>> > - *
>> > - * Once we get to that situation, protect the system's working
>> > - * set from being evicted by disabling active file page aging.
>> > - *
>> > - * This uses a different ratio than the anonymous pages, because
>> > - * the page cache uses a use-once replacement algorithm.
>> > - */
>> > -static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
>> > -{
>> > -       int low;
>> > -
>> > -       if (scanning_global_lru(sc))
>> > -               low = inactive_file_is_low_global(zone);
>> > -       else
>> > -               low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
>> > -       return low;
>> > -}
>> > -
>> > -static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
>> > -                               int file)
>> > -{
>> > -       if (file)
>> > -               return inactive_file_is_low(zone, sc);
>> > -       else
>> > -               return inactive_anon_is_low(zone, sc);
>> > -}
>> > -
>> >  static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>> >        struct zone *zone, struct scan_control *sc, int priority)
>> >  {
>> >        int file = is_file_lru(lru);
>> >
>> > -       if (is_active_lru(lru)) {
>> > -               if (inactive_list_is_low(zone, sc, file))
>> > -                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> > +       if (lru == LRU_ACTIVE_FILE) {
>> > +               shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> >                return 0;
>> >        }
>> >
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-24 12:56                         ` Corrado Zoccolo
  0 siblings, 0 replies; 136+ messages in thread
From: Corrado Zoccolo @ 2010-03-24 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Rik van Riel,
	Johannes Weiner

On Wed, Mar 24, 2010 at 12:48 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Tue, Mar 23, 2010 at 10:35:20PM +0100, Corrado Zoccolo wrote:
>> Hi Mel,
>> On Tue, Mar 23, 2010 at 12:50 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>> > On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
>> >> On Mon, 15 Mar 2010 13:34:50 +0100
>> >> Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
>> >>
>> >> > c) If direct reclaim did reasonable progress in try_to_free but did not
>> >> > get a page, AND there is no write in flight at all then let it try again
>> >> > to free up something.
>> >> > This could be extended by some kind of max retry to avoid some weird
>> >> > looping cases as well.
>> >> >
>> >> > d) Another way might be as easy as letting congestion_wait return
>> >> > immediately if there are no outstanding writes - this would keep the
>> >> > behavior for cases with write and avoid the "running always in full
>> >> > timeout" issue without writes.
>> >>
>> >> They're pretty much equivalent and would work.  But there are two
>> >> things I still don't understand:
>> >>
>> >> 1: Why is direct reclaim calling congestion_wait() at all?  If no
>> >> writes are going on there's lots of clean pagecache around so reclaim
>> >> should trivially succeed.  What's preventing it from doing so?
>> >>
>> >> 2: This is, I think, new behaviour.  A regression.  What caused it?
>> >>
>> >
>> > 120+ kernels and a lot of hurt later;
>> >
>> > Short summary - The number of times kswapd and the page allocator have been
>> >        calling congestion_wait and the length of time it spends in there
>> >        has been increasing since 2.6.29. Oddly, it has little to do
>> >        with the page allocator itself.
>> >
>> > Test scenario
>> > =============
>> > X86-64 machine 1 socket 4 cores
>> > 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>> >        on-board and a piece of crap, and a decent RAID card could blow
>> >        the budget.
>> > Booted mem=256 to ensure it is fully IO-bound and match closer to what
>> >        Christian was doing
>> >
>> > At each test, the disks are partitioned, the raid arrays created and an
>> > ext2 filesystem created. iozone sequential read/write tests are run with
>> > increasing number of processes up to 64. Each test creates 8G of files. i.e.
>> > 1 process = 8G. 2 processes = 2x4G etc
>> >
>> >        iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
>> >        iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
>> >        etc.
>> >
>> > Metrics
>> > =======
>> >
>> > Each kernel was instrumented to collected the following stats
>> >
>> >        pg-Stall        Page allocator stalled calling congestion_wait
>> >        pg-Wait         The amount of time spent in congestion_wait
>> >        pg-Rclm         Pages reclaimed by direct reclaim
>> >        ksd-stall       balance_pgdat() (ie kswapd) staled on congestion_wait
>> >        ksd-wait        Time spend by balance_pgdat in congestion_wait
>> >
>> > Large differences in this do not necessarily show up in iozone because the
>> > disks are so slow that the stalls are a tiny percentage overall. However, in
>> > the event that there are many disks, it might be a greater problem. I believe
>> > Christian is hitting a corner case where small delays trigger a much larger
>> > stall.
>> >
>> > Why The Increases
>> > =================
>> >
>> > The big problem here is that there was no one change. Instead, it has been
>> > a steady build-up of a number of problems. The ones I identified are in the
>> > block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
>> > but need backporting and others I expect are a major surprise. Whether they
>> > are worth backporting or not heavily depends on whether Christian's problem
>> > is resolved.
>> >
>> > Some of the "fixes" below are obviously not fixes at all. Gathering this data
>> > took a significant amount of time. It'd be nice if people more familiar with
>> > the relevant problem patches could spring a theory or patch.
>> >
>> > The Problems
>> > ============
>> >
>> > 1. Block layer congestion queue async/sync difficulty
>> >        fix title: asyncconfusion
>> >        fixed in mainline? yes, in 2.6.31
>> >        affects: 2.6.30
>> >
>> >        2.6.30 replaced congestion queues based on read/write with sync/async
>> >        in commit 1faa16d2. Problems were identified with this and fixed in
>> >        2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings
>> >        2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.
>> >
>> > 2. TTY using high order allocations more frequently
>> >        fix title: ttyfix
>> >        fixed in mainline? yes, in 2.6.34-rc2
>> >        affects: 2.6.31 to 2.6.34-rc1
>> >
>> >        2.6.31 made pty's use the same buffering logic as tty.  Unfortunately,
>> >        it was also allowed to make high-order GFP_ATOMIC allocations. This
>> >        triggers some high-order reclaim and introduces some stalls. It's
>> >        fixed in 2.6.34-rc2 but needs back-porting.
>> >
>> > 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>> >        fix title: revertevict
>> >        fixed in mainline? no
>> >        affects: 2.6.31 to now
>> >
>> >        For reasons that are not immediately obvious, the evict-once patches
>> >        *really* hurt the time spent on congestion and the number of pages
>> >        reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>> >        because clearly you tested this for AIM7 and might have some
>> >        theories. For the purposes of testing, I just reverted the changes.
>> >
>> > 4. CFQ scheduler fairness commit 718eee057 causes some hurt
>> >        fix title: none available
>> >        fixed in mainline? no
>> >        affects: 2.6.33 to now
>> >
>> >        A bisection finger printed this patch as being a problem introduced
>> >        between 2.6.32 and 2.6.33. It increases a small amount the number of
>> >        times the page allocator stalls but drastically increased the number
>> >        of pages reclaimed. It's not clear why the commit is such a problem.
>> >
>> >        Unfortunately, I could not test a revert of this patch. The CFQ and
>> >        block IO changes made in this window were extremely convulated and
>> >        overlapped heavily with a large number of patches altering the same
>> >        code as touched by commit 718eee057. I tried reverting everything
>> >        made on and after this commit but the results were unsatisfactory.
>> >
>> >        Hence, there is no fix in the results below
>> >
>> > Results
>> > =======
>> >
>> > Here are the highlights of kernels tested. I'm omitting the bisection
>> > results for obvious reasons. The metrics were gathered at two points;
>> > after filesystem creation and after IOZone completed.
>> >
>> > The lower the number for each metric, the better.
>> >
>> >                                                     After Filesystem Setup                                       After IOZone
>> >                                         pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait
>> > 2.6.29                                          0        0        0          2         1               4        3      183        152         0
>> > 2.6.30                                          1        5       34          1        25             783     3752    31939         76         0
>> > 2.6.30-asyncconfusion                           0        0        0          3         1              44       60     2656        893         0
>> > 2.6.30.10                                       0        0        0          2        43             777     3699    32661         74         0
>> > 2.6.30.10-asyncconfusion                        0        0        0          2         1              36       88     1699       1114         0
>> >
>> > asyncconfusion can be back-ported easily to 2.6.30.10. Performance is not
>> > perfectly in line with 2.6.29 but it's better.
>> >
>> > 2.6.31                                          0        0        0          3         1           49175   245727  2730626     176344         0
>> > 2.6.31-revertevict                              0        0        0          3         2              31      147     1887        114         0
>> > 2.6.31-ttyfix                                   0        0        0          2         2           46238   231000  2549462     170912         0
>> > 2.6.31-ttyfix-revertevict                       0        0        0          3         0               7       35      448        121         0
>> > 2.6.31.12                                       0        0        0          2         0           68897   344268  4050646     183523         0
>> > 2.6.31.12-revertevict                           0        0        0          3         1              18       87     1009        147         0
>> > 2.6.31.12-ttyfix                                0        0        0          2         0           62797   313805  3786539     173398         0
>> > 2.6.31.12-ttyfix-revertevict                    0        0        0          3         2               7       35      448        199         0
>> >
>> > Applying the tty fixes from 2.6.34-rc2 and getting rid of the evict-once
>> > patches bring things back in line with 2.6.29 again.
>> >
>> > Rik, any theory on evict-once?
>> >
>> > 2.6.32                                          0        0        0          3         2           44437   221753  2760857     132517         0
>> > 2.6.32-revertevict                              0        0        0          3         2              35       14     1570        460         0
>> > 2.6.32-ttyfix                                   0        0        0          2         0           60770   303206  3659254     166293         0
>> > 2.6.32-ttyfix-revertevict                       0        0        0          3         0              55       62     2496        494         0
>> > 2.6.32.10                                       0        0        0          2         1           90769   447702  4251448     234868         0
>> > 2.6.32.10-revertevict                           0        0        0          3         2             148      597     8642        478         0
>> > 2.6.32.10-ttyfix                                0        0        0          3         0           91729   453337  4374070     238593         0
>> > 2.6.32.10-ttyfix-revertevict                    0        0        0          3         1              65      146     3408        347         0
>> >
>> > Again, fixing tty and reverting evict-once helps bring figures more in line
>> > with 2.6.29.
>> >
>> > 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
>> > 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
>> > 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
>> > 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
>> > 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
>> > 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
>> > 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
>> > 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0
>> >
>> > At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
>> > queues" has lodged itself deep within CGQ and I couldn't tear it out or
>> > see how to fix it. Fixing tty and reverting evict-once helps but the number
>> > of stalls is significantly increased and a much larger number of pages get
>> > reclaimed overall.
>> >
>> > Corrado?
>>
>> The major changes in I/O scheduing behaviour are:
>> * buffered writes:
>>    before we could schedule few writes, then interrupt them to do
>>   some reads, and then go back to writes; now we guarantee some
>>   uninterruptible time slice for writes, but the delay between two
>>   slices is increased. The total write throughput averaged over a time
>>   window larger than 300ms should be comparable, or even better with
>>   2.6.33. Note that the commit you cite has introduced a bug regarding
>>   write throughput on NCQ disks that was later fixed by 1efe8fe1, merged
>>   before 2.6.33 (this may lead to confusing bisection results).
>
> This is true. The CFQ and block IO changes in that window are almost
> impossible to properly bisect and isolate individual changes. There were
> multiple dependant patches that modified each others changes. It's unclear
> if this modification can even be isolated although your suggestion below
> is the best bet.
>
>> * reads (and sync writes):
>>   * before, we serviced a single process for 100ms, then switched to
>>     an other, and so on.
>>   * after, we go round robin for random requests (they get a unified
>>     time slice, like buffered writes do), and we have consecutive time
>>     slices for sequential requests, but the length of the slice is reduced
>>     when the number of concurrent processes doing I/O increases.
>>
>> This means that with 16 processes doing sequential I/O on the same
>> disk, before you were switching between processes every 100ms, and now
>> every 32ms. The old behaviour can be brought back by setting
>> /sys/block/sd*/queue/iosched/low_latency to 0.
>
> Will try this and see what happens.
>
>> For random I/O, the situation (going round robin, it will translate to
>> switching every 8 ms on average) is not revertable via flags.
>>
>
> At the moment, I'm not testing random IO so it shouldn't be a factor in
> the tests.
>
>> >
>> > 2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0
>> > 2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0
>> > 2.6.34-rc1-ttyfix                               0        0        0          1         1          159603   791056  5186082     223458         0
>> > 2.6.34-rc1-ttyfix-revertevict                   0        0        0          0         0            1549     7641    50484        679         0
>> >
>> > Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to get
>> > back to 2.6.29 performance.
>> >
>> > Next Steps
>> > ==========
>> >
>> > Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
>> > 2.6.30.x (assuming that is still maintained, Greg?)?
>> >
>> > Rik, any suggestions on what can be done with evict-once?
>> >
>> > Corrado, any suggestions on what can be done with CFQ?
>>
>> If my intuition that switching between processes too often is
>> detrimental when you have memory pressure (higher probability to need
>> to re-page-in some of the pages that were just discarded), I suggest
>> trying setting low_latency to 0, and maybe increasing the slice_sync
>> (to get more slice to a single process before switching to an other),
>> slice_async (to give more uninterruptible time to buffered writes) and
>> slice_async_rq (to higher the limit of consecutive write requests can
>> be sent to disk).
>> While this would normally lead to a bad user experience on a system
>> with plenty of memory, it should keep things acceptable when paging in
>> / swapping / dirty page writeback is overwhelming.
>>
>
> Christian, would you be able to follow the same instructions and see can
> you make a difference to your test? It is known for your situation that
> memory is unusually low for size of your workload so it's a possibility.

An other parameter that is worth tweaking in this case is the
readahead size. If readahead size is too large for the available
memory, we might be reading, then discarding, and then reading again
the same pages.

I would also like to see some iostats output (iostats -kx 5 >
iostats.log) during the experiment run, to better understand what's
happening.

Thanks,
Corrado

>
> Thanks Corrado.
>
>> Corrado
>>
>> >
>> > Christian, can you test the following amalgamated patch on 2.6.32.10 and
>> > 2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
>> > cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
>> > and revertevict. If your problem goes away, it implies that the stalls I
>> > can measure are roughly correlated to the more significant problem you have.
>> >
>> > ===== CUT HERE =====
>> >
>> > From d9661adfb8e53a7647360140af3b92284cbe52d4 Mon Sep 17 00:00:00 2001
>> > From: Alan Cox <alan@linux.intel.com>
>> > Date: Thu, 18 Feb 2010 16:43:47 +0000
>> > Subject: [PATCH] tty: Keep the default buffering to sub-page units
>> >
>> > We allocate during interrupts so while our buffering is normally diced up
>> > small anyway on some hardware at speed we can pressure the VM excessively
>> > for page pairs. We don't really need big buffers to be linear so don't try
>> > so hard.
>> >
>> > In order to make this work well we will tidy up excess callers to request_room,
>> > which cannot itself enforce this break up.
>> >
>> > Signed-off-by: Alan Cox <alan@linux.intel.com>
>> > Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
>> >
>> > diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
>> > index 66fa4e1..f27c4d6 100644
>> > --- a/drivers/char/tty_buffer.c
>> > +++ b/drivers/char/tty_buffer.c
>> > @@ -247,7 +247,8 @@ int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars,
>> >  {
>> >        int copied = 0;
>> >        do {
>> > -               int space = tty_buffer_request_room(tty, size - copied);
>> > +               int goal = min(size - copied, TTY_BUFFER_PAGE);
>> > +               int space = tty_buffer_request_room(tty, goal);
>> >                struct tty_buffer *tb = tty->buf.tail;
>> >                /* If there is no space then tb may be NULL */
>> >                if (unlikely(space == 0))
>> > @@ -283,7 +284,8 @@ int tty_insert_flip_string_flags(struct tty_struct *tty,
>> >  {
>> >        int copied = 0;
>> >        do {
>> > -               int space = tty_buffer_request_room(tty, size - copied);
>> > +               int goal = min(size - copied, TTY_BUFFER_PAGE);
>> > +               int space = tty_buffer_request_room(tty, goal);
>> >                struct tty_buffer *tb = tty->buf.tail;
>> >                /* If there is no space then tb may be NULL */
>> >                if (unlikely(space == 0))
>> > diff --git a/include/linux/tty.h b/include/linux/tty.h
>> > index 6abfcf5..d96e588 100644
>> > --- a/include/linux/tty.h
>> > +++ b/include/linux/tty.h
>> > @@ -68,6 +68,16 @@ struct tty_buffer {
>> >        unsigned long data[0];
>> >  };
>> >
>> > +/*
>> > + * We default to dicing tty buffer allocations to this many characters
>> > + * in order to avoid multiple page allocations. We assume tty_buffer itself
>> > + * is under 256 bytes. See tty_buffer_find for the allocation logic this
>> > + * must match
>> > + */
>> > +
>> > +#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
>> > +
>> > +
>> >  struct tty_bufhead {
>> >        struct delayed_work work;
>> >        spinlock_t lock;
>> > From 352fa6ad16b89f8ffd1a93b4419b1a8f2259feab Mon Sep 17 00:00:00 2001
>> > From: Mel Gorman <mel@csn.ul.ie>
>> > Date: Tue, 2 Mar 2010 22:24:19 +0000
>> > Subject: [PATCH] tty: Take a 256 byte padding into account when buffering below sub-page units
>> >
>> > The TTY layer takes some care to ensure that only sub-page allocations
>> > are made with interrupts disabled. It does this by setting a goal of
>> > "TTY_BUFFER_PAGE" to allocate. Unfortunately, while TTY_BUFFER_PAGE takes the
>> > size of tty_buffer into account, it fails to account that tty_buffer_find()
>> > rounds the buffer size out to the next 256 byte boundary before adding on
>> > the size of the tty_buffer.
>> >
>> > This patch adjusts the TTY_BUFFER_PAGE calculation to take into account the
>> > size of the tty_buffer and the padding. Once applied, tty_buffer_alloc()
>> > should not require high-order allocations.
>> >
>> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>> > Cc: stable <stable@kernel.org>
>> > Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
>> >
>> > diff --git a/include/linux/tty.h b/include/linux/tty.h
>> > index 568369a..593228a 100644
>> > --- a/include/linux/tty.h
>> > +++ b/include/linux/tty.h
>> > @@ -70,12 +70,13 @@ struct tty_buffer {
>> >
>> >  /*
>> >  * We default to dicing tty buffer allocations to this many characters
>> > - * in order to avoid multiple page allocations. We assume tty_buffer itself
>> > - * is under 256 bytes. See tty_buffer_find for the allocation logic this
>> > - * must match
>> > + * in order to avoid multiple page allocations. We know the size of
>> > + * tty_buffer itself but it must also be taken into account that the
>> > + * the buffer is 256 byte aligned. See tty_buffer_find for the allocation
>> > + * logic this must match
>> >  */
>> >
>> > -#define TTY_BUFFER_PAGE                ((PAGE_SIZE  - 256) / 2)
>> > +#define TTY_BUFFER_PAGE        (((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)
>> >
>> >
>> >  struct tty_bufhead {
>> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> > index bf9213b..5ba0d9a 100644
>> > --- a/include/linux/memcontrol.h
>> > +++ b/include/linux/memcontrol.h
>> > @@ -94,7 +94,6 @@ extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
>> >  extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
>> >                                                        int priority);
>> >  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>> > -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
>> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>> >                                       struct zone *zone,
>> >                                       enum lru_list lru);
>> > @@ -243,12 +242,6 @@ mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>> >        return 1;
>> >  }
>> >
>> > -static inline int
>> > -mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>> > -{
>> > -       return 1;
>> > -}
>> > -
>> >  static inline unsigned long
>> >  mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
>> >                         enum lru_list lru)
>> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> > index 66035bf..bbb0eda 100644
>> > --- a/mm/memcontrol.c
>> > +++ b/mm/memcontrol.c
>> > @@ -843,17 +843,6 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>> >        return 0;
>> >  }
>> >
>> > -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>> > -{
>> > -       unsigned long active;
>> > -       unsigned long inactive;
>> > -
>> > -       inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
>> > -       active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
>> > -
>> > -       return (active > inactive);
>> > -}
>> > -
>> >  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
>> >                                       struct zone *zone,
>> >                                       enum lru_list lru)
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 692807f..5512301 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1428,59 +1428,13 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
>> >        return low;
>> >  }
>> >
>> > -static int inactive_file_is_low_global(struct zone *zone)
>> > -{
>> > -       unsigned long active, inactive;
>> > -
>> > -       active = zone_page_state(zone, NR_ACTIVE_FILE);
>> > -       inactive = zone_page_state(zone, NR_INACTIVE_FILE);
>> > -
>> > -       return (active > inactive);
>> > -}
>> > -
>> > -/**
>> > - * inactive_file_is_low - check if file pages need to be deactivated
>> > - * @zone: zone to check
>> > - * @sc:   scan control of this context
>> > - *
>> > - * When the system is doing streaming IO, memory pressure here
>> > - * ensures that active file pages get deactivated, until more
>> > - * than half of the file pages are on the inactive list.
>> > - *
>> > - * Once we get to that situation, protect the system's working
>> > - * set from being evicted by disabling active file page aging.
>> > - *
>> > - * This uses a different ratio than the anonymous pages, because
>> > - * the page cache uses a use-once replacement algorithm.
>> > - */
>> > -static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
>> > -{
>> > -       int low;
>> > -
>> > -       if (scanning_global_lru(sc))
>> > -               low = inactive_file_is_low_global(zone);
>> > -       else
>> > -               low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
>> > -       return low;
>> > -}
>> > -
>> > -static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
>> > -                               int file)
>> > -{
>> > -       if (file)
>> > -               return inactive_file_is_low(zone, sc);
>> > -       else
>> > -               return inactive_anon_is_low(zone, sc);
>> > -}
>> > -
>> >  static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>> >        struct zone *zone, struct scan_control *sc, int priority)
>> >  {
>> >        int file = is_file_lru(lru);
>> >
>> > -       if (is_active_lru(lru)) {
>> > -               if (inactive_list_is_low(zone, sc, file))
>> > -                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> > +       if (lru == LRU_ACTIVE_FILE) {
>> > +               shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> >                return 0;
>> >        }
>> >
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-22 23:50                   ` Mel Gorman
@ 2010-03-24 13:13                     ` Johannes Weiner
  -1 siblings, 0 replies; 136+ messages in thread
From: Johannes Weiner @ 2010-03-24 13:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo,
	Rik van Riel

Hi,

On Mon, Mar 22, 2010 at 11:50:54PM +0000, Mel Gorman wrote:
> On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> > On Mon, 15 Mar 2010 13:34:50 +0100
> > Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> > 
> > > c) If direct reclaim did reasonable progress in try_to_free but did not
> > > get a page, AND there is no write in flight at all then let it try again
> > > to free up something.
> > > This could be extended by some kind of max retry to avoid some weird
> > > looping cases as well.
> > > 
> > > d) Another way might be as easy as letting congestion_wait return
> > > immediately if there are no outstanding writes - this would keep the 
> > > behavior for cases with write and avoid the "running always in full 
> > > timeout" issue without writes.
> > 
> > They're pretty much equivalent and would work.  But there are two
> > things I still don't understand:
> > 
> > 1: Why is direct reclaim calling congestion_wait() at all?  If no
> > writes are going on there's lots of clean pagecache around so reclaim
> > should trivially succeed.  What's preventing it from doing so?
> > 
> > 2: This is, I think, new behaviour.  A regression.  What caused it?
> > 
> 
> 120+ kernels and a lot of hurt later;
> 
> Short summary - The number of times kswapd and the page allocator have been
> 	calling congestion_wait and the length of time it spends in there
> 	has been increasing since 2.6.29. Oddly, it has little to do
> 	with the page allocator itself.
> 
> Test scenario
> =============
> X86-64 machine 1 socket 4 cores
> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
> 	on-board and a piece of crap, and a decent RAID card could blow
> 	the budget.
> Booted mem=256 to ensure it is fully IO-bound and match closer to what
> 	Christian was doing
> 
> At each test, the disks are partitioned, the raid arrays created and an
> ext2 filesystem created. iozone sequential read/write tests are run with
> increasing number of processes up to 64. Each test creates 8G of files. i.e.
> 1 process = 8G. 2 processes = 2x4G etc
> 
> 	iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
> 	iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
> 	etc.
> 
> Metrics
> =======
> 
> Each kernel was instrumented to collected the following stats
> 
> 	pg-Stall	Page allocator stalled calling congestion_wait
> 	pg-Wait		The amount of time spent in congestion_wait
> 	pg-Rclm		Pages reclaimed by direct reclaim
> 	ksd-stall	balance_pgdat() (ie kswapd) staled on congestion_wait
> 	ksd-wait	Time spend by balance_pgdat in congestion_wait
> 
> Large differences in this do not necessarily show up in iozone because the
> disks are so slow that the stalls are a tiny percentage overall. However, in
> the event that there are many disks, it might be a greater problem. I believe
> Christian is hitting a corner case where small delays trigger a much larger
> stall.
> 
> Why The Increases
> =================
> 
> The big problem here is that there was no one change. Instead, it has been
> a steady build-up of a number of problems. The ones I identified are in the
> block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
> but need backporting and others I expect are a major surprise. Whether they
> are worth backporting or not heavily depends on whether Christian's problem
> is resolved.
> 
> Some of the "fixes" below are obviously not fixes at all. Gathering this data
> took a significant amount of time. It'd be nice if people more familiar with
> the relevant problem patches could spring a theory or patch.
> 
> The Problems
> ============

[...]

> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
> 	fix title: revertevict
> 	fixed in mainline? no
> 	affects: 2.6.31 to now
> 
> 	For reasons that are not immediately obvious, the evict-once patches
> 	*really* hurt the time spent on congestion and the number of pages
> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
> 	because clearly you tested this for AIM7 and might have some
> 	theories. For the purposes of testing, I just reverted the changes.
> 

[...]

> Results
> =======
> 
> Here are the highlights of kernels tested. I'm omitting the bisection
> results for obvious reasons. The metrics were gathered at two points;
> after filesystem creation and after IOZone completed.
> 
> The lower the number for each metric, the better.
> 
>                                                      After Filesystem Setup                                       After IOZone
>                                          pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait

[...]

> Again, fixing tty and reverting evict-once helps bring figures more in line
> with 2.6.29.
> 
> 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
> 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
> 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
> 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
> 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
> 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
> 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
> 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0
> 
> At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
> queues" has lodged itself deep within CGQ and I couldn't tear it out or
> see how to fix it. Fixing tty and reverting evict-once helps but the number
> of stalls is significantly increased and a much larger number of pages get
> reclaimed overall.
> 
> Corrado?
> 
> 2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0
> 2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0

I was wondering why kswapd would not make any progress and stall without
dirty pages, luckily Rik has better eyes than me.

So if he is right and most inactive pages are under IO (thus locked and
skipped) when kswapd is running, we have two choices:

  1) deactivate pages and reclaim them instead
  2) sleep and wait for IO to finish

The patch in question changes 1) to 2) because it won't scan small active
lists and the inactive list does not shrink in size when rotating busy
pages.

You said pg-Rclm is only direct reclaim.  I assume the sum of reclaimed
pages from kswapd and direct reclaim stays in the same ballpark, only
the ratio shifted towards direct reclaim?

Waiting for the disks seems to be better than going after the working set
but I have a feeling we are waiting for the wrong event to happen there.

I am amazingly ignorant when it comes to the block layer, but glancing over
the queue congestion code, it seems we are waiting for the queue to shrink
below a certain threshold.  Is this correct?

When it comes to the reclaim scanner, however, aren't we more interested in
single completions than in the overall state of the queue?

With such a constant stream of IO as in Mel's test, I could imagine that
the queue never really gets below that threshold (here goes the ignorance part)
and we always hit the timeout.  While what we really want is to be woken
up when, say, SWAP_CLUSTER_MAX pages finished since we went to sleep.

Because at that point there is a chance to reclaim some pages again,
even if a lot of requests are still pending.

	Hannes

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-24 13:13                     ` Johannes Weiner
  0 siblings, 0 replies; 136+ messages in thread
From: Johannes Weiner @ 2010-03-24 13:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo,
	Rik van Riel

Hi,

On Mon, Mar 22, 2010 at 11:50:54PM +0000, Mel Gorman wrote:
> On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> > On Mon, 15 Mar 2010 13:34:50 +0100
> > Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> wrote:
> > 
> > > c) If direct reclaim did reasonable progress in try_to_free but did not
> > > get a page, AND there is no write in flight at all then let it try again
> > > to free up something.
> > > This could be extended by some kind of max retry to avoid some weird
> > > looping cases as well.
> > > 
> > > d) Another way might be as easy as letting congestion_wait return
> > > immediately if there are no outstanding writes - this would keep the 
> > > behavior for cases with write and avoid the "running always in full 
> > > timeout" issue without writes.
> > 
> > They're pretty much equivalent and would work.  But there are two
> > things I still don't understand:
> > 
> > 1: Why is direct reclaim calling congestion_wait() at all?  If no
> > writes are going on there's lots of clean pagecache around so reclaim
> > should trivially succeed.  What's preventing it from doing so?
> > 
> > 2: This is, I think, new behaviour.  A regression.  What caused it?
> > 
> 
> 120+ kernels and a lot of hurt later;
> 
> Short summary - The number of times kswapd and the page allocator have been
> 	calling congestion_wait and the length of time it spends in there
> 	has been increasing since 2.6.29. Oddly, it has little to do
> 	with the page allocator itself.
> 
> Test scenario
> =============
> X86-64 machine 1 socket 4 cores
> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
> 	on-board and a piece of crap, and a decent RAID card could blow
> 	the budget.
> Booted mem=256 to ensure it is fully IO-bound and match closer to what
> 	Christian was doing
> 
> At each test, the disks are partitioned, the raid arrays created and an
> ext2 filesystem created. iozone sequential read/write tests are run with
> increasing number of processes up to 64. Each test creates 8G of files. i.e.
> 1 process = 8G. 2 processes = 2x4G etc
> 
> 	iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
> 	iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
> 	etc.
> 
> Metrics
> =======
> 
> Each kernel was instrumented to collected the following stats
> 
> 	pg-Stall	Page allocator stalled calling congestion_wait
> 	pg-Wait		The amount of time spent in congestion_wait
> 	pg-Rclm		Pages reclaimed by direct reclaim
> 	ksd-stall	balance_pgdat() (ie kswapd) staled on congestion_wait
> 	ksd-wait	Time spend by balance_pgdat in congestion_wait
> 
> Large differences in this do not necessarily show up in iozone because the
> disks are so slow that the stalls are a tiny percentage overall. However, in
> the event that there are many disks, it might be a greater problem. I believe
> Christian is hitting a corner case where small delays trigger a much larger
> stall.
> 
> Why The Increases
> =================
> 
> The big problem here is that there was no one change. Instead, it has been
> a steady build-up of a number of problems. The ones I identified are in the
> block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
> but need backporting and others I expect are a major surprise. Whether they
> are worth backporting or not heavily depends on whether Christian's problem
> is resolved.
> 
> Some of the "fixes" below are obviously not fixes at all. Gathering this data
> took a significant amount of time. It'd be nice if people more familiar with
> the relevant problem patches could spring a theory or patch.
> 
> The Problems
> ============

[...]

> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
> 	fix title: revertevict
> 	fixed in mainline? no
> 	affects: 2.6.31 to now
> 
> 	For reasons that are not immediately obvious, the evict-once patches
> 	*really* hurt the time spent on congestion and the number of pages
> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
> 	because clearly you tested this for AIM7 and might have some
> 	theories. For the purposes of testing, I just reverted the changes.
> 

[...]

> Results
> =======
> 
> Here are the highlights of kernels tested. I'm omitting the bisection
> results for obvious reasons. The metrics were gathered at two points;
> after filesystem creation and after IOZone completed.
> 
> The lower the number for each metric, the better.
> 
>                                                      After Filesystem Setup                                       After IOZone
>                                          pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait        pg-Stall  pg-Wait  pg-Rclm  ksd-stall  ksd-wait

[...]

> Again, fixing tty and reverting evict-once helps bring figures more in line
> with 2.6.29.
> 
> 2.6.33                                          0        0        0          3         0          152248   754226  4940952     267214         0
> 2.6.33-revertevict                              0        0        0          3         0             883     4306    28918        507         0
> 2.6.33-ttyfix                                   0        0        0          3         0          157831   782473  5129011     237116         0
> 2.6.33-ttyfix-revertevict                       0        0        0          2         0            1056     5235    34796        519         0
> 2.6.33.1                                        0        0        0          3         1          156422   776724  5078145     234938         0
> 2.6.33.1-revertevict                            0        0        0          2         0            1095     5405    36058        477         0
> 2.6.33.1-ttyfix                                 0        0        0          3         1          136324   673148  4434461     236597         0
> 2.6.33.1-ttyfix-revertevict                     0        0        0          1         1            1339     6624    43583        466         0
> 
> At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
> queues" has lodged itself deep within CGQ and I couldn't tear it out or
> see how to fix it. Fixing tty and reverting evict-once helps but the number
> of stalls is significantly increased and a much larger number of pages get
> reclaimed overall.
> 
> Corrado?
> 
> 2.6.34-rc1                                      0        0        0          1         1          150629   746901  4895328     239233         0
> 2.6.34-rc1-revertevict                          0        0        0          1         0            2595    12901    84988        622         0

I was wondering why kswapd would not make any progress and stall without
dirty pages, luckily Rik has better eyes than me.

So if he is right and most inactive pages are under IO (thus locked and
skipped) when kswapd is running, we have two choices:

  1) deactivate pages and reclaim them instead
  2) sleep and wait for IO to finish

The patch in question changes 1) to 2) because it won't scan small active
lists and the inactive list does not shrink in size when rotating busy
pages.

You said pg-Rclm is only direct reclaim.  I assume the sum of reclaimed
pages from kswapd and direct reclaim stays in the same ballpark, only
the ratio shifted towards direct reclaim?

Waiting for the disks seems to be better than going after the working set
but I have a feeling we are waiting for the wrong event to happen there.

I am amazingly ignorant when it comes to the block layer, but glancing over
the queue congestion code, it seems we are waiting for the queue to shrink
below a certain threshold.  Is this correct?

When it comes to the reclaim scanner, however, aren't we more interested in
single completions than in the overall state of the queue?

With such a constant stream of IO as in Mel's test, I could imagine that
the queue never really gets below that threshold (here goes the ignorance part)
and we always hit the timeout.  While what we really want is to be woken
up when, say, SWAP_CLUSTER_MAX pages finished since we went to sleep.

Because at that point there is a chance to reclaim some pages again,
even if a lot of requests are still pending.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-23 22:29                     ` Rik van Riel
@ 2010-03-24 14:50                       ` Mel Gorman
  -1 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-24 14:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo,
	Johannes Weiner

On Tue, Mar 23, 2010 at 06:29:59PM -0400, Rik van Riel wrote:
> On 03/22/2010 07:50 PM, Mel Gorman wrote:
>
>> Test scenario
>> =============
>> X86-64 machine 1 socket 4 cores
>> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>> 	on-board and a piece of crap, and a decent RAID card could blow
>> 	the budget.
>> Booted mem=256 to ensure it is fully IO-bound and match closer to what
>> 	Christian was doing
>
> With that many disks, you can easily have dozens of megabytes
> of data in flight to the disk at once.  That is a major
> fraction of memory.
>

That is easily possible. Note, I'm not maintaining this workload configuration
is a good idea.

The background to this problem is Christian running a disk-intensive iozone
workload over many CPUs and disks with limited memory. It's already known
that if he added a small amount of extra memory, the problem went away.
The problem was a massive throughput regression and a bisect pinpointed
two patches (both mine) but neither make sense. One altered the order pages
come back from lists but not availability and his hardware does no automatic
merging. A second does alter the availility of pages via the per-cpu lists
but reverting the behaviour didn't help.

The first fix to this was to replace congestion_wait with a waitqueue
that woke up processes if the watermarks were met. This fixed
Christian's problem but Andrew wants to pin the underlying cause.

I strongly suspect that evict-once behaves sensibly when memory is ample
but in this particular case, it's not helping.

> In fact, you might have all of the inactive file pages under
> IO...
>

Possibly. The tests have a write and a read phase but I wasn't
collecting the data with sufficient granularity to see which of the
tests are actually stalling.

>> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>> 	fix title: revertevict
>> 	fixed in mainline? no
>> 	affects: 2.6.31 to now
>>
>> 	For reasons that are not immediately obvious, the evict-once patches
>> 	*really* hurt the time spent on congestion and the number of pages
>> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>> 	because clearly you tested this for AIM7 and might have some
>> 	theories. For the purposes of testing, I just reverted the changes.
>
> The patch helped IO tests with reasonable amounts of memory
> available, because the VM can cache frequently used data
> much more effectively.
>
> This comes at the cost of caching less recently accessed
> use-once data, which should not be an issue since the data
> is only used once...
>

Indeed. With or without evict-once, I'd have an expectation of all the
pages being recycled anyway because of the amount of data involved.

>> Rik, any theory on evict-once?
>
> No real theories yet, just the observation that your revert
> appears to be buggy (see below) and the possibility that your
> test may have all of the inactive file pages under IO...
>

Bah. I had the initial revert right and screwed up reverting from
2.6.32.10 on. I'm rerunning the tests. Is this right?

-       if (is_active_lru(lru)) {
-               if (inactive_list_is_low(zone, sc, file))
-                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
+       if (is_active_lru(lru)) {
+               shrink_active_list(nr_to_scan, zone, sc, priority, file);
                return 0;


> Can you reproduce the stall if you lower the dirty limits?
>

I'm rerunning the revertevict patches at the moment. When they complete,
I'll experiment with dirty limits. Any suggested values or will I just
increase it by some arbitrary amount and see what falls out? e.g.
increse dirty_ratio to 80.

>>   static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>>   	struct zone *zone, struct scan_control *sc, int priority)
>>   {
>>   	int file = is_file_lru(lru);
>>
>> -	if (is_active_lru(lru)) {
>> -		if (inactive_list_is_low(zone, sc, file))
>> -		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> +	if (lru == LRU_ACTIVE_FILE) {
>> +		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>   		return 0;
>>   	}
>
> Your revert is buggy.  With this change, anonymous pages will
> never get deactivated via shrink_list.
>

/me slaps self

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-03-24 14:50                       ` Mel Gorman
  0 siblings, 0 replies; 136+ messages in thread
From: Mel Gorman @ 2010-03-24 14:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Christian Ehrhardt, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo,
	Johannes Weiner

On Tue, Mar 23, 2010 at 06:29:59PM -0400, Rik van Riel wrote:
> On 03/22/2010 07:50 PM, Mel Gorman wrote:
>
>> Test scenario
>> =============
>> X86-64 machine 1 socket 4 cores
>> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>> 	on-board and a piece of crap, and a decent RAID card could blow
>> 	the budget.
>> Booted mem=256 to ensure it is fully IO-bound and match closer to what
>> 	Christian was doing
>
> With that many disks, you can easily have dozens of megabytes
> of data in flight to the disk at once.  That is a major
> fraction of memory.
>

That is easily possible. Note, I'm not maintaining this workload configuration
is a good idea.

The background to this problem is Christian running a disk-intensive iozone
workload over many CPUs and disks with limited memory. It's already known
that if he added a small amount of extra memory, the problem went away.
The problem was a massive throughput regression and a bisect pinpointed
two patches (both mine) but neither make sense. One altered the order pages
come back from lists but not availability and his hardware does no automatic
merging. A second does alter the availility of pages via the per-cpu lists
but reverting the behaviour didn't help.

The first fix to this was to replace congestion_wait with a waitqueue
that woke up processes if the watermarks were met. This fixed
Christian's problem but Andrew wants to pin the underlying cause.

I strongly suspect that evict-once behaves sensibly when memory is ample
but in this particular case, it's not helping.

> In fact, you might have all of the inactive file pages under
> IO...
>

Possibly. The tests have a write and a read phase but I wasn't
collecting the data with sufficient granularity to see which of the
tests are actually stalling.

>> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>> 	fix title: revertevict
>> 	fixed in mainline? no
>> 	affects: 2.6.31 to now
>>
>> 	For reasons that are not immediately obvious, the evict-once patches
>> 	*really* hurt the time spent on congestion and the number of pages
>> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>> 	because clearly you tested this for AIM7 and might have some
>> 	theories. For the purposes of testing, I just reverted the changes.
>
> The patch helped IO tests with reasonable amounts of memory
> available, because the VM can cache frequently used data
> much more effectively.
>
> This comes at the cost of caching less recently accessed
> use-once data, which should not be an issue since the data
> is only used once...
>

Indeed. With or without evict-once, I'd have an expectation of all the
pages being recycled anyway because of the amount of data involved.

>> Rik, any theory on evict-once?
>
> No real theories yet, just the observation that your revert
> appears to be buggy (see below) and the possibility that your
> test may have all of the inactive file pages under IO...
>

Bah. I had the initial revert right and screwed up reverting from
2.6.32.10 on. I'm rerunning the tests. Is this right?

-       if (is_active_lru(lru)) {
-               if (inactive_list_is_low(zone, sc, file))
-                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
+       if (is_active_lru(lru)) {
+               shrink_active_list(nr_to_scan, zone, sc, priority, file);
                return 0;


> Can you reproduce the stall if you lower the dirty limits?
>

I'm rerunning the revertevict patches at the moment. When they complete,
I'll experiment with dirty limits. Any suggested values or will I just
increase it by some arbitrary amount and see what falls out? e.g.
increse dirty_ratio to 80.

>>   static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>>   	struct zone *zone, struct scan_control *sc, int priority)
>>   {
>>   	int file = is_file_lru(lru);
>>
>> -	if (is_active_lru(lru)) {
>> -		if (inactive_list_is_low(zone, sc, file))
>> -		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> +	if (lru == LRU_ACTIVE_FILE) {
>> +		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>   		return 0;
>>   	}
>
> Your revert is buggy.  With this change, anonymous pages will
> never get deactivated via shrink_list.
>

/me slaps self

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-03-24 14:50                       ` Mel Gorman
@ 2010-04-19 12:22                         ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-19 12:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo,
	Johannes Weiner, Ehrhardt Christian

Sorry for replying that late, but after digging through another pile of tasks I'm happy to come back to this issue and I'll try to answer all open questions.
Fortunately I'm also able to add a few new insights that might resurrect this discussion^^

For the requested CFQ scheduler tuning, its deadline what is here :-)
So I can't apply all that. But in the past I was already able to show that all the "slowdown" occurs above the block device layer (read back through our threads if interessted about details). But eventually that leaves all lower layer tuning out of the critical zone.

Corrado also asked for iostat data, due to the reason explained above (issue above BDL) it doesn't contain anything much useful as expected.
So I'll just add a one liner of good/bad case to show that things like req-sz etc are the same, but just slower.
This "being slower" is caused by the request arriving in the BDL at a lower rate - caused by our beloved full timeouts in congestion_wait.

Device:         rrqm/s   wrqm/s     r/s     w/s     rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
bad sdb         0.00     0.00    154.50    0.00  70144.00     0.00   908.01     0.62    4.05   2.72  42.00
good sdb        0.00     0.00    270.50    0.00 122624.00     0.00   906.65     1.32    4.94   2.92  79.00


So now coming to the probably most critical part - the evict once discussion in this thread.
I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.

In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
Therefore I ran all tests and verifications with that drops.
In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)

On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.

But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
- first write/read load after reboot or dropping caches -> read TP good
- second write/read load after reboot or dropping caches -> read TP bad
=> so what changed.

I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:

pre write 1
Buffers:             484 kB
Cached:             5664 kB
pre write 2
Buffers:           33500 kB
Cached:           149856 kB
pre write 3
Buffers:           65564 kB
Cached:           115888 kB
pre write 4
Buffers:           85556 kB
Cached:            97184 kB

It stays at ~85M with more writes which is approx 50% of my free 160M memory.
It can be said that once Buffers reached the 65M level all (no matter how much read load I throw at the system) following read loads will have the bad throughput.
Dropping caches - and by that removing these buffers - gives back the good performance.

So far I found no alternative to a manual drop_caches, but recommending a 30 second cron job dropping caches to get good read performance for customers is not that good anyway.
I checked if the buffers get cleaned some when, but neither a lot of subsequent read loads pushing the pressure towards read page cache (I hoped the buffers would age or something to eventually get thrown out) nor waiting a long time helped.
The system seems to be totally unable to get rid of these buffers without my manual help via drop_caches.

I imagine a huge customer DB running wirtes&reads fine at day, with a nightly large backup that losses 50% read throughput because the kernel keeps 50% buffers all the night - and by that doesn't fit in their night slot - just to draw one realistic scenario.
Is there anything to avoid that behavior to "never free these buffers", but still get all/some of the intended benefits of 56e49d21?

Ideas welcome

P.S. This is still a .32 stable kernel + Mels watermark wait patch based analysis - I plan to check current kernels as well once I find the time, but let me know if there are known obvious fixes related to this issue I should test asap. 

Mel Gorman wrote:
> On Tue, Mar 23, 2010 at 06:29:59PM -0400, Rik van Riel wrote:
>> On 03/22/2010 07:50 PM, Mel Gorman wrote:
>>
>>> Test scenario
>>> =============
>>> X86-64 machine 1 socket 4 cores
>>> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>>> 	on-board and a piece of crap, and a decent RAID card could blow
>>> 	the budget.
>>> Booted mem=256 to ensure it is fully IO-bound and match closer to what
>>> 	Christian was doing
>> With that many disks, you can easily have dozens of megabytes
>> of data in flight to the disk at once.  That is a major
>> fraction of memory.
>>
> 
> That is easily possible. Note, I'm not maintaining this workload configuration
> is a good idea.
> 
> The background to this problem is Christian running a disk-intensive iozone
> workload over many CPUs and disks with limited memory. It's already known
> that if he added a small amount of extra memory, the problem went away.
> The problem was a massive throughput regression and a bisect pinpointed
> two patches (both mine) but neither make sense. One altered the order pages
> come back from lists but not availability and his hardware does no automatic
> merging. A second does alter the availility of pages via the per-cpu lists
> but reverting the behaviour didn't help.
> 
> The first fix to this was to replace congestion_wait with a waitqueue
> that woke up processes if the watermarks were met. This fixed
> Christian's problem but Andrew wants to pin the underlying cause.
> 
> I strongly suspect that evict-once behaves sensibly when memory is ample
> but in this particular case, it's not helping.
> 
>> In fact, you might have all of the inactive file pages under
>> IO...
>>
> 
> Possibly. The tests have a write and a read phase but I wasn't
> collecting the data with sufficient granularity to see which of the
> tests are actually stalling.
> 
>>> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>>> 	fix title: revertevict
>>> 	fixed in mainline? no
>>> 	affects: 2.6.31 to now
>>>
>>> 	For reasons that are not immediately obvious, the evict-once patches
>>> 	*really* hurt the time spent on congestion and the number of pages
>>> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>>> 	because clearly you tested this for AIM7 and might have some
>>> 	theories. For the purposes of testing, I just reverted the changes.
>> The patch helped IO tests with reasonable amounts of memory
>> available, because the VM can cache frequently used data
>> much more effectively.
>>
>> This comes at the cost of caching less recently accessed
>> use-once data, which should not be an issue since the data
>> is only used once...
>>
> 
> Indeed. With or without evict-once, I'd have an expectation of all the
> pages being recycled anyway because of the amount of data involved.
> 
>>> Rik, any theory on evict-once?
>> No real theories yet, just the observation that your revert
>> appears to be buggy (see below) and the possibility that your
>> test may have all of the inactive file pages under IO...
>>
> 
> Bah. I had the initial revert right and screwed up reverting from
> 2.6.32.10 on. I'm rerunning the tests. Is this right?
> 
> -       if (is_active_lru(lru)) {
> -               if (inactive_list_is_low(zone, sc, file))
> -                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
> +       if (is_active_lru(lru)) {
> +               shrink_active_list(nr_to_scan, zone, sc, priority, file);
>                 return 0;
> 
> 
>> Can you reproduce the stall if you lower the dirty limits?
>>
> 
> I'm rerunning the revertevict patches at the moment. When they complete,
> I'll experiment with dirty limits. Any suggested values or will I just
> increase it by some arbitrary amount and see what falls out? e.g.
> increse dirty_ratio to 80.
> 
>>>   static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>>>   	struct zone *zone, struct scan_control *sc, int priority)
>>>   {
>>>   	int file = is_file_lru(lru);
>>>
>>> -	if (is_active_lru(lru)) {
>>> -		if (inactive_list_is_low(zone, sc, file))
>>> -		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>> +	if (lru == LRU_ACTIVE_FILE) {
>>> +		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>>   		return 0;
>>>   	}
>> Your revert is buggy.  With this change, anonymous pages will
>> never get deactivated via shrink_list.
>>
> 
> /me slaps self
> 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-19 12:22                         ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-19 12:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Andrew Morton, linux-mm, Nick Piggin, Chris Mason,
	Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo,
	Johannes Weiner, Ehrhardt Christian

Sorry for replying that late, but after digging through another pile of tasks I'm happy to come back to this issue and I'll try to answer all open questions.
Fortunately I'm also able to add a few new insights that might resurrect this discussion^^

For the requested CFQ scheduler tuning, its deadline what is here :-)
So I can't apply all that. But in the past I was already able to show that all the "slowdown" occurs above the block device layer (read back through our threads if interessted about details). But eventually that leaves all lower layer tuning out of the critical zone.

Corrado also asked for iostat data, due to the reason explained above (issue above BDL) it doesn't contain anything much useful as expected.
So I'll just add a one liner of good/bad case to show that things like req-sz etc are the same, but just slower.
This "being slower" is caused by the request arriving in the BDL at a lower rate - caused by our beloved full timeouts in congestion_wait.

Device:         rrqm/s   wrqm/s     r/s     w/s     rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
bad sdb         0.00     0.00    154.50    0.00  70144.00     0.00   908.01     0.62    4.05   2.72  42.00
good sdb        0.00     0.00    270.50    0.00 122624.00     0.00   906.65     1.32    4.94   2.92  79.00


So now coming to the probably most critical part - the evict once discussion in this thread.
I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.

In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
Therefore I ran all tests and verifications with that drops.
In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)

On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.

But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
- first write/read load after reboot or dropping caches -> read TP good
- second write/read load after reboot or dropping caches -> read TP bad
=> so what changed.

I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:

pre write 1
Buffers:             484 kB
Cached:             5664 kB
pre write 2
Buffers:           33500 kB
Cached:           149856 kB
pre write 3
Buffers:           65564 kB
Cached:           115888 kB
pre write 4
Buffers:           85556 kB
Cached:            97184 kB

It stays at ~85M with more writes which is approx 50% of my free 160M memory.
It can be said that once Buffers reached the 65M level all (no matter how much read load I throw at the system) following read loads will have the bad throughput.
Dropping caches - and by that removing these buffers - gives back the good performance.

So far I found no alternative to a manual drop_caches, but recommending a 30 second cron job dropping caches to get good read performance for customers is not that good anyway.
I checked if the buffers get cleaned some when, but neither a lot of subsequent read loads pushing the pressure towards read page cache (I hoped the buffers would age or something to eventually get thrown out) nor waiting a long time helped.
The system seems to be totally unable to get rid of these buffers without my manual help via drop_caches.

I imagine a huge customer DB running wirtes&reads fine at day, with a nightly large backup that losses 50% read throughput because the kernel keeps 50% buffers all the night - and by that doesn't fit in their night slot - just to draw one realistic scenario.
Is there anything to avoid that behavior to "never free these buffers", but still get all/some of the intended benefits of 56e49d21?

Ideas welcome

P.S. This is still a .32 stable kernel + Mels watermark wait patch based analysis - I plan to check current kernels as well once I find the time, but let me know if there are known obvious fixes related to this issue I should test asap. 

Mel Gorman wrote:
> On Tue, Mar 23, 2010 at 06:29:59PM -0400, Rik van Riel wrote:
>> On 03/22/2010 07:50 PM, Mel Gorman wrote:
>>
>>> Test scenario
>>> =============
>>> X86-64 machine 1 socket 4 cores
>>> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>>> 	on-board and a piece of crap, and a decent RAID card could blow
>>> 	the budget.
>>> Booted mem=256 to ensure it is fully IO-bound and match closer to what
>>> 	Christian was doing
>> With that many disks, you can easily have dozens of megabytes
>> of data in flight to the disk at once.  That is a major
>> fraction of memory.
>>
> 
> That is easily possible. Note, I'm not maintaining this workload configuration
> is a good idea.
> 
> The background to this problem is Christian running a disk-intensive iozone
> workload over many CPUs and disks with limited memory. It's already known
> that if he added a small amount of extra memory, the problem went away.
> The problem was a massive throughput regression and a bisect pinpointed
> two patches (both mine) but neither make sense. One altered the order pages
> come back from lists but not availability and his hardware does no automatic
> merging. A second does alter the availility of pages via the per-cpu lists
> but reverting the behaviour didn't help.
> 
> The first fix to this was to replace congestion_wait with a waitqueue
> that woke up processes if the watermarks were met. This fixed
> Christian's problem but Andrew wants to pin the underlying cause.
> 
> I strongly suspect that evict-once behaves sensibly when memory is ample
> but in this particular case, it's not helping.
> 
>> In fact, you might have all of the inactive file pages under
>> IO...
>>
> 
> Possibly. The tests have a write and a read phase but I wasn't
> collecting the data with sufficient granularity to see which of the
> tests are actually stalling.
> 
>>> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>>> 	fix title: revertevict
>>> 	fixed in mainline? no
>>> 	affects: 2.6.31 to now
>>>
>>> 	For reasons that are not immediately obvious, the evict-once patches
>>> 	*really* hurt the time spent on congestion and the number of pages
>>> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>>> 	because clearly you tested this for AIM7 and might have some
>>> 	theories. For the purposes of testing, I just reverted the changes.
>> The patch helped IO tests with reasonable amounts of memory
>> available, because the VM can cache frequently used data
>> much more effectively.
>>
>> This comes at the cost of caching less recently accessed
>> use-once data, which should not be an issue since the data
>> is only used once...
>>
> 
> Indeed. With or without evict-once, I'd have an expectation of all the
> pages being recycled anyway because of the amount of data involved.
> 
>>> Rik, any theory on evict-once?
>> No real theories yet, just the observation that your revert
>> appears to be buggy (see below) and the possibility that your
>> test may have all of the inactive file pages under IO...
>>
> 
> Bah. I had the initial revert right and screwed up reverting from
> 2.6.32.10 on. I'm rerunning the tests. Is this right?
> 
> -       if (is_active_lru(lru)) {
> -               if (inactive_list_is_low(zone, sc, file))
> -                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
> +       if (is_active_lru(lru)) {
> +               shrink_active_list(nr_to_scan, zone, sc, priority, file);
>                 return 0;
> 
> 
>> Can you reproduce the stall if you lower the dirty limits?
>>
> 
> I'm rerunning the revertevict patches at the moment. When they complete,
> I'll experiment with dirty limits. Any suggested values or will I just
> increase it by some arbitrary amount and see what falls out? e.g.
> increse dirty_ratio to 80.
> 
>>>   static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>>>   	struct zone *zone, struct scan_control *sc, int priority)
>>>   {
>>>   	int file = is_file_lru(lru);
>>>
>>> -	if (is_active_lru(lru)) {
>>> -		if (inactive_list_is_low(zone, sc, file))
>>> -		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>> +	if (lru == LRU_ACTIVE_FILE) {
>>> +		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>>   		return 0;
>>>   	}
>> Your revert is buggy.  With this change, anonymous pages will
>> never get deactivated via shrink_list.
>>
> 
> /me slaps self
> 

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-19 12:22                         ` Christian Ehrhardt
@ 2010-04-19 21:44                           ` Johannes Weiner
  -1 siblings, 0 replies; 136+ messages in thread
From: Johannes Weiner @ 2010-04-19 21:44 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, Rik van Riel, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo

On Mon, Apr 19, 2010 at 02:22:36PM +0200, Christian Ehrhardt wrote:
> So now coming to the probably most critical part - the evict once discussion in this thread.
> I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.
> 
> In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
> Therefore I ran all tests and verifications with that drops.
> In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
> Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)
> 
> On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
> Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.
> 
> But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
> - first write/read load after reboot or dropping caches -> read TP good
> - second write/read load after reboot or dropping caches -> read TP bad
> => so what changed.
> 
> I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
> When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:
> 
> pre write 1
> Buffers:             484 kB
> Cached:             5664 kB
> pre write 2
> Buffers:           33500 kB
> Cached:           149856 kB
> pre write 3
> Buffers:           65564 kB
> Cached:           115888 kB
> pre write 4
> Buffers:           85556 kB
> Cached:            97184 kB
> 
> It stays at ~85M with more writes which is approx 50% of my free 160M memory.

Ok, so I am the idiot that got quoted on 'the active set is not too big, so
buffer heads are not a problem when avoiding to scan it' in eternal history.

But the threshold inactive/active ratio for skipping active file pages is
actually 1:1.

The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) appears
to be a bit more natural anyway?  Below is a patch that changes it to 2:1.
Christian, can you check if it fixes your regression?

Additionally, we can always scan active file pages but only deactivate them
when the ratio is off and otherwise strip buffers of clean pages.

What do people think?

	Hannes

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ede99..a4aea76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -898,7 +898,7 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > inactive / 2);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..8f1a846 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1466,7 +1466,7 @@ static int inactive_file_is_low_global(struct zone *zone)
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
 	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > inactive / 2);
 }
 
 /**


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-19 21:44                           ` Johannes Weiner
  0 siblings, 0 replies; 136+ messages in thread
From: Johannes Weiner @ 2010-04-19 21:44 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, Rik van Riel, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo

On Mon, Apr 19, 2010 at 02:22:36PM +0200, Christian Ehrhardt wrote:
> So now coming to the probably most critical part - the evict once discussion in this thread.
> I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.
> 
> In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
> Therefore I ran all tests and verifications with that drops.
> In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
> Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)
> 
> On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
> Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.
> 
> But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
> - first write/read load after reboot or dropping caches -> read TP good
> - second write/read load after reboot or dropping caches -> read TP bad
> => so what changed.
> 
> I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
> When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:
> 
> pre write 1
> Buffers:             484 kB
> Cached:             5664 kB
> pre write 2
> Buffers:           33500 kB
> Cached:           149856 kB
> pre write 3
> Buffers:           65564 kB
> Cached:           115888 kB
> pre write 4
> Buffers:           85556 kB
> Cached:            97184 kB
> 
> It stays at ~85M with more writes which is approx 50% of my free 160M memory.

Ok, so I am the idiot that got quoted on 'the active set is not too big, so
buffer heads are not a problem when avoiding to scan it' in eternal history.

But the threshold inactive/active ratio for skipping active file pages is
actually 1:1.

The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) appears
to be a bit more natural anyway?  Below is a patch that changes it to 2:1.
Christian, can you check if it fixes your regression?

Additionally, we can always scan active file pages but only deactivate them
when the ratio is off and otherwise strip buffers of clean pages.

What do people think?

	Hannes

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ede99..a4aea76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -898,7 +898,7 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > inactive / 2);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..8f1a846 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1466,7 +1466,7 @@ static int inactive_file_is_low_global(struct zone *zone)
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
 	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > inactive / 2);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-19 21:44                           ` Johannes Weiner
@ 2010-04-20  7:20                             ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-20  7:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Rik van Riel, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo



Johannes Weiner wrote:
> On Mon, Apr 19, 2010 at 02:22:36PM +0200, Christian Ehrhardt wrote:
>> So now coming to the probably most critical part - the evict once discussion in this thread.
>> I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.
>>
>> In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
>> Therefore I ran all tests and verifications with that drops.
>> In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
>> Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)
>>
>> On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
>> Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.
>>
>> But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
>> - first write/read load after reboot or dropping caches -> read TP good
>> - second write/read load after reboot or dropping caches -> read TP bad
>> => so what changed.
>>
>> I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
>> When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:
>>
>> pre write 1
>> Buffers:             484 kB
>> Cached:             5664 kB
>> pre write 2
>> Buffers:           33500 kB
>> Cached:           149856 kB
>> pre write 3
>> Buffers:           65564 kB
>> Cached:           115888 kB
>> pre write 4
>> Buffers:           85556 kB
>> Cached:            97184 kB
>>
>> It stays at ~85M with more writes which is approx 50% of my free 160M memory.
> 
> Ok, so I am the idiot that got quoted on 'the active set is not too big, so
> buffer heads are not a problem when avoiding to scan it' in eternal history.
> 
> But the threshold inactive/active ratio for skipping active file pages is
> actually 1:1.
> 
> The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) appears
> to be a bit more natural anyway?  Below is a patch that changes it to 2:1.
> Christian, can you check if it fixes your regression?

I'll check it out.
from the numbers I have up to now I know that the good->bad transition 
for my case is somewhere between 30M/60M e.g. first and second write.
The ratio 2:1 will eat max 53M of my ~160M that gets split up.

That means setting the ratio to 2:1 or whatever else might help or not, 
but eventually there is just another setting of workload vs. memory 
constraints that would still be affected. Still I guess 3:1 (and I'll 
try that as well) should be enough to be a bit more towards the save side.

> Additionally, we can always scan active file pages but only deactivate them
> when the ratio is off and otherwise strip buffers of clean pages.

In think we need something that allows the system to forget its history 
somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm) 
it should eventually throw all old things out.
Like I described before many systems have different usage patterns when 
e.g. comparing day/night workload. So it is far from optimal if e.g. day 
write loads eat so much cache and never give it back for nightly huge 
reads tasks or something similar.

Would your suggestion achieve that already?
If not what kind change could?

> What do people think?
> 
> 	Hannes
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4ede99..a4aea76 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -898,7 +898,7 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>  	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
>  	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
> 
> -	return (active > inactive);
> +	return (active > inactive / 2);
>  }
> 
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3ff3311..8f1a846 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1466,7 +1466,7 @@ static int inactive_file_is_low_global(struct zone *zone)
>  	active = zone_page_state(zone, NR_ACTIVE_FILE);
>  	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> 
> -	return (active > inactive);
> +	return (active > inactive / 2);
>  }
> 
>  /**
> 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-20  7:20                             ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-20  7:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Rik van Riel, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo



Johannes Weiner wrote:
> On Mon, Apr 19, 2010 at 02:22:36PM +0200, Christian Ehrhardt wrote:
>> So now coming to the probably most critical part - the evict once discussion in this thread.
>> I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.
>>
>> In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
>> Therefore I ran all tests and verifications with that drops.
>> In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
>> Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)
>>
>> On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
>> Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.
>>
>> But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
>> - first write/read load after reboot or dropping caches -> read TP good
>> - second write/read load after reboot or dropping caches -> read TP bad
>> => so what changed.
>>
>> I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
>> When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:
>>
>> pre write 1
>> Buffers:             484 kB
>> Cached:             5664 kB
>> pre write 2
>> Buffers:           33500 kB
>> Cached:           149856 kB
>> pre write 3
>> Buffers:           65564 kB
>> Cached:           115888 kB
>> pre write 4
>> Buffers:           85556 kB
>> Cached:            97184 kB
>>
>> It stays at ~85M with more writes which is approx 50% of my free 160M memory.
> 
> Ok, so I am the idiot that got quoted on 'the active set is not too big, so
> buffer heads are not a problem when avoiding to scan it' in eternal history.
> 
> But the threshold inactive/active ratio for skipping active file pages is
> actually 1:1.
> 
> The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) appears
> to be a bit more natural anyway?  Below is a patch that changes it to 2:1.
> Christian, can you check if it fixes your regression?

I'll check it out.
from the numbers I have up to now I know that the good->bad transition 
for my case is somewhere between 30M/60M e.g. first and second write.
The ratio 2:1 will eat max 53M of my ~160M that gets split up.

That means setting the ratio to 2:1 or whatever else might help or not, 
but eventually there is just another setting of workload vs. memory 
constraints that would still be affected. Still I guess 3:1 (and I'll 
try that as well) should be enough to be a bit more towards the save side.

> Additionally, we can always scan active file pages but only deactivate them
> when the ratio is off and otherwise strip buffers of clean pages.

In think we need something that allows the system to forget its history 
somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm) 
it should eventually throw all old things out.
Like I described before many systems have different usage patterns when 
e.g. comparing day/night workload. So it is far from optimal if e.g. day 
write loads eat so much cache and never give it back for nightly huge 
reads tasks or something similar.

Would your suggestion achieve that already?
If not what kind change could?

> What do people think?
> 
> 	Hannes
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4ede99..a4aea76 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -898,7 +898,7 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
>  	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
>  	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
> 
> -	return (active > inactive);
> +	return (active > inactive / 2);
>  }
> 
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3ff3311..8f1a846 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1466,7 +1466,7 @@ static int inactive_file_is_low_global(struct zone *zone)
>  	active = zone_page_state(zone, NR_ACTIVE_FILE);
>  	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> 
> -	return (active > inactive);
> +	return (active > inactive / 2);
>  }
> 
>  /**
> 

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-20  7:20                             ` Christian Ehrhardt
@ 2010-04-20  8:54                               ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-20  8:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Rik van Riel, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo



Christian Ehrhardt wrote:
> 
> 
> Johannes Weiner wrote:
[...]

>>>
>>> It stays at ~85M with more writes which is approx 50% of my free 160M 
>>> memory.
>>
>> Ok, so I am the idiot that got quoted on 'the active set is not too 
>> big, so
>> buffer heads are not a problem when avoiding to scan it' in eternal 
>> history.
>>
>> But the threshold inactive/active ratio for skipping active file pages is
>> actually 1:1.
>>
>> The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) 
>> appears
>> to be a bit more natural anyway?  Below is a patch that changes it to 
>> 2:1.
>> Christian, can you check if it fixes your regression?
> 
> I'll check it out.
> from the numbers I have up to now I know that the good->bad transition 
> for my case is somewhere between 30M/60M e.g. first and second write.
> The ratio 2:1 will eat max 53M of my ~160M that gets split up.
> 
> That means setting the ratio to 2:1 or whatever else might help or not, 
> but eventually there is just another setting of workload vs. memory 
> constraints that would still be affected. Still I guess 3:1 (and I'll 
> try that as well) should be enough to be a bit more towards the save side.

For "my case" 2:1 is not enough, 3:1 almost and 4:1 fixes the issue.
Still as I mentioned before I think any value carved in stone can and 
will be bad to some use case - as 1:1 is for mine.

If we end up being unable to fix it internally by allowing the system to 
"forget" and eventually free old unused buffers at least somewhen - then 
we should neither implement it as 2:1 nor 3:1 nor whatsoever, but as 
userspace configurable e.g. /proc/sys/vm/active_inactive_ratio.

I hope your suggestion below or an extension to it will allow the kernel 
to free the buffers somewhen. Depending on how good/fast this solution 
then will work we can still modify the ratio if needed.

>> Additionally, we can always scan active file pages but only deactivate 
>> them
>> when the ratio is off and otherwise strip buffers of clean pages.
> 
> In think we need something that allows the system to forget its history 
> somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm) 
> it should eventually throw all old things out.
> Like I described before many systems have different usage patterns when 
> e.g. comparing day/night workload. So it is far from optimal if e.g. day 
> write loads eat so much cache and never give it back for nightly huge 
> reads tasks or something similar.
> 
> Would your suggestion achieve that already?
> If not what kind change could?
> 
[...]
-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-20  8:54                               ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-20  8:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Rik van Riel, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo



Christian Ehrhardt wrote:
> 
> 
> Johannes Weiner wrote:
[...]

>>>
>>> It stays at ~85M with more writes which is approx 50% of my free 160M 
>>> memory.
>>
>> Ok, so I am the idiot that got quoted on 'the active set is not too 
>> big, so
>> buffer heads are not a problem when avoiding to scan it' in eternal 
>> history.
>>
>> But the threshold inactive/active ratio for skipping active file pages is
>> actually 1:1.
>>
>> The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) 
>> appears
>> to be a bit more natural anyway?  Below is a patch that changes it to 
>> 2:1.
>> Christian, can you check if it fixes your regression?
> 
> I'll check it out.
> from the numbers I have up to now I know that the good->bad transition 
> for my case is somewhere between 30M/60M e.g. first and second write.
> The ratio 2:1 will eat max 53M of my ~160M that gets split up.
> 
> That means setting the ratio to 2:1 or whatever else might help or not, 
> but eventually there is just another setting of workload vs. memory 
> constraints that would still be affected. Still I guess 3:1 (and I'll 
> try that as well) should be enough to be a bit more towards the save side.

For "my case" 2:1 is not enough, 3:1 almost and 4:1 fixes the issue.
Still as I mentioned before I think any value carved in stone can and 
will be bad to some use case - as 1:1 is for mine.

If we end up being unable to fix it internally by allowing the system to 
"forget" and eventually free old unused buffers at least somewhen - then 
we should neither implement it as 2:1 nor 3:1 nor whatsoever, but as 
userspace configurable e.g. /proc/sys/vm/active_inactive_ratio.

I hope your suggestion below or an extension to it will allow the kernel 
to free the buffers somewhen. Depending on how good/fast this solution 
then will work we can still modify the ratio if needed.

>> Additionally, we can always scan active file pages but only deactivate 
>> them
>> when the ratio is off and otherwise strip buffers of clean pages.
> 
> In think we need something that allows the system to forget its history 
> somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm) 
> it should eventually throw all old things out.
> Like I described before many systems have different usage patterns when 
> e.g. comparing day/night workload. So it is far from optimal if e.g. day 
> write loads eat so much cache and never give it back for nightly huge 
> reads tasks or something similar.
> 
> Would your suggestion achieve that already?
> If not what kind change could?
> 
[...]
-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-19 21:44                           ` Johannes Weiner
@ 2010-04-20 14:40                             ` Rik van Riel
  -1 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-20 14:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christian Ehrhardt, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

On 04/19/2010 05:44 PM, Johannes Weiner wrote:

> What do people think?

It has potential advantages and disadvantages.

On smaller desktop systems, it is entirely possible that
the working set is close to half of the page cache.  Your
patch reduces the amount of memory that is protected on
the active file list, so it may cause part of the working
set to get evicted.

On the other hand, having a smaller active list frees up
more memory for sequential (streaming, use-once) disk IO.
This can be useful on systems with large IO subsystems
and small memory (like Christian's s390 virtual machine,
with 256MB RAM and 4 disks!).

I wonder if we could not find some automatic way to
balance between these two situations, for example by
excluding currently-in-flight pages from the calculations.

In Christian's case, he could have 160MB of cache (buffer
+ page cache), of which 70MB is in flight to disk at a
time.  It may be worthwhile to exclude that 70MB from the
total and aim for 45MB active file and 45MB inactive file
pages on his system.  That way IO does not get starved.

On a desktop system, which needs the working set protected
and does less IO, we will automatically protect more of
the working set - since there is no IO to starve.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-20 14:40                             ` Rik van Riel
  0 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-20 14:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christian Ehrhardt, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

On 04/19/2010 05:44 PM, Johannes Weiner wrote:

> What do people think?

It has potential advantages and disadvantages.

On smaller desktop systems, it is entirely possible that
the working set is close to half of the page cache.  Your
patch reduces the amount of memory that is protected on
the active file list, so it may cause part of the working
set to get evicted.

On the other hand, having a smaller active list frees up
more memory for sequential (streaming, use-once) disk IO.
This can be useful on systems with large IO subsystems
and small memory (like Christian's s390 virtual machine,
with 256MB RAM and 4 disks!).

I wonder if we could not find some automatic way to
balance between these two situations, for example by
excluding currently-in-flight pages from the calculations.

In Christian's case, he could have 160MB of cache (buffer
+ page cache), of which 70MB is in flight to disk at a
time.  It may be worthwhile to exclude that 70MB from the
total and aim for 45MB active file and 45MB inactive file
pages on his system.  That way IO does not get starved.

On a desktop system, which needs the working set protected
and does less IO, we will automatically protect more of
the working set - since there is no IO to starve.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-20  7:20                             ` Christian Ehrhardt
@ 2010-04-20 15:32                               ` Johannes Weiner
  -1 siblings, 0 replies; 136+ messages in thread
From: Johannes Weiner @ 2010-04-20 15:32 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, Rik van Riel, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo

On Tue, Apr 20, 2010 at 09:20:58AM +0200, Christian Ehrhardt wrote:
> 
> 
> Johannes Weiner wrote:
> >On Mon, Apr 19, 2010 at 02:22:36PM +0200, Christian Ehrhardt wrote:
> >>So now coming to the probably most critical part - the evict once 
> >>discussion in this thread.
> >>I'll try to explain what I found in the meanwhile - let me know whats 
> >>unclear and I'll add data etc.
> >>
> >>In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps 
> >>to improve the accuracy of the used testcase by lowering the noise from 
> >>5-8% to <1%.
> >>Therefore I ran all tests and verifications with that drops.
> >>In the meanwhile I unfortunately discovered that Mel's fix only helps for 
> >>the cases when the caches are dropped.
> >>Without it seems to be bad all the time. So don't cast the patch away due 
> >>to that discovery :-)
> >>
> >>On the good side I was also able to analyze a few more things due to that 
> >>insight - and it might give us new data to debug the root cause.
> >>Like Mel I also had identified "56e49d21 vmscan: evict use-once pages 
> >>first" to be related in the past. But without the watermark wait fix, 
> >>unapplying it 56e49d21 didn't change much for my case so I left this 
> >>analysis path.
> >>
> >>But now after I found dropping caches is the key to "get back good 
> >>performance" and "subsequent writes for bad performance" even with 
> >>watermark wait applied I checked what else changes:
> >>- first write/read load after reboot or dropping caches -> read TP good
> >>- second write/read load after reboot or dropping caches -> read TP bad
> >>=> so what changed.
> >>
> >>I went through all kind of logs and found something in the system 
> >>activity report which very probably is related to 56e49d21.
> >>When issuing subsequent writes after I dropped caches to get a clean 
> >>start I get this in Buffers/Caches from Meminfo:
> >>
> >>pre write 1
> >>Buffers:             484 kB
> >>Cached:             5664 kB
> >>pre write 2
> >>Buffers:           33500 kB
> >>Cached:           149856 kB
> >>pre write 3
> >>Buffers:           65564 kB
> >>Cached:           115888 kB
> >>pre write 4
> >>Buffers:           85556 kB
> >>Cached:            97184 kB
> >>
> >>It stays at ~85M with more writes which is approx 50% of my free 160M 
> >>memory.
> >
> >Ok, so I am the idiot that got quoted on 'the active set is not too big, so
> >buffer heads are not a problem when avoiding to scan it' in eternal 
> >history.
> >
> >But the threshold inactive/active ratio for skipping active file pages is
> >actually 1:1.
> >
> >The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) 
> >appears
> >to be a bit more natural anyway?  Below is a patch that changes it to 2:1.
> >Christian, can you check if it fixes your regression?
> 
> I'll check it out.
> from the numbers I have up to now I know that the good->bad transition 
> for my case is somewhere between 30M/60M e.g. first and second write.
> The ratio 2:1 will eat max 53M of my ~160M that gets split up.
> 
> That means setting the ratio to 2:1 or whatever else might help or not, 
> but eventually there is just another setting of workload vs. memory 
> constraints that would still be affected. Still I guess 3:1 (and I'll 
> try that as well) should be enough to be a bit more towards the save side.
> 
> >Additionally, we can always scan active file pages but only deactivate them
> >when the ratio is off and otherwise strip buffers of clean pages.
> 
> In think we need something that allows the system to forget its history 
> somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm) 
> it should eventually throw all old things out.

The idea is that it pans out on its own.  If the workload changes, new
pages get activated and when that set grows too large, we start shrinking
it again.

Of course, right now this unscanned set is way too large and we can end
up wasting up to 50% of usable page cache on false active pages.

A fixed ratio does not scale with varying workloads, obviously, but having
it at a safe level still seems like a good trade-off.

We can still do the optimization, and in the worst case the amount of
memory wasted on false active pages is small enough that it should leave
the system performant.

You have a rather extreme page cache load.  If 4:1 works for you, I think
this is a safe bet for now because we only frob the knobs into the
direction of earlier kernel behaviour.

We still have a nice amount of pages we do not need to scan regularly
(up to 50k file pages for a streaming IO load on a 1G machine).

	Hannes

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-20 15:32                               ` Johannes Weiner
  0 siblings, 0 replies; 136+ messages in thread
From: Johannes Weiner @ 2010-04-20 15:32 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Mel Gorman, Rik van Riel, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo

On Tue, Apr 20, 2010 at 09:20:58AM +0200, Christian Ehrhardt wrote:
> 
> 
> Johannes Weiner wrote:
> >On Mon, Apr 19, 2010 at 02:22:36PM +0200, Christian Ehrhardt wrote:
> >>So now coming to the probably most critical part - the evict once 
> >>discussion in this thread.
> >>I'll try to explain what I found in the meanwhile - let me know whats 
> >>unclear and I'll add data etc.
> >>
> >>In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps 
> >>to improve the accuracy of the used testcase by lowering the noise from 
> >>5-8% to <1%.
> >>Therefore I ran all tests and verifications with that drops.
> >>In the meanwhile I unfortunately discovered that Mel's fix only helps for 
> >>the cases when the caches are dropped.
> >>Without it seems to be bad all the time. So don't cast the patch away due 
> >>to that discovery :-)
> >>
> >>On the good side I was also able to analyze a few more things due to that 
> >>insight - and it might give us new data to debug the root cause.
> >>Like Mel I also had identified "56e49d21 vmscan: evict use-once pages 
> >>first" to be related in the past. But without the watermark wait fix, 
> >>unapplying it 56e49d21 didn't change much for my case so I left this 
> >>analysis path.
> >>
> >>But now after I found dropping caches is the key to "get back good 
> >>performance" and "subsequent writes for bad performance" even with 
> >>watermark wait applied I checked what else changes:
> >>- first write/read load after reboot or dropping caches -> read TP good
> >>- second write/read load after reboot or dropping caches -> read TP bad
> >>=> so what changed.
> >>
> >>I went through all kind of logs and found something in the system 
> >>activity report which very probably is related to 56e49d21.
> >>When issuing subsequent writes after I dropped caches to get a clean 
> >>start I get this in Buffers/Caches from Meminfo:
> >>
> >>pre write 1
> >>Buffers:             484 kB
> >>Cached:             5664 kB
> >>pre write 2
> >>Buffers:           33500 kB
> >>Cached:           149856 kB
> >>pre write 3
> >>Buffers:           65564 kB
> >>Cached:           115888 kB
> >>pre write 4
> >>Buffers:           85556 kB
> >>Cached:            97184 kB
> >>
> >>It stays at ~85M with more writes which is approx 50% of my free 160M 
> >>memory.
> >
> >Ok, so I am the idiot that got quoted on 'the active set is not too big, so
> >buffer heads are not a problem when avoiding to scan it' in eternal 
> >history.
> >
> >But the threshold inactive/active ratio for skipping active file pages is
> >actually 1:1.
> >
> >The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) 
> >appears
> >to be a bit more natural anyway?  Below is a patch that changes it to 2:1.
> >Christian, can you check if it fixes your regression?
> 
> I'll check it out.
> from the numbers I have up to now I know that the good->bad transition 
> for my case is somewhere between 30M/60M e.g. first and second write.
> The ratio 2:1 will eat max 53M of my ~160M that gets split up.
> 
> That means setting the ratio to 2:1 or whatever else might help or not, 
> but eventually there is just another setting of workload vs. memory 
> constraints that would still be affected. Still I guess 3:1 (and I'll 
> try that as well) should be enough to be a bit more towards the save side.
> 
> >Additionally, we can always scan active file pages but only deactivate them
> >when the ratio is off and otherwise strip buffers of clean pages.
> 
> In think we need something that allows the system to forget its history 
> somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm) 
> it should eventually throw all old things out.

The idea is that it pans out on its own.  If the workload changes, new
pages get activated and when that set grows too large, we start shrinking
it again.

Of course, right now this unscanned set is way too large and we can end
up wasting up to 50% of usable page cache on false active pages.

A fixed ratio does not scale with varying workloads, obviously, but having
it at a safe level still seems like a good trade-off.

We can still do the optimization, and in the worst case the amount of
memory wasted on false active pages is small enough that it should leave
the system performant.

You have a rather extreme page cache load.  If 4:1 works for you, I think
this is a safe bet for now because we only frob the knobs into the
direction of earlier kernel behaviour.

We still have a nice amount of pages we do not need to scan regularly
(up to 50k file pages for a streaming IO load on a 1G machine).

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-20 15:32                               ` Johannes Weiner
@ 2010-04-20 17:22                                 ` Rik van Riel
  -1 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-20 17:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christian Ehrhardt, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

On 04/20/2010 11:32 AM, Johannes Weiner wrote:

> The idea is that it pans out on its own.  If the workload changes, new
> pages get activated and when that set grows too large, we start shrinking
> it again.
>
> Of course, right now this unscanned set is way too large and we can end
> up wasting up to 50% of usable page cache on false active pages.

Thing is, changing workloads often change back.

Specifically, think of a desktop system that is doing
work for the user during the day and gets backed up
at night.

You do not want the backup to kick the working set
out of memory, because when the user returns in the
morning the desktop should come back quickly after
the screensaver is unlocked.

The big question is, what workload suffers from
having the inactive list at 50% of the page cache?

So far the only big problem we have seen is on a
very unbalanced virtual machine, with 256MB RAM
and 4 fast disks.  The disks simply have more IO
in flight at once than what fits in the inactive
list.

This is a very untypical situation, and we can
probably solve it by excluding the in-flight pages
from the active/inactive file calculation.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-20 17:22                                 ` Rik van Riel
  0 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-20 17:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Christian Ehrhardt, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

On 04/20/2010 11:32 AM, Johannes Weiner wrote:

> The idea is that it pans out on its own.  If the workload changes, new
> pages get activated and when that set grows too large, we start shrinking
> it again.
>
> Of course, right now this unscanned set is way too large and we can end
> up wasting up to 50% of usable page cache on false active pages.

Thing is, changing workloads often change back.

Specifically, think of a desktop system that is doing
work for the user during the day and gets backed up
at night.

You do not want the backup to kick the working set
out of memory, because when the user returns in the
morning the desktop should come back quickly after
the screensaver is unlocked.

The big question is, what workload suffers from
having the inactive list at 50% of the page cache?

So far the only big problem we have seen is on a
very unbalanced virtual machine, with 256MB RAM
and 4 fast disks.  The disks simply have more IO
in flight at once than what fits in the inactive
list.

This is a very untypical situation, and we can
probably solve it by excluding the in-flight pages
from the active/inactive file calculation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-20 17:22                                 ` Rik van Riel
@ 2010-04-21  4:23                                   ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-21  4:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo



Rik van Riel wrote:
> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
> 
>> The idea is that it pans out on its own.  If the workload changes, new
>> pages get activated and when that set grows too large, we start shrinking
>> it again.
>>
>> Of course, right now this unscanned set is way too large and we can end
>> up wasting up to 50% of usable page cache on false active pages.
> 
> Thing is, changing workloads often change back.
> 
> Specifically, think of a desktop system that is doing
> work for the user during the day and gets backed up
> at night.
> 
> You do not want the backup to kick the working set
> out of memory, because when the user returns in the
> morning the desktop should come back quickly after
> the screensaver is unlocked.

IMHO it is fine to prevent that nightly backup job from not being 
finished when the user arrives at morning because we didn't give him 
some more cache - and e.g. a 30 sec transition from/to both optimized 
states is fine.
But eventually I guess the point is that both behaviors are reasonable 
to achieve - depending on the users needs.

What we could do is combine all our thoughts we had so far:
a) Rik could create an experimental patch that excludes the in flight pages
b) Johannes could create one for his suggestion to "always scan active 
file pages but only deactivate them when the ratio is off and otherwise 
strip buffers of clean pages"
c) I would extend the patch from Johannes setting the ratio of 
active/inactive pages to be a userspace tunable

a,b,a+b would then need to be tested if they achieve a better behavior.

c on the other hand would be a fine tunable to let administrators 
(knowing their workloads) or distributions (e.g. different values for 
Desktop/Server defaults) adapt their installations.

In theory a,b and c should work fine together in case we need all of them.

> The big question is, what workload suffers from
> having the inactive list at 50% of the page cache?
> 
> So far the only big problem we have seen is on a
> very unbalanced virtual machine, with 256MB RAM
> and 4 fast disks.  The disks simply have more IO
> in flight at once than what fits in the inactive
> list.

Did I get you right that this means the write case - explaining why it 
is building up buffers to the 50% max?

Note: It even uses up to 64 disks, with 1 disk per thread so e.g. 16 
threads => 16 disks.

For being "unbalanced" I'd like to mention that over the years I learned 
that sometimes, after a while, virtualized systems look that way without 
being intended - this happens by adding more and more guests and let 
guest memory balooning take care of it.

> This is a very untypical situation, and we can
> probably solve it by excluding the in-flight pages
> from the active/inactive file calculation.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-21  4:23                                   ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-21  4:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo



Rik van Riel wrote:
> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
> 
>> The idea is that it pans out on its own.  If the workload changes, new
>> pages get activated and when that set grows too large, we start shrinking
>> it again.
>>
>> Of course, right now this unscanned set is way too large and we can end
>> up wasting up to 50% of usable page cache on false active pages.
> 
> Thing is, changing workloads often change back.
> 
> Specifically, think of a desktop system that is doing
> work for the user during the day and gets backed up
> at night.
> 
> You do not want the backup to kick the working set
> out of memory, because when the user returns in the
> morning the desktop should come back quickly after
> the screensaver is unlocked.

IMHO it is fine to prevent that nightly backup job from not being 
finished when the user arrives at morning because we didn't give him 
some more cache - and e.g. a 30 sec transition from/to both optimized 
states is fine.
But eventually I guess the point is that both behaviors are reasonable 
to achieve - depending on the users needs.

What we could do is combine all our thoughts we had so far:
a) Rik could create an experimental patch that excludes the in flight pages
b) Johannes could create one for his suggestion to "always scan active 
file pages but only deactivate them when the ratio is off and otherwise 
strip buffers of clean pages"
c) I would extend the patch from Johannes setting the ratio of 
active/inactive pages to be a userspace tunable

a,b,a+b would then need to be tested if they achieve a better behavior.

c on the other hand would be a fine tunable to let administrators 
(knowing their workloads) or distributions (e.g. different values for 
Desktop/Server defaults) adapt their installations.

In theory a,b and c should work fine together in case we need all of them.

> The big question is, what workload suffers from
> having the inactive list at 50% of the page cache?
> 
> So far the only big problem we have seen is on a
> very unbalanced virtual machine, with 256MB RAM
> and 4 fast disks.  The disks simply have more IO
> in flight at once than what fits in the inactive
> list.

Did I get you right that this means the write case - explaining why it 
is building up buffers to the 50% max?

Note: It even uses up to 64 disks, with 1 disk per thread so e.g. 16 
threads => 16 disks.

For being "unbalanced" I'd like to mention that over the years I learned 
that sometimes, after a while, virtualized systems look that way without 
being intended - this happens by adding more and more guests and let 
guest memory balooning take care of it.

> This is a very untypical situation, and we can
> probably solve it by excluding the in-flight pages
> from the active/inactive file calculation.

-- 

GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-21  4:23                                   ` Christian Ehrhardt
@ 2010-04-21  7:35                                     ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-21  7:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

[-- Attachment #1: Type: text/plain, Size: 3723 bytes --]



Christian Ehrhardt wrote:
> 
> 
> Rik van Riel wrote:
>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>
>>> The idea is that it pans out on its own.  If the workload changes, new
>>> pages get activated and when that set grows too large, we start 
>>> shrinking
>>> it again.
>>>
>>> Of course, right now this unscanned set is way too large and we can end
>>> up wasting up to 50% of usable page cache on false active pages.
>>
>> Thing is, changing workloads often change back.
>>
>> Specifically, think of a desktop system that is doing
>> work for the user during the day and gets backed up
>> at night.
>>
>> You do not want the backup to kick the working set
>> out of memory, because when the user returns in the
>> morning the desktop should come back quickly after
>> the screensaver is unlocked.
> 
> IMHO it is fine to prevent that nightly backup job from not being 
> finished when the user arrives at morning because we didn't give him 
> some more cache - and e.g. a 30 sec transition from/to both optimized 
> states is fine.
> But eventually I guess the point is that both behaviors are reasonable 
> to achieve - depending on the users needs.
> 
> What we could do is combine all our thoughts we had so far:
> a) Rik could create an experimental patch that excludes the in flight pages
> b) Johannes could create one for his suggestion to "always scan active 
> file pages but only deactivate them when the ratio is off and otherwise 
> strip buffers of clean pages"
> c) I would extend the patch from Johannes setting the ratio of 
> active/inactive pages to be a userspace tunable

A first revision of patch c is attached.
I tested assigning different percentages, so far e.g. 50 really behave 
like before and 25 protects ~42M Buffers in my example which would match 
the intended behavior - see patch for more details.

Checkpatch and some basic function tests went fine.
While it may be not perfect yet, I think it is ready for feedback now.

> a,b,a+b would then need to be tested if they achieve a better behavior.
> 
> c on the other hand would be a fine tunable to let administrators 
> (knowing their workloads) or distributions (e.g. different values for 
> Desktop/Server defaults) adapt their installations.
> 
> In theory a,b and c should work fine together in case we need all of them.
> 
>> The big question is, what workload suffers from
>> having the inactive list at 50% of the page cache?
>>
>> So far the only big problem we have seen is on a
>> very unbalanced virtual machine, with 256MB RAM
>> and 4 fast disks.  The disks simply have more IO
>> in flight at once than what fits in the inactive
>> list.
> 
> Did I get you right that this means the write case - explaining why it 
> is building up buffers to the 50% max?
> 

Thinking about it I wondered for what these Buffers are protected.
If the intention to save these buffers is for reuse with similar loads I 
wonder why I "need" three iozones to build up the 85M in my case.

Buffers start at ~0, after iozone run 1 they are at ~35, then after #2 
~65 and after run #3 ~85.
Shouldn't that either allocate 85M for the first directly in case that 
much is needed for a single run - or if not the second and third run 
just "resuse" the 35M Buffers from the first run still held?

Note - "1 iozone run" means "iozone ... -i 0" which sequentially writes 
and then rewrites a 2Gb file on 16 disks in my current case.

looking forward especially to patch b as I'd really like to see a kernel 
able to win back these buffers if they are no more used for a longer 
period while still allowing to grow&protect them while needed.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

[-- Attachment #2: active-inacte-ratio-tunable.diff --]
[-- Type: text/x-patch, Size: 4672 bytes --]

Subject: [PATCH] mm: make working set portion that is protected tunable

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0%  - like a kernel pre "56e49d21 vmscan: evict use-once pages first"
- x%  - any other percentage to allow customizing the system to its needs.

Due to our experiments the suggested default in this patch is 25%, but if
preferred I'm fine keeping 50% and letting admins/distros adapt as needed.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---

[diffstat]

[diff]
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2010-04-21 06:32:23.000000000 +0200
+++ linux-2.6/Documentation/sysctl/vm.txt	2010-04-21 07:24:35.000000000 +0200
@@ -18,6 +18,7 @@
 
 Currently, these files are in /proc/sys/vm:
 
+- active_inactive_ratio
 - block_dump
 - dirty_background_bytes
 - dirty_background_ratio
@@ -57,6 +58,15 @@
 
 ==============================================================
 
+active_inactive_ratio
+
+The kernel tries to protect the active working set. Therefore a portion of the
+file pages is protected, meaning they are omitted when eviting pages until this
+ratio is reached.
+This tunable represents that ratio in percent and specifies the protected part
+
+==============================================================
+
 block_dump
 
 block_dump enables block I/O debugging when set to a nonzero value. More
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2010-04-21 06:33:43.000000000 +0200
+++ linux-2.6/kernel/sysctl.c	2010-04-21 07:26:35.000000000 +0200
@@ -1271,6 +1271,15 @@
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "active_inactive_ratio",
+		.data		= &sysctl_active_inactive_ratio,
+		.maxlen		= sizeof(sysctl_active_inactive_ratio),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/memcontrol.c	2010-04-21 09:00:22.000000000 +0200
@@ -893,12 +893,12 @@
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
-	unsigned long inactive;
+	unsigned long file;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+	file = active + mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > file * sysctl_active_inactive_ratio / 100);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/vmscan.c	2010-04-21 09:00:13.000000000 +0200
@@ -1459,14 +1459,23 @@
 	return low;
 }
 
+/*
+ * sysctl_active_inactive_ratio
+ *
+ * Defines the portion of file pages within the active working set is going to
+ * be protected. The value represents the percentage that will be protected.
+ */
+int sysctl_active_inactive_ratio __read_mostly = 25;
+
 static int inactive_file_is_low_global(struct zone *zone)
 {
-	unsigned long active, inactive;
+	unsigned long active, file;
 
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+	file = active + zone_page_state(zone, NR_INACTIVE_FILE);
+
+	return (active > file * sysctl_active_inactive_ratio / 100);
 
-	return (active > inactive);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-04-21 09:02:37.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2010-04-21 09:02:51.000000000 +0200
@@ -1467,5 +1467,7 @@
 
 extern void dump_page(struct page *page);
 
+extern int sysctl_active_inactive_ratio;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-21  7:35                                     ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-21  7:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

[-- Attachment #1: Type: text/plain, Size: 3727 bytes --]



Christian Ehrhardt wrote:
> 
> 
> Rik van Riel wrote:
>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>
>>> The idea is that it pans out on its own.  If the workload changes, new
>>> pages get activated and when that set grows too large, we start 
>>> shrinking
>>> it again.
>>>
>>> Of course, right now this unscanned set is way too large and we can end
>>> up wasting up to 50% of usable page cache on false active pages.
>>
>> Thing is, changing workloads often change back.
>>
>> Specifically, think of a desktop system that is doing
>> work for the user during the day and gets backed up
>> at night.
>>
>> You do not want the backup to kick the working set
>> out of memory, because when the user returns in the
>> morning the desktop should come back quickly after
>> the screensaver is unlocked.
> 
> IMHO it is fine to prevent that nightly backup job from not being 
> finished when the user arrives at morning because we didn't give him 
> some more cache - and e.g. a 30 sec transition from/to both optimized 
> states is fine.
> But eventually I guess the point is that both behaviors are reasonable 
> to achieve - depending on the users needs.
> 
> What we could do is combine all our thoughts we had so far:
> a) Rik could create an experimental patch that excludes the in flight pages
> b) Johannes could create one for his suggestion to "always scan active 
> file pages but only deactivate them when the ratio is off and otherwise 
> strip buffers of clean pages"
> c) I would extend the patch from Johannes setting the ratio of 
> active/inactive pages to be a userspace tunable

A first revision of patch c is attached.
I tested assigning different percentages, so far e.g. 50 really behave 
like before and 25 protects ~42M Buffers in my example which would match 
the intended behavior - see patch for more details.

Checkpatch and some basic function tests went fine.
While it may be not perfect yet, I think it is ready for feedback now.

> a,b,a+b would then need to be tested if they achieve a better behavior.
> 
> c on the other hand would be a fine tunable to let administrators 
> (knowing their workloads) or distributions (e.g. different values for 
> Desktop/Server defaults) adapt their installations.
> 
> In theory a,b and c should work fine together in case we need all of them.
> 
>> The big question is, what workload suffers from
>> having the inactive list at 50% of the page cache?
>>
>> So far the only big problem we have seen is on a
>> very unbalanced virtual machine, with 256MB RAM
>> and 4 fast disks.  The disks simply have more IO
>> in flight at once than what fits in the inactive
>> list.
> 
> Did I get you right that this means the write case - explaining why it 
> is building up buffers to the 50% max?
> 

Thinking about it I wondered for what these Buffers are protected.
If the intention to save these buffers is for reuse with similar loads I 
wonder why I "need" three iozones to build up the 85M in my case.

Buffers start at ~0, after iozone run 1 they are at ~35, then after #2 
~65 and after run #3 ~85.
Shouldn't that either allocate 85M for the first directly in case that 
much is needed for a single run - or if not the second and third run 
just "resuse" the 35M Buffers from the first run still held?

Note - "1 iozone run" means "iozone ... -i 0" which sequentially writes 
and then rewrites a 2Gb file on 16 disks in my current case.

looking forward especially to patch b as I'd really like to see a kernel 
able to win back these buffers if they are no more used for a longer 
period while still allowing to grow&protect them while needed.

-- 

GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

[-- Attachment #2: active-inacte-ratio-tunable.diff --]
[-- Type: text/x-patch, Size: 4672 bytes --]

Subject: [PATCH] mm: make working set portion that is protected tunable

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0%  - like a kernel pre "56e49d21 vmscan: evict use-once pages first"
- x%  - any other percentage to allow customizing the system to its needs.

Due to our experiments the suggested default in this patch is 25%, but if
preferred I'm fine keeping 50% and letting admins/distros adapt as needed.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---

[diffstat]

[diff]
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2010-04-21 06:32:23.000000000 +0200
+++ linux-2.6/Documentation/sysctl/vm.txt	2010-04-21 07:24:35.000000000 +0200
@@ -18,6 +18,7 @@
 
 Currently, these files are in /proc/sys/vm:
 
+- active_inactive_ratio
 - block_dump
 - dirty_background_bytes
 - dirty_background_ratio
@@ -57,6 +58,15 @@
 
 ==============================================================
 
+active_inactive_ratio
+
+The kernel tries to protect the active working set. Therefore a portion of the
+file pages is protected, meaning they are omitted when eviting pages until this
+ratio is reached.
+This tunable represents that ratio in percent and specifies the protected part
+
+==============================================================
+
 block_dump
 
 block_dump enables block I/O debugging when set to a nonzero value. More
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2010-04-21 06:33:43.000000000 +0200
+++ linux-2.6/kernel/sysctl.c	2010-04-21 07:26:35.000000000 +0200
@@ -1271,6 +1271,15 @@
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "active_inactive_ratio",
+		.data		= &sysctl_active_inactive_ratio,
+		.maxlen		= sizeof(sysctl_active_inactive_ratio),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/memcontrol.c	2010-04-21 09:00:22.000000000 +0200
@@ -893,12 +893,12 @@
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
-	unsigned long inactive;
+	unsigned long file;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+	file = active + mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > file * sysctl_active_inactive_ratio / 100);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/vmscan.c	2010-04-21 09:00:13.000000000 +0200
@@ -1459,14 +1459,23 @@
 	return low;
 }
 
+/*
+ * sysctl_active_inactive_ratio
+ *
+ * Defines the portion of file pages within the active working set is going to
+ * be protected. The value represents the percentage that will be protected.
+ */
+int sysctl_active_inactive_ratio __read_mostly = 25;
+
 static int inactive_file_is_low_global(struct zone *zone)
 {
-	unsigned long active, inactive;
+	unsigned long active, file;
 
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+	file = active + zone_page_state(zone, NR_INACTIVE_FILE);
+
+	return (active > file * sysctl_active_inactive_ratio / 100);
 
-	return (active > inactive);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-04-21 09:02:37.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2010-04-21 09:02:51.000000000 +0200
@@ -1467,5 +1467,7 @@
 
 extern void dump_page(struct page *page);
 
+extern int sysctl_active_inactive_ratio;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-21  4:23                                   ` Christian Ehrhardt
@ 2010-04-21  9:03                                     ` Johannes Weiner
  -1 siblings, 0 replies; 136+ messages in thread
From: Johannes Weiner @ 2010-04-21  9:03 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Rik van Riel, Mel Gorman, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo

On Wed, Apr 21, 2010 at 06:23:45AM +0200, Christian Ehrhardt wrote:
> Rik van Riel wrote:
> >You do not want the backup to kick the working set
> >out of memory, because when the user returns in the
> >morning the desktop should come back quickly after
> >the screensaver is unlocked.
> 
> IMHO it is fine to prevent that nightly backup job from not being 
> finished when the user arrives at morning because we didn't give him 
> some more cache - and e.g. a 30 sec transition from/to both optimized 
> states is fine.

For batched work maybe :-)

> What we could do is combine all our thoughts we had so far:
> a) Rik could create an experimental patch that excludes the in flight pages
> b) Johannes could create one for his suggestion to "always scan active 
> file pages but only deactivate them when the ratio is off and otherwise 
> strip buffers of clean pages"

Please drop that idea, that 'Buffers:' is a red herring.  It's just pages
that do not back files but block devices.  Stripping buffer_heads won't
achieve anything, we need to get rid of the pages.  Sorry, I should have
slept and thought before writing that suggestion.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-21  9:03                                     ` Johannes Weiner
  0 siblings, 0 replies; 136+ messages in thread
From: Johannes Weiner @ 2010-04-21  9:03 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Rik van Riel, Mel Gorman, Andrew Morton, linux-mm, Nick Piggin,
	Chris Mason, Jens Axboe, linux-kernel, gregkh, Corrado Zoccolo

On Wed, Apr 21, 2010 at 06:23:45AM +0200, Christian Ehrhardt wrote:
> Rik van Riel wrote:
> >You do not want the backup to kick the working set
> >out of memory, because when the user returns in the
> >morning the desktop should come back quickly after
> >the screensaver is unlocked.
> 
> IMHO it is fine to prevent that nightly backup job from not being 
> finished when the user arrives at morning because we didn't give him 
> some more cache - and e.g. a 30 sec transition from/to both optimized 
> states is fine.

For batched work maybe :-)

> What we could do is combine all our thoughts we had so far:
> a) Rik could create an experimental patch that excludes the in flight pages
> b) Johannes could create one for his suggestion to "always scan active 
> file pages but only deactivate them when the ratio is off and otherwise 
> strip buffers of clean pages"

Please drop that idea, that 'Buffers:' is a red herring.  It's just pages
that do not back files but block devices.  Stripping buffer_heads won't
achieve anything, we need to get rid of the pages.  Sorry, I should have
slept and thought before writing that suggestion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-21  7:35                                     ` Christian Ehrhardt
@ 2010-04-21 13:19                                       ` Rik van Riel
  -1 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-21 13:19 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

On 04/21/2010 03:35 AM, Christian Ehrhardt wrote:
>
>
> Christian Ehrhardt wrote:
>>
>>
>> Rik van Riel wrote:
>>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>>
>>>> The idea is that it pans out on its own. If the workload changes, new
>>>> pages get activated and when that set grows too large, we start
>>>> shrinking
>>>> it again.
>>>>
>>>> Of course, right now this unscanned set is way too large and we can end
>>>> up wasting up to 50% of usable page cache on false active pages.
>>>
>>> Thing is, changing workloads often change back.
>>>
>>> Specifically, think of a desktop system that is doing
>>> work for the user during the day and gets backed up
>>> at night.
>>>
>>> You do not want the backup to kick the working set
>>> out of memory, because when the user returns in the
>>> morning the desktop should come back quickly after
>>> the screensaver is unlocked.
>>
>> IMHO it is fine to prevent that nightly backup job from not being
>> finished when the user arrives at morning because we didn't give him
>> some more cache - and e.g. a 30 sec transition from/to both optimized
>> states is fine.
>> But eventually I guess the point is that both behaviors are reasonable
>> to achieve - depending on the users needs.
>>
>> What we could do is combine all our thoughts we had so far:
>> a) Rik could create an experimental patch that excludes the in flight
>> pages
>> b) Johannes could create one for his suggestion to "always scan active
>> file pages but only deactivate them when the ratio is off and
>> otherwise strip buffers of clean pages"

I think you are confusing "buffer heads" with "buffers".

You can strip buffer heads off pages, but that is not
your problem.

"buffers" in /proc/meminfo stands for cached metadata,
eg. the filesystem journal, inodes, directories, etc...
Caching such metadata is legitimate, because it reduces
the number of disk seeks down the line.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-21 13:19                                       ` Rik van Riel
  0 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-21 13:19 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

On 04/21/2010 03:35 AM, Christian Ehrhardt wrote:
>
>
> Christian Ehrhardt wrote:
>>
>>
>> Rik van Riel wrote:
>>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>>
>>>> The idea is that it pans out on its own. If the workload changes, new
>>>> pages get activated and when that set grows too large, we start
>>>> shrinking
>>>> it again.
>>>>
>>>> Of course, right now this unscanned set is way too large and we can end
>>>> up wasting up to 50% of usable page cache on false active pages.
>>>
>>> Thing is, changing workloads often change back.
>>>
>>> Specifically, think of a desktop system that is doing
>>> work for the user during the day and gets backed up
>>> at night.
>>>
>>> You do not want the backup to kick the working set
>>> out of memory, because when the user returns in the
>>> morning the desktop should come back quickly after
>>> the screensaver is unlocked.
>>
>> IMHO it is fine to prevent that nightly backup job from not being
>> finished when the user arrives at morning because we didn't give him
>> some more cache - and e.g. a 30 sec transition from/to both optimized
>> states is fine.
>> But eventually I guess the point is that both behaviors are reasonable
>> to achieve - depending on the users needs.
>>
>> What we could do is combine all our thoughts we had so far:
>> a) Rik could create an experimental patch that excludes the in flight
>> pages
>> b) Johannes could create one for his suggestion to "always scan active
>> file pages but only deactivate them when the ratio is off and
>> otherwise strip buffers of clean pages"

I think you are confusing "buffer heads" with "buffers".

You can strip buffer heads off pages, but that is not
your problem.

"buffers" in /proc/meminfo stands for cached metadata,
eg. the filesystem journal, inodes, directories, etc...
Caching such metadata is legitimate, because it reduces
the number of disk seeks down the line.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-21  4:23                                   ` Christian Ehrhardt
@ 2010-04-21 13:20                                     ` Rik van Riel
  -1 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-21 13:20 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

On 04/21/2010 12:23 AM, Christian Ehrhardt wrote:

> IMHO it is fine to prevent that nightly backup job from not being
> finished when the user arrives at morning because we didn't give him
> some more cache

How on earth would a backup job benefit from cache?

It only accesses each bit of data once, so caching the
to-be-backed-up data is a waste of memory.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-21 13:20                                     ` Rik van Riel
  0 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-21 13:20 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo

On 04/21/2010 12:23 AM, Christian Ehrhardt wrote:

> IMHO it is fine to prevent that nightly backup job from not being
> finished when the user arrives at morning because we didn't give him
> some more cache

How on earth would a backup job benefit from cache?

It only accesses each bit of data once, so caching the
to-be-backed-up data is a waste of memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
  2010-04-21 13:19                                       ` Rik van Riel
@ 2010-04-22  6:21                                         ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-22  6:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo, Ehrhardt Christian

[-- Attachment #1: Type: text/plain, Size: 4476 bytes --]

Trying to answer and consolidate all open parts of this thread down below.

Rik van Riel wrote:
> On 04/21/2010 03:35 AM, Christian Ehrhardt wrote:
>>
>>
>> Christian Ehrhardt wrote:
>>>
>>>
>>> Rik van Riel wrote:
>>>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>>>
>>>>> The idea is that it pans out on its own. If the workload changes, new
>>>>> pages get activated and when that set grows too large, we start
>>>>> shrinking
>>>>> it again.
>>>>>
>>>>> Of course, right now this unscanned set is way too large and we can 
>>>>> end
>>>>> up wasting up to 50% of usable page cache on false active pages.
>>>>
>>>> Thing is, changing workloads often change back.
>>>>
>>>> Specifically, think of a desktop system that is doing
>>>> work for the user during the day and gets backed up
>>>> at night.
>>>>
>>>> You do not want the backup to kick the working set
>>>> out of memory, because when the user returns in the
>>>> morning the desktop should come back quickly after
>>>> the screensaver is unlocked.
>>>
>>> IMHO it is fine to prevent that nightly backup job from not being
>>> finished when the user arrives at morning because we didn't give him
>>> some more cache - and e.g. a 30 sec transition from/to both optimized
>>> states is fine.
>>> But eventually I guess the point is that both behaviors are reasonable
>>> to achieve - depending on the users needs.
>>>
>>> What we could do is combine all our thoughts we had so far:
>>> a) Rik could create an experimental patch that excludes the in flight
>>> pages
>>> b) Johannes could create one for his suggestion to "always scan active
>>> file pages but only deactivate them when the ratio is off and
>>> otherwise strip buffers of clean pages"
> 
> I think you are confusing "buffer heads" with "buffers".
> 
> You can strip buffer heads off pages, but that is not
> your problem.
> 
> "buffers" in /proc/meminfo stands for cached metadata,
> eg. the filesystem journal, inodes, directories, etc...
> Caching such metadata is legitimate, because it reduces
> the number of disk seeks down the line.

Yeah I mixed that as well, thanks for clarification (Johannes wrote a 
similar response effectively kicking b) from the list of things we could 
do).

Regarding your question from thread reply#3
 > How on earth would a backup job benefit from cache?
 >
 > It only accesses each bit of data once, so caching the
 > to-be-backed-up data is a waste of memory.

If it is a low memory system with a lot of disks (like in my case) 
giving it more cache allows e.g. larger readaheads or less cache 
trashing - but it might be ok, as it might be rare case to hit all those 
constraints at once.
But as we discussed before on virtual servers it can happen from time to 
time due to balooning and much more disk attachments etc.



So definitely not the majority of cases around, but some corner cases 
here and there that would benefit at least from making the preserved 
ratio configurable if we don't find a good way to let it take the memory 
back without hurting the intended preservation functionality.

For that reason - how about the patch I posted yesterday (to consolidate 
this spread out thread I attach it here again)



And finally I still would like to understand why writing the same files 
three times increase the active file pages each time instead of reusing 
those already brought into memory by the first run.
To collect that last open thread as well I'll cite my own question here:

 > Thinking about it I wondered for what these Buffers are protected.
 > If the intention to save these buffers is for reuse with similar 
loads > I wonder why I "need" three iozones to build up the 85M in my case.

 > Buffers start at ~0, after iozone run 1 they are at ~35, then after 
#2 > ~65 and after run #3 ~85.
 > Shouldn't that either allocate 85M for the first directly in case 
that > much is needed for a single run - or if not the second and third 
run > > just "resuse" the 35M Buffers from the first run still held?

 > Note - "1 iozone run" means "iozone ... -i 0" which sequentially
 > writes and then rewrites a 2Gb file on 16 disks in my current case.

Trying to answering this question my self using your buffer details 
above doesn't completely fit without further clarification, as the same 
files should have the same dir, inode, ... (all ext2 in my case, so no 
journal data as well).


-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

[-- Attachment #2: active-inacte-ratio-tunable.diff --]
[-- Type: text/x-patch, Size: 4672 bytes --]

Subject: [PATCH] mm: make working set portion that is protected tunable

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0%  - like a kernel pre "56e49d21 vmscan: evict use-once pages first"
- x%  - any other percentage to allow customizing the system to its needs.

Due to our experiments the suggested default in this patch is 25%, but if
preferred I'm fine keeping 50% and letting admins/distros adapt as needed.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---

[diffstat]

[diff]
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2010-04-21 06:32:23.000000000 +0200
+++ linux-2.6/Documentation/sysctl/vm.txt	2010-04-21 07:24:35.000000000 +0200
@@ -18,6 +18,7 @@
 
 Currently, these files are in /proc/sys/vm:
 
+- active_inactive_ratio
 - block_dump
 - dirty_background_bytes
 - dirty_background_ratio
@@ -57,6 +58,15 @@
 
 ==============================================================
 
+active_inactive_ratio
+
+The kernel tries to protect the active working set. Therefore a portion of the
+file pages is protected, meaning they are omitted when eviting pages until this
+ratio is reached.
+This tunable represents that ratio in percent and specifies the protected part
+
+==============================================================
+
 block_dump
 
 block_dump enables block I/O debugging when set to a nonzero value. More
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2010-04-21 06:33:43.000000000 +0200
+++ linux-2.6/kernel/sysctl.c	2010-04-21 07:26:35.000000000 +0200
@@ -1271,6 +1271,15 @@
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "active_inactive_ratio",
+		.data		= &sysctl_active_inactive_ratio,
+		.maxlen		= sizeof(sysctl_active_inactive_ratio),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/memcontrol.c	2010-04-21 09:00:22.000000000 +0200
@@ -893,12 +893,12 @@
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
-	unsigned long inactive;
+	unsigned long file;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+	file = active + mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > file * sysctl_active_inactive_ratio / 100);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/vmscan.c	2010-04-21 09:00:13.000000000 +0200
@@ -1459,14 +1459,23 @@
 	return low;
 }
 
+/*
+ * sysctl_active_inactive_ratio
+ *
+ * Defines the portion of file pages within the active working set is going to
+ * be protected. The value represents the percentage that will be protected.
+ */
+int sysctl_active_inactive_ratio __read_mostly = 25;
+
 static int inactive_file_is_low_global(struct zone *zone)
 {
-	unsigned long active, inactive;
+	unsigned long active, file;
 
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+	file = active + zone_page_state(zone, NR_INACTIVE_FILE);
+
+	return (active > file * sysctl_active_inactive_ratio / 100);
 
-	return (active > inactive);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-04-21 09:02:37.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2010-04-21 09:02:51.000000000 +0200
@@ -1467,5 +1467,7 @@
 
 extern void dump_page(struct page *page);
 
+extern int sysctl_active_inactive_ratio;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
@ 2010-04-22  6:21                                         ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-22  6:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, linux-mm,
	Nick Piggin, Chris Mason, Jens Axboe, linux-kernel, gregkh,
	Corrado Zoccolo, Ehrhardt Christian

[-- Attachment #1: Type: text/plain, Size: 4480 bytes --]

Trying to answer and consolidate all open parts of this thread down below.

Rik van Riel wrote:
> On 04/21/2010 03:35 AM, Christian Ehrhardt wrote:
>>
>>
>> Christian Ehrhardt wrote:
>>>
>>>
>>> Rik van Riel wrote:
>>>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>>>
>>>>> The idea is that it pans out on its own. If the workload changes, new
>>>>> pages get activated and when that set grows too large, we start
>>>>> shrinking
>>>>> it again.
>>>>>
>>>>> Of course, right now this unscanned set is way too large and we can 
>>>>> end
>>>>> up wasting up to 50% of usable page cache on false active pages.
>>>>
>>>> Thing is, changing workloads often change back.
>>>>
>>>> Specifically, think of a desktop system that is doing
>>>> work for the user during the day and gets backed up
>>>> at night.
>>>>
>>>> You do not want the backup to kick the working set
>>>> out of memory, because when the user returns in the
>>>> morning the desktop should come back quickly after
>>>> the screensaver is unlocked.
>>>
>>> IMHO it is fine to prevent that nightly backup job from not being
>>> finished when the user arrives at morning because we didn't give him
>>> some more cache - and e.g. a 30 sec transition from/to both optimized
>>> states is fine.
>>> But eventually I guess the point is that both behaviors are reasonable
>>> to achieve - depending on the users needs.
>>>
>>> What we could do is combine all our thoughts we had so far:
>>> a) Rik could create an experimental patch that excludes the in flight
>>> pages
>>> b) Johannes could create one for his suggestion to "always scan active
>>> file pages but only deactivate them when the ratio is off and
>>> otherwise strip buffers of clean pages"
> 
> I think you are confusing "buffer heads" with "buffers".
> 
> You can strip buffer heads off pages, but that is not
> your problem.
> 
> "buffers" in /proc/meminfo stands for cached metadata,
> eg. the filesystem journal, inodes, directories, etc...
> Caching such metadata is legitimate, because it reduces
> the number of disk seeks down the line.

Yeah I mixed that as well, thanks for clarification (Johannes wrote a 
similar response effectively kicking b) from the list of things we could 
do).

Regarding your question from thread reply#3
 > How on earth would a backup job benefit from cache?
 >
 > It only accesses each bit of data once, so caching the
 > to-be-backed-up data is a waste of memory.

If it is a low memory system with a lot of disks (like in my case) 
giving it more cache allows e.g. larger readaheads or less cache 
trashing - but it might be ok, as it might be rare case to hit all those 
constraints at once.
But as we discussed before on virtual servers it can happen from time to 
time due to balooning and much more disk attachments etc.



So definitely not the majority of cases around, but some corner cases 
here and there that would benefit at least from making the preserved 
ratio configurable if we don't find a good way to let it take the memory 
back without hurting the intended preservation functionality.

For that reason - how about the patch I posted yesterday (to consolidate 
this spread out thread I attach it here again)



And finally I still would like to understand why writing the same files 
three times increase the active file pages each time instead of reusing 
those already brought into memory by the first run.
To collect that last open thread as well I'll cite my own question here:

 > Thinking about it I wondered for what these Buffers are protected.
 > If the intention to save these buffers is for reuse with similar 
loads > I wonder why I "need" three iozones to build up the 85M in my case.

 > Buffers start at ~0, after iozone run 1 they are at ~35, then after 
#2 > ~65 and after run #3 ~85.
 > Shouldn't that either allocate 85M for the first directly in case 
that > much is needed for a single run - or if not the second and third 
run > > just "resuse" the 35M Buffers from the first run still held?

 > Note - "1 iozone run" means "iozone ... -i 0" which sequentially
 > writes and then rewrites a 2Gb file on 16 disks in my current case.

Trying to answering this question my self using your buffer details 
above doesn't completely fit without further clarification, as the same 
files should have the same dir, inode, ... (all ext2 in my case, so no 
journal data as well).


-- 

GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

[-- Attachment #2: active-inacte-ratio-tunable.diff --]
[-- Type: text/x-patch, Size: 4672 bytes --]

Subject: [PATCH] mm: make working set portion that is protected tunable

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0%  - like a kernel pre "56e49d21 vmscan: evict use-once pages first"
- x%  - any other percentage to allow customizing the system to its needs.

Due to our experiments the suggested default in this patch is 25%, but if
preferred I'm fine keeping 50% and letting admins/distros adapt as needed.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---

[diffstat]

[diff]
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2010-04-21 06:32:23.000000000 +0200
+++ linux-2.6/Documentation/sysctl/vm.txt	2010-04-21 07:24:35.000000000 +0200
@@ -18,6 +18,7 @@
 
 Currently, these files are in /proc/sys/vm:
 
+- active_inactive_ratio
 - block_dump
 - dirty_background_bytes
 - dirty_background_ratio
@@ -57,6 +58,15 @@
 
 ==============================================================
 
+active_inactive_ratio
+
+The kernel tries to protect the active working set. Therefore a portion of the
+file pages is protected, meaning they are omitted when eviting pages until this
+ratio is reached.
+This tunable represents that ratio in percent and specifies the protected part
+
+==============================================================
+
 block_dump
 
 block_dump enables block I/O debugging when set to a nonzero value. More
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2010-04-21 06:33:43.000000000 +0200
+++ linux-2.6/kernel/sysctl.c	2010-04-21 07:26:35.000000000 +0200
@@ -1271,6 +1271,15 @@
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "active_inactive_ratio",
+		.data		= &sysctl_active_inactive_ratio,
+		.maxlen		= sizeof(sysctl_active_inactive_ratio),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/memcontrol.c	2010-04-21 09:00:22.000000000 +0200
@@ -893,12 +893,12 @@
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
-	unsigned long inactive;
+	unsigned long file;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+	file = active + mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > file * sysctl_active_inactive_ratio / 100);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/vmscan.c	2010-04-21 09:00:13.000000000 +0200
@@ -1459,14 +1459,23 @@
 	return low;
 }
 
+/*
+ * sysctl_active_inactive_ratio
+ *
+ * Defines the portion of file pages within the active working set is going to
+ * be protected. The value represents the percentage that will be protected.
+ */
+int sysctl_active_inactive_ratio __read_mostly = 25;
+
 static int inactive_file_is_low_global(struct zone *zone)
 {
-	unsigned long active, inactive;
+	unsigned long active, file;
 
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+	file = active + zone_page_state(zone, NR_INACTIVE_FILE);
+
+	return (active > file * sysctl_active_inactive_ratio / 100);
 
-	return (active > inactive);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-04-21 09:02:37.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2010-04-21 09:02:51.000000000 +0200
@@ -1467,5 +1467,7 @@
 
 extern void dump_page(struct page *page);
 
+extern int sysctl_active_inactive_ratio;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
  2010-04-22  6:21                                         ` Christian Ehrhardt
@ 2010-04-26 10:59                                           ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-26 10:59 UTC (permalink / raw)
  To: Rik van Riel, Johannes Weiner, Andrew Morton, Nick Piggin, gregkh
  Cc: Mel Gorman, linux-mm, Chris Mason, Jens Axboe, linux-kernel,
	Corrado Zoccolo

Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

*updates in v2*
- use do_div

This patch creates a knob to help users that have workloads suffering from the
fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
evict use-once pages first".
It also provides the tuning mechanisms for other users that want an even bigger
working set to be protected.

To be honest the best solution would be to allow a system not using the working
set to regain that memory *somewhen*, and therefore without drawbacks to the
scenarios it was implemented for e.g. UI interactivity while copying a lot of
data. But up to now there was no idea how to get that behaviour implemented.

In the old thread started by Elladan that finally led to 56e49d21 Wu Fengguang
wrote:
 "In the worse scenario, it could waste half the memory that could
 otherwise be used for readahead buffer and to prevent thrashing, in a
 server serving large datasets that are hardly reused, but still slowly
 builds up its active list during the long uptime (think about a slowly
 performance downgrade that can be fixed by a crude dropcache action).

 That said, the actual performance degradation could be much smaller -
 say 15% - all memories are not equal."

We now identified a case with up to -60% Throughput, therefore this patch tries
to provide a more gentle interface than drop_caches to help a system stuck in
this.

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0%  - like a kernel pre 56e49d21
- x%  - allow customizing the system to someones needs

Due to our experiments the suggested default in this patch is 25%, but if
preferred I'm fine keeping 50% and letting admins/distros adapt as needed.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---

[diffstat]
 Documentation/sysctl/vm.txt |   10 ++++++++++
 include/linux/mm.h          |    2 ++
 kernel/sysctl.c             |    9 +++++++++
 mm/memcontrol.c             |    9 ++++++---
 mm/vmscan.c                 |   17 ++++++++++++++---
 5 files changed, 41 insertions(+), 6 deletions(-)

[diff]
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2010-04-21 06:32:23.000000000 +0200
+++ linux-2.6/Documentation/sysctl/vm.txt	2010-04-21 07:24:35.000000000 +0200
@@ -18,6 +18,7 @@
 
 Currently, these files are in /proc/sys/vm:
 
+- active_inactive_ratio
 - block_dump
 - dirty_background_bytes
 - dirty_background_ratio
@@ -57,6 +58,15 @@
 
 ==============================================================
 
+active_inactive_ratio
+
+The kernel tries to protect the active working set. Therefore a portion of the
+file pages is protected, meaning they are omitted when eviting pages until this
+ratio is reached.
+This tunable represents that ratio in percent and specifies the protected part
+
+==============================================================
+
 block_dump
 
 block_dump enables block I/O debugging when set to a nonzero value. More
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2010-04-21 06:33:43.000000000 +0200
+++ linux-2.6/kernel/sysctl.c	2010-04-21 07:26:35.000000000 +0200
@@ -1271,6 +1271,15 @@
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "active_inactive_ratio",
+		.data		= &sysctl_active_inactive_ratio,
+		.maxlen		= sizeof(sysctl_active_inactive_ratio),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/memcontrol.c	2010-04-26 12:45:46.000000000 +0200
@@ -893,12 +893,15 @@
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
-	unsigned long inactive;
+	unsigned long activetoprotect;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+	activetoprotect = active
+		+ mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE)
+		* sysctl_active_inactive_ratio;
+	activetoprotect = do_div(activetoprotect, 100);
 
-	return (active > inactive);
+	return (active > activetoprotect);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/vmscan.c	2010-04-26 12:50:47.000000000 +0200
@@ -1459,14 +1459,25 @@
 	return low;
 }
 
+/*
+ * sysctl_active_inactive_ratio
+ *
+ * Defines the portion of file pages within the active working set is going to
+ * be protected. The value represents the percentage that will be protected.
+ */
+int sysctl_active_inactive_ratio __read_mostly = 25;
+
 static int inactive_file_is_low_global(struct zone *zone)
 {
-	unsigned long active, inactive;
+	unsigned long active, activetoprotect;
 
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+	activetoprotect = zone_page_state(zone, NR_FILE)
+			* sysctl_active_inactive_ratio;
+	activetoprotect = do_div(activetoprotect, 100);
+
+	return (active > activetoprotect);
 
-	return (active > inactive);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-04-21 09:02:37.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2010-04-21 09:02:51.000000000 +0200
@@ -1467,5 +1467,7 @@
 
 extern void dump_page(struct page *page);
 
+extern int sysctl_active_inactive_ratio;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
@ 2010-04-26 10:59                                           ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-26 10:59 UTC (permalink / raw)
  To: Rik van Riel, Johannes Weiner, Andrew Morton, Nick Piggin, gregkh
  Cc: Mel Gorman, linux-mm, Chris Mason, Jens Axboe, linux-kernel,
	Corrado Zoccolo

Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

*updates in v2*
- use do_div

This patch creates a knob to help users that have workloads suffering from the
fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
evict use-once pages first".
It also provides the tuning mechanisms for other users that want an even bigger
working set to be protected.

To be honest the best solution would be to allow a system not using the working
set to regain that memory *somewhen*, and therefore without drawbacks to the
scenarios it was implemented for e.g. UI interactivity while copying a lot of
data. But up to now there was no idea how to get that behaviour implemented.

In the old thread started by Elladan that finally led to 56e49d21 Wu Fengguang
wrote:
 "In the worse scenario, it could waste half the memory that could
 otherwise be used for readahead buffer and to prevent thrashing, in a
 server serving large datasets that are hardly reused, but still slowly
 builds up its active list during the long uptime (think about a slowly
 performance downgrade that can be fixed by a crude dropcache action).

 That said, the actual performance degradation could be much smaller -
 say 15% - all memories are not equal."

We now identified a case with up to -60% Throughput, therefore this patch tries
to provide a more gentle interface than drop_caches to help a system stuck in
this.

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0%  - like a kernel pre 56e49d21
- x%  - allow customizing the system to someones needs

Due to our experiments the suggested default in this patch is 25%, but if
preferred I'm fine keeping 50% and letting admins/distros adapt as needed.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---

[diffstat]
 Documentation/sysctl/vm.txt |   10 ++++++++++
 include/linux/mm.h          |    2 ++
 kernel/sysctl.c             |    9 +++++++++
 mm/memcontrol.c             |    9 ++++++---
 mm/vmscan.c                 |   17 ++++++++++++++---
 5 files changed, 41 insertions(+), 6 deletions(-)

[diff]
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2010-04-21 06:32:23.000000000 +0200
+++ linux-2.6/Documentation/sysctl/vm.txt	2010-04-21 07:24:35.000000000 +0200
@@ -18,6 +18,7 @@
 
 Currently, these files are in /proc/sys/vm:
 
+- active_inactive_ratio
 - block_dump
 - dirty_background_bytes
 - dirty_background_ratio
@@ -57,6 +58,15 @@
 
 ==============================================================
 
+active_inactive_ratio
+
+The kernel tries to protect the active working set. Therefore a portion of the
+file pages is protected, meaning they are omitted when eviting pages until this
+ratio is reached.
+This tunable represents that ratio in percent and specifies the protected part
+
+==============================================================
+
 block_dump
 
 block_dump enables block I/O debugging when set to a nonzero value. More
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2010-04-21 06:33:43.000000000 +0200
+++ linux-2.6/kernel/sysctl.c	2010-04-21 07:26:35.000000000 +0200
@@ -1271,6 +1271,15 @@
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "active_inactive_ratio",
+		.data		= &sysctl_active_inactive_ratio,
+		.maxlen		= sizeof(sysctl_active_inactive_ratio),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/memcontrol.c	2010-04-26 12:45:46.000000000 +0200
@@ -893,12 +893,15 @@
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
-	unsigned long inactive;
+	unsigned long activetoprotect;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+	activetoprotect = active
+		+ mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE)
+		* sysctl_active_inactive_ratio;
+	activetoprotect = do_div(activetoprotect, 100);
 
-	return (active > inactive);
+	return (active > activetoprotect);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/vmscan.c	2010-04-26 12:50:47.000000000 +0200
@@ -1459,14 +1459,25 @@
 	return low;
 }
 
+/*
+ * sysctl_active_inactive_ratio
+ *
+ * Defines the portion of file pages within the active working set is going to
+ * be protected. The value represents the percentage that will be protected.
+ */
+int sysctl_active_inactive_ratio __read_mostly = 25;
+
 static int inactive_file_is_low_global(struct zone *zone)
 {
-	unsigned long active, inactive;
+	unsigned long active, activetoprotect;
 
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+	activetoprotect = zone_page_state(zone, NR_FILE)
+			* sysctl_active_inactive_ratio;
+	activetoprotect = do_div(activetoprotect, 100);
+
+	return (active > activetoprotect);
 
-	return (active > inactive);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-04-21 09:02:37.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2010-04-21 09:02:51.000000000 +0200
@@ -1467,5 +1467,7 @@
 
 extern void dump_page(struct page *page);
 
+extern int sysctl_active_inactive_ratio;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Subject: [PATCH][RFC] mm: make working set portion that is  protected tunable v2
  2010-04-26 10:59                                           ` Christian Ehrhardt
@ 2010-04-26 11:59                                             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 136+ messages in thread
From: KOSAKI Motohiro @ 2010-04-26 11:59 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Rik van Riel, Johannes Weiner, Andrew Morton, Nick Piggin,
	gregkh, Mel Gorman, linux-mm, Chris Mason, Jens Axboe,
	linux-kernel, Corrado Zoccolo

Hi

I've quick reviewed your patch. but unfortunately I can't write my
reviewed-by sign.

> Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> *updates in v2*
> - use do_div
>
> This patch creates a knob to help users that have workloads suffering from the
> fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
> evict use-once pages first".
> It also provides the tuning mechanisms for other users that want an even bigger
> working set to be protected.

We certainly need no knob. because typical desktop users use various
application,
various workload. then, the knob doesn't help them.

Probably, I've missed previous discussion. I'm going to find your previous mail.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
@ 2010-04-26 11:59                                             ` KOSAKI Motohiro
  0 siblings, 0 replies; 136+ messages in thread
From: KOSAKI Motohiro @ 2010-04-26 11:59 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Rik van Riel, Johannes Weiner, Andrew Morton, Nick Piggin,
	gregkh, Mel Gorman, linux-mm, Chris Mason, Jens Axboe,
	linux-kernel, Corrado Zoccolo

Hi

I've quick reviewed your patch. but unfortunately I can't write my
reviewed-by sign.

> Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>
> *updates in v2*
> - use do_div
>
> This patch creates a knob to help users that have workloads suffering from the
> fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
> evict use-once pages first".
> It also provides the tuning mechanisms for other users that want an even bigger
> working set to be protected.

We certainly need no knob. because typical desktop users use various
application,
various workload. then, the knob doesn't help them.

Probably, I've missed previous discussion. I'm going to find your previous mail.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
  2010-04-26 11:59                                             ` KOSAKI Motohiro
@ 2010-04-26 12:43                                               ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-26 12:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Johannes Weiner, Andrew Morton, Nick Piggin,
	gregkh, Mel Gorman, linux-mm, Chris Mason, Jens Axboe,
	linux-kernel, Corrado Zoccolo



KOSAKI Motohiro wrote:
> Hi
> 
> I've quick reviewed your patch. but unfortunately I can't write my
> reviewed-by sign.

Not a problem, atm I'm happy about any review and comment :-)

>> Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
>> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>>
>> *updates in v2*
>> - use do_div
>>
>> This patch creates a knob to help users that have workloads suffering from the
>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
>> evict use-once pages first".
>> It also provides the tuning mechanisms for other users that want an even bigger
>> working set to be protected.
> 
> We certainly need no knob. because typical desktop users use various
> application,
> various workload. then, the knob doesn't help them.

Briefly - We had discussed non desktop scenarios where like a day load 
that builds up the working set to 50% and a nightly backup job which 
then is unable to use that protected 50% when sequentially reading a lot 
of disks and due to that doesn't finish before morning.

The knob should help those people that know their system would suffer 
from this or similar cases to e.g. set the protected ratio smaller or 
even to zero if wanted.

As mentioned before, being able to gain back those protected 50% would 
be even better - if it can be done in a way not hurting the original 
intention of protecting them.

I personally just don't feel too good knowing that 50% of my memory 
might hang around unused for many hours while they could be of some use.
I absolutely agree with the old intention and see how the patch helped 
with the latency issue Elladan brought up in the past - but it just 
looks way too aggressive to protect it "forever" for some server use cases.

> Probably, I've missed previous discussion. I'm going to find your previous mail.

The discussion ends at http://lkml.org/lkml/2010/4/22/38 - feel free to 
click through it.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
@ 2010-04-26 12:43                                               ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-26 12:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Johannes Weiner, Andrew Morton, Nick Piggin,
	gregkh, Mel Gorman, linux-mm, Chris Mason, Jens Axboe,
	linux-kernel, Corrado Zoccolo



KOSAKI Motohiro wrote:
> Hi
> 
> I've quick reviewed your patch. but unfortunately I can't write my
> reviewed-by sign.

Not a problem, atm I'm happy about any review and comment :-)

>> Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
>> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>>
>> *updates in v2*
>> - use do_div
>>
>> This patch creates a knob to help users that have workloads suffering from the
>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
>> evict use-once pages first".
>> It also provides the tuning mechanisms for other users that want an even bigger
>> working set to be protected.
> 
> We certainly need no knob. because typical desktop users use various
> application,
> various workload. then, the knob doesn't help them.

Briefly - We had discussed non desktop scenarios where like a day load 
that builds up the working set to 50% and a nightly backup job which 
then is unable to use that protected 50% when sequentially reading a lot 
of disks and due to that doesn't finish before morning.

The knob should help those people that know their system would suffer 
from this or similar cases to e.g. set the protected ratio smaller or 
even to zero if wanted.

As mentioned before, being able to gain back those protected 50% would 
be even better - if it can be done in a way not hurting the original 
intention of protecting them.

I personally just don't feel too good knowing that 50% of my memory 
might hang around unused for many hours while they could be of some use.
I absolutely agree with the old intention and see how the patch helped 
with the latency issue Elladan brought up in the past - but it just 
looks way too aggressive to protect it "forever" for some server use cases.

> Probably, I've missed previous discussion. I'm going to find your previous mail.

The discussion ends at http://lkml.org/lkml/2010/4/22/38 - feel free to 
click through it.

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
  2010-04-26 12:43                                               ` Christian Ehrhardt
@ 2010-04-26 14:20                                                 ` Rik van Riel
  -1 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-26 14:20 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: KOSAKI Motohiro, Johannes Weiner, Andrew Morton, Nick Piggin,
	gregkh, Mel Gorman, linux-mm, Chris Mason, Jens Axboe,
	linux-kernel, Corrado Zoccolo

On 04/26/2010 08:43 AM, Christian Ehrhardt wrote:

>>> This patch creates a knob to help users that have workloads suffering
>>> from the
>>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21
>>> vmscan:
>>> evict use-once pages first".
>>> It also provides the tuning mechanisms for other users that want an
>>> even bigger
>>> working set to be protected.
>>
>> We certainly need no knob. because typical desktop users use various
>> application,
>> various workload. then, the knob doesn't help them.
>
> Briefly - We had discussed non desktop scenarios where like a day load
> that builds up the working set to 50% and a nightly backup job which
> then is unable to use that protected 50% when sequentially reading a lot
> of disks and due to that doesn't finish before morning.

This is a red herring.  A backup touches all of the
data once, so it does not need a lot of page cache
and will not "not finish before morning" due to the
working set being protected.

You're going to have to come up with a more realistic
scenario than that.

> I personally just don't feel too good knowing that 50% of my memory
> might hang around unused for many hours while they could be of some use.
> I absolutely agree with the old intention and see how the patch helped
> with the latency issue Elladan brought up in the past - but it just
> looks way too aggressive to protect it "forever" for some server use cases.

So far we have seen exactly one workload where it helps
to reduce the size of the active file list, and that is
not due to any need for caching more inactive pages.

On the contrary, it is because ALL OF THE INACTIVE PAGES
are in flight to disk, all under IO at the same time.

Caching has absolutely nothing to do with the regression
you ran into.


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
@ 2010-04-26 14:20                                                 ` Rik van Riel
  0 siblings, 0 replies; 136+ messages in thread
From: Rik van Riel @ 2010-04-26 14:20 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: KOSAKI Motohiro, Johannes Weiner, Andrew Morton, Nick Piggin,
	gregkh, Mel Gorman, linux-mm, Chris Mason, Jens Axboe,
	linux-kernel, Corrado Zoccolo

On 04/26/2010 08:43 AM, Christian Ehrhardt wrote:

>>> This patch creates a knob to help users that have workloads suffering
>>> from the
>>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21
>>> vmscan:
>>> evict use-once pages first".
>>> It also provides the tuning mechanisms for other users that want an
>>> even bigger
>>> working set to be protected.
>>
>> We certainly need no knob. because typical desktop users use various
>> application,
>> various workload. then, the knob doesn't help them.
>
> Briefly - We had discussed non desktop scenarios where like a day load
> that builds up the working set to 50% and a nightly backup job which
> then is unable to use that protected 50% when sequentially reading a lot
> of disks and due to that doesn't finish before morning.

This is a red herring.  A backup touches all of the
data once, so it does not need a lot of page cache
and will not "not finish before morning" due to the
working set being protected.

You're going to have to come up with a more realistic
scenario than that.

> I personally just don't feel too good knowing that 50% of my memory
> might hang around unused for many hours while they could be of some use.
> I absolutely agree with the old intention and see how the patch helped
> with the latency issue Elladan brought up in the past - but it just
> looks way too aggressive to protect it "forever" for some server use cases.

So far we have seen exactly one workload where it helps
to reduce the size of the active file list, and that is
not due to any need for caching more inactive pages.

On the contrary, it is because ALL OF THE INACTIVE PAGES
are in flight to disk, all under IO at the same time.

Caching has absolutely nothing to do with the regression
you ran into.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
  2010-04-26 14:20                                                 ` Rik van Riel
@ 2010-04-27 14:00                                                   ` Christian Ehrhardt
  -1 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-27 14:00 UTC (permalink / raw)
  To: Rik van Riel, Mel Gorman
  Cc: KOSAKI Motohiro, Johannes Weiner, Andrew Morton, Nick Piggin,
	gregkh, linux-mm, Chris Mason, Jens Axboe, linux-kernel,
	Corrado Zoccolo



Rik van Riel wrote:
> On 04/26/2010 08:43 AM, Christian Ehrhardt wrote:
> 
>>>> This patch creates a knob to help users that have workloads suffering
>>>> from the
>>>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21
>>>> vmscan:
>>>> evict use-once pages first".
>>>> It also provides the tuning mechanisms for other users that want an
>>>> even bigger
>>>> working set to be protected.
>>>
>>> We certainly need no knob. because typical desktop users use various
>>> application,
>>> various workload. then, the knob doesn't help them.
>>
>> Briefly - We had discussed non desktop scenarios where like a day load
>> that builds up the working set to 50% and a nightly backup job which
>> then is unable to use that protected 50% when sequentially reading a lot
>> of disks and due to that doesn't finish before morning.
> 
> This is a red herring.  A backup touches all of the
> data once, so it does not need a lot of page cache
> and will not "not finish before morning" due to the
> working set being protected.
>
> You're going to have to come up with a more realistic
> scenario than that.

I completely agree that a backup case is read once and therefore doesn't
benefit from caching itself, but you know my scenario from the thread
where this patch emerged from.
="Parallel iozone sequential read - resembling the classic backup case
(read once + sequential)."

While caching isn't helping the classic way, by having data in cache
ready on the next access it is still used transparently as the system
is reading ahead into page cache to assist the sequentially reading
process.
Yes it doesn't happen with direct IO and some, but unfortunately not
all backup tools use DIO. Additionally not all backup jobs have a whole
night, and this can really be a decision maker if you can quickly pump
out your 100 TB main database in 10 or 20 minutes.

So here comes the problem, due to the 50% preserved I assume it comes
into trouble allocating that page cache memory in time. So much that it
even slows down the load - meaning long enough to let the application
completely consume the data already read and then still letting it wait.
More about that below.

Now IMHO this feels comparable to a classic backup job, and by loosing
60% Throughput (more than a Gb/s) is seems neither red nor smells like
fish to me.

>> I personally just don't feel too good knowing that 50% of my memory
>> might hang around unused for many hours while they could be of some use.
>> I absolutely agree with the old intention and see how the patch helped
>> with the latency issue Elladan brought up in the past - but it just
>> looks way too aggressive to protect it "forever" for some server use 
>> cases.
> 
> So far we have seen exactly one workload where it helps
> to reduce the size of the active file list, and that is
> not due to any need for caching more inactive pages.
>
> On the contrary, it is because ALL OF THE INACTIVE PAGES
> are in flight to disk, all under IO at the same time.

Ok this time I think I got your point much better - sorry for 
being confused.
Discard my patch, but I'd really like to clarify and verify your 
assumption in conjunction with my findings and would be happy
if you can help me with that.

As mentioned the case that suffers from the 50% memory protected is
iozone read - so it would be "in flight FROM disk", but I guess that
it is not important if it is from or to right ?

Effectively I have two read cases, one with caches dropped which then 
has almost full memory for page cache in the read case. And the other 
one with a few writes before filling up the protected 50% leading to a 
read case with only half of the memory for page cache.
Now if I really got you right this time the issue is caused by the
fact that the parallel read ahead on all 16 disks creates so much I/O
in flight that the 128M (=50% that are left) are not enough.
>From the past we know that the time lost for the -60% Throughput was 
spent in a loop around direct_reclaim&congestion_wait trying to get the
memory for the page cache reads - would you consider it possible that
we now run into a scenario splitting the memory like this?:
- 50% active file protected
- a lot of the other half related to I/O that is currently
  in flight from the disk -> not free-able too?
- almost nothing to free when allocating for the next read to page 
  cache (can only take pages above low watermark) -> waiting

I updated my old counter patch, that I used to verify the old issue were
we spent so much time in a full timeout of congestion wait. Thanks to
Mel this was fixed (I have his watermark wait patch applied), but I
assume having 50% protected I just run into the shortened wait more
often or wait longer for watermarks to still be an issue (due to 50%
not free-able).
See the patch inlined at the end of the mail for details what/how
it is exactly counted.

As before the scenario is iozone on 16 disks in parallel with 1 iozone
child per disk.
I ran:
- write, write, write, read -> bad case
- drop cache, read -> good case
Read throughput still drops by ~60% comparing good to bad case.
Here are the numbers I got for those two cases by my counters and
meminfo:

Value                           Initial state          Write 1            Write 2             Write 3     Read after writes (bad)      Read after DC (good)	
watermark_wait_duration (ns)                0    9,902,333,643     12,288,444,574      24,197,098,221             317,175,021,553            35,002,926,894
watermark_wait                              0            24102              26708               35285                       29720                     15515
pages_direct_reclaim                        0            59195              65010               86777                       90883                     66672
failed_pages_direct_reclaim                 0            24144              26768               35343                       29733                     15525
failed_pages_direct_reclaim_but_progress    0            24144              26768               35343                       29733                     15525

MemTotal:                              248912           248912             248912              248912                      248912                    248912
MemFree:                               185732             4868               5028                3780                        3064                      7136
Buffers:                                  536            33588              65660               84296                       81868                     32072
Cached:                                  9480           145252             111672               93736                       98424                    149724
Active:                                 11052            43920              76032               89084                       87780                     38024
Inactive:                                6860           142628             108980               96528                      100280                    151572
Active(anon):                            5092             4452               4428                4364                        4516                      4492
Inactive(anon):                          6480             6608               6604                6604                        6604                      6604
Active(file):                            5960            39468              71604               84720                       83264                     33532
Inactive(file):                           380           136020             102376               89924                       93676                    144968
Unevictable:                             3952             3952               3952                3952                        3952                      3952
							
Real Time passed in seconds                              48.83             49.38                50.35                       40.62                      22.61	
AVG wait time waitduration/#                           410,851           460,104              685,762                  10,672,107                  2,256,070	=> x5 longer waits in avg
                                                                                                                                                      -52.20%	bad case runs about twice as often into waits

These numbers seem to point toward my assumption, that the 50% preserved
cause the system to be unable to find memory fast enough.
Happening twice as often to run into the wait after a direct_reclaim
that made progress, but not finding a free page.
And then in average waiting about 5 times longer to get things freed up
enough reaching the watermark and get woken up.


####

Eventually I'd also really like to completely understand why the active
file pages grow when I execute the same iozone write load three times.
They effectively write the same files in the same directories without 
being a journaling file system (The effect can be seen in the table
above as well).

If one of these write runs would use more than ~30M active file pages
they would be allocated and afterwards protected, but they aren't.
Then after the second run I see ~60M active file pages.
As mentioned before I would assume that it either just reuses what is
in memory from the first run, or if it really uses new stuff then the
time has come to throw the old away.

Therefore I would assume that it should never get much more after the
first run as long as they are essentially doing the same.
Does someone already know or has a good assumption what might be
growing in these buffers?
Is there a good interface to check what is buffered and protected atm?

> Caching has absolutely nothing to do with the regression
> you ran into.

As mentioned above not by means of "having it in the cache for another
fast access" yes.
But maybe by "not getting memory for reads into page cache fast enough".

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance


#### patch for the counters shown in table above ######
Subject: [PATCH][DEBUGONLY] mm: track allocation waits

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

This patch adds some debug counters to track how often a system runs into
waits after direct reclaim (happens in case of did_some_progress & !page)
and how much time it spends there waiting.

#for debugging only#

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---

[diffstat]
 include/linux/sysctl.h |    1
 kernel/sysctl.c        |   57 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |   17 ++++++++++++++
 3 files changed, 75 insertions(+)

[diff]
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/include/linux/sysctl.h linux-2.6.32.11-0.3.99.6.626e022/include/linux/sysctl.h
--- linux-2.6.32.11-0.3.99.6.626e022.orig/include/linux/sysctl.h	2010-04-27 12:01:54.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/include/linux/sysctl.h	2010-04-27 12:03:56.000000000 +0200
@@ -68,6 +68,7 @@
 	CTL_BUS=8,		/* Busses */
 	CTL_ABI=9,		/* Binary emulation */
 	CTL_CPU=10,		/* CPU stuff (speed scaling, etc) */
+	CTL_PERF=11,		/* Performance counters and timer sums for debugging */
 	CTL_XEN=123,		/* Xen info and control */
 	CTL_ARLAN=254,		/* arlan wireless driver */
 	CTL_S390DBF=5677,	/* s390 debug */
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/kernel/sysctl.c linux-2.6.32.11-0.3.99.6.626e022/kernel/sysctl.c
--- linux-2.6.32.11-0.3.99.6.626e022.orig/kernel/sysctl.c	2010-04-27 14:26:04.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/kernel/sysctl.c	2010-04-27 15:44:54.000000000 +0200
@@ -183,6 +183,7 @@
 	.default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
 };
 
+static struct ctl_table perf_table[];
 static struct ctl_table kern_table[];
 static struct ctl_table vm_table[];
 static struct ctl_table fs_table[];
@@ -236,6 +237,13 @@
 		.mode		= 0555,
 		.child		= dev_table,
 	},
+	{
+		.ctl_name	= CTL_PERF,
+		.procname	= "perf",
+		.mode		= 0555,
+		.child		= perf_table,
+	},
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
@@ -254,6 +262,55 @@
 static int max_sched_shares_ratelimit = NSEC_PER_SEC; /* 1 second */
 #endif
 
+extern unsigned long perf_count_watermark_wait;
+extern unsigned long perf_count_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim_but_progress;
+extern unsigned long perf_count_watermark_wait_duration;
+static struct ctl_table perf_table[] = {
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_watermark_wait_duration",
+		.data           = &perf_count_watermark_wait_duration,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_watermark_wait",
+		.data           = &perf_count_watermark_wait,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_pages_direct_reclaim",
+		.data           = &perf_count_pages_direct_reclaim,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_failed_pages_direct_reclaim",
+		.data           = &perf_count_failed_pages_direct_reclaim,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_failed_pages_direct_reclaim_but_progress",
+		.data           = &perf_count_failed_pages_direct_reclaim_but_progress,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{ .ctl_name = 0 }
+};
+
 static struct ctl_table kern_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/mm/page_alloc.c linux-2.6.32.11-0.3.99.6.626e022/mm/page_alloc.c
--- linux-2.6.32.11-0.3.99.6.626e022.orig/mm/page_alloc.c	2010-04-27 12:01:55.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/mm/page_alloc.c	2010-04-27 14:06:40.000000000 +0200
@@ -191,6 +191,7 @@
 		wake_up_interruptible(&watermark_wq);
 }
 
+unsigned long perf_count_watermark_wait = 0;
 /**
  * watermark_wait - Wait for watermark to go above low
  * @timeout: Wait until watermark is reached or this timeout is reached
@@ -202,6 +203,7 @@
 	long ret;
 	DEFINE_WAIT(wait);
 
+	perf_count_watermark_wait++;
 	prepare_to_wait(&watermark_wq, &wait, TASK_INTERRUPTIBLE);
 
 	/*
@@ -1725,6 +1727,10 @@
 	return page;
 }
 
+unsigned long perf_count_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim_but_progress = 0;
+
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1761,6 +1767,13 @@
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	perf_count_pages_direct_reclaim++;
+	if (!page)
+		perf_count_failed_pages_direct_reclaim++;
+	if (!page && *did_some_progress)
+		perf_count_failed_pages_direct_reclaim_but_progress++;
+
 	return page;
 }
 
@@ -1841,6 +1854,7 @@
 	return alloc_flags;
 }
 
+unsigned long perf_count_watermark_wait_duration = 0;
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1961,8 +1975,11 @@
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+		unsigned long t1;
 		/* Too much pressure, back off a bit at let reclaimers do work */
+		t1 = get_clock();
 		watermark_wait(HZ/50);
+		perf_count_watermark_wait_duration += ((get_clock() - t1) * 125) >> 9;
 		goto rebalance;
 	}
 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
@ 2010-04-27 14:00                                                   ` Christian Ehrhardt
  0 siblings, 0 replies; 136+ messages in thread
From: Christian Ehrhardt @ 2010-04-27 14:00 UTC (permalink / raw)
  To: Rik van Riel, Mel Gorman
  Cc: KOSAKI Motohiro, Johannes Weiner, Andrew Morton, Nick Piggin,
	gregkh, linux-mm, Chris Mason, Jens Axboe, linux-kernel,
	Corrado Zoccolo



Rik van Riel wrote:
> On 04/26/2010 08:43 AM, Christian Ehrhardt wrote:
> 
>>>> This patch creates a knob to help users that have workloads suffering
>>>> from the
>>>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21
>>>> vmscan:
>>>> evict use-once pages first".
>>>> It also provides the tuning mechanisms for other users that want an
>>>> even bigger
>>>> working set to be protected.
>>>
>>> We certainly need no knob. because typical desktop users use various
>>> application,
>>> various workload. then, the knob doesn't help them.
>>
>> Briefly - We had discussed non desktop scenarios where like a day load
>> that builds up the working set to 50% and a nightly backup job which
>> then is unable to use that protected 50% when sequentially reading a lot
>> of disks and due to that doesn't finish before morning.
> 
> This is a red herring.  A backup touches all of the
> data once, so it does not need a lot of page cache
> and will not "not finish before morning" due to the
> working set being protected.
>
> You're going to have to come up with a more realistic
> scenario than that.

I completely agree that a backup case is read once and therefore doesn't
benefit from caching itself, but you know my scenario from the thread
where this patch emerged from.
="Parallel iozone sequential read - resembling the classic backup case
(read once + sequential)."

While caching isn't helping the classic way, by having data in cache
ready on the next access it is still used transparently as the system
is reading ahead into page cache to assist the sequentially reading
process.
Yes it doesn't happen with direct IO and some, but unfortunately not
all backup tools use DIO. Additionally not all backup jobs have a whole
night, and this can really be a decision maker if you can quickly pump
out your 100 TB main database in 10 or 20 minutes.

So here comes the problem, due to the 50% preserved I assume it comes
into trouble allocating that page cache memory in time. So much that it
even slows down the load - meaning long enough to let the application
completely consume the data already read and then still letting it wait.
More about that below.

Now IMHO this feels comparable to a classic backup job, and by loosing
60% Throughput (more than a Gb/s) is seems neither red nor smells like
fish to me.

>> I personally just don't feel too good knowing that 50% of my memory
>> might hang around unused for many hours while they could be of some use.
>> I absolutely agree with the old intention and see how the patch helped
>> with the latency issue Elladan brought up in the past - but it just
>> looks way too aggressive to protect it "forever" for some server use 
>> cases.
> 
> So far we have seen exactly one workload where it helps
> to reduce the size of the active file list, and that is
> not due to any need for caching more inactive pages.
>
> On the contrary, it is because ALL OF THE INACTIVE PAGES
> are in flight to disk, all under IO at the same time.

Ok this time I think I got your point much better - sorry for 
being confused.
Discard my patch, but I'd really like to clarify and verify your 
assumption in conjunction with my findings and would be happy
if you can help me with that.

As mentioned the case that suffers from the 50% memory protected is
iozone read - so it would be "in flight FROM disk", but I guess that
it is not important if it is from or to right ?

Effectively I have two read cases, one with caches dropped which then 
has almost full memory for page cache in the read case. And the other 
one with a few writes before filling up the protected 50% leading to a 
read case with only half of the memory for page cache.
Now if I really got you right this time the issue is caused by the
fact that the parallel read ahead on all 16 disks creates so much I/O
in flight that the 128M (=50% that are left) are not enough.
>From the past we know that the time lost for the -60% Throughput was 
spent in a loop around direct_reclaim&congestion_wait trying to get the
memory for the page cache reads - would you consider it possible that
we now run into a scenario splitting the memory like this?:
- 50% active file protected
- a lot of the other half related to I/O that is currently
  in flight from the disk -> not free-able too?
- almost nothing to free when allocating for the next read to page 
  cache (can only take pages above low watermark) -> waiting

I updated my old counter patch, that I used to verify the old issue were
we spent so much time in a full timeout of congestion wait. Thanks to
Mel this was fixed (I have his watermark wait patch applied), but I
assume having 50% protected I just run into the shortened wait more
often or wait longer for watermarks to still be an issue (due to 50%
not free-able).
See the patch inlined at the end of the mail for details what/how
it is exactly counted.

As before the scenario is iozone on 16 disks in parallel with 1 iozone
child per disk.
I ran:
- write, write, write, read -> bad case
- drop cache, read -> good case
Read throughput still drops by ~60% comparing good to bad case.
Here are the numbers I got for those two cases by my counters and
meminfo:

Value                           Initial state          Write 1            Write 2             Write 3     Read after writes (bad)      Read after DC (good)	
watermark_wait_duration (ns)                0    9,902,333,643     12,288,444,574      24,197,098,221             317,175,021,553            35,002,926,894
watermark_wait                              0            24102              26708               35285                       29720                     15515
pages_direct_reclaim                        0            59195              65010               86777                       90883                     66672
failed_pages_direct_reclaim                 0            24144              26768               35343                       29733                     15525
failed_pages_direct_reclaim_but_progress    0            24144              26768               35343                       29733                     15525

MemTotal:                              248912           248912             248912              248912                      248912                    248912
MemFree:                               185732             4868               5028                3780                        3064                      7136
Buffers:                                  536            33588              65660               84296                       81868                     32072
Cached:                                  9480           145252             111672               93736                       98424                    149724
Active:                                 11052            43920              76032               89084                       87780                     38024
Inactive:                                6860           142628             108980               96528                      100280                    151572
Active(anon):                            5092             4452               4428                4364                        4516                      4492
Inactive(anon):                          6480             6608               6604                6604                        6604                      6604
Active(file):                            5960            39468              71604               84720                       83264                     33532
Inactive(file):                           380           136020             102376               89924                       93676                    144968
Unevictable:                             3952             3952               3952                3952                        3952                      3952
							
Real Time passed in seconds                              48.83             49.38                50.35                       40.62                      22.61	
AVG wait time waitduration/#                           410,851           460,104              685,762                  10,672,107                  2,256,070	=> x5 longer waits in avg
                                                                                                                                                      -52.20%	bad case runs about twice as often into waits

These numbers seem to point toward my assumption, that the 50% preserved
cause the system to be unable to find memory fast enough.
Happening twice as often to run into the wait after a direct_reclaim
that made progress, but not finding a free page.
And then in average waiting about 5 times longer to get things freed up
enough reaching the watermark and get woken up.


####

Eventually I'd also really like to completely understand why the active
file pages grow when I execute the same iozone write load three times.
They effectively write the same files in the same directories without 
being a journaling file system (The effect can be seen in the table
above as well).

If one of these write runs would use more than ~30M active file pages
they would be allocated and afterwards protected, but they aren't.
Then after the second run I see ~60M active file pages.
As mentioned before I would assume that it either just reuses what is
in memory from the first run, or if it really uses new stuff then the
time has come to throw the old away.

Therefore I would assume that it should never get much more after the
first run as long as they are essentially doing the same.
Does someone already know or has a good assumption what might be
growing in these buffers?
Is there a good interface to check what is buffered and protected atm?

> Caching has absolutely nothing to do with the regression
> you ran into.

As mentioned above not by means of "having it in the cache for another
fast access" yes.
But maybe by "not getting memory for reads into page cache fast enough".

-- 

GrA 1/4 sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance


#### patch for the counters shown in table above ######
Subject: [PATCH][DEBUGONLY] mm: track allocation waits

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

This patch adds some debug counters to track how often a system runs into
waits after direct reclaim (happens in case of did_some_progress & !page)
and how much time it spends there waiting.

#for debugging only#

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
---

[diffstat]
 include/linux/sysctl.h |    1
 kernel/sysctl.c        |   57 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        |   17 ++++++++++++++
 3 files changed, 75 insertions(+)

[diff]
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/include/linux/sysctl.h linux-2.6.32.11-0.3.99.6.626e022/include/linux/sysctl.h
--- linux-2.6.32.11-0.3.99.6.626e022.orig/include/linux/sysctl.h	2010-04-27 12:01:54.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/include/linux/sysctl.h	2010-04-27 12:03:56.000000000 +0200
@@ -68,6 +68,7 @@
 	CTL_BUS=8,		/* Busses */
 	CTL_ABI=9,		/* Binary emulation */
 	CTL_CPU=10,		/* CPU stuff (speed scaling, etc) */
+	CTL_PERF=11,		/* Performance counters and timer sums for debugging */
 	CTL_XEN=123,		/* Xen info and control */
 	CTL_ARLAN=254,		/* arlan wireless driver */
 	CTL_S390DBF=5677,	/* s390 debug */
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/kernel/sysctl.c linux-2.6.32.11-0.3.99.6.626e022/kernel/sysctl.c
--- linux-2.6.32.11-0.3.99.6.626e022.orig/kernel/sysctl.c	2010-04-27 14:26:04.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/kernel/sysctl.c	2010-04-27 15:44:54.000000000 +0200
@@ -183,6 +183,7 @@
 	.default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
 };
 
+static struct ctl_table perf_table[];
 static struct ctl_table kern_table[];
 static struct ctl_table vm_table[];
 static struct ctl_table fs_table[];
@@ -236,6 +237,13 @@
 		.mode		= 0555,
 		.child		= dev_table,
 	},
+	{
+		.ctl_name	= CTL_PERF,
+		.procname	= "perf",
+		.mode		= 0555,
+		.child		= perf_table,
+	},
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
@@ -254,6 +262,55 @@
 static int max_sched_shares_ratelimit = NSEC_PER_SEC; /* 1 second */
 #endif
 
+extern unsigned long perf_count_watermark_wait;
+extern unsigned long perf_count_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim_but_progress;
+extern unsigned long perf_count_watermark_wait_duration;
+static struct ctl_table perf_table[] = {
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_watermark_wait_duration",
+		.data           = &perf_count_watermark_wait_duration,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_watermark_wait",
+		.data           = &perf_count_watermark_wait,
+		.mode           = 0666,
+		.maxlen		= sizeof(unsigned long),
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_pages_direct_reclaim",
+		.data           = &perf_count_pages_direct_reclaim,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_failed_pages_direct_reclaim",
+		.data           = &perf_count_failed_pages_direct_reclaim,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname       = "perf_count_failed_pages_direct_reclaim_but_progress",
+		.data           = &perf_count_failed_pages_direct_reclaim_but_progress,
+		.maxlen		= sizeof(unsigned long),
+		.mode           = 0666,
+		.proc_handler   = &proc_doulongvec_minmax,
+	},
+	{ .ctl_name = 0 }
+};
+
 static struct ctl_table kern_table[] = {
 	{
 		.ctl_name	= CTL_UNNUMBERED,
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/mm/page_alloc.c linux-2.6.32.11-0.3.99.6.626e022/mm/page_alloc.c
--- linux-2.6.32.11-0.3.99.6.626e022.orig/mm/page_alloc.c	2010-04-27 12:01:55.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/mm/page_alloc.c	2010-04-27 14:06:40.000000000 +0200
@@ -191,6 +191,7 @@
 		wake_up_interruptible(&watermark_wq);
 }
 
+unsigned long perf_count_watermark_wait = 0;
 /**
  * watermark_wait - Wait for watermark to go above low
  * @timeout: Wait until watermark is reached or this timeout is reached
@@ -202,6 +203,7 @@
 	long ret;
 	DEFINE_WAIT(wait);
 
+	perf_count_watermark_wait++;
 	prepare_to_wait(&watermark_wq, &wait, TASK_INTERRUPTIBLE);
 
 	/*
@@ -1725,6 +1727,10 @@
 	return page;
 }
 
+unsigned long perf_count_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim_but_progress = 0;
+
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1761,6 +1767,13 @@
 					zonelist, high_zoneidx,
 					alloc_flags, preferred_zone,
 					migratetype);
+
+	perf_count_pages_direct_reclaim++;
+	if (!page)
+		perf_count_failed_pages_direct_reclaim++;
+	if (!page && *did_some_progress)
+		perf_count_failed_pages_direct_reclaim_but_progress++;
+
 	return page;
 }
 
@@ -1841,6 +1854,7 @@
 	return alloc_flags;
 }
 
+unsigned long perf_count_watermark_wait_duration = 0;
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1961,8 +1975,11 @@
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+		unsigned long t1;
 		/* Too much pressure, back off a bit at let reclaimers do work */
+		t1 = get_clock();
 		watermark_wait(HZ/50);
+		perf_count_watermark_wait_duration += ((get_clock() - t1) * 125) >> 9;
 		goto rebalance;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

end of thread, other threads:[~2010-04-27 14:01 UTC | newest]

Thread overview: 136+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-08 11:48 [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Mel Gorman
2010-03-08 11:48 ` Mel Gorman
2010-03-08 11:48 ` [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion Mel Gorman
2010-03-08 11:48   ` Mel Gorman
2010-03-09 13:35   ` Nick Piggin
2010-03-09 13:35     ` Nick Piggin
2010-03-09 14:17     ` Mel Gorman
2010-03-09 14:17       ` Mel Gorman
2010-03-09 15:03       ` Nick Piggin
2010-03-09 15:03         ` Nick Piggin
2010-03-09 15:42         ` Christian Ehrhardt
2010-03-09 15:42           ` Christian Ehrhardt
2010-03-09 18:22           ` Mel Gorman
2010-03-09 18:22             ` Mel Gorman
2010-03-10  2:38             ` Nick Piggin
2010-03-10  2:38               ` Nick Piggin
2010-03-09 17:35         ` Mel Gorman
2010-03-09 17:35           ` Mel Gorman
2010-03-10  2:35           ` Nick Piggin
2010-03-10  2:35             ` Nick Piggin
2010-03-09 15:50   ` Christoph Lameter
2010-03-09 15:50     ` Christoph Lameter
2010-03-09 15:56     ` Christian Ehrhardt
2010-03-09 15:56       ` Christian Ehrhardt
2010-03-09 16:09       ` Christoph Lameter
2010-03-09 16:09         ` Christoph Lameter
2010-03-09 17:01         ` Mel Gorman
2010-03-09 17:01           ` Mel Gorman
2010-03-09 17:11           ` Christoph Lameter
2010-03-09 17:11             ` Christoph Lameter
2010-03-09 17:30             ` Mel Gorman
2010-03-09 17:30               ` Mel Gorman
2010-03-08 11:48 ` [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed Mel Gorman
2010-03-08 11:48   ` Mel Gorman
2010-03-09  9:53   ` Nick Piggin
2010-03-09  9:53     ` Nick Piggin
2010-03-09 10:08     ` Mel Gorman
2010-03-09 10:08       ` Mel Gorman
2010-03-09 10:23       ` Nick Piggin
2010-03-09 10:23         ` Nick Piggin
2010-03-09 10:36         ` Mel Gorman
2010-03-09 10:36           ` Mel Gorman
2010-03-09 11:11           ` Nick Piggin
2010-03-09 11:11             ` Nick Piggin
2010-03-09 11:29             ` Mel Gorman
2010-03-09 11:29               ` Mel Gorman
2010-03-08 11:48 ` [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion Mel Gorman
2010-03-08 11:48   ` Mel Gorman
2010-03-09 10:00   ` Nick Piggin
2010-03-09 10:00     ` Nick Piggin
2010-03-09 10:21     ` Mel Gorman
2010-03-09 10:21       ` Mel Gorman
2010-03-09 10:32       ` Nick Piggin
2010-03-09 10:32         ` Nick Piggin
2010-03-11 23:41 ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Andrew Morton
2010-03-11 23:41   ` Andrew Morton
2010-03-12  6:39   ` Christian Ehrhardt
2010-03-12  6:39     ` Christian Ehrhardt
2010-03-12  7:05     ` Andrew Morton
2010-03-12  7:05       ` Andrew Morton
2010-03-12 10:47       ` Mel Gorman
2010-03-12 10:47         ` Mel Gorman
2010-03-12 12:15         ` Christian Ehrhardt
2010-03-12 12:15           ` Christian Ehrhardt
2010-03-12 14:37           ` Andrew Morton
2010-03-12 14:37             ` Andrew Morton
2010-03-15 12:29             ` Mel Gorman
2010-03-15 12:29               ` Mel Gorman
2010-03-15 14:45               ` Christian Ehrhardt
2010-03-15 14:45                 ` Christian Ehrhardt
2010-03-15 12:34             ` Christian Ehrhardt
2010-03-15 12:34               ` Christian Ehrhardt
2010-03-15 20:09               ` Andrew Morton
2010-03-15 20:09                 ` Andrew Morton
2010-03-16 10:11                 ` Mel Gorman
2010-03-16 10:11                   ` Mel Gorman
2010-03-18 17:42                 ` Mel Gorman
2010-03-18 17:42                   ` Mel Gorman
2010-03-22 23:50                 ` Mel Gorman
2010-03-22 23:50                   ` Mel Gorman
2010-03-23 14:35                   ` Christian Ehrhardt
2010-03-23 14:35                     ` Christian Ehrhardt
2010-03-23 21:35                   ` Corrado Zoccolo
2010-03-23 21:35                     ` Corrado Zoccolo
2010-03-24 11:48                     ` Mel Gorman
2010-03-24 11:48                       ` Mel Gorman
2010-03-24 12:56                       ` Corrado Zoccolo
2010-03-24 12:56                         ` Corrado Zoccolo
2010-03-23 22:29                   ` Rik van Riel
2010-03-23 22:29                     ` Rik van Riel
2010-03-24 14:50                     ` Mel Gorman
2010-03-24 14:50                       ` Mel Gorman
2010-04-19 12:22                       ` Christian Ehrhardt
2010-04-19 12:22                         ` Christian Ehrhardt
2010-04-19 21:44                         ` Johannes Weiner
2010-04-19 21:44                           ` Johannes Weiner
2010-04-20  7:20                           ` Christian Ehrhardt
2010-04-20  7:20                             ` Christian Ehrhardt
2010-04-20  8:54                             ` Christian Ehrhardt
2010-04-20  8:54                               ` Christian Ehrhardt
2010-04-20 15:32                             ` Johannes Weiner
2010-04-20 15:32                               ` Johannes Weiner
2010-04-20 17:22                               ` Rik van Riel
2010-04-20 17:22                                 ` Rik van Riel
2010-04-21  4:23                                 ` Christian Ehrhardt
2010-04-21  4:23                                   ` Christian Ehrhardt
2010-04-21  7:35                                   ` Christian Ehrhardt
2010-04-21  7:35                                     ` Christian Ehrhardt
2010-04-21 13:19                                     ` Rik van Riel
2010-04-21 13:19                                       ` Rik van Riel
2010-04-22  6:21                                       ` Christian Ehrhardt
2010-04-22  6:21                                         ` Christian Ehrhardt
2010-04-26 10:59                                         ` Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2 Christian Ehrhardt
2010-04-26 10:59                                           ` Christian Ehrhardt
2010-04-26 11:59                                           ` KOSAKI Motohiro
2010-04-26 11:59                                             ` KOSAKI Motohiro
2010-04-26 12:43                                             ` Christian Ehrhardt
2010-04-26 12:43                                               ` Christian Ehrhardt
2010-04-26 14:20                                               ` Rik van Riel
2010-04-26 14:20                                                 ` Rik van Riel
2010-04-27 14:00                                                 ` Christian Ehrhardt
2010-04-27 14:00                                                   ` Christian Ehrhardt
2010-04-21  9:03                                   ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Johannes Weiner
2010-04-21  9:03                                     ` Johannes Weiner
2010-04-21 13:20                                   ` Rik van Riel
2010-04-21 13:20                                     ` Rik van Riel
2010-04-20 14:40                           ` Rik van Riel
2010-04-20 14:40                             ` Rik van Riel
2010-03-24  2:38                   ` Greg KH
2010-03-24  2:38                     ` Greg KH
2010-03-24 11:49                     ` Mel Gorman
2010-03-24 11:49                       ` Mel Gorman
2010-03-24 13:13                   ` Johannes Weiner
2010-03-24 13:13                     ` Johannes Weiner
2010-03-12  9:09   ` Mel Gorman
2010-03-12  9:09     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.