All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0
@ 2013-06-05 15:10 Andrea Arcangeli
  2013-06-05 15:10 ` [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
                   ` (8 more replies)
  0 siblings, 9 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

Hello everyone,

I got a bugreport showing some problem with NUMA affinity with CPU
node bindings when THP is enabled and /proc/sys/vm/zone_reclaim_mode
is > 0.

When THP is disabled, zone_reclaim_mode set to 1 (or higher) tends to
allocate memory in the local node with quite some accuracy in presence
of CPU node bindings (and weak or no memory bindings). However THP
enabled tends to spread the memory to other nodes erroneously.

I also found zone_reclaim_mode is quite unreliable in presence of
multiple threads allocating memory at the same time from different
CPUs in the same node, even when THP is disabled and there's plenty of
clean cache to trivially reclaim.

The major problem with THP enabled is that zone_reclaim doesn't even
try to use compaction. Then there are more changes suggested to make
the whole compaction process more reliable than it is now.

After setting zone_reclaim_mode to 1 and booting with
numa_zonelist_order=n, with this patchset applied I get this NUMA placement:

  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 7088 breakthp              0      2.1M [   2.1M     0  ]
 7089 breakthp              1      2.1M [   2.1M     0  ]
 7090 breakthp              2      2.1M [   2.1M     0  ]
 7091 breakthp              3      2.1M [   2.1M     0  ]
 7092 breakthp              6      2.1M [     0    2.1M ]
 7093 breakthp              7      2.1M [     0    2.1M ]
 7094 breakthp              8      2.1M [     0    2.1M ]
 7095 breakthp              9      2.1M [     0    2.1M ]
 7097 breakthp              0      2.1M [   2.1M     0  ]
 7098 breakthp              1      2.1M [   2.1M     0  ]
 7099 breakthp              2      2.1M [   2.1M     0  ]
 7100 breakthp              3      2.1M [   2.1M     0  ]
 7101 breakthp              6      2.1M [     0    2.1M ]
 7102 breakthp              7      2.1M [     0    2.1M ]
 7103 breakthp              8      2.1M [     0    2.1M ]
 7104 breakthp              9      2.1M [     0    2.1M ]
  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 7106 usemem                0     1.00G [  1.00G     0  ]
 7107 usemem                1     1.00G [  1.00G     0  ]
 7108 usemem                2     1.00G [  1.00G     0  ]
 7109 usemem                3     1.00G [  1.00G     0  ]
 7110 usemem                6     1.00G [     0   1.00G ]
 7111 usemem                7     1.00G [     0   1.00G ]
 7112 usemem                8     1.00G [     0   1.00G ]
 7113 usemem                9     1.00G [     0   1.00G ]

Without current upstream without the patchset and still
zone_reclaim_mode = 1 and booting with numa_zonelist_order=n:

  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 2950 breakthp              0      2.1M [   2.1M     0  ]
 2951 breakthp              1      2.1M [   2.1M     0  ]
 2952 breakthp              2      2.1M [   2.1M     0  ]
 2953 breakthp              3      2.1M [   2.1M     0  ]
 2954 breakthp              6      2.1M [     0    2.1M ]
 2955 breakthp              7      2.1M [     0    2.1M ]
 2956 breakthp              8      2.1M [     0    2.1M ]
 2957 breakthp              9      2.1M [     0    2.1M ]
 2966 breakthp              0      2.1M [   2.0M    96K ]
 2967 breakthp              1      2.1M [   2.0M    96K ]
 2968 breakthp              2      1.9M [   1.9M    96K ]
 2969 breakthp              3      2.1M [   2.0M    96K ]
 2970 breakthp              6      2.1M [   228K   1.8M ]
 2971 breakthp              7      2.1M [    72K   2.0M ]
 2972 breakthp              8      2.1M [    60K   2.0M ]
 2973 breakthp              9      2.1M [   204K   1.9M ]
  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 3088 usemem                0     1.00G [ 856.2M 168.0M ]
 3089 usemem                1     1.00G [ 860.2M 164.0M ]
 3090 usemem                2     1.00G [ 860.2M 164.0M ]
 3091 usemem                3     1.00G [ 858.2M 166.0M ]
 3092 usemem                6     1.00G [ 248.0M 776.2M ]
 3093 usemem                7     1.00G [ 248.0M 776.2M ]
 3094 usemem                8     1.00G [ 250.0M 774.2M ]
 3095 usemem                9     1.00G [ 246.0M 778.2M ]

Allocation speed seems a bit faster with the patchset applied likely
thanks to the increased NUMA locality that even during a simple
initialization, more than offsets the compaction costs.

The testcase always uses CPU bindings (half processes in one node, and
half processes in the other node). It first fragments all memory
(breakthp) by breaking lots of hugepages with mremap, and then another
process (usemem) allocates lots of memory, in turn exercising the
reliability of compaction with zone_reclaim_mode > 0.

Very few hugepages are available when usemem starts, but compaction
has a trivial time to generate as many hugepages as needed without any
risk of failure.

The memory layout when usemem starts is like this:

4k page anon
4k page free
another 512-2 4k pages free
4k page anon
4k page free
another 512-2 4k pages free
[..]

If automatic NUMA balancing is enabled, this isn't as critical issues
as without (the placement will be fixed later at runtime with THP NUMA
migration faults), but it still looks worth optimizing the initial
placement to avoid those migrations and for short lived computations
(where automatic NUMA balancing can't help). Especially if the process
has already been pinned to the CPUs of a node like in the bugreport I
got.

The main change of behavior is the removal of compact_blockskip_flush
and the __reset_isolation_suitable immediately executed when a
compaction pass completes and the slightly increased amount of
hugepages required to meet the low/min watermarks. The rest of the
changes mostly applies to zone_reclaim_mode > 0 and doesn't affect the
default 0 value (some large system may boot with zone_reclaim_mode set
to 1 by default though, if the node distance is very high).

Andrea Arcangeli (7):
  mm: remove ZONE_RECLAIM_LOCKED
  mm: compaction: scan all memory with /proc/sys/vm/compact_memory
  mm: compaction: don't depend on kswapd to invoke
    reset_isolation_suitable
  mm: compaction: reset before initializing the scan cursors
  mm: compaction: increase the high order pages in the watermarks
  mm: compaction: export compact_zone_order()
  mm: compaction: add compaction to zone_reclaim_mode

 include/linux/compaction.h |  10 +++--
 include/linux/mmzone.h     |   9 ----
 mm/compaction.c            |  40 +++++++++---------
 mm/internal.h              |   1 -
 mm/page_alloc.c            | 103 +++++++++++++++++++++++++++++++++------------
 mm/vmscan.c                |  29 -------------
 6 files changed, 105 insertions(+), 87 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
@ 2013-06-05 15:10 ` Andrea Arcangeli
  2013-06-05 19:23   ` Rik van Riel
                     ` (3 more replies)
  2013-06-05 15:10 ` [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
                   ` (7 subsequent siblings)
  8 siblings, 4 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
thread allocates memory at the same time, it forces a premature
allocation into remote NUMA nodes even when there's plenty of clean
cache to reclaim in the local nodes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mmzone.h | 6 ------
 mm/vmscan.c            | 4 ----
 2 files changed, 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5c76737..f23b080 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -490,7 +490,6 @@ struct zone {
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
-	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
 	ZONE_CONGESTED,			/* zone has many dirty pages backed by
 					 * a congested BDI
@@ -517,11 +516,6 @@ static inline int zone_is_reclaim_congested(const struct zone *zone)
 	return test_bit(ZONE_CONGESTED, &zone->flags);
 }
 
-static inline int zone_is_reclaim_locked(const struct zone *zone)
-{
-	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
-}
-
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa6a853..cc5bb01 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3424,11 +3424,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return ZONE_RECLAIM_NOSCAN;
 
-	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
-		return ZONE_RECLAIM_NOSCAN;
-
 	ret = __zone_reclaim(zone, gfp_mask, order);
-	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
 
 	if (!ret)
 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
  2013-06-05 15:10 ` [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
@ 2013-06-05 15:10 ` Andrea Arcangeli
  2013-06-05 19:34   ` Rik van Riel
                     ` (2 more replies)
  2013-06-05 15:10 ` [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

Reset the stats so /proc/sys/vm/compact_memory will scan all memory.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..cac9594 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1136,12 +1136,14 @@ void compact_pgdat(pg_data_t *pgdat, int order)
 
 static void compact_node(int nid)
 {
+	pg_data_t *pgdat = NODE_DATA(nid);
 	struct compact_control cc = {
 		.order = -1,
 		.sync = true,
 	};
 
-	__compact_pgdat(NODE_DATA(nid), &cc);
+	reset_isolation_suitable(pgdat);
+	__compact_pgdat(pgdat, &cc);
 }
 
 /* Compact all nodes in the system */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
  2013-06-05 15:10 ` [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
  2013-06-05 15:10 ` [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
@ 2013-06-05 15:10 ` Andrea Arcangeli
  2013-06-05 19:49   ` Rik van Riel
                     ` (2 more replies)
  2013-06-05 15:10 ` [PATCH 4/7] mm: compaction: reset before initializing the scan cursors Andrea Arcangeli
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
plenty of free memory) compaction is otherwise crippled down and stops
running for a while after the free/isolation cursor meets. After that
allocation can fail for a full cycle of compaction_deferred, until
compaction_restarting finally reset it again.

Stopping compaction for a full cycle after the cursor meets, even if
it never failed and it's not going to fail, doesn't make sense.

We already throttle compaction CPU utilization using
defer_compaction. We shouldn't prevent compaction to run after each
pass completes when the cursor meets, unless it failed.

This makes direct compaction functional again. The throttling of
direct compaction is still controlled by the defer_compaction
logic.

kswapd still won't risk to reset compaction, and it will wait direct
compaction to do so. Not sure if this is ideal but it at least
decreases the risk of kswapd doing too much work. kswapd will only run
one pass of compaction until some allocation invokes compaction again.

This decreased reliability of compaction was introduced in commit
62997027ca5b3d4618198ed8b1aba40b61b1137b .

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/compaction.h |  5 -----
 include/linux/mmzone.h     |  3 ---
 mm/compaction.c            | 15 ++++++---------
 mm/page_alloc.c            |  1 -
 mm/vmscan.c                |  8 --------
 5 files changed, 6 insertions(+), 26 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 091d72e..fc3f266 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -24,7 +24,6 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync, bool *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
-extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
 /* Do not skip compaction more than 64 times */
@@ -84,10 +83,6 @@ static inline void compact_pgdat(pg_data_t *pgdat, int order)
 {
 }
 
-static inline void reset_isolation_suitable(pg_data_t *pgdat)
-{
-}
-
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f23b080..9e9d285 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -354,9 +354,6 @@ struct zone {
 	spinlock_t		lock;
 	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* Set to true when the PG_migrate_skip bits should be cleared */
-	bool			compact_blockskip_flush;
-
 	/* pfns where compaction scanners should start */
 	unsigned long		compact_cached_free_pfn;
 	unsigned long		compact_cached_migrate_pfn;
diff --git a/mm/compaction.c b/mm/compaction.c
index cac9594..525baaa 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -91,7 +91,6 @@ static void __reset_isolation_suitable(struct zone *zone)
 
 	zone->compact_cached_migrate_pfn = start_pfn;
 	zone->compact_cached_free_pfn = end_pfn;
-	zone->compact_blockskip_flush = false;
 
 	/* Walk the zone and mark every pageblock as suitable for isolation */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -110,7 +109,7 @@ static void __reset_isolation_suitable(struct zone *zone)
 	}
 }
 
-void reset_isolation_suitable(pg_data_t *pgdat)
+static void reset_isolation_suitable(pg_data_t *pgdat)
 {
 	int zoneid;
 
@@ -120,8 +119,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
 			continue;
 
 		/* Only flush if a full compaction finished recently */
-		if (zone->compact_blockskip_flush)
-			__reset_isolation_suitable(zone);
+		__reset_isolation_suitable(zone);
 	}
 }
 
@@ -828,13 +826,12 @@ static int compact_finished(struct zone *zone,
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn) {
 		/*
-		 * Mark that the PG_migrate_skip information should be cleared
-		 * by kswapd when it goes to sleep. kswapd does not set the
-		 * flag itself as the decision to be clear should be directly
-		 * based on an allocation request.
+		 * Clear the PG_migrate_skip information. kswapd does
+		 * not clear it as the decision to be clear should be
+		 * directly based on an allocation request.
 		 */
 		if (!current_is_kswapd())
-			zone->compact_blockskip_flush = true;
+			__reset_isolation_suitable(zone);
 
 		return COMPACT_COMPLETE;
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 378a15b..3931d16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2188,7 +2188,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
 				preferred_zone, migratetype);
 		if (page) {
-			preferred_zone->compact_blockskip_flush = false;
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
 			if (order >= preferred_zone->compact_order_failed)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc5bb01..825c631 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2920,14 +2920,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 
-		/*
-		 * Compaction records what page blocks it recently failed to
-		 * isolate pages from and skips them in the future scanning.
-		 * When kswapd is going to sleep, it is reasonable to assume
-		 * that pages and compaction may succeed so reset the cache.
-		 */
-		reset_isolation_suitable(pgdat);
-
 		if (!kthread_should_stop())
 			schedule();
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/7] mm: compaction: reset before initializing the scan cursors
  2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2013-06-05 15:10 ` [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
@ 2013-06-05 15:10 ` Andrea Arcangeli
  2013-06-05 20:04   ` Rik van Riel
                     ` (2 more replies)
  2013-06-05 15:10 ` [PATCH 5/7] mm: compaction: increase the high order pages in the watermarks Andrea Arcangeli
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

Otherwise the first iteration of compaction after restarting it, will
only do a partial scan.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 525baaa..afaf692 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -934,6 +934,17 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	}
 
 	/*
+	 * Clear pageblock skip if there were failures recently and
+	 * compaction is about to be retried after being
+	 * deferred. kswapd does not do this reset and it will wait
+	 * direct compaction to do so either when the cursor meets
+	 * after one compaction pass is complete or if compaction is
+	 * restarted after being deferred for a while.
+	 */
+	if ((compaction_restarting(zone, cc->order)) && !current_is_kswapd())
+		__reset_isolation_suitable(zone);
+
+	/*
 	 * Setup to move all movable pages to the end of the zone. Used cached
 	 * information on where the scanners should start but check that it
 	 * is initialised by ensuring the values are within zone boundaries.
@@ -949,14 +960,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
 	}
 
-	/*
-	 * Clear pageblock skip if there were failures recently and compaction
-	 * is about to be retried after being deferred. kswapd does not do
-	 * this reset as it'll reset the cached information when going to sleep.
-	 */
-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
-		__reset_isolation_suitable(zone);
-
 	migrate_prep_local();
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/7] mm: compaction: increase the high order pages in the watermarks
  2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2013-06-05 15:10 ` [PATCH 4/7] mm: compaction: reset before initializing the scan cursors Andrea Arcangeli
@ 2013-06-05 15:10 ` Andrea Arcangeli
  2013-06-05 20:18   ` Rik van Riel
  2013-06-06  9:19   ` Mel Gorman
  2013-06-05 15:10 ` [PATCH 6/7] mm: compaction: export compact_zone_order() Andrea Arcangeli
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

Require more high order pages in the watermarks, to give more margin
for concurrent allocations. If there are too few pages, they can
disappear too soon.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3931d16..c13e062 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1646,7 +1646,8 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		free_pages -= z->free_area[o].nr_free << o;
 
 		/* Require fewer higher order pages to be free */
-		min >>= 1;
+		if (o >= pageblock_order-1)
+			min >>= 1;
 
 		if (free_pages <= min)
 			return false;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/7] mm: compaction: export compact_zone_order()
  2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2013-06-05 15:10 ` [PATCH 5/7] mm: compaction: increase the high order pages in the watermarks Andrea Arcangeli
@ 2013-06-05 15:10 ` Andrea Arcangeli
  2013-06-05 20:24   ` Rik van Riel
  2013-06-05 15:10 ` [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

Needed by zone_reclaim_mode compaction-awareness.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/compaction.h | 9 +++++++++
 mm/compaction.c            | 2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index fc3f266..77fdd8a 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -23,6 +23,9 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync, bool *contended);
+extern unsigned long compact_zone_order(struct zone *zone,
+					int order, gfp_t gfp_mask,
+					bool sync, bool *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
@@ -79,6 +82,12 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_CONTINUE;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+					       int order, gfp_t gfp_mask,
+					       bool sync, bool *contended)
+{
+}
+
 static inline void compact_pgdat(pg_data_t *pgdat, int order)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index afaf692..a1154c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1008,7 +1008,7 @@ out:
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone,
+unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
 				 bool sync, bool *contended)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2013-06-05 15:10 ` [PATCH 6/7] mm: compaction: export compact_zone_order() Andrea Arcangeli
@ 2013-06-05 15:10 ` Andrea Arcangeli
  2013-06-05 22:21   ` Rik van Riel
                     ` (2 more replies)
  2013-06-06  8:53 ` [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Mel Gorman
  2013-06-06 10:09 ` Mel Gorman
  8 siblings, 3 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 15:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

This fixes zone_reclaim_mode by using the min watermark so it won't
fail in presence of concurrent allocations. This greatly increases the
reliability of zone_reclaim_mode > 0 also with cache shrinking and THP
disabled.

This also adds compaction to zone_reclaim so THP enabled won't
decrease the NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.

Some checks for __GFP_WAIT and numa_node_id() are moved from the
zone_reclaim() to the caller so they also apply to the compaction
logic.

It is important to boot with numa_zonelist_order=n (n means nodes) to
get more accurate NUMA locality if there are multiple zones per node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/internal.h   |  1 -
 mm/page_alloc.c | 99 +++++++++++++++++++++++++++++++++++++++++++--------------
 mm/vmscan.c     | 17 ----------
 3 files changed, 75 insertions(+), 42 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8562de0..560a1ec 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -339,7 +339,6 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
 }
 #endif /* CONFIG_SPARSEMEM */
 
-#define ZONE_RECLAIM_NOSCAN	-2
 #define ZONE_RECLAIM_FULL	-1
 #define ZONE_RECLAIM_SOME	0
 #define ZONE_RECLAIM_SUCCESS	1
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c13e062..3ca905a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1902,7 +1902,9 @@ zonelist_scan:
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
-			int ret;
+			int ret, node_id, c_ret;
+			bool repeated_compaction, need_compaction;
+			bool contended = false;
 
 			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 			if (zone_watermark_ok(zone, order, mark,
@@ -1933,35 +1935,84 @@ zonelist_scan:
 				!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 
-			ret = zone_reclaim(zone, gfp_mask, order);
-			switch (ret) {
-			case ZONE_RECLAIM_NOSCAN:
-				/* did not scan */
+			if (!(gfp_mask & __GFP_WAIT) ||
+			    (current->flags & PF_MEMALLOC))
 				continue;
-			case ZONE_RECLAIM_FULL:
-				/* scanned but unreclaimable */
+
+			/*
+			 * Only reclaim the local zone or on zones
+			 * that do not have associated
+			 * processors. This will favor the local
+			 * processor over remote processors and spread
+			 * off node memory allocations as wide as
+			 * possible.
+			 */
+			node_id = zone_to_nid(zone);
+			if (node_state(node_id, N_CPU) &&
+			    node_id != numa_node_id())
 				continue;
-			default:
-				/* did we reclaim enough */
+
+			/*
+			 * We're going to do reclaim so allow
+			 * allocations up to the MIN watermark, so less
+			 * concurrent allocation will fail.
+			 */
+			mark = min_wmark_pages(zone);
+
+			/* initialize to avoid warnings */
+			c_ret = COMPACT_SKIPPED;
+			ret = ZONE_RECLAIM_FULL;
+
+			repeated_compaction = false;
+			need_compaction = false;
+			if (!compaction_deferred(preferred_zone, order))
+				need_compaction = order &&
+					(gfp_mask & GFP_KERNEL) == GFP_KERNEL;
+			if (need_compaction) {
+			repeat_compaction:
+				c_ret = compact_zone_order(zone, order,
+							   gfp_mask,
+							   repeated_compaction,
+							   &contended);
+				if (c_ret != COMPACT_SKIPPED &&
+				    zone_watermark_ok(zone, order, mark,
+						      classzone_idx,
+						      alloc_flags)) {
+#ifdef CONFIG_COMPACTION
+					preferred_zone->compact_considered = 0;
+					preferred_zone->compact_defer_shift = 0;
+#endif
+					goto try_this_zone;
+				}
+			}
+			/*
+			 * reclaim if compaction failed because not
+			 * enough memory was available or if
+			 * compaction didn't run (order 0) or didn't
+			 * succeed.
+			 */
+			if (!repeated_compaction || c_ret == COMPACT_SKIPPED) {
+				ret = zone_reclaim(zone, gfp_mask, order);
 				if (zone_watermark_ok(zone, order, mark,
-						classzone_idx, alloc_flags))
+						      classzone_idx,
+						      alloc_flags))
 					goto try_this_zone;
+			}
+			if (need_compaction &&
+			    (!repeated_compaction ||
+			     (c_ret == COMPACT_SKIPPED &&
+			      ret == ZONE_RECLAIM_SUCCESS))) {
+				repeated_compaction = true;
+				cond_resched();
+				goto repeat_compaction;
+			}
+			if (need_compaction)
+				defer_compaction(preferred_zone, order);
 
-				/*
-				 * Failed to reclaim enough to meet watermark.
-				 * Only mark the zone full if checking the min
-				 * watermark or if we failed to reclaim just
-				 * 1<<order pages or else the page allocator
-				 * fastpath will prematurely mark zones full
-				 * when the watermark is between the low and
-				 * min watermarks.
-				 */
-				if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
-				    ret == ZONE_RECLAIM_SOME)
-					goto this_zone_full;
-
+			if (!order)
+				goto this_zone_full;
+			else
 				continue;
-			}
 		}
 
 try_this_zone:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 825c631..6a65107 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3380,7 +3380,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 
 int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 {
-	int node_id;
 	int ret;
 
 	/*
@@ -3400,22 +3399,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	if (zone->all_unreclaimable)
 		return ZONE_RECLAIM_FULL;
 
-	/*
-	 * Do not scan if the allocation should not be delayed.
-	 */
-	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
-		return ZONE_RECLAIM_NOSCAN;
-
-	/*
-	 * Only run zone reclaim on the local zone or on zones that do not
-	 * have associated processors. This will favor the local processor
-	 * over remote processors and spread off node memory allocations
-	 * as wide as possible.
-	 */
-	node_id = zone_to_nid(zone);
-	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
-		return ZONE_RECLAIM_NOSCAN;
-
 	ret = __zone_reclaim(zone, gfp_mask, order);
 
 	if (!ret)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-05 15:10 ` [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
@ 2013-06-05 19:23   ` Rik van Riel
  2013-06-05 20:37   ` KOSAKI Motohiro
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2013-06-05 19:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On 06/05/2013 11:10 AM, Andrea Arcangeli wrote:
> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
> thread allocates memory at the same time, it forces a premature
> allocation into remote NUMA nodes even when there's plenty of clean
> cache to reclaim in the local nodes.

Since we can get into concurrent reclaim from the direct reclaim
path anyway, and seem to handle that correctly, removing this
special case from zone reclaim looks fine.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-06-05 15:10 ` [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
@ 2013-06-05 19:34   ` Rik van Riel
  2013-06-05 21:39   ` Rafael Aquini
  2013-06-06  9:05   ` Mel Gorman
  2 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2013-06-05 19:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On 06/05/2013 11:10 AM, Andrea Arcangeli wrote:
> Reset the stats so /proc/sys/vm/compact_memory will scan all memory.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-06-05 15:10 ` [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
@ 2013-06-05 19:49   ` Rik van Riel
  2013-06-26 20:38     ` Andrea Arcangeli
  2013-06-06  9:11   ` Mel Gorman
  2013-06-06 12:47   ` Rafael Aquini
  2 siblings, 1 reply; 48+ messages in thread
From: Rik van Riel @ 2013-06-05 19:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On 06/05/2013 11:10 AM, Andrea Arcangeli wrote:
> If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
> plenty of free memory) compaction is otherwise crippled down and stops
> running for a while after the free/isolation cursor meets. After that
> allocation can fail for a full cycle of compaction_deferred, until
> compaction_restarting finally reset it again.
>
> Stopping compaction for a full cycle after the cursor meets, even if
> it never failed and it's not going to fail, doesn't make sense.
>
> We already throttle compaction CPU utilization using
> defer_compaction. We shouldn't prevent compaction to run after each
> pass completes when the cursor meets, unless it failed.
>
> This makes direct compaction functional again. The throttling of
> direct compaction is still controlled by the defer_compaction
> logic.
>
> kswapd still won't risk to reset compaction, and it will wait direct
> compaction to do so. Not sure if this is ideal but it at least
> decreases the risk of kswapd doing too much work. kswapd will only run
> one pass of compaction until some allocation invokes compaction again.

Won't kswapd reset compaction even with your patch,
but only when kswapd invokes compaction and the cursors
meet?

In other words, the behaviour should be correct even
for cases where kswapd is the only thing in the system
doing compaction (eg. GFP_ATOMIC higher order allocations),
but your changelog does not describe it completely.

> This decreased reliability of compaction was introduced in commit
> 62997027ca5b3d4618198ed8b1aba40b61b1137b .
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

The code looks good to me.

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/7] mm: compaction: reset before initializing the scan cursors
  2013-06-05 15:10 ` [PATCH 4/7] mm: compaction: reset before initializing the scan cursors Andrea Arcangeli
@ 2013-06-05 20:04   ` Rik van Riel
  2013-06-06  9:14   ` Mel Gorman
  2013-06-06 12:49   ` Rafael Aquini
  2 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2013-06-05 20:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On 06/05/2013 11:10 AM, Andrea Arcangeli wrote:
> Otherwise the first iteration of compaction after restarting it, will
> only do a partial scan.

Changelog could be a little more verbose :)

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

>   mm/compaction.c | 19 +++++++++++--------
>   1 file changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 525baaa..afaf692 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -934,6 +934,17 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>   	}
>
>   	/*
> +	 * Clear pageblock skip if there were failures recently and
> +	 * compaction is about to be retried after being
> +	 * deferred. kswapd does not do this reset and it will wait
> +	 * direct compaction to do so either when the cursor meets
> +	 * after one compaction pass is complete or if compaction is
> +	 * restarted after being deferred for a while.
> +	 */
> +	if ((compaction_restarting(zone, cc->order)) && !current_is_kswapd())
> +		__reset_isolation_suitable(zone);
> +
> +	/*
>   	 * Setup to move all movable pages to the end of the zone. Used cached
>   	 * information on where the scanners should start but check that it
>   	 * is initialised by ensuring the values are within zone boundaries.
> @@ -949,14 +960,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>   		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
>   	}
>
> -	/*
> -	 * Clear pageblock skip if there were failures recently and compaction
> -	 * is about to be retried after being deferred. kswapd does not do
> -	 * this reset as it'll reset the cached information when going to sleep.
> -	 */
> -	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
> -		__reset_isolation_suitable(zone);
> -
>   	migrate_prep_local();
>
>   	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/7] mm: compaction: increase the high order pages in the watermarks
  2013-06-05 15:10 ` [PATCH 5/7] mm: compaction: increase the high order pages in the watermarks Andrea Arcangeli
@ 2013-06-05 20:18   ` Rik van Riel
  2013-06-28 22:14     ` Andrea Arcangeli
  2013-06-06  9:19   ` Mel Gorman
  1 sibling, 1 reply; 48+ messages in thread
From: Rik van Riel @ 2013-06-05 20:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On 06/05/2013 11:10 AM, Andrea Arcangeli wrote:
> Require more high order pages in the watermarks, to give more margin
> for concurrent allocations. If there are too few pages, they can
> disappear too soon.

Not sure what to do with this patch.

Not scaling min for pageblock_order-2 allocations seems like
it could be excessive.

Presumably this scaling was introduced for a good reason.

Why is that reason no longer valid?

Why is it safe to make this change?

Would it be safer to simply scale min less steeply?

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>   mm/page_alloc.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3931d16..c13e062 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1646,7 +1646,8 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>   		free_pages -= z->free_area[o].nr_free << o;
>
>   		/* Require fewer higher order pages to be free */
> -		min >>= 1;
> +		if (o >= pageblock_order-1)
> +			min >>= 1;

Why this and not this?

		if (order & 1)
			min >>=1;

Not saying my idea is any better than yours, just saying that
a change like this needs more justification than provided by
your changelog...

>
>   		if (free_pages <= min)
>   			return false;
>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/7] mm: compaction: export compact_zone_order()
  2013-06-05 15:10 ` [PATCH 6/7] mm: compaction: export compact_zone_order() Andrea Arcangeli
@ 2013-06-05 20:24   ` Rik van Riel
  0 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2013-06-05 20:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On 06/05/2013 11:10 AM, Andrea Arcangeli wrote:
> Needed by zone_reclaim_mode compaction-awareness.

> @@ -79,6 +82,12 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>   	return COMPACT_CONTINUE;
>   }
>
> +static inline unsigned long compact_zone_order(struct zone *zone,
> +					       int order, gfp_t gfp_mask,
> +					       bool sync, bool *contended)
> +{
> +}

An unsigned long function should probably return something.


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-05 15:10 ` [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
  2013-06-05 19:23   ` Rik van Riel
@ 2013-06-05 20:37   ` KOSAKI Motohiro
  2013-06-05 20:51     ` Christoph Lameter
  2013-06-05 21:33   ` Rafael Aquini
  2013-06-06  9:04   ` Mel Gorman
  3 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-06-05 20:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, kosaki.motohiro, Christoph Lameter

(6/5/13 11:10 AM), Andrea Arcangeli wrote:
> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
> thread allocates memory at the same time, it forces a premature
> allocation into remote NUMA nodes even when there's plenty of clean
> cache to reclaim in the local nodes.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

You should Christoph Lameter who make this lock. I've CCed. I couldn't 
find any problem in this removing. But I also didn't find a reason why 
this lock is needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-05 20:37   ` KOSAKI Motohiro
@ 2013-06-05 20:51     ` Christoph Lameter
  2013-06-05 21:03       ` KOSAKI Motohiro
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Lameter @ 2013-06-05 20:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrea Arcangeli, linux-mm, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini

On Wed, 5 Jun 2013, KOSAKI Motohiro wrote:

> (6/5/13 11:10 AM), Andrea Arcangeli wrote:
> > Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
> > thread allocates memory at the same time, it forces a premature
> > allocation into remote NUMA nodes even when there's plenty of clean
> > cache to reclaim in the local nodes.
> >
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>
> You should Christoph Lameter who make this lock. I've CCed. I couldn't
> find any problem in this removing. But I also didn't find a reason why
> this lock is needed.

There was early on an issue with multiple zone reclaims from
different processors causing an extreme slowdown and the system would go
OOM. The flag was used to enforce that only a single zone reclaim pass was
occurring at one time on a zone. This minimized contention and avoided
the failure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-05 20:51     ` Christoph Lameter
@ 2013-06-05 21:03       ` KOSAKI Motohiro
  2013-06-06 14:15         ` Christoph Lameter
  0 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-06-05 21:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Andrea Arcangeli, linux-mm, Mel Gorman,
	Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

(6/5/13 4:51 PM), Christoph Lameter wrote:
> On Wed, 5 Jun 2013, KOSAKI Motohiro wrote:
>
>> (6/5/13 11:10 AM), Andrea Arcangeli wrote:
>>> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
>>> thread allocates memory at the same time, it forces a premature
>>> allocation into remote NUMA nodes even when there's plenty of clean
>>> cache to reclaim in the local nodes.
>>>
>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>>
>> You should Christoph Lameter who make this lock. I've CCed. I couldn't
>> find any problem in this removing. But I also didn't find a reason why
>> this lock is needed.
>
> There was early on an issue with multiple zone reclaims from
> different processors causing an extreme slowdown and the system would go
> OOM. The flag was used to enforce that only a single zone reclaim pass was
> occurring at one time on a zone. This minimized contention and avoided
> the failure.

OK. I've convinced now we can removed because sc->nr_to_reclaim protect us form
this issue.

Thank you, guys.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-05 15:10 ` [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
  2013-06-05 19:23   ` Rik van Riel
  2013-06-05 20:37   ` KOSAKI Motohiro
@ 2013-06-05 21:33   ` Rafael Aquini
  2013-06-06  9:04   ` Mel Gorman
  3 siblings, 0 replies; 48+ messages in thread
From: Rafael Aquini @ 2013-06-05 21:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li

On Wed, Jun 05, 2013 at 05:10:31PM +0200, Andrea Arcangeli wrote:
> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
> thread allocates memory at the same time, it forces a premature
> allocation into remote NUMA nodes even when there's plenty of clean
> cache to reclaim in the local nodes.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


>  include/linux/mmzone.h | 6 ------
>  mm/vmscan.c            | 4 ----
>  2 files changed, 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5c76737..f23b080 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -490,7 +490,6 @@ struct zone {
>  } ____cacheline_internodealigned_in_smp;
>  
>  typedef enum {
> -	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
>  	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
>  	ZONE_CONGESTED,			/* zone has many dirty pages backed by
>  					 * a congested BDI
> @@ -517,11 +516,6 @@ static inline int zone_is_reclaim_congested(const struct zone *zone)
>  	return test_bit(ZONE_CONGESTED, &zone->flags);
>  }
>  
> -static inline int zone_is_reclaim_locked(const struct zone *zone)
> -{
> -	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
> -}
> -
>  static inline int zone_is_oom_locked(const struct zone *zone)
>  {
>  	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fa6a853..cc5bb01 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3424,11 +3424,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
>  		return ZONE_RECLAIM_NOSCAN;
>  
> -	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> -		return ZONE_RECLAIM_NOSCAN;
> -
>  	ret = __zone_reclaim(zone, gfp_mask, order);
> -	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>  
>  	if (!ret)
>  		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-06-05 15:10 ` [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
  2013-06-05 19:34   ` Rik van Riel
@ 2013-06-05 21:39   ` Rafael Aquini
  2013-06-06  9:05   ` Mel Gorman
  2 siblings, 0 replies; 48+ messages in thread
From: Rafael Aquini @ 2013-06-05 21:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li

On Wed, Jun 05, 2013 at 05:10:32PM +0200, Andrea Arcangeli wrote:
> Reset the stats so /proc/sys/vm/compact_memory will scan all memory.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/compaction.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>

Acked-by: Rafael Aquini <aquini@redhat.com>

 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 05ccb4c..cac9594 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1136,12 +1136,14 @@ void compact_pgdat(pg_data_t *pgdat, int order)
>  
>  static void compact_node(int nid)
>  {
> +	pg_data_t *pgdat = NODE_DATA(nid);
>  	struct compact_control cc = {
>  		.order = -1,
>  		.sync = true,
>  	};
>  
> -	__compact_pgdat(NODE_DATA(nid), &cc);
> +	reset_isolation_suitable(pgdat);
> +	__compact_pgdat(pgdat, &cc);
>  }
>  
>  /* Compact all nodes in the system */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-06-05 15:10 ` [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
@ 2013-06-05 22:21   ` Rik van Riel
  2013-06-06 10:05   ` Mel Gorman
  2013-07-12 23:57   ` Hush Bensen
  2 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2013-06-05 22:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On 06/05/2013 11:10 AM, Andrea Arcangeli wrote:
> This fixes zone_reclaim_mode by using the min watermark so it won't
> fail in presence of concurrent allocations. This greatly increases the
> reliability of zone_reclaim_mode > 0 also with cache shrinking and THP
> disabled.
>
> This also adds compaction to zone_reclaim so THP enabled won't
> decrease the NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.
>
> Some checks for __GFP_WAIT and numa_node_id() are moved from the
> zone_reclaim() to the caller so they also apply to the compaction
> logic.
>
> It is important to boot with numa_zonelist_order=n (n means nodes) to
> get more accurate NUMA locality if there are multiple zones per node.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>   mm/internal.h   |  1 -
>   mm/page_alloc.c | 99 +++++++++++++++++++++++++++++++++++++++++++--------------
>   mm/vmscan.c     | 17 ----------
>   3 files changed, 75 insertions(+), 42 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 8562de0..560a1ec 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -339,7 +339,6 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
>   }
>   #endif /* CONFIG_SPARSEMEM */
>
> -#define ZONE_RECLAIM_NOSCAN	-2
>   #define ZONE_RECLAIM_FULL	-1
>   #define ZONE_RECLAIM_SOME	0
>   #define ZONE_RECLAIM_SUCCESS	1
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c13e062..3ca905a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1902,7 +1902,9 @@ zonelist_scan:
>   		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
>   		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>   			unsigned long mark;
> -			int ret;
> +			int ret, node_id, c_ret;
> +			bool repeated_compaction, need_compaction;
> +			bool contended = false;
>
>   			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
>   			if (zone_watermark_ok(zone, order, mark,
> @@ -1933,35 +1935,84 @@ zonelist_scan:
>   				!zlc_zone_worth_trying(zonelist, z, allowednodes))
>   				continue;
>
> -			ret = zone_reclaim(zone, gfp_mask, order);
> -			switch (ret) {
> -			case ZONE_RECLAIM_NOSCAN:
> -				/* did not scan */
> +			if (!(gfp_mask & __GFP_WAIT) ||
> +			    (current->flags & PF_MEMALLOC))
>   				continue;
> -			case ZONE_RECLAIM_FULL:
> -				/* scanned but unreclaimable */
> +
> +			/*
> +			 * Only reclaim the local zone or on zones
> +			 * that do not have associated
> +			 * processors. This will favor the local
> +			 * processor over remote processors and spread
> +			 * off node memory allocations as wide as
> +			 * possible.
> +			 */
> +			node_id = zone_to_nid(zone);
> +			if (node_state(node_id, N_CPU) &&
> +			    node_id != numa_node_id())
>   				continue;
> -			default:
> -				/* did we reclaim enough */
> +
> +			/*
> +			 * We're going to do reclaim so allow
> +			 * allocations up to the MIN watermark, so less
> +			 * concurrent allocation will fail.
> +			 */
> +			mark = min_wmark_pages(zone);
> +
> +			/* initialize to avoid warnings */
> +			c_ret = COMPACT_SKIPPED;
> +			ret = ZONE_RECLAIM_FULL;
> +
> +			repeated_compaction = false;
> +			need_compaction = false;
> +			if (!compaction_deferred(preferred_zone, order))
> +				need_compaction = order &&
> +					(gfp_mask & GFP_KERNEL) == GFP_KERNEL;
> +			if (need_compaction) {
> +			repeat_compaction:

That is indented strangely. Took me a while to find this label.
Could it be moved to the beginning of the line, like all the
other goto labels?

> +				c_ret = compact_zone_order(zone, order,
> +							   gfp_mask,
> +							   repeated_compaction,
> +							   &contended);
> +				if (c_ret != COMPACT_SKIPPED &&
> +				    zone_watermark_ok(zone, order, mark,
> +						      classzone_idx,
> +						      alloc_flags)) {
> +#ifdef CONFIG_COMPACTION
> +					preferred_zone->compact_considered = 0;
> +					preferred_zone->compact_defer_shift = 0;
> +#endif
> +					goto try_this_zone;
> +				}
> +			}
> +			/*
> +			 * reclaim if compaction failed because not
> +			 * enough memory was available or if
> +			 * compaction didn't run (order 0) or didn't
> +			 * succeed.
> +			 */
> +			if (!repeated_compaction || c_ret == COMPACT_SKIPPED) {
> +				ret = zone_reclaim(zone, gfp_mask, order);
>   				if (zone_watermark_ok(zone, order, mark,
> -						classzone_idx, alloc_flags))
> +						      classzone_idx,
> +						      alloc_flags))
>   					goto try_this_zone;
> +			}
> +			if (need_compaction &&
> +			    (!repeated_compaction ||
> +			     (c_ret == COMPACT_SKIPPED &&
> +			      ret == ZONE_RECLAIM_SUCCESS))) {
> +				repeated_compaction = true;
> +				cond_resched();
> +				goto repeat_compaction;
> +			}
> +			if (need_compaction)
> +				defer_compaction(preferred_zone, order);
>
> -				/*
> -				 * Failed to reclaim enough to meet watermark.
> -				 * Only mark the zone full if checking the min
> -				 * watermark or if we failed to reclaim just
> -				 * 1<<order pages or else the page allocator
> -				 * fastpath will prematurely mark zones full
> -				 * when the watermark is between the low and
> -				 * min watermarks.
> -				 */
> -				if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
> -				    ret == ZONE_RECLAIM_SOME)
> -					goto this_zone_full;
> -
> +			if (!order)
> +				goto this_zone_full;
> +			else
>   				continue;
> -			}
>   		}
>
>   try_this_zone:


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0
  2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2013-06-05 15:10 ` [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
@ 2013-06-06  8:53 ` Mel Gorman
  2013-06-06 10:09 ` Mel Gorman
  8 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2013-06-06  8:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 05:10:30PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> 
> I got a bugreport showing some problem with NUMA affinity with CPU
> node bindings when THP is enabled and /proc/sys/vm/zone_reclaim_mode
> is > 0.
> 

zone_reclaim_mode has generated a number of bugs for me in the past. In
my experience almost all of them were related to excessive stalling. The
most recent example was stalls measured in minutes while memory was free
on other nodes although this was with an old kernel that would not affect
mainline today.

> When THP is disabled, zone_reclaim_mode set to 1 (or higher) tends to
> allocate memory in the local node with quite some accuracy in presence
> of CPU node bindings (and weak or no memory bindings). However THP
> enabled tends to spread the memory to other nodes erroneously.
> 

This does not surprise me as such. zone reclaim is expected today to
fail quickly. I would expect a process to make more forware progress if
it uses remote memory quickly or allocates base pages locally than stall
trying to allocate a THP locally. There was a time when THP were allocated
aggressively with lumpy reclaim, repeated attempts at allocation etc and
we moved away from that. zone_reclaim is very similar.

> I also found zone_reclaim_mode is quite unreliable in presence of
> multiple threads allocating memory at the same time from different
> CPUs in the same node, even when THP is disabled and there's plenty of
> clean cache to trivially reclaim.
> 

There is a flag that prevents multiple zone_reclaims within a zone and it
had at least two purposes.  One was that reclaim in the past was excessive
and multiple processes entering reclaim simultaneously tended to reclaim
way more than required. Second, multiple processes in zone reclaim mode
interfered with each other reclaiming each others memory in a stupid
loop. It should be better today for a variety of reasons but these are
the two primary problems to watch out for.

> The major problem with THP enabled is that zone_reclaim doesn't even
> try to use compaction. Then there are more changes suggested to make
> the whole compaction process more reliable than it is now.
> 

It has less *overhead*. The success rate is actually lower today because it
no longer reclaims the world and heavily disrupts the system to allocate
a THP. It's a long-standing TODO item to implement proper capture logic
in compaction but it has not happened yet. It's a fair bit down the TODO
list unfortunately.

> After setting zone_reclaim_mode to 1 and booting with
> numa_zonelist_order=n, with this patchset applied I get this NUMA placement:
> 
>   PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
>  7088 breakthp              0      2.1M [   2.1M     0  ]
>  7089 breakthp              1      2.1M [   2.1M     0  ]
>  7090 breakthp              2      2.1M [   2.1M     0  ]
>  7091 breakthp              3      2.1M [   2.1M     0  ]
>  7092 breakthp              6      2.1M [     0    2.1M ]
>  7093 breakthp              7      2.1M [     0    2.1M ]
>  7094 breakthp              8      2.1M [     0    2.1M ]
>  7095 breakthp              9      2.1M [     0    2.1M ]
>  7097 breakthp              0      2.1M [   2.1M     0  ]
>  7098 breakthp              1      2.1M [   2.1M     0  ]
>  7099 breakthp              2      2.1M [   2.1M     0  ]
>  7100 breakthp              3      2.1M [   2.1M     0  ]
>  7101 breakthp              6      2.1M [     0    2.1M ]
>  7102 breakthp              7      2.1M [     0    2.1M ]
>  7103 breakthp              8      2.1M [     0    2.1M ]
>  7104 breakthp              9      2.1M [     0    2.1M ]
>   PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
>  7106 usemem                0     1.00G [  1.00G     0  ]
>  7107 usemem                1     1.00G [  1.00G     0  ]
>  7108 usemem                2     1.00G [  1.00G     0  ]
>  7109 usemem                3     1.00G [  1.00G     0  ]
>  7110 usemem                6     1.00G [     0   1.00G ]
>  7111 usemem                7     1.00G [     0   1.00G ]
>  7112 usemem                8     1.00G [     0   1.00G ]
>  7113 usemem                9     1.00G [     0   1.00G ]
> 
> Without current upstream without the patchset and still
> zone_reclaim_mode = 1 and booting with numa_zonelist_order=n:
> 
>   PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
>  2950 breakthp              0      2.1M [   2.1M     0  ]
>  2951 breakthp              1      2.1M [   2.1M     0  ]
>  2952 breakthp              2      2.1M [   2.1M     0  ]
>  2953 breakthp              3      2.1M [   2.1M     0  ]
>  2954 breakthp              6      2.1M [     0    2.1M ]
>  2955 breakthp              7      2.1M [     0    2.1M ]
>  2956 breakthp              8      2.1M [     0    2.1M ]
>  2957 breakthp              9      2.1M [     0    2.1M ]
>  2966 breakthp              0      2.1M [   2.0M    96K ]
>  2967 breakthp              1      2.1M [   2.0M    96K ]
>  2968 breakthp              2      1.9M [   1.9M    96K ]
>  2969 breakthp              3      2.1M [   2.0M    96K ]
>  2970 breakthp              6      2.1M [   228K   1.8M ]
>  2971 breakthp              7      2.1M [    72K   2.0M ]
>  2972 breakthp              8      2.1M [    60K   2.0M ]
>  2973 breakthp              9      2.1M [   204K   1.9M ]
>   PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
>  3088 usemem                0     1.00G [ 856.2M 168.0M ]
>  3089 usemem                1     1.00G [ 860.2M 164.0M ]
>  3090 usemem                2     1.00G [ 860.2M 164.0M ]
>  3091 usemem                3     1.00G [ 858.2M 166.0M ]
>  3092 usemem                6     1.00G [ 248.0M 776.2M ]
>  3093 usemem                7     1.00G [ 248.0M 776.2M ]
>  3094 usemem                8     1.00G [ 250.0M 774.2M ]
>  3095 usemem                9     1.00G [ 246.0M 778.2M ]
> 
> Allocation speed seems a bit faster with the patchset applied likely
> thanks to the increased NUMA locality that even during a simple
> initialization, more than offsets the compaction costs.
> 
> The testcase always uses CPU bindings (half processes in one node, and
> half processes in the other node). It first fragments all memory
> (breakthp) by breaking lots of hugepages with mremap, and then another
> process (usemem) allocates lots of memory, in turn exercising the
> reliability of compaction with zone_reclaim_mode > 0.
> 

Ok.

> Very few hugepages are available when usemem starts, but compaction
> has a trivial time to generate as many hugepages as needed without any
> risk of failure.
> 
> The memory layout when usemem starts is like this:
> 
> 4k page anon
> 4k page free
> another 512-2 4k pages free
> 4k page anon
> 4k page free
> another 512-2 4k pages free
> [..]
> 
> If automatic NUMA balancing is enabled, this isn't as critical issues
> as without (the placement will be fixed later at runtime with THP NUMA
> migration faults), but it still looks worth optimizing the initial
> placement to avoid those migrations and for short lived computations
> (where automatic NUMA balancing can't help). Especially if the process
> has already been pinned to the CPUs of a node like in the bugreport I
> got.
> 

I had debated with myself whether zone_reclaim_mode and automatic numa
placement would be mutually exclusive. I felt that it would be sufficient for
the local memory allocation policy to handle initial placement without using
zone_reclaim_mode and have automatic numa placement fix it up later. This
would minimise application startup time and initial poor placement would not
be a problem for long-lived processes. I did not do anything about actually
setting the defaults but I became less concerned with zone_reclaim_mode
as a result. Still, it would not kill to improve zone_reclaim_mode either.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-05 15:10 ` [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
                     ` (2 preceding siblings ...)
  2013-06-05 21:33   ` Rafael Aquini
@ 2013-06-06  9:04   ` Mel Gorman
  2013-06-06 17:37     ` Rik van Riel
  3 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2013-06-06  9:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 05:10:31PM +0200, Andrea Arcangeli wrote:
> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
> thread allocates memory at the same time, it forces a premature
> allocation into remote NUMA nodes even when there's plenty of clean
> cache to reclaim in the local nodes.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Be aware that after this patch is applied that it is possible to have a
situation like this

1. 4 processes running on node 1
2. Each process tries to allocate 30% of memory
3. Each process reads the full buffer in a loop (stupid, just an example)

In this situation the processes will continually interfere with each
other until one of them gets migrated to another zone by the scheduler.
Watch for excessive reclaim, swapping and page writes from reclaim context
as a result of this patch. A less stupid example is four file intensive
workloads running in one node interfering with each other.

Before this patch, one process would make forward progress and the others
would fall back to using remote memory until all 4 processes had all the
memory they need. At this point it is no longer allocating new pages or in
reclaim. Most users will not notice additional remote accesses but I bet
you they will notice swap/reclaim storms when there is plenty of memory
on other nodes.

Direct reclaim suffers a similar problem but to a much lesser extent.
Users of direct reclaim will fall back to other zones in the zonelist and
kswapd mitigates entry into direct reclaim in a number of cases.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-06-05 15:10 ` [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
  2013-06-05 19:34   ` Rik van Riel
  2013-06-05 21:39   ` Rafael Aquini
@ 2013-06-06  9:05   ` Mel Gorman
  2 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2013-06-06  9:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 05:10:32PM +0200, Andrea Arcangeli wrote:
> Reset the stats so /proc/sys/vm/compact_memory will scan all memory.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-06-05 15:10 ` [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
  2013-06-05 19:49   ` Rik van Riel
@ 2013-06-06  9:11   ` Mel Gorman
  2013-06-26 20:48     ` Andrea Arcangeli
  2013-06-06 12:47   ` Rafael Aquini
  2 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2013-06-06  9:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 05:10:33PM +0200, Andrea Arcangeli wrote:
> If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
> plenty of free memory) compaction is otherwise crippled down and stops
> running for a while after the free/isolation cursor meets. After that
> allocation can fail for a full cycle of compaction_deferred, until
> compaction_restarting finally reset it again.
> 
> Stopping compaction for a full cycle after the cursor meets, even if
> it never failed and it's not going to fail, doesn't make sense.
> 
> We already throttle compaction CPU utilization using
> defer_compaction. We shouldn't prevent compaction to run after each
> pass completes when the cursor meets, unless it failed.
> 
> This makes direct compaction functional again. The throttling of
> direct compaction is still controlled by the defer_compaction
> logic.
> 
> kswapd still won't risk to reset compaction, and it will wait direct
> compaction to do so. Not sure if this is ideal but it at least
> decreases the risk of kswapd doing too much work. kswapd will only run
> one pass of compaction until some allocation invokes compaction again.
> 
> This decreased reliability of compaction was introduced in commit
> 62997027ca5b3d4618198ed8b1aba40b61b1137b .
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

That was part of a series that addressed a problem where processes
stalled for prolonged periods of time in compaction. I see your point
and I do not have a better suggestion at this time but I'll keep an eye
out for regressions in that area.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/7] mm: compaction: reset before initializing the scan cursors
  2013-06-05 15:10 ` [PATCH 4/7] mm: compaction: reset before initializing the scan cursors Andrea Arcangeli
  2013-06-05 20:04   ` Rik van Riel
@ 2013-06-06  9:14   ` Mel Gorman
  2013-06-06 12:49   ` Rafael Aquini
  2 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2013-06-06  9:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 05:10:34PM +0200, Andrea Arcangeli wrote:
> Otherwise the first iteration of compaction after restarting it, will
> only do a partial scan.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Whoops!

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/7] mm: compaction: increase the high order pages in the watermarks
  2013-06-05 15:10 ` [PATCH 5/7] mm: compaction: increase the high order pages in the watermarks Andrea Arcangeli
  2013-06-05 20:18   ` Rik van Riel
@ 2013-06-06  9:19   ` Mel Gorman
  1 sibling, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2013-06-06  9:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 05:10:35PM +0200, Andrea Arcangeli wrote:
> Require more high order pages in the watermarks, to give more margin
> for concurrent allocations. If there are too few pages, they can
> disappear too soon.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

This seems to be special casing THP allocations to allow it to relax the
watermark requirements. FWIW, I have seen cases where hugepage
allocations fail even though there are pages free becase memory is low
overall. It was very marginal though in terms of overall success rates
but that was also a long time ago when I was checking. How much of a
difference did you see with this patch?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-06-05 15:10 ` [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
  2013-06-05 22:21   ` Rik van Riel
@ 2013-06-06 10:05   ` Mel Gorman
  2013-07-11 16:02     ` Andrea Arcangeli
  2013-07-12 23:57   ` Hush Bensen
  2 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2013-06-06 10:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 05:10:37PM +0200, Andrea Arcangeli wrote:
> This fixes zone_reclaim_mode by using the min watermark so it won't
> fail in presence of concurrent allocations. This greatly increases the
> reliability of zone_reclaim_mode > 0 also with cache shrinking and THP
> disabled.
> 

Again be mindful that improved reliability of zone_reclaim_mode can come
at the cost of stalling and process interference for workloads where the
processes are not NUMA aware or fit in individual nodes.

> This also adds compaction to zone_reclaim so THP enabled won't
> decrease the NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.
> 
> Some checks for __GFP_WAIT and numa_node_id() are moved from the
> zone_reclaim() to the caller so they also apply to the compaction
> logic.
> 
> It is important to boot with numa_zonelist_order=n (n means nodes) to
> get more accurate NUMA locality if there are multiple zones per node.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/internal.h   |  1 -
>  mm/page_alloc.c | 99 +++++++++++++++++++++++++++++++++++++++++++--------------
>  mm/vmscan.c     | 17 ----------
>  3 files changed, 75 insertions(+), 42 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 8562de0..560a1ec 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -339,7 +339,6 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
>  }
>  #endif /* CONFIG_SPARSEMEM */
>  
> -#define ZONE_RECLAIM_NOSCAN	-2
>  #define ZONE_RECLAIM_FULL	-1
>  #define ZONE_RECLAIM_SOME	0
>  #define ZONE_RECLAIM_SUCCESS	1
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c13e062..3ca905a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1902,7 +1902,9 @@ zonelist_scan:
>  		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>  			unsigned long mark;
> -			int ret;
> +			int ret, node_id, c_ret;
> +			bool repeated_compaction, need_compaction;
> +			bool contended = false;
>  
>  			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
>  			if (zone_watermark_ok(zone, order, mark,
> @@ -1933,35 +1935,84 @@ zonelist_scan:
>  				!zlc_zone_worth_trying(zonelist, z, allowednodes))
>  				continue;
>  
> -			ret = zone_reclaim(zone, gfp_mask, order);
> -			switch (ret) {
> -			case ZONE_RECLAIM_NOSCAN:
> -				/* did not scan */
> +			if (!(gfp_mask & __GFP_WAIT) ||
> +			    (current->flags & PF_MEMALLOC))
>  				continue;

Instead of moving the logic from zone_reclaim to here, why was the
compaction logic not moved to zone_reclaim or a separate function? This
patch adds a lot of logic to get_page_from_freelist() which is unused
for most users

> -			case ZONE_RECLAIM_FULL:
> -				/* scanned but unreclaimable */
> +
> +			/*
> +			 * Only reclaim the local zone or on zones
> +			 * that do not have associated
> +			 * processors. This will favor the local
> +			 * processor over remote processors and spread
> +			 * off node memory allocations as wide as
> +			 * possible.
> +			 */
> +			node_id = zone_to_nid(zone);
> +			if (node_state(node_id, N_CPU) &&
> +			    node_id != numa_node_id())
>  				continue;

And this?

> -			default:
> -				/* did we reclaim enough */
> +
> +			/*
> +			 * We're going to do reclaim so allow
> +			 * allocations up to the MIN watermark, so less
> +			 * concurrent allocation will fail.
> +			 */
> +			mark = min_wmark_pages(zone);
> +

If we arrived here from the page allocator fast path then it also means
that we potentially miss going into the slow patch and waking kswapd. If
kswapd is not woken at the low watermark as normal then there will be
stalls due to direct reclaim and the stalls will be abrupt.

> +			/* initialize to avoid warnings */
> +			c_ret = COMPACT_SKIPPED;
> +			ret = ZONE_RECLAIM_FULL;
> +
> +			repeated_compaction = false;
> +			need_compaction = false;
> +			if (!compaction_deferred(preferred_zone, order))
> +				need_compaction = order &&
> +					(gfp_mask & GFP_KERNEL) == GFP_KERNEL;

need_compaction = order will always be true. Because of the bracketing,
the comparison is within the conditional block so the second comparison
is doing nothing. Not sure what is going on there at all.

> +			if (need_compaction) {
> +			repeat_compaction:
> +				c_ret = compact_zone_order(zone, order,
> +							   gfp_mask,
> +							   repeated_compaction,
> +							   &contended);
> +				if (c_ret != COMPACT_SKIPPED &&
> +				    zone_watermark_ok(zone, order, mark,
> +						      classzone_idx,
> +						      alloc_flags)) {
> +#ifdef CONFIG_COMPACTION
> +					preferred_zone->compact_considered = 0;
> +					preferred_zone->compact_defer_shift = 0;
> +#endif
> +					goto try_this_zone;
> +				}
> +			}

It's a question of taste, but overall I think this could have been done in
zone_reclaim and rename it to zone_reclaim_compact to match the concept
of reclaim/compaction if you like. Split the compaction part out to have
__zone_reclaim and __zone_compact if you like and it'll be hell of a lot
easier to follow. Right now, it's a bit twisty and while I can follow it,
it's headache inducing.

With that arrangement it will be a lot easier to add a new zone_reclaim
flag if it turns out that zone reclaim compacts too aggressively leading
to excessive stalls. Right now, I think this loops in compaction until
it gets deferred because of how need_compaction gets set which could be
for a long time. I'm not sure that's what you intended.


> +			/*
> +			 * reclaim if compaction failed because not
> +			 * enough memory was available or if
> +			 * compaction didn't run (order 0) or didn't
> +			 * succeed.
> +			 */
> +			if (!repeated_compaction || c_ret == COMPACT_SKIPPED) {
> +				ret = zone_reclaim(zone, gfp_mask, order);
>  				if (zone_watermark_ok(zone, order, mark,
> -						classzone_idx, alloc_flags))
> +						      classzone_idx,
> +						      alloc_flags))
>  					goto try_this_zone;
> +			}
> +			if (need_compaction &&
> +			    (!repeated_compaction ||
> +			     (c_ret == COMPACT_SKIPPED &&
> +			      ret == ZONE_RECLAIM_SUCCESS))) {
> +				repeated_compaction = true;
> +				cond_resched();
> +				goto repeat_compaction;
> +			}
> +			if (need_compaction)
> +				defer_compaction(preferred_zone, order);
>  
> -				/*
> -				 * Failed to reclaim enough to meet watermark.
> -				 * Only mark the zone full if checking the min
> -				 * watermark or if we failed to reclaim just
> -				 * 1<<order pages or else the page allocator
> -				 * fastpath will prematurely mark zones full
> -				 * when the watermark is between the low and
> -				 * min watermarks.
> -				 */
> -				if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
> -				    ret == ZONE_RECLAIM_SOME)
> -					goto this_zone_full;
> -
> +			if (!order)
> +				goto this_zone_full;
> +			else
>  				continue;
> -			}
>  		}
>  
>  try_this_zone:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 825c631..6a65107 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3380,7 +3380,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  
>  int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  {
> -	int node_id;
>  	int ret;
>  
>  	/*
> @@ -3400,22 +3399,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	if (zone->all_unreclaimable)
>  		return ZONE_RECLAIM_FULL;
>  
> -	/*
> -	 * Do not scan if the allocation should not be delayed.
> -	 */
> -	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> -		return ZONE_RECLAIM_NOSCAN;
> -
> -	/*
> -	 * Only run zone reclaim on the local zone or on zones that do not
> -	 * have associated processors. This will favor the local processor
> -	 * over remote processors and spread off node memory allocations
> -	 * as wide as possible.
> -	 */
> -	node_id = zone_to_nid(zone);
> -	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> -		return ZONE_RECLAIM_NOSCAN;
> -
>  	ret = __zone_reclaim(zone, gfp_mask, order);
>  
>  	if (!ret)

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0
  2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2013-06-06  8:53 ` [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Mel Gorman
@ 2013-06-06 10:09 ` Mel Gorman
  8 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2013-06-06 10:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 05:10:30PM +0200, Andrea Arcangeli wrote:
> <SNIP>
> The main change of behavior is the removal of compact_blockskip_flush
> and the __reset_isolation_suitable immediately executed when a
> compaction pass completes and the slightly increased amount of
> hugepages required to meet the low/min watermarks. The rest of the
> changes mostly applies to zone_reclaim_mode > 0 and doesn't affect the
> default 0 value (some large system may boot with zone_reclaim_mode set
> to 1 by default though, if the node distance is very high).
> 

I'm fine with patches 2, 3 and 4 which make sense independant of the rest
of the series. I'm less sure of the rest of the series. Can 2, 3 and 4 be
split out and sent separately and then treat 1, 5, 6 and 7 exclusively as
a zone_reclaim set please?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-06-05 15:10 ` [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
  2013-06-05 19:49   ` Rik van Riel
  2013-06-06  9:11   ` Mel Gorman
@ 2013-06-06 12:47   ` Rafael Aquini
  2 siblings, 0 replies; 48+ messages in thread
From: Rafael Aquini @ 2013-06-06 12:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li

On Wed, Jun 05, 2013 at 05:10:33PM +0200, Andrea Arcangeli wrote:
> If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
> plenty of free memory) compaction is otherwise crippled down and stops
> running for a while after the free/isolation cursor meets. After that
> allocation can fail for a full cycle of compaction_deferred, until
> compaction_restarting finally reset it again.
> 
> Stopping compaction for a full cycle after the cursor meets, even if
> it never failed and it's not going to fail, doesn't make sense.
> 
> We already throttle compaction CPU utilization using
> defer_compaction. We shouldn't prevent compaction to run after each
> pass completes when the cursor meets, unless it failed.
> 
> This makes direct compaction functional again. The throttling of
> direct compaction is still controlled by the defer_compaction
> logic.
> 
> kswapd still won't risk to reset compaction, and it will wait direct
> compaction to do so. Not sure if this is ideal but it at least
> decreases the risk of kswapd doing too much work. kswapd will only run
> one pass of compaction until some allocation invokes compaction again.
> 
> This decreased reliability of compaction was introduced in commit
> 62997027ca5b3d4618198ed8b1aba40b61b1137b .
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


>  include/linux/compaction.h |  5 -----
>  include/linux/mmzone.h     |  3 ---
>  mm/compaction.c            | 15 ++++++---------
>  mm/page_alloc.c            |  1 -
>  mm/vmscan.c                |  8 --------
>  5 files changed, 6 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 091d72e..fc3f266 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -24,7 +24,6 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask,
>  			bool sync, bool *contended);
>  extern void compact_pgdat(pg_data_t *pgdat, int order);
> -extern void reset_isolation_suitable(pg_data_t *pgdat);
>  extern unsigned long compaction_suitable(struct zone *zone, int order);
>  
>  /* Do not skip compaction more than 64 times */
> @@ -84,10 +83,6 @@ static inline void compact_pgdat(pg_data_t *pgdat, int order)
>  {
>  }
>  
> -static inline void reset_isolation_suitable(pg_data_t *pgdat)
> -{
> -}
> -
>  static inline unsigned long compaction_suitable(struct zone *zone, int order)
>  {
>  	return COMPACT_SKIPPED;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f23b080..9e9d285 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -354,9 +354,6 @@ struct zone {
>  	spinlock_t		lock;
>  	int                     all_unreclaimable; /* All pages pinned */
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> -	/* Set to true when the PG_migrate_skip bits should be cleared */
> -	bool			compact_blockskip_flush;
> -
>  	/* pfns where compaction scanners should start */
>  	unsigned long		compact_cached_free_pfn;
>  	unsigned long		compact_cached_migrate_pfn;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index cac9594..525baaa 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -91,7 +91,6 @@ static void __reset_isolation_suitable(struct zone *zone)
>  
>  	zone->compact_cached_migrate_pfn = start_pfn;
>  	zone->compact_cached_free_pfn = end_pfn;
> -	zone->compact_blockskip_flush = false;
>  
>  	/* Walk the zone and mark every pageblock as suitable for isolation */
>  	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> @@ -110,7 +109,7 @@ static void __reset_isolation_suitable(struct zone *zone)
>  	}
>  }
>  
> -void reset_isolation_suitable(pg_data_t *pgdat)
> +static void reset_isolation_suitable(pg_data_t *pgdat)
>  {
>  	int zoneid;
>  
> @@ -120,8 +119,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>  			continue;
>  
>  		/* Only flush if a full compaction finished recently */
> -		if (zone->compact_blockskip_flush)
> -			__reset_isolation_suitable(zone);
> +		__reset_isolation_suitable(zone);
>  	}
>  }
>  
> @@ -828,13 +826,12 @@ static int compact_finished(struct zone *zone,
>  	/* Compaction run completes if the migrate and free scanner meet */
>  	if (cc->free_pfn <= cc->migrate_pfn) {
>  		/*
> -		 * Mark that the PG_migrate_skip information should be cleared
> -		 * by kswapd when it goes to sleep. kswapd does not set the
> -		 * flag itself as the decision to be clear should be directly
> -		 * based on an allocation request.
> +		 * Clear the PG_migrate_skip information. kswapd does
> +		 * not clear it as the decision to be clear should be
> +		 * directly based on an allocation request.
>  		 */
>  		if (!current_is_kswapd())
> -			zone->compact_blockskip_flush = true;
> +			__reset_isolation_suitable(zone);
>  
>  		return COMPACT_COMPLETE;
>  	}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 378a15b..3931d16 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2188,7 +2188,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  				alloc_flags & ~ALLOC_NO_WATERMARKS,
>  				preferred_zone, migratetype);
>  		if (page) {
> -			preferred_zone->compact_blockskip_flush = false;
>  			preferred_zone->compact_considered = 0;
>  			preferred_zone->compact_defer_shift = 0;
>  			if (order >= preferred_zone->compact_order_failed)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc5bb01..825c631 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2920,14 +2920,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  		 */
>  		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
>  
> -		/*
> -		 * Compaction records what page blocks it recently failed to
> -		 * isolate pages from and skips them in the future scanning.
> -		 * When kswapd is going to sleep, it is reasonable to assume
> -		 * that pages and compaction may succeed so reset the cache.
> -		 */
> -		reset_isolation_suitable(pgdat);
> -
>  		if (!kthread_should_stop())
>  			schedule();
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/7] mm: compaction: reset before initializing the scan cursors
  2013-06-05 15:10 ` [PATCH 4/7] mm: compaction: reset before initializing the scan cursors Andrea Arcangeli
  2013-06-05 20:04   ` Rik van Riel
  2013-06-06  9:14   ` Mel Gorman
@ 2013-06-06 12:49   ` Rafael Aquini
  2 siblings, 0 replies; 48+ messages in thread
From: Rafael Aquini @ 2013-06-06 12:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li

On Wed, Jun 05, 2013 at 05:10:34PM +0200, Andrea Arcangeli wrote:
> Otherwise the first iteration of compaction after restarting it, will
> only do a partial scan.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---

Others have said it already, but looks like the changelog was stripped.

Acked-by: Rafael Aquini <aquini@redhat.com>


>  mm/compaction.c | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 525baaa..afaf692 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -934,6 +934,17 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  	}
>  
>  	/*
> +	 * Clear pageblock skip if there were failures recently and
> +	 * compaction is about to be retried after being
> +	 * deferred. kswapd does not do this reset and it will wait
> +	 * direct compaction to do so either when the cursor meets
> +	 * after one compaction pass is complete or if compaction is
> +	 * restarted after being deferred for a while.
> +	 */
> +	if ((compaction_restarting(zone, cc->order)) && !current_is_kswapd())
> +		__reset_isolation_suitable(zone);
> +
> +	/*
>  	 * Setup to move all movable pages to the end of the zone. Used cached
>  	 * information on where the scanners should start but check that it
>  	 * is initialised by ensuring the values are within zone boundaries.
> @@ -949,14 +960,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
>  	}
>  
> -	/*
> -	 * Clear pageblock skip if there were failures recently and compaction
> -	 * is about to be retried after being deferred. kswapd does not do
> -	 * this reset as it'll reset the cached information when going to sleep.
> -	 */
> -	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
> -		__reset_isolation_suitable(zone);
> -
>  	migrate_prep_local();
>  
>  	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-05 21:03       ` KOSAKI Motohiro
@ 2013-06-06 14:15         ` Christoph Lameter
  2013-06-06 17:17           ` KOSAKI Motohiro
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Lameter @ 2013-06-06 14:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrea Arcangeli, linux-mm, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini

On Wed, 5 Jun 2013, KOSAKI Motohiro wrote:

> > There was early on an issue with multiple zone reclaims from
> > different processors causing an extreme slowdown and the system would go
> > OOM. The flag was used to enforce that only a single zone reclaim pass was
> > occurring at one time on a zone. This minimized contention and avoided
> > the failure.
>
> OK. I've convinced now we can removed because sc->nr_to_reclaim protect us
> form
> this issue.

How does nr_to_reclaim limit the concurrency of zone reclaim?

What happens if multiple processes are allocating from the same zone and
they all go into direct reclaim and therefore hit zone reclaim?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-06 14:15         ` Christoph Lameter
@ 2013-06-06 17:17           ` KOSAKI Motohiro
  2013-06-06 18:16             ` Christoph Lameter
  0 siblings, 1 reply; 48+ messages in thread
From: KOSAKI Motohiro @ 2013-06-06 17:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, linux-mm, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini

> How does nr_to_reclaim limit the concurrency of zone reclaim?

No, it doesn't prevent concurrent reclaim itself. It only prevents cuncurrent
reclaim much much pages rather than SWAP_CLUSTER_MAX. note,
zone reclaim uses priority 4 by default.

> What happens if multiple processes are allocating from the same zone and
> they all go into direct reclaim and therefore hit zone reclaim?

At zone reclaim was created, 16 (1<<4) concurrent reclaim may drop all page
cache because zone reclaim uses priority 4 by default. However, now we have
reckaim bail out logic. So, priority 4 doesn't directly mean each zone reclaim
drop 1/16 caches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-06  9:04   ` Mel Gorman
@ 2013-06-06 17:37     ` Rik van Riel
  2013-06-14 16:16       ` Rik van Riel
  0 siblings, 1 reply; 48+ messages in thread
From: Rik van Riel @ 2013-06-06 17:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On 06/06/2013 05:04 AM, Mel Gorman wrote:
> On Wed, Jun 05, 2013 at 05:10:31PM +0200, Andrea Arcangeli wrote:
>> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
>> thread allocates memory at the same time, it forces a premature
>> allocation into remote NUMA nodes even when there's plenty of clean
>> cache to reclaim in the local nodes.
>>
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>
> Be aware that after this patch is applied that it is possible to have a
> situation like this
>
> 1. 4 processes running on node 1
> 2. Each process tries to allocate 30% of memory
> 3. Each process reads the full buffer in a loop (stupid, just an example)
>
> In this situation the processes will continually interfere with each
> other until one of them gets migrated to another zone by the scheduler.

This is a very good point.

Andrea, I suspect we will need some kind of safeguard against
this problem.

I can see how ZONE_RECLAIM_LOCKED may not be that safeguard,
but I do not have any good ideas on what it would be.

Should we limit zone reclaim to priority == DEF_PRIORITY, and
fall back if we fail to free enough memory at DEF_PRIORITY?

Does anybody have better ideas?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-06 17:17           ` KOSAKI Motohiro
@ 2013-06-06 18:16             ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2013-06-06 18:16 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrea Arcangeli, linux-mm, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini

On Thu, 6 Jun 2013, KOSAKI Motohiro wrote:

> At zone reclaim was created, 16 (1<<4) concurrent reclaim may drop all page
> cache because zone reclaim uses priority 4 by default. However, now we have
> reckaim bail out logic. So, priority 4 doesn't directly mean each zone reclaim
> drop 1/16 caches.

Sounds good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-06 17:37     ` Rik van Riel
@ 2013-06-14 16:16       ` Rik van Riel
  2013-06-17  9:30         ` Mel Gorman
  0 siblings, 1 reply; 48+ messages in thread
From: Rik van Riel @ 2013-06-14 16:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On 06/06/2013 01:37 PM, Rik van Riel wrote:
> On 06/06/2013 05:04 AM, Mel Gorman wrote:
>> On Wed, Jun 05, 2013 at 05:10:31PM +0200, Andrea Arcangeli wrote:
>>> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
>>> thread allocates memory at the same time, it forces a premature
>>> allocation into remote NUMA nodes even when there's plenty of clean
>>> cache to reclaim in the local nodes.
>>>
>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>>
>> Be aware that after this patch is applied that it is possible to have a
>> situation like this
>>
>> 1. 4 processes running on node 1
>> 2. Each process tries to allocate 30% of memory
>> 3. Each process reads the full buffer in a loop (stupid, just an example)
>>
>> In this situation the processes will continually interfere with each
>> other until one of them gets migrated to another zone by the scheduler.
>
> This is a very good point.
>
> Andrea, I suspect we will need some kind of safeguard against
> this problem.

Never mind me.

In __zone_reclaim we set the flags in swap_control so
we never unmap pages or swap pages out at all by
default, so this should not be an issue at all.

In order to get the problem illustrated above, the
user will have to enable RECLAIM_SWAP through sysfs
manually.

The series looks fine as is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-14 16:16       ` Rik van Riel
@ 2013-06-17  9:30         ` Mel Gorman
  2013-06-17 18:12           ` Rik van Riel
  0 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2013-06-17  9:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-mm, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On Fri, Jun 14, 2013 at 12:16:47PM -0400, Rik van Riel wrote:
> On 06/06/2013 01:37 PM, Rik van Riel wrote:
> >On 06/06/2013 05:04 AM, Mel Gorman wrote:
> >>On Wed, Jun 05, 2013 at 05:10:31PM +0200, Andrea Arcangeli wrote:
> >>>Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
> >>>thread allocates memory at the same time, it forces a premature
> >>>allocation into remote NUMA nodes even when there's plenty of clean
> >>>cache to reclaim in the local nodes.
> >>>
> >>>Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> >>
> >>Be aware that after this patch is applied that it is possible to have a
> >>situation like this
> >>
> >>1. 4 processes running on node 1
> >>2. Each process tries to allocate 30% of memory
> >>3. Each process reads the full buffer in a loop (stupid, just an example)
> >>
> >>In this situation the processes will continually interfere with each
> >>other until one of them gets migrated to another zone by the scheduler.
> >
> >This is a very good point.
> >
> >Andrea, I suspect we will need some kind of safeguard against
> >this problem.
> 
> Never mind me.
> 
> In __zone_reclaim we set the flags in swap_control so
> we never unmap pages or swap pages out at all by
> default, so this should not be an issue at all.
> 
> In order to get the problem illustrated above, the
> user will have to enable RECLAIM_SWAP through sysfs
> manually.
> 

For the mapped case and the default tuning for zone_reclaim_mode then
yes. If instead of allocating 30% of memory the processes are using using
buffered reads/writes then they'll reach each others page cache pages and
it's a very similar problem.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-17  9:30         ` Mel Gorman
@ 2013-06-17 18:12           ` Rik van Riel
  2013-06-26 20:10             ` Andrea Arcangeli
  0 siblings, 1 reply; 48+ messages in thread
From: Rik van Riel @ 2013-06-17 18:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On 06/17/2013 05:30 AM, Mel Gorman wrote:
> On Fri, Jun 14, 2013 at 12:16:47PM -0400, Rik van Riel wrote:
>> On 06/06/2013 01:37 PM, Rik van Riel wrote:
>>> On 06/06/2013 05:04 AM, Mel Gorman wrote:
>>>> On Wed, Jun 05, 2013 at 05:10:31PM +0200, Andrea Arcangeli wrote:
>>>>> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
>>>>> thread allocates memory at the same time, it forces a premature
>>>>> allocation into remote NUMA nodes even when there's plenty of clean
>>>>> cache to reclaim in the local nodes.
>>>>>
>>>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>>>>
>>>> Be aware that after this patch is applied that it is possible to have a
>>>> situation like this
>>>>
>>>> 1. 4 processes running on node 1
>>>> 2. Each process tries to allocate 30% of memory
>>>> 3. Each process reads the full buffer in a loop (stupid, just an example)
>>>>
>>>> In this situation the processes will continually interfere with each
>>>> other until one of them gets migrated to another zone by the scheduler.
>>>
>>> This is a very good point.
>>>
>>> Andrea, I suspect we will need some kind of safeguard against
>>> this problem.
>>
>> Never mind me.
>>
>> In __zone_reclaim we set the flags in swap_control so
>> we never unmap pages or swap pages out at all by
>> default, so this should not be an issue at all.
>>
>> In order to get the problem illustrated above, the
>> user will have to enable RECLAIM_SWAP through sysfs
>> manually.
>>
>
> For the mapped case and the default tuning for zone_reclaim_mode then
> yes. If instead of allocating 30% of memory the processes are using using
> buffered reads/writes then they'll reach each others page cache pages and
> it's a very similar problem.

Could we fix that problem by simply allowing page cache
allocations (__GFP_WRITE) to fall back to other zones,
regardless of the zone_reclaim setting?

The ZONE_RECLAIM_LOCKED function seems to break as many
things as it fixes, so replacing it with something else
seems like a worthwhile pursuit...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED
  2013-06-17 18:12           ` Rik van Riel
@ 2013-06-26 20:10             ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-26 20:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-mm, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

Hi!

On Mon, Jun 17, 2013 at 02:12:44PM -0400, Rik van Riel wrote:
> On 06/17/2013 05:30 AM, Mel Gorman wrote:
> > On Fri, Jun 14, 2013 at 12:16:47PM -0400, Rik van Riel wrote:
> >> On 06/06/2013 01:37 PM, Rik van Riel wrote:
> >>> On 06/06/2013 05:04 AM, Mel Gorman wrote:
> >>>> On Wed, Jun 05, 2013 at 05:10:31PM +0200, Andrea Arcangeli wrote:
> >>>>> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
> >>>>> thread allocates memory at the same time, it forces a premature
> >>>>> allocation into remote NUMA nodes even when there's plenty of clean
> >>>>> cache to reclaim in the local nodes.
> >>>>>
> >>>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> >>>>
> >>>> Be aware that after this patch is applied that it is possible to have a
> >>>> situation like this
> >>>>
> >>>> 1. 4 processes running on node 1
> >>>> 2. Each process tries to allocate 30% of memory
> >>>> 3. Each process reads the full buffer in a loop (stupid, just an example)
> >>>>
> >>>> In this situation the processes will continually interfere with each
> >>>> other until one of them gets migrated to another zone by the scheduler.
> >>>
> >>> This is a very good point.
> >>>
> >>> Andrea, I suspect we will need some kind of safeguard against
> >>> this problem.
> >>
> >> Never mind me.
> >>
> >> In __zone_reclaim we set the flags in swap_control so
> >> we never unmap pages or swap pages out at all by
> >> default, so this should not be an issue at all.
> >>
> >> In order to get the problem illustrated above, the
> >> user will have to enable RECLAIM_SWAP through sysfs
> >> manually.
> >>
> >
> > For the mapped case and the default tuning for zone_reclaim_mode then
> > yes. If instead of allocating 30% of memory the processes are using using
> > buffered reads/writes then they'll reach each others page cache pages and
> > it's a very similar problem.
> 
> Could we fix that problem by simply allowing page cache
> allocations (__GFP_WRITE) to fall back to other zones,
> regardless of the zone_reclaim setting?
> 
> The ZONE_RECLAIM_LOCKED function seems to break as many
> things as it fixes, so replacing it with something else
> seems like a worthwhile pursuit...

I actually don't see a connection between the various scenarios
described above with ZONE_RECLAIM_LOCKED. I mean whatever problem you
are having with swapping or excessive reclaim in a single zone/node
despite the other zones/nodes are completely free, could materialize
the same way with the current ZONE_RECLAIM_LOCKED code if you just use
a mutex in userland to serialize the memory allocations. Or if they
just happen to run serially for other reasons.

If it was a problem to keep insisting calling zone_reclaim in any
given zone, the problem would eventually materialize anyway, by just
running a single thread in the whole system pinned to a single node.

ZONE_RECLAIM_LOCKED isn't about swapping or memory pressure, it is
only about preventing running more than one zone_reclaim function at
once in any given zone. But that shall be ok. If all zone_reclaim()
running in parallel are doing a .nr_to_reclaim = SWAP_CLUSTER_MAX
shrinkage attempt with may_unmap/may_writepage unset
(zone_reclaim_mode is <=1), there shall be no problem. And
zone_reclaim won't be called anymore as soon as the watermark is above
"low".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-06-05 19:49   ` Rik van Riel
@ 2013-06-26 20:38     ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-26 20:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 03:49:04PM -0400, Rik van Riel wrote:
> On 06/05/2013 11:10 AM, Andrea Arcangeli wrote:
> > If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
> > plenty of free memory) compaction is otherwise crippled down and stops
> > running for a while after the free/isolation cursor meets. After that
> > allocation can fail for a full cycle of compaction_deferred, until
> > compaction_restarting finally reset it again.
> >
> > Stopping compaction for a full cycle after the cursor meets, even if
> > it never failed and it's not going to fail, doesn't make sense.
> >
> > We already throttle compaction CPU utilization using
> > defer_compaction. We shouldn't prevent compaction to run after each
> > pass completes when the cursor meets, unless it failed.
> >
> > This makes direct compaction functional again. The throttling of
> > direct compaction is still controlled by the defer_compaction
> > logic.
> >
> > kswapd still won't risk to reset compaction, and it will wait direct
> > compaction to do so. Not sure if this is ideal but it at least
> > decreases the risk of kswapd doing too much work. kswapd will only run
> > one pass of compaction until some allocation invokes compaction again.
> 
> Won't kswapd reset compaction even with your patch,
> but only when kswapd invokes compaction and the cursors
> meet?

kswapd won't ever reset it because of the current_is_kswapd check.

> In other words, the behaviour should be correct even
> for cases where kswapd is the only thing in the system
> doing compaction (eg. GFP_ATOMIC higher order allocations),
> but your changelog does not describe it completely.

In that case with the previous code compact_blockskip_flush would
never get set and still compaction would never be resetted.

> > This decreased reliability of compaction was introduced in commit
> > 62997027ca5b3d4618198ed8b1aba40b61b1137b .
> >
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> The code looks good to me.
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>

Thanks. Not sure if this does exactly what you expected but it's not
changing the behavior of the GFP_ATOMIC load if compared to the
current upstream comapction code.

The first attempt to add compaction in kswapd a few years back failed
because it slowed down network NFS loads with jumbo frames.  kswapd
was doing too much compaction for very short lived allocations
(worthless and totally wasted CPU), so I believe this time compaction
in kswapd worked out just because we don't activate with GFP_ATOMIC
network jumbo frame allocations. So the above to me sounds more a
feature than a bug and that's why I've been careful not to ever reset
compaction within kswapd (just like the previous code).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-06-06  9:11   ` Mel Gorman
@ 2013-06-26 20:48     ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-26 20:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Thu, Jun 06, 2013 at 10:11:48AM +0100, Mel Gorman wrote:
> That was part of a series that addressed a problem where processes
> stalled for prolonged periods of time in compaction. I see your point

Yes.

> and I do not have a better suggestion at this time but I'll keep an eye
> out for regressions in that area.

That's my exact concern too, and there's not much we can do about
it. But not calling compaction reliably, simply guarantees spurious
failures where it would be trivial to allocate THP and we just don't
even try to compact memory.

Of course we have khugepaged that fixes it up for THP.

But in the NUMA case (without automatic NUMA balancing enabled), the
transparent hugepage could be allocated in the wrong node and it will
stay there forever.

In general it should be more optimal not to require khugepaged or
automatic NUMA balancing to fix up allocator errors after the fact,
especially because they both won't help with short lived
allocations. And especially the NUMA effect could be measurable for
short lived allocations that may go in the wrong node (like while
building with gcc).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/7] mm: compaction: increase the high order pages in the watermarks
  2013-06-05 20:18   ` Rik van Riel
@ 2013-06-28 22:14     ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-06-28 22:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, Mel Gorman, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Wed, Jun 05, 2013 at 04:18:37PM -0400, Rik van Riel wrote:
> On 06/05/2013 11:10 AM, Andrea Arcangeli wrote:
> > Require more high order pages in the watermarks, to give more margin
> > for concurrent allocations. If there are too few pages, they can
> > disappear too soon.
> 
> Not sure what to do with this patch.
> 
> Not scaling min for pageblock_order-2 allocations seems like
> it could be excessive.
<
> Presumably this scaling was introduced for a good reason.

I think the reason is that even if we generate plenty of hugepages
they may be splitted to lower orders well before other hugepage
allocations will be invoked. So we risk doing too much work.

> 
> Why is that reason no longer valid?
> 
> Why is it safe to make this change?
> 
> Would it be safer to simply scale min less steeply?

This simply makes the scaling down for pageblock_order-1 equal to the
scaling of pageblock_order-2. So compaction will generate more
hugepages before stopping.

The larger the order size of the allocation and the more aggressive we
scale it down "min" for larger allocations, the fewer order pages
there will be within the low-min wmarks. this changes increases the
number of hugepages within low-min wmarks.

In my testing what happened with too few large order pages within the
low-min wmark, is that those few pages may be allocated or splitted to
lower orders well before the CPU doing the compaction has a chance to
allocate those. This patch made a difference in increasing the
reliability of a threaded workload running with CPU NODE pinning and
zone_reclaim_mode=1, and in turn it may be beneficial in general.

The patch simpy reduces the scaling factor, because having a few more
pages of margin between low-min wmarks is beneficial not just for 4k
pages but for any order size in presence of threads. The downside is
the risk of doing more compaction work but then nobody keeps
allocating hugepages.

> >   		/* Require fewer higher order pages to be free */
> > -		min >>= 1;
> > +		if (o >= pageblock_order-1)
> > +			min >>= 1;

On my laptop I get min:44764kB low:55952kB high:67144kB

scaling down like upstream we get:

order9 -> 218kb low wmark

not even 1 hugepage of buffer between low-min -> very unreliable.

with my patch:

order9 -> 27mbyte low wmark
order9 -> 22mbyte min wmark

2 hugepages of margin: at least a bit of margin with concurrent
allocations from different CPUs.

But this whole logic is wrong and needs to be improved further: it
makes no sense to have the min wmark scaled down to 22mbytes for order
9. The min must be scaled down way more aggressively and _differently_
than "low" for large order allocations. The "min" exists only for
PF_MEMALLOC and other emergency allocations, and those must never
require order > 0. Hence the min should shift down towards 0, while
"low" must not.

What I mean is that right now we generate just 1 hugepage, we should
generate 3 to have more margin for concurrent allocations, but not
22mbyte of hugepages that won't even have a chance to be used for
anything but PF_MEMALLOC paths that shouldn't depend on high order.

We still need plenty of free memory in the "min" wmark, but no
hugepage is ok in the min.

> Why this and not this?
> 
> 		if (order & 1)
> 			min >>=1;

I can test it but it would be less aggressive.

> Not saying my idea is any better than yours, just saying that
> a change like this needs more justification than provided by
> your changelog...

I'll improve the changelog and I'll try to improve the logic so that
it says reliable as with my patch but the min is scaled down more
aggressively in hugepage terms.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-06-06 10:05   ` Mel Gorman
@ 2013-07-11 16:02     ` Andrea Arcangeli
  2013-07-12 12:26       ` Hush Bensen
  0 siblings, 1 reply; 48+ messages in thread
From: Andrea Arcangeli @ 2013-07-11 16:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

On Thu, Jun 06, 2013 at 11:05:03AM +0100, Mel Gorman wrote:
> Again be mindful that improved reliability of zone_reclaim_mode can come
> at the cost of stalling and process interference for workloads where the
> processes are not NUMA aware or fit in individual nodes.

This won't risk to affect any system where NUMA locality is a
secondary priority, unless zone_reclaim_mode is set to 1 manually, so
it shall be ok.

> Instead of moving the logic from zone_reclaim to here, why was the
> compaction logic not moved to zone_reclaim or a separate function? This
> patch adds a lot of logic to get_page_from_freelist() which is unused
> for most users

I can try to move all checks that aren't passing information
across different passes from the caller to the callee.

> > -			default:
> > -				/* did we reclaim enough */
> > +
> > +			/*
> > +			 * We're going to do reclaim so allow
> > +			 * allocations up to the MIN watermark, so less
> > +			 * concurrent allocation will fail.
> > +			 */
> > +			mark = min_wmark_pages(zone);
> > +
> 
> If we arrived here from the page allocator fast path then it also means
> that we potentially miss going into the slow patch and waking kswapd. If
> kswapd is not woken at the low watermark as normal then there will be
> stalls due to direct reclaim and the stalls will be abrupt.

What is going on here without my changes is something like this:

1)   hit low wmark
2)   zone_reclaim()
3)   check if we're above low wmark
4)   if yes alloc memory
5)   if no try the next zone in the zonelist

The current code is already fully synchronous in direct reclaim in the
way it calls zone_reclaim and kswapd never gets invoked anyway.

This isn't VM reclaim, we're not globally low on memory and we can't
wake kswapd until we expired all memory from all zones or we risk to
screw the lru rotation even further (by having kswapd and the thread
allocating memory racing in a single zone, not even a single node
which would at least be better).

But the current code totally lack any kind oaf hysteresis. If
zone_reclaim provides feedback and confirms it did progress, and
another CPU steals the page before the current CPU has a chance to get
it, we're going to fall in the wrong zone (point 5 above). Allowing to
go deeper in to the "min" wmark won't change anything in kswapd wake
cycle, and we'll still invoke zone_reclaim synchronously forever until
the "low wmark" is restored so over time we should still linger above
the low wmark over time. So we should still wake kswapd well before
all zones are at the "min" (the moment all zones are below "low" we'll
wake kswapd).

So it shouldn't regress in terms of "stalls", zone_reclai is already
fully synchronous. It should only reduce a lot the numa allocation
false negatives. The need of hysteresis that the min wmark check
fixes, is exactly related the generation of more than 1 hugepage
between low-min wmarks in previous patches that altered the wmark
calculation for high order allocations.

BTW, I improved that calculation further in the meanwhile to avoid
generating any hugepage below the min wmark (min becomes in the order
> 0 checks, for any order > 0). You'll see it in next submit.

Ideally in fact zone_reclaim should get a per-node zonelist, not just
a zone, to be more fair in the lru rotations. All this code is pretty
bad. Not trying to fix it all at once (sticking to passing a zone
instead of a node to zone_reclaim even if it's wrong in lru terms). In
order to pass a node to zone_reclaim I should completely drop the
numa_zonelist_order=z mode (which happens to be the default on my
hardware according to the flakey heuristic in default_zonelist_order()
which should be also entirely dropped).

Johannes roundrobin allocator that makes the lru rotations more fair
in a multi-LRU VM, will completely invalidate any benefit provided by
the (default on my NUMA hardware) numa_zonelist_order=z model. So
that's one more reason to nuke that whole zonelist order =n badness.

I introduced long time ago a lowmem reserve ratio, that is the thing
that is supposed to avoid the OOM conditions with shortage of lowmem
zones. We don't need to prioritize anymore on the zones that are
aren't usable by all allocations. lowmem reserve provides a
significant margin. And the roundrobin allocator will entirely depend
on the lowmem reserve alone to be safe. In fact we should also add
some lowmem reserve calculation to compaction free memory checks to be
more accurate. (low priority)

> > +			/* initialize to avoid warnings */
> > +			c_ret = COMPACT_SKIPPED;
> > +			ret = ZONE_RECLAIM_FULL;
> > +
> > +			repeated_compaction = false;
> > +			need_compaction = false;
> > +			if (!compaction_deferred(preferred_zone, order))
> > +				need_compaction = order &&
> > +					(gfp_mask & GFP_KERNEL) == GFP_KERNEL;
> 
> need_compaction = order will always be true. Because of the bracketing,
> the comparison is within the conditional block so the second comparison
> is doing nothing. Not sure what is going on there at all.

if order is zero, need_compaction will be false. If order is not zero,
need_compaction will be true only if it's a GFP_KERNEL
allocation. Maybe I'm missing something, I don't see how
need_compaction is true if order is 0 or if gfp_mask is GFP_ATOMIC.

> > +			if (need_compaction) {
> > +			repeat_compaction:
> > +				c_ret = compact_zone_order(zone, order,
> > +							   gfp_mask,
> > +							   repeated_compaction,
> > +							   &contended);
> > +				if (c_ret != COMPACT_SKIPPED &&
> > +				    zone_watermark_ok(zone, order, mark,
> > +						      classzone_idx,
> > +						      alloc_flags)) {
> > +#ifdef CONFIG_COMPACTION
> > +					preferred_zone->compact_considered = 0;
> > +					preferred_zone->compact_defer_shift = 0;
> > +#endif
> > +					goto try_this_zone;
> > +				}
> > +			}
> 
> It's a question of taste, but overall I think this could have been done in
> zone_reclaim and rename it to zone_reclaim_compact to match the concept
> of reclaim/compaction if you like. Split the compaction part out to have
> __zone_reclaim and __zone_compact if you like and it'll be hell of a lot
> easier to follow. Right now, it's a bit twisty and while I can follow it,
> it's headache inducing.

I'll try to move it in a callee function and clean it up.

> 
> With that arrangement it will be a lot easier to add a new zone_reclaim
> flag if it turns out that zone reclaim compacts too aggressively leading
> to excessive stalls. Right now, I think this loops in compaction until
> it gets deferred because of how need_compaction gets set which could be
> for a long time. I'm not sure that's what you intended.

I intended to shrink until we successfully shrink cache (zone_reclaim
won't unmap if zone_reclaim_mode == 1 etc...), until compaction has
enough free memory to do its work. It shouldn't require that much
memory. After compaction has enough memory (not return COMPACT_SKIPPED
anymore) then we stop calling zone_reclaim to shrink caches and we
just try compaction once.

This is the last patch I need to update before resending, I hope to
clean it up for good.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-07-11 16:02     ` Andrea Arcangeli
@ 2013-07-12 12:26       ` Hush Bensen
  2013-07-12 16:01         ` Andrea Arcangeli
  0 siblings, 1 reply; 48+ messages in thread
From: Hush Bensen @ 2013-07-12 12:26 UTC (permalink / raw)
  To: Andrea Arcangeli, Mel Gorman
  Cc: linux-mm, Rik van Riel, Hugh Dickins, Richard Davies, Shaohua Li,
	Rafael Aquini

ao? 2013/7/11 10:02, Andrea Arcangeli a??e??:
> On Thu, Jun 06, 2013 at 11:05:03AM +0100, Mel Gorman wrote:
>> Again be mindful that improved reliability of zone_reclaim_mode can come
>> at the cost of stalling and process interference for workloads where the
>> processes are not NUMA aware or fit in individual nodes.
> This won't risk to affect any system where NUMA locality is a
> secondary priority, unless zone_reclaim_mode is set to 1 manually, so
> it shall be ok.
>
>> Instead of moving the logic from zone_reclaim to here, why was the
>> compaction logic not moved to zone_reclaim or a separate function? This
>> patch adds a lot of logic to get_page_from_freelist() which is unused
>> for most users
> I can try to move all checks that aren't passing information
> across different passes from the caller to the callee.
>
>>> -			default:
>>> -				/* did we reclaim enough */
>>> +
>>> +			/*
>>> +			 * We're going to do reclaim so allow
>>> +			 * allocations up to the MIN watermark, so less
>>> +			 * concurrent allocation will fail.
>>> +			 */
>>> +			mark = min_wmark_pages(zone);
>>> +
>> If we arrived here from the page allocator fast path then it also means
>> that we potentially miss going into the slow patch and waking kswapd. If
>> kswapd is not woken at the low watermark as normal then there will be
>> stalls due to direct reclaim and the stalls will be abrupt.
> What is going on here without my changes is something like this:
>
> 1)   hit low wmark
> 2)   zone_reclaim()
> 3)   check if we're above low wmark
> 4)   if yes alloc memory
> 5)   if no try the next zone in the zonelist
>
> The current code is already fully synchronous in direct reclaim in the
> way it calls zone_reclaim and kswapd never gets invoked anyway.
>
> This isn't VM reclaim, we're not globally low on memory and we can't
> wake kswapd until we expired all memory from all zones or we risk to
> screw the lru rotation even further (by having kswapd and the thread

What's the meaning of lru rotation?

> allocating memory racing in a single zone, not even a single node
> which would at least be better).
>
> But the current code totally lack any kind oaf hysteresis. If
> zone_reclaim provides feedback and confirms it did progress, and
> another CPU steals the page before the current CPU has a chance to get
> it, we're going to fall in the wrong zone (point 5 above). Allowing to
> go deeper in to the "min" wmark won't change anything in kswapd wake
> cycle, and we'll still invoke zone_reclaim synchronously forever until
> the "low wmark" is restored so over time we should still linger above
> the low wmark over time. So we should still wake kswapd well before
> all zones are at the "min" (the moment all zones are below "low" we'll
> wake kswapd).
>
> So it shouldn't regress in terms of "stalls", zone_reclai is already
> fully synchronous. It should only reduce a lot the numa allocation
> false negatives. The need of hysteresis that the min wmark check
> fixes, is exactly related the generation of more than 1 hugepage
> between low-min wmarks in previous patches that altered the wmark
> calculation for high order allocations.
>
> BTW, I improved that calculation further in the meanwhile to avoid
> generating any hugepage below the min wmark (min becomes in the order
>> 0 checks, for any order > 0). You'll see it in next submit.
> Ideally in fact zone_reclaim should get a per-node zonelist, not just
> a zone, to be more fair in the lru rotations. All this code is pretty
> bad. Not trying to fix it all at once (sticking to passing a zone
> instead of a node to zone_reclaim even if it's wrong in lru terms). In
> order to pass a node to zone_reclaim I should completely drop the
> numa_zonelist_order=z mode (which happens to be the default on my
> hardware according to the flakey heuristic in default_zonelist_order()
> which should be also entirely dropped).
>
> Johannes roundrobin allocator that makes the lru rotations more fair
> in a multi-LRU VM, will completely invalidate any benefit provided by
> the (default on my NUMA hardware) numa_zonelist_order=z model. So
> that's one more reason to nuke that whole zonelist order =n badness.
>
> I introduced long time ago a lowmem reserve ratio, that is the thing
> that is supposed to avoid the OOM conditions with shortage of lowmem
> zones. We don't need to prioritize anymore on the zones that are
> aren't usable by all allocations. lowmem reserve provides a
> significant margin. And the roundrobin allocator will entirely depend
> on the lowmem reserve alone to be safe. In fact we should also add
> some lowmem reserve calculation to compaction free memory checks to be
> more accurate. (low priority)
>
>>> +			/* initialize to avoid warnings */
>>> +			c_ret = COMPACT_SKIPPED;
>>> +			ret = ZONE_RECLAIM_FULL;
>>> +
>>> +			repeated_compaction = false;
>>> +			need_compaction = false;
>>> +			if (!compaction_deferred(preferred_zone, order))
>>> +				need_compaction = order &&
>>> +					(gfp_mask & GFP_KERNEL) == GFP_KERNEL;
>> need_compaction = order will always be true. Because of the bracketing,
>> the comparison is within the conditional block so the second comparison
>> is doing nothing. Not sure what is going on there at all.
> if order is zero, need_compaction will be false. If order is not zero,
> need_compaction will be true only if it's a GFP_KERNEL
> allocation. Maybe I'm missing something, I don't see how
> need_compaction is true if order is 0 or if gfp_mask is GFP_ATOMIC.
>
>>> +			if (need_compaction) {
>>> +			repeat_compaction:
>>> +				c_ret = compact_zone_order(zone, order,
>>> +							   gfp_mask,
>>> +							   repeated_compaction,
>>> +							   &contended);
>>> +				if (c_ret != COMPACT_SKIPPED &&
>>> +				    zone_watermark_ok(zone, order, mark,
>>> +						      classzone_idx,
>>> +						      alloc_flags)) {
>>> +#ifdef CONFIG_COMPACTION
>>> +					preferred_zone->compact_considered = 0;
>>> +					preferred_zone->compact_defer_shift = 0;
>>> +#endif
>>> +					goto try_this_zone;
>>> +				}
>>> +			}
>> It's a question of taste, but overall I think this could have been done in
>> zone_reclaim and rename it to zone_reclaim_compact to match the concept
>> of reclaim/compaction if you like. Split the compaction part out to have
>> __zone_reclaim and __zone_compact if you like and it'll be hell of a lot
>> easier to follow. Right now, it's a bit twisty and while I can follow it,
>> it's headache inducing.
> I'll try to move it in a callee function and clean it up.
>
>> With that arrangement it will be a lot easier to add a new zone_reclaim
>> flag if it turns out that zone reclaim compacts too aggressively leading
>> to excessive stalls. Right now, I think this loops in compaction until
>> it gets deferred because of how need_compaction gets set which could be
>> for a long time. I'm not sure that's what you intended.
> I intended to shrink until we successfully shrink cache (zone_reclaim
> won't unmap if zone_reclaim_mode == 1 etc...), until compaction has
> enough free memory to do its work. It shouldn't require that much
> memory. After compaction has enough memory (not return COMPACT_SKIPPED
> anymore) then we stop calling zone_reclaim to shrink caches and we
> just try compaction once.
>
> This is the last patch I need to update before resending, I hope to
> clean it up for good.
>
> Thanks!
> Andrea
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-07-12 12:26       ` Hush Bensen
@ 2013-07-12 16:01         ` Andrea Arcangeli
  2013-07-12 23:23           ` Hush Bensen
  0 siblings, 1 reply; 48+ messages in thread
From: Andrea Arcangeli @ 2013-07-12 16:01 UTC (permalink / raw)
  To: Hush Bensen
  Cc: Mel Gorman, linux-mm, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

Hi,

On Fri, Jul 12, 2013 at 06:26:37AM -0600, Hush Bensen wrote:
> > This isn't VM reclaim, we're not globally low on memory and we can't
> > wake kswapd until we expired all memory from all zones or we risk to
> > screw the lru rotation even further (by having kswapd and the thread
> 
> What's the meaning of lru rotation?

I mean the per-zone LRU walks to shrink the memory (they rotate pages
through the LRU). To provide for better global working set information
in the LRUs, we should walk all the zone LRUs in a fair
way.

zone_reclaim_mode however makes it non fair by always shrinking from
the first NUMA local zone even if the other zones could be shrunk
too.

When zone_reclaim_mode is disabled instead (default for most hardware
out there), we wait all candidate zones to be at the low wmark before
starting the shrinking from any zone (and then we shrink all zones,
not just one). So when zone_reclaim_mode is disabled, we don't insist
aging a single zone indefinitely, while leaving the others un-aged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-07-12 16:01         ` Andrea Arcangeli
@ 2013-07-12 23:23           ` Hush Bensen
  2013-07-15  9:16             ` Andrea Arcangeli
  0 siblings, 1 reply; 48+ messages in thread
From: Hush Bensen @ 2013-07-12 23:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-mm, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

Hi Andrea,
ao? 2013/7/13 0:01, Andrea Arcangeli a??e??:
> Hi,
>
> On Fri, Jul 12, 2013 at 06:26:37AM -0600, Hush Bensen wrote:
>>> This isn't VM reclaim, we're not globally low on memory and we can't
>>> wake kswapd until we expired all memory from all zones or we risk to
>>> screw the lru rotation even further (by having kswapd and the thread
>> What's the meaning of lru rotation?
> I mean the per-zone LRU walks to shrink the memory (they rotate pages
> through the LRU). To provide for better global working set information
> in the LRUs, we should walk all the zone LRUs in a fair
> way.
>
> zone_reclaim_mode however makes it non fair by always shrinking from
> the first NUMA local zone even if the other zones could be shrunk
> too.
>
> When zone_reclaim_mode is disabled instead (default for most hardware
> out there), we wait all candidate zones to be at the low wmark before
> starting the shrinking from any zone (and then we shrink all zones,
> not just one). So when zone_reclaim_mode is disabled, we don't insist
> aging a single zone indefinitely, while leaving the others un-aged.

Do you mean your patch done this fair? There is target zone shrink as 
you mentiond in the vanilla kernel, however, your patch also done target 
compaction/reclaim, is this fair?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-06-05 15:10 ` [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
  2013-06-05 22:21   ` Rik van Riel
  2013-06-06 10:05   ` Mel Gorman
@ 2013-07-12 23:57   ` Hush Bensen
  2013-07-15  9:25     ` Andrea Arcangeli
  2 siblings, 1 reply; 48+ messages in thread
From: Hush Bensen @ 2013-07-12 23:57 UTC (permalink / raw)
  To: Andrea Arcangeli, linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

Hi Andrea,
OU 2013/6/5 10:10, Andrea Arcangeli D'uA:
> This fixes zone_reclaim_mode by using the min watermark so it won't
> fail in presence of concurrent allocations. This greatly increases the
> reliability of zone_reclaim_mode > 0 also with cache shrinking and THP
> disabled.
>
> This also adds compaction to zone_reclaim so THP enabled won't
> decrease the NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.
>
> Some checks for __GFP_WAIT and numa_node_id() are moved from the
> zone_reclaim() to the caller so they also apply to the compaction
> logic.
>
> It is important to boot with numa_zonelist_order=n (n means nodes) to
> get more accurate NUMA locality if there are multiple zones per node.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/internal.h   |  1 -
>  mm/page_alloc.c | 99 +++++++++++++++++++++++++++++++++++++++++++--------------
>  mm/vmscan.c     | 17 ----------
>  3 files changed, 75 insertions(+), 42 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 8562de0..560a1ec 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -339,7 +339,6 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
>  }
>  #endif /* CONFIG_SPARSEMEM */
>  
> -#define ZONE_RECLAIM_NOSCAN	-2
>  #define ZONE_RECLAIM_FULL	-1
>  #define ZONE_RECLAIM_SOME	0
>  #define ZONE_RECLAIM_SUCCESS	1
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c13e062..3ca905a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1902,7 +1902,9 @@ zonelist_scan:
>  		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>  			unsigned long mark;
> -			int ret;
> +			int ret, node_id, c_ret;
> +			bool repeated_compaction, need_compaction;
> +			bool contended = false;
>  
>  			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
>  			if (zone_watermark_ok(zone, order, mark,
> @@ -1933,35 +1935,84 @@ zonelist_scan:
>  				!zlc_zone_worth_trying(zonelist, z, allowednodes))
>  				continue;
>  
> -			ret = zone_reclaim(zone, gfp_mask, order);
> -			switch (ret) {
> -			case ZONE_RECLAIM_NOSCAN:
> -				/* did not scan */
> +			if (!(gfp_mask & __GFP_WAIT) ||
> +			    (current->flags & PF_MEMALLOC))
>  				continue;
> -			case ZONE_RECLAIM_FULL:
> -				/* scanned but unreclaimable */
> +
> +			/*
> +			 * Only reclaim the local zone or on zones
> +			 * that do not have associated
> +			 * processors. This will favor the local
> +			 * processor over remote processors and spread
> +			 * off node memory allocations as wide as
> +			 * possible.
> +			 */
> +			node_id = zone_to_nid(zone);
> +			if (node_state(node_id, N_CPU) &&
> +			    node_id != numa_node_id())
>  				continue;
> -			default:
> -				/* did we reclaim enough */
> +
> +			/*
> +			 * We're going to do reclaim so allow
> +			 * allocations up to the MIN watermark, so less
> +			 * concurrent allocation will fail.
> +			 */
> +			mark = min_wmark_pages(zone);

Target reclaim will be done once under low wmark in vanilla kernel,
however, your patch change it to min wmark, why this behavior change?

> +
> +			/* initialize to avoid warnings */
> +			c_ret = COMPACT_SKIPPED;
> +			ret = ZONE_RECLAIM_FULL;
> +
> +			repeated_compaction = false;
> +			need_compaction = false;
> +			if (!compaction_deferred(preferred_zone, order))
> +				need_compaction = order &&
> +					(gfp_mask & GFP_KERNEL) == GFP_KERNEL;
> +			if (need_compaction) {
> +			repeat_compaction:
> +				c_ret = compact_zone_order(zone, order,
> +							   gfp_mask,
> +							   repeated_compaction,
> +							   &contended);
> +				if (c_ret != COMPACT_SKIPPED &&
> +				    zone_watermark_ok(zone, order, mark,
> +						      classzone_idx,
> +						      alloc_flags)) {
> +#ifdef CONFIG_COMPACTION
> +					preferred_zone->compact_considered = 0;
> +					preferred_zone->compact_defer_shift = 0;
> +#endif
> +					goto try_this_zone;
> +				}
> +			}
> +			/*
> +			 * reclaim if compaction failed because not
> +			 * enough memory was available or if
> +			 * compaction didn't run (order 0) or didn't
> +			 * succeed.
> +			 */
> +			if (!repeated_compaction || c_ret == COMPACT_SKIPPED) {
> +				ret = zone_reclaim(zone, gfp_mask, order);
>  				if (zone_watermark_ok(zone, order, mark,
> -						classzone_idx, alloc_flags))
> +						      classzone_idx,
> +						      alloc_flags))
>  					goto try_this_zone;
> +			}
> +			if (need_compaction &&
> +			    (!repeated_compaction ||
> +			     (c_ret == COMPACT_SKIPPED &&
> +			      ret == ZONE_RECLAIM_SUCCESS))) {
> +				repeated_compaction = true;
> +				cond_resched();
> +				goto repeat_compaction;
> +			}
> +			if (need_compaction)
> +				defer_compaction(preferred_zone, order);
>  
> -				/*
> -				 * Failed to reclaim enough to meet watermark.
> -				 * Only mark the zone full if checking the min
> -				 * watermark or if we failed to reclaim just
> -				 * 1<<order pages or else the page allocator
> -				 * fastpath will prematurely mark zones full
> -				 * when the watermark is between the low and
> -				 * min watermarks.
> -				 */
> -				if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
> -				    ret == ZONE_RECLAIM_SOME)
> -					goto this_zone_full;
> -
> +			if (!order)
> +				goto this_zone_full;
> +			else

You do the works should be done in slow path, is it worth?

>  				continue;
> -			}
>  		}
>  
>  try_this_zone:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 825c631..6a65107 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3380,7 +3380,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  
>  int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  {
> -	int node_id;
>  	int ret;
>  
>  	/*
> @@ -3400,22 +3399,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	if (zone->all_unreclaimable)
>  		return ZONE_RECLAIM_FULL;
>  
> -	/*
> -	 * Do not scan if the allocation should not be delayed.
> -	 */
> -	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> -		return ZONE_RECLAIM_NOSCAN;
> -
> -	/*
> -	 * Only run zone reclaim on the local zone or on zones that do not
> -	 * have associated processors. This will favor the local processor
> -	 * over remote processors and spread off node memory allocations
> -	 * as wide as possible.
> -	 */
> -	node_id = zone_to_nid(zone);
> -	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> -		return ZONE_RECLAIM_NOSCAN;
> -
>  	ret = __zone_reclaim(zone, gfp_mask, order);
>  
>  	if (!ret)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-07-12 23:23           ` Hush Bensen
@ 2013-07-15  9:16             ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-07-15  9:16 UTC (permalink / raw)
  To: Hush Bensen
  Cc: Mel Gorman, linux-mm, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On Sat, Jul 13, 2013 at 07:23:58AM +0800, Hush Bensen wrote:
> Do you mean your patch done this fair? There is target zone shrink as 
> you mentiond in the vanilla kernel, however, your patch also done target 
> compaction/reclaim, is this fair?

It's still not fair, zone_reclaim_mode cannot be (modulo major rework
at least) as its whole point is to reclaim memory from the local node
indefinitely, even if there's plenty of "free" or "reclaimable" memory
in remote nodes.

But waking kswapd before all nodes are below the low wmark probably
would make it even less fair than it is now, or at least it wouldn't
provide a fariness increase.

The idea of allowing allocations in the min-low wmark range is that
the "low" wmark would be restored soon anyway at the next
zone_reclaim() invocation, and the zone_reclaim will still behave
synchronous (like direct reclaim) without ever waking kswapd,
regardless if we stop at the low or at the min. But if we stop at the
"low" we're more susceptible to parallel allocation jitters as the
jitter-error margin then becomes:

		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),

which is just 1 single high order page in case of (1<<order) >=
SWAP_CLUSTER_MAX. While if we use the "min" wmark after a successful
zone_reclaim(zone) to decide if to allocate from the zone (the one
passed to zone_reclaim, we may have more margin for allocation jitters
in other CPUs of the same node, or interrupts.

So this again is connected to altering the wmark calculation for high
order pages in the previous patch (which also is intended to allow
having more than 1 THP page in the low-min wmark range). We don't need
many, too many is just a waste of CPU, but a few more than 1
significantly improves the NUMA locality on first allocation if all
CPUs in the node are allocating memory at the same time. I also
trimmed down to zero the high order page requirement for the min
wmark, as we don't need to guarantee hugepage availability for
PF_MEMALLOC (which avoids useless compaction work).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode
  2013-07-12 23:57   ` Hush Bensen
@ 2013-07-15  9:25     ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2013-07-15  9:25 UTC (permalink / raw)
  To: Hush Bensen
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On Fri, Jul 12, 2013 at 06:57:25PM -0500, Hush Bensen wrote:
> Target reclaim will be done once under low wmark in vanilla kernel,
> however, your patch change it to min wmark, why this behavior change?

This was connected to the previous question, so I tried to answer this
as well in the context of the previous email.

> > +			if (!order)
> > +				goto this_zone_full;
> > +			else
> 
> You do the works should be done in slow path, is it worth?

Not sure to understand the question sorry. The reason for checking
order is that I was skeptical in marking the zone as full, just
because an high order allocation failed. The problem is that if the
cache says "full" and it was just a jitter (like compaction not having
run) , we'll fallback into the other nodes.

In the previous patches however I made compaction a lot more reliable
(no more random skips where compaction isn't even tried for a while
after the cursor meets for example) so maybe I could still mark the
zone full without noticeable effects. The above code has changed in
the meanwhile as I moved the code elsewhere,, so it's better to wait I
send out the new version before reviewing the above further.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2013-07-15  9:26 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-05 15:10 [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Andrea Arcangeli
2013-06-05 15:10 ` [PATCH 1/7] mm: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
2013-06-05 19:23   ` Rik van Riel
2013-06-05 20:37   ` KOSAKI Motohiro
2013-06-05 20:51     ` Christoph Lameter
2013-06-05 21:03       ` KOSAKI Motohiro
2013-06-06 14:15         ` Christoph Lameter
2013-06-06 17:17           ` KOSAKI Motohiro
2013-06-06 18:16             ` Christoph Lameter
2013-06-05 21:33   ` Rafael Aquini
2013-06-06  9:04   ` Mel Gorman
2013-06-06 17:37     ` Rik van Riel
2013-06-14 16:16       ` Rik van Riel
2013-06-17  9:30         ` Mel Gorman
2013-06-17 18:12           ` Rik van Riel
2013-06-26 20:10             ` Andrea Arcangeli
2013-06-05 15:10 ` [PATCH 2/7] mm: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
2013-06-05 19:34   ` Rik van Riel
2013-06-05 21:39   ` Rafael Aquini
2013-06-06  9:05   ` Mel Gorman
2013-06-05 15:10 ` [PATCH 3/7] mm: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
2013-06-05 19:49   ` Rik van Riel
2013-06-26 20:38     ` Andrea Arcangeli
2013-06-06  9:11   ` Mel Gorman
2013-06-26 20:48     ` Andrea Arcangeli
2013-06-06 12:47   ` Rafael Aquini
2013-06-05 15:10 ` [PATCH 4/7] mm: compaction: reset before initializing the scan cursors Andrea Arcangeli
2013-06-05 20:04   ` Rik van Riel
2013-06-06  9:14   ` Mel Gorman
2013-06-06 12:49   ` Rafael Aquini
2013-06-05 15:10 ` [PATCH 5/7] mm: compaction: increase the high order pages in the watermarks Andrea Arcangeli
2013-06-05 20:18   ` Rik van Riel
2013-06-28 22:14     ` Andrea Arcangeli
2013-06-06  9:19   ` Mel Gorman
2013-06-05 15:10 ` [PATCH 6/7] mm: compaction: export compact_zone_order() Andrea Arcangeli
2013-06-05 20:24   ` Rik van Riel
2013-06-05 15:10 ` [PATCH 7/7] mm: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
2013-06-05 22:21   ` Rik van Riel
2013-06-06 10:05   ` Mel Gorman
2013-07-11 16:02     ` Andrea Arcangeli
2013-07-12 12:26       ` Hush Bensen
2013-07-12 16:01         ` Andrea Arcangeli
2013-07-12 23:23           ` Hush Bensen
2013-07-15  9:16             ` Andrea Arcangeli
2013-07-12 23:57   ` Hush Bensen
2013-07-15  9:25     ` Andrea Arcangeli
2013-06-06  8:53 ` [PATCH 0/7] RFC: adding compaction to zone_reclaim_mode > 0 Mel Gorman
2013-06-06 10:09 ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.