All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] adding compaction to zone_reclaim_mode v3
@ 2013-08-02 16:06 Andrea Arcangeli
  2013-08-02 16:06 ` [PATCH 1/9] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

Hi,

This v3 is incremental with Johannes roundrobin allocator v2. It
should be good to be included in -mm and applied right after the
roundrobin allocator (the roundrobin allocator also improves the
zone_reclaim_mode for those nodes where there's more than one zone,
like node 0 with the pci32 zone).

Without the patch, when THP is disabled, zone_reclaim_mode set to 1
(or higher) tends to allocate memory in the local node with quite some
accuracy in presence of CPU node bindings (and weak or no memory
bindings). However when THP is enabled, it tends to spread the memory
to other nodes erroneously.

This patch adds compaction to zone_reclaim_mode (default enabled on
big NUMA) so THP doesn't risk to regress the NUMA placement when
zone_reclaim_mode is enabled.

I also found zone_reclaim_mode is quite unreliable in presence of
multiple threads allocating memory at the same time from different
CPUs in the same node, even when THP is disabled and there's plenty of
clean cache to trivially reclaim.

After setting zone_reclaim_mode to 1 and booting with
numa_zonelist_order=n, with this patchset applied I get this NUMA placement:

  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 7088 breakthp              0      2.1M [   2.1M     0  ]
 7089 breakthp              1      2.1M [   2.1M     0  ]
 7090 breakthp              2      2.1M [   2.1M     0  ]
 7091 breakthp              3      2.1M [   2.1M     0  ]
 7092 breakthp              6      2.1M [     0    2.1M ]
 7093 breakthp              7      2.1M [     0    2.1M ]
 7094 breakthp              8      2.1M [     0    2.1M ]
 7095 breakthp              9      2.1M [     0    2.1M ]
 7097 breakthp              0      2.1M [   2.1M     0  ]
 7098 breakthp              1      2.1M [   2.1M     0  ]
 7099 breakthp              2      2.1M [   2.1M     0  ]
 7100 breakthp              3      2.1M [   2.1M     0  ]
 7101 breakthp              6      2.1M [     0    2.1M ]
 7102 breakthp              7      2.1M [     0    2.1M ]
 7103 breakthp              8      2.1M [     0    2.1M ]
 7104 breakthp              9      2.1M [     0    2.1M ]
  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 7106 usemem                0     1.00G [  1.00G     0  ]
 7107 usemem                1     1.00G [  1.00G     0  ]
 7108 usemem                2     1.00G [  1.00G     0  ]
 7109 usemem                3     1.00G [  1.00G     0  ]
 7110 usemem                6     1.00G [     0   1.00G ]
 7111 usemem                7     1.00G [     0   1.00G ]
 7112 usemem                8     1.00G [     0   1.00G ]
 7113 usemem                9     1.00G [     0   1.00G ]

Without current upstream without the patchset and still
zone_reclaim_mode = 1 and booting with numa_zonelist_order=n:

  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 2950 breakthp              0      2.1M [   2.1M     0  ]
 2951 breakthp              1      2.1M [   2.1M     0  ]
 2952 breakthp              2      2.1M [   2.1M     0  ]
 2953 breakthp              3      2.1M [   2.1M     0  ]
 2954 breakthp              6      2.1M [     0    2.1M ]
 2955 breakthp              7      2.1M [     0    2.1M ]
 2956 breakthp              8      2.1M [     0    2.1M ]
 2957 breakthp              9      2.1M [     0    2.1M ]
 2966 breakthp              0      2.1M [   2.0M    96K ]
 2967 breakthp              1      2.1M [   2.0M    96K ]
 2968 breakthp              2      1.9M [   1.9M    96K ]
 2969 breakthp              3      2.1M [   2.0M    96K ]
 2970 breakthp              6      2.1M [   228K   1.8M ]
 2971 breakthp              7      2.1M [    72K   2.0M ]
 2972 breakthp              8      2.1M [    60K   2.0M ]
 2973 breakthp              9      2.1M [   204K   1.9M ]
  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 3088 usemem                0     1.00G [ 856.2M 168.0M ]
 3089 usemem                1     1.00G [ 860.2M 164.0M ]
 3090 usemem                2     1.00G [ 860.2M 164.0M ]
 3091 usemem                3     1.00G [ 858.2M 166.0M ]
 3092 usemem                6     1.00G [ 248.0M 776.2M ]
 3093 usemem                7     1.00G [ 248.0M 776.2M ]
 3094 usemem                8     1.00G [ 250.0M 774.2M ]
 3095 usemem                9     1.00G [ 246.0M 778.2M ]

The testcase always uses CPU bindings (half processes in one node, and
half processes in the other node). It first fragments all memory
(breakthp) by breaking lots of hugepages with mremap, and then another
process (usemem) allocates lots of memory, in turn exercising the
reliability of compaction with zone_reclaim_mode > 0.

Very few hugepages are available when usemem starts, but compaction
has a trivial time to generate as many hugepages as needed without any
risk of failure.

The memory layout when usemem starts is like this:

4k page anon
4k page free
another 512-2 4k pages free
4k page anon
4k page free
another 512-2 4k pages free
[..]

The heuristic that decides the default of numa_zonelist_order=z should
also be dropped or at least be improved (not addressed by this
patchset). The =z default makes no sense on my hardware for example,
and the coming roundrobin allocator from Johannes will defeats any
benefit of =z. Only =n default will make sense with the roundrobin
allocator.

The roundrobin allocator entirely depends on the lowmem_reserve logic
for its safety with regard to lowmem zones.

Andrea Arcangeli (9):
  mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED
  mm: zone_reclaim: compaction: scan all memory with
    /proc/sys/vm/compact_memory
  mm: zone_reclaim: compaction: don't depend on kswapd to invoke
    reset_isolation_suitable
  mm: zone_reclaim: compaction: reset before initializing the scan
    cursors
  mm: compaction: don't require high order pages below min wmark
  mm: zone_reclaim: compaction: increase the high order pages in the
    watermarks
  mm: zone_reclaim: compaction: export compact_zone_order()
  mm: zone_reclaim: after a successful zone_reclaim check the min
    watermark
  mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode

 include/linux/compaction.h |  11 +++--
 include/linux/mmzone.h     |   9 ----
 include/linux/swap.h       |   8 ++-
 mm/compaction.c            |  40 ++++++++-------
 mm/page_alloc.c            |  45 +++++++++++++++--
 mm/vmscan.c                | 121 ++++++++++++++++++++++++++++++++++-----------
 6 files changed, 168 insertions(+), 66 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/9] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED
  2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
@ 2013-08-02 16:06 ` Andrea Arcangeli
  2013-08-05 18:30   ` Johannes Weiner
  2013-08-07 15:13   ` Mel Gorman
  2013-08-02 16:06 ` [PATCH 2/9] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
thread allocates memory at the same time, it forces a premature
allocation into remote NUMA nodes even when there's plenty of clean
cache to reclaim in the local nodes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mmzone.h | 6 ------
 mm/vmscan.c            | 4 ----
 2 files changed, 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dcad2ab..1badcc9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -497,7 +497,6 @@ struct zone {
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
-	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
 	ZONE_CONGESTED,			/* zone has many dirty pages backed by
 					 * a congested BDI
@@ -541,11 +540,6 @@ static inline int zone_is_reclaim_writeback(const struct zone *zone)
 	return test_bit(ZONE_WRITEBACK, &zone->flags);
 }
 
-static inline int zone_is_reclaim_locked(const struct zone *zone)
-{
-	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
-}
-
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 758540d..4b2da9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3595,11 +3595,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return ZONE_RECLAIM_NOSCAN;
 
-	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
-		return ZONE_RECLAIM_NOSCAN;
-
 	ret = __zone_reclaim(zone, gfp_mask, order);
-	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
 
 	if (!ret)
 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/9] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
  2013-08-02 16:06 ` [PATCH 1/9] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
@ 2013-08-02 16:06 ` Andrea Arcangeli
  2013-08-05 18:45   ` Johannes Weiner
  2013-08-02 16:06 ` [PATCH 3/9] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

Reset the stats so /proc/sys/vm/compact_memory will scan all memory.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..cac9594 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1136,12 +1136,14 @@ void compact_pgdat(pg_data_t *pgdat, int order)
 
 static void compact_node(int nid)
 {
+	pg_data_t *pgdat = NODE_DATA(nid);
 	struct compact_control cc = {
 		.order = -1,
 		.sync = true,
 	};
 
-	__compact_pgdat(NODE_DATA(nid), &cc);
+	reset_isolation_suitable(pgdat);
+	__compact_pgdat(pgdat, &cc);
 }
 
 /* Compact all nodes in the system */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/9] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
  2013-08-02 16:06 ` [PATCH 1/9] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
  2013-08-02 16:06 ` [PATCH 2/9] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
@ 2013-08-02 16:06 ` Andrea Arcangeli
  2013-08-05 19:32   ` Johannes Weiner
  2013-08-02 16:06 ` [PATCH 4/9] mm: zone_reclaim: compaction: reset before initializing the scan cursors Andrea Arcangeli
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
plenty of free memory) compaction is otherwise crippled down and stops
running for a while after the free/isolation cursor meets. After that
allocation can fail for a full cycle of compaction_deferred, until
compaction_restarting finally reset it again.

Stopping compaction for a full cycle after the cursor meets, even if
it never failed and it's not going to fail, doesn't make sense.

We already throttle compaction CPU utilization using
defer_compaction. We shouldn't prevent compaction to run after each
pass completes when the cursor meets, unless it failed.

This makes direct compaction functional again. The throttling of
direct compaction is still controlled by the defer_compaction
logic.

kswapd still won't risk to reset compaction, and it will wait direct
compaction to do so. Not sure if this is ideal but it at least
decreases the risk of kswapd doing too much work. kswapd will only run
one pass of compaction until some allocation invokes compaction again.

This decreased reliability of compaction was introduced in commit
62997027ca5b3d4618198ed8b1aba40b61b1137b .

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/compaction.h |  5 -----
 include/linux/mmzone.h     |  3 ---
 mm/compaction.c            | 15 ++++++---------
 mm/page_alloc.c            |  1 -
 mm/vmscan.c                |  8 --------
 5 files changed, 6 insertions(+), 26 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 091d72e..fc3f266 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -24,7 +24,6 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync, bool *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
-extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
 /* Do not skip compaction more than 64 times */
@@ -84,10 +83,6 @@ static inline void compact_pgdat(pg_data_t *pgdat, int order)
 {
 }
 
-static inline void reset_isolation_suitable(pg_data_t *pgdat)
-{
-}
-
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1badcc9..5904b32 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -355,9 +355,6 @@ struct zone {
 	int			alloc_batch;
 	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* Set to true when the PG_migrate_skip bits should be cleared */
-	bool			compact_blockskip_flush;
-
 	/* pfns where compaction scanners should start */
 	unsigned long		compact_cached_free_pfn;
 	unsigned long		compact_cached_migrate_pfn;
diff --git a/mm/compaction.c b/mm/compaction.c
index cac9594..525baaa 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -91,7 +91,6 @@ static void __reset_isolation_suitable(struct zone *zone)
 
 	zone->compact_cached_migrate_pfn = start_pfn;
 	zone->compact_cached_free_pfn = end_pfn;
-	zone->compact_blockskip_flush = false;
 
 	/* Walk the zone and mark every pageblock as suitable for isolation */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -110,7 +109,7 @@ static void __reset_isolation_suitable(struct zone *zone)
 	}
 }
 
-void reset_isolation_suitable(pg_data_t *pgdat)
+static void reset_isolation_suitable(pg_data_t *pgdat)
 {
 	int zoneid;
 
@@ -120,8 +119,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
 			continue;
 
 		/* Only flush if a full compaction finished recently */
-		if (zone->compact_blockskip_flush)
-			__reset_isolation_suitable(zone);
+		__reset_isolation_suitable(zone);
 	}
 }
 
@@ -828,13 +826,12 @@ static int compact_finished(struct zone *zone,
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn) {
 		/*
-		 * Mark that the PG_migrate_skip information should be cleared
-		 * by kswapd when it goes to sleep. kswapd does not set the
-		 * flag itself as the decision to be clear should be directly
-		 * based on an allocation request.
+		 * Clear the PG_migrate_skip information. kswapd does
+		 * not clear it as the decision to be clear should be
+		 * directly based on an allocation request.
 		 */
 		if (!current_is_kswapd())
-			zone->compact_blockskip_flush = true;
+			__reset_isolation_suitable(zone);
 
 		return COMPACT_COMPLETE;
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ccedd5..092b30d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2222,7 +2222,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
 				preferred_zone, migratetype);
 		if (page) {
-			preferred_zone->compact_blockskip_flush = false;
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
 			if (order >= preferred_zone->compact_order_failed)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4b2da9e..f2ada36 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3091,14 +3091,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 
-		/*
-		 * Compaction records what page blocks it recently failed to
-		 * isolate pages from and skips them in the future scanning.
-		 * When kswapd is going to sleep, it is reasonable to assume
-		 * that pages and compaction may succeed so reset the cache.
-		 */
-		reset_isolation_suitable(pgdat);
-
 		if (!kthread_should_stop())
 			schedule();
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/9] mm: zone_reclaim: compaction: reset before initializing the scan cursors
  2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2013-08-02 16:06 ` [PATCH 3/9] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
@ 2013-08-02 16:06 ` Andrea Arcangeli
  2013-08-05 19:44   ` Johannes Weiner
  2013-08-02 16:06 ` [PATCH 5/9] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

Correct the location where we reset the scan cursors, otherwise the
first iteration of compaction (after restarting it) will only do a
partial scan.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rafael Aquini <aquini@redhat.com>
---
 mm/compaction.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 525baaa..afaf692 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -934,6 +934,17 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	}
 
 	/*
+	 * Clear pageblock skip if there were failures recently and
+	 * compaction is about to be retried after being
+	 * deferred. kswapd does not do this reset and it will wait
+	 * direct compaction to do so either when the cursor meets
+	 * after one compaction pass is complete or if compaction is
+	 * restarted after being deferred for a while.
+	 */
+	if ((compaction_restarting(zone, cc->order)) && !current_is_kswapd())
+		__reset_isolation_suitable(zone);
+
+	/*
 	 * Setup to move all movable pages to the end of the zone. Used cached
 	 * information on where the scanners should start but check that it
 	 * is initialised by ensuring the values are within zone boundaries.
@@ -949,14 +960,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
 	}
 
-	/*
-	 * Clear pageblock skip if there were failures recently and compaction
-	 * is about to be retried after being deferred. kswapd does not do
-	 * this reset as it'll reset the cached information when going to sleep.
-	 */
-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
-		__reset_isolation_suitable(zone);
-
 	migrate_prep_local();
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/9] mm: compaction: don't require high order pages below min wmark
  2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2013-08-02 16:06 ` [PATCH 4/9] mm: zone_reclaim: compaction: reset before initializing the scan cursors Andrea Arcangeli
@ 2013-08-02 16:06 ` Andrea Arcangeli
  2013-08-05 18:35   ` Rik van Riel
                     ` (2 more replies)
  2013-08-02 16:06 ` [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks Andrea Arcangeli
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

The min wmark should be satisfied with just 1 hugepage. And the other
wmarks should be adjusted accordingly. We need to succeed the low
wmark check if there's some significant amount of 0 order pages, but
we don't need plenty of high order pages because the PF_MEMALLOC paths
don't require those. Creating a ton of high order pages that cannot be
allocated by the high order allocation paths (no PF_MEMALLOC) is quite
wasteful because they can be splitted in lower order pages before
anybody has a chance to allocate them.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 092b30d..4401983 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1643,6 +1643,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 
 	if (free_pages - free_cma <= min + lowmem_reserve)
 		return false;
+	if (!order)
+		return true;
+
+	/*
+	 * Don't require any high order page under the min
+	 * wmark. Invoking compaction to create lots of high order
+	 * pages below the min wmark is wasteful because those
+	 * hugepages cannot be allocated without PF_MEMALLOC and the
+	 * PF_MEMALLOC paths must not depend on high order allocations
+	 * to succeed.
+	 */
+	min = mark - z->watermark[WMARK_MIN];
+	WARN_ON(min < 0);
+	if (alloc_flags & ALLOC_HIGH)
+		min -= min / 2;
+	if (alloc_flags & ALLOC_HARDER)
+		min -= min / 4;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
 		free_pages -= z->free_area[o].nr_free << o;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks
  2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2013-08-02 16:06 ` [PATCH 5/9] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
@ 2013-08-02 16:06 ` Andrea Arcangeli
  2013-08-05 18:36   ` Rik van Riel
                     ` (2 more replies)
  2013-08-02 16:06 ` [PATCH 7/9] mm: zone_reclaim: compaction: export compact_zone_order() Andrea Arcangeli
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

Prevent the scaling down to reduce the watermarks too much.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4401983..b32ecde 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1665,7 +1665,8 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		free_pages -= z->free_area[o].nr_free << o;
 
 		/* Require fewer higher order pages to be free */
-		min >>= 1;
+		if (o < (pageblock_order >> 2))
+			min >>= 1;
 
 		if (free_pages <= min)
 			return false;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 7/9] mm: zone_reclaim: compaction: export compact_zone_order()
  2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2013-08-02 16:06 ` [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks Andrea Arcangeli
@ 2013-08-02 16:06 ` Andrea Arcangeli
  2013-08-05 18:37   ` Rik van Riel
  2013-08-07 15:48   ` Mel Gorman
  2013-08-02 16:06 ` [PATCH 8/9] mm: zone_reclaim: after a successful zone_reclaim check the min watermark Andrea Arcangeli
  2013-08-02 16:06 ` [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
  8 siblings, 2 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

Needed by zone_reclaim_mode compaction-awareness.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/compaction.h | 10 ++++++++++
 mm/compaction.c            |  2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index fc3f266..e953acb 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -23,6 +23,9 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync, bool *contended);
+extern unsigned long compact_zone_order(struct zone *zone,
+					int order, gfp_t gfp_mask,
+					bool sync, bool *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
@@ -79,6 +82,13 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_CONTINUE;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+					       int order, gfp_t gfp_mask,
+					       bool sync, bool *contended)
+{
+	return COMPACT_CONTINUE;
+}
+
 static inline void compact_pgdat(pg_data_t *pgdat, int order)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index afaf692..a1154c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1008,7 +1008,7 @@ out:
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone,
+unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
 				 bool sync, bool *contended)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 8/9] mm: zone_reclaim: after a successful zone_reclaim check the min watermark
  2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2013-08-02 16:06 ` [PATCH 7/9] mm: zone_reclaim: compaction: export compact_zone_order() Andrea Arcangeli
@ 2013-08-02 16:06 ` Andrea Arcangeli
  2013-08-05 18:38   ` Rik van Riel
  2013-08-07 15:53   ` Mel Gorman
  2013-08-02 16:06 ` [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
  8 siblings, 2 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

If we're in the fast path and we succeeded zone_reclaim(), it means we
freed enough memory and we can use the min watermark to have some
margin against concurrent allocations from other CPUs or interrupts.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b32ecde..879a3fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1990,8 +1990,26 @@ zonelist_scan:
 			case ZONE_RECLAIM_FULL:
 				/* scanned but unreclaimable */
 				continue;
+			case ZONE_RECLAIM_SUCCESS:
+				/*
+				 * If we successfully reclaimed
+				 * enough, allow allocations up to the
+				 * min watermark (instead of stopping
+				 * at "mark"). This provides some more
+				 * margin against parallel
+				 * allocations. Using the min
+				 * watermark doesn't alter when we
+				 * wakeup kswapd. It also doesn't
+				 * alter the synchronous direct
+				 * reclaim behavior of zone_reclaim()
+				 * that will still be invoked at the
+				 * next pass if we're still below the
+				 * low watermark (even if kswapd isn't
+				 * woken).
+				 */
+				mark = min_wmark_pages(zone);
+				/* Fall through */
 			default:
-				/* did we reclaim enough */
 				if (zone_watermark_ok(zone, order, mark,
 						classzone_idx, alloc_flags))
 					goto try_this_zone;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode
  2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2013-08-02 16:06 ` [PATCH 8/9] mm: zone_reclaim: after a successful zone_reclaim check the min watermark Andrea Arcangeli
@ 2013-08-02 16:06 ` Andrea Arcangeli
  2013-08-04 16:55   ` Andrea Arcangeli
  8 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-02 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

This adds compaction to zone_reclaim so THP enabled won't decrease the
NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.

It is important to boot with numa_zonelist_order=n (n means nodes) to
get more accurate NUMA locality if there are multiple zones per node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/swap.h |   8 +++-
 mm/page_alloc.c      |   4 +-
 mm/vmscan.c          | 111 ++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 102 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d95cde5..d076a54 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -289,10 +289,14 @@ extern unsigned long vm_total_pages;
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int zone_reclaim(struct zone *, struct zone *, gfp_t, unsigned int,
+			unsigned long, int, int);
 #else
 #define zone_reclaim_mode 0
-static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+static inline int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
+			       gfp_t mask, unsigned int order,
+			       unsigned long mark, int classzone_idx,
+			       int alloc_flags)
 {
 	return 0;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 879a3fd..c0bdde6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1982,7 +1982,9 @@ zonelist_scan:
 				!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 
-			ret = zone_reclaim(zone, gfp_mask, order);
+			ret = zone_reclaim(preferred_zone, zone, gfp_mask,
+					   order,
+					   mark, classzone_idx, alloc_flags);
 			switch (ret) {
 			case ZONE_RECLAIM_NOSCAN:
 				/* did not scan */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f2ada36..f28dc00 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3488,6 +3488,24 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	unsigned long nr_slab_pages0, nr_slab_pages1;
 
 	cond_resched();
+
+	/*
+	 * Zone reclaim reclaims unmapped file backed pages and
+	 * slab pages if we are over the defined limits.
+	 *
+	 * A small portion of unmapped file backed pages is needed for
+	 * file I/O otherwise pages read by file I/O will be immediately
+	 * thrown out if the zone is overallocated. So we do not reclaim
+	 * if less than a specified percentage of the zone is used by
+	 * unmapped file backed pages.
+	 */
+	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
+	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
+		return ZONE_RECLAIM_FULL;
+
+	if (zone->all_unreclaimable)
+		return ZONE_RECLAIM_FULL;
+
 	/*
 	 * We need to be able to allocate from the reserves for RECLAIM_SWAP
 	 * and we also need to be able to write out pages for RECLAIM_WRITE
@@ -3549,27 +3567,35 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	return sc.nr_reclaimed >= nr_pages;
 }
 
-int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+static int zone_reclaim_compact(struct zone *preferred_zone,
+				struct zone *zone, gfp_t gfp_mask,
+				unsigned int order,
+				bool sync_compaction,
+				bool *need_compaction)
 {
-	int node_id;
-	int ret;
+	bool contended;
 
-	/*
-	 * Zone reclaim reclaims unmapped file backed pages and
-	 * slab pages if we are over the defined limits.
-	 *
-	 * A small portion of unmapped file backed pages is needed for
-	 * file I/O otherwise pages read by file I/O will be immediately
-	 * thrown out if the zone is overallocated. So we do not reclaim
-	 * if less than a specified percentage of the zone is used by
-	 * unmapped file backed pages.
-	 */
-	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
-	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
-		return ZONE_RECLAIM_FULL;
+	if (compaction_deferred(preferred_zone, order) ||
+	    !order ||
+	    (gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
+		need_compaction = false;
+		return COMPACT_SKIPPED;
+	}
 
-	if (zone->all_unreclaimable)
-		return ZONE_RECLAIM_FULL;
+	*need_compaction = true;
+	return compact_zone_order(zone, order,
+				  gfp_mask,
+				  sync_compaction,
+				  &contended);
+}
+
+int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
+		 gfp_t gfp_mask, unsigned int order,
+		 unsigned long mark, int classzone_idx, int alloc_flags)
+{
+	int node_id;
+	int ret, c_ret;
+	bool sync_compaction = false, need_compaction = false;
 
 	/*
 	 * Do not scan if the allocation should not be delayed.
@@ -3587,7 +3613,56 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return ZONE_RECLAIM_NOSCAN;
 
+repeat_compaction:
+	/*
+	 * If this allocation may be satisfied by memory compaction,
+	 * run compaction before reclaim.
+	 */
+	c_ret = zone_reclaim_compact(preferred_zone,
+				     zone, gfp_mask, order,
+				     sync_compaction,
+				     &need_compaction);
+	if (need_compaction &&
+	    c_ret != COMPACT_SKIPPED &&
+	    zone_watermark_ok(zone, order, mark,
+			      classzone_idx,
+			      alloc_flags)) {
+#ifdef CONFIG_COMPACTION
+		zone->compact_considered = 0;
+		zone->compact_defer_shift = 0;
+#endif
+		return ZONE_RECLAIM_SUCCESS;
+	}
+
+	/*
+	 * reclaim if compaction failed because not enough memory was
+	 * available or if compaction didn't run (order 0) or didn't
+	 * succeed.
+	 */
 	ret = __zone_reclaim(zone, gfp_mask, order);
+	if (ret == ZONE_RECLAIM_SUCCESS) {
+		if (zone_watermark_ok(zone, order, mark,
+				      classzone_idx,
+				      alloc_flags))
+			return ZONE_RECLAIM_SUCCESS;
+
+		/*
+		 * If compaction run but it was skipped and reclaim was
+		 * successful keep going.
+		 */
+		if (need_compaction && c_ret == COMPACT_SKIPPED) {
+			/*
+			 * If it's ok to wait for I/O we can as well run sync
+			 * compaction
+			 */
+			sync_compaction = !!(zone_reclaim_mode &
+					     (RECLAIM_WRITE|RECLAIM_SWAP));
+			cond_resched();
+			goto repeat_compaction;
+		}
+	}
+	if (need_compaction)
+		defer_compaction(preferred_zone, order);
 
 	if (!ret)
 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode
  2013-08-02 16:06 ` [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
@ 2013-08-04 16:55   ` Andrea Arcangeli
  2013-08-05 18:38     ` Rik van Riel
  2013-08-07 16:18     ` Mel Gorman
  0 siblings, 2 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-04 16:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Mel Gorman, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Fri, Aug 02, 2013 at 06:06:36PM +0200, Andrea Arcangeli wrote:
> +		need_compaction = false;

This should be changed to "*need_compaction = false". It's actually a
cleanup because it's a nooperational change at runtime.
need_compaction was initialized to false by the only caller so it
couldn't harm. But it's better to fix it to avoid
confusion. Alternatively the above line can be dropped entirely but I
thought it was cleaner to have a defined value as result of the
function.

Found by Fengguang kbuild robot.

A new replacement patch 9/9 is appended below:

===
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] mm: zone_reclaim: compaction: add compaction to
 zone_reclaim_mode

This adds compaction to zone_reclaim so THP enabled won't decrease the
NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.

It is important to boot with numa_zonelist_order=n (n means nodes) to
get more accurate NUMA locality if there are multiple zones per node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/swap.h |   8 +++-
 mm/page_alloc.c      |   4 +-
 mm/vmscan.c          | 111 ++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 102 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d95cde5..d076a54 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -289,10 +289,14 @@ extern unsigned long vm_total_pages;
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int zone_reclaim(struct zone *, struct zone *, gfp_t, unsigned int,
+			unsigned long, int, int);
 #else
 #define zone_reclaim_mode 0
-static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+static inline int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
+			       gfp_t mask, unsigned int order,
+			       unsigned long mark, int classzone_idx,
+			       int alloc_flags)
 {
 	return 0;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 879a3fd..c0bdde6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1982,7 +1982,9 @@ zonelist_scan:
 				!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 
-			ret = zone_reclaim(zone, gfp_mask, order);
+			ret = zone_reclaim(preferred_zone, zone, gfp_mask,
+					   order,
+					   mark, classzone_idx, alloc_flags);
 			switch (ret) {
 			case ZONE_RECLAIM_NOSCAN:
 				/* did not scan */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f2ada36..fedb246 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3488,6 +3488,24 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	unsigned long nr_slab_pages0, nr_slab_pages1;
 
 	cond_resched();
+
+	/*
+	 * Zone reclaim reclaims unmapped file backed pages and
+	 * slab pages if we are over the defined limits.
+	 *
+	 * A small portion of unmapped file backed pages is needed for
+	 * file I/O otherwise pages read by file I/O will be immediately
+	 * thrown out if the zone is overallocated. So we do not reclaim
+	 * if less than a specified percentage of the zone is used by
+	 * unmapped file backed pages.
+	 */
+	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
+	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
+		return ZONE_RECLAIM_FULL;
+
+	if (zone->all_unreclaimable)
+		return ZONE_RECLAIM_FULL;
+
 	/*
 	 * We need to be able to allocate from the reserves for RECLAIM_SWAP
 	 * and we also need to be able to write out pages for RECLAIM_WRITE
@@ -3549,27 +3567,35 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	return sc.nr_reclaimed >= nr_pages;
 }
 
-int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+static int zone_reclaim_compact(struct zone *preferred_zone,
+				struct zone *zone, gfp_t gfp_mask,
+				unsigned int order,
+				bool sync_compaction,
+				bool *need_compaction)
 {
-	int node_id;
-	int ret;
+	bool contended;
 
-	/*
-	 * Zone reclaim reclaims unmapped file backed pages and
-	 * slab pages if we are over the defined limits.
-	 *
-	 * A small portion of unmapped file backed pages is needed for
-	 * file I/O otherwise pages read by file I/O will be immediately
-	 * thrown out if the zone is overallocated. So we do not reclaim
-	 * if less than a specified percentage of the zone is used by
-	 * unmapped file backed pages.
-	 */
-	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
-	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
-		return ZONE_RECLAIM_FULL;
+	if (compaction_deferred(preferred_zone, order) ||
+	    !order ||
+	    (gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
+		*need_compaction = false;
+		return COMPACT_SKIPPED;
+	}
 
-	if (zone->all_unreclaimable)
-		return ZONE_RECLAIM_FULL;
+	*need_compaction = true;
+	return compact_zone_order(zone, order,
+				  gfp_mask,
+				  sync_compaction,
+				  &contended);
+}
+
+int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
+		 gfp_t gfp_mask, unsigned int order,
+		 unsigned long mark, int classzone_idx, int alloc_flags)
+{
+	int node_id;
+	int ret, c_ret;
+	bool sync_compaction = false, need_compaction = false;
 
 	/*
 	 * Do not scan if the allocation should not be delayed.
@@ -3587,7 +3613,56 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return ZONE_RECLAIM_NOSCAN;
 
+repeat_compaction:
+	/*
+	 * If this allocation may be satisfied by memory compaction,
+	 * run compaction before reclaim.
+	 */
+	c_ret = zone_reclaim_compact(preferred_zone,
+				     zone, gfp_mask, order,
+				     sync_compaction,
+				     &need_compaction);
+	if (need_compaction &&
+	    c_ret != COMPACT_SKIPPED &&
+	    zone_watermark_ok(zone, order, mark,
+			      classzone_idx,
+			      alloc_flags)) {
+#ifdef CONFIG_COMPACTION
+		zone->compact_considered = 0;
+		zone->compact_defer_shift = 0;
+#endif
+		return ZONE_RECLAIM_SUCCESS;
+	}
+
+	/*
+	 * reclaim if compaction failed because not enough memory was
+	 * available or if compaction didn't run (order 0) or didn't
+	 * succeed.
+	 */
 	ret = __zone_reclaim(zone, gfp_mask, order);
+	if (ret == ZONE_RECLAIM_SUCCESS) {
+		if (zone_watermark_ok(zone, order, mark,
+				      classzone_idx,
+				      alloc_flags))
+			return ZONE_RECLAIM_SUCCESS;
+
+		/*
+		 * If compaction run but it was skipped and reclaim was
+		 * successful keep going.
+		 */
+		if (need_compaction && c_ret == COMPACT_SKIPPED) {
+			/*
+			 * If it's ok to wait for I/O we can as well run sync
+			 * compaction
+			 */
+			sync_compaction = !!(zone_reclaim_mode &
+					     (RECLAIM_WRITE|RECLAIM_SWAP));
+			cond_resched();
+			goto repeat_compaction;
+		}
+	}
+	if (need_compaction)
+		defer_compaction(preferred_zone, order);
 
 	if (!ret)
 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/9] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED
  2013-08-02 16:06 ` [PATCH 1/9] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
@ 2013-08-05 18:30   ` Johannes Weiner
  2013-08-07 15:13   ` Mel Gorman
  1 sibling, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2013-08-05 18:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini,
	Andrew Morton, Hush Bensen

On Fri, Aug 02, 2013 at 06:06:28PM +0200, Andrea Arcangeli wrote:
> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
> thread allocates memory at the same time, it forces a premature
> allocation into remote NUMA nodes even when there's plenty of clean
> cache to reclaim in the local nodes.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/9] mm: compaction: don't require high order pages below min wmark
  2013-08-02 16:06 ` [PATCH 5/9] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
@ 2013-08-05 18:35   ` Rik van Riel
  2013-08-06 13:23   ` Johannes Weiner
  2013-08-07 15:42   ` Mel Gorman
  2 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2013-08-05 18:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On 08/02/2013 12:06 PM, Andrea Arcangeli wrote:
> The min wmark should be satisfied with just 1 hugepage. And the other
> wmarks should be adjusted accordingly. We need to succeed the low
> wmark check if there's some significant amount of 0 order pages, but
> we don't need plenty of high order pages because the PF_MEMALLOC paths
> don't require those. Creating a ton of high order pages that cannot be
> allocated by the high order allocation paths (no PF_MEMALLOC) is quite
> wasteful because they can be splitted in lower order pages before
> anybody has a chance to allocate them.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks
  2013-08-02 16:06 ` [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks Andrea Arcangeli
@ 2013-08-05 18:36   ` Rik van Riel
  2013-08-06 16:08   ` Johannes Weiner
  2013-08-07 15:43   ` Mel Gorman
  2 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2013-08-05 18:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On 08/02/2013 12:06 PM, Andrea Arcangeli wrote:
> Prevent the scaling down to reduce the watermarks too much.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Not super fond of the magic number, but I don't have a better idea...

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/9] mm: zone_reclaim: compaction: export compact_zone_order()
  2013-08-02 16:06 ` [PATCH 7/9] mm: zone_reclaim: compaction: export compact_zone_order() Andrea Arcangeli
@ 2013-08-05 18:37   ` Rik van Riel
  2013-08-07 15:48   ` Mel Gorman
  1 sibling, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2013-08-05 18:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On 08/02/2013 12:06 PM, Andrea Arcangeli wrote:
> Needed by zone_reclaim_mode compaction-awareness.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] mm: zone_reclaim: after a successful zone_reclaim check the min watermark
  2013-08-02 16:06 ` [PATCH 8/9] mm: zone_reclaim: after a successful zone_reclaim check the min watermark Andrea Arcangeli
@ 2013-08-05 18:38   ` Rik van Riel
  2013-08-07 15:53   ` Mel Gorman
  1 sibling, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2013-08-05 18:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On 08/02/2013 12:06 PM, Andrea Arcangeli wrote:
> If we're in the fast path and we succeeded zone_reclaim(), it means we
> freed enough memory and we can use the min watermark to have some
> margin against concurrent allocations from other CPUs or interrupts.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode
  2013-08-04 16:55   ` Andrea Arcangeli
@ 2013-08-05 18:38     ` Rik van Riel
  2013-08-07 16:18     ` Mel Gorman
  1 sibling, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2013-08-05 18:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On 08/04/2013 12:55 PM, Andrea Arcangeli wrote:
> On Fri, Aug 02, 2013 at 06:06:36PM +0200, Andrea Arcangeli wrote:
>> +		need_compaction = false;
>
> This should be changed to "*need_compaction = false". It's actually a
> cleanup because it's a nooperational change at runtime.
> need_compaction was initialized to false by the only caller so it
> couldn't harm. But it's better to fix it to avoid
> confusion. Alternatively the above line can be dropped entirely but I
> thought it was cleaner to have a defined value as result of the
> function.
>
> Found by Fengguang kbuild robot.
>
> A new replacement patch 9/9 is appended below:
>
> ===
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: [PATCH] mm: zone_reclaim: compaction: add compaction to
>   zone_reclaim_mode
>
> This adds compaction to zone_reclaim so THP enabled won't decrease the
> NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.
>
> It is important to boot with numa_zonelist_order=n (n means nodes) to
> get more accurate NUMA locality if there are multiple zones per node.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/9] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-08-02 16:06 ` [PATCH 2/9] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
@ 2013-08-05 18:45   ` Johannes Weiner
  2013-08-06  5:50     ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Johannes Weiner @ 2013-08-05 18:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini,
	Andrew Morton, Hush Bensen

On Fri, Aug 02, 2013 at 06:06:29PM +0200, Andrea Arcangeli wrote:
> Reset the stats so /proc/sys/vm/compact_memory will scan all memory.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Acked-by: Mel Gorman <mgorman@suse.de>

It somehow feels wrong that this operation should have a destructive
side effect, rather than just ignore the cached info for the one run
(like cc.ignore_skip_hint).  But I don't really have a strong reason
against it, so...

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/9] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-08-02 16:06 ` [PATCH 3/9] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
@ 2013-08-05 19:32   ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2013-08-05 19:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini,
	Andrew Morton, Hush Bensen

On Fri, Aug 02, 2013 at 06:06:30PM +0200, Andrea Arcangeli wrote:
> If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
> plenty of free memory) compaction is otherwise crippled down and stops
> running for a while after the free/isolation cursor meets. After that
> allocation can fail for a full cycle of compaction_deferred, until
> compaction_restarting finally reset it again.
> 
> Stopping compaction for a full cycle after the cursor meets, even if
> it never failed and it's not going to fail, doesn't make sense.
> 
> We already throttle compaction CPU utilization using
> defer_compaction. We shouldn't prevent compaction to run after each
> pass completes when the cursor meets, unless it failed.
> 
> This makes direct compaction functional again. The throttling of
> direct compaction is still controlled by the defer_compaction
> logic.
> 
> kswapd still won't risk to reset compaction, and it will wait direct
> compaction to do so. Not sure if this is ideal but it at least
> decreases the risk of kswapd doing too much work. kswapd will only run
> one pass of compaction until some allocation invokes compaction again.
> 
> This decreased reliability of compaction was introduced in commit
> 62997027ca5b3d4618198ed8b1aba40b61b1137b .
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Acked-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/9] mm: zone_reclaim: compaction: reset before initializing the scan cursors
  2013-08-02 16:06 ` [PATCH 4/9] mm: zone_reclaim: compaction: reset before initializing the scan cursors Andrea Arcangeli
@ 2013-08-05 19:44   ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2013-08-05 19:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini,
	Andrew Morton, Hush Bensen

On Fri, Aug 02, 2013 at 06:06:31PM +0200, Andrea Arcangeli wrote:
> Correct the location where we reset the scan cursors, otherwise the
> first iteration of compaction (after restarting it) will only do a
> partial scan.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rafael Aquini <aquini@redhat.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Yes, it does not make sense to read the situation from the cache
first, then two lines later invalidate it because it's stale data.

That being said, why are we maintaining the pageblock skip bits in
addition to the scanner offset caches?  Sometimes we only set the
pageblock skip bit and not update the position cache, but the next
invocation will skip over these blocks anyway because of
!isolation_suitable().  And they are invalidated together.  Aren't
they redundant?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/9] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-08-05 18:45   ` Johannes Weiner
@ 2013-08-06  5:50     ` Andrea Arcangeli
  0 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-06  5:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini,
	Andrew Morton, Hush Bensen

Hi!

On Mon, Aug 05, 2013 at 02:45:59PM -0400, Johannes Weiner wrote:
> On Fri, Aug 02, 2013 at 06:06:29PM +0200, Andrea Arcangeli wrote:
> > Reset the stats so /proc/sys/vm/compact_memory will scan all memory.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Acked-by: Rafael Aquini <aquini@redhat.com>
> > Acked-by: Mel Gorman <mgorman@suse.de>
> 
> It somehow feels wrong that this operation should have a destructive
> side effect, rather than just ignore the cached info for the one run
> (like cc.ignore_skip_hint).  But I don't really have a strong reason
> against it, so...

But what benefit would provide to keep the cached cursor positions
alive after we already compacted the whole memory from the start to
the end? The cached cursors provide useful information when we compact
in small steps and they represent the unscanned part of the
memory. But after a full compaction completed unless some memory
activity has happened there will be nothing to compact anymore. So we
just need to find what may have changed as result of the memory
activity and in turn there should be no benefit in starting at the
previously cached cursors positions.

> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/9] mm: compaction: don't require high order pages below min wmark
  2013-08-02 16:06 ` [PATCH 5/9] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
  2013-08-05 18:35   ` Rik van Riel
@ 2013-08-06 13:23   ` Johannes Weiner
  2013-08-07 15:42   ` Mel Gorman
  2 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2013-08-06 13:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini,
	Andrew Morton, Hush Bensen

On Fri, Aug 02, 2013 at 06:06:32PM +0200, Andrea Arcangeli wrote:
> The min wmark should be satisfied with just 1 hugepage. And the other
> wmarks should be adjusted accordingly. We need to succeed the low
> wmark check if there's some significant amount of 0 order pages, but
> we don't need plenty of high order pages because the PF_MEMALLOC paths
> don't require those. Creating a ton of high order pages that cannot be
> allocated by the high order allocation paths (no PF_MEMALLOC) is quite
> wasteful because they can be splitted in lower order pages before
> anybody has a chance to allocate them.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks
  2013-08-02 16:06 ` [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks Andrea Arcangeli
  2013-08-05 18:36   ` Rik van Riel
@ 2013-08-06 16:08   ` Johannes Weiner
  2013-08-06 18:52     ` Johannes Weiner
  2013-08-07 13:18     ` Andrea Arcangeli
  2013-08-07 15:43   ` Mel Gorman
  2 siblings, 2 replies; 36+ messages in thread
From: Johannes Weiner @ 2013-08-06 16:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini,
	Andrew Morton, Hush Bensen

On Fri, Aug 02, 2013 at 06:06:33PM +0200, Andrea Arcangeli wrote:
> Prevent the scaling down to reduce the watermarks too much.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4401983..b32ecde 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1665,7 +1665,8 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  		free_pages -= z->free_area[o].nr_free << o;
>  
>  		/* Require fewer higher order pages to be free */
> -		min >>= 1;
> +		if (o < (pageblock_order >> 2))
> +			min >>= 1;

Okay, bear with me here:

After an allocation, the watermark has to be met, all available pages
considered.  That is reasonable because we don't want to deadlock and
order-0 pages can be served from any page block.

Now say we have an order-2 allocation: in addition to the order-0 view
on the watermark, after the allocation a quarter of the watermark has
to be met with order-2 pages and up.  Okay, maybe we always want a few
< PAGE_ALLOC_COSTLY_ORDER pages at our disposal, who knows.

Then it kind of sizzles out towards higher order pages but it always
requires the remainder to be positive, i.e. at least one page at the
checked order available.  Only the actually requested order is not
checked, so for an order-9 we only make sure that we could serve at
least one order-8 page, maybe more depending on the zone size.

You're proposing to check at least for

  (watermark - min_watermark) >> (pageblock_order >> 2)

worth of order-8 pages instead.

How does any of this make any sense?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks
  2013-08-06 16:08   ` Johannes Weiner
@ 2013-08-06 18:52     ` Johannes Weiner
  2013-08-07 13:18     ` Andrea Arcangeli
  1 sibling, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2013-08-06 18:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini,
	Andrew Morton, Hush Bensen

On Tue, Aug 06, 2013 at 12:08:38PM -0400, Johannes Weiner wrote:
> On Fri, Aug 02, 2013 at 06:06:33PM +0200, Andrea Arcangeli wrote:
> > Prevent the scaling down to reduce the watermarks too much.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  mm/page_alloc.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4401983..b32ecde 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1665,7 +1665,8 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> >  		free_pages -= z->free_area[o].nr_free << o;
> >  
> >  		/* Require fewer higher order pages to be free */
> > -		min >>= 1;
> > +		if (o < (pageblock_order >> 2))
> > +			min >>= 1;
> 
> Okay, bear with me here:
>
> After an allocation, the watermark has to be met, all available pages
> considered.  That is reasonable because we don't want to deadlock and
> order-0 pages can be served from any page block.
> 
> Now say we have an order-2 allocation: in addition to the order-0 view
> on the watermark, after the allocation a quarter of the watermark has
> to be met with order-2 pages and up.  Okay, maybe we always want a few
> < PAGE_ALLOC_COSTLY_ORDER pages at our disposal, who knows.
>
> Then it kind of sizzles out towards higher order pages but it always
> requires the remainder to be positive, i.e. at least one page at the
> checked order available.  Only the actually requested order is not
> checked, so for an order-9 we only make sure that we could serve at
> least one order-8 page, maybe more depending on the zone size.

Argh, brainfart.  It subtracts free order-8 and what's left is order-9
and up.  So it makes sure there are watermark >> 9 worth of order-9
and up page blocks available after the allocation of an order-9 page.

> You're proposing to check at least for
> 
>   (watermark - min_watermark) >> (pageblock_order >> 2)
> 
> worth of order-8 pages instead.

Wouldn't it be enough to meet watermarks from an order-0 perspective,
maybe require *some* buffer up to PAGE_ALLOC_COSTLY_ORDER, and then be
okay with a single order-n page?  Why do we need more than one?

Anyway, I think the changelog should be a bit more verbose about the
motivation ;-)

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks
  2013-08-06 16:08   ` Johannes Weiner
  2013-08-06 18:52     ` Johannes Weiner
@ 2013-08-07 13:18     ` Andrea Arcangeli
  1 sibling, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-07 13:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Johannes Weiner, Mel Gorman, Rik van Riel,
	Hugh Dickins, Richard Davies, Shaohua Li, Rafael Aquini,
	Andrew Morton, Hush Bensen

On Tue, Aug 06, 2013 at 12:08:38PM -0400, Johannes Weiner wrote:
> Okay, bear with me here:
> 
> After an allocation, the watermark has to be met, all available pages
> considered.  That is reasonable because we don't want to deadlock and
> order-0 pages can be served from any page block.
> 
> Now say we have an order-2 allocation: in addition to the order-0 view
> on the watermark, after the allocation a quarter of the watermark has
> to be met with order-2 pages and up.  Okay, maybe we always want a few
> < PAGE_ALLOC_COSTLY_ORDER pages at our disposal, who knows.
> 
> Then it kind of sizzles out towards higher order pages but it always
> requires the remainder to be positive, i.e. at least one page at the
> checked order available.  Only the actually requested order is not
> checked, so for an order-9 we only make sure that we could serve at
> least one order-8 page, maybe more depending on the zone size.
> 
> You're proposing to check at least for
> 
>   (watermark - min_watermark) >> (pageblock_order >> 2)
> 
> worth of order-8 pages instead.

worth of order 9 plus all higher order than 9, pages.

> 
> How does any of this make any sense?

The loop removes from the total number of free pages, all free pages
in the too-low buddy allocator orders to be relevant for an order 9
allocation (order from 0 to 8 range), so the check is against order 9
pages or higher order of free memory.

But it's still good to review this in detail, as this is one tricky
bit of math.

The loop independently of the above (to correcting the free_pages to
only account order 9 or higher), scales down the watermark that we'll
check the corrected free_pages level against. With this change now we
stop the scaling down after we reach o >= 2. (pageblock_order >> 2 ==
2).

So we remain identical and we still scale down (shift right) once for
order 1, and we scale down twice for order 2 like before. But for all
higher orders we scale down the maximum twice and not more:

	(mark - min) >> 2

And we set to 0 the wmark of high order pages required for the
ALLOC_WMARK_MIN. That means the MIN wmark will succeed and stop
compaction if there's 1 page of the right order, not more than
that. (free_pages <= min will not succeed the allocation, which means
free_pages <= 0 because min is 0, and if instead it's > 0 it means at
least 1 order 9 page has been generated which is what we need at the
MIN level)

About the ALLOC_WMARK_LOW, my current normal zone has min 11191 and
low 13988. So the amount of free memory in order9 or higher zone
required for the ALLOC_WMARK_LOW to succeed in compaction now becomes
((13988-11191)>>2)*4096 = 2.7MB. That means "free_pages > 2.7MB" will
succeed with order 9 or higher, only with at least 2 pages generated
in compaction, to have at least a bit more of margin than for min
wmark and for concurrent allocations that could split or allocate the
order 9 page generated. 2 pages better than one, and it seems to make
a measurable difference in the allocation success rate here with all
threads allocating from all CPUs at the same time.

Before ALLOC_WMARK_LOW check for order 9 or higher was set at (13988
>> 9)*4096 = 110KB (same as zero, requiring only one order 9 page),
and ALLOC_WMARK_MIN was set at (11191 >> 9)*4096 = 86KB (again same as
zero, requiring just 1 page).

In short this should allow at least 2 pages free at LOW for order 9
allocations, and still 1 for MIN like before. But it should be mostly
beneficial in reducing compaction overhead for the lower orders than
9, where previously we would stop compaction much later for the MIN
and for the LOW, despite nobody (except a bit atomic allocations)
could access to the pages below the MIN. We were working for nothing.
And potentially we did more work but we had fewer pages usable than we
do now in between LOW and MIN that we could really allocate as high
order pages.

I could have used PAGE_ALLOC_COSTLY_ORDER instead of "pageblock_order
>> 2" but I seen no obvious connection between this value and the fact
the allocation is more costly. Keeping it in function of the
pageblock_order I thought was better because if the pageblock order is
bigger than 9 (signaling a very big hugepage), we keep scaling down a
little more to avoid generating too many of very huge ones. It's not
black and white stuff anyway but it felt better than
PAGE_ALLOC_COSTLY_ORDER.

To be sure, I would suggest to write a script that prints the low and
min wmarks to check against the high order free_pages, given the
standard low/min wmarks, for all orders of allocations from 0 to
MAX_ORDER, with new and old code. And to run it with different min/low
wmarks as generated by default on very different zone sizes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/9] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED
  2013-08-02 16:06 ` [PATCH 1/9] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
  2013-08-05 18:30   ` Johannes Weiner
@ 2013-08-07 15:13   ` Mel Gorman
  1 sibling, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2013-08-07 15:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Fri, Aug 02, 2013 at 06:06:28PM +0200, Andrea Arcangeli wrote:
> Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
> thread allocates memory at the same time, it forces a premature
> allocation into remote NUMA nodes even when there's plenty of clean
> cache to reclaim in the local nodes.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

I would have preferred if the changelog included the history of why this
exists at all (prevents excessive reclaim from parallel allocation requests)
and why it should not currently be a problem (SWAP_CLUSTER_MAX should
be strictly obeyed limiting the excessive reclaim to SWAP_CLUSTER_MAX *
nr_parallel_requests). Hopefully we'll remember to connect any bugs about
excessive reclaim + zone_reclaim to this patch :)

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/9] mm: compaction: don't require high order pages below min wmark
  2013-08-02 16:06 ` [PATCH 5/9] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
  2013-08-05 18:35   ` Rik van Riel
  2013-08-06 13:23   ` Johannes Weiner
@ 2013-08-07 15:42   ` Mel Gorman
  2013-08-07 16:14     ` Andrea Arcangeli
  2 siblings, 1 reply; 36+ messages in thread
From: Mel Gorman @ 2013-08-07 15:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Fri, Aug 02, 2013 at 06:06:32PM +0200, Andrea Arcangeli wrote:
> The min wmark should be satisfied with just 1 hugepage.

This depends on the size of the machine and if THP is enabled or not
(which adjusts min_free_kbytes).  I expect that it is generally true but
wonder how often it is true on something like ARM which does high-order
allocators for stack.

> And the other
> wmarks should be adjusted accordingly. We need to succeed the low
> wmark check if there's some significant amount of 0 order pages, but
> we don't need plenty of high order pages because the PF_MEMALLOC paths
> don't require those. Creating a ton of high order pages that cannot be
> allocated by the high order allocation paths (no PF_MEMALLOC) is quite
> wasteful because they can be splitted in lower order pages before
> anybody has a chance to allocate them.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/page_alloc.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 092b30d..4401983 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1643,6 +1643,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  
>  	if (free_pages - free_cma <= min + lowmem_reserve)
>  		return false;
> +	if (!order)
> +		return true;
> +
> +	/*
> +	 * Don't require any high order page under the min
> +	 * wmark. Invoking compaction to create lots of high order
> +	 * pages below the min wmark is wasteful because those
> +	 * hugepages cannot be allocated without PF_MEMALLOC and the
> +	 * PF_MEMALLOC paths must not depend on high order allocations
> +	 * to succeed.
> +	 */
> +	min = mark - z->watermark[WMARK_MIN];
> +	WARN_ON(min < 0);

It would be hard to hit but you may be able to trigger this warning if

process a			process b
read min watermark
				increase min_free_kbytes
__zone_watermark_ok




if (min < 0)
	return false;

?

> +	if (alloc_flags & ALLOC_HIGH)
> +		min -= min / 2;
> +	if (alloc_flags & ALLOC_HARDER)
> +		min -= min / 4;
>  	for (o = 0; o < order; o++) {
>  		/* At the next order, this order's pages become unavailable */
>  		free_pages -= z->free_area[o].nr_free << o;

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks
  2013-08-02 16:06 ` [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks Andrea Arcangeli
  2013-08-05 18:36   ` Rik van Riel
  2013-08-06 16:08   ` Johannes Weiner
@ 2013-08-07 15:43   ` Mel Gorman
  2 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2013-08-07 15:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Fri, Aug 02, 2013 at 06:06:33PM +0200, Andrea Arcangeli wrote:
> Prevent the scaling down to reduce the watermarks too much.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4401983..b32ecde 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1665,7 +1665,8 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>  		free_pages -= z->free_area[o].nr_free << o;
>  
>  		/* Require fewer higher order pages to be free */
> -		min >>= 1;
> +		if (o < (pageblock_order >> 2))
> +			min >>= 1;
>  

Why pageblock_order? That thing depends on the huge page size of the
system in question so it'll vary between platforms and kernel configs.
It seems arbitrary but I suspect it happened to work well for THP
allocations on x86_64.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/9] mm: zone_reclaim: compaction: export compact_zone_order()
  2013-08-02 16:06 ` [PATCH 7/9] mm: zone_reclaim: compaction: export compact_zone_order() Andrea Arcangeli
  2013-08-05 18:37   ` Rik van Riel
@ 2013-08-07 15:48   ` Mel Gorman
  1 sibling, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2013-08-07 15:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Fri, Aug 02, 2013 at 06:06:34PM +0200, Andrea Arcangeli wrote:
> Needed by zone_reclaim_mode compaction-awareness.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

> @@ -79,6 +82,13 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  	return COMPACT_CONTINUE;
>  }
>  
> +static inline unsigned long compact_zone_order(struct zone *zone,
> +					       int order, gfp_t gfp_mask,
> +					       bool sync, bool *contended)
> +{
> +	return COMPACT_CONTINUE;
> +}
> +

COMPACT_SKIPPED to indicate that compaction did not even start and there
is no point rechecking watermarks or trying to allocate?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] mm: zone_reclaim: after a successful zone_reclaim check the min watermark
  2013-08-02 16:06 ` [PATCH 8/9] mm: zone_reclaim: after a successful zone_reclaim check the min watermark Andrea Arcangeli
  2013-08-05 18:38   ` Rik van Riel
@ 2013-08-07 15:53   ` Mel Gorman
  1 sibling, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2013-08-07 15:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Fri, Aug 02, 2013 at 06:06:35PM +0200, Andrea Arcangeli wrote:
> If we're in the fast path and we succeeded zone_reclaim(), it means we
> freed enough memory and we can use the min watermark to have some
> margin against concurrent allocations from other CPUs or interrupts.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

This seems very light on explanation. The fast path is meant to obey the
low watermark and wake up kswapd in the slow path if the watermark cannot
be obeyed. This patch appears to allow a kswapd wakeup to be missed.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/9] mm: compaction: don't require high order pages below min wmark
  2013-08-07 15:42   ` Mel Gorman
@ 2013-08-07 16:14     ` Andrea Arcangeli
  2013-08-07 16:47       ` Mel Gorman
  0 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-07 16:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

Hi Mel,

On Wed, Aug 07, 2013 at 04:42:01PM +0100, Mel Gorman wrote:
> On Fri, Aug 02, 2013 at 06:06:32PM +0200, Andrea Arcangeli wrote:
> > The min wmark should be satisfied with just 1 hugepage.
> 
> This depends on the size of the machine and if THP is enabled or not
> (which adjusts min_free_kbytes).  I expect that it is generally true but
> wonder how often it is true on something like ARM which does high-order
> allocators for stack.

I exclude ARM is allocating stacks with GFP_ATOMIC, or how could it be
reliable? If it's not an atomic allocation, it should make no
difference as it wouldn't be allowed to eat from the reservation below
MIN anyway, just the area between LOW and MIN matters, no?

> It would be hard to hit but you may be able to trigger this warning if
> 
> process a			process b
> read min watermark
> 				increase min_free_kbytes
> __zone_watermark_ok
> 
> 
> 
> 
> if (min < 0)
> 	return false;
> 
> ?

Correct, this is why it's a WARN_ON and not a BUG_ON. It's signed so
nothing shall go wrong after the warn_on. I just wanted to be sure it
never triggers when people isn't altering the min_free_kbytes. If you
prefer to drop it, it's fine though (it never triggered as expected).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode
  2013-08-04 16:55   ` Andrea Arcangeli
  2013-08-05 18:38     ` Rik van Riel
@ 2013-08-07 16:18     ` Mel Gorman
  2013-08-07 23:48       ` Andrea Arcangeli
  1 sibling, 1 reply; 36+ messages in thread
From: Mel Gorman @ 2013-08-07 16:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Sun, Aug 04, 2013 at 06:55:26PM +0200, Andrea Arcangeli wrote:
> On Fri, Aug 02, 2013 at 06:06:36PM +0200, Andrea Arcangeli wrote:
> > +		need_compaction = false;
> 
> This should be changed to "*need_compaction = false". It's actually a
> cleanup because it's a nooperational change at runtime.
> need_compaction was initialized to false by the only caller so it
> couldn't harm. But it's better to fix it to avoid
> confusion. Alternatively the above line can be dropped entirely but I
> thought it was cleaner to have a defined value as result of the
> function.
> 
> Found by Fengguang kbuild robot.
> 
> A new replacement patch 9/9 is appended below:
> 
> ===
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: [PATCH] mm: zone_reclaim: compaction: add compaction to
>  zone_reclaim_mode
> 
> This adds compaction to zone_reclaim so THP enabled won't decrease the
> NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.
> 

That is a light explanation.

> It is important to boot with numa_zonelist_order=n (n means nodes) to
> get more accurate NUMA locality if there are multiple zones per node.
> 

This appears to be an unrelated observation.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/swap.h |   8 +++-
>  mm/page_alloc.c      |   4 +-
>  mm/vmscan.c          | 111 ++++++++++++++++++++++++++++++++++++++++++---------
>  3 files changed, 102 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f2ada36..fedb246 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
>
> <SNIP>
>
> @@ -3549,27 +3567,35 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	return sc.nr_reclaimed >= nr_pages;
>  }
>  
> -int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> +static int zone_reclaim_compact(struct zone *preferred_zone,
> +				struct zone *zone, gfp_t gfp_mask,
> +				unsigned int order,
> +				bool sync_compaction,
> +				bool *need_compaction)
>  {
> -	int node_id;
> -	int ret;
> +	bool contended;
>  
> -	/*
> -	 * Zone reclaim reclaims unmapped file backed pages and
> -	 * slab pages if we are over the defined limits.
> -	 *
> -	 * A small portion of unmapped file backed pages is needed for
> -	 * file I/O otherwise pages read by file I/O will be immediately
> -	 * thrown out if the zone is overallocated. So we do not reclaim
> -	 * if less than a specified percentage of the zone is used by
> -	 * unmapped file backed pages.
> -	 */
> -	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> -	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> -		return ZONE_RECLAIM_FULL;
> +	if (compaction_deferred(preferred_zone, order) ||
> +	    !order ||
> +	    (gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
> +		*need_compaction = false;
> +		return COMPACT_SKIPPED;
> +	}
>  
> -	if (zone->all_unreclaimable)
> -		return ZONE_RECLAIM_FULL;
> +	*need_compaction = true;
> +	return compact_zone_order(zone, order,
> +				  gfp_mask,
> +				  sync_compaction,
> +				  &contended);
> +}
> +
> +int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
> +		 gfp_t gfp_mask, unsigned int order,
> +		 unsigned long mark, int classzone_idx, int alloc_flags)
> +{
> +	int node_id;
> +	int ret, c_ret;
> +	bool sync_compaction = false, need_compaction = false;
>  
>  	/*
>  	 * Do not scan if the allocation should not be delayed.
> @@ -3587,7 +3613,56 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
>  		return ZONE_RECLAIM_NOSCAN;
>  
> +repeat_compaction:
> +	/*
> +	 * If this allocation may be satisfied by memory compaction,
> +	 * run compaction before reclaim.
> +	 */
> +	c_ret = zone_reclaim_compact(preferred_zone,
> +				     zone, gfp_mask, order,
> +				     sync_compaction,
> +				     &need_compaction);
> +	if (need_compaction &&
> +	    c_ret != COMPACT_SKIPPED &&

need_compaction records whether compaction was attempted or not. Why
not just check for COMPACT_SKIPPED and have compact_zone_order return
COMPACT_SKIPPED if !CONFIG_COMPACTION?

> +	    zone_watermark_ok(zone, order, mark,
> +			      classzone_idx,
> +			      alloc_flags)) {
> +#ifdef CONFIG_COMPACTION
> +		zone->compact_considered = 0;
> +		zone->compact_defer_shift = 0;
> +#endif
> +		return ZONE_RECLAIM_SUCCESS;
> +	}
> +
> +	/*
> +	 * reclaim if compaction failed because not enough memory was
> +	 * available or if compaction didn't run (order 0) or didn't
> +	 * succeed.
> +	 */
>  	ret = __zone_reclaim(zone, gfp_mask, order);
> +	if (ret == ZONE_RECLAIM_SUCCESS) {
> +		if (zone_watermark_ok(zone, order, mark,
> +				      classzone_idx,
> +				      alloc_flags))
> +			return ZONE_RECLAIM_SUCCESS;
> +
> +		/*
> +		 * If compaction run but it was skipped and reclaim was
> +		 * successful keep going.
> +		 */
> +		if (need_compaction && c_ret == COMPACT_SKIPPED) {

And I recognise that you use need_compaction to see if it had been possible
to attempt compaction before but the way this is organise it appears that
it is possible to loop until __zone_reclaim fails which could be a lot
of reclaiming. If compaction always makes an attempt but always fails
(pinned/locked pages, fragmentation etc) then potentially we reclaim the
entire zone. I recognise that a single __zone_reclaim pass may not reclaim
enough pages to allow compaction to succeed so a single pass is not the
answer either but I worry that this will create a variation of the bug
where an excessive amount of memory is reclaimed to satisfy a THP
allocation.

> +			/*
> +			 * If it's ok to wait for I/O we can as well run sync
> +			 * compaction
> +			 */
> +			sync_compaction = !!(zone_reclaim_mode &
> +					     (RECLAIM_WRITE|RECLAIM_SWAP));
> +			cond_resched();
> +			goto repeat_compaction;
> +		}
> +	}
> +	if (need_compaction)
> +		defer_compaction(preferred_zone, order);
>  
>  	if (!ret)
>  		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/9] mm: compaction: don't require high order pages below min wmark
  2013-08-07 16:14     ` Andrea Arcangeli
@ 2013-08-07 16:47       ` Mel Gorman
  2013-08-08  0:59         ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Mel Gorman @ 2013-08-07 16:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Wed, Aug 07, 2013 at 06:14:37PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Wed, Aug 07, 2013 at 04:42:01PM +0100, Mel Gorman wrote:
> > On Fri, Aug 02, 2013 at 06:06:32PM +0200, Andrea Arcangeli wrote:
> > > The min wmark should be satisfied with just 1 hugepage.
> > 
> > This depends on the size of the machine and if THP is enabled or not
> > (which adjusts min_free_kbytes).  I expect that it is generally true but
> > wonder how often it is true on something like ARM which does high-order
> > allocators for stack.
> 
> I exclude ARM is allocating stacks with GFP_ATOMIC, or how could it be
> reliable?

I assumed they were GFP_KERNEL allocations. High-order atomic allocations would
be jumbo frame allocations.

Anyway, my general concern is that min watermark can be smaller than 1
hugepage on some platforms and making assumptions about the watermarks
and their relative size in comparison to THP seems dangerous.  If it is
possible that ((low - min) > requested_size) then a high-order allocation
from the allocator slowpath will be allowed to go below the min
watermark which is not expected.

> If it's not an atomic allocation, it should make no
> difference as it wouldn't be allowed to eat from the reservation below
> MIN anyway, just the area between LOW and MIN matters, no?
> 
> > It would be hard to hit but you may be able to trigger this warning if
> > 
> > process a			process b
> > read min watermark
> > 				increase min_free_kbytes
> > __zone_watermark_ok
> > 
> > 
> > 
> > 
> > if (min < 0)
> > 	return false;
> > 
> > ?
> 
> Correct, this is why it's a WARN_ON and not a BUG_ON.

if (WARN_ON(min < 0))
	return false;

?

Seems odd to just fall through and make decisions based on a negative
watermark. It'll erronously return true when the revised watermark should
have returned false. As it is likely due to a min_free_kbytes or hot-add
event then it probably does not matter a great deal. I'm not massively
pushed and the WARN_ON is fine if you like. It just struck me as being
strange looking.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode
  2013-08-07 16:18     ` Mel Gorman
@ 2013-08-07 23:48       ` Andrea Arcangeli
  2013-08-08  8:22         ` Mel Gorman
  0 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-07 23:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Wed, Aug 07, 2013 at 05:18:37PM +0100, Mel Gorman wrote:
> > It is important to boot with numa_zonelist_order=n (n means nodes) to
> > get more accurate NUMA locality if there are multiple zones per node.
> > 
> 
> This appears to be an unrelated observation.

But things still don't work ok without it. After alloc_batch changes
it matters only in the slowpath but it still related.

> 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  include/linux/swap.h |   8 +++-
> >  mm/page_alloc.c      |   4 +-
> >  mm/vmscan.c          | 111 ++++++++++++++++++++++++++++++++++++++++++---------
> >  3 files changed, 102 insertions(+), 21 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f2ada36..fedb246 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> >
> > <SNIP>
> >
> > @@ -3549,27 +3567,35 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> >  	return sc.nr_reclaimed >= nr_pages;
> >  }
> >  
> > -int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > +static int zone_reclaim_compact(struct zone *preferred_zone,
> > +				struct zone *zone, gfp_t gfp_mask,
> > +				unsigned int order,
> > +				bool sync_compaction,
> > +				bool *need_compaction)
> >  {
> > -	int node_id;
> > -	int ret;
> > +	bool contended;
> >  
> > -	/*
> > -	 * Zone reclaim reclaims unmapped file backed pages and
> > -	 * slab pages if we are over the defined limits.
> > -	 *
> > -	 * A small portion of unmapped file backed pages is needed for
> > -	 * file I/O otherwise pages read by file I/O will be immediately
> > -	 * thrown out if the zone is overallocated. So we do not reclaim
> > -	 * if less than a specified percentage of the zone is used by
> > -	 * unmapped file backed pages.
> > -	 */
> > -	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> > -	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> > -		return ZONE_RECLAIM_FULL;
> > +	if (compaction_deferred(preferred_zone, order) ||
> > +	    !order ||
> > +	    (gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
> > +		*need_compaction = false;
> > +		return COMPACT_SKIPPED;
> > +	}
> >  
> > -	if (zone->all_unreclaimable)
> > -		return ZONE_RECLAIM_FULL;
> > +	*need_compaction = true;
> > +	return compact_zone_order(zone, order,
> > +				  gfp_mask,
> > +				  sync_compaction,
> > +				  &contended);
> > +}
> > +
> > +int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
> > +		 gfp_t gfp_mask, unsigned int order,
> > +		 unsigned long mark, int classzone_idx, int alloc_flags)
> > +{
> > +	int node_id;
> > +	int ret, c_ret;
> > +	bool sync_compaction = false, need_compaction = false;
> >  
> >  	/*
> >  	 * Do not scan if the allocation should not be delayed.
> > @@ -3587,7 +3613,56 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> >  	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> >  		return ZONE_RECLAIM_NOSCAN;
> >  
> > +repeat_compaction:
> > +	/*
> > +	 * If this allocation may be satisfied by memory compaction,
> > +	 * run compaction before reclaim.
> > +	 */
> > +	c_ret = zone_reclaim_compact(preferred_zone,
> > +				     zone, gfp_mask, order,
> > +				     sync_compaction,
> > +				     &need_compaction);
> > +	if (need_compaction &&
> > +	    c_ret != COMPACT_SKIPPED &&
> 
> need_compaction records whether compaction was attempted or not. Why
> not just check for COMPACT_SKIPPED and have compact_zone_order return
> COMPACT_SKIPPED if !CONFIG_COMPACTION?

How can it be ok that try_to_compact_pages returns COMPACT_CONTINUE
but compact_zone order returns the opposite? I mean either we change
both or none.

static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
			int order, gfp_t gfp_mask, nodemask_t *nodemask,
			bool sync, bool *contended)
{
	return COMPACT_CONTINUE;
}

static inline unsigned long compact_zone_order(struct zone *zone,
					       int order, gfp_t gfp_mask,
					       bool sync, bool *contended)
{
	return COMPACT_CONTINUE;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/9] mm: compaction: don't require high order pages below min wmark
  2013-08-07 16:47       ` Mel Gorman
@ 2013-08-08  0:59         ` Andrea Arcangeli
  0 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2013-08-08  0:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Wed, Aug 07, 2013 at 05:47:41PM +0100, Mel Gorman wrote:
> On Wed, Aug 07, 2013 at 06:14:37PM +0200, Andrea Arcangeli wrote:
> > Hi Mel,
> > 
> > On Wed, Aug 07, 2013 at 04:42:01PM +0100, Mel Gorman wrote:
> > > On Fri, Aug 02, 2013 at 06:06:32PM +0200, Andrea Arcangeli wrote:
> > > > The min wmark should be satisfied with just 1 hugepage.
> > > 
> > > This depends on the size of the machine and if THP is enabled or not
> > > (which adjusts min_free_kbytes).  I expect that it is generally true but
> > > wonder how often it is true on something like ARM which does high-order
> > > allocators for stack.
> > 
> > I exclude ARM is allocating stacks with GFP_ATOMIC, or how could it be
> > reliable?
> 
> I assumed they were GFP_KERNEL allocations. High-order atomic allocations would
> be jumbo frame allocations.
> 
> Anyway, my general concern is that min watermark can be smaller than 1
> hugepage on some platforms and making assumptions about the watermarks

Without my patch the min wmark is always smaller than 1 hugepage
anyway. I tried to go over the details of this in my previous email to
Johannes.

> and their relative size in comparison to THP seems dangerous.  If it is
> possible that ((low - min) > requested_size) then a high-order allocation
> from the allocator slowpath will be allowed to go below the min
> watermark which is not expected.

I don't see what you mean here with regard to the min watermark and
THP allocations.

Please elaborate the difference does it make to order 9 THP
allocations, if the min is 0 as with my patch or set to a few dozen
kbyte like in upstream? Does it make any difference? If yes where?

> if (WARN_ON(min < 0))
> 	return false;
> 
> ?
> 
> Seems odd to just fall through and make decisions based on a negative
> watermark. It'll erronously return true when the revised watermark should
> have returned false. As it is likely due to a min_free_kbytes or hot-add
> event then it probably does not matter a great deal. I'm not massively
> pushed and the WARN_ON is fine if you like. It just struck me as being
> strange looking.

This is an high order allocation check that has no relevancy with
reliability of PF_MEMALLOC. So I don't think it worth to keep a
special branch just to be sure to obey the min in the race condition
case with a write to /proc/sys. If we temporarily try to get more
pages nothing goes wrong like it would with the order 0
allocations. We could actually nuke the WARN_ON as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode
  2013-08-07 23:48       ` Andrea Arcangeli
@ 2013-08-08  8:22         ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2013-08-08  8:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins,
	Richard Davies, Shaohua Li, Rafael Aquini, Andrew Morton,
	Hush Bensen

On Thu, Aug 08, 2013 at 01:48:00AM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 07, 2013 at 05:18:37PM +0100, Mel Gorman wrote:
> > > It is important to boot with numa_zonelist_order=n (n means nodes) to
> > > get more accurate NUMA locality if there are multiple zones per node.
> > > 
> > 
> > This appears to be an unrelated observation.
> 
> But things still don't work ok without it. After alloc_batch changes
> it matters only in the slowpath but it still related.
> 

Ok, that's curious in itself but I'm not going to dig into the why.

> > > <SNIP>
> > > @@ -3587,7 +3613,56 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > >  	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> > >  		return ZONE_RECLAIM_NOSCAN;
> > >  
> > > +repeat_compaction:
> > > +	/*
> > > +	 * If this allocation may be satisfied by memory compaction,
> > > +	 * run compaction before reclaim.
> > > +	 */
> > > +	c_ret = zone_reclaim_compact(preferred_zone,
> > > +				     zone, gfp_mask, order,
> > > +				     sync_compaction,
> > > +				     &need_compaction);
> > > +	if (need_compaction &&
> > > +	    c_ret != COMPACT_SKIPPED &&
> > 
> > need_compaction records whether compaction was attempted or not. Why
> > not just check for COMPACT_SKIPPED and have compact_zone_order return
> > COMPACT_SKIPPED if !CONFIG_COMPACTION?
> 
> How can it be ok that try_to_compact_pages returns COMPACT_CONTINUE
> but compact_zone order returns the opposite?

Good question and I expect it was because the return value of
try_to_compact_pages was never used in the !CONFIG_COMPACTION case and I
did not think it through properly. try_to_compact_pages has only one caller
in the CONFIG_COMPACTION case and zero callers in the !CONFIG_COMPACTION
making the return value was irrelevant. COMPACT_SKIPPED would still have
been a better choice to indicate "compaction didn't start as it was not
possible or direct reclaim was more suitable"

> I mean either we change both or none.
> 

I think both to COMPACT_SKIPPED would be a better fit for the documented
meaning of COMPACT_SKIPPED.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-08-08  8:23 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-02 16:06 [PATCH 0/9] adding compaction to zone_reclaim_mode v3 Andrea Arcangeli
2013-08-02 16:06 ` [PATCH 1/9] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
2013-08-05 18:30   ` Johannes Weiner
2013-08-07 15:13   ` Mel Gorman
2013-08-02 16:06 ` [PATCH 2/9] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
2013-08-05 18:45   ` Johannes Weiner
2013-08-06  5:50     ` Andrea Arcangeli
2013-08-02 16:06 ` [PATCH 3/9] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
2013-08-05 19:32   ` Johannes Weiner
2013-08-02 16:06 ` [PATCH 4/9] mm: zone_reclaim: compaction: reset before initializing the scan cursors Andrea Arcangeli
2013-08-05 19:44   ` Johannes Weiner
2013-08-02 16:06 ` [PATCH 5/9] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
2013-08-05 18:35   ` Rik van Riel
2013-08-06 13:23   ` Johannes Weiner
2013-08-07 15:42   ` Mel Gorman
2013-08-07 16:14     ` Andrea Arcangeli
2013-08-07 16:47       ` Mel Gorman
2013-08-08  0:59         ` Andrea Arcangeli
2013-08-02 16:06 ` [PATCH 6/9] mm: zone_reclaim: compaction: increase the high order pages in the watermarks Andrea Arcangeli
2013-08-05 18:36   ` Rik van Riel
2013-08-06 16:08   ` Johannes Weiner
2013-08-06 18:52     ` Johannes Weiner
2013-08-07 13:18     ` Andrea Arcangeli
2013-08-07 15:43   ` Mel Gorman
2013-08-02 16:06 ` [PATCH 7/9] mm: zone_reclaim: compaction: export compact_zone_order() Andrea Arcangeli
2013-08-05 18:37   ` Rik van Riel
2013-08-07 15:48   ` Mel Gorman
2013-08-02 16:06 ` [PATCH 8/9] mm: zone_reclaim: after a successful zone_reclaim check the min watermark Andrea Arcangeli
2013-08-05 18:38   ` Rik van Riel
2013-08-07 15:53   ` Mel Gorman
2013-08-02 16:06 ` [PATCH 9/9] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
2013-08-04 16:55   ` Andrea Arcangeli
2013-08-05 18:38     ` Rik van Riel
2013-08-07 16:18     ` Mel Gorman
2013-08-07 23:48       ` Andrea Arcangeli
2013-08-08  8:22         ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.