All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/3] mm: improve page aging fairness between zones/nodes
@ 2013-07-19 20:55 ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-19 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

The way the page allocator interacts with kswapd creates aging
imbalances, where the amount of time a userspace page gets in memory
under reclaim pressure is dependent on which zone, which node the
allocator took the page frame from.

#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
   nodes falling behind for a full reclaim cycle relative to the other
   nodes in the system

#3 fixes an interaction where kswapd and a continuous stream of page
   allocations keep the preferred zone of a task between the high and
   low watermark (allocations succeed + kswapd does not go to sleep)
   indefinitely, completely underutilizing the lower zones and
   thrashing on the preferred zone

These patches are the aging fairness part of the thrash-detection
based file LRU balancing.  Andrea recommended to submit them
separately as they are bugfixes in their own right.

The following test ran a foreground workload (memcachetest) with
background IO of various sizes on a 4 node 8G system (similar results
were observed with single-node 4G systems):

parallelio
                                               BAS                    FAIRALLO
                                              BASE                   FAIRALLOC
Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)

                 BAS    FAIRALLO
                BASE   FAIRALLOC
User          287.78      460.97
System       2151.67     3142.51
Elapsed      9737.00     8879.34

                                   BAS    FAIRALLO
                                  BASE   FAIRALLOC
Minor Faults                  53721925    57188551
Major Faults                    392195       15157
Swap Ins                       2994854      112770
Swap Outs                      4907092      134982
Direct pages scanned                 0       41824
Kswapd pages scanned          32975063     8128269
Kswapd pages reclaimed         6323069     7093495
Direct pages reclaimed               0       41824
Kswapd efficiency                  19%         87%
Kswapd velocity               3386.573     915.414
Direct efficiency                 100%        100%
Direct velocity                  0.000       4.710
Percentage direct scans             0%          0%
Zone normal velocity          2011.338     550.661
Zone dma32 velocity           1365.623     369.221
Zone dma velocity                9.612       0.242
Page writes by reclaim    18732404.000  614807.000
Page writes file              13825312      479825
Page writes anon               4907092      134982
Page reclaim immediate           85490        5647
Sector Reads                  12080532      483244
Sector Writes                 88740508    65438876
Page rescued immediate               0           0
Slabs scanned                    82560       12160
Direct inode steals                  0           0
Kswapd inode steals              24401       40013
Kswapd skipped wait                  0           0
THP fault alloc                      6           8
THP collapse alloc                5481        5812
THP splits                          75          22
THP fault fallback                   0           0
THP collapse fail                    0           0
Compaction stalls                    0          54
Compaction success                   0          45
Compaction failures                  0           9
Page migrate success            881492       82278
Page migrate failure                 0           0
Compaction pages isolated            0       60334
Compaction migrate scanned           0       53505
Compaction free scanned              0     1537605
Compaction cost                    914          86
NUMA PTE updates              46738231    41988419
NUMA hint faults              31175564    24213387
NUMA hint local faults        10427393     6411593
NUMA pages migrated             881492       55344
AutoNUMA cost                   156221      121361

The overall runtime was reduced, throughput for both the foreground
workload as well as the background IO improved, major faults, swapping
and reclaim activity shrunk significantly, reclaim efficiency more
than quadrupled.

 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 55 +++++++++++++++++++++++++++++++++++++------------------
 mm/vmscan.c            |  2 +-
 3 files changed, 39 insertions(+), 19 deletions(-)


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 0/3] mm: improve page aging fairness between zones/nodes
@ 2013-07-19 20:55 ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-19 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

The way the page allocator interacts with kswapd creates aging
imbalances, where the amount of time a userspace page gets in memory
under reclaim pressure is dependent on which zone, which node the
allocator took the page frame from.

#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
   nodes falling behind for a full reclaim cycle relative to the other
   nodes in the system

#3 fixes an interaction where kswapd and a continuous stream of page
   allocations keep the preferred zone of a task between the high and
   low watermark (allocations succeed + kswapd does not go to sleep)
   indefinitely, completely underutilizing the lower zones and
   thrashing on the preferred zone

These patches are the aging fairness part of the thrash-detection
based file LRU balancing.  Andrea recommended to submit them
separately as they are bugfixes in their own right.

The following test ran a foreground workload (memcachetest) with
background IO of various sizes on a 4 node 8G system (similar results
were observed with single-node 4G systems):

parallelio
                                               BAS                    FAIRALLO
                                              BASE                   FAIRALLOC
Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)

                 BAS    FAIRALLO
                BASE   FAIRALLOC
User          287.78      460.97
System       2151.67     3142.51
Elapsed      9737.00     8879.34

                                   BAS    FAIRALLO
                                  BASE   FAIRALLOC
Minor Faults                  53721925    57188551
Major Faults                    392195       15157
Swap Ins                       2994854      112770
Swap Outs                      4907092      134982
Direct pages scanned                 0       41824
Kswapd pages scanned          32975063     8128269
Kswapd pages reclaimed         6323069     7093495
Direct pages reclaimed               0       41824
Kswapd efficiency                  19%         87%
Kswapd velocity               3386.573     915.414
Direct efficiency                 100%        100%
Direct velocity                  0.000       4.710
Percentage direct scans             0%          0%
Zone normal velocity          2011.338     550.661
Zone dma32 velocity           1365.623     369.221
Zone dma velocity                9.612       0.242
Page writes by reclaim    18732404.000  614807.000
Page writes file              13825312      479825
Page writes anon               4907092      134982
Page reclaim immediate           85490        5647
Sector Reads                  12080532      483244
Sector Writes                 88740508    65438876
Page rescued immediate               0           0
Slabs scanned                    82560       12160
Direct inode steals                  0           0
Kswapd inode steals              24401       40013
Kswapd skipped wait                  0           0
THP fault alloc                      6           8
THP collapse alloc                5481        5812
THP splits                          75          22
THP fault fallback                   0           0
THP collapse fail                    0           0
Compaction stalls                    0          54
Compaction success                   0          45
Compaction failures                  0           9
Page migrate success            881492       82278
Page migrate failure                 0           0
Compaction pages isolated            0       60334
Compaction migrate scanned           0       53505
Compaction free scanned              0     1537605
Compaction cost                    914          86
NUMA PTE updates              46738231    41988419
NUMA hint faults              31175564    24213387
NUMA hint local faults        10427393     6411593
NUMA pages migrated             881492       55344
AutoNUMA cost                   156221      121361

The overall runtime was reduced, throughput for both the foreground
workload as well as the background IO improved, major faults, swapping
and reclaim activity shrunk significantly, reclaim efficiency more
than quadrupled.

 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 55 +++++++++++++++++++++++++++++++++++++------------------
 mm/vmscan.c            |  2 +-
 3 files changed, 39 insertions(+), 19 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
  2013-07-19 20:55 ` Johannes Weiner
@ 2013-07-19 20:55   ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-19 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

When the page allocator fails to get a page from all zones in its
given zonelist, it wakes up the per-node kswapds for all zones that
are at their low watermark.

However, with a system under load and the free page counters being
per-cpu approximations, the observed counter value in a zone can
fluctuate enough that the allocation fails but the kswapd wakeup is
also skipped while the zone is still really close to the low
watermark.

When one node misses a wakeup like this, it won't be aged before all
the other node's zones are down to their low watermarks again.  And
skipping a full aging cycle is an obvious fairness problem.

Kswapd runs until the high watermarks are restored, so it should also
be woken when the high watermarks are not met.  This ages nodes more
equally and creates a safety margin for the page counter fluctuation.

By using zone_balanced(), it will now check, in addition to the
watermark, if compaction requires more order-0 pages to create a
higher order page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e364542..bccc6d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3277,7 +3277,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
+	if (zone_balanced(zone, order, 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
@ 2013-07-19 20:55   ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-19 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

When the page allocator fails to get a page from all zones in its
given zonelist, it wakes up the per-node kswapds for all zones that
are at their low watermark.

However, with a system under load and the free page counters being
per-cpu approximations, the observed counter value in a zone can
fluctuate enough that the allocation fails but the kswapd wakeup is
also skipped while the zone is still really close to the low
watermark.

When one node misses a wakeup like this, it won't be aged before all
the other node's zones are down to their low watermarks again.  And
skipping a full aging cycle is an obvious fairness problem.

Kswapd runs until the high watermarks are restored, so it should also
be woken when the high watermarks are not met.  This ages nodes more
equally and creates a safety margin for the page counter fluctuation.

By using zone_balanced(), it will now check, in addition to the
watermark, if compaction requires more order-0 pages to create a
higher order page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e364542..bccc6d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3277,7 +3277,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
+	if (zone_balanced(zone, order, 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [patch 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist
  2013-07-19 20:55 ` Johannes Weiner
@ 2013-07-19 20:55   ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-19 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

Allocations that do not have to respect the watermarks are rare
high-priority events.  Reorder the code such that per-zone dirty
limits and future checks important only to regular page allocations
are ignored in these extraordinary situations.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9df57d..af1d956b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1867,12 +1867,17 @@ zonelist_scan:
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						high_zoneidx, nodemask) {
+		unsigned long mark;
+
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
+		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
+		if (alloc_flags & ALLOC_NO_WATERMARKS)
+			goto try_this_zone;
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
@@ -1903,16 +1908,11 @@ zonelist_scan:
 		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
 			goto this_zone_full;
 
-		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
-			unsigned long mark;
+		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+		if (!zone_watermark_ok(zone, order, mark,
+				       classzone_idx, alloc_flags)) {
 			int ret;
 
-			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-			if (zone_watermark_ok(zone, order, mark,
-				    classzone_idx, alloc_flags))
-				goto try_this_zone;
-
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [patch 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist
@ 2013-07-19 20:55   ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-19 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

Allocations that do not have to respect the watermarks are rare
high-priority events.  Reorder the code such that per-zone dirty
limits and future checks important only to regular page allocations
are ignored in these extraordinary situations.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9df57d..af1d956b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1867,12 +1867,17 @@ zonelist_scan:
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						high_zoneidx, nodemask) {
+		unsigned long mark;
+
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
+		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
+		if (alloc_flags & ALLOC_NO_WATERMARKS)
+			goto try_this_zone;
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
@@ -1903,16 +1908,11 @@ zonelist_scan:
 		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
 			goto this_zone_full;
 
-		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
-			unsigned long mark;
+		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+		if (!zone_watermark_ok(zone, order, mark,
+				       classzone_idx, alloc_flags)) {
 			int ret;
 
-			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-			if (zone_watermark_ok(zone, order, mark,
-				    classzone_idx, alloc_flags))
-				goto try_this_zone;
-
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-19 20:55 ` Johannes Weiner
@ 2013-07-19 20:55   ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-19 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size.  Otherwise, the time an
individual page gets to stay in memory depends on the zone it happened
to be allocated in.  Asymmetry in the zone aging creates rather
unpredictable aging behavior and results in the wrong pages being
reclaimed, activated etc.

But exactly this happens right now because of the way the page
allocator and kswapd interact.  The page allocator uses per-node lists
of all zones in the system, ordered by preference, when allocating a
new page.  When the first iteration does not yield any results, kswapd
is woken up and the allocator retries.  Due to the way kswapd reclaims
zones below the high watermark while a zone can be allocated from when
it is above the low watermark, the allocator may keep kswapd running
while kswapd reclaim ensures that the page allocator can keep
allocating from the first zone in the zonelist for extended periods of
time.  Meanwhile the other zones rarely see new allocations and thus
get aged much slower in comparison.

The result is that the occasional page placed in lower zones gets
relatively more time in memory, even get promoted to the active list
after its peers have long been evicted.  Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there
may be significant amounts of memory available in the lower zones.

Even the most basic test -- repeatedly reading a file slightly bigger
than memory -- shows how broken the zone aging is.  In this scenario,
no single page should be able stay in memory long enough to get
referenced twice and activated, but activation happens in spades:

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 0
      nr_active_file 8
      nr_inactive_file 1582
      nr_active_file 11994
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 70
      nr_inactive_file 258753
      nr_active_file 443214
      nr_inactive_file 149793
      nr_active_file 12021

Fix this with a very simple round robin allocator.  Each zone is
allowed a batch of allocations that is proportional to the zone's
size, after which it is treated as full.  The batch counters are reset
when all zones have been tried and the allocator enters the slowpath
and kicks off kswapd reclaim:

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 174
      nr_active_file 4865
      nr_inactive_file 53
      nr_active_file 860
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 666622
      nr_active_file 4988
      nr_inactive_file 190969
      nr_active_file 937

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 39 +++++++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index af4a3b7..0c41d59 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -352,6 +352,7 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
+	atomic_t		alloc_batch;
 	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 	/* Set to true when the PG_migrate_skip bits should be cleared */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index af1d956b..d938b67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1879,6 +1879,14 @@ zonelist_scan:
 		if (alloc_flags & ALLOC_NO_WATERMARKS)
 			goto try_this_zone;
 		/*
+		 * Distribute pages in proportion to the individual
+		 * zone size to ensure fair page aging.  The zone a
+		 * page was allocated in should have no effect on the
+		 * time the page has in memory before being reclaimed.
+		 */
+		if (atomic_read(&zone->alloc_batch) <= 0)
+			continue;
+		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
 		 * limit, such that no single zone holds more than its
@@ -1984,7 +1992,8 @@ this_zone_full:
 		goto zonelist_scan;
 	}
 
-	if (page)
+	if (page) {
+		atomic_sub(1U << order, &zone->alloc_batch);
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -1993,6 +2002,7 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+	}
 
 	return page;
 }
@@ -2342,16 +2352,20 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline
-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
-						enum zone_type high_zoneidx,
-						enum zone_type classzone_idx)
+static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
+			     struct zonelist *zonelist,
+			     enum zone_type high_zoneidx,
+			     enum zone_type classzone_idx)
 {
 	struct zoneref *z;
 	struct zone *zone;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order, classzone_idx);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		atomic_set(&zone->alloc_batch,
+			   high_wmark_pages(zone) - low_wmark_pages(zone));
+		if (!(gfp_mask & __GFP_NO_KSWAPD))
+			wakeup_kswapd(zone, order, classzone_idx);
+	}
 }
 
 static inline int
@@ -2447,9 +2461,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 restart:
-	if (!(gfp_mask & __GFP_NO_KSWAPD))
-		wake_all_kswapd(order, zonelist, high_zoneidx,
-						zone_idx(preferred_zone));
+	prepare_slowpath(gfp_mask, order, zonelist,
+			 high_zoneidx, zone_idx(preferred_zone));
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -4758,6 +4771,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
+		/* For bootup, initialized properly in watermark setup */
+		atomic_set(&zone->alloc_batch, zone->managed_pages);
+
 		zone_pcp_init(zone);
 		lruvec_init(&zone->lruvec);
 		if (!size)
@@ -5533,6 +5549,9 @@ static void __setup_per_zone_wmarks(void)
 		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
 		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
 
+		atomic_set(&zone->alloc_batch,
+			   high_wmark_pages(zone) - low_wmark_pages(zone));
+
 		setup_zone_migrate_reserve(zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-07-19 20:55   ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-19 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size.  Otherwise, the time an
individual page gets to stay in memory depends on the zone it happened
to be allocated in.  Asymmetry in the zone aging creates rather
unpredictable aging behavior and results in the wrong pages being
reclaimed, activated etc.

But exactly this happens right now because of the way the page
allocator and kswapd interact.  The page allocator uses per-node lists
of all zones in the system, ordered by preference, when allocating a
new page.  When the first iteration does not yield any results, kswapd
is woken up and the allocator retries.  Due to the way kswapd reclaims
zones below the high watermark while a zone can be allocated from when
it is above the low watermark, the allocator may keep kswapd running
while kswapd reclaim ensures that the page allocator can keep
allocating from the first zone in the zonelist for extended periods of
time.  Meanwhile the other zones rarely see new allocations and thus
get aged much slower in comparison.

The result is that the occasional page placed in lower zones gets
relatively more time in memory, even get promoted to the active list
after its peers have long been evicted.  Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there
may be significant amounts of memory available in the lower zones.

Even the most basic test -- repeatedly reading a file slightly bigger
than memory -- shows how broken the zone aging is.  In this scenario,
no single page should be able stay in memory long enough to get
referenced twice and activated, but activation happens in spades:

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 0
      nr_active_file 8
      nr_inactive_file 1582
      nr_active_file 11994
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 70
      nr_inactive_file 258753
      nr_active_file 443214
      nr_inactive_file 149793
      nr_active_file 12021

Fix this with a very simple round robin allocator.  Each zone is
allowed a batch of allocations that is proportional to the zone's
size, after which it is treated as full.  The batch counters are reset
when all zones have been tried and the allocator enters the slowpath
and kicks off kswapd reclaim:

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 174
      nr_active_file 4865
      nr_inactive_file 53
      nr_active_file 860
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 666622
      nr_active_file 4988
      nr_inactive_file 190969
      nr_active_file 937

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 39 +++++++++++++++++++++++++++++----------
 2 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index af4a3b7..0c41d59 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -352,6 +352,7 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
+	atomic_t		alloc_batch;
 	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 	/* Set to true when the PG_migrate_skip bits should be cleared */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index af1d956b..d938b67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1879,6 +1879,14 @@ zonelist_scan:
 		if (alloc_flags & ALLOC_NO_WATERMARKS)
 			goto try_this_zone;
 		/*
+		 * Distribute pages in proportion to the individual
+		 * zone size to ensure fair page aging.  The zone a
+		 * page was allocated in should have no effect on the
+		 * time the page has in memory before being reclaimed.
+		 */
+		if (atomic_read(&zone->alloc_batch) <= 0)
+			continue;
+		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
 		 * limit, such that no single zone holds more than its
@@ -1984,7 +1992,8 @@ this_zone_full:
 		goto zonelist_scan;
 	}
 
-	if (page)
+	if (page) {
+		atomic_sub(1U << order, &zone->alloc_batch);
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -1993,6 +2002,7 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+	}
 
 	return page;
 }
@@ -2342,16 +2352,20 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline
-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
-						enum zone_type high_zoneidx,
-						enum zone_type classzone_idx)
+static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
+			     struct zonelist *zonelist,
+			     enum zone_type high_zoneidx,
+			     enum zone_type classzone_idx)
 {
 	struct zoneref *z;
 	struct zone *zone;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order, classzone_idx);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		atomic_set(&zone->alloc_batch,
+			   high_wmark_pages(zone) - low_wmark_pages(zone));
+		if (!(gfp_mask & __GFP_NO_KSWAPD))
+			wakeup_kswapd(zone, order, classzone_idx);
+	}
 }
 
 static inline int
@@ -2447,9 +2461,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 restart:
-	if (!(gfp_mask & __GFP_NO_KSWAPD))
-		wake_all_kswapd(order, zonelist, high_zoneidx,
-						zone_idx(preferred_zone));
+	prepare_slowpath(gfp_mask, order, zonelist,
+			 high_zoneidx, zone_idx(preferred_zone));
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -4758,6 +4771,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
+		/* For bootup, initialized properly in watermark setup */
+		atomic_set(&zone->alloc_batch, zone->managed_pages);
+
 		zone_pcp_init(zone);
 		lruvec_init(&zone->lruvec);
 		if (!size)
@@ -5533,6 +5549,9 @@ static void __setup_per_zone_wmarks(void)
 		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
 		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
 
+		atomic_set(&zone->alloc_batch,
+			   high_wmark_pages(zone) - low_wmark_pages(zone));
+
 		setup_zone_migrate_reserve(zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
  2013-07-19 20:55 ` Johannes Weiner
@ 2013-07-22 16:48   ` Zlatko Calusic
  -1 siblings, 0 replies; 66+ messages in thread
From: Zlatko Calusic @ 2013-07-22 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

On 19.07.2013 22:55, Johannes Weiner wrote:
> The way the page allocator interacts with kswapd creates aging
> imbalances, where the amount of time a userspace page gets in memory
> under reclaim pressure is dependent on which zone, which node the
> allocator took the page frame from.
>
> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>     nodes falling behind for a full reclaim cycle relative to the other
>     nodes in the system
>
> #3 fixes an interaction where kswapd and a continuous stream of page
>     allocations keep the preferred zone of a task between the high and
>     low watermark (allocations succeed + kswapd does not go to sleep)
>     indefinitely, completely underutilizing the lower zones and
>     thrashing on the preferred zone
>
> These patches are the aging fairness part of the thrash-detection
> based file LRU balancing.  Andrea recommended to submit them
> separately as they are bugfixes in their own right.
>

I have the patch applied and under testing. So far, so good. It looks 
like it could finally fix the bug that I was chasing few months ago 
(nicely described in your bullet #3). But, few more days of testing will 
be needed before I can reach a quality verdict.

Good job!
-- 
Zlatko


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
@ 2013-07-22 16:48   ` Zlatko Calusic
  0 siblings, 0 replies; 66+ messages in thread
From: Zlatko Calusic @ 2013-07-22 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

On 19.07.2013 22:55, Johannes Weiner wrote:
> The way the page allocator interacts with kswapd creates aging
> imbalances, where the amount of time a userspace page gets in memory
> under reclaim pressure is dependent on which zone, which node the
> allocator took the page frame from.
>
> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>     nodes falling behind for a full reclaim cycle relative to the other
>     nodes in the system
>
> #3 fixes an interaction where kswapd and a continuous stream of page
>     allocations keep the preferred zone of a task between the high and
>     low watermark (allocations succeed + kswapd does not go to sleep)
>     indefinitely, completely underutilizing the lower zones and
>     thrashing on the preferred zone
>
> These patches are the aging fairness part of the thrash-detection
> based file LRU balancing.  Andrea recommended to submit them
> separately as they are bugfixes in their own right.
>

I have the patch applied and under testing. So far, so good. It looks 
like it could finally fix the bug that I was chasing few months ago 
(nicely described in your bullet #3). But, few more days of testing will 
be needed before I can reach a quality verdict.

Good job!
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
  2013-07-22 16:48   ` Zlatko Calusic
@ 2013-07-22 17:01     ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-22 17:01 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

Hi Zlatko,

On Mon, Jul 22, 2013 at 06:48:52PM +0200, Zlatko Calusic wrote:
> On 19.07.2013 22:55, Johannes Weiner wrote:
> >The way the page allocator interacts with kswapd creates aging
> >imbalances, where the amount of time a userspace page gets in memory
> >under reclaim pressure is dependent on which zone, which node the
> >allocator took the page frame from.
> >
> >#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
> >    nodes falling behind for a full reclaim cycle relative to the other
> >    nodes in the system
> >
> >#3 fixes an interaction where kswapd and a continuous stream of page
> >    allocations keep the preferred zone of a task between the high and
> >    low watermark (allocations succeed + kswapd does not go to sleep)
> >    indefinitely, completely underutilizing the lower zones and
> >    thrashing on the preferred zone
> >
> >These patches are the aging fairness part of the thrash-detection
> >based file LRU balancing.  Andrea recommended to submit them
> >separately as they are bugfixes in their own right.
> >
> 
> I have the patch applied and under testing. So far, so good. It
> looks like it could finally fix the bug that I was chasing few
> months ago (nicely described in your bullet #3). But, few more days
> of testing will be needed before I can reach a quality verdict.

I should have remembered that you talked about this problem... Thanks
a lot for testing!

May I ask for the zone layout of your test machine(s)?  I.e. how many
nodes if NUMA, how big Normal and DMA32 (on Node 0) are.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
@ 2013-07-22 17:01     ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-22 17:01 UTC (permalink / raw)
  To: Zlatko Calusic
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

Hi Zlatko,

On Mon, Jul 22, 2013 at 06:48:52PM +0200, Zlatko Calusic wrote:
> On 19.07.2013 22:55, Johannes Weiner wrote:
> >The way the page allocator interacts with kswapd creates aging
> >imbalances, where the amount of time a userspace page gets in memory
> >under reclaim pressure is dependent on which zone, which node the
> >allocator took the page frame from.
> >
> >#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
> >    nodes falling behind for a full reclaim cycle relative to the other
> >    nodes in the system
> >
> >#3 fixes an interaction where kswapd and a continuous stream of page
> >    allocations keep the preferred zone of a task between the high and
> >    low watermark (allocations succeed + kswapd does not go to sleep)
> >    indefinitely, completely underutilizing the lower zones and
> >    thrashing on the preferred zone
> >
> >These patches are the aging fairness part of the thrash-detection
> >based file LRU balancing.  Andrea recommended to submit them
> >separately as they are bugfixes in their own right.
> >
> 
> I have the patch applied and under testing. So far, so good. It
> looks like it could finally fix the bug that I was chasing few
> months ago (nicely described in your bullet #3). But, few more days
> of testing will be needed before I can reach a quality verdict.

I should have remembered that you talked about this problem... Thanks
a lot for testing!

May I ask for the zone layout of your test machine(s)?  I.e. how many
nodes if NUMA, how big Normal and DMA32 (on Node 0) are.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
  2013-07-22 17:01     ` Johannes Weiner
  (?)
@ 2013-07-22 17:14     ` Zlatko Calusic
  -1 siblings, 0 replies; 66+ messages in thread
From: Zlatko Calusic @ 2013-07-22 17:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2359 bytes --]

On 22.07.2013 19:01, Johannes Weiner wrote:
> Hi Zlatko,
>
> On Mon, Jul 22, 2013 at 06:48:52PM +0200, Zlatko Calusic wrote:
>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>> The way the page allocator interacts with kswapd creates aging
>>> imbalances, where the amount of time a userspace page gets in memory
>>> under reclaim pressure is dependent on which zone, which node the
>>> allocator took the page frame from.
>>>
>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>>     nodes falling behind for a full reclaim cycle relative to the other
>>>     nodes in the system
>>>
>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>>     allocations keep the preferred zone of a task between the high and
>>>     low watermark (allocations succeed + kswapd does not go to sleep)
>>>     indefinitely, completely underutilizing the lower zones and
>>>     thrashing on the preferred zone
>>>
>>> These patches are the aging fairness part of the thrash-detection
>>> based file LRU balancing.  Andrea recommended to submit them
>>> separately as they are bugfixes in their own right.
>>>
>>
>> I have the patch applied and under testing. So far, so good. It
>> looks like it could finally fix the bug that I was chasing few
>> months ago (nicely described in your bullet #3). But, few more days
>> of testing will be needed before I can reach a quality verdict.
>
> I should have remembered that you talked about this problem... Thanks
> a lot for testing!
>
> May I ask for the zone layout of your test machine(s)?  I.e. how many
> nodes if NUMA, how big Normal and DMA32 (on Node 0) are.
>

I have been reading about NUMA hw for at least a decade, but I guess 
another one will pass before I actually see one. ;) Find /proc/zoneinfo 
attached.

If your patchset fails my case, then nr_{in,}active_file in Normal zone 
will drop close to zero in a matter of days. If it fixes this particular 
imbalance, and I have faith it will, then those two counters will stay 
in relative balance with nr_{in,}active_anon in the same zone. I also 
applied Konstantin's excellent lru-milestones-timestamps-and-ages, and 
graphing of interesting numbers on top of that, which is why I already 
have faith in your patchset. I can see much better balance between zones 
already. But, let's give it some more time...

-- 
Zlatko

[-- Attachment #2: zoneinfo --]
[-- Type: text/plain, Size: 3970 bytes --]

Node 0, zone      DMA
  pages free     3975
        min      132
        low      165
        high     198
        scanned  0
        spanned  4095
        present  3998
        managed  3977
    nr_free_pages 3975
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 2
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    nr_dirtied   0
    nr_written   0
    nr_anon_transparent_hugepages 0
    nr_free_cma  0
        protection: (0, 3236, 3933, 3933)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 4
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 4
  all_unreclaimable: 1
  start_pfn:         1
  inactive_ratio:    1
  avg_age_inactive_anon: 0
  avg_age_active_anon:   0
  avg_age_inactive_file: 0
  avg_age_active_file:   0
Node 0, zone    DMA32
  pages free     83177
        min      27693
        low      34616
        high     41539
        scanned  0
        spanned  1044480
        present  847429
        managed  829295
    nr_free_pages 83177
    nr_inactive_anon 2061
    nr_active_anon 313380
    nr_inactive_file 199460
    nr_active_file 207097
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 239688
    nr_mapped    38888
    nr_file_pages 424978
    nr_dirty     87
    nr_writeback 0
    nr_slab_reclaimable 9119
    nr_slab_unreclaimable 2054
    nr_page_table_pages 1795
    nr_kernel_stack 144
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     18421
    nr_dirtied   725414
    nr_written   768505
    nr_anon_transparent_hugepages 112
    nr_free_cma  0
        protection: (0, 0, 697, 697)
  pagesets
    cpu: 0
              count: 132
              high:  186
              batch: 31
  vm stats threshold: 24
    cpu: 1
              count: 146
              high:  186
              batch: 31
  vm stats threshold: 24
  all_unreclaimable: 0
  start_pfn:         4096
  inactive_ratio:    5
  avg_age_inactive_anon: 5467648
  avg_age_active_anon:   5467648
  avg_age_inactive_file: 3184128
  avg_age_active_file:   5467648
Node 0, zone   Normal
  pages free     17164
        min      5965
        low      7456
        high     8947
        scanned  0
        spanned  196607
        present  196607
        managed  178491
    nr_free_pages 17164
    nr_inactive_anon 294
    nr_active_anon 64754
    nr_inactive_file 42191
    nr_active_file 44925
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 51456
    nr_mapped    9580
    nr_file_pages 91492
    nr_dirty     27
    nr_writeback 0
    nr_slab_reclaimable 2686
    nr_slab_unreclaimable 1194
    nr_page_table_pages 401
    nr_kernel_stack 65
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     4376
    nr_dirtied   163250
    nr_written   172369
    nr_anon_transparent_hugepages 18
    nr_free_cma  0
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 177
              high:  186
              batch: 31
  vm stats threshold: 16
    cpu: 1
              count: 170
              high:  186
              batch: 31
  vm stats threshold: 16
  all_unreclaimable: 0
  start_pfn:         1048576
  inactive_ratio:    1
  avg_age_inactive_anon: 5468672
  avg_age_active_anon:   5468672
  avg_age_inactive_file: 3382628
  avg_age_active_file:   5468672

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
  2013-07-19 20:55   ` Johannes Weiner
@ 2013-07-22 19:47     ` Rik van Riel
  -1 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-07-22 19:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> When the page allocator fails to get a page from all zones in its
> given zonelist, it wakes up the per-node kswapds for all zones that
> are at their low watermark.
>
> However, with a system under load and the free page counters being
> per-cpu approximations, the observed counter value in a zone can
> fluctuate enough that the allocation fails but the kswapd wakeup is
> also skipped while the zone is still really close to the low
> watermark.
>
> When one node misses a wakeup like this, it won't be aged before all
> the other node's zones are down to their low watermarks again.  And
> skipping a full aging cycle is an obvious fairness problem.
>
> Kswapd runs until the high watermarks are restored, so it should also
> be woken when the high watermarks are not met.  This ages nodes more
> equally and creates a safety margin for the page counter fluctuation.
>
> By using zone_balanced(), it will now check, in addition to the
> watermark, if compaction requires more order-0 pages to create a
> higher order page.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

This patch alone looks like it could have the effect of increasing
the pressure on the first zone in the zonelist, keeping its free
memory above the low watermark essentially forever, without having
the allocator fall back to other zones.

However, your third patch fixes that problem, and missed wakeups
would still hurt, so...

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
@ 2013-07-22 19:47     ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-07-22 19:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> When the page allocator fails to get a page from all zones in its
> given zonelist, it wakes up the per-node kswapds for all zones that
> are at their low watermark.
>
> However, with a system under load and the free page counters being
> per-cpu approximations, the observed counter value in a zone can
> fluctuate enough that the allocation fails but the kswapd wakeup is
> also skipped while the zone is still really close to the low
> watermark.
>
> When one node misses a wakeup like this, it won't be aged before all
> the other node's zones are down to their low watermarks again.  And
> skipping a full aging cycle is an obvious fairness problem.
>
> Kswapd runs until the high watermarks are restored, so it should also
> be woken when the high watermarks are not met.  This ages nodes more
> equally and creates a safety margin for the page counter fluctuation.
>
> By using zone_balanced(), it will now check, in addition to the
> watermark, if compaction requires more order-0 pages to create a
> higher order page.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

This patch alone looks like it could have the effect of increasing
the pressure on the first zone in the zonelist, keeping its free
memory above the low watermark essentially forever, without having
the allocator fall back to other zones.

However, your third patch fixes that problem, and missed wakeups
would still hurt, so...

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist
  2013-07-19 20:55   ` Johannes Weiner
@ 2013-07-22 19:51     ` Rik van Riel
  -1 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-07-22 19:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> Allocations that do not have to respect the watermarks are rare
> high-priority events.  Reorder the code such that per-zone dirty
> limits and future checks important only to regular page allocations
> are ignored in these extraordinary situations.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist
@ 2013-07-22 19:51     ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-07-22 19:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> Allocations that do not have to respect the watermarks are rare
> high-priority events.  Reorder the code such that per-zone dirty
> limits and future checks important only to regular page allocations
> are ignored in these extraordinary situations.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
  2013-07-22 19:47     ` Rik van Riel
@ 2013-07-22 20:14       ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-22 20:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On Mon, Jul 22, 2013 at 03:47:03PM -0400, Rik van Riel wrote:
> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> >When the page allocator fails to get a page from all zones in its
> >given zonelist, it wakes up the per-node kswapds for all zones that
> >are at their low watermark.
> >
> >However, with a system under load and the free page counters being
> >per-cpu approximations, the observed counter value in a zone can
> >fluctuate enough that the allocation fails but the kswapd wakeup is
> >also skipped while the zone is still really close to the low
> >watermark.
> >
> >When one node misses a wakeup like this, it won't be aged before all
> >the other node's zones are down to their low watermarks again.  And
> >skipping a full aging cycle is an obvious fairness problem.
> >
> >Kswapd runs until the high watermarks are restored, so it should also
> >be woken when the high watermarks are not met.  This ages nodes more
> >equally and creates a safety margin for the page counter fluctuation.
> >
> >By using zone_balanced(), it will now check, in addition to the
> >watermark, if compaction requires more order-0 pages to create a
> >higher order page.
> >
> >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> This patch alone looks like it could have the effect of increasing
> the pressure on the first zone in the zonelist, keeping its free
> memory above the low watermark essentially forever, without having
> the allocator fall back to other zones.
> 
> However, your third patch fixes that problem, and missed wakeups
> would still hurt, so...

The kswapd wakeups happen in the slowpath, after the fastpath tried
all zones in the zonelist, not just the first one.

With the problem fixed in #3, the slowpath is rarely entered (even
when kswapds should be woken).  From that point of view, the effects
of #1 are further improved by #3, but #1 on its own does not worsen
the situation.

> Reviewed-by: Rik van Riel <riel@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
@ 2013-07-22 20:14       ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-22 20:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On Mon, Jul 22, 2013 at 03:47:03PM -0400, Rik van Riel wrote:
> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> >When the page allocator fails to get a page from all zones in its
> >given zonelist, it wakes up the per-node kswapds for all zones that
> >are at their low watermark.
> >
> >However, with a system under load and the free page counters being
> >per-cpu approximations, the observed counter value in a zone can
> >fluctuate enough that the allocation fails but the kswapd wakeup is
> >also skipped while the zone is still really close to the low
> >watermark.
> >
> >When one node misses a wakeup like this, it won't be aged before all
> >the other node's zones are down to their low watermarks again.  And
> >skipping a full aging cycle is an obvious fairness problem.
> >
> >Kswapd runs until the high watermarks are restored, so it should also
> >be woken when the high watermarks are not met.  This ages nodes more
> >equally and creates a safety margin for the page counter fluctuation.
> >
> >By using zone_balanced(), it will now check, in addition to the
> >watermark, if compaction requires more order-0 pages to create a
> >higher order page.
> >
> >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> This patch alone looks like it could have the effect of increasing
> the pressure on the first zone in the zonelist, keeping its free
> memory above the low watermark essentially forever, without having
> the allocator fall back to other zones.
> 
> However, your third patch fixes that problem, and missed wakeups
> would still hurt, so...

The kswapd wakeups happen in the slowpath, after the fastpath tried
all zones in the zonelist, not just the first one.

With the problem fixed in #3, the slowpath is rarely entered (even
when kswapds should be woken).  From that point of view, the effects
of #1 are further improved by #3, but #1 on its own does not worsen
the situation.

> Reviewed-by: Rik van Riel <riel@redhat.com>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-19 20:55   ` Johannes Weiner
@ 2013-07-22 20:21     ` Rik van Riel
  -1 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-07-22 20:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On 07/19/2013 04:55 PM, Johannes Weiner wrote:

> @@ -1984,7 +1992,8 @@ this_zone_full:
>   		goto zonelist_scan;
>   	}
>
> -	if (page)
> +	if (page) {
> +		atomic_sub(1U << order, &zone->alloc_batch);
>   		/*
>   		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>   		 * necessary to allocate the page. The expectation is

Could this be moved into the slow path in buffered_rmqueue and
rmqueue_bulk, or would the effect of ignoring the pcp buffers be
too detrimental to keeping the balance between zones?

It would be kind of nice to not have this atomic operation on every
page allocation...

As a side benefit, higher-order buffered_rmqueue and rmqueue_bulk
both happen under the zone->lock, so moving this accounting down
to that layer might allow you to get rid of the atomics alltogether.

I like the overall approach though. This is something Linux has needed
for a long time, and could be extremely useful to automatic NUMA
balancing as well...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-07-22 20:21     ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-07-22 20:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On 07/19/2013 04:55 PM, Johannes Weiner wrote:

> @@ -1984,7 +1992,8 @@ this_zone_full:
>   		goto zonelist_scan;
>   	}
>
> -	if (page)
> +	if (page) {
> +		atomic_sub(1U << order, &zone->alloc_batch);
>   		/*
>   		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>   		 * necessary to allocate the page. The expectation is

Could this be moved into the slow path in buffered_rmqueue and
rmqueue_bulk, or would the effect of ignoring the pcp buffers be
too detrimental to keeping the balance between zones?

It would be kind of nice to not have this atomic operation on every
page allocation...

As a side benefit, higher-order buffered_rmqueue and rmqueue_bulk
both happen under the zone->lock, so moving this accounting down
to that layer might allow you to get rid of the atomics alltogether.

I like the overall approach though. This is something Linux has needed
for a long time, and could be extremely useful to automatic NUMA
balancing as well...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-22 20:21     ` Rik van Riel
@ 2013-07-22 21:04       ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-22 21:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On Mon, Jul 22, 2013 at 04:21:07PM -0400, Rik van Riel wrote:
> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> 
> >@@ -1984,7 +1992,8 @@ this_zone_full:
> >  		goto zonelist_scan;
> >  	}
> >
> >-	if (page)
> >+	if (page) {
> >+		atomic_sub(1U << order, &zone->alloc_batch);
> >  		/*
> >  		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> >  		 * necessary to allocate the page. The expectation is
> 
> Could this be moved into the slow path in buffered_rmqueue and
> rmqueue_bulk, or would the effect of ignoring the pcp buffers be
> too detrimental to keeping the balance between zones?

What I'm worried about is not the inaccury that comes from the buffer
size but the fact that there are no guaranteed buffer empty+refill
cycles.  The reclaimer could end up feeding the pcp list that the
allocator is using indefinitely, which brings us back to the original
problem.  If you have >= NR_CPU jobs running, the kswapds are bound to
share CPUs with the allocating tasks, so the scenario is not unlikely.

> It would be kind of nice to not have this atomic operation on every
> page allocation...

No argument there ;)

> I like the overall approach though. This is something Linux has needed
> for a long time, and could be extremely useful to automatic NUMA
> balancing as well...
> 
> -- 
> All rights reversed

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-07-22 21:04       ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-22 21:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On Mon, Jul 22, 2013 at 04:21:07PM -0400, Rik van Riel wrote:
> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> 
> >@@ -1984,7 +1992,8 @@ this_zone_full:
> >  		goto zonelist_scan;
> >  	}
> >
> >-	if (page)
> >+	if (page) {
> >+		atomic_sub(1U << order, &zone->alloc_batch);
> >  		/*
> >  		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> >  		 * necessary to allocate the page. The expectation is
> 
> Could this be moved into the slow path in buffered_rmqueue and
> rmqueue_bulk, or would the effect of ignoring the pcp buffers be
> too detrimental to keeping the balance between zones?

What I'm worried about is not the inaccury that comes from the buffer
size but the fact that there are no guaranteed buffer empty+refill
cycles.  The reclaimer could end up feeding the pcp list that the
allocator is using indefinitely, which brings us back to the original
problem.  If you have >= NR_CPU jobs running, the kswapds are bound to
share CPUs with the allocating tasks, so the scenario is not unlikely.

> It would be kind of nice to not have this atomic operation on every
> page allocation...

No argument there ;)

> I like the overall approach though. This is something Linux has needed
> for a long time, and could be extremely useful to automatic NUMA
> balancing as well...
> 
> -- 
> All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-22 21:04       ` Johannes Weiner
@ 2013-07-22 22:48         ` Rik van Riel
  -1 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-07-22 22:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On 07/22/2013 05:04 PM, Johannes Weiner wrote:
> On Mon, Jul 22, 2013 at 04:21:07PM -0400, Rik van Riel wrote:
>> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
>>
>>> @@ -1984,7 +1992,8 @@ this_zone_full:
>>>   		goto zonelist_scan;
>>>   	}
>>>
>>> -	if (page)
>>> +	if (page) {
>>> +		atomic_sub(1U << order, &zone->alloc_batch);
>>>   		/*
>>>   		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>>>   		 * necessary to allocate the page. The expectation is
>>
>> Could this be moved into the slow path in buffered_rmqueue and
>> rmqueue_bulk, or would the effect of ignoring the pcp buffers be
>> too detrimental to keeping the balance between zones?
>
> What I'm worried about is not the inaccury that comes from the buffer
> size but the fact that there are no guaranteed buffer empty+refill
> cycles.  The reclaimer could end up feeding the pcp list that the
> allocator is using indefinitely, which brings us back to the original
> problem.  If you have >= NR_CPU jobs running, the kswapds are bound to
> share CPUs with the allocating tasks, so the scenario is not unlikely.

You are absolutely right.  Thinking about it some more,
I cannot think of a better way to do this than your patch.

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-07-22 22:48         ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-07-22 22:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, linux-kernel

On 07/22/2013 05:04 PM, Johannes Weiner wrote:
> On Mon, Jul 22, 2013 at 04:21:07PM -0400, Rik van Riel wrote:
>> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
>>
>>> @@ -1984,7 +1992,8 @@ this_zone_full:
>>>   		goto zonelist_scan;
>>>   	}
>>>
>>> -	if (page)
>>> +	if (page) {
>>> +		atomic_sub(1U << order, &zone->alloc_batch);
>>>   		/*
>>>   		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>>>   		 * necessary to allocate the page. The expectation is
>>
>> Could this be moved into the slow path in buffered_rmqueue and
>> rmqueue_bulk, or would the effect of ignoring the pcp buffers be
>> too detrimental to keeping the balance between zones?
>
> What I'm worried about is not the inaccury that comes from the buffer
> size but the fact that there are no guaranteed buffer empty+refill
> cycles.  The reclaimer could end up feeding the pcp list that the
> allocator is using indefinitely, which brings us back to the original
> problem.  If you have >= NR_CPU jobs running, the kswapds are bound to
> share CPUs with the allocating tasks, so the scenario is not unlikely.

You are absolutely right.  Thinking about it some more,
I cannot think of a better way to do this than your patch.

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
  2013-07-22 16:48   ` Zlatko Calusic
  (?)
  (?)
@ 2013-07-24 11:18   ` Zlatko Calusic
  2013-07-24 12:46       ` Hush Bensen
  2013-07-31  9:33       ` Zlatko Calusic
  -1 siblings, 2 replies; 66+ messages in thread
From: Zlatko Calusic @ 2013-07-24 11:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2134 bytes --]

On 22.07.2013 18:48, Zlatko Calusic wrote:
> On 19.07.2013 22:55, Johannes Weiner wrote:
>> The way the page allocator interacts with kswapd creates aging
>> imbalances, where the amount of time a userspace page gets in memory
>> under reclaim pressure is dependent on which zone, which node the
>> allocator took the page frame from.
>>
>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>     nodes falling behind for a full reclaim cycle relative to the other
>>     nodes in the system
>>
>> #3 fixes an interaction where kswapd and a continuous stream of page
>>     allocations keep the preferred zone of a task between the high and
>>     low watermark (allocations succeed + kswapd does not go to sleep)
>>     indefinitely, completely underutilizing the lower zones and
>>     thrashing on the preferred zone
>>
>> These patches are the aging fairness part of the thrash-detection
>> based file LRU balancing.  Andrea recommended to submit them
>> separately as they are bugfixes in their own right.
>>
>
> I have the patch applied and under testing. So far, so good. It looks
> like it could finally fix the bug that I was chasing few months ago
> (nicely described in your bullet #3). But, few more days of testing will
> be needed before I can reach a quality verdict.
>

Well, only 2 days later it's already obvious that the patch is perfect! :)

In the attached image, in the left column are the graphs covering last 
day and a half. It can be observed that zones are really balanced, and 
that aging is practically perfect. Graphs on the right column cover last 
10 day period, and the left side of the upper graph shows how it would 
look with the stock kernel after about 20 day uptime (although only a 
few days is enough to reach such imbalance). File pages in the Normal 
zone are extinct species (red) and the zone is choke full of anon pages 
(blue). Having seen a lot of this graphs, I'm certain that it won't 
happen anymore with your patch applied. The balance is restored! Thank 
you for your work. Feel free to add:

Tested-by: Zlatko Calusic <zcalusic@bitsync.net>

Regards,
-- 
Zlatko

[-- Attachment #2: screenshot1.png --]
[-- Type: image/png, Size: 79572 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
  2013-07-24 11:18   ` Zlatko Calusic
@ 2013-07-24 12:46       ` Hush Bensen
  2013-07-31  9:33       ` Zlatko Calusic
  1 sibling, 0 replies; 66+ messages in thread
From: Hush Bensen @ 2013-07-24 12:46 UTC (permalink / raw)
  To: Zlatko Calusic, Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

于 2013/7/24 19:18, Zlatko Calusic 写道:
> On 22.07.2013 18:48, Zlatko Calusic wrote:
>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>> The way the page allocator interacts with kswapd creates aging
>>> imbalances, where the amount of time a userspace page gets in memory
>>> under reclaim pressure is dependent on which zone, which node the
>>> allocator took the page frame from.
>>>
>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>> nodes falling behind for a full reclaim cycle relative to the other
>>> nodes in the system
>>>
>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>> allocations keep the preferred zone of a task between the high and
>>> low watermark (allocations succeed + kswapd does not go to sleep)
>>> indefinitely, completely underutilizing the lower zones and
>>> thrashing on the preferred zone
>>>
>>> These patches are the aging fairness part of the thrash-detection
>>> based file LRU balancing. Andrea recommended to submit them
>>> separately as they are bugfixes in their own right.
>>>
>>
>> I have the patch applied and under testing. So far, so good. It looks
>> like it could finally fix the bug that I was chasing few months ago
>> (nicely described in your bullet #3). But, few more days of testing will
>> be needed before I can reach a quality verdict.
>>
>
> Well, only 2 days later it's already obvious that the patch is 
> perfect! :)
>
> In the attached image, in the left column are the graphs covering last 
> day and a half. It can be observed that zones are really balanced, and 
> that aging is practically perfect. Graphs on the right column cover 
> last 10 day period, and the left side of the upper graph shows how it 
> would look with the stock kernel after about 20 day uptime (although 
> only a few days is enough to reach such imbalance). File pages in the 
> Normal zone are extinct species (red) and the zone is choke full of 
> anon pages (blue). Having seen a lot of this graphs, I'm certain that 
> it won't happen anymore with your patch applied. The balance is 
> restored! Thank you for your work. Feel free to add:
>
> Tested-by: Zlatko Calusic <zcalusic@bitsync.net>

Thanks for your testing Zlatko, could you tell me which benchmark or 
workload you are using? Btw, which tool is used to draw these nice 
pictures? ;-)

>
> Regards,


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
@ 2013-07-24 12:46       ` Hush Bensen
  0 siblings, 0 replies; 66+ messages in thread
From: Hush Bensen @ 2013-07-24 12:46 UTC (permalink / raw)
  To: Zlatko Calusic, Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

ao? 2013/7/24 19:18, Zlatko Calusic a??e??:
> On 22.07.2013 18:48, Zlatko Calusic wrote:
>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>> The way the page allocator interacts with kswapd creates aging
>>> imbalances, where the amount of time a userspace page gets in memory
>>> under reclaim pressure is dependent on which zone, which node the
>>> allocator took the page frame from.
>>>
>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>> nodes falling behind for a full reclaim cycle relative to the other
>>> nodes in the system
>>>
>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>> allocations keep the preferred zone of a task between the high and
>>> low watermark (allocations succeed + kswapd does not go to sleep)
>>> indefinitely, completely underutilizing the lower zones and
>>> thrashing on the preferred zone
>>>
>>> These patches are the aging fairness part of the thrash-detection
>>> based file LRU balancing. Andrea recommended to submit them
>>> separately as they are bugfixes in their own right.
>>>
>>
>> I have the patch applied and under testing. So far, so good. It looks
>> like it could finally fix the bug that I was chasing few months ago
>> (nicely described in your bullet #3). But, few more days of testing will
>> be needed before I can reach a quality verdict.
>>
>
> Well, only 2 days later it's already obvious that the patch is 
> perfect! :)
>
> In the attached image, in the left column are the graphs covering last 
> day and a half. It can be observed that zones are really balanced, and 
> that aging is practically perfect. Graphs on the right column cover 
> last 10 day period, and the left side of the upper graph shows how it 
> would look with the stock kernel after about 20 day uptime (although 
> only a few days is enough to reach such imbalance). File pages in the 
> Normal zone are extinct species (red) and the zone is choke full of 
> anon pages (blue). Having seen a lot of this graphs, I'm certain that 
> it won't happen anymore with your patch applied. The balance is 
> restored! Thank you for your work. Feel free to add:
>
> Tested-by: Zlatko Calusic <zcalusic@bitsync.net>

Thanks for your testing Zlatko, could you tell me which benchmark or 
workload you are using? Btw, which tool is used to draw these nice 
pictures? ;-)

>
> Regards,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
  2013-07-24 12:46       ` Hush Bensen
@ 2013-07-24 13:59         ` Zlatko Calusic
  -1 siblings, 0 replies; 66+ messages in thread
From: Zlatko Calusic @ 2013-07-24 13:59 UTC (permalink / raw)
  To: Hush Bensen
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Rik van Riel,
	Andrea Arcangeli, linux-mm, linux-kernel

On 24.07.2013 14:46, Hush Bensen wrote:
> 于 2013/7/24 19:18, Zlatko Calusic 写道:
>> On 22.07.2013 18:48, Zlatko Calusic wrote:
>>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>>> The way the page allocator interacts with kswapd creates aging
>>>> imbalances, where the amount of time a userspace page gets in memory
>>>> under reclaim pressure is dependent on which zone, which node the
>>>> allocator took the page frame from.
>>>>
>>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>>> nodes falling behind for a full reclaim cycle relative to the other
>>>> nodes in the system
>>>>
>>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>>> allocations keep the preferred zone of a task between the high and
>>>> low watermark (allocations succeed + kswapd does not go to sleep)
>>>> indefinitely, completely underutilizing the lower zones and
>>>> thrashing on the preferred zone
>>>>
>>>> These patches are the aging fairness part of the thrash-detection
>>>> based file LRU balancing. Andrea recommended to submit them
>>>> separately as they are bugfixes in their own right.
>>>>
>>>
>>> I have the patch applied and under testing. So far, so good. It looks
>>> like it could finally fix the bug that I was chasing few months ago
>>> (nicely described in your bullet #3). But, few more days of testing will
>>> be needed before I can reach a quality verdict.
>>>
>>
>> Well, only 2 days later it's already obvious that the patch is
>> perfect! :)
>>
>> In the attached image, in the left column are the graphs covering last
>> day and a half. It can be observed that zones are really balanced, and
>> that aging is practically perfect. Graphs on the right column cover
>> last 10 day period, and the left side of the upper graph shows how it
>> would look with the stock kernel after about 20 day uptime (although
>> only a few days is enough to reach such imbalance). File pages in the
>> Normal zone are extinct species (red) and the zone is choke full of
>> anon pages (blue). Having seen a lot of this graphs, I'm certain that
>> it won't happen anymore with your patch applied. The balance is
>> restored! Thank you for your work. Feel free to add:
>>
>> Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
>
> Thanks for your testing Zlatko, could you tell me which benchmark or
> workload you are using? Btw, which tool is used to draw these nice
> pictures? ;-)
>

Workload is mixed (various services, light load). What makes the biggest 
I/O load is backup procedure that goes every evening. The graphs are 
home-made, a little bit of rrd, a little bit of perl, nothing too 
complex. I'm actually slowly getting rid of these extra graphs, because 
I used them only for debugging this specific problem, which is now fixed 
thanks to Johannes.

-- 
Zlatko


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
@ 2013-07-24 13:59         ` Zlatko Calusic
  0 siblings, 0 replies; 66+ messages in thread
From: Zlatko Calusic @ 2013-07-24 13:59 UTC (permalink / raw)
  To: Hush Bensen
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Rik van Riel,
	Andrea Arcangeli, linux-mm, linux-kernel

On 24.07.2013 14:46, Hush Bensen wrote:
> ao? 2013/7/24 19:18, Zlatko Calusic a??e??:
>> On 22.07.2013 18:48, Zlatko Calusic wrote:
>>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>>> The way the page allocator interacts with kswapd creates aging
>>>> imbalances, where the amount of time a userspace page gets in memory
>>>> under reclaim pressure is dependent on which zone, which node the
>>>> allocator took the page frame from.
>>>>
>>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>>> nodes falling behind for a full reclaim cycle relative to the other
>>>> nodes in the system
>>>>
>>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>>> allocations keep the preferred zone of a task between the high and
>>>> low watermark (allocations succeed + kswapd does not go to sleep)
>>>> indefinitely, completely underutilizing the lower zones and
>>>> thrashing on the preferred zone
>>>>
>>>> These patches are the aging fairness part of the thrash-detection
>>>> based file LRU balancing. Andrea recommended to submit them
>>>> separately as they are bugfixes in their own right.
>>>>
>>>
>>> I have the patch applied and under testing. So far, so good. It looks
>>> like it could finally fix the bug that I was chasing few months ago
>>> (nicely described in your bullet #3). But, few more days of testing will
>>> be needed before I can reach a quality verdict.
>>>
>>
>> Well, only 2 days later it's already obvious that the patch is
>> perfect! :)
>>
>> In the attached image, in the left column are the graphs covering last
>> day and a half. It can be observed that zones are really balanced, and
>> that aging is practically perfect. Graphs on the right column cover
>> last 10 day period, and the left side of the upper graph shows how it
>> would look with the stock kernel after about 20 day uptime (although
>> only a few days is enough to reach such imbalance). File pages in the
>> Normal zone are extinct species (red) and the zone is choke full of
>> anon pages (blue). Having seen a lot of this graphs, I'm certain that
>> it won't happen anymore with your patch applied. The balance is
>> restored! Thank you for your work. Feel free to add:
>>
>> Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
>
> Thanks for your testing Zlatko, could you tell me which benchmark or
> workload you are using? Btw, which tool is used to draw these nice
> pictures? ;-)
>

Workload is mixed (various services, light load). What makes the biggest 
I/O load is backup procedure that goes every evening. The graphs are 
home-made, a little bit of rrd, a little bit of perl, nothing too 
complex. I'm actually slowly getting rid of these extra graphs, because 
I used them only for debugging this specific problem, which is now fixed 
thanks to Johannes.

-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-22 20:21     ` Rik van Riel
@ 2013-07-25  6:50       ` Paul Bolle
  -1 siblings, 0 replies; 66+ messages in thread
From: Paul Bolle @ 2013-07-25  6:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, linux-kernel

On 07/23/2013 04:21 AM, Rik van Riel wrote:
> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
>
>> @@ -1984,7 +1992,8 @@ this_zone_full:
>>           goto zonelist_scan;
>>       }
>>
>> -    if (page)
>> +    if (page) {
>> +        atomic_sub(1U << order, &zone->alloc_batch);
>>           /*
>>            * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>>            * necessary to allocate the page. The expectation is
>
> Could this be moved into the slow path in buffered_rmqueue and
> rmqueue_bulk, or would the effect of ignoring the pcp buffers be
> too detrimental to keeping the balance between zones?
>
> It would be kind of nice to not have this atomic operation on every
> page allocation...

atomic operation will lock cache line or memory bus? And cmpxchg will 
lock cache line or memory bus? ;-)

>
> As a side benefit, higher-order buffered_rmqueue and rmqueue_bulk
> both happen under the zone->lock, so moving this accounting down
> to that layer might allow you to get rid of the atomics alltogether.
>
> I like the overall approach though. This is something Linux has needed
> for a long time, and could be extremely useful to automatic NUMA
> balancing as well...
>


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-07-25  6:50       ` Paul Bolle
  0 siblings, 0 replies; 66+ messages in thread
From: Paul Bolle @ 2013-07-25  6:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, linux-kernel

On 07/23/2013 04:21 AM, Rik van Riel wrote:
> On 07/19/2013 04:55 PM, Johannes Weiner wrote:
>
>> @@ -1984,7 +1992,8 @@ this_zone_full:
>>           goto zonelist_scan;
>>       }
>>
>> -    if (page)
>> +    if (page) {
>> +        atomic_sub(1U << order, &zone->alloc_batch);
>>           /*
>>            * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>>            * necessary to allocate the page. The expectation is
>
> Could this be moved into the slow path in buffered_rmqueue and
> rmqueue_bulk, or would the effect of ignoring the pcp buffers be
> too detrimental to keeping the balance between zones?
>
> It would be kind of nice to not have this atomic operation on every
> page allocation...

atomic operation will lock cache line or memory bus? And cmpxchg will 
lock cache line or memory bus? ;-)

>
> As a side benefit, higher-order buffered_rmqueue and rmqueue_bulk
> both happen under the zone->lock, so moving this accounting down
> to that layer might allow you to get rid of the atomics alltogether.
>
> I like the overall approach though. This is something Linux has needed
> for a long time, and could be extremely useful to automatic NUMA
> balancing as well...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-25  6:50       ` Paul Bolle
@ 2013-07-25 15:10         ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-25 15:10 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Rik van Riel, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	Paul Bolle, linux-mm, linux-kernel

Hi Paul Bolle^W^W Sam Ben^W^W Hush Bensen^W^W Mtrr Patt^W^W Ric
Mason^W^W Will Huck^W^W Simon Jeons^W^W Jaeguk Hanse^W^W Ni zhan
Chen^W^W^W Wanpeng Li

[ I Cc'd Paul Bolle at pebolle@tiscali.nl as well, his English was
  better from there ]

On Thu, Jul 25, 2013 at 02:50:54PM +0800, Paul Bolle wrote:
> On 07/23/2013 04:21 AM, Rik van Riel wrote:
> >On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> >
> >>@@ -1984,7 +1992,8 @@ this_zone_full:
> >>          goto zonelist_scan;
> >>      }
> >>
> >>-    if (page)
> >>+    if (page) {
> >>+        atomic_sub(1U << order, &zone->alloc_batch);
> >>          /*
> >>           * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> >>           * necessary to allocate the page. The expectation is
> >
> >Could this be moved into the slow path in buffered_rmqueue and
> >rmqueue_bulk, or would the effect of ignoring the pcp buffers be
> >too detrimental to keeping the balance between zones?
> >
> >It would be kind of nice to not have this atomic operation on every
> >page allocation...
> 
> atomic operation will lock cache line or memory bus? And cmpxchg
> will lock cache line or memory bus? ;-)

Sure, why not ;-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-07-25 15:10         ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-25 15:10 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Rik van Riel, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	Paul Bolle, linux-mm, linux-kernel

Hi Paul Bolle^W^W Sam Ben^W^W Hush Bensen^W^W Mtrr Patt^W^W Ric
Mason^W^W Will Huck^W^W Simon Jeons^W^W Jaeguk Hanse^W^W Ni zhan
Chen^W^W^W Wanpeng Li

[ I Cc'd Paul Bolle at pebolle@tiscali.nl as well, his English was
  better from there ]

On Thu, Jul 25, 2013 at 02:50:54PM +0800, Paul Bolle wrote:
> On 07/23/2013 04:21 AM, Rik van Riel wrote:
> >On 07/19/2013 04:55 PM, Johannes Weiner wrote:
> >
> >>@@ -1984,7 +1992,8 @@ this_zone_full:
> >>          goto zonelist_scan;
> >>      }
> >>
> >>-    if (page)
> >>+    if (page) {
> >>+        atomic_sub(1U << order, &zone->alloc_batch);
> >>          /*
> >>           * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> >>           * necessary to allocate the page. The expectation is
> >
> >Could this be moved into the slow path in buffered_rmqueue and
> >rmqueue_bulk, or would the effect of ignoring the pcp buffers be
> >too detrimental to keeping the balance between zones?
> >
> >It would be kind of nice to not have this atomic operation on every
> >page allocation...
> 
> atomic operation will lock cache line or memory bus? And cmpxchg
> will lock cache line or memory bus? ;-)

Sure, why not ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-25 15:10         ` Johannes Weiner
@ 2013-07-25 15:20           ` Paul Bolle
  -1 siblings, 0 replies; 66+ messages in thread
From: Paul Bolle @ 2013-07-25 15:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Paul Bolle, Rik van Riel, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, linux-mm, linux-kernel

On Thu, 2013-07-25 at 11:10 -0400, Johannes Weiner wrote:
> [ I Cc'd Paul Bolle at pebolle@tiscali.nl as well, his English was
>   better from there ]

Namespace collission! That's not me! 

Could my double perhaps use another name, like say, Paul Bollee? Or
something even less confusing?

Thanks!


Paul Bolle


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-07-25 15:20           ` Paul Bolle
  0 siblings, 0 replies; 66+ messages in thread
From: Paul Bolle @ 2013-07-25 15:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Paul Bolle, Rik van Riel, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, linux-mm, linux-kernel

On Thu, 2013-07-25 at 11:10 -0400, Johannes Weiner wrote:
> [ I Cc'd Paul Bolle at pebolle@tiscali.nl as well, his English was
>   better from there ]

Namespace collission! That's not me! 

Could my double perhaps use another name, like say, Paul Bollee? Or
something even less confusing?

Thanks!


Paul Bolle

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
  2013-07-19 20:55 ` Johannes Weiner
@ 2013-07-26 22:45   ` Andrew Morton
  -1 siblings, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2013-07-26 22:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, 19 Jul 2013 16:55:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> The way the page allocator interacts with kswapd creates aging
> imbalances, where the amount of time a userspace page gets in memory
> under reclaim pressure is dependent on which zone, which node the
> allocator took the page frame from.
> 
> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>    nodes falling behind for a full reclaim cycle relative to the other
>    nodes in the system
> 
> #3 fixes an interaction where kswapd and a continuous stream of page
>    allocations keep the preferred zone of a task between the high and
>    low watermark (allocations succeed + kswapd does not go to sleep)
>    indefinitely, completely underutilizing the lower zones and
>    thrashing on the preferred zone
> 
> These patches are the aging fairness part of the thrash-detection
> based file LRU balancing.  Andrea recommended to submit them
> separately as they are bugfixes in their own right.
> 
> The following test ran a foreground workload (memcachetest) with
> background IO of various sizes on a 4 node 8G system (similar results
> were observed with single-node 4G systems):
> 
> parallelio
>                                                BAS                    FAIRALLO
>                                               BASE                   FAIRALLOC
> Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
> Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
> Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
> Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
> Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
> Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
> Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
> Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
> Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
> Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
> Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
> Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
> Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
> Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
> Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
> Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
> Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
> Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
> Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
> Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
> Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
> Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)

A reminder whether positive numbers are good or bad would be useful ;)

>                  BAS    FAIRALLO
>                 BASE   FAIRALLOC
> User          287.78      460.97
> System       2151.67     3142.51
> Elapsed      9737.00     8879.34

Confused.  Why would the amount of user time increase so much?

And that's a tremendous increase in system time.  Am I interpreting
this correctly?
 
>                                    BAS    FAIRALLO
>                                   BASE   FAIRALLOC
> Minor Faults                  53721925    57188551
> Major Faults                    392195       15157
> Swap Ins                       2994854      112770
> Swap Outs                      4907092      134982
> Direct pages scanned                 0       41824
> Kswapd pages scanned          32975063     8128269
> Kswapd pages reclaimed         6323069     7093495
> Direct pages reclaimed               0       41824
> Kswapd efficiency                  19%         87%
> Kswapd velocity               3386.573     915.414
> Direct efficiency                 100%        100%
> Direct velocity                  0.000       4.710
> Percentage direct scans             0%          0%
> Zone normal velocity          2011.338     550.661
> Zone dma32 velocity           1365.623     369.221
> Zone dma velocity                9.612       0.242
> Page writes by reclaim    18732404.000  614807.000
> Page writes file              13825312      479825
> Page writes anon               4907092      134982
> Page reclaim immediate           85490        5647
> Sector Reads                  12080532      483244
> Sector Writes                 88740508    65438876
> Page rescued immediate               0           0
> Slabs scanned                    82560       12160
> Direct inode steals                  0           0
> Kswapd inode steals              24401       40013
> Kswapd skipped wait                  0           0
> THP fault alloc                      6           8
> THP collapse alloc                5481        5812
> THP splits                          75          22
> THP fault fallback                   0           0
> THP collapse fail                    0           0
> Compaction stalls                    0          54
> Compaction success                   0          45
> Compaction failures                  0           9
> Page migrate success            881492       82278
> Page migrate failure                 0           0
> Compaction pages isolated            0       60334
> Compaction migrate scanned           0       53505
> Compaction free scanned              0     1537605
> Compaction cost                    914          86
> NUMA PTE updates              46738231    41988419
> NUMA hint faults              31175564    24213387
> NUMA hint local faults        10427393     6411593
> NUMA pages migrated             881492       55344
> AutoNUMA cost                   156221      121361

Some nice numbers there.

> The overall runtime was reduced, throughput for both the foreground
> workload as well as the background IO improved, major faults, swapping
> and reclaim activity shrunk significantly, reclaim efficiency more
> than quadrupled.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
@ 2013-07-26 22:45   ` Andrew Morton
  0 siblings, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2013-07-26 22:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, 19 Jul 2013 16:55:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> The way the page allocator interacts with kswapd creates aging
> imbalances, where the amount of time a userspace page gets in memory
> under reclaim pressure is dependent on which zone, which node the
> allocator took the page frame from.
> 
> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>    nodes falling behind for a full reclaim cycle relative to the other
>    nodes in the system
> 
> #3 fixes an interaction where kswapd and a continuous stream of page
>    allocations keep the preferred zone of a task between the high and
>    low watermark (allocations succeed + kswapd does not go to sleep)
>    indefinitely, completely underutilizing the lower zones and
>    thrashing on the preferred zone
> 
> These patches are the aging fairness part of the thrash-detection
> based file LRU balancing.  Andrea recommended to submit them
> separately as they are bugfixes in their own right.
> 
> The following test ran a foreground workload (memcachetest) with
> background IO of various sizes on a 4 node 8G system (similar results
> were observed with single-node 4G systems):
> 
> parallelio
>                                                BAS                    FAIRALLO
>                                               BASE                   FAIRALLOC
> Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
> Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
> Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
> Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
> Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
> Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
> Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
> Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
> Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
> Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
> Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
> Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
> Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
> Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
> Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
> Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
> Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
> Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
> Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
> Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
> Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
> Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)

A reminder whether positive numbers are good or bad would be useful ;)

>                  BAS    FAIRALLO
>                 BASE   FAIRALLOC
> User          287.78      460.97
> System       2151.67     3142.51
> Elapsed      9737.00     8879.34

Confused.  Why would the amount of user time increase so much?

And that's a tremendous increase in system time.  Am I interpreting
this correctly?
 
>                                    BAS    FAIRALLO
>                                   BASE   FAIRALLOC
> Minor Faults                  53721925    57188551
> Major Faults                    392195       15157
> Swap Ins                       2994854      112770
> Swap Outs                      4907092      134982
> Direct pages scanned                 0       41824
> Kswapd pages scanned          32975063     8128269
> Kswapd pages reclaimed         6323069     7093495
> Direct pages reclaimed               0       41824
> Kswapd efficiency                  19%         87%
> Kswapd velocity               3386.573     915.414
> Direct efficiency                 100%        100%
> Direct velocity                  0.000       4.710
> Percentage direct scans             0%          0%
> Zone normal velocity          2011.338     550.661
> Zone dma32 velocity           1365.623     369.221
> Zone dma velocity                9.612       0.242
> Page writes by reclaim    18732404.000  614807.000
> Page writes file              13825312      479825
> Page writes anon               4907092      134982
> Page reclaim immediate           85490        5647
> Sector Reads                  12080532      483244
> Sector Writes                 88740508    65438876
> Page rescued immediate               0           0
> Slabs scanned                    82560       12160
> Direct inode steals                  0           0
> Kswapd inode steals              24401       40013
> Kswapd skipped wait                  0           0
> THP fault alloc                      6           8
> THP collapse alloc                5481        5812
> THP splits                          75          22
> THP fault fallback                   0           0
> THP collapse fail                    0           0
> Compaction stalls                    0          54
> Compaction success                   0          45
> Compaction failures                  0           9
> Page migrate success            881492       82278
> Page migrate failure                 0           0
> Compaction pages isolated            0       60334
> Compaction migrate scanned           0       53505
> Compaction free scanned              0     1537605
> Compaction cost                    914          86
> NUMA PTE updates              46738231    41988419
> NUMA hint faults              31175564    24213387
> NUMA hint local faults        10427393     6411593
> NUMA pages migrated             881492       55344
> AutoNUMA cost                   156221      121361

Some nice numbers there.

> The overall runtime was reduced, throughput for both the foreground
> workload as well as the background IO improved, major faults, swapping
> and reclaim activity shrunk significantly, reclaim efficiency more
> than quadrupled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
  2013-07-19 20:55   ` Johannes Weiner
@ 2013-07-26 22:53     ` Andrew Morton
  -1 siblings, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2013-07-26 22:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, 19 Jul 2013 16:55:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> When the page allocator fails to get a page from all zones in its
> given zonelist, it wakes up the per-node kswapds for all zones that
> are at their low watermark.
> 
> However, with a system under load and the free page counters being
> per-cpu approximations, the observed counter value in a zone can
> fluctuate enough that the allocation fails but the kswapd wakeup is
> also skipped while the zone is still really close to the low
> watermark.
> 
> When one node misses a wakeup like this, it won't be aged before all
> the other node's zones are down to their low watermarks again.  And
> skipping a full aging cycle is an obvious fairness problem.
> 
> Kswapd runs until the high watermarks are restored, so it should also
> be woken when the high watermarks are not met.  This ages nodes more
> equally and creates a safety margin for the page counter fluctuation.

Well yes, but what guarantee is there that the per-cpu counter error
problem is reliably fixed?  AFAICT this patch "fixes" it because the
gap between the low and high watermarks happens to be larger than the
per-cpu counter fluctuation, yes?  If so, there are surely all sorts of
situations where it will break again.

To fix this reliably, we should be looking at constraining counter
batch sizes or performing a counter summation to get the more accurate
estimate?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
@ 2013-07-26 22:53     ` Andrew Morton
  0 siblings, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2013-07-26 22:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, 19 Jul 2013 16:55:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> When the page allocator fails to get a page from all zones in its
> given zonelist, it wakes up the per-node kswapds for all zones that
> are at their low watermark.
> 
> However, with a system under load and the free page counters being
> per-cpu approximations, the observed counter value in a zone can
> fluctuate enough that the allocation fails but the kswapd wakeup is
> also skipped while the zone is still really close to the low
> watermark.
> 
> When one node misses a wakeup like this, it won't be aged before all
> the other node's zones are down to their low watermarks again.  And
> skipping a full aging cycle is an obvious fairness problem.
> 
> Kswapd runs until the high watermarks are restored, so it should also
> be woken when the high watermarks are not met.  This ages nodes more
> equally and creates a safety margin for the page counter fluctuation.

Well yes, but what guarantee is there that the per-cpu counter error
problem is reliably fixed?  AFAICT this patch "fixes" it because the
gap between the low and high watermarks happens to be larger than the
per-cpu counter fluctuation, yes?  If so, there are surely all sorts of
situations where it will break again.

To fix this reliably, we should be looking at constraining counter
batch sizes or performing a counter summation to get the more accurate
estimate?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
  2013-07-26 22:45   ` Andrew Morton
@ 2013-07-26 23:14     ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-26 23:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, Jul 26, 2013 at 03:45:33PM -0700, Andrew Morton wrote:
> On Fri, 19 Jul 2013 16:55:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > The way the page allocator interacts with kswapd creates aging
> > imbalances, where the amount of time a userspace page gets in memory
> > under reclaim pressure is dependent on which zone, which node the
> > allocator took the page frame from.
> > 
> > #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
> >    nodes falling behind for a full reclaim cycle relative to the other
> >    nodes in the system
> > 
> > #3 fixes an interaction where kswapd and a continuous stream of page
> >    allocations keep the preferred zone of a task between the high and
> >    low watermark (allocations succeed + kswapd does not go to sleep)
> >    indefinitely, completely underutilizing the lower zones and
> >    thrashing on the preferred zone
> > 
> > These patches are the aging fairness part of the thrash-detection
> > based file LRU balancing.  Andrea recommended to submit them
> > separately as they are bugfixes in their own right.
> > 
> > The following test ran a foreground workload (memcachetest) with
> > background IO of various sizes on a 4 node 8G system (similar results
> > were observed with single-node 4G systems):
> > 
> > parallelio
> >                                                BAS                    FAIRALLO
> >                                               BASE                   FAIRALLOC
> > Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
> > Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
> > Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
> > Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
> > Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
> > Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
> > Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
> > Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
> > Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
> > Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
> > Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
> > Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
> > Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
> > Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
> > Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
> > Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
> > Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
> > Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
> > Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
> > Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
> > Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
> > Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
> > Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
> > Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)
> 
> A reminder whether positive numbers are good or bad would be useful ;)

It depends on the datapoint, but a positive percentage number is an
improvement, a negative one a regression.

> >                  BAS    FAIRALLO
> >                 BASE   FAIRALLOC
> > User          287.78      460.97
> > System       2151.67     3142.51
> > Elapsed      9737.00     8879.34
> 
> Confused.  Why would the amount of user time increase so much?
> 
> And that's a tremendous increase in system time.  Am I interpreting
> this correctly?

It is because each memcachetest is running for a fixed duration (only
the background IO is fixed in size).  The time memcachetest previously
spent waiting on major faults is now spent doing actual work (more
user time, more syscalls).  The number of operations memcachetest
could actually perform increased.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
@ 2013-07-26 23:14     ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-26 23:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, Jul 26, 2013 at 03:45:33PM -0700, Andrew Morton wrote:
> On Fri, 19 Jul 2013 16:55:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > The way the page allocator interacts with kswapd creates aging
> > imbalances, where the amount of time a userspace page gets in memory
> > under reclaim pressure is dependent on which zone, which node the
> > allocator took the page frame from.
> > 
> > #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
> >    nodes falling behind for a full reclaim cycle relative to the other
> >    nodes in the system
> > 
> > #3 fixes an interaction where kswapd and a continuous stream of page
> >    allocations keep the preferred zone of a task between the high and
> >    low watermark (allocations succeed + kswapd does not go to sleep)
> >    indefinitely, completely underutilizing the lower zones and
> >    thrashing on the preferred zone
> > 
> > These patches are the aging fairness part of the thrash-detection
> > based file LRU balancing.  Andrea recommended to submit them
> > separately as they are bugfixes in their own right.
> > 
> > The following test ran a foreground workload (memcachetest) with
> > background IO of various sizes on a 4 node 8G system (similar results
> > were observed with single-node 4G systems):
> > 
> > parallelio
> >                                                BAS                    FAIRALLO
> >                                               BASE                   FAIRALLOC
> > Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
> > Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
> > Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
> > Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
> > Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
> > Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
> > Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
> > Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
> > Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
> > Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
> > Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
> > Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
> > Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
> > Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
> > Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
> > Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
> > Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
> > Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
> > Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
> > Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
> > Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
> > Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
> > Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
> > Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)
> 
> A reminder whether positive numbers are good or bad would be useful ;)

It depends on the datapoint, but a positive percentage number is an
improvement, a negative one a regression.

> >                  BAS    FAIRALLO
> >                 BASE   FAIRALLOC
> > User          287.78      460.97
> > System       2151.67     3142.51
> > Elapsed      9737.00     8879.34
> 
> Confused.  Why would the amount of user time increase so much?
> 
> And that's a tremendous increase in system time.  Am I interpreting
> this correctly?

It is because each memcachetest is running for a fixed duration (only
the background IO is fixed in size).  The time memcachetest previously
spent waiting on major faults is now spent doing actual work (more
user time, more syscalls).  The number of operations memcachetest
could actually perform increased.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-19 20:55   ` Johannes Weiner
@ 2013-07-29 17:48     ` Andrea Arcangeli
  -1 siblings, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2013-07-29 17:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-mm, linux-kernel

Hi Johannes,

On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index af1d956b..d938b67 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1879,6 +1879,14 @@ zonelist_scan:
>  		if (alloc_flags & ALLOC_NO_WATERMARKS)
>  			goto try_this_zone;
>  		/*
> +		 * Distribute pages in proportion to the individual
> +		 * zone size to ensure fair page aging.  The zone a
> +		 * page was allocated in should have no effect on the
> +		 * time the page has in memory before being reclaimed.
> +		 */
> +		if (atomic_read(&zone->alloc_batch) <= 0)
> +			continue;
> +		/*
>  		 * When allocating a page cache page for writing, we
>  		 * want to get it from a zone that is within its dirty
>  		 * limit, such that no single zone holds more than its

I rebased the zone_reclaim_mode and compaction fixes on top of the
zone fair allocator (it applied without rejects, lucky) but the above
breaks zone_reclaim_mode (it regress for pagecache too, which
currently works), so then in turn my THP/compaction tests break too.

zone_reclaim_mode isn't LRU-fair, and cannot be... (even migrating
cache around nodes to try to keep LRU fariness would not be worth it,
especially with ssds). But we can still increase the fairness within
the zones of the current node (for those nodes that have more than 1
zone).

I think to fix it we need an additional first pass of the fast path,
and if alloc_batch is <= 0 for any zone in the current node, we then
forbid allocating from the zones not in the current node (even if
alloc_batch would allow it) during the first pass, only if
zone_reclaim_mode is enabled. If first pass fails, we need to reset
alloc_batch for all zones in the current node (and only in the current
zone), goto zonelist_scan and continue as we do now.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-07-29 17:48     ` Andrea Arcangeli
  0 siblings, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2013-07-29 17:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-mm, linux-kernel

Hi Johannes,

On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index af1d956b..d938b67 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1879,6 +1879,14 @@ zonelist_scan:
>  		if (alloc_flags & ALLOC_NO_WATERMARKS)
>  			goto try_this_zone;
>  		/*
> +		 * Distribute pages in proportion to the individual
> +		 * zone size to ensure fair page aging.  The zone a
> +		 * page was allocated in should have no effect on the
> +		 * time the page has in memory before being reclaimed.
> +		 */
> +		if (atomic_read(&zone->alloc_batch) <= 0)
> +			continue;
> +		/*
>  		 * When allocating a page cache page for writing, we
>  		 * want to get it from a zone that is within its dirty
>  		 * limit, such that no single zone holds more than its

I rebased the zone_reclaim_mode and compaction fixes on top of the
zone fair allocator (it applied without rejects, lucky) but the above
breaks zone_reclaim_mode (it regress for pagecache too, which
currently works), so then in turn my THP/compaction tests break too.

zone_reclaim_mode isn't LRU-fair, and cannot be... (even migrating
cache around nodes to try to keep LRU fariness would not be worth it,
especially with ssds). But we can still increase the fairness within
the zones of the current node (for those nodes that have more than 1
zone).

I think to fix it we need an additional first pass of the fast path,
and if alloc_batch is <= 0 for any zone in the current node, we then
forbid allocating from the zones not in the current node (even if
alloc_batch would allow it) during the first pass, only if
zone_reclaim_mode is enabled. If first pass fails, we need to reset
alloc_batch for all zones in the current node (and only in the current
zone), goto zonelist_scan and continue as we do now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-29 17:48     ` Andrea Arcangeli
@ 2013-07-29 22:24       ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-29 22:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-mm, linux-kernel

On Mon, Jul 29, 2013 at 07:48:21PM +0200, Andrea Arcangeli wrote:
> Hi Johannes,
> 
> On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index af1d956b..d938b67 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1879,6 +1879,14 @@ zonelist_scan:
> >  		if (alloc_flags & ALLOC_NO_WATERMARKS)
> >  			goto try_this_zone;
> >  		/*
> > +		 * Distribute pages in proportion to the individual
> > +		 * zone size to ensure fair page aging.  The zone a
> > +		 * page was allocated in should have no effect on the
> > +		 * time the page has in memory before being reclaimed.
> > +		 */
> > +		if (atomic_read(&zone->alloc_batch) <= 0)
> > +			continue;
> > +		/*
> >  		 * When allocating a page cache page for writing, we
> >  		 * want to get it from a zone that is within its dirty
> >  		 * limit, such that no single zone holds more than its
> 
> I rebased the zone_reclaim_mode and compaction fixes on top of the
> zone fair allocator (it applied without rejects, lucky) but the above
> breaks zone_reclaim_mode (it regress for pagecache too, which
> currently works), so then in turn my THP/compaction tests break too.

Ah yeah, we spill too eagerly to other nodes when we should try to
reclaim the local one first.

> zone_reclaim_mode isn't LRU-fair, and cannot be... (even migrating
> cache around nodes to try to keep LRU fariness would not be worth it,
> especially with ssds). But we can still increase the fairness within
> the zones of the current node (for those nodes that have more than 1
> zone).
> 
> I think to fix it we need an additional first pass of the fast path,
> and if alloc_batch is <= 0 for any zone in the current node, we then
> forbid allocating from the zones not in the current node (even if
> alloc_batch would allow it) during the first pass, only if
> zone_reclaim_mode is enabled. If first pass fails, we need to reset
> alloc_batch for all zones in the current node (and only in the current
> zone), goto zonelist_scan and continue as we do now.

How sensible are the various settings of zone_reclaim_mode and the way
zone reclaim is invoked right now?

zone_reclaim_mode == 1 tries to reclaim clean page cache in the
preferred zone before falling back to other zones.  Great, kswapd also
tries clean cache first and avoids writeout and swapping as long as
possible.  And if zone_reclaim() fails, kswapd has to be woken up
anyway because filling remote zones without reclaiming them is hardly
sustainable.  Using kswapd would have the advantage that it reclaims
the whole local node and not just the first zone in it, which would
make much more sense in the first place.

Same for zone_reclaim_mode at higher settings.

So could we cut all this short and just restrict any allocation with
zone_reclaim_mode != 0 to zones in reclaim distance, and if that
fails, wake kswapd and enter the slowpath? (ALLOC_WMARK_LOW marks the
fast path)

It would be even cooler to remove the zone_reclaim() call further
down, but that might be too much reliance on kswapd...

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f03d2f2..8ddf9ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1875,6 +1875,10 @@ zonelist_scan:
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (alloc_flags & ALLOC_NO_WATERMARKS)
 			goto try_this_zone;
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    zone_reclaim_mode &&
+		    !zone_allows_reclaim(preferred_zone, zone))
+			continue;
 		/*
 		 * Distribute pages in proportion to the individual
 		 * zone size to ensure fair page aging.  The zone a


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-07-29 22:24       ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-29 22:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-mm, linux-kernel

On Mon, Jul 29, 2013 at 07:48:21PM +0200, Andrea Arcangeli wrote:
> Hi Johannes,
> 
> On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index af1d956b..d938b67 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1879,6 +1879,14 @@ zonelist_scan:
> >  		if (alloc_flags & ALLOC_NO_WATERMARKS)
> >  			goto try_this_zone;
> >  		/*
> > +		 * Distribute pages in proportion to the individual
> > +		 * zone size to ensure fair page aging.  The zone a
> > +		 * page was allocated in should have no effect on the
> > +		 * time the page has in memory before being reclaimed.
> > +		 */
> > +		if (atomic_read(&zone->alloc_batch) <= 0)
> > +			continue;
> > +		/*
> >  		 * When allocating a page cache page for writing, we
> >  		 * want to get it from a zone that is within its dirty
> >  		 * limit, such that no single zone holds more than its
> 
> I rebased the zone_reclaim_mode and compaction fixes on top of the
> zone fair allocator (it applied without rejects, lucky) but the above
> breaks zone_reclaim_mode (it regress for pagecache too, which
> currently works), so then in turn my THP/compaction tests break too.

Ah yeah, we spill too eagerly to other nodes when we should try to
reclaim the local one first.

> zone_reclaim_mode isn't LRU-fair, and cannot be... (even migrating
> cache around nodes to try to keep LRU fariness would not be worth it,
> especially with ssds). But we can still increase the fairness within
> the zones of the current node (for those nodes that have more than 1
> zone).
> 
> I think to fix it we need an additional first pass of the fast path,
> and if alloc_batch is <= 0 for any zone in the current node, we then
> forbid allocating from the zones not in the current node (even if
> alloc_batch would allow it) during the first pass, only if
> zone_reclaim_mode is enabled. If first pass fails, we need to reset
> alloc_batch for all zones in the current node (and only in the current
> zone), goto zonelist_scan and continue as we do now.

How sensible are the various settings of zone_reclaim_mode and the way
zone reclaim is invoked right now?

zone_reclaim_mode == 1 tries to reclaim clean page cache in the
preferred zone before falling back to other zones.  Great, kswapd also
tries clean cache first and avoids writeout and swapping as long as
possible.  And if zone_reclaim() fails, kswapd has to be woken up
anyway because filling remote zones without reclaiming them is hardly
sustainable.  Using kswapd would have the advantage that it reclaims
the whole local node and not just the first zone in it, which would
make much more sense in the first place.

Same for zone_reclaim_mode at higher settings.

So could we cut all this short and just restrict any allocation with
zone_reclaim_mode != 0 to zones in reclaim distance, and if that
fails, wake kswapd and enter the slowpath? (ALLOC_WMARK_LOW marks the
fast path)

It would be even cooler to remove the zone_reclaim() call further
down, but that might be too much reliance on kswapd...

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f03d2f2..8ddf9ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1875,6 +1875,10 @@ zonelist_scan:
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (alloc_flags & ALLOC_NO_WATERMARKS)
 			goto try_this_zone;
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    zone_reclaim_mode &&
+		    !zone_allows_reclaim(preferred_zone, zone))
+			continue;
 		/*
 		 * Distribute pages in proportion to the individual
 		 * zone size to ensure fair page aging.  The zone a

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
  2013-07-26 22:53     ` Andrew Morton
@ 2013-07-30 17:45       ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-30 17:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, Jul 26, 2013 at 03:53:19PM -0700, Andrew Morton wrote:
> On Fri, 19 Jul 2013 16:55:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > When the page allocator fails to get a page from all zones in its
> > given zonelist, it wakes up the per-node kswapds for all zones that
> > are at their low watermark.
> > 
> > However, with a system under load and the free page counters being
> > per-cpu approximations, the observed counter value in a zone can
> > fluctuate enough that the allocation fails but the kswapd wakeup is
> > also skipped while the zone is still really close to the low
> > watermark.
> > 
> > When one node misses a wakeup like this, it won't be aged before all
> > the other node's zones are down to their low watermarks again.  And
> > skipping a full aging cycle is an obvious fairness problem.
> > 
> > Kswapd runs until the high watermarks are restored, so it should also
> > be woken when the high watermarks are not met.  This ages nodes more
> > equally and creates a safety margin for the page counter fluctuation.
> 
> Well yes, but what guarantee is there that the per-cpu counter error
> problem is reliably fixed?  AFAICT this patch "fixes" it because the
> gap between the low and high watermarks happens to be larger than the
> per-cpu counter fluctuation, yes?  If so, there are surely all sorts of
> situations where it will break again.
> 
> To fix this reliably, we should be looking at constraining counter
> batch sizes or performing a counter summation to get the more accurate
> estimate?

Thinking about this some more, the per-cpu fluctuation appears to be a
red herring in this case.  Kswapd wakeup uses the safe version of the
watermark checking code.  A percpu inaccuracy would result in the
allocator waking kswapd a little too early or too late, but once it
decides to wake it, we are getting an accurate picture on all zones.
The safe watermark checks are designed to catch percpu inaccuracies
between the low and the min watermark, the distance between low and
high is twice as big, so we should be good.

The fluctuation that makes individual zones miss full aging cycles
comes from true free page variation under load when the NOT OK
threshold for the allocator is the same value as the OK threshold for
skipping kswapd wakeups.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
@ 2013-07-30 17:45       ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-30 17:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, Jul 26, 2013 at 03:53:19PM -0700, Andrew Morton wrote:
> On Fri, 19 Jul 2013 16:55:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > When the page allocator fails to get a page from all zones in its
> > given zonelist, it wakes up the per-node kswapds for all zones that
> > are at their low watermark.
> > 
> > However, with a system under load and the free page counters being
> > per-cpu approximations, the observed counter value in a zone can
> > fluctuate enough that the allocation fails but the kswapd wakeup is
> > also skipped while the zone is still really close to the low
> > watermark.
> > 
> > When one node misses a wakeup like this, it won't be aged before all
> > the other node's zones are down to their low watermarks again.  And
> > skipping a full aging cycle is an obvious fairness problem.
> > 
> > Kswapd runs until the high watermarks are restored, so it should also
> > be woken when the high watermarks are not met.  This ages nodes more
> > equally and creates a safety margin for the page counter fluctuation.
> 
> Well yes, but what guarantee is there that the per-cpu counter error
> problem is reliably fixed?  AFAICT this patch "fixes" it because the
> gap between the low and high watermarks happens to be larger than the
> per-cpu counter fluctuation, yes?  If so, there are surely all sorts of
> situations where it will break again.
> 
> To fix this reliably, we should be looking at constraining counter
> batch sizes or performing a counter summation to get the more accurate
> estimate?

Thinking about this some more, the per-cpu fluctuation appears to be a
red herring in this case.  Kswapd wakeup uses the safe version of the
watermark checking code.  A percpu inaccuracy would result in the
allocator waking kswapd a little too early or too late, but once it
decides to wake it, we are getting an accurate picture on all zones.
The safe watermark checks are designed to catch percpu inaccuracies
between the low and the min watermark, the distance between low and
high is twice as big, so we should be good.

The fluctuation that makes individual zones miss full aging cycles
comes from true free page variation under load when the NOT OK
threshold for the allocator is the same value as the OK threshold for
skipping kswapd wakeups.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
  2013-07-24 11:18   ` Zlatko Calusic
@ 2013-07-31  9:33       ` Zlatko Calusic
  2013-07-31  9:33       ` Zlatko Calusic
  1 sibling, 0 replies; 66+ messages in thread
From: Zlatko Calusic @ 2013-07-31  9:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

On 24.07.2013 13:18, Zlatko Calusic wrote:
> On 22.07.2013 18:48, Zlatko Calusic wrote:
>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>> The way the page allocator interacts with kswapd creates aging
>>> imbalances, where the amount of time a userspace page gets in memory
>>> under reclaim pressure is dependent on which zone, which node the
>>> allocator took the page frame from.
>>>
>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>>     nodes falling behind for a full reclaim cycle relative to the other
>>>     nodes in the system
>>>
>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>>     allocations keep the preferred zone of a task between the high and
>>>     low watermark (allocations succeed + kswapd does not go to sleep)
>>>     indefinitely, completely underutilizing the lower zones and
>>>     thrashing on the preferred zone
>>>
>>> These patches are the aging fairness part of the thrash-detection
>>> based file LRU balancing.  Andrea recommended to submit them
>>> separately as they are bugfixes in their own right.
>>>
>>
>> I have the patch applied and under testing. So far, so good. It looks
>> like it could finally fix the bug that I was chasing few months ago
>> (nicely described in your bullet #3). But, few more days of testing will
>> be needed before I can reach a quality verdict.
>>
>
> Well, only 2 days later it's already obvious that the patch is perfect! :)
>

Additionaly, on the patched kernel, kswapd burns 30% less CPU cycles. 
Nice to see that restored balance also eases kswapd's job, but that was 
to be expected. Measured on the real workload, twice, to be sure.

Regards,
-- 
Zlatko


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 0/3] mm: improve page aging fairness between zones/nodes
@ 2013-07-31  9:33       ` Zlatko Calusic
  0 siblings, 0 replies; 66+ messages in thread
From: Zlatko Calusic @ 2013-07-31  9:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

On 24.07.2013 13:18, Zlatko Calusic wrote:
> On 22.07.2013 18:48, Zlatko Calusic wrote:
>> On 19.07.2013 22:55, Johannes Weiner wrote:
>>> The way the page allocator interacts with kswapd creates aging
>>> imbalances, where the amount of time a userspace page gets in memory
>>> under reclaim pressure is dependent on which zone, which node the
>>> allocator took the page frame from.
>>>
>>> #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
>>>     nodes falling behind for a full reclaim cycle relative to the other
>>>     nodes in the system
>>>
>>> #3 fixes an interaction where kswapd and a continuous stream of page
>>>     allocations keep the preferred zone of a task between the high and
>>>     low watermark (allocations succeed + kswapd does not go to sleep)
>>>     indefinitely, completely underutilizing the lower zones and
>>>     thrashing on the preferred zone
>>>
>>> These patches are the aging fairness part of the thrash-detection
>>> based file LRU balancing.  Andrea recommended to submit them
>>> separately as they are bugfixes in their own right.
>>>
>>
>> I have the patch applied and under testing. So far, so good. It looks
>> like it could finally fix the bug that I was chasing few months ago
>> (nicely described in your bullet #3). But, few more days of testing will
>> be needed before I can reach a quality verdict.
>>
>
> Well, only 2 days later it's already obvious that the patch is perfect! :)
>

Additionaly, on the patched kernel, kswapd burns 30% less CPU cycles. 
Nice to see that restored balance also eases kswapd's job, but that was 
to be expected. Measured on the real workload, twice, to be sure.

Regards,
-- 
Zlatko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
  2013-07-30 17:45       ` Johannes Weiner
@ 2013-07-31 12:43         ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-31 12:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Tue, Jul 30, 2013 at 01:45:39PM -0400, Johannes Weiner wrote:
> On Fri, Jul 26, 2013 at 03:53:19PM -0700, Andrew Morton wrote:
> > On Fri, 19 Jul 2013 16:55:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > When the page allocator fails to get a page from all zones in its
> > > given zonelist, it wakes up the per-node kswapds for all zones that
> > > are at their low watermark.
> > > 
> > > However, with a system under load and the free page counters being
> > > per-cpu approximations, the observed counter value in a zone can
> > > fluctuate enough that the allocation fails but the kswapd wakeup is
> > > also skipped while the zone is still really close to the low
> > > watermark.
> > > 
> > > When one node misses a wakeup like this, it won't be aged before all
> > > the other node's zones are down to their low watermarks again.  And
> > > skipping a full aging cycle is an obvious fairness problem.
> > > 
> > > Kswapd runs until the high watermarks are restored, so it should also
> > > be woken when the high watermarks are not met.  This ages nodes more
> > > equally and creates a safety margin for the page counter fluctuation.
> > 
> > Well yes, but what guarantee is there that the per-cpu counter error
> > problem is reliably fixed?  AFAICT this patch "fixes" it because the
> > gap between the low and high watermarks happens to be larger than the
> > per-cpu counter fluctuation, yes?  If so, there are surely all sorts of
> > situations where it will break again.
> > 
> > To fix this reliably, we should be looking at constraining counter
> > batch sizes or performing a counter summation to get the more accurate
> > estimate?
> 
> Thinking about this some more, the per-cpu fluctuation appears to be a
> red herring in this case.  Kswapd wakeup uses the safe version of the
> watermark checking code.  A percpu inaccuracy would result in the
> allocator waking kswapd a little too early or too late, but once it
> decides to wake it, we are getting an accurate picture on all zones.
> The safe watermark checks are designed to catch percpu inaccuracies
> between the low and the min watermark, the distance between low and
> high is twice as big, so we should be good.
> 
> The fluctuation that makes individual zones miss full aging cycles
> comes from true free page variation under load when the NOT OK
> threshold for the allocator is the same value as the OK threshold for
> skipping kswapd wakeups.

Andrew, could you please update the patch to this new changelog with
the percpu counter drivel removed?  Thanks!

---
When the page allocator fails to get a page from all zones in its
given zonelist, it wakes up the per-node kswapds for all zones that
are at their low watermark.

However, with a system under load the free pages in a zone can
fluctuate enough that the allocation fails but the kswapd wakeup is
also skipped while the zone is still really close to the low
watermark.

When one node misses a wakeup like this, it won't be aged before all
the other node's zones are down to their low watermarks again.  And
skipping a full aging cycle is an obvious fairness problem.

Kswapd runs until the high watermarks are restored, so it should also
be woken when the high watermarks are not met.  This ages nodes more
equally and creates a safety margin for the page counter fluctuation.

By using zone_balanced(), it will now check, in addition to the
watermark, if compaction requires more order-0 pages to create a
higher order page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
@ 2013-07-31 12:43         ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-07-31 12:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Andrea Arcangeli, linux-mm, linux-kernel

On Tue, Jul 30, 2013 at 01:45:39PM -0400, Johannes Weiner wrote:
> On Fri, Jul 26, 2013 at 03:53:19PM -0700, Andrew Morton wrote:
> > On Fri, 19 Jul 2013 16:55:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > When the page allocator fails to get a page from all zones in its
> > > given zonelist, it wakes up the per-node kswapds for all zones that
> > > are at their low watermark.
> > > 
> > > However, with a system under load and the free page counters being
> > > per-cpu approximations, the observed counter value in a zone can
> > > fluctuate enough that the allocation fails but the kswapd wakeup is
> > > also skipped while the zone is still really close to the low
> > > watermark.
> > > 
> > > When one node misses a wakeup like this, it won't be aged before all
> > > the other node's zones are down to their low watermarks again.  And
> > > skipping a full aging cycle is an obvious fairness problem.
> > > 
> > > Kswapd runs until the high watermarks are restored, so it should also
> > > be woken when the high watermarks are not met.  This ages nodes more
> > > equally and creates a safety margin for the page counter fluctuation.
> > 
> > Well yes, but what guarantee is there that the per-cpu counter error
> > problem is reliably fixed?  AFAICT this patch "fixes" it because the
> > gap between the low and high watermarks happens to be larger than the
> > per-cpu counter fluctuation, yes?  If so, there are surely all sorts of
> > situations where it will break again.
> > 
> > To fix this reliably, we should be looking at constraining counter
> > batch sizes or performing a counter summation to get the more accurate
> > estimate?
> 
> Thinking about this some more, the per-cpu fluctuation appears to be a
> red herring in this case.  Kswapd wakeup uses the safe version of the
> watermark checking code.  A percpu inaccuracy would result in the
> allocator waking kswapd a little too early or too late, but once it
> decides to wake it, we are getting an accurate picture on all zones.
> The safe watermark checks are designed to catch percpu inaccuracies
> between the low and the min watermark, the distance between low and
> high is twice as big, so we should be good.
> 
> The fluctuation that makes individual zones miss full aging cycles
> comes from true free page variation under load when the NOT OK
> threshold for the allocator is the same value as the OK threshold for
> skipping kswapd wakeups.

Andrew, could you please update the patch to this new changelog with
the percpu counter drivel removed?  Thanks!

---
When the page allocator fails to get a page from all zones in its
given zonelist, it wakes up the per-node kswapds for all zones that
are at their low watermark.

However, with a system under load the free pages in a zone can
fluctuate enough that the allocation fails but the kswapd wakeup is
also skipped while the zone is still really close to the low
watermark.

When one node misses a wakeup like this, it won't be aged before all
the other node's zones are down to their low watermarks again.  And
skipping a full aging cycle is an obvious fairness problem.

Kswapd runs until the high watermarks are restored, so it should also
be woken when the high watermarks are not met.  This ages nodes more
equally and creates a safety margin for the page counter fluctuation.

By using zone_balanced(), it will now check, in addition to the
watermark, if compaction requires more order-0 pages to create a
higher order page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-07-19 20:55   ` Johannes Weiner
@ 2013-08-01  2:56     ` Minchan Kim
  -1 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2013-08-01  2:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

Hi Hannes,

On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> Each zone that holds userspace pages of one workload must be aged at a
> speed proportional to the zone size.  Otherwise, the time an
> individual page gets to stay in memory depends on the zone it happened
> to be allocated in.  Asymmetry in the zone aging creates rather
> unpredictable aging behavior and results in the wrong pages being
> reclaimed, activated etc.
> 
> But exactly this happens right now because of the way the page
> allocator and kswapd interact.  The page allocator uses per-node lists
> of all zones in the system, ordered by preference, when allocating a
> new page.  When the first iteration does not yield any results, kswapd
> is woken up and the allocator retries.  Due to the way kswapd reclaims
> zones below the high watermark while a zone can be allocated from when
> it is above the low watermark, the allocator may keep kswapd running
> while kswapd reclaim ensures that the page allocator can keep
> allocating from the first zone in the zonelist for extended periods of
> time.  Meanwhile the other zones rarely see new allocations and thus
> get aged much slower in comparison.
> 
> The result is that the occasional page placed in lower zones gets
> relatively more time in memory, even get promoted to the active list
> after its peers have long been evicted.  Meanwhile, the bulk of the
> working set may be thrashing on the preferred zone even though there
> may be significant amounts of memory available in the lower zones.
> 
> Even the most basic test -- repeatedly reading a file slightly bigger
> than memory -- shows how broken the zone aging is.  In this scenario,
> no single page should be able stay in memory long enough to get
> referenced twice and activated, but activation happens in spades:
> 
>   $ grep active_file /proc/zoneinfo
>       nr_inactive_file 0
>       nr_active_file 0
>       nr_inactive_file 0
>       nr_active_file 8
>       nr_inactive_file 1582
>       nr_active_file 11994
>   $ cat data data data data >/dev/null
>   $ grep active_file /proc/zoneinfo
>       nr_inactive_file 0
>       nr_active_file 70
>       nr_inactive_file 258753
>       nr_active_file 443214
>       nr_inactive_file 149793
>       nr_active_file 12021
> 
> Fix this with a very simple round robin allocator.  Each zone is
> allowed a batch of allocations that is proportional to the zone's
> size, after which it is treated as full.  The batch counters are reset
> when all zones have been tried and the allocator enters the slowpath
> and kicks off kswapd reclaim:
> 
>   $ grep active_file /proc/zoneinfo
>       nr_inactive_file 0
>       nr_active_file 0
>       nr_inactive_file 174
>       nr_active_file 4865
>       nr_inactive_file 53
>       nr_active_file 860
>   $ cat data data data data >/dev/null
>   $ grep active_file /proc/zoneinfo
>       nr_inactive_file 0
>       nr_active_file 0
>       nr_inactive_file 666622
>       nr_active_file 4988
>       nr_inactive_file 190969
>       nr_active_file 937

First of all, I should appreciate your great work!
It's amazing and I saw Zlatko proved it enhances real works.
Thanks Zlatko, too!

So, I don't want to prevent merging but I think at least, we should
discuss some issues.

The concern I have is that it could accelerate low memory pinning
problems like mlock. Actually, I don't have such workload that makes
pin lots of pages but that's why we introduced lowmem_reserve_ratio,
as you know well so we should cover this issue, at least.

Other thing of my concerns is to add overhead in fast path.
Sometime, we are really reluctant to add simple even "if" condition
in fastpath but you are adding atomic op whenever page is allocated and
enter slowpath whenever all of given zones's batchcount is zero.
Yes, it's not really slow path because it could return to normal status
without calling significant slow functions by reset batchcount of
prepare_slowpath.

I think it's tradeoff and I am biased your approach although we would
lose a little performance because fair aging would recover the loss by
fastpath's overhead. But who knows? Someone has a concern.

So we should mention about such problems.

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h |  1 +
>  mm/page_alloc.c        | 39 +++++++++++++++++++++++++++++----------
>  2 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index af4a3b7..0c41d59 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -352,6 +352,7 @@ struct zone {
>  	 * free areas of different sizes
>  	 */
>  	spinlock_t		lock;
> +	atomic_t		alloc_batch;
>  	int                     all_unreclaimable; /* All pages pinned */
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  	/* Set to true when the PG_migrate_skip bits should be cleared */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index af1d956b..d938b67 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1879,6 +1879,14 @@ zonelist_scan:
>  		if (alloc_flags & ALLOC_NO_WATERMARKS)
>  			goto try_this_zone;
>  		/*
> +		 * Distribute pages in proportion to the individual
> +		 * zone size to ensure fair page aging.  The zone a
> +		 * page was allocated in should have no effect on the
> +		 * time the page has in memory before being reclaimed.
> +		 */
> +		if (atomic_read(&zone->alloc_batch) <= 0)
> +			continue;
> +		/*
>  		 * When allocating a page cache page for writing, we
>  		 * want to get it from a zone that is within its dirty
>  		 * limit, such that no single zone holds more than its
> @@ -1984,7 +1992,8 @@ this_zone_full:
>  		goto zonelist_scan;
>  	}
>  
> -	if (page)
> +	if (page) {
> +		atomic_sub(1U << order, &zone->alloc_batch);
>  		/*
>  		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>  		 * necessary to allocate the page. The expectation is
> @@ -1993,6 +2002,7 @@ this_zone_full:
>  		 * for !PFMEMALLOC purposes.
>  		 */
>  		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> +	}
>  
>  	return page;
>  }
> @@ -2342,16 +2352,20 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> -static inline
> -void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> -						enum zone_type high_zoneidx,
> -						enum zone_type classzone_idx)
> +static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
> +			     struct zonelist *zonelist,
> +			     enum zone_type high_zoneidx,
> +			     enum zone_type classzone_idx)
>  {
>  	struct zoneref *z;
>  	struct zone *zone;
>  
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		wakeup_kswapd(zone, order, classzone_idx);
> +	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> +		atomic_set(&zone->alloc_batch,
> +			   high_wmark_pages(zone) - low_wmark_pages(zone));
> +		if (!(gfp_mask & __GFP_NO_KSWAPD))
> +			wakeup_kswapd(zone, order, classzone_idx);
> +	}
>  }
>  
>  static inline int
> @@ -2447,9 +2461,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto nopage;
>  
>  restart:
> -	if (!(gfp_mask & __GFP_NO_KSWAPD))
> -		wake_all_kswapd(order, zonelist, high_zoneidx,
> -						zone_idx(preferred_zone));
> +	prepare_slowpath(gfp_mask, order, zonelist,
> +			 high_zoneidx, zone_idx(preferred_zone));
>  
>  	/*
>  	 * OK, we're below the kswapd watermark and have kicked background
> @@ -4758,6 +4771,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  		zone_seqlock_init(zone);
>  		zone->zone_pgdat = pgdat;
>  
> +		/* For bootup, initialized properly in watermark setup */
> +		atomic_set(&zone->alloc_batch, zone->managed_pages);
> +
>  		zone_pcp_init(zone);
>  		lruvec_init(&zone->lruvec);
>  		if (!size)
> @@ -5533,6 +5549,9 @@ static void __setup_per_zone_wmarks(void)
>  		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
>  		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
>  
> +		atomic_set(&zone->alloc_batch,
> +			   high_wmark_pages(zone) - low_wmark_pages(zone));
> +
>  		setup_zone_migrate_reserve(zone);
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  	}
> -- 
> 1.8.3.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-08-01  2:56     ` Minchan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2013-08-01  2:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

Hi Hannes,

On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> Each zone that holds userspace pages of one workload must be aged at a
> speed proportional to the zone size.  Otherwise, the time an
> individual page gets to stay in memory depends on the zone it happened
> to be allocated in.  Asymmetry in the zone aging creates rather
> unpredictable aging behavior and results in the wrong pages being
> reclaimed, activated etc.
> 
> But exactly this happens right now because of the way the page
> allocator and kswapd interact.  The page allocator uses per-node lists
> of all zones in the system, ordered by preference, when allocating a
> new page.  When the first iteration does not yield any results, kswapd
> is woken up and the allocator retries.  Due to the way kswapd reclaims
> zones below the high watermark while a zone can be allocated from when
> it is above the low watermark, the allocator may keep kswapd running
> while kswapd reclaim ensures that the page allocator can keep
> allocating from the first zone in the zonelist for extended periods of
> time.  Meanwhile the other zones rarely see new allocations and thus
> get aged much slower in comparison.
> 
> The result is that the occasional page placed in lower zones gets
> relatively more time in memory, even get promoted to the active list
> after its peers have long been evicted.  Meanwhile, the bulk of the
> working set may be thrashing on the preferred zone even though there
> may be significant amounts of memory available in the lower zones.
> 
> Even the most basic test -- repeatedly reading a file slightly bigger
> than memory -- shows how broken the zone aging is.  In this scenario,
> no single page should be able stay in memory long enough to get
> referenced twice and activated, but activation happens in spades:
> 
>   $ grep active_file /proc/zoneinfo
>       nr_inactive_file 0
>       nr_active_file 0
>       nr_inactive_file 0
>       nr_active_file 8
>       nr_inactive_file 1582
>       nr_active_file 11994
>   $ cat data data data data >/dev/null
>   $ grep active_file /proc/zoneinfo
>       nr_inactive_file 0
>       nr_active_file 70
>       nr_inactive_file 258753
>       nr_active_file 443214
>       nr_inactive_file 149793
>       nr_active_file 12021
> 
> Fix this with a very simple round robin allocator.  Each zone is
> allowed a batch of allocations that is proportional to the zone's
> size, after which it is treated as full.  The batch counters are reset
> when all zones have been tried and the allocator enters the slowpath
> and kicks off kswapd reclaim:
> 
>   $ grep active_file /proc/zoneinfo
>       nr_inactive_file 0
>       nr_active_file 0
>       nr_inactive_file 174
>       nr_active_file 4865
>       nr_inactive_file 53
>       nr_active_file 860
>   $ cat data data data data >/dev/null
>   $ grep active_file /proc/zoneinfo
>       nr_inactive_file 0
>       nr_active_file 0
>       nr_inactive_file 666622
>       nr_active_file 4988
>       nr_inactive_file 190969
>       nr_active_file 937

First of all, I should appreciate your great work!
It's amazing and I saw Zlatko proved it enhances real works.
Thanks Zlatko, too!

So, I don't want to prevent merging but I think at least, we should
discuss some issues.

The concern I have is that it could accelerate low memory pinning
problems like mlock. Actually, I don't have such workload that makes
pin lots of pages but that's why we introduced lowmem_reserve_ratio,
as you know well so we should cover this issue, at least.

Other thing of my concerns is to add overhead in fast path.
Sometime, we are really reluctant to add simple even "if" condition
in fastpath but you are adding atomic op whenever page is allocated and
enter slowpath whenever all of given zones's batchcount is zero.
Yes, it's not really slow path because it could return to normal status
without calling significant slow functions by reset batchcount of
prepare_slowpath.

I think it's tradeoff and I am biased your approach although we would
lose a little performance because fair aging would recover the loss by
fastpath's overhead. But who knows? Someone has a concern.

So we should mention about such problems.

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h |  1 +
>  mm/page_alloc.c        | 39 +++++++++++++++++++++++++++++----------
>  2 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index af4a3b7..0c41d59 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -352,6 +352,7 @@ struct zone {
>  	 * free areas of different sizes
>  	 */
>  	spinlock_t		lock;
> +	atomic_t		alloc_batch;
>  	int                     all_unreclaimable; /* All pages pinned */
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  	/* Set to true when the PG_migrate_skip bits should be cleared */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index af1d956b..d938b67 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1879,6 +1879,14 @@ zonelist_scan:
>  		if (alloc_flags & ALLOC_NO_WATERMARKS)
>  			goto try_this_zone;
>  		/*
> +		 * Distribute pages in proportion to the individual
> +		 * zone size to ensure fair page aging.  The zone a
> +		 * page was allocated in should have no effect on the
> +		 * time the page has in memory before being reclaimed.
> +		 */
> +		if (atomic_read(&zone->alloc_batch) <= 0)
> +			continue;
> +		/*
>  		 * When allocating a page cache page for writing, we
>  		 * want to get it from a zone that is within its dirty
>  		 * limit, such that no single zone holds more than its
> @@ -1984,7 +1992,8 @@ this_zone_full:
>  		goto zonelist_scan;
>  	}
>  
> -	if (page)
> +	if (page) {
> +		atomic_sub(1U << order, &zone->alloc_batch);
>  		/*
>  		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
>  		 * necessary to allocate the page. The expectation is
> @@ -1993,6 +2002,7 @@ this_zone_full:
>  		 * for !PFMEMALLOC purposes.
>  		 */
>  		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> +	}
>  
>  	return page;
>  }
> @@ -2342,16 +2352,20 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> -static inline
> -void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> -						enum zone_type high_zoneidx,
> -						enum zone_type classzone_idx)
> +static void prepare_slowpath(gfp_t gfp_mask, unsigned int order,
> +			     struct zonelist *zonelist,
> +			     enum zone_type high_zoneidx,
> +			     enum zone_type classzone_idx)
>  {
>  	struct zoneref *z;
>  	struct zone *zone;
>  
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		wakeup_kswapd(zone, order, classzone_idx);
> +	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> +		atomic_set(&zone->alloc_batch,
> +			   high_wmark_pages(zone) - low_wmark_pages(zone));
> +		if (!(gfp_mask & __GFP_NO_KSWAPD))
> +			wakeup_kswapd(zone, order, classzone_idx);
> +	}
>  }
>  
>  static inline int
> @@ -2447,9 +2461,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto nopage;
>  
>  restart:
> -	if (!(gfp_mask & __GFP_NO_KSWAPD))
> -		wake_all_kswapd(order, zonelist, high_zoneidx,
> -						zone_idx(preferred_zone));
> +	prepare_slowpath(gfp_mask, order, zonelist,
> +			 high_zoneidx, zone_idx(preferred_zone));
>  
>  	/*
>  	 * OK, we're below the kswapd watermark and have kicked background
> @@ -4758,6 +4771,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  		zone_seqlock_init(zone);
>  		zone->zone_pgdat = pgdat;
>  
> +		/* For bootup, initialized properly in watermark setup */
> +		atomic_set(&zone->alloc_batch, zone->managed_pages);
> +
>  		zone_pcp_init(zone);
>  		lruvec_init(&zone->lruvec);
>  		if (!size)
> @@ -5533,6 +5549,9 @@ static void __setup_per_zone_wmarks(void)
>  		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
>  		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
>  
> +		atomic_set(&zone->alloc_batch,
> +			   high_wmark_pages(zone) - low_wmark_pages(zone));
> +
>  		setup_zone_migrate_reserve(zone);
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  	}
> -- 
> 1.8.3.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-08-01  2:56     ` Minchan Kim
@ 2013-08-01  4:31       ` Rik van Riel
  -1 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-08-01  4:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, linux-kernel

On 07/31/2013 10:56 PM, Minchan Kim wrote:

> Yes, it's not really slow path because it could return to normal status
> without calling significant slow functions by reset batchcount of
> prepare_slowpath.
>
> I think it's tradeoff and I am biased your approach although we would
> lose a little performance because fair aging would recover the loss by
> fastpath's overhead. But who knows? Someone has a concern.
>
> So we should mention about such problems.

If the atomic operation in the fast path turns out to be a problem,
I suspect we may be able to fix it by using per-cpu counters, and
consolidating those every once in a while.

However, it may be good to see whether there is a problem in the
first place, before adding complexity.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-08-01  4:31       ` Rik van Riel
  0 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2013-08-01  4:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, linux-kernel

On 07/31/2013 10:56 PM, Minchan Kim wrote:

> Yes, it's not really slow path because it could return to normal status
> without calling significant slow functions by reset batchcount of
> prepare_slowpath.
>
> I think it's tradeoff and I am biased your approach although we would
> lose a little performance because fair aging would recover the loss by
> fastpath's overhead. But who knows? Someone has a concern.
>
> So we should mention about such problems.

If the atomic operation in the fast path turns out to be a problem,
I suspect we may be able to fix it by using per-cpu counters, and
consolidating those every once in a while.

However, it may be good to see whether there is a problem in the
first place, before adding complexity.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-08-01  4:31       ` Rik van Riel
@ 2013-08-01 15:51         ` Andrea Arcangeli
  -1 siblings, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2013-08-01 15:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, Johannes Weiner, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel

On Thu, Aug 01, 2013 at 12:31:34AM -0400, Rik van Riel wrote:
> On 07/31/2013 10:56 PM, Minchan Kim wrote:
> 
> > Yes, it's not really slow path because it could return to normal status
> > without calling significant slow functions by reset batchcount of
> > prepare_slowpath.
> >
> > I think it's tradeoff and I am biased your approach although we would
> > lose a little performance because fair aging would recover the loss by
> > fastpath's overhead. But who knows? Someone has a concern.
> >
> > So we should mention about such problems.
> 
> If the atomic operation in the fast path turns out to be a problem,
> I suspect we may be able to fix it by using per-cpu counters, and
> consolidating those every once in a while.
> 
> However, it may be good to see whether there is a problem in the
> first place, before adding complexity.

prepare_slowpath is racy anyway, so I don't see the point in wanting
to do atomic ops in the first place.

Much better to do the increment in assembly in a single insn to reduce
the window, but there's no point in using the lock prefix when
atomic_set will screw it over while somebody does atomic_dec under it
anyway. atomic_set cannot be atomic, so it'll screw over any
perfectly accurate accounted atomic_dec anyway.

I think it should be done in C without atomic_dec or by introducing an
__atomic_dec that isn't using the lock prefix and just tries to do the
dec in a single (or as few as possible) asm insn.

And because the whole thing is racy, after prepare_slowpath (which I
think is a too generic name and should be renamed
reset_zonelist_alloc_batch), this last attempt before invoking
reclaim:

	/* This is the last chance, in general, before the goto nopage. */
	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
			preferred_zone, migratetype);

which is trying again with the MIN wmark, should be also altered so
that we do one pass totally disregarding alloc_batch, just before
declaring MIN-wmark failure in the allocation, and invoking direct
reclaim.

Either we add a (alloc_flags | ALLOC_NO_ALLOC_BATCH) to the above
alloc_flags in the above call, or we run a second
get_page_from_freelist if the above one fails with
ALLOC_NO_ALLOC_BATCH set.

Otherwise because of the inherent racy behavior of the alloc_batch
logic, what will happen is that we'll eventually have enough hugepage
allocations in enough CPUs to hit the alloc_batch limit again in
between the prepare_slowpath and the above get_page_from_freelist,
that will lead us to call spurious direct_reclaim invocations, when it
was just a false negative of alloc_batch.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-08-01 15:51         ` Andrea Arcangeli
  0 siblings, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2013-08-01 15:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, Johannes Weiner, Andrew Morton, Mel Gorman,
	linux-mm, linux-kernel

On Thu, Aug 01, 2013 at 12:31:34AM -0400, Rik van Riel wrote:
> On 07/31/2013 10:56 PM, Minchan Kim wrote:
> 
> > Yes, it's not really slow path because it could return to normal status
> > without calling significant slow functions by reset batchcount of
> > prepare_slowpath.
> >
> > I think it's tradeoff and I am biased your approach although we would
> > lose a little performance because fair aging would recover the loss by
> > fastpath's overhead. But who knows? Someone has a concern.
> >
> > So we should mention about such problems.
> 
> If the atomic operation in the fast path turns out to be a problem,
> I suspect we may be able to fix it by using per-cpu counters, and
> consolidating those every once in a while.
> 
> However, it may be good to see whether there is a problem in the
> first place, before adding complexity.

prepare_slowpath is racy anyway, so I don't see the point in wanting
to do atomic ops in the first place.

Much better to do the increment in assembly in a single insn to reduce
the window, but there's no point in using the lock prefix when
atomic_set will screw it over while somebody does atomic_dec under it
anyway. atomic_set cannot be atomic, so it'll screw over any
perfectly accurate accounted atomic_dec anyway.

I think it should be done in C without atomic_dec or by introducing an
__atomic_dec that isn't using the lock prefix and just tries to do the
dec in a single (or as few as possible) asm insn.

And because the whole thing is racy, after prepare_slowpath (which I
think is a too generic name and should be renamed
reset_zonelist_alloc_batch), this last attempt before invoking
reclaim:

	/* This is the last chance, in general, before the goto nopage. */
	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
			preferred_zone, migratetype);

which is trying again with the MIN wmark, should be also altered so
that we do one pass totally disregarding alloc_batch, just before
declaring MIN-wmark failure in the allocation, and invoking direct
reclaim.

Either we add a (alloc_flags | ALLOC_NO_ALLOC_BATCH) to the above
alloc_flags in the above call, or we run a second
get_page_from_freelist if the above one fails with
ALLOC_NO_ALLOC_BATCH set.

Otherwise because of the inherent racy behavior of the alloc_batch
logic, what will happen is that we'll eventually have enough hugepage
allocations in enough CPUs to hit the alloc_batch limit again in
between the prepare_slowpath and the above get_page_from_freelist,
that will lead us to call spurious direct_reclaim invocations, when it
was just a false negative of alloc_batch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-08-01 15:51         ` Andrea Arcangeli
@ 2013-08-01 19:58           ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-08-01 19:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Minchan Kim, Andrew Morton, Mel Gorman, linux-mm,
	linux-kernel

On Thu, Aug 01, 2013 at 05:51:11PM +0200, Andrea Arcangeli wrote:
> On Thu, Aug 01, 2013 at 12:31:34AM -0400, Rik van Riel wrote:
> > On 07/31/2013 10:56 PM, Minchan Kim wrote:
> > 
> > > Yes, it's not really slow path because it could return to normal status
> > > without calling significant slow functions by reset batchcount of
> > > prepare_slowpath.
> > >
> > > I think it's tradeoff and I am biased your approach although we would
> > > lose a little performance because fair aging would recover the loss by
> > > fastpath's overhead. But who knows? Someone has a concern.
> > >
> > > So we should mention about such problems.
> > 
> > If the atomic operation in the fast path turns out to be a problem,
> > I suspect we may be able to fix it by using per-cpu counters, and
> > consolidating those every once in a while.
> > 
> > However, it may be good to see whether there is a problem in the
> > first place, before adding complexity.
> 
> prepare_slowpath is racy anyway, so I don't see the point in wanting
> to do atomic ops in the first place.
> 
> Much better to do the increment in assembly in a single insn to reduce
> the window, but there's no point in using the lock prefix when
> atomic_set will screw it over while somebody does atomic_dec under it
> anyway. atomic_set cannot be atomic, so it'll screw over any
> perfectly accurate accounted atomic_dec anyway.
> 
> I think it should be done in C without atomic_dec or by introducing an
> __atomic_dec that isn't using the lock prefix and just tries to do the
> dec in a single (or as few as possible) asm insn.

Well the reset is not synchronized to the atomic_dec(), but we are
guaranteed that after n page allocations we wake up kswapd, while
without atomic ops we might miss a couple allocations when allocators
race during the RMW.

But we might be able to get away with a small error.

> And because the whole thing is racy, after prepare_slowpath (which I
> think is a too generic name and should be renamed
> reset_zonelist_alloc_batch), this last attempt before invoking
> reclaim:
> 
> 	/* This is the last chance, in general, before the goto nopage. */
> 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> 			preferred_zone, migratetype);
> 
> which is trying again with the MIN wmark, should be also altered so
> that we do one pass totally disregarding alloc_batch, just before
> declaring MIN-wmark failure in the allocation, and invoking direct
> reclaim.
> 
> Either we add a (alloc_flags | ALLOC_NO_ALLOC_BATCH) to the above
> alloc_flags in the above call, or we run a second
> get_page_from_freelist if the above one fails with
> ALLOC_NO_ALLOC_BATCH set.
> 
> Otherwise because of the inherent racy behavior of the alloc_batch
> logic, what will happen is that we'll eventually have enough hugepage
> allocations in enough CPUs to hit the alloc_batch limit again in
> between the prepare_slowpath and the above get_page_from_freelist,
> that will lead us to call spurious direct_reclaim invocations, when it
> was just a false negative of alloc_batch.

So the discussion diverged between on-list and off-list.  I suggested
ignoring the alloc_batch in the slowpath completely (since allocations
are very unlikely to fail due to the batches immediately after they
were reset, and the occasional glitch does not matter).  This should
solve the problem of spurious direct reclaim invocations.  The
zone_reclaim_mode allocation sequence would basically be:

1. Try zones in the local node with low watermark while maintaining
   fairness with the allocation batch and invoking zone reclaim if
   necessary

2. If that fails, wake up kswapd and reset the alloc batches on the
   local node

3. Try zones in order of preference with the min watermark, invoking
   zone reclaim if necessary, ignoring allocation batches.

So in the fastpath we try to be local and fair, in the slow path we
are likely to be local and fair, unless we have to spill into remote
nodes.  But then there is nothing we can do anyway.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-08-01 19:58           ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-08-01 19:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Minchan Kim, Andrew Morton, Mel Gorman, linux-mm,
	linux-kernel

On Thu, Aug 01, 2013 at 05:51:11PM +0200, Andrea Arcangeli wrote:
> On Thu, Aug 01, 2013 at 12:31:34AM -0400, Rik van Riel wrote:
> > On 07/31/2013 10:56 PM, Minchan Kim wrote:
> > 
> > > Yes, it's not really slow path because it could return to normal status
> > > without calling significant slow functions by reset batchcount of
> > > prepare_slowpath.
> > >
> > > I think it's tradeoff and I am biased your approach although we would
> > > lose a little performance because fair aging would recover the loss by
> > > fastpath's overhead. But who knows? Someone has a concern.
> > >
> > > So we should mention about such problems.
> > 
> > If the atomic operation in the fast path turns out to be a problem,
> > I suspect we may be able to fix it by using per-cpu counters, and
> > consolidating those every once in a while.
> > 
> > However, it may be good to see whether there is a problem in the
> > first place, before adding complexity.
> 
> prepare_slowpath is racy anyway, so I don't see the point in wanting
> to do atomic ops in the first place.
> 
> Much better to do the increment in assembly in a single insn to reduce
> the window, but there's no point in using the lock prefix when
> atomic_set will screw it over while somebody does atomic_dec under it
> anyway. atomic_set cannot be atomic, so it'll screw over any
> perfectly accurate accounted atomic_dec anyway.
> 
> I think it should be done in C without atomic_dec or by introducing an
> __atomic_dec that isn't using the lock prefix and just tries to do the
> dec in a single (or as few as possible) asm insn.

Well the reset is not synchronized to the atomic_dec(), but we are
guaranteed that after n page allocations we wake up kswapd, while
without atomic ops we might miss a couple allocations when allocators
race during the RMW.

But we might be able to get away with a small error.

> And because the whole thing is racy, after prepare_slowpath (which I
> think is a too generic name and should be renamed
> reset_zonelist_alloc_batch), this last attempt before invoking
> reclaim:
> 
> 	/* This is the last chance, in general, before the goto nopage. */
> 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> 			preferred_zone, migratetype);
> 
> which is trying again with the MIN wmark, should be also altered so
> that we do one pass totally disregarding alloc_batch, just before
> declaring MIN-wmark failure in the allocation, and invoking direct
> reclaim.
> 
> Either we add a (alloc_flags | ALLOC_NO_ALLOC_BATCH) to the above
> alloc_flags in the above call, or we run a second
> get_page_from_freelist if the above one fails with
> ALLOC_NO_ALLOC_BATCH set.
> 
> Otherwise because of the inherent racy behavior of the alloc_batch
> logic, what will happen is that we'll eventually have enough hugepage
> allocations in enough CPUs to hit the alloc_batch limit again in
> between the prepare_slowpath and the above get_page_from_freelist,
> that will lead us to call spurious direct_reclaim invocations, when it
> was just a false negative of alloc_batch.

So the discussion diverged between on-list and off-list.  I suggested
ignoring the alloc_batch in the slowpath completely (since allocations
are very unlikely to fail due to the batches immediately after they
were reset, and the occasional glitch does not matter).  This should
solve the problem of spurious direct reclaim invocations.  The
zone_reclaim_mode allocation sequence would basically be:

1. Try zones in the local node with low watermark while maintaining
   fairness with the allocation batch and invoking zone reclaim if
   necessary

2. If that fails, wake up kswapd and reset the alloc batches on the
   local node

3. Try zones in order of preference with the min watermark, invoking
   zone reclaim if necessary, ignoring allocation batches.

So in the fastpath we try to be local and fair, in the slow path we
are likely to be local and fair, unless we have to spill into remote
nodes.  But then there is nothing we can do anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-08-01 19:58           ` Johannes Weiner
@ 2013-08-01 22:16             ` Andrea Arcangeli
  -1 siblings, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2013-08-01 22:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Minchan Kim, Andrew Morton, Mel Gorman, linux-mm,
	linux-kernel

On Thu, Aug 01, 2013 at 03:58:23PM -0400, Johannes Weiner wrote:
> But we might be able to get away with a small error.

The idea is that there's a small error anyway, because multiple CPUs
can reset it at the same time, while another CPU is decreasing it, so
the decrease sometime may get lost regardless if atomic or not. How
worse the error will become I don't know for sure but I would expect
to still be in the noise level.

Removing the locked op won't help much when multiple processes
allocates from the same node, but it'll speed it up when just one CPU
runs.

> So the discussion diverged between on-list and off-list.  I suggested
> ignoring the alloc_batch in the slowpath completely (since allocations
> are very unlikely to fail due to the batches immediately after they
> were reset, and the occasional glitch does not matter).  This should
> solve the problem of spurious direct reclaim invocations.  The

I'm fine with this solution of ignoring alloc_batch in the slowpath,
it definitely solves the risk of suprious direct reclaim and it
contributes to solving zone_reclaim_mode too.

> zone_reclaim_mode allocation sequence would basically be:
> 
> 1. Try zones in the local node with low watermark while maintaining
>    fairness with the allocation batch and invoking zone reclaim if
>    necessary
> 
> 2. If that fails, wake up kswapd and reset the alloc batches on the
>    local node
> 
> 3. Try zones in order of preference with the min watermark, invoking
>    zone reclaim if necessary, ignoring allocation batches.

One difference is that kswapd wasn't involved before. But maybe it's
beneficial.

> So in the fastpath we try to be local and fair, in the slow path we
> are likely to be local and fair, unless we have to spill into remote
> nodes.  But then there is nothing we can do anyway.

I think it should work well.

And with these changes, alloc_batch will help tremendously
zone_reclaim_mode too (not just prevent its breakage): the allocations
will be spread evenly on multiple zones of the same nodes, instead of
insisting calling zone_reclaim on the highest zone of the node until
zone_reclaim finally fails on it (with the current behavior if
zone_reclaim_mode is enabled, the LRU is rotated only the high zone
until full). And it's common to have a least one node with two zones
(pci32 zone).

By skipping any non-local zone in the fast path if zone_reclaim_mode
is enabled, will prevent multiple CPUs in different nodes to step in
each other toes, and alloc_batch will never be resetted by remote CPUs
with the zone_local check introduced in prepare_slowpath.

Changing the prepare_slowpath to reset only the local node will also
decrease the very distant interconnect traffic for very big NUMA where
zone_reclaim_mode is enabled by default, so it'll speedup
prepare_slowpath too.

Can you submit the fixes to Andrew? I'm very happy with your patch, so
then I can submit the compaction zone_reclaim_mode fixes on top of
it.

Changing topic on my zone_reclaim compaction fixes, hope to get more
review on those, especially the high order watermark algorithm
improvements. I'll try to get a specsfs with jumbo frames run on it
too to see if there's any positive measurable effect out of the
watermark changes and the substantial improvement in compaction
reliability. (the increase of the low watermark level for high order
pages is needed to get more accuracy out of zone_reclaim_mode hugepage
placement, and the decrease of the min watermark level will avoid us
to waste precious CPU to generate a zillon of high order pages that
cannot be possibly be useful to PF_MEMALLOC, we've currently a ton 8k
pages, that any GFP_ATOMIC or GFP_KERNEL allocation are totally
forbidden to use, but we still generate those for nothing, and over
time we split them right away at the first 4k allocation so they don't
even stick)

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-08-01 22:16             ` Andrea Arcangeli
  0 siblings, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2013-08-01 22:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Minchan Kim, Andrew Morton, Mel Gorman, linux-mm,
	linux-kernel

On Thu, Aug 01, 2013 at 03:58:23PM -0400, Johannes Weiner wrote:
> But we might be able to get away with a small error.

The idea is that there's a small error anyway, because multiple CPUs
can reset it at the same time, while another CPU is decreasing it, so
the decrease sometime may get lost regardless if atomic or not. How
worse the error will become I don't know for sure but I would expect
to still be in the noise level.

Removing the locked op won't help much when multiple processes
allocates from the same node, but it'll speed it up when just one CPU
runs.

> So the discussion diverged between on-list and off-list.  I suggested
> ignoring the alloc_batch in the slowpath completely (since allocations
> are very unlikely to fail due to the batches immediately after they
> were reset, and the occasional glitch does not matter).  This should
> solve the problem of spurious direct reclaim invocations.  The

I'm fine with this solution of ignoring alloc_batch in the slowpath,
it definitely solves the risk of suprious direct reclaim and it
contributes to solving zone_reclaim_mode too.

> zone_reclaim_mode allocation sequence would basically be:
> 
> 1. Try zones in the local node with low watermark while maintaining
>    fairness with the allocation batch and invoking zone reclaim if
>    necessary
> 
> 2. If that fails, wake up kswapd and reset the alloc batches on the
>    local node
> 
> 3. Try zones in order of preference with the min watermark, invoking
>    zone reclaim if necessary, ignoring allocation batches.

One difference is that kswapd wasn't involved before. But maybe it's
beneficial.

> So in the fastpath we try to be local and fair, in the slow path we
> are likely to be local and fair, unless we have to spill into remote
> nodes.  But then there is nothing we can do anyway.

I think it should work well.

And with these changes, alloc_batch will help tremendously
zone_reclaim_mode too (not just prevent its breakage): the allocations
will be spread evenly on multiple zones of the same nodes, instead of
insisting calling zone_reclaim on the highest zone of the node until
zone_reclaim finally fails on it (with the current behavior if
zone_reclaim_mode is enabled, the LRU is rotated only the high zone
until full). And it's common to have a least one node with two zones
(pci32 zone).

By skipping any non-local zone in the fast path if zone_reclaim_mode
is enabled, will prevent multiple CPUs in different nodes to step in
each other toes, and alloc_batch will never be resetted by remote CPUs
with the zone_local check introduced in prepare_slowpath.

Changing the prepare_slowpath to reset only the local node will also
decrease the very distant interconnect traffic for very big NUMA where
zone_reclaim_mode is enabled by default, so it'll speedup
prepare_slowpath too.

Can you submit the fixes to Andrew? I'm very happy with your patch, so
then I can submit the compaction zone_reclaim_mode fixes on top of
it.

Changing topic on my zone_reclaim compaction fixes, hope to get more
review on those, especially the high order watermark algorithm
improvements. I'll try to get a specsfs with jumbo frames run on it
too to see if there's any positive measurable effect out of the
watermark changes and the substantial improvement in compaction
reliability. (the increase of the low watermark level for high order
pages is needed to get more accuracy out of zone_reclaim_mode hugepage
placement, and the decrease of the min watermark level will avoid us
to waste precious CPU to generate a zillon of high order pages that
cannot be possibly be useful to PF_MEMALLOC, we've currently a ton 8k
pages, that any GFP_ATOMIC or GFP_KERNEL allocation are totally
forbidden to use, but we still generate those for nothing, and over
time we split them right away at the first 4k allocation so they don't
even stick)

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-08-01  2:56     ` Minchan Kim
@ 2013-08-02  6:22       ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-08-02  6:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

On Thu, Aug 01, 2013 at 11:56:36AM +0900, Minchan Kim wrote:
> Hi Hannes,
> 
> On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > Each zone that holds userspace pages of one workload must be aged at a
> > speed proportional to the zone size.  Otherwise, the time an
> > individual page gets to stay in memory depends on the zone it happened
> > to be allocated in.  Asymmetry in the zone aging creates rather
> > unpredictable aging behavior and results in the wrong pages being
> > reclaimed, activated etc.
> > 
> > But exactly this happens right now because of the way the page
> > allocator and kswapd interact.  The page allocator uses per-node lists
> > of all zones in the system, ordered by preference, when allocating a
> > new page.  When the first iteration does not yield any results, kswapd
> > is woken up and the allocator retries.  Due to the way kswapd reclaims
> > zones below the high watermark while a zone can be allocated from when
> > it is above the low watermark, the allocator may keep kswapd running
> > while kswapd reclaim ensures that the page allocator can keep
> > allocating from the first zone in the zonelist for extended periods of
> > time.  Meanwhile the other zones rarely see new allocations and thus
> > get aged much slower in comparison.
> > 
> > The result is that the occasional page placed in lower zones gets
> > relatively more time in memory, even get promoted to the active list
> > after its peers have long been evicted.  Meanwhile, the bulk of the
> > working set may be thrashing on the preferred zone even though there
> > may be significant amounts of memory available in the lower zones.
> > 
> > Even the most basic test -- repeatedly reading a file slightly bigger
> > than memory -- shows how broken the zone aging is.  In this scenario,
> > no single page should be able stay in memory long enough to get
> > referenced twice and activated, but activation happens in spades:
> > 
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 0
> >       nr_inactive_file 0
> >       nr_active_file 8
> >       nr_inactive_file 1582
> >       nr_active_file 11994
> >   $ cat data data data data >/dev/null
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 70
> >       nr_inactive_file 258753
> >       nr_active_file 443214
> >       nr_inactive_file 149793
> >       nr_active_file 12021
> > 
> > Fix this with a very simple round robin allocator.  Each zone is
> > allowed a batch of allocations that is proportional to the zone's
> > size, after which it is treated as full.  The batch counters are reset
> > when all zones have been tried and the allocator enters the slowpath
> > and kicks off kswapd reclaim:
> > 
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 0
> >       nr_inactive_file 174
> >       nr_active_file 4865
> >       nr_inactive_file 53
> >       nr_active_file 860
> >   $ cat data data data data >/dev/null
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 0
> >       nr_inactive_file 666622
> >       nr_active_file 4988
> >       nr_inactive_file 190969
> >       nr_active_file 937
> 
> First of all, I should appreciate your great work!
> It's amazing and I saw Zlatko proved it enhances real works.
> Thanks Zlatko, too!
> 
> So, I don't want to prevent merging but I think at least, we should
> discuss some issues.
> 
> The concern I have is that it could accelerate low memory pinning
> problems like mlock. Actually, I don't have such workload that makes
> pin lots of pages but that's why we introduced lowmem_reserve_ratio,
> as you know well so we should cover this issue, at least.

We are not actually using the lower zones more in terms of number of
pages at any given time, we are just using them more in terms of
allocation and reclaim rate.

Consider how the page allocator works: it will first fill up the
preferred zone, then the next best zone, etc. and then when all of
them are full, it will wake up kswapd, which will only reclaim them
back to the high watermark + balance gap.

After my change, all zones will be at the high watermarks + balance
gap during idle and between the low and high marks under load.  The
change is marginal.

The lowmem reserve makes sure, before and after, that a sizable
portion of low memory is reserved for non-user allocations.

> Other thing of my concerns is to add overhead in fast path.
> Sometime, we are really reluctant to add simple even "if" condition
> in fastpath but you are adding atomic op whenever page is allocated and
> enter slowpath whenever all of given zones's batchcount is zero.
> Yes, it's not really slow path because it could return to normal status
> without calling significant slow functions by reset batchcount of
> prepare_slowpath.

Yes, no direct reclaim should occur.  I know that it comes of a cost,
but the alternative is broken page reclaim, page cache thrashing etc.
I don't see a way around it...

Thanks for your input, Minchan!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-08-02  6:22       ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2013-08-02  6:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

On Thu, Aug 01, 2013 at 11:56:36AM +0900, Minchan Kim wrote:
> Hi Hannes,
> 
> On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > Each zone that holds userspace pages of one workload must be aged at a
> > speed proportional to the zone size.  Otherwise, the time an
> > individual page gets to stay in memory depends on the zone it happened
> > to be allocated in.  Asymmetry in the zone aging creates rather
> > unpredictable aging behavior and results in the wrong pages being
> > reclaimed, activated etc.
> > 
> > But exactly this happens right now because of the way the page
> > allocator and kswapd interact.  The page allocator uses per-node lists
> > of all zones in the system, ordered by preference, when allocating a
> > new page.  When the first iteration does not yield any results, kswapd
> > is woken up and the allocator retries.  Due to the way kswapd reclaims
> > zones below the high watermark while a zone can be allocated from when
> > it is above the low watermark, the allocator may keep kswapd running
> > while kswapd reclaim ensures that the page allocator can keep
> > allocating from the first zone in the zonelist for extended periods of
> > time.  Meanwhile the other zones rarely see new allocations and thus
> > get aged much slower in comparison.
> > 
> > The result is that the occasional page placed in lower zones gets
> > relatively more time in memory, even get promoted to the active list
> > after its peers have long been evicted.  Meanwhile, the bulk of the
> > working set may be thrashing on the preferred zone even though there
> > may be significant amounts of memory available in the lower zones.
> > 
> > Even the most basic test -- repeatedly reading a file slightly bigger
> > than memory -- shows how broken the zone aging is.  In this scenario,
> > no single page should be able stay in memory long enough to get
> > referenced twice and activated, but activation happens in spades:
> > 
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 0
> >       nr_inactive_file 0
> >       nr_active_file 8
> >       nr_inactive_file 1582
> >       nr_active_file 11994
> >   $ cat data data data data >/dev/null
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 70
> >       nr_inactive_file 258753
> >       nr_active_file 443214
> >       nr_inactive_file 149793
> >       nr_active_file 12021
> > 
> > Fix this with a very simple round robin allocator.  Each zone is
> > allowed a batch of allocations that is proportional to the zone's
> > size, after which it is treated as full.  The batch counters are reset
> > when all zones have been tried and the allocator enters the slowpath
> > and kicks off kswapd reclaim:
> > 
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 0
> >       nr_inactive_file 174
> >       nr_active_file 4865
> >       nr_inactive_file 53
> >       nr_active_file 860
> >   $ cat data data data data >/dev/null
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 0
> >       nr_inactive_file 666622
> >       nr_active_file 4988
> >       nr_inactive_file 190969
> >       nr_active_file 937
> 
> First of all, I should appreciate your great work!
> It's amazing and I saw Zlatko proved it enhances real works.
> Thanks Zlatko, too!
> 
> So, I don't want to prevent merging but I think at least, we should
> discuss some issues.
> 
> The concern I have is that it could accelerate low memory pinning
> problems like mlock. Actually, I don't have such workload that makes
> pin lots of pages but that's why we introduced lowmem_reserve_ratio,
> as you know well so we should cover this issue, at least.

We are not actually using the lower zones more in terms of number of
pages at any given time, we are just using them more in terms of
allocation and reclaim rate.

Consider how the page allocator works: it will first fill up the
preferred zone, then the next best zone, etc. and then when all of
them are full, it will wake up kswapd, which will only reclaim them
back to the high watermark + balance gap.

After my change, all zones will be at the high watermarks + balance
gap during idle and between the low and high marks under load.  The
change is marginal.

The lowmem reserve makes sure, before and after, that a sizable
portion of low memory is reserved for non-user allocations.

> Other thing of my concerns is to add overhead in fast path.
> Sometime, we are really reluctant to add simple even "if" condition
> in fastpath but you are adding atomic op whenever page is allocated and
> enter slowpath whenever all of given zones's batchcount is zero.
> Yes, it's not really slow path because it could return to normal status
> without calling significant slow functions by reset batchcount of
> prepare_slowpath.

Yes, no direct reclaim should occur.  I know that it comes of a cost,
but the alternative is broken page reclaim, page cache thrashing etc.
I don't see a way around it...

Thanks for your input, Minchan!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
  2013-08-02  6:22       ` Johannes Weiner
@ 2013-08-02  7:32         ` Minchan Kim
  -1 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2013-08-02  7:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

On Fri, Aug 02, 2013 at 02:22:08AM -0400, Johannes Weiner wrote:
> On Thu, Aug 01, 2013 at 11:56:36AM +0900, Minchan Kim wrote:
> > Hi Hannes,
> > 
> > On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > > Each zone that holds userspace pages of one workload must be aged at a
> > > speed proportional to the zone size.  Otherwise, the time an
> > > individual page gets to stay in memory depends on the zone it happened
> > > to be allocated in.  Asymmetry in the zone aging creates rather
> > > unpredictable aging behavior and results in the wrong pages being
> > > reclaimed, activated etc.
> > > 
> > > But exactly this happens right now because of the way the page
> > > allocator and kswapd interact.  The page allocator uses per-node lists
> > > of all zones in the system, ordered by preference, when allocating a
> > > new page.  When the first iteration does not yield any results, kswapd
> > > is woken up and the allocator retries.  Due to the way kswapd reclaims
> > > zones below the high watermark while a zone can be allocated from when
> > > it is above the low watermark, the allocator may keep kswapd running
> > > while kswapd reclaim ensures that the page allocator can keep
> > > allocating from the first zone in the zonelist for extended periods of
> > > time.  Meanwhile the other zones rarely see new allocations and thus
> > > get aged much slower in comparison.
> > > 
> > > The result is that the occasional page placed in lower zones gets
> > > relatively more time in memory, even get promoted to the active list
> > > after its peers have long been evicted.  Meanwhile, the bulk of the
> > > working set may be thrashing on the preferred zone even though there
> > > may be significant amounts of memory available in the lower zones.
> > > 
> > > Even the most basic test -- repeatedly reading a file slightly bigger
> > > than memory -- shows how broken the zone aging is.  In this scenario,
> > > no single page should be able stay in memory long enough to get
> > > referenced twice and activated, but activation happens in spades:
> > > 
> > >   $ grep active_file /proc/zoneinfo
> > >       nr_inactive_file 0
> > >       nr_active_file 0
> > >       nr_inactive_file 0
> > >       nr_active_file 8
> > >       nr_inactive_file 1582
> > >       nr_active_file 11994
> > >   $ cat data data data data >/dev/null
> > >   $ grep active_file /proc/zoneinfo
> > >       nr_inactive_file 0
> > >       nr_active_file 70
> > >       nr_inactive_file 258753
> > >       nr_active_file 443214
> > >       nr_inactive_file 149793
> > >       nr_active_file 12021
> > > 
> > > Fix this with a very simple round robin allocator.  Each zone is
> > > allowed a batch of allocations that is proportional to the zone's
> > > size, after which it is treated as full.  The batch counters are reset
> > > when all zones have been tried and the allocator enters the slowpath
> > > and kicks off kswapd reclaim:
> > > 
> > >   $ grep active_file /proc/zoneinfo
> > >       nr_inactive_file 0
> > >       nr_active_file 0
> > >       nr_inactive_file 174
> > >       nr_active_file 4865
> > >       nr_inactive_file 53
> > >       nr_active_file 860
> > >   $ cat data data data data >/dev/null
> > >   $ grep active_file /proc/zoneinfo
> > >       nr_inactive_file 0
> > >       nr_active_file 0
> > >       nr_inactive_file 666622
> > >       nr_active_file 4988
> > >       nr_inactive_file 190969
> > >       nr_active_file 937
> > 
> > First of all, I should appreciate your great work!
> > It's amazing and I saw Zlatko proved it enhances real works.
> > Thanks Zlatko, too!
> > 
> > So, I don't want to prevent merging but I think at least, we should
> > discuss some issues.
> > 
> > The concern I have is that it could accelerate low memory pinning
> > problems like mlock. Actually, I don't have such workload that makes
> > pin lots of pages but that's why we introduced lowmem_reserve_ratio,
> > as you know well so we should cover this issue, at least.
> 
> We are not actually using the lower zones more in terms of number of
> pages at any given time, we are just using them more in terms of
> allocation and reclaim rate.
> 
> Consider how the page allocator works: it will first fill up the
> preferred zone, then the next best zone, etc. and then when all of
> them are full, it will wake up kswapd, which will only reclaim them
> back to the high watermark + balance gap.
> 
> After my change, all zones will be at the high watermarks + balance
> gap during idle and between the low and high marks under load.  The
> change is marginal.

Sorry but I couldn't follow.
Old logic is to use lower zones after higher zones is full so it is
unlikely to have mlocked pages in lower zones.
With your change, it is using lower zones although higher zone isn't full
so the chance that mlocked pages in lower zones would be more than old.

Yes, if we think in the long term where all zones are full and reclaimer
continue to reclaim all of zones reach high watermark + balance gap,
the change would be marginal as you said. But that's not to say we can
ignore former case, I think.

> 
> The lowmem reserve makes sure, before and after, that a sizable
> portion of low memory is reserved for non-user allocations.
> 
> > Other thing of my concerns is to add overhead in fast path.
> > Sometime, we are really reluctant to add simple even "if" condition
> > in fastpath but you are adding atomic op whenever page is allocated and
> > enter slowpath whenever all of given zones's batchcount is zero.
> > Yes, it's not really slow path because it could return to normal status
> > without calling significant slow functions by reset batchcount of
> > prepare_slowpath.
> 
> Yes, no direct reclaim should occur.  I know that it comes of a cost,
> but the alternative is broken page reclaim, page cache thrashing etc.
> I don't see a way around it...

That's why I support your solution. :)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [patch 3/3] mm: page_alloc: fair zone allocator policy
@ 2013-08-02  7:32         ` Minchan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2013-08-02  7:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andrea Arcangeli,
	linux-mm, linux-kernel

On Fri, Aug 02, 2013 at 02:22:08AM -0400, Johannes Weiner wrote:
> On Thu, Aug 01, 2013 at 11:56:36AM +0900, Minchan Kim wrote:
> > Hi Hannes,
> > 
> > On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > > Each zone that holds userspace pages of one workload must be aged at a
> > > speed proportional to the zone size.  Otherwise, the time an
> > > individual page gets to stay in memory depends on the zone it happened
> > > to be allocated in.  Asymmetry in the zone aging creates rather
> > > unpredictable aging behavior and results in the wrong pages being
> > > reclaimed, activated etc.
> > > 
> > > But exactly this happens right now because of the way the page
> > > allocator and kswapd interact.  The page allocator uses per-node lists
> > > of all zones in the system, ordered by preference, when allocating a
> > > new page.  When the first iteration does not yield any results, kswapd
> > > is woken up and the allocator retries.  Due to the way kswapd reclaims
> > > zones below the high watermark while a zone can be allocated from when
> > > it is above the low watermark, the allocator may keep kswapd running
> > > while kswapd reclaim ensures that the page allocator can keep
> > > allocating from the first zone in the zonelist for extended periods of
> > > time.  Meanwhile the other zones rarely see new allocations and thus
> > > get aged much slower in comparison.
> > > 
> > > The result is that the occasional page placed in lower zones gets
> > > relatively more time in memory, even get promoted to the active list
> > > after its peers have long been evicted.  Meanwhile, the bulk of the
> > > working set may be thrashing on the preferred zone even though there
> > > may be significant amounts of memory available in the lower zones.
> > > 
> > > Even the most basic test -- repeatedly reading a file slightly bigger
> > > than memory -- shows how broken the zone aging is.  In this scenario,
> > > no single page should be able stay in memory long enough to get
> > > referenced twice and activated, but activation happens in spades:
> > > 
> > >   $ grep active_file /proc/zoneinfo
> > >       nr_inactive_file 0
> > >       nr_active_file 0
> > >       nr_inactive_file 0
> > >       nr_active_file 8
> > >       nr_inactive_file 1582
> > >       nr_active_file 11994
> > >   $ cat data data data data >/dev/null
> > >   $ grep active_file /proc/zoneinfo
> > >       nr_inactive_file 0
> > >       nr_active_file 70
> > >       nr_inactive_file 258753
> > >       nr_active_file 443214
> > >       nr_inactive_file 149793
> > >       nr_active_file 12021
> > > 
> > > Fix this with a very simple round robin allocator.  Each zone is
> > > allowed a batch of allocations that is proportional to the zone's
> > > size, after which it is treated as full.  The batch counters are reset
> > > when all zones have been tried and the allocator enters the slowpath
> > > and kicks off kswapd reclaim:
> > > 
> > >   $ grep active_file /proc/zoneinfo
> > >       nr_inactive_file 0
> > >       nr_active_file 0
> > >       nr_inactive_file 174
> > >       nr_active_file 4865
> > >       nr_inactive_file 53
> > >       nr_active_file 860
> > >   $ cat data data data data >/dev/null
> > >   $ grep active_file /proc/zoneinfo
> > >       nr_inactive_file 0
> > >       nr_active_file 0
> > >       nr_inactive_file 666622
> > >       nr_active_file 4988
> > >       nr_inactive_file 190969
> > >       nr_active_file 937
> > 
> > First of all, I should appreciate your great work!
> > It's amazing and I saw Zlatko proved it enhances real works.
> > Thanks Zlatko, too!
> > 
> > So, I don't want to prevent merging but I think at least, we should
> > discuss some issues.
> > 
> > The concern I have is that it could accelerate low memory pinning
> > problems like mlock. Actually, I don't have such workload that makes
> > pin lots of pages but that's why we introduced lowmem_reserve_ratio,
> > as you know well so we should cover this issue, at least.
> 
> We are not actually using the lower zones more in terms of number of
> pages at any given time, we are just using them more in terms of
> allocation and reclaim rate.
> 
> Consider how the page allocator works: it will first fill up the
> preferred zone, then the next best zone, etc. and then when all of
> them are full, it will wake up kswapd, which will only reclaim them
> back to the high watermark + balance gap.
> 
> After my change, all zones will be at the high watermarks + balance
> gap during idle and between the low and high marks under load.  The
> change is marginal.

Sorry but I couldn't follow.
Old logic is to use lower zones after higher zones is full so it is
unlikely to have mlocked pages in lower zones.
With your change, it is using lower zones although higher zone isn't full
so the chance that mlocked pages in lower zones would be more than old.

Yes, if we think in the long term where all zones are full and reclaimer
continue to reclaim all of zones reach high watermark + balance gap,
the change would be marginal as you said. But that's not to say we can
ignore former case, I think.

> 
> The lowmem reserve makes sure, before and after, that a sizable
> portion of low memory is reserved for non-user allocations.
> 
> > Other thing of my concerns is to add overhead in fast path.
> > Sometime, we are really reluctant to add simple even "if" condition
> > in fastpath but you are adding atomic op whenever page is allocated and
> > enter slowpath whenever all of given zones's batchcount is zero.
> > Yes, it's not really slow path because it could return to normal status
> > without calling significant slow functions by reset batchcount of
> > prepare_slowpath.
> 
> Yes, no direct reclaim should occur.  I know that it comes of a cost,
> but the alternative is broken page reclaim, page cache thrashing etc.
> I don't see a way around it...

That's why I support your solution. :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2013-08-02  7:31 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-19 20:55 [patch 0/3] mm: improve page aging fairness between zones/nodes Johannes Weiner
2013-07-19 20:55 ` Johannes Weiner
2013-07-19 20:55 ` [patch 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd Johannes Weiner
2013-07-19 20:55   ` Johannes Weiner
2013-07-22 19:47   ` Rik van Riel
2013-07-22 19:47     ` Rik van Riel
2013-07-22 20:14     ` Johannes Weiner
2013-07-22 20:14       ` Johannes Weiner
2013-07-26 22:53   ` Andrew Morton
2013-07-26 22:53     ` Andrew Morton
2013-07-30 17:45     ` Johannes Weiner
2013-07-30 17:45       ` Johannes Weiner
2013-07-31 12:43       ` Johannes Weiner
2013-07-31 12:43         ` Johannes Weiner
2013-07-19 20:55 ` [patch 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist Johannes Weiner
2013-07-19 20:55   ` Johannes Weiner
2013-07-22 19:51   ` Rik van Riel
2013-07-22 19:51     ` Rik van Riel
2013-07-19 20:55 ` [patch 3/3] mm: page_alloc: fair zone allocator policy Johannes Weiner
2013-07-19 20:55   ` Johannes Weiner
2013-07-22 20:21   ` Rik van Riel
2013-07-22 20:21     ` Rik van Riel
2013-07-22 21:04     ` Johannes Weiner
2013-07-22 21:04       ` Johannes Weiner
2013-07-22 22:48       ` Rik van Riel
2013-07-22 22:48         ` Rik van Riel
2013-07-25  6:50     ` Paul Bolle
2013-07-25  6:50       ` Paul Bolle
2013-07-25 15:10       ` Johannes Weiner
2013-07-25 15:10         ` Johannes Weiner
2013-07-25 15:20         ` Paul Bolle
2013-07-25 15:20           ` Paul Bolle
2013-07-29 17:48   ` Andrea Arcangeli
2013-07-29 17:48     ` Andrea Arcangeli
2013-07-29 22:24     ` Johannes Weiner
2013-07-29 22:24       ` Johannes Weiner
2013-08-01  2:56   ` Minchan Kim
2013-08-01  2:56     ` Minchan Kim
2013-08-01  4:31     ` Rik van Riel
2013-08-01  4:31       ` Rik van Riel
2013-08-01 15:51       ` Andrea Arcangeli
2013-08-01 15:51         ` Andrea Arcangeli
2013-08-01 19:58         ` Johannes Weiner
2013-08-01 19:58           ` Johannes Weiner
2013-08-01 22:16           ` Andrea Arcangeli
2013-08-01 22:16             ` Andrea Arcangeli
2013-08-02  6:22     ` Johannes Weiner
2013-08-02  6:22       ` Johannes Weiner
2013-08-02  7:32       ` Minchan Kim
2013-08-02  7:32         ` Minchan Kim
2013-07-22 16:48 ` [patch 0/3] mm: improve page aging fairness between zones/nodes Zlatko Calusic
2013-07-22 16:48   ` Zlatko Calusic
2013-07-22 17:01   ` Johannes Weiner
2013-07-22 17:01     ` Johannes Weiner
2013-07-22 17:14     ` Zlatko Calusic
2013-07-24 11:18   ` Zlatko Calusic
2013-07-24 12:46     ` Hush Bensen
2013-07-24 12:46       ` Hush Bensen
2013-07-24 13:59       ` Zlatko Calusic
2013-07-24 13:59         ` Zlatko Calusic
2013-07-31  9:33     ` Zlatko Calusic
2013-07-31  9:33       ` Zlatko Calusic
2013-07-26 22:45 ` Andrew Morton
2013-07-26 22:45   ` Andrew Morton
2013-07-26 23:14   ` Johannes Weiner
2013-07-26 23:14     ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.