linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2
@ 2016-02-23 15:04 Mel Gorman
  2016-02-23 15:04 ` [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone) Mel Gorman
                   ` (28 more replies)
  0 siblings, 29 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

(sorry for the resend, outgoing smtp went haywire so trying again)

This is a revisit of an RFC series from last year that moves LRUs from
the zones to the node. It is based on mmotm from February 9th as it had
to be rebased on top of work there and will not apply cleanly to 4.5-rc*
Conceptually, this is simple but there are a lot of details. Some of the
broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

It was tested on a UMA (16 cores single socket) and a NUMA machine (48 cores,
2 sockets). However, many of these results are from the UMA machine as the
NUMA machine had a bug that was causing numa balancing to push everything
out to swap. A fix for that issue has already been posted.

In many benchmarks, there is an obvious difference in the number of
allocations from each zone as the fair zone allocation policy is removed
towards the end of the series. For example, this is the allocation stats
when running blogbench that showed no difference in headling performance

                          mmotm-20160209   nodelru-v2
DMA allocs                           0           0
DMA32 allocs                   7218763      608067
Normal allocs                 12701806    18821286
Movable allocs                       0           0

bonnie
------

This was configured to do an IO test with a working set 2*RAM using the
ext4 filesystem.  For both machines, there was no significant performance
difference between them but this is the result for the UMA machine


bonnie
                                           4.5.0-rc3                   4.5.0-rc3
                                      mmotm-20160209                  nodelru-v2
Hmean    SeqOut Char            85457.62 (  0.00%)        85376.69 ( -0.09%)
Hmean    SeqOut Block           87031.13 (  0.00%)        87523.40 (  0.57%)
Hmean    SeqOut Rewrite         36685.66 (  0.00%)        36006.64 ( -1.85%)
Hmean    SeqIn  Char            76766.34 (  0.00%)        75935.63 ( -1.08%)
Hmean    SeqIn  Block          105405.02 (  0.00%)       105513.21 (  0.10%)
Hmean    Random seeks             333.03 (  0.00%)          332.82 ( -0.07%)
Hmean    SeqCreate ops              5.00 (  0.00%)            4.62 ( -7.69%)
Hmean    SeqCreate read             4.62 (  0.00%)            4.62 (  0.00%)
Hmean    SeqCreate del           1622.44 (  0.00%)         1633.46 (  0.68%)
Hmean    RandCreate ops             5.00 (  0.00%)            5.00 (  0.00%)
Hmean    RandCreate read            4.62 (  0.00%)            4.62 (  0.00%)
Hmean    RandCreate del          1664.51 (  0.00%)         1672.79 (  0.50%)

           4.5.0-rc3   4.5.0-rc3
        mmotm-20160209   nodelru-v2
User          892.43      896.96
System        160.86      156.56
Elapsed      5990.52     6005.04

However, the overall VM stats are interesting


                             4.5.0-rc3   4.5.0-rc3
                          mmotm-20160209   nodelru-v2
Swap Ins                             8           0
Swap Outs                          705          52
Allocation stalls                 6480           0
DMA allocs                           0           0
DMA32 allocs                  38287801    35274742
Normal allocs                 64983682    67494335
Movable allocs                       0           0
Direct pages scanned           1334296           0
Kswapd pages scanned          77617741    78643061
Kswapd pages reclaimed        77493866    78481909
Direct pages reclaimed         1334220           0
Kswapd efficiency                  99%         99%
Kswapd velocity              12956.762   13096.176
Direct efficiency                  99%        100%
Direct velocity                222.735       0.000
Percentage direct scans             1%          0%

Note that there were no allocation stalls with this patch applied and no
direct reclaim activity.

tiobench
--------

tiobench is a flawed benchmark but it's very important in this case. tiobench
benefited from a bug prior to the fair zone allocation policy that allowed
old pages to be artificially preserved. The visible impact was that performance
exceeded the physical capabilities of the disk. With this patch applied the results are

tiobench Throughput
                                               4.5.0-rc3                   4.5.0-rc3
                                          mmotm-20160209                  nodelru-v2
Hmean    PotentialReadSpeed        91.27 (  0.00%)       89.89 ( -1.51%)
Hmean    SeqRead-MB/sec-1          84.97 (  0.00%)       84.33 ( -0.75%)
Hmean    SeqRead-MB/sec-2          75.18 (  0.00%)       75.02 ( -0.20%)
Hmean    SeqRead-MB/sec-4          77.05 (  0.00%)       77.07 (  0.03%)
Hmean    SeqRead-MB/sec-8          68.13 (  0.00%)       67.90 ( -0.33%)
Hmean    SeqRead-MB/sec-16         61.64 (  0.00%)       61.99 (  0.57%)
Hmean    RandRead-MB/sec-1          0.92 (  0.00%)        0.86 ( -6.49%)
Hmean    RandRead-MB/sec-2          1.06 (  0.00%)        1.09 (  2.53%)
Hmean    RandRead-MB/sec-4          1.49 (  0.00%)        1.47 ( -1.54%)
Hmean    RandRead-MB/sec-8          1.64 (  0.00%)        1.73 (  5.72%)
Hmean    RandRead-MB/sec-16         2.02 (  0.00%)        1.91 ( -5.45%)
Hmean    SeqWrite-MB/sec-1         83.03 (  0.00%)       82.91 ( -0.15%)
Hmean    SeqWrite-MB/sec-2         77.46 (  0.00%)       77.43 ( -0.03%)
Hmean    SeqWrite-MB/sec-4         80.92 (  0.00%)       80.90 ( -0.02%)
Hmean    SeqWrite-MB/sec-8         77.71 (  0.00%)       77.36 ( -0.45%)
Hmean    SeqWrite-MB/sec-16        79.23 (  0.00%)       79.36 (  0.17%)
Hmean    RandWrite-MB/sec-1         1.19 (  0.00%)        1.16 ( -2.29%)
Hmean    RandWrite-MB/sec-2         1.00 (  0.00%)        1.07 (  7.03%)
Hmean    RandWrite-MB/sec-4         0.96 (  0.00%)        1.05 (  8.67%)
Hmean    RandWrite-MB/sec-8         0.94 (  0.00%)        0.97 (  2.76%)
Hmean    RandWrite-MB/sec-16        0.95 (  0.00%)        0.93 ( -2.42%)

Note that the performance is almost identical allowing us to conclude that
the correct reclaim behaviour granted by the fair zone allocation policy
is preserved.

stutter
-------

stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.

stutter
                             4.5.0-rc3             4.5.0-rc3
                        mmotm-20160209            nodelru-v2
Min         mmap     12.5114 (  0.00%)     13.5315 ( -8.15%)
1st-qrtle   mmap     14.4985 (  0.00%)     14.3907 (  0.74%)
2nd-qrtle   mmap     14.7093 (  0.00%)     14.5478 (  1.10%)
3rd-qrtle   mmap     15.7381 (  0.00%)     14.7581 (  6.23%)
Max-90%     mmap     16.4561 (  0.00%)     15.6516 (  4.89%)
Max-93%     mmap     16.9571 (  0.00%)     15.8844 (  6.33%)
Max-95%     mmap     17.2948 (  0.00%)     16.3679 (  5.36%)
Max-99%     mmap     21.1054 (  0.00%)     19.9593 (  5.43%)
Max         mmap   2815.7509 (  0.00%)   2717.4201 (  3.49%)
Mean        mmap     16.6965 (  0.00%)     14.9653 ( 10.37%)

There is a consistent improvement in mmap latency and some of this may be due
to less direct reclaim and more kswapd activity

                             4.5.0-rc3   4.5.0-rc3
                          mmotm-20160209   nodelru-v2
Minor Faults                  89868559    78842249
Major Faults                      1037         899
Swap Ins                           362         583
Swap Outs                            0           0
Allocation stalls                65758       31410
DMA allocs                           0           0
DMA32 allocs                1196649783  2633682376
Normal allocs               2227851590  1110162400
Movable allocs                       0           0
Direct pages scanned          28776006    15074415
Kswapd pages scanned          13051818    30529292
Kswapd pages reclaimed        12936208    26704609
Direct pages reclaimed        28774473    15074044

Best1%Mean  mmap     14.0438 (  0.00%)     13.7945 (  1.77%)

Other pagereclaim workloads were tested but the results are often repetitive

lmbench lat_mmap: no major performance difference, less direct reclaim scanning
parallelio: This measures how much an anonymous memory workload is affected by
	large amounts of background IO. Impact on workload is roughly comparable.
fsmark: This created large numbers of zero-length files to target the shrinkers.
	Shrinker activity was comparable.

Page allocator intensive workloads showed little difference as the cost
of the fair zone allocation policy does not dominate from a userspace perspective
but a microbench of just the allocator shows a difference

                                           4.5.0-rc3                  4.5.0-rc3
                                      mmotm-20160209                 nodelru-v2
Min      total-odr0-1              1075.00 (  0.00%)           606.00 ( 43.63%)
Min      total-odr0-2               786.00 (  0.00%)           456.00 ( 41.98%)
Min      total-odr0-4               383.00 (  0.00%)           377.00 (  1.57%)
Min      total-odr0-8               355.00 (  0.00%)           554.00 (-56.06%)
Min      total-odr0-16              312.00 (  0.00%)           293.00 (  6.09%)
Min      total-odr0-32              309.00 (  0.00%)           284.00 (  8.09%)
Min      total-odr0-64              283.00 (  0.00%)           269.00 (  4.95%)
Min      total-odr0-128             292.00 (  0.00%)           274.00 (  6.16%)
Min      total-odr0-256             305.00 (  0.00%)           292.00 (  4.26%)
Min      total-odr0-512             335.00 (  0.00%)           333.00 (  0.60%)
Min      total-odr0-1024            347.00 (  0.00%)           347.00 (  0.00%)
Min      total-odr0-2048            361.00 (  0.00%)           356.00 (  1.39%)
Min      total-odr0-4096            371.00 (  0.00%)           366.00 (  1.35%)
Min      total-odr0-8192            376.00 (  0.00%)           368.00 (  2.13%)
Min      total-odr0-16384           377.00 (  0.00%)           368.00 (  2.39%)

 Documentation/cgroup-v1/memcg_test.txt    |   4 +-
 Documentation/cgroup-v1/memory.txt        |   4 +-
 arch/s390/appldata/appldata_mem.c         |   2 +-
 arch/tile/mm/pgtable.c                    |  18 +-
 drivers/base/node.c                       |  73 +--
 drivers/staging/android/lowmemorykiller.c |  12 +-
 fs/fs-writeback.c                         |   4 +-
 fs/fuse/file.c                            |   8 +-
 fs/nfs/internal.h                         |   2 +-
 fs/nfs/write.c                            |   2 +-
 fs/proc/meminfo.c                         |  14 +-
 include/linux/backing-dev.h               |   2 +-
 include/linux/memcontrol.h                |  30 +-
 include/linux/mm_inline.h                 |   4 +-
 include/linux/mm_types.h                  |   2 +-
 include/linux/mmzone.h                    | 156 +++---
 include/linux/swap.h                      |  15 +-
 include/linux/topology.h                  |   2 +-
 include/linux/vm_event_item.h             |  11 +-
 include/linux/vmstat.h                    | 106 +++-
 include/linux/writeback.h                 |   2 +-
 include/trace/events/vmscan.h             |  40 +-
 include/trace/events/writeback.h          |  10 +-
 kernel/power/snapshot.c                   |  10 +-
 kernel/sysctl.c                           |   4 +-
 mm/backing-dev.c                          |  15 +-
 mm/compaction.c                           |  28 +-
 mm/filemap.c                              |  14 +-
 mm/huge_memory.c                          |  14 +-
 mm/internal.h                             |  11 +-
 mm/memcontrol.c                           | 235 ++++-----
 mm/memory-failure.c                       |   4 +-
 mm/memory_hotplug.c                       |   7 +-
 mm/mempolicy.c                            |   2 +-
 mm/migrate.c                              |  35 +-
 mm/mlock.c                                |  12 +-
 mm/mmap.c                                 |   4 +-
 mm/nommu.c                                |   4 +-
 mm/page-writeback.c                       | 119 ++---
 mm/page_alloc.c                           | 269 +++++-----
 mm/page_idle.c                            |   4 +-
 mm/rmap.c                                 |  15 +-
 mm/shmem.c                                |  12 +-
 mm/swap.c                                 |  66 +--
 mm/swap_state.c                           |   4 +-
 mm/vmscan.c                               | 828 ++++++++++++++----------------
 mm/vmstat.c                               | 363 ++++++++++---
 mm/workingset.c                           |  51 +-
 48 files changed, 1455 insertions(+), 1198 deletions(-)

-- 
2.6.4

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone)
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 18:04   ` Johannes Weiner
  2016-03-03 10:37   ` Vlastimil Babka
  2016-02-23 15:04 ` [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim Mel Gorman
                   ` (27 subsequent siblings)
  28 siblings, 2 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

ac->classzone_idx is determined by the index of the preferred zone and cached
to avoid repeated calculations. wake_all_kswapds() should use it instead of
using zone_idx() within a loop.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c849a202b6f1..46ecbfa5b228 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3022,7 +3022,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 						ac->high_zoneidx, ac->nodemask)
-		wakeup_kswapd(zone, order, zone_idx(ac->preferred_zone));
+		wakeup_kswapd(zone, order, ac->classzone_idx);
 }
 
 static inline int
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
  2016-02-23 15:04 ` [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone) Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 18:06   ` Johannes Weiner
  2016-03-03 11:31   ` Vlastimil Babka
  2016-02-23 15:04 ` [PATCH 03/27] mm, vmstat: Add infrastructure for per-node vmstats Mel Gorman
                   ` (26 subsequent siblings)
  28 siblings, 2 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

Direct reclaim obeys cpusets but misses the cpusets_enabled() check.
The overhead is unlikely to be measurable in the direct reclaim
path which is expensive but there is no harm is doing it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 86eb21491867..de8d6226e026 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2566,7 +2566,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed(zone,
+			if (cpusets_enabled() && !cpuset_zone_allowed(zone,
 						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 03/27] mm, vmstat: Add infrastructure for per-node vmstats
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
  2016-02-23 15:04 ` [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone) Mel Gorman
  2016-02-23 15:04 ` [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 18:13   ` Johannes Weiner
  2016-02-23 15:04 ` [PATCH 04/27] mm, vmscan: Move lru_lock to the node Mel Gorman
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

VM statistic counters for reclaim decisions are zone-based. If the kernel
is to reclaim on a per-node basis then we need to track per-node statistics
but there is no infrastructure for that. The most notable change is that
the old node_page_state is renamed to sum_zone_node_page_state.  The new
node_page_state takes a pglist_data and uses per-node stats but none exist
yet. There is some renaming such as vm_stat to vm_zone_stat and the addition
of vm_node_stat and the renaming of mod_state to mod_zone_state. Otherwise,
this is mostly a mechanical patch with no functional change. There is a
lot of similarity between the node and zone helpers which is unfortunate
but there was no obvious way of reusing the code and maintaining type safety.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 drivers/base/node.c    |  72 +++++++------
 include/linux/mmzone.h |  13 +++
 include/linux/vmstat.h |  88 +++++++++++++--
 mm/page_alloc.c        |  10 +-
 mm/vmstat.c            | 287 +++++++++++++++++++++++++++++++++++++++++++------
 mm/workingset.c        |  10 +-
 6 files changed, 401 insertions(+), 79 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..efb81da250a8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -74,16 +74,16 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, K(node_page_state(nid, NR_ACTIVE_ANON) +
-				node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_INACTIVE_ANON) +
-				node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_ACTIVE_ANON)),
-		       nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
-		       nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_UNEVICTABLE)),
-		       nid, K(node_page_state(nid, NR_MLOCK)));
+		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
+				sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
+				sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
+		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
 	n += sprintf(buf + n,
@@ -115,28 +115,28 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d AnonHugePages:  %8lu kB\n"
 #endif
 			,
-		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
-		       nid, K(node_page_state(nid, NR_WRITEBACK)),
-		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
-		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
+		       nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
+		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
+		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
+		       nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
+		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
-		       nid, K(node_page_state(nid, NR_PAGETABLE)),
-		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
-		       nid, K(node_page_state(nid, NR_BOUNCE)),
-		       nid, K(node_page_state(nid, NR_WRITEBACK_TEMP)),
-		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
-				node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
-		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_UNSTABLE_NFS)),
+		       nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK_TEMP)),
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) +
+				sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
 			, nid,
-			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			K(sum_zone_node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
 			HPAGE_PMD_NR));
 #else
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
 #endif
 	n += hugetlb_report_node_meminfo(nid, buf + n);
 	return n;
@@ -155,12 +155,12 @@ static ssize_t node_read_numastat(struct device *dev,
 		       "interleave_hit %lu\n"
 		       "local_node %lu\n"
 		       "other_node %lu\n",
-		       node_page_state(dev->id, NUMA_HIT),
-		       node_page_state(dev->id, NUMA_MISS),
-		       node_page_state(dev->id, NUMA_FOREIGN),
-		       node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
-		       node_page_state(dev->id, NUMA_LOCAL),
-		       node_page_state(dev->id, NUMA_OTHER));
+		       sum_zone_node_page_state(dev->id, NUMA_HIT),
+		       sum_zone_node_page_state(dev->id, NUMA_MISS),
+		       sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
+		       sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
+		       sum_zone_node_page_state(dev->id, NUMA_LOCAL),
+		       sum_zone_node_page_state(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
@@ -168,12 +168,18 @@ static ssize_t node_read_vmstat(struct device *dev,
 				struct device_attribute *attr, char *buf)
 {
 	int nid = dev->id;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	int i;
 	int n = 0;
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
-			     node_page_state(nid, i));
+			     sum_zone_node_page_state(nid, i));
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		n += sprintf(buf+n, "%s %lu\n",
+			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+			     node_page_state(pgdat, i));
 
 	return n;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bdd9a270a813..291719dadea6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -162,6 +162,10 @@ enum zone_stat_item {
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
+enum node_stat_item {
+	NR_VM_NODE_STAT_ITEMS
+};
+
 /*
  * We do arithmetic on the LRU lists in various places in the code,
  * so it is important to keep the active lists LRU_ACTIVE higher in
@@ -269,6 +273,11 @@ struct per_cpu_pageset {
 #endif
 };
 
+struct per_cpu_nodestat {
+	s8 stat_threshold;
+	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+};
+
 #endif /* !__GENERATING_BOUNDS.H */
 
 enum zone_type {
@@ -698,6 +707,10 @@ typedef struct pglist_data {
 	struct list_head split_queue;
 	unsigned long split_queue_len;
 #endif
+
+	/* Per-node vmstats */
+	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
+	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 73fae8c4a5fb..d9f8889263b6 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -106,20 +106,38 @@ static inline void vm_events_fold_cpu(int cpu)
 		zone_idx(zone), delta)
 
 /*
- * Zone based page accounting with per cpu differentials.
+ * Zone and node-based page accounting with per cpu differentials.
  */
-extern atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
 
 static inline void zone_page_state_add(long x, struct zone *zone,
 				 enum zone_stat_item item)
 {
 	atomic_long_add(x, &zone->vm_stat[item]);
-	atomic_long_add(x, &vm_stat[item]);
+	atomic_long_add(x, &vm_zone_stat[item]);
+}
+
+static inline void node_page_state_add(long x, struct pglist_data *pgdat,
+				 enum node_stat_item item)
+{
+	atomic_long_add(x, &pgdat->vm_stat[item]);
+	atomic_long_add(x, &vm_node_stat[item]);
 }
 
 static inline unsigned long global_page_state(enum zone_stat_item item)
 {
-	long x = atomic_long_read(&vm_stat[item]);
+	long x = atomic_long_read(&vm_zone_stat[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+static inline unsigned long global_node_page_state(enum node_stat_item item)
+{
+	long x = atomic_long_read(&vm_node_stat[item]);
 #ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
@@ -161,33 +179,48 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 }
 
 #ifdef CONFIG_NUMA
-
-extern unsigned long node_page_state(int node, enum zone_stat_item item);
+extern unsigned long sum_zone_node_page_state(int node,
+						enum zone_stat_item item);
+extern unsigned long node_page_state(struct pglist_data *pgdat,
+						enum node_stat_item item);
 extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp);
 
 #else
 
-#define node_page_state(node, item) global_page_state(item)
+#define node_page_state(node, item) global_node_page_state(item)
 #define zone_statistics(_zl, _z, gfp) do { } while (0)
 
 #endif /* CONFIG_NUMA */
 
 #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
 #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d))
+#define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d)
+#define sub_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, -(__d))
 
 #ifdef CONFIG_SMP
 void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
 void __inc_zone_page_state(struct page *, enum zone_stat_item);
 void __dec_zone_page_state(struct page *, enum zone_stat_item);
 
+void __mod_node_page_state(struct pglist_data *, enum node_stat_item item, long);
+void __inc_node_page_state(struct page *, enum node_stat_item);
+void __dec_node_page_state(struct page *, enum node_stat_item);
+
 void mod_zone_page_state(struct zone *, enum zone_stat_item, long);
 void inc_zone_page_state(struct page *, enum zone_stat_item);
 void dec_zone_page_state(struct page *, enum zone_stat_item);
 
+void mod_node_page_state(struct pglist_data *, enum node_stat_item, long);
+void inc_node_page_state(struct page *, enum node_stat_item);
+void dec_node_page_state(struct page *, enum node_stat_item);
+
 extern void inc_zone_state(struct zone *, enum zone_stat_item);
+extern void inc_node_state(struct pglist_data *, enum node_stat_item);
 extern void __inc_zone_state(struct zone *, enum zone_stat_item);
+extern void __inc_node_state(struct pglist_data *, enum node_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
+extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
@@ -211,16 +244,34 @@ static inline void __mod_zone_page_state(struct zone *zone,
 	zone_page_state_add(delta, zone, item);
 }
 
+static inline void __mod_node_page_state(struct pglist_data *pgdat,
+			enum node_stat_item item, int delta)
+{
+	node_page_state_add(delta, pgdat, item);
+}
+
 static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	atomic_long_inc(&zone->vm_stat[item]);
-	atomic_long_inc(&vm_stat[item]);
+	atomic_long_inc(&vm_zone_stat[item]);
+}
+
+static inline void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	atomic_long_inc(&pgdat->vm_stat[item]);
+	atomic_long_inc(&vm_node_stat[item]);
 }
 
 static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	atomic_long_dec(&zone->vm_stat[item]);
-	atomic_long_dec(&vm_stat[item]);
+	atomic_long_dec(&vm_zone_stat[item]);
+}
+
+static inline void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	atomic_long_dec(&pgdat->vm_stat[item]);
+	atomic_long_dec(&vm_node_stat[item]);
 }
 
 static inline void __inc_zone_page_state(struct page *page,
@@ -229,12 +280,26 @@ static inline void __inc_zone_page_state(struct page *page,
 	__inc_zone_state(page_zone(page), item);
 }
 
+static inline void __inc_node_page_state(struct page *page,
+			enum node_stat_item item)
+{
+	__inc_node_state(page_zone(page)->zone_pgdat, item);
+}
+
+
 static inline void __dec_zone_page_state(struct page *page,
 			enum zone_stat_item item)
 {
 	__dec_zone_state(page_zone(page), item);
 }
 
+static inline void __dec_node_page_state(struct page *page,
+			enum node_stat_item item)
+{
+	__dec_node_state(page_zone(page)->zone_pgdat, item);
+}
+
+
 /*
  * We only use atomic operations to update counters. So there is no need to
  * disable interrupts.
@@ -243,7 +308,12 @@ static inline void __dec_zone_page_state(struct page *page,
 #define dec_zone_page_state __dec_zone_page_state
 #define mod_zone_page_state __mod_zone_page_state
 
+#define inc_node_page_state __inc_node_page_state
+#define dec_node_page_state __dec_node_page_state
+#define mod_node_page_state __mod_node_page_state
+
 #define inc_zone_state __inc_zone_state
+#define inc_node_state __inc_node_state
 #define dec_zone_state __dec_zone_state
 
 #define set_pgdat_percpu_threshold(pgdat, callback) { }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46ecbfa5b228..e06d548521af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3837,8 +3837,8 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
 		managed_pages += pgdat->node_zones[zone_type].managed_pages;
 	val->totalram = managed_pages;
-	val->sharedram = node_page_state(nid, NR_SHMEM);
-	val->freeram = node_page_state(nid, NR_FREE_PAGES);
+	val->sharedram = sum_zone_node_page_state(nid, NR_SHMEM);
+	val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
 #ifdef CONFIG_HIGHMEM
 	val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].managed_pages;
 	val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM],
@@ -4941,6 +4941,11 @@ static void __meminit setup_zone_pageset(struct zone *zone)
 	zone->pageset = alloc_percpu(struct per_cpu_pageset);
 	for_each_possible_cpu(cpu)
 		zone_pageset_init(zone, cpu);
+
+	if (!zone->zone_pgdat->per_cpu_nodestats) {
+		zone->zone_pgdat->per_cpu_nodestats =
+			alloc_percpu(struct per_cpu_nodestat);
+	}
 }
 
 /*
@@ -5647,6 +5652,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	reset_deferred_meminit(pgdat);
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
+	pgdat->per_cpu_nodestats = NULL;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 74f8c918ac4b..ea4e6f9f1094 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -86,8 +86,10 @@ void vm_events_fold_cpu(int cpu)
  *
  * vm_stat contains the global counters
  */
-atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
-EXPORT_SYMBOL(vm_stat);
+atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
+atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
+EXPORT_SYMBOL(vm_zone_stat);
+EXPORT_SYMBOL(vm_node_stat);
 
 #ifdef CONFIG_SMP
 
@@ -176,9 +178,13 @@ void refresh_zone_stat_thresholds(void)
 
 		threshold = calculate_normal_threshold(zone);
 
-		for_each_online_cpu(cpu)
+		for_each_online_cpu(cpu) {
+			struct pglist_data *pgdat = zone->zone_pgdat;
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
+							= threshold;
+		}
 
 		/*
 		 * Only set percpu_drift_mark if there is a danger that
@@ -238,6 +244,26 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+				long delta)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long x;
+	long t;
+
+	x = delta + __this_cpu_read(*p);
+
+	t = __this_cpu_read(pcp->stat_threshold);
+
+	if (unlikely(x > t || x < -t)) {
+		node_page_state_add(x, pgdat, item);
+		x = 0;
+	}
+	__this_cpu_write(*p, x);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
 /*
  * Optimized increment and decrement functions.
  *
@@ -277,12 +303,34 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 	}
 }
 
+void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s8 v, t;
+
+	v = __this_cpu_inc_return(*p);
+	t = __this_cpu_read(pcp->stat_threshold);
+	if (unlikely(v > t)) {
+		s8 overstep = t >> 1;
+
+		node_page_state_add(v + overstep, pgdat, item);
+		__this_cpu_write(*p, -overstep);
+	}
+}
+
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	__inc_zone_state(page_zone(page), item);
 }
 EXPORT_SYMBOL(__inc_zone_page_state);
 
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	__inc_node_state(page_zone(page)->zone_pgdat, item);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
@@ -299,12 +347,34 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 	}
 }
 
+void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s8 v, t;
+
+	v = __this_cpu_dec_return(*p);
+	t = __this_cpu_read(pcp->stat_threshold);
+	if (unlikely(v < - t)) {
+		s8 overstep = t >> 1;
+
+		node_page_state_add(v - overstep, pgdat, item);
+		__this_cpu_write(*p, overstep);
+	}
+}
+
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	__dec_zone_state(page_zone(page), item);
 }
 EXPORT_SYMBOL(__dec_zone_page_state);
 
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	__dec_node_state(page_zone(page)->zone_pgdat, item);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+
 #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
 /*
  * If we have cmpxchg_local support then we do not need to incur the overhead
@@ -318,8 +388,8 @@ EXPORT_SYMBOL(__dec_zone_page_state);
  *     1       Overstepping half of threshold
  *     -1      Overstepping minus half of threshold
 */
-static inline void mod_state(struct zone *zone, enum zone_stat_item item,
-			     long delta, int overstep_mode)
+static inline void mod_zone_state(struct zone *zone,
+       enum zone_stat_item item, long delta, int overstep_mode)
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
@@ -359,26 +429,88 @@ static inline void mod_state(struct zone *zone, enum zone_stat_item item,
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 			 long delta)
 {
-	mod_state(zone, item, delta, 0);
+	mod_zone_state(zone, item, delta, 0);
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
 void inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	mod_state(zone, item, 1, 1);
+	mod_zone_state(zone, item, 1, 1);
 }
 
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	mod_state(page_zone(page), item, 1, 1);
+	mod_zone_state(page_zone(page), item, 1, 1);
 }
 EXPORT_SYMBOL(inc_zone_page_state);
 
 void dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	mod_state(page_zone(page), item, -1, -1);
+	mod_zone_state(page_zone(page), item, -1, -1);
 }
 EXPORT_SYMBOL(dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+       enum node_stat_item item, int delta, int overstep_mode)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;  /* overflow to zone counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a zone.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (n > t || n < -t) {
+			int os = overstep_mode * (t >> 1) ;
+
+			/* Overflow must be added to zone counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_zone(page)->zone_pgdat, item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_zone(page)->zone_pgdat, item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
 #else
 /*
  * Use interrupt disable to serialize counter updates
@@ -394,15 +526,6 @@ void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
-void inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__inc_zone_state(zone, item);
-	local_irq_restore(flags);
-}
-
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	unsigned long flags;
@@ -424,21 +547,69 @@ void dec_zone_page_state(struct page *page, enum zone_stat_item item)
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(dec_zone_page_state);
-#endif
 
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_node_state(pgdat, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_node_state);
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_node_page_state(node, item, delta);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	unsigned long flags;
+	struct pglist_data *pgdat;
+
+	pgdat = page_zone(page)->zone_pgdat;
+	local_irq_save(flags);
+	__inc_zone_state(pgdat, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_node_page_state(page, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+#endif
 
 /*
  * Fold a differential into the global counters.
  * Returns the number of counters updated.
  */
-static int fold_diff(int *diff)
+static int fold_diff(int *zone_diff, int *node_diff)
 {
 	int i;
 	int changes = 0;
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (diff[i]) {
-			atomic_long_add(diff[i], &vm_stat[i]);
+		if (zone_diff[i]) {
+			atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
+			changes++;
+	}
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		if (node_diff[i]) {
+			atomic_long_add(node_diff[i], &vm_node_stat[i]);
 			changes++;
 	}
 	return changes;
@@ -462,9 +633,11 @@ static int fold_diff(int *diff)
  */
 static int refresh_cpu_vm_stats(bool do_pagesets)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int i;
-	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
 	for_each_populated_zone(zone) {
@@ -477,7 +650,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			if (v) {
 
 				atomic_long_add(v, &zone->vm_stat[i]);
-				global_diff[i] += v;
+				global_zone_diff[i] += v;
 #ifdef CONFIG_NUMA
 				/* 3 seconds idle till flush */
 				__this_cpu_write(p->expire, 3);
@@ -516,7 +689,22 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		}
 #endif
 	}
-	changes += fold_diff(global_diff);
+
+	for_each_online_pgdat(pgdat) {
+		struct per_cpu_nodestat __percpu *p = pgdat->per_cpu_nodestats;
+
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+			int v;
+
+			v = this_cpu_xchg(p->vm_node_stat_diff[i], 0);
+			if (v) {
+				atomic_long_add(v, &pgdat->vm_stat[i]);
+				global_node_diff[i] += v;
+			}
+		}
+	}
+
+	changes += fold_diff(global_zone_diff, global_node_diff);
 	return changes;
 }
 
@@ -527,9 +715,11 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
  */
 void cpu_vm_stats_fold(int cpu)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int i;
-	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
@@ -543,11 +733,25 @@ void cpu_vm_stats_fold(int cpu)
 				v = p->vm_stat_diff[i];
 				p->vm_stat_diff[i] = 0;
 				atomic_long_add(v, &zone->vm_stat[i]);
-				global_diff[i] += v;
+				global_zone_diff[i] += v;
+			}
+	}
+
+	for_each_online_pgdat(pgdat) {
+		struct per_cpu_nodestat __percpu *p = pgdat->per_cpu_nodestats;
+
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+			if (p->vm_node_stat_diff[i]) {
+				int v;
+
+				v = p->vm_node_stat_diff[i];
+				p->vm_node_stat_diff[i] = 0;
+				atomic_long_add(v, &pgdat->vm_stat[i]);
+				global_node_diff[i] += v;
 			}
 	}
 
-	fold_diff(global_diff);
+	fold_diff(global_zone_diff, global_node_diff);
 }
 
 /*
@@ -563,7 +767,7 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 			int v = pset->vm_stat_diff[i];
 			pset->vm_stat_diff[i] = 0;
 			atomic_long_add(v, &zone->vm_stat[i]);
-			atomic_long_add(v, &vm_stat[i]);
+			atomic_long_add(v, &vm_zone_stat[i]);
 		}
 }
 #endif
@@ -595,9 +799,12 @@ void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags)
 }
 
 /*
- * Determine the per node value of a stat item.
+ * Determine the per node value of a stat item. This function
+ * is called frequently in a NUMA machine, so try to be as
+ * frugal as possible.
  */
-unsigned long node_page_state(int node, enum zone_stat_item item)
+unsigned long sum_zone_node_page_state(int node,
+				 enum zone_stat_item item)
 {
 	struct zone *zones = NODE_DATA(node)->node_zones;
 
@@ -615,6 +822,19 @@ unsigned long node_page_state(int node, enum zone_stat_item item)
 		zone_page_state(&zones[ZONE_MOVABLE], item);
 }
 
+/*
+ * Determine the per node value of a stat item.
+ */
+unsigned long node_page_state(struct pglist_data *pgdat,
+				enum node_stat_item item)
+{
+	long x = atomic_long_read(&pgdat->vm_stat[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
 #endif
 
 #ifdef CONFIG_COMPACTION
@@ -1303,6 +1523,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	if (*pos >= ARRAY_SIZE(vmstat_text))
 		return NULL;
 	stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
+			  NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
 			  NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
@@ -1317,6 +1538,10 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		v[i] = global_page_state(i);
 	v += NR_VM_ZONE_STAT_ITEMS;
 
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		v[i] = global_node_page_state(i);
+	v += NR_VM_NODE_STAT_ITEMS;
+
 	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
 			    v + NR_DIRTY_THRESHOLD);
 	v += NR_VM_WRITEBACK_STAT_ITEMS;
diff --git a/mm/workingset.c b/mm/workingset.c
index 8a75f8d2916a..173399b239be 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -349,12 +349,14 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 	shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
 	local_irq_enable();
 
-	if (memcg_kmem_enabled())
+	if (memcg_kmem_enabled()) {
 		pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
 						     LRU_ALL_FILE);
-	else
-		pages = node_page_state(sc->nid, NR_ACTIVE_FILE) +
-			node_page_state(sc->nid, NR_INACTIVE_FILE);
+	} else {
+		pg_data_t *pgdat = NODE_DATA(sc->nid);
+		pages = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			node_page_state(pgdat, NR_INACTIVE_FILE);
+	}
 
 	/*
 	 * Active cache pages are limited to 50% of memory, and shadow
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 04/27] mm, vmscan: Move lru_lock to the node
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (2 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 03/27] mm, vmstat: Add infrastructure for per-node vmstats Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 18:40   ` Johannes Weiner
  2016-02-23 15:04 ` [PATCH 05/27] mm, vmscan: Move LRU lists to node Mel Gorman
                   ` (24 subsequent siblings)
  28 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

Node-based reclaim requires node-based LRUs and locking. This is a
preparation patch that just moves the lru_lock to the node so later patches
are easier to review. It is a mechanical change but note this patch makes
contention worse because the LRU lock is hotter and direct reclaim and kswapd
can contend on the same lock even when reclaiming from different zones.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/cgroup-v1/memcg_test.txt |  4 +--
 Documentation/cgroup-v1/memory.txt     |  4 +--
 include/linux/mm_types.h               |  2 +-
 include/linux/mmzone.h                 | 10 +++++--
 mm/compaction.c                        |  6 ++---
 mm/filemap.c                           |  4 +--
 mm/huge_memory.c                       |  4 +--
 mm/memcontrol.c                        |  6 ++---
 mm/mlock.c                             | 10 +++----
 mm/page_alloc.c                        |  4 +--
 mm/page_idle.c                         |  4 +--
 mm/rmap.c                              |  2 +-
 mm/swap.c                              | 30 ++++++++++-----------
 mm/vmscan.c                            | 48 +++++++++++++++++-----------------
 14 files changed, 72 insertions(+), 66 deletions(-)

diff --git a/Documentation/cgroup-v1/memcg_test.txt b/Documentation/cgroup-v1/memcg_test.txt
index 8870b0212150..78a8c2963b38 100644
--- a/Documentation/cgroup-v1/memcg_test.txt
+++ b/Documentation/cgroup-v1/memcg_test.txt
@@ -107,9 +107,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global zone->lru_lock).
+	VM's control (means that it's handled under global zone_lru_lock).
 	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under zone->lru_lock().
+	list management functions under zone_lru_lock().
 
 	A special function is mem_cgroup_isolate_pages(). This scans
 	memcg's private LRU and call __isolate_lru_page() to extract a page
diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index ff71e16cc752..2060e402869c 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
    Other lock order is following:
    PG_locked.
    mm->page_table_lock
-       zone->lru_lock
+       zone_lru_lock
 	  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  zone->lru_lock, it has no lock of its own.
+  zone_lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 944b2b37313b..73a06e582bfa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -112,7 +112,7 @@ struct page {
 	 */
 	union {
 		struct list_head lru;	/* Pageout list, eg. active_list
-					 * protected by zone->lru_lock !
+					 * protected by zone_lru_lock !
 					 * Can be used as a generic list
 					 * by the page owner.
 					 */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 291719dadea6..86539225067e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,7 +100,7 @@ struct free_area {
 struct pglist_data;
 
 /*
- * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
@@ -498,7 +498,6 @@ struct zone {
 	/* Write-intensive fields used by page reclaim */
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
 	/*
@@ -693,6 +692,9 @@ typedef struct pglist_data {
 	/* Number of pages migrated during the rate limiting time interval */
 	unsigned long numabalancing_migrate_nr_pages;
 #endif
+	/* Write-intensive fields used from the page allocator */
+	ZONE_PADDING(_pad1_)
+	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
@@ -724,6 +726,10 @@ typedef struct pglist_data {
 
 #define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
+static inline spinlock_t *zone_lru_lock(struct zone *zone)
+{
+	return &zone->zone_pgdat->lru_lock;
+}
 
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
 {
diff --git a/mm/compaction.c b/mm/compaction.c
index 4cb1c2ef5abb..c85bd016754f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -670,7 +670,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 * if contended.
 		 */
 		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&zone->lru_lock, flags,
+		    && compact_unlock_should_abort(zone_lru_lock(zone), flags,
 								&locked, cc))
 			break;
 
@@ -747,7 +747,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
-			locked = compact_trylock_irqsave(&zone->lru_lock,
+			locked = compact_trylock_irqsave(zone_lru_lock(zone),
 								&flags, cc);
 			if (!locked)
 				break;
@@ -798,7 +798,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		low_pfn = end_pfn;
 
 	if (locked)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 
 	/*
 	 * Update the pageblock-skip information and cached scanner pfn,
diff --git a/mm/filemap.c b/mm/filemap.c
index 6414803972f7..99d921e1e1bb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -95,8 +95,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->tree_lock		(try_to_unmap_one)
- *    ->zone.lru_lock		(follow_page->mark_page_accessed)
- *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->zone_lru_lock(zone)	(follow_page->mark_page_accessed)
+ *    ->zone_lru_lock(zone)	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8caec8f13dfc..59ca13bb13f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3356,7 +3356,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 	int i;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	lruvec = mem_cgroup_page_lruvec(head, zone);
 
 	/* complete memcg works before add pages to LRU */
@@ -3366,7 +3366,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 		__split_huge_page_tail(head, i, lruvec, list);
 
 	ClearPageCompound(head);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	unfreeze_page(page_anon_vma(head), head);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae8b81c55685..c1acf0edd3b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2088,7 +2088,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
@@ -2112,7 +2112,7 @@ static void unlock_page_lru(struct page *page, int isolated)
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -2377,7 +2377,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * zone->lru_lock and migration entries setup in all page mappings.
+ * zone_lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 96f001041928..ce7dabd53e7e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -188,7 +188,7 @@ unsigned int munlock_vma_page(struct page *page)
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	nr_pages = hpage_nr_pages(page);
 	if (!TestClearPageMlocked(page))
@@ -197,14 +197,14 @@ unsigned int munlock_vma_page(struct page *page)
 	__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
 
 	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irq(zone_lru_lock(zone));
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 out:
 	return nr_pages - 1;
@@ -289,7 +289,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	pagevec_init(&pvec_putback, 0);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -315,7 +315,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	}
 	delta_munlocked = -nr + pagevec_count(&pvec_putback);
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e06d548521af..288015b5ee24 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5570,10 +5570,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
+		zone->zone_pgdat = pgdat;
 		spin_lock_init(&zone->lock);
-		spin_lock_init(&zone->lru_lock);
+		spin_lock_init(zone_lru_lock(zone));
 		zone_seqlock_init(zone);
-		zone->zone_pgdat = pgdat;
 		zone_pcp_init(zone);
 
 		/* For bootup, initialized properly in watermark setup */
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 4ea9c4ef5146..ae11aa914e55 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -41,12 +41,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
 		return NULL;
 
 	zone = page_zone(page);
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 	return page;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 02f0bfc3c80a..6ed6699c45fc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
  *         mapping->i_mmap_rwsem
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               zone->lru_lock (in mark_page_accessed, isolate_lru_page)
+ *               zone_lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e97714a..4067911033e1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -59,12 +59,12 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&zone->lru_lock, flags);
+		spin_lock_irqsave(zone_lru_lock(zone), flags);
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 	}
 	mem_cgroup_uncharge(page);
 }
@@ -186,16 +186,16 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 		if (pagezone != zone) {
 			if (zone)
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 			zone = pagezone;
-			spin_lock_irqsave(&zone->lru_lock, flags);
+			spin_lock_irqsave(zone_lru_lock(zone), flags);
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		(*move_fn)(page, lruvec, arg);
 	}
 	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 	release_pages(pvec->pages, pvec->nr, pvec->cold);
 	pagevec_reinit(pvec);
 }
@@ -315,9 +315,9 @@ void activate_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 }
 #endif
 
@@ -446,13 +446,13 @@ void add_page_to_unevictable_list(struct page *page)
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	lruvec = mem_cgroup_page_lruvec(page, zone);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 }
 
 /**
@@ -724,7 +724,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		 * same zone. The lock is held only if zone != NULL.
 		 */
 		if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 			zone = NULL;
 		}
 
@@ -734,7 +734,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 
 		if (PageCompound(page)) {
 			if (zone) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 				zone = NULL;
 			}
 			__put_compound_page(page);
@@ -746,11 +746,11 @@ void release_pages(struct page **pages, int nr, bool cold)
 
 			if (pagezone != zone) {
 				if (zone)
-					spin_unlock_irqrestore(&zone->lru_lock,
+					spin_unlock_irqrestore(zone_lru_lock(zone),
 									flags);
 				lock_batch = 0;
 				zone = pagezone;
-				spin_lock_irqsave(&zone->lru_lock, flags);
+				spin_lock_irqsave(zone_lru_lock(zone), flags);
 			}
 
 			lruvec = mem_cgroup_page_lruvec(page, zone);
@@ -765,7 +765,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_hot_cold_page_list(&pages_to_free, cold);
@@ -801,7 +801,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
 	VM_BUG_ON(NR_CPUS != 1 &&
-		  !spin_is_locked(&lruvec_zone(lruvec)->lru_lock));
+		  !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index de8d6226e026..13a8ca37ab42 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1343,7 +1343,7 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 }
 
 /*
- * zone->lru_lock is heavily contended.  Some of the functions that
+ * zone_lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
@@ -1441,7 +1441,7 @@ int isolate_lru_page(struct page *page)
 		struct zone *zone = page_zone(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&zone->lru_lock);
+		spin_lock_irq(zone_lru_lock(zone));
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
@@ -1450,7 +1450,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irq(zone_lru_lock(zone));
 	}
 	return ret;
 }
@@ -1509,9 +1509,9 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&zone->lru_lock);
+			spin_unlock_irq(zone_lru_lock(zone));
 			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irq(zone_lru_lock(zone));
 			continue;
 		}
 
@@ -1532,10 +1532,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irq(zone_lru_lock(zone));
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				spin_lock_irq(zone_lru_lock(zone));
 			} else
 				list_add(&page->lru, &pages_to_free);
 		}
@@ -1597,7 +1597,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
@@ -1612,7 +1612,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		else
 			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	if (nr_taken == 0)
 		return 0;
@@ -1622,7 +1622,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 				&nr_writeback, &nr_immediate,
 				false);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
@@ -1639,7 +1639,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_hot_cold_page_list(&page_list, true);
@@ -1713,9 +1713,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone->lru_lock across the whole operation.  But if
+ * appropriate to hold zone_lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone->lru_lock around each page.  It's impossible to balance
+ * should drop zone_lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
@@ -1752,10 +1752,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irq(zone_lru_lock(zone));
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				spin_lock_irq(zone_lru_lock(zone));
 			} else
 				list_add(&page->lru, pages_to_free);
 		}
@@ -1790,7 +1790,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
@@ -1802,7 +1802,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
 	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -1847,7 +1847,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -1859,7 +1859,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	mem_cgroup_uncharge_list(&l_hold);
 	free_hot_cold_page_list(&l_hold, true);
@@ -2094,7 +2094,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2115,7 +2115,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -3809,9 +3809,9 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 		pagezone = page_zone(page);
 		if (pagezone != zone) {
 			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irq(zone_lru_lock(zone));
 			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irq(zone_lru_lock(zone));
 		}
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 
@@ -3832,7 +3832,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 	if (zone) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irq(zone_lru_lock(zone));
 	}
 }
 #endif /* CONFIG_SHMEM */
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/27] mm, vmscan: Move LRU lists to node
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (3 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 04/27] mm, vmscan: Move lru_lock to the node Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 18:42   ` Johannes Weiner
  2016-02-23 15:04 ` [PATCH 06/27] mm, vmscan: Begin reclaiming pages on a per-node basis Mel Gorman
                   ` (23 subsequent siblings)
  28 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

This moves the LRU lists from the zone to the node and all related data
such as counters, tracing, congestion tracking and writeback tracking.
This is mostly a mechanical patch but note that it introduces a number
of anomalies. For example, the scans are per-zone but using per-node
counters. We also mark a node as congested when a zone is congested. This
causes weird problems that are fixed later but is easier to review.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 arch/tile/mm/pgtable.c                    |   8 +-
 drivers/base/node.c                       |  19 +--
 drivers/staging/android/lowmemorykiller.c |   8 +-
 include/linux/backing-dev.h               |   2 +-
 include/linux/memcontrol.h                |   8 +-
 include/linux/mm_inline.h                 |   4 +-
 include/linux/mmzone.h                    |  66 ++++++----
 include/linux/swap.h                      |   2 +-
 include/linux/vm_event_item.h             |  10 +-
 include/linux/vmstat.h                    |  18 +++
 include/trace/events/vmscan.h             |  12 +-
 kernel/power/snapshot.c                   |  10 +-
 mm/backing-dev.c                          |  15 ++-
 mm/compaction.c                           |  18 +--
 mm/huge_memory.c                          |   6 +-
 mm/internal.h                             |   2 +-
 mm/memcontrol.c                           |  18 +--
 mm/memory-failure.c                       |   4 +-
 mm/memory_hotplug.c                       |   2 +-
 mm/mempolicy.c                            |   2 +-
 mm/migrate.c                              |  21 +--
 mm/mlock.c                                |   2 +-
 mm/page-writeback.c                       |   8 +-
 mm/page_alloc.c                           |  99 ++++++++------
 mm/swap.c                                 |  56 ++++----
 mm/vmscan.c                               | 210 ++++++++++++++++--------------
 mm/vmstat.c                               |  45 +++----
 27 files changed, 368 insertions(+), 307 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 7bf2491a9c1f..3ed0a666d44a 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
 	struct zone *zone;
 
 	pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
-	       (global_page_state(NR_ACTIVE_ANON) +
-		global_page_state(NR_ACTIVE_FILE)),
-	       (global_page_state(NR_INACTIVE_ANON) +
-		global_page_state(NR_INACTIVE_FILE)),
+	       (global_node_page_state(NR_ACTIVE_ANON) +
+		global_node_page_state(NR_ACTIVE_FILE)),
+	       (global_node_page_state(NR_INACTIVE_ANON) +
+		global_node_page_state(NR_INACTIVE_FILE)),
 	       global_page_state(NR_FILE_DIRTY),
 	       global_page_state(NR_WRITEBACK),
 	       global_page_state(NR_UNSTABLE_NFS),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index efb81da250a8..4260c7f3ee1b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 {
 	int n;
 	int nid = dev->id;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct sysinfo i;
 
 	si_meminfo_node(&i, nid);
@@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
-				sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
-				sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
-		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
+		       nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
+				node_page_state(pgdat, NR_ACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
+				node_page_state(pgdat, NR_INACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
+		       nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
+		       nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
 		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 4b8a56cda6ca..7d677791d13a 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -70,10 +70,10 @@ static unsigned long lowmem_deathpending_timeout;
 static unsigned long lowmem_count(struct shrinker *s,
 				  struct shrink_control *sc)
 {
-	return global_page_state(NR_ACTIVE_ANON) +
-		global_page_state(NR_ACTIVE_FILE) +
-		global_page_state(NR_INACTIVE_ANON) +
-		global_page_state(NR_INACTIVE_FILE);
+	return global_node_page_state(NR_ACTIVE_ANON) +
+		global_node_page_state(NR_ACTIVE_FILE) +
+		global_node_page_state(NR_INACTIVE_ANON) +
+		global_node_page_state(NR_INACTIVE_FILE);
 }
 
 static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c82794f20110..491a91717788 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
 }
 
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(struct zone *zone, int sync, long timeout);
+long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1191d79aa495..f5626a5c88c2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -307,7 +307,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -595,13 +595,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 						    struct mem_cgroup *memcg)
 {
-	return &zone->lruvec;
+	return zone_lruvec(zone);
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
-						    struct zone *zone)
+						    struct pglist_data *pgdat)
 {
-	return &zone->lruvec;
+	return &pgdat->lruvec;
 }
 
 static inline bool mm_match_cgroup(struct mm_struct *mm,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 712e8c37a200..5817ae41ba30 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -28,7 +28,7 @@ static __always_inline void add_page_to_lru_list(struct page *page,
 	int nr_pages = hpage_nr_pages(page);
 	mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
 	list_add(&page->lru, &lruvec->lists[lru]);
-	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
+	__mod_node_page_state(page_zone(page)->zone_pgdat, NR_LRU_BASE + lru, nr_pages);
 }
 
 static __always_inline void del_page_from_lru_list(struct page *page,
@@ -37,7 +37,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
 	int nr_pages = hpage_nr_pages(page);
 	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
 	list_del(&page->lru);
-	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
+	__mod_node_page_state(page_zone(page)->zone_pgdat, NR_LRU_BASE + lru, -nr_pages);
 }
 
 /**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 86539225067e..93151a9a4f56 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -118,12 +118,6 @@ enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
 	NR_ALLOC_BATCH,
-	NR_LRU_BASE,
-	NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
-	NR_ACTIVE_ANON,		/*  "     "     "   "       "         */
-	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
-	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
-	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
@@ -141,12 +135,9 @@ enum zone_stat_item {
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
-	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
-	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
-	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
@@ -163,6 +154,15 @@ enum zone_stat_item {
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
+	NR_LRU_BASE,
+	NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
+	NR_ACTIVE_ANON,		/*  "     "     "   "       "         */
+	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
+	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
+	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
+	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
+	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	NR_VM_NODE_STAT_ITEMS
 };
 
@@ -221,7 +221,7 @@ struct lruvec {
 	/* Evictions & activations on the inactive file list */
 	atomic_long_t			inactive_age;
 #ifdef CONFIG_MEMCG
-	struct zone			*zone;
+	struct pglist_data *pgdat;
 #endif
 };
 
@@ -359,13 +359,6 @@ struct zone {
 #ifdef CONFIG_NUMA
 	int node;
 #endif
-
-	/*
-	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
-	 * this zone's LRU.  Maintained by the pageout code.
-	 */
-	unsigned int inactive_ratio;
-
 	struct pglist_data	*zone_pgdat;
 	struct per_cpu_pageset __percpu *pageset;
 
@@ -497,9 +490,6 @@ struct zone {
 
 	/* Write-intensive fields used by page reclaim */
 
-	/* Fields commonly accessed by the page reclaim scanner */
-	struct lruvec		lruvec;
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
@@ -540,17 +530,20 @@ struct zone {
 enum zone_flags {
 	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
-	ZONE_CONGESTED,			/* zone has many dirty pages backed by
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
+};
+
+enum pgdat_flags {
+	PGDAT_CONGESTED,		/* zone has many dirty pages backed by
 					 * a congested BDI
 					 */
-	ZONE_DIRTY,			/* reclaim scanning has recently found
+	PGDAT_DIRTY,			/* reclaim scanning has recently found
 					 * many dirty file pages at the tail
 					 * of the LRU.
 					 */
-	ZONE_WRITEBACK,			/* reclaim scanning has recently found
+	PGDAT_WRITEBACK,		/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
-	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -704,12 +697,26 @@ typedef struct pglist_data {
 	unsigned long first_deferred_pfn;
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	spinlock_t split_queue_lock;
 	struct list_head split_queue;
 	unsigned long split_queue_len;
 #endif
 
+	/* Fields commonly accessed by the page reclaim scanner */
+	struct lruvec		lruvec;
+
+	/*
+	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+	 * this node's LRU.  Maintained by the pageout code.
+	 */
+	unsigned int inactive_ratio;
+
+	unsigned long		flags;
+
+	ZONE_PADDING(_pad2_)
+
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
@@ -731,6 +738,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
 	return &zone->zone_pgdat->lru_lock;
 }
 
+static inline struct lruvec *zone_lruvec(struct zone *zone)
+{
+	return &zone->zone_pgdat->lruvec;
+}
+
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
 {
 	return pgdat->node_start_pfn + pgdat->node_spanned_pages;
@@ -778,12 +790,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
 
 extern void lruvec_init(struct lruvec *lruvec);
 
-static inline struct zone *lruvec_zone(struct lruvec *lruvec)
+static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
 {
 #ifdef CONFIG_MEMCG
-	return lruvec->zone;
+	return lruvec->pgdat;
 #else
-	return container_of(lruvec, struct zone, lruvec);
+	return container_of(lruvec, struct pglist_data, lruvec);
 #endif
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b14a2bb33514..a776f6506522 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,7 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
+extern unsigned long pgdat_reclaimable_pages(pg_data_t *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ec084321fe09..8dcb5a813163 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
-		FOR_ALL_ZONES(PGREFILL),
-		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
-		FOR_ALL_ZONES(PGSTEAL_DIRECT),
-		FOR_ALL_ZONES(PGSCAN_KSWAPD),
-		FOR_ALL_ZONES(PGSCAN_DIRECT),
+		PGREFILL,
+		PGSTEAL_KSWAPD,
+		PGSTEAL_DIRECT,
+		PGSCAN_KSWAPD,
+		PGSCAN_DIRECT,
 		PGSCAN_DIRECT_THROTTLE,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d9f8889263b6..8a43f7b80c20 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 	return x;
 }
 
+static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
+					enum zone_stat_item item)
+{
+	long x = atomic_long_read(&pgdat->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+
 #ifdef CONFIG_NUMA
 extern unsigned long sum_zone_node_page_state(int node,
 						enum zone_stat_item item);
@@ -187,6 +204,7 @@ extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp);
 
 #else
 
+#define sum_zone_node_page_state(node, item) global_node_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
 #define zone_statistics(_zl, _z, gfp) do { } while (0)
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 0101ef37f1ee..897f1aa1ee5f 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -352,15 +352,14 @@ TRACE_EVENT(mm_vmscan_writepage,
 
 TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 
-	TP_PROTO(struct zone *zone,
+	TP_PROTO(int nid,
 		unsigned long nr_scanned, unsigned long nr_reclaimed,
 		int priority, int file),
 
-	TP_ARGS(zone, nr_scanned, nr_reclaimed, priority, file),
+	TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
 
 	TP_STRUCT__entry(
 		__field(int, nid)
-		__field(int, zid)
 		__field(unsigned long, nr_scanned)
 		__field(unsigned long, nr_reclaimed)
 		__field(int, priority)
@@ -368,16 +367,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 	),
 
 	TP_fast_assign(
-		__entry->nid = zone_to_nid(zone);
-		__entry->zid = zone_idx(zone);
+		__entry->nid = nid;
 		__entry->nr_scanned = nr_scanned;
 		__entry->nr_reclaimed = nr_reclaimed;
 		__entry->priority = priority;
 		__entry->reclaim_flags = trace_shrink_flags(file);
 	),
 
-	TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
-		__entry->nid, __entry->zid,
+	TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+		__entry->nid,
 		__entry->nr_scanned, __entry->nr_reclaimed,
 		__entry->priority,
 		show_reclaim_flags(__entry->reclaim_flags))
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 3a970604308f..24a06bc23f85 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
 	unsigned long size;
 
 	size = global_page_state(NR_SLAB_RECLAIMABLE)
-		+ global_page_state(NR_ACTIVE_ANON)
-		+ global_page_state(NR_INACTIVE_ANON)
-		+ global_page_state(NR_ACTIVE_FILE)
-		+ global_page_state(NR_INACTIVE_FILE)
-		- global_page_state(NR_FILE_MAPPED);
+		+ global_node_page_state(NR_ACTIVE_ANON)
+		+ global_node_page_state(NR_INACTIVE_ANON)
+		+ global_node_page_state(NR_ACTIVE_FILE)
+		+ global_node_page_state(NR_INACTIVE_FILE)
+		- global_node_page_state(NR_FILE_MAPPED);
 
 	return saveable <= size ? 0 : saveable - size;
 }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5683592f79d5..b59d4ebdf54a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -982,24 +982,24 @@ long congestion_wait(int sync, long timeout)
 EXPORT_SYMBOL(congestion_wait);
 
 /**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
- * @zone: A zone to check if it is heavily congested
+ * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
+ * @pgdat: A pgdat to check if it is heavily congested
  * @sync: SYNC or ASYNC IO
  * @timeout: timeout in jiffies
  *
  * In the event of a congested backing_dev (any backing_dev) and the given
- * @zone has experienced recent congestion, this waits for up to @timeout
+ * @pgdat has experienced recent congestion, this waits for up to @timeout
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
+ * In the absence of pgdat congestion, cond_resched() is called to yield
  * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
  * returned. return_value == timeout implies the function did not sleep.
  */
-long wait_iff_congested(struct zone *zone, int sync, long timeout)
+long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
@@ -1008,12 +1008,13 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 
 	/*
 	 * If there is no congestion, or heavy congestion is not being
-	 * encountered in the current zone, yield if necessary instead
+	 * encountered in the current pgdat, yield if necessary instead
 	 * of sleeping on the congestion queue
 	 */
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
-	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
+	    !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
 		cond_resched();
+
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
 		if (ret < 0)
diff --git a/mm/compaction.c b/mm/compaction.c
index c85bd016754f..2ae7c7ea664e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -591,8 +591,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
 	list_for_each_entry(page, &cc->migratepages, lru)
 		count[!!page_is_file_cache(page)]++;
 
-	mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
-	mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
@@ -600,12 +600,12 @@ static bool too_many_isolated(struct zone *zone)
 {
 	unsigned long active, inactive, isolated;
 
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
-					zone_page_state(zone, NR_INACTIVE_ANON);
-	active = zone_page_state(zone, NR_ACTIVE_FILE) +
-					zone_page_state(zone, NR_ACTIVE_ANON);
-	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
-					zone_page_state(zone, NR_ISOLATED_ANON);
+	inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
+			node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
+	active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
+			node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
+	isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
+			node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
 
 	return isolated > (inactive + active) / 2;
 }
@@ -767,7 +767,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			}
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, isolate_mode) != 0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 59ca13bb13f9..709797510e10 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1995,7 +1995,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 static void release_pte_page(struct page *page)
 {
 	/* 0 stands for page_is_file_cache(page) == false */
-	dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+	dec_node_page_state(page, NR_ISOLATED_ANON + 0);
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -2090,7 +2090,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 		/* 0 stands for page_is_file_cache(page) == false */
-		inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+		inc_node_page_state(page, NR_ISOLATED_ANON + 0);
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
@@ -3357,7 +3357,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(zone_lru_lock(zone));
-	lruvec = mem_cgroup_page_lruvec(head, zone);
+	lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
 
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
diff --git a/mm/internal.h b/mm/internal.h
index f9153e580d81..08c996d7b8fc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -84,7 +84,7 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
-extern bool zone_reclaimable(struct zone *zone);
+extern bool pgdat_reclaimable(struct pglist_data *pgdat);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c1acf0edd3b4..e9e35424c57e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -958,7 +958,7 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = &zone->lruvec;
+		lruvec = zone_lruvec(zone);
 		goto out;
 	}
 
@@ -970,8 +970,8 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 	 * we have to be prepared to initialize lruvec->zone here;
 	 * and if offlined then reonlined, we need to reinitialize it.
 	 */
-	if (unlikely(lruvec->zone != zone))
-		lruvec->zone = zone;
+	if (unlikely(lruvec->pgdat != zone->zone_pgdat))
+		lruvec->pgdat = zone->zone_pgdat;
 	return lruvec;
 }
 
@@ -984,14 +984,14 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
  * and putback protocol: the LRU lock must be held, and the page must
  * either be PageLRU() or the caller must have isolated/allocated it.
  */
-struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
+struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = &zone->lruvec;
+		lruvec = &pgdat->lruvec;
 		goto out;
 	}
 
@@ -1011,8 +1011,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
 	 * we have to be prepared to initialize lruvec->zone here;
 	 * and if offlined then reonlined, we need to reinitialize it.
 	 */
-	if (unlikely(lruvec->zone != zone))
-		lruvec->zone = zone;
+	if (unlikely(lruvec->pgdat != pgdat))
+		lruvec->pgdat = pgdat;
 	return lruvec;
 }
 
@@ -2092,7 +2092,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		*isolated = 1;
@@ -2107,7 +2107,7 @@ static void unlock_page_lru(struct page *page, int isolated)
 	if (isolated) {
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 67c30eb993f0..204709f9ae0f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1657,7 +1657,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	put_hwpoison_page(page);
 	if (!ret) {
 		LIST_HEAD(pagelist);
-		inc_zone_page_state(page, NR_ISOLATED_ANON +
+		inc_node_page_state(page, NR_ISOLATED_ANON +
 					page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
@@ -1665,7 +1665,7 @@ static int __soft_offline_page(struct page *page, int flags)
 		if (ret) {
 			if (!list_empty(&pagelist)) {
 				list_del(&page->lru);
-				dec_zone_page_state(page, NR_ISOLATED_ANON +
+				dec_node_page_state(page, NR_ISOLATED_ANON +
 						page_is_file_cache(page));
 				putback_lru_page(page);
 			}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 92402f8d17e8..b33fe895a35c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1524,7 +1524,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			put_page(page);
 			list_add_tail(&page->lru, &source);
 			move_pages--;
-			inc_zone_page_state(page, NR_ISOLATED_ANON +
+			inc_node_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
 
 		} else {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8c5fd08c253c..4871e4d66c44 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -963,7 +963,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
 		if (!isolate_lru_page(page)) {
 			list_add_tail(&page->lru, pagelist);
-			inc_zone_page_state(page, NR_ISOLATED_ANON +
+			inc_node_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
 		}
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 90cbf7c65cac..aea3e350f4fc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -92,7 +92,7 @@ void putback_movable_pages(struct list_head *l)
 			continue;
 		}
 		list_del(&page->lru);
-		dec_zone_page_state(page, NR_ISOLATED_ANON +
+		dec_node_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
 		if (unlikely(isolated_balloon_page(page)))
 			balloon_page_putback(page);
@@ -969,7 +969,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		 * restored.
 		 */
 		list_del(&page->lru);
-		dec_zone_page_state(page, NR_ISOLATED_ANON +
+		dec_node_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
 		/* Soft-offlined page shouldn't go through lru cache list */
 		if (reason == MR_MEMORY_FAILURE) {
@@ -1291,7 +1291,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		err = isolate_lru_page(page);
 		if (!err) {
 			list_add_tail(&page->lru, &pagelist);
-			inc_zone_page_state(page, NR_ISOLATED_ANON +
+			inc_node_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
 		}
 put_and_set:
@@ -1557,15 +1557,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 				   unsigned long nr_migrate_pages)
 {
 	int z;
+
+	if (!pgdat_reclaimable(pgdat))
+		return false;
+
 	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 		struct zone *zone = pgdat->node_zones + z;
 
 		if (!populated_zone(zone))
 			continue;
 
-		if (!zone_reclaimable(zone))
-			continue;
-
 		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
 		if (!zone_watermark_ok(zone, 0,
 				       high_wmark_pages(zone) +
@@ -1659,7 +1660,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	}
 
 	page_lru = page_is_file_cache(page);
-	mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
+	mod_node_page_state(page_zone(page)->zone_pgdat, NR_ISOLATED_ANON + page_lru,
 				hpage_nr_pages(page));
 
 	/*
@@ -1717,7 +1718,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
-			dec_zone_page_state(page, NR_ISOLATED_ANON +
+			dec_node_page_state(page, NR_ISOLATED_ANON +
 					page_is_file_cache(page));
 			putback_lru_page(page);
 		}
@@ -1807,7 +1808,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		/* Retake the callers reference and putback on LRU */
 		get_page(page);
 		putback_lru_page(page);
-		mod_zone_page_state(page_zone(page),
+		mod_node_page_state(page_zone(page)->zone_pgdat,
 			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
 
 		goto out_unlock;
@@ -1860,7 +1861,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-	mod_zone_page_state(page_zone(page),
+	mod_node_page_state(page_zone(page)->zone_pgdat,
 			NR_ISOLATED_ANON + page_lru,
 			-HPAGE_PMD_NR);
 	return isolated;
diff --git a/mm/mlock.c b/mm/mlock.c
index ce7dabd53e7e..40156665ccf0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -103,7 +103,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
+		lruvec = mem_cgroup_page_lruvec(page, page_zone(page)->zone_pgdat);
 		if (getpage)
 			get_page(page);
 		ClearPageLRU(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11ff8f758631..b0960ec94bc9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -285,8 +285,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
 	 */
 	nr_pages -= min(nr_pages, zone->totalreserve_pages);
 
-	nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
-	nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
+	nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
+	nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
 
 	return nr_pages;
 }
@@ -344,8 +344,8 @@ static unsigned long global_dirtyable_memory(void)
 	 */
 	x -= min(x, totalreserve_pages);
 
-	x += global_page_state(NR_INACTIVE_FILE);
-	x += global_page_state(NR_ACTIVE_FILE);
+	x += global_node_page_state(NR_INACTIVE_FILE);
+	x += global_node_page_state(NR_ACTIVE_FILE);
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 288015b5ee24..b03c7b5872bf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -801,9 +801,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	unsigned long nr_scanned;
 
 	spin_lock(&zone->lock);
-	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
 
 	while (to_free) {
 		struct page *page;
@@ -855,9 +855,9 @@ static void free_one_page(struct zone *zone,
 {
 	unsigned long nr_scanned;
 	spin_lock(&zone->lock);
-	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
 
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
@@ -3131,7 +3131,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		unsigned long available;
 		unsigned long reclaimable;
 
-		available = reclaimable = zone_reclaimable_pages(zone);
+		available = reclaimable = pgdat_reclaimable_pages(zone->zone_pgdat);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					  MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@ -3914,6 +3914,7 @@ void show_free_areas(unsigned int filter)
 	unsigned long free_pcp = 0;
 	int cpu;
 	struct zone *zone;
+	pg_data_t *pgdat;
 
 	for_each_populated_zone(zone) {
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -3929,13 +3930,13 @@ void show_free_areas(unsigned int filter)
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
-		global_page_state(NR_ACTIVE_ANON),
-		global_page_state(NR_INACTIVE_ANON),
-		global_page_state(NR_ISOLATED_ANON),
-		global_page_state(NR_ACTIVE_FILE),
-		global_page_state(NR_INACTIVE_FILE),
-		global_page_state(NR_ISOLATED_FILE),
-		global_page_state(NR_UNEVICTABLE),
+		global_node_page_state(NR_ACTIVE_ANON),
+		global_node_page_state(NR_INACTIVE_ANON),
+		global_node_page_state(NR_ISOLATED_ANON),
+		global_node_page_state(NR_ACTIVE_FILE),
+		global_node_page_state(NR_INACTIVE_FILE),
+		global_node_page_state(NR_ISOLATED_FILE),
+		global_node_page_state(NR_UNEVICTABLE),
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -3949,6 +3950,28 @@ void show_free_areas(unsigned int filter)
 		free_pcp,
 		global_page_state(NR_FREE_CMA_PAGES));
 
+	for_each_online_pgdat(pgdat) {
+		printk("Node %d"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
+			" unevictable:%lukB"
+			" isolated(anon):%lukB"
+			" isolated(file):%lukB"
+			" all_unreclaimable? %s"
+			"\n",
+			pgdat->node_id,
+			K(node_page_state(pgdat, NR_ACTIVE_ANON)),
+			K(node_page_state(pgdat, NR_INACTIVE_ANON)),
+			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
+			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
+			K(node_page_state(pgdat, NR_UNEVICTABLE)),
+			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
+			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+			!pgdat_reclaimable(pgdat) ? "yes" : "no");
+	}
+
 	for_each_populated_zone(zone) {
 		int i;
 
@@ -3965,13 +3988,6 @@ void show_free_areas(unsigned int filter)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
-			" active_anon:%lukB"
-			" inactive_anon:%lukB"
-			" active_file:%lukB"
-			" inactive_file:%lukB"
-			" unevictable:%lukB"
-			" isolated(anon):%lukB"
-			" isolated(file):%lukB"
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
@@ -3989,21 +4005,13 @@ void show_free_areas(unsigned int filter)
 			" local_pcp:%ukB"
 			" free_cma:%lukB"
 			" writeback_tmp:%lukB"
-			" pages_scanned:%lu"
-			" all_unreclaimable? %s"
+			" node_pages_scanned:%lu"
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
-			K(zone_page_state(zone, NR_ACTIVE_ANON)),
-			K(zone_page_state(zone, NR_INACTIVE_ANON)),
-			K(zone_page_state(zone, NR_ACTIVE_FILE)),
-			K(zone_page_state(zone, NR_INACTIVE_FILE)),
-			K(zone_page_state(zone, NR_UNEVICTABLE)),
-			K(zone_page_state(zone, NR_ISOLATED_ANON)),
-			K(zone_page_state(zone, NR_ISOLATED_FILE)),
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
@@ -4022,9 +4030,7 @@ void show_free_areas(unsigned int filter)
 			K(this_cpu_read(zone->pageset->pcp.count)),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
-			K(zone_page_state(zone, NR_PAGES_SCANNED)),
-			(!zone_reclaimable(zone) ? "yes" : "no")
-			);
+			K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
 			printk(" %ld", zone->lowmem_reserve[i]);
@@ -5579,7 +5585,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		/* For bootup, initialized properly in watermark setup */
 		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
 
-		lruvec_init(&zone->lruvec);
+		lruvec_init(zone_lruvec(zone));
 		if (!size)
 			continue;
 
@@ -6465,26 +6471,37 @@ void setup_per_zone_wmarks(void)
  *    1TB     101        10GB
  *   10TB     320        32GB
  */
-static void __meminit calculate_zone_inactive_ratio(struct zone *zone)
+static void __meminit calculate_node_inactive_ratio(struct pglist_data *pgdat)
 {
 	unsigned int gb, ratio;
+	int z;
+	unsigned long managed_pages = 0;
+
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		struct zone *zone = &pgdat->node_zones[z];
 
-	/* Zone size in gigabytes */
-	gb = zone->managed_pages >> (30 - PAGE_SHIFT);
+		if (!populated_zone(zone))
+			continue;
+
+		managed_pages += zone->managed_pages;
+	}
+
+	/* Node size in gigabytes */
+	gb = managed_pages >> (30 - PAGE_SHIFT);
 	if (gb)
 		ratio = int_sqrt(10 * gb);
 	else
 		ratio = 1;
 
-	zone->inactive_ratio = ratio;
+	pgdat->inactive_ratio = ratio;
 }
 
-static void __meminit setup_per_zone_inactive_ratio(void)
+static void __meminit setup_per_node_inactive_ratio(void)
 {
-	struct zone *zone;
+	struct pglist_data *pgdat;
 
-	for_each_zone(zone)
-		calculate_zone_inactive_ratio(zone);
+	for_each_online_pgdat(pgdat)
+		calculate_node_inactive_ratio(pgdat);
 }
 
 /*
@@ -6532,7 +6549,7 @@ int __meminit init_per_zone_wmark_min(void)
 	setup_per_zone_wmarks();
 	refresh_zone_stat_thresholds();
 	setup_per_zone_lowmem_reserve();
-	setup_per_zone_inactive_ratio();
+	setup_per_node_inactive_ratio();
 	return 0;
 }
 module_init(init_per_zone_wmark_min)
diff --git a/mm/swap.c b/mm/swap.c
index 4067911033e1..83d3fbd03316 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -60,7 +60,7 @@ static void __page_cache_release(struct page *page)
 		unsigned long flags;
 
 		spin_lock_irqsave(zone_lru_lock(zone), flags);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -191,7 +191,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(zone_lru_lock(zone), flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		(*move_fn)(page, lruvec, arg);
 	}
 	if (zone)
@@ -313,11 +313,11 @@ static bool need_activate_page_drain(int cpu)
 
 void activate_page(struct page *page)
 {
-	struct zone *zone = page_zone(page);
+	struct pglist_data *pgdat = page_zone(page)->zone_pgdat;
 
-	spin_lock_irq(zone_lru_lock(zone));
-	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
+	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
+	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
 
@@ -443,16 +443,16 @@ void lru_cache_add(struct page *page)
  */
 void add_page_to_unevictable_list(struct page *page)
 {
-	struct zone *zone = page_zone(page);
+	struct pglist_data *pgdat = page_zone(page)->zone_pgdat;
 	struct lruvec *lruvec;
 
-	spin_lock_irq(zone_lru_lock(zone));
-	lruvec = mem_cgroup_page_lruvec(page, zone);
+	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 }
 
 /**
@@ -710,7 +710,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct zone *zone = NULL;
+	struct pglist_data *pgdat = NULL;
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 	unsigned int uninitialized_var(lock_batch);
@@ -721,11 +721,11 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same zone. The lock is held only if zone != NULL.
+		 * same pgdat. The lock is held only if pgdat != NULL.
 		 */
-		if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(zone_lru_lock(zone), flags);
-			zone = NULL;
+		if (pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
+			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			pgdat = NULL;
 		}
 
 		page = compound_head(page);
@@ -733,27 +733,27 @@ void release_pages(struct page **pages, int nr, bool cold)
 			continue;
 
 		if (PageCompound(page)) {
-			if (zone) {
-				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
-				zone = NULL;
+			if (pgdat) {
+				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+				pgdat = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct zone *pagezone = page_zone(page);
+			struct pglist_data *page_pgdat = page_zone(page)->zone_pgdat;
 
-			if (pagezone != zone) {
-				if (zone)
-					spin_unlock_irqrestore(zone_lru_lock(zone),
+			if (page_pgdat != pgdat) {
+				if (pgdat)
+					spin_unlock_irqrestore(&pgdat->lru_lock,
 									flags);
 				lock_batch = 0;
-				zone = pagezone;
-				spin_lock_irqsave(zone_lru_lock(zone), flags);
+				pgdat = page_pgdat;
+				spin_lock_irqsave(&pgdat->lru_lock, flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, zone);
+			lruvec = mem_cgroup_page_lruvec(page, pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -764,8 +764,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (zone)
-		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
+	if (pgdat)
+		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_hot_cold_page_list(&pages_to_free, cold);
@@ -801,7 +801,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
 	VM_BUG_ON(NR_CPUS != 1 &&
-		  !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
+		  !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13a8ca37ab42..760fdea19729 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -191,26 +191,26 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
 
-	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
-	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
-	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
+	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
+	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
+	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
 
 	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
-		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
-		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
+		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
+		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
+		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
 
 	return nr;
 }
 
-bool zone_reclaimable(struct zone *zone)
+bool pgdat_reclaimable(struct pglist_data *pgdat)
 {
-	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
-		zone_reclaimable_pages(zone) * 6;
+	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
+		pgdat_reclaimable_pages(pgdat) * 6;
 }
 
 unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -218,7 +218,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_lru_size(lruvec, lru);
 
-	return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
+	return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
 }
 
 /*
@@ -877,7 +877,7 @@ static void page_check_dirty_writeback(struct page *page,
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct zone *zone,
+				      struct pglist_data *pgdat,
 				      struct scan_control *sc,
 				      enum ttu_flags ttu_flags,
 				      unsigned long *ret_nr_dirty,
@@ -917,7 +917,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep;
 
 		VM_BUG_ON_PAGE(PageActive(page), page);
-		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		sc->nr_scanned++;
 
@@ -996,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			/* Case 1 above */
 			if (current_is_kswapd() &&
 			    PageReclaim(page) &&
-			    test_bit(ZONE_WRITEBACK, &zone->flags)) {
+			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
 				nr_immediate++;
 				goto keep_locked;
 
@@ -1086,7 +1085,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 */
 			if (page_is_file_cache(page) &&
 					(!current_is_kswapd() ||
-					 !test_bit(ZONE_DIRTY, &zone->flags))) {
+					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -1260,11 +1259,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 		}
 	}
 
-	ret = shrink_page_list(&clean_pages, zone, &sc,
+	ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
 			TTU_UNMAP|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
 	list_splice(&clean_pages, page_list);
-	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
 	return ret;
 }
 
@@ -1442,7 +1441,7 @@ int isolate_lru_page(struct page *page)
 		struct lruvec *lruvec;
 
 		spin_lock_irq(zone_lru_lock(zone));
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
 			get_page(page);
@@ -1462,7 +1461,7 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct zone *zone, int file,
+static int too_many_isolated(struct pglist_data *pgdat, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
@@ -1474,11 +1473,11 @@ static int too_many_isolated(struct zone *zone, int file,
 		return 0;
 
 	if (file) {
-		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
+		isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
 	} else {
-		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+		inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
+		isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
 	}
 
 	/*
@@ -1496,7 +1495,7 @@ static noinline_for_stack void
 putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	LIST_HEAD(pages_to_free);
 
 	/*
@@ -1509,13 +1508,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(zone_lru_lock(zone));
+			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(zone_lru_lock(zone));
+			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		SetPageLRU(page);
 		lru = page_lru(page);
@@ -1532,10 +1531,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(zone_lru_lock(zone));
+				spin_unlock_irq(&pgdat->lru_lock);
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(zone_lru_lock(zone));
+				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		}
@@ -1579,10 +1578,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned long nr_immediate = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
+	while (unlikely(too_many_isolated(pgdat, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
@@ -1597,49 +1596,47 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, -nr_taken);
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 
 	if (global_reclaim(sc)) {
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
+		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 		if (current_is_kswapd())
-			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
+			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
 		else
-			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
+			__count_vm_events(PGSCAN_DIRECT, nr_scanned);
 	}
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
 				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
 				&nr_writeback, &nr_immediate,
 				false);
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
 		if (current_is_kswapd())
-			__count_zone_vm_events(PGSTEAL_KSWAPD, zone,
-					       nr_reclaimed);
+			__count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
 		else
-			__count_zone_vm_events(PGSTEAL_DIRECT, zone,
-					       nr_reclaimed);
+			__count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
 	}
 
 	putback_inactive_pages(lruvec, &page_list);
 
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_hot_cold_page_list(&page_list, true);
@@ -1659,7 +1656,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 * are encountered in the nr_immediate check below.
 	 */
 	if (nr_writeback && nr_writeback == nr_taken)
-		set_bit(ZONE_WRITEBACK, &zone->flags);
+		set_bit(PGDAT_WRITEBACK, &pgdat->flags);
 
 	/*
 	 * Legacy memcg will stall in page writeback so avoid forcibly
@@ -1671,16 +1668,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		 * backed by a congested BDI and wait_iff_congested will stall.
 		 */
 		if (nr_dirty && nr_dirty == nr_congested)
-			set_bit(ZONE_CONGESTED, &zone->flags);
+			set_bit(PGDAT_CONGESTED, &pgdat->flags);
 
 		/*
 		 * If dirty pages are scanned that are not queued for IO, it
 		 * implies that flushers are not keeping up. In this case, flag
-		 * the zone ZONE_DIRTY and kswapd will start writing pages from
+		 * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
 		 * reclaim context.
 		 */
 		if (nr_unqueued_dirty == nr_taken)
-			set_bit(ZONE_DIRTY, &zone->flags);
+			set_bit(PGDAT_DIRTY, &pgdat->flags);
 
 		/*
 		 * If kswapd scans pages marked marked for immediate
@@ -1699,9 +1696,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 */
 	if (!sc->hibernation_mode && !current_is_kswapd() &&
 	    current_may_throttle())
-		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+		wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
 
-	trace_mm_vmscan_lru_shrink_inactive(zone, nr_scanned, nr_reclaimed,
+	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
+			nr_scanned, nr_reclaimed,
 			sc->priority, file);
 	return nr_reclaimed;
 }
@@ -1729,14 +1727,14 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 				     struct list_head *pages_to_free,
 				     enum lru_list lru)
 {
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long pgmoved = 0;
 	struct page *page;
 	int nr_pages;
 
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		SetPageLRU(page);
@@ -1752,15 +1750,15 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(zone_lru_lock(zone));
+				spin_unlock_irq(&pgdat->lru_lock);
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(zone_lru_lock(zone));
+				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, pages_to_free);
 		}
 	}
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
+	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, pgmoved);
 	if (!is_active_lru(lru))
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
@@ -1781,7 +1779,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	unsigned long nr_rotated = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lru_add_drain();
 
@@ -1790,19 +1788,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
 	if (global_reclaim(sc))
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
+		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
-	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
-	spin_unlock_irq(zone_lru_lock(zone));
+	__count_vm_events(PGREFILL, nr_scanned);
+	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, -nr_taken);
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -1847,7 +1845,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -1858,22 +1856,22 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(zone_lru_lock(zone));
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_hold);
 	free_hot_cold_page_list(&l_hold, true);
 }
 
 #ifdef CONFIG_SWAP
-static bool inactive_anon_is_low_global(struct zone *zone)
+static bool inactive_anon_is_low_global(struct pglist_data *pgdat)
 {
 	unsigned long active, inactive;
 
-	active = zone_page_state(zone, NR_ACTIVE_ANON);
-	inactive = zone_page_state(zone, NR_INACTIVE_ANON);
+	active = node_page_state(pgdat, NR_ACTIVE_ANON);
+	inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
 
-	return inactive * zone->inactive_ratio < active;
+	return inactive * pgdat->inactive_ratio < active;
 }
 
 /**
@@ -1895,7 +1893,7 @@ static bool inactive_anon_is_low(struct lruvec *lruvec)
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_inactive_anon_is_low(lruvec);
 
-	return inactive_anon_is_low_global(lruvec_zone(lruvec));
+	return inactive_anon_is_low_global(lruvec_pgdat(lruvec));
 }
 #else
 static inline bool inactive_anon_is_low(struct lruvec *lruvec)
@@ -1973,7 +1971,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2];
 	u64 denominator = 0;	/* gcc */
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
 	unsigned long anon, file;
@@ -1994,7 +1992,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * well.
 	 */
 	if (current_is_kswapd()) {
-		if (!zone_reclaimable(zone))
+		if (!pgdat_reclaimable(pgdat))
 			force_scan = true;
 		if (!mem_cgroup_online(memcg))
 			force_scan = true;
@@ -2040,14 +2038,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * anon pages.  Try to detect this based on file LRU size.
 	 */
 	if (global_reclaim(sc)) {
-		unsigned long zonefile;
-		unsigned long zonefree;
+		unsigned long pgdatfile;
+		unsigned long pgdatfree;
+		int z;
+		unsigned long total_high_wmark = 0;
+
+		pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
 
-		zonefree = zone_page_state(zone, NR_FREE_PAGES);
-		zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
-			   zone_page_state(zone, NR_INACTIVE_FILE);
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+			if (!populated_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
 
-		if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
+		if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
 			scan_balance = SCAN_ANON;
 			goto out;
 		}
@@ -2094,7 +2102,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2115,7 +2123,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -2369,9 +2377,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
+	inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
 	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+		inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
@@ -2571,7 +2579,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 				continue;
 
 			if (sc->priority != DEF_PRIORITY &&
-			    !zone_reclaimable(zone))
+			    !pgdat_reclaimable(zone->zone_pgdat))
 				continue;	/* Let kswapd poll it */
 
 			/*
@@ -2709,7 +2717,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
 		if (!populated_zone(zone) ||
-		    zone_reclaimable_pages(zone) == 0)
+		    pgdat_reclaimable_pages(pgdat) == 0)
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
@@ -3017,7 +3025,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 		 * DEF_PRIORITY. Effectively, it considers them balanced so
 		 * they must be considered balanced here as well!
 		 */
-		if (!zone_reclaimable(zone)) {
+		if (!pgdat_reclaimable(zone->zone_pgdat)) {
 			balanced_pages += zone->managed_pages;
 			continue;
 		}
@@ -3081,6 +3089,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	int testorder = sc->order;
 	unsigned long balance_gap;
 	bool lowmem_pressure;
+	struct pglist_data *pgdat = zone->zone_pgdat;
 
 	/* Reclaim above the high watermark. */
 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
@@ -3105,7 +3114,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
 
 	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
 
-	clear_bit(ZONE_WRITEBACK, &zone->flags);
+	/* TODO: ANOMALY */
+	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
 
 	/*
 	 * If a zone reaches its high watermark, consider it to be no longer
@@ -3113,10 +3123,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * BDIs but as pressure is relieved, speculatively avoid congestion
 	 * waits.
 	 */
-	if (zone_reclaimable(zone) &&
+	if (pgdat_reclaimable(zone->zone_pgdat) &&
 	    zone_balanced(zone, testorder, false, 0, classzone_idx)) {
-		clear_bit(ZONE_CONGESTED, &zone->flags);
-		clear_bit(ZONE_DIRTY, &zone->flags);
+		clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+		clear_bit(PGDAT_DIRTY, &pgdat->flags);
 	}
 
 	return sc->nr_scanned >= sc->nr_to_reclaim;
@@ -3175,7 +3185,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				continue;
 
 			if (sc.priority != DEF_PRIORITY &&
-			    !zone_reclaimable(zone))
+			    !pgdat_reclaimable(zone->zone_pgdat))
 				continue;
 
 			/*
@@ -3202,9 +3212,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				/*
 				 * If balanced, clear the dirty and congested
 				 * flags
+				 *
+				 * TODO: ANOMALY
 				 */
-				clear_bit(ZONE_CONGESTED, &zone->flags);
-				clear_bit(ZONE_DIRTY, &zone->flags);
+				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
+				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
 			}
 		}
 
@@ -3234,7 +3246,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				continue;
 
 			if (sc.priority != DEF_PRIORITY &&
-			    !zone_reclaimable(zone))
+			    !pgdat_reclaimable(zone->zone_pgdat))
 				continue;
 
 			sc.nr_scanned = 0;
@@ -3630,8 +3642,8 @@ int sysctl_min_slab_ratio = 5;
 static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
 {
 	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
-	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
-		zone_page_state(zone, NR_ACTIVE_FILE);
+	unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
+		node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
 
 	/*
 	 * It's possible for there to be more file mapped pages than
@@ -3734,7 +3746,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return ZONE_RECLAIM_FULL;
 
-	if (!zone_reclaimable(zone))
+	if (!pgdat_reclaimable(zone->zone_pgdat))
 		return ZONE_RECLAIM_FULL;
 
 	/*
@@ -3813,7 +3825,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 			zone = pagezone;
 			spin_lock_irq(zone_lru_lock(zone));
 		}
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ea4e6f9f1094..19bd521e161b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -943,11 +943,6 @@ const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
 	"nr_alloc_batch",
-	"nr_inactive_anon",
-	"nr_active_anon",
-	"nr_inactive_file",
-	"nr_active_file",
-	"nr_unevictable",
 	"nr_mlock",
 	"nr_anon_pages",
 	"nr_mapped",
@@ -963,12 +958,9 @@ const char * const vmstat_text[] = {
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_writeback_temp",
-	"nr_isolated_anon",
-	"nr_isolated_file",
 	"nr_shmem",
 	"nr_dirtied",
 	"nr_written",
-	"nr_pages_scanned",
 
 #ifdef CONFIG_NUMA
 	"numa_hit",
@@ -984,6 +976,16 @@ const char * const vmstat_text[] = {
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 
+	/* Node-based counters */
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
+	"nr_unevictable",
+	"nr_isolated_anon",
+	"nr_isolated_file",
+	"nr_pages_scanned",
+
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
@@ -1005,11 +1007,11 @@ const char * const vmstat_text[] = {
 	"pgmajfault",
 	"pglazyfreed",
 
-	TEXTS_FOR_ZONES("pgrefill")
-	TEXTS_FOR_ZONES("pgsteal_kswapd")
-	TEXTS_FOR_ZONES("pgsteal_direct")
-	TEXTS_FOR_ZONES("pgscan_kswapd")
-	TEXTS_FOR_ZONES("pgscan_direct")
+	"pgrefill",
+	"pgsteal_kswapd",
+	"pgsteal_direct",
+	"pgscan_kswapd",
+	"pgscan_direct",
 	"pgscan_direct_throttle",
 
 #ifdef CONFIG_NUMA
@@ -1426,7 +1428,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n        scanned  %lu"
+		   "\n   node_scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu",
@@ -1434,13 +1436,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-		   zone_page_state(zone, NR_PAGES_SCANNED),
+		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone->managed_pages);
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
+		seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
 				zone_page_state(zone, i));
 
 	seq_printf(m,
@@ -1470,12 +1472,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 #endif
 	}
 	seq_printf(m,
-		   "\n  all_unreclaimable: %u"
-		   "\n  start_pfn:         %lu"
-		   "\n  inactive_ratio:    %u",
-		   !zone_reclaimable(zone),
+		   "\n  node_unreclaimable:  %u"
+		   "\n  start_pfn:           %lu"
+		   "\n  node_inactive_ratio: %u",
+		   !pgdat_reclaimable(zone->zone_pgdat),
 		   zone->zone_start_pfn,
-		   zone->inactive_ratio);
+		   zone->zone_pgdat->inactive_ratio);
 	seq_putc(m, '\n');
 }
 
@@ -1566,7 +1568,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
 {
 	unsigned long *l = arg;
 	unsigned long off = l - (unsigned long *)m->private;
-
 	seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
 	return 0;
 }
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 06/27] mm, vmscan: Begin reclaiming pages on a per-node basis
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (4 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 05/27] mm, vmscan: Move LRU lists to node Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 18:57   ` Johannes Weiner
  2016-02-23 15:04 ` [PATCH 07/27] mm, vmscan: Have kswapd only scan based on the highest requested zone Mel Gorman
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

This patch makes reclaim decisions on a per-node basis. A reclaimer knows
what zone is required by the allocation request and skips pages from
higher zones. In many cases this will be ok because it's a GFP_HIGHMEM
request of some description. On 64-bit, ZONE_DMA32 requests will cause
some problems but 32-bit devices on 64-bit platforms are increasingly
rare. Historically it would have been a major problem on 32-bit with big
Highmem:Lowmem ratios but such configurations are also now rare and even
where they exist, they are not encouraged. If it really becomes a problem,
it'll manifest as very low reclaim efficiencies.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 78 +++++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 48 insertions(+), 30 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 760fdea19729..3e423993d374 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -84,6 +84,9 @@ struct scan_control {
 	/* Scan (total_size >> priority) pages at once */
 	int priority;
 
+	/* The highest zone to isolate pages for reclaim from */
+	enum zone_type reclaim_idx;
+
 	unsigned int may_writepage:1;
 
 	/* Can mapped pages be reclaimed? */
@@ -1369,6 +1372,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	struct list_head *src = &lruvec->lists[lru];
 	unsigned long nr_taken = 0;
 	unsigned long scan;
+	LIST_HEAD(pages_skipped);
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
 					!list_empty(src); scan++) {
@@ -1380,6 +1384,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 
+		if (page_zonenum(page) > sc->reclaim_idx) {
+			list_move(&page->lru, &pages_skipped);
+			continue;
+		}
+
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
 			nr_pages = hpage_nr_pages(page);
@@ -1398,6 +1407,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		}
 	}
 
+	/*
+	 * Splice any skipped pages to the start of the LRU list. Note that
+	 * this disrupts the LRU order when reclaiming for lower zones but
+	 * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
+	 * scanning would soon rescan the same pages to skip and put the
+	 * system at risk of premature OOM.
+	 */
+	if (!list_empty(&pages_skipped))
+		list_splice(&pages_skipped, src);
 	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
@@ -1560,7 +1578,7 @@ static int current_may_throttle(void)
 }
 
 /*
- * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
+ * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
@@ -2394,12 +2412,14 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static bool shrink_zone(struct zone *zone, struct scan_control *sc,
-			bool is_classzone)
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
+			enum zone_type reclaim_idx,
+			enum zone_type classzone_idx)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
+	struct zone *zone = &pgdat->node_zones[classzone_idx];
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2428,10 +2448,11 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			reclaimed = sc->nr_reclaimed;
 			scanned = sc->nr_scanned;
 
+			sc->reclaim_idx = reclaim_idx;
 			shrink_zone_memcg(zone, memcg, sc, &lru_pages);
 			zone_lru_pages += lru_pages;
 
-			if (memcg && is_classzone)
+			if (!global_reclaim(sc) && reclaim_idx == classzone_idx)
 				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
 					    memcg, sc->nr_scanned - scanned,
 					    lru_pages);
@@ -2462,7 +2483,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 		 * Shrink the slab caches in the same proportion that
 		 * the eligible LRU pages were scanned.
 		 */
-		if (global_reclaim(sc) && is_classzone)
+		if (global_reclaim(sc) && reclaim_idx == classzone_idx)
 			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
 				    sc->nr_scanned - nr_scanned,
 				    zone_lru_pages);
@@ -2539,14 +2560,14 @@ static inline bool compaction_ready(struct zone *zone, int order)
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
-static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
+		enum zone_type reclaim_idx, enum zone_type classzone_idx)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
-	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2558,16 +2579,12 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
-					requested_highidx, sc->nodemask) {
-		enum zone_type classzone_idx;
-
-		if (!populated_zone(zone))
-			continue;
-
-		classzone_idx = requested_highidx;
-		while (!populated_zone(zone->zone_pgdat->node_zones +
-							classzone_idx))
+					classzone_idx, sc->nodemask) {
+		if (!populated_zone(zone)) {
+			reclaim_idx--;
 			classzone_idx--;
+			continue;
+		}
 
 		/*
 		 * Take care memory controller reclaiming has small influence
@@ -2593,7 +2610,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			 */
 			if (IS_ENABLED(CONFIG_COMPACTION) &&
 			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
-			    zonelist_zone_idx(z) <= requested_highidx &&
+			    zonelist_zone_idx(z) <= classzone_idx &&
 			    compaction_ready(zone, sc->order)) {
 				sc->compaction_ready = true;
 				continue;
@@ -2611,10 +2628,10 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
-			/* need some check for avoid more shrink_zone() */
+			/* need some check for avoid more shrink_node() */
 		}
 
-		shrink_zone(zone, sc, zone_idx(zone));
+		shrink_node(zone->zone_pgdat, sc, reclaim_idx, classzone_idx);
 	}
 
 	/*
@@ -2646,6 +2663,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
+	enum zone_type classzone_idx = gfp_zone(sc->gfp_mask);
 retry:
 	delayacct_freepages_start();
 
@@ -2656,7 +2674,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		shrink_zones(zonelist, sc);
+		shrink_zones(zonelist, sc, classzone_idx, classzone_idx);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -3112,7 +3130,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 						balance_gap, classzone_idx))
 		return true;
 
-	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
+	shrink_node(zone->zone_pgdat, sc, zone_idx(zone), classzone_idx);
 
 	/* TODO: ANOMALY */
 	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
@@ -3161,6 +3179,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.order = order,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
@@ -3231,15 +3250,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.may_writepage = 1;
 
 		/*
-		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * Continue scanning in the highmem->dma direction stopping at
+		 * the last zone which needs scanning. This may reclaim lowmem
+		 * pages that are not necessary for zone balancing but it
+		 * preserves LRU ordering. It is assumed that the bulk of
+		 * allocation requests can use arbitrary zones with the
+		 * possible exception of big highmem:lowmem configurations.
 		 */
-		for (i = 0; i <= end_zone; i++) {
+		for (i = end_zone; i >= end_zone; i--) {
 			struct zone *zone = pgdat->node_zones + i;
 
 			if (!populated_zone(zone))
@@ -3717,7 +3735,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_zone(zone, &sc, true);
+			shrink_node(zone->zone_pgdat, &sc, zone_idx(zone), zone_idx(zone));
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 07/27] mm, vmscan: Have kswapd only scan based on the highest requested zone
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (5 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 06/27] mm, vmscan: Begin reclaiming pages on a per-node basis Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-25 22:17   ` Johannes Weiner
  2016-02-23 15:04 ` [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

kswapd checks all eligible zones to see if they need balancing even if it was
woken for a lower zone. This made sense when we reclaimed on a per-zone basis
because we wanted to shrink zones fairly so avoid age-inversion problems.
Ideally this is completely unnecessary when reclaiming on a per-node basis.
In theory, there may still be anomalies when all requests are for lower
zones and very old pages are preserved in higher zones but this should be
the exceptional case.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3e423993d374..1decaa31c4c6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3193,11 +3193,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		sc.nr_reclaimed = 0;
 
-		/*
-		 * Scan in the highmem->dma direction for the highest
-		 * zone which needs scanning
-		 */
-		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+		/* Scan from the highest requested zone to dma */
+		for (i = classzone_idx; i >= 0; i--) {
 			struct zone *zone = pgdat->node_zones + i;
 
 			if (!populated_zone(zone))
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (6 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 07/27] mm, vmscan: Have kswapd only scan based on the highest requested zone Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-28 16:16   ` Johannes Weiner
  2016-03-03 13:46   ` Vlastimil Babka
  2016-02-23 15:04 ` [PATCH 09/27] mm, vmscan: Simplify the logic deciding whether kswapd sleeps Mel Gorman
                   ` (20 subsequent siblings)
  28 siblings, 2 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
patch gets rid of many of the node-based versus zone-based decisions.

o A node is considered balanced when any eligible lower zone is balanced.
  This eliminates one class of age-inversion problem because we avoid
  reclaiming a newer page just because it's in the wrong zone
o pgdat_balanced disappears because we now only care about one zone being
  balanced.
o Some anomalies related to writeback and congestion tracking being based on
  zones disappear.
o kswapd no longer has to take care to reclaim zones in the reverse order
  that the page allocator uses.
o Most importantly of all, reclaim from node 0 with multiple zones will
  have similar aging and reclaiming characteristics as every
  other node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 285 +++++++++++++++++++++---------------------------------------
 1 file changed, 97 insertions(+), 188 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1decaa31c4c6..57114c130df1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2963,7 +2963,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 }
 #endif
 
-static void age_active_anon(struct zone *zone, struct scan_control *sc)
+static void age_active_anon(struct pglist_data *pgdat,
+				struct zone *zone, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 
@@ -2982,85 +2983,15 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order, bool highorder,
+static bool zone_balanced(struct zone *zone, int order,
 			unsigned long balance_gap, int classzone_idx)
 {
 	unsigned long mark = high_wmark_pages(zone) + balance_gap;
 
-	/*
-	 * When checking from pgdat_balanced(), kswapd should stop and sleep
-	 * when it reaches the high order-0 watermark and let kcompactd take
-	 * over. Other callers such as wakeup_kswapd() want to determine the
-	 * true high-order watermark.
-	 */
-	if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
-		mark += (1UL << order);
-		order = 0;
-	}
-
 	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
 
 /*
- * pgdat_balanced() is used when checking if a node is balanced.
- *
- * For order-0, all zones must be balanced!
- *
- * For high-order allocations only zones that meet watermarks and are in a
- * zone allowed by the callers classzone_idx are added to balanced_pages. The
- * total of balanced pages must be at least 25% of the zones allowed by
- * classzone_idx for the node to be considered balanced. Forcing all zones to
- * be balanced for high orders can cause excessive reclaim when there are
- * imbalanced zones.
- * The choice of 25% is due to
- *   o a 16M DMA zone that is balanced will not balance a zone on any
- *     reasonable sized machine
- *   o On all other machines, the top zone must be at least a reasonable
- *     percentage of the middle zones. For example, on 32-bit x86, highmem
- *     would need to be at least 256M for it to be balance a whole node.
- *     Similarly, on x86-64 the Normal zone would need to be at least 1G
- *     to balance a node on its own. These seemed like reasonable ratios.
- */
-static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
-{
-	unsigned long managed_pages = 0;
-	unsigned long balanced_pages = 0;
-	int i;
-
-	/* Check the watermark levels */
-	for (i = 0; i <= classzone_idx; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		if (!populated_zone(zone))
-			continue;
-
-		managed_pages += zone->managed_pages;
-
-		/*
-		 * A special case here:
-		 *
-		 * balance_pgdat() skips over all_unreclaimable after
-		 * DEF_PRIORITY. Effectively, it considers them balanced so
-		 * they must be considered balanced here as well!
-		 */
-		if (!pgdat_reclaimable(zone->zone_pgdat)) {
-			balanced_pages += zone->managed_pages;
-			continue;
-		}
-
-		if (zone_balanced(zone, order, false, 0, i))
-			balanced_pages += zone->managed_pages;
-		else if (!order)
-			return false;
-	}
-
-	if (order)
-		return balanced_pages >= (managed_pages >> 2);
-	else
-		return true;
-}
-
-/*
  * Prepare kswapd for sleeping. This verifies that there are no processes
  * waiting in throttle_direct_reclaim() and that watermarks have been met.
  *
@@ -3069,6 +3000,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 					int classzone_idx)
 {
+	int i;
+
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return false;
@@ -3089,97 +3022,84 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
-	return pgdat_balanced(pgdat, order, classzone_idx);
+	for (i = 0; i <= classzone_idx; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_balanced(zone, order, 0, classzone_idx))
+			return true;
+	}
+
+	return false;
 }
 
 /*
- * kswapd shrinks the zone by the number of pages required to reach
- * the high watermark.
+ * kswapd shrinks a node of pages that are at or below the highest usable
+ * zone that is currently unbalanced.
  *
  * Returns true if kswapd scanned at least the requested number of pages to
  * reclaim or if the lack of progress was due to pages under writeback.
  * This is used to determine if the scanning priority needs to be raised.
  */
-static bool kswapd_shrink_zone(struct zone *zone,
+static bool kswapd_shrink_node(pg_data_t *pgdat,
 			       int classzone_idx,
 			       struct scan_control *sc)
 {
-	int testorder = sc->order;
-	unsigned long balance_gap;
-	bool lowmem_pressure;
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct zone *zone;
+	unsigned long nr_to_reclaim = 0;
+	int z;
 
-	/* Reclaim above the high watermark. */
-	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+	/* Reclaim a number of pages proportional to the number of zones */
+	for (z = 0; z <= classzone_idx; z++) {
+		zone = pgdat->node_zones + z;
+		if (!populated_zone(zone))
+			continue;
 
-	/*
-	 * We put equal pressure on every zone, unless one zone has way too
-	 * many pages free already. The "too many pages" is defined as the
-	 * high wmark plus a "gap" where the gap is either the low
-	 * watermark or 1% of the zone, whichever is smaller.
-	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
+		nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX);
+	}
 
 	/*
-	 * If there is no low memory pressure or the zone is balanced then no
-	 * reclaim is necessary
+	 * Historically care was taken to put equal pressure on all zones but
+	 * now pressure is applied based on node LRU order.
 	 */
-	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
-	if (!lowmem_pressure && zone_balanced(zone, testorder, false,
-						balance_gap, classzone_idx))
-		return true;
-
-	shrink_node(zone->zone_pgdat, sc, zone_idx(zone), classzone_idx);
-
-	/* TODO: ANOMALY */
-	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
+	shrink_node(pgdat, sc, classzone_idx, classzone_idx);
 
 	/*
-	 * If a zone reaches its high watermark, consider it to be no longer
-	 * congested. It's possible there are dirty pages backed by congested
-	 * BDIs but as pressure is relieved, speculatively avoid congestion
-	 * waits.
+	 * Fragmentation may mean that the system cannot be rebalanced for
+	 * high-order allocations. If twice the allocation size has been
+	 * reclaimed then recheck watermarks only at order-0 to prevent
+	 * excessive reclaim. Assume that a process requested a high-order
+	 * can direct reclaim/compact.
 	 */
-	if (pgdat_reclaimable(zone->zone_pgdat) &&
-	    zone_balanced(zone, testorder, false, 0, classzone_idx)) {
-		clear_bit(PGDAT_CONGESTED, &pgdat->flags);
-		clear_bit(PGDAT_DIRTY, &pgdat->flags);
-	}
+	if (sc->order && sc->nr_reclaimed >= 2UL << sc->order)
+		sc->order = 0;
 
 	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
 /*
- * For kswapd, balance_pgdat() will work across all this node's zones until
- * they are all at high_wmark_pages(zone).
+ * For kswapd, balance_pgdat() will reclaim pages across a node from zones
+ * that are eligible for use by the caller until at least one zone is
+ * balanced.
  *
  * Returns the highest zone idx kswapd was reclaiming at
  *
- * There is special handling here for zones which are full of pinned pages.
- * This can happen if the pages are all mlocked, or if they are all used by
- * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
- * What we do is to detect the case where all pages in the zone have been
- * scanned twice and there has been zero successful reclaim.  Mark the zone as
- * dead and from now on, only perform a short scan.  Basically we're polling
- * the zone for when the problem goes away.
- *
  * kswapd scans the zones in the highmem->normal->dma direction.  It skips
  * zones which have free_pages > high_wmark_pages(zone), but once a zone is
- * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the
- * lower zones regardless of the number of free pages in the lower zones. This
- * interoperates with the page allocator fallback scheme to ensure that aging
- * of pages is balanced across the zones.
+ * found to have free_pages <= high_wmark_pages(zone), any page is that zone
+ * or lower is eligible for reclaim until at least one usable zone is
+ * balanced.
  */
 static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
-	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	struct zone *zone;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
-		.reclaim_idx = MAX_NR_ZONES - 1,
 		.order = order,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
@@ -3195,21 +3115,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		/* Scan from the highest requested zone to dma */
 		for (i = classzone_idx; i >= 0; i--) {
-			struct zone *zone = pgdat->node_zones + i;
-
+			zone = pgdat->node_zones + i;
 			if (!populated_zone(zone))
 				continue;
 
-			if (sc.priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;
-
-			/*
-			 * Do some background aging of the anon list, to give
-			 * pages a chance to be referenced before reclaiming.
-			 */
-			age_active_anon(zone, &sc);
-
 			/*
 			 * If the number of buffer_heads in the machine
 			 * exceeds the maximum allowed level and this node
@@ -3217,19 +3126,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			 * it to relieve lowmem pressure.
 			 */
 			if (buffer_heads_over_limit && is_highmem_idx(i)) {
-				end_zone = i;
+				classzone_idx = i;
 				break;
 			}
 
-			if (!zone_balanced(zone, order, true, 0, 0)) {
-				end_zone = i;
+			if (!zone_balanced(zone, order, 0, 0)) {
+				classzone_idx = i;
 				break;
 			} else {
 				/*
-				 * If balanced, clear the dirty and congested
-				 * flags
-				 *
-				 * TODO: ANOMALY
+				 * If any eligible zone is balanced then the
+				 * node is not considered congested or dirty.
 				 */
 				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
 				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
@@ -3240,50 +3147,35 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			goto out;
 
 		/*
+		 * Do some background aging of the anon list, to give
+		 * pages a chance to be referenced before reclaiming. All
+		 * pages are rotated regardless of classzone as this is
+		 * about consistent aging.
+		 */
+		sc.reclaim_idx = MAX_NR_ZONES - 1;
+		age_active_anon(pgdat, &pgdat->node_zones[sc.reclaim_idx], &sc);
+
+		/*
 		 * If we're getting trouble reclaiming, start doing writepage
 		 * even in laptop mode.
 		 */
-		if (sc.priority < DEF_PRIORITY - 2)
+		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
 			sc.may_writepage = 1;
 
+		/* Call soft limit reclaim before calling shrink_node. */
+		sc.nr_scanned = 0;
+		nr_soft_scanned = 0;
+		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, order,
+						sc.gfp_mask, &nr_soft_scanned);
+		sc.nr_reclaimed += nr_soft_reclaimed;
+
 		/*
-		 * Continue scanning in the highmem->dma direction stopping at
-		 * the last zone which needs scanning. This may reclaim lowmem
-		 * pages that are not necessary for zone balancing but it
-		 * preserves LRU ordering. It is assumed that the bulk of
-		 * allocation requests can use arbitrary zones with the
-		 * possible exception of big highmem:lowmem configurations.
+		 * There should be no need to raise the scanning priority if
+		 * enough pages are already being scanned that that high
+		 * watermark would be met at 100% efficiency.
 		 */
-		for (i = end_zone; i >= end_zone; i--) {
-			struct zone *zone = pgdat->node_zones + i;
-
-			if (!populated_zone(zone))
-				continue;
-
-			if (sc.priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;
-
-			sc.nr_scanned = 0;
-
-			nr_soft_scanned = 0;
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 */
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-							order, sc.gfp_mask,
-							&nr_soft_scanned);
-			sc.nr_reclaimed += nr_soft_reclaimed;
-
-			/*
-			 * There should be no need to raise the scanning
-			 * priority if enough pages are already being scanned
-			 * that that high watermark would be met at 100%
-			 * efficiency.
-			 */
-			if (kswapd_shrink_zone(zone, end_zone, &sc))
-				raise_priority = false;
-		}
+		if (kswapd_shrink_node(pgdat, classzone_idx, &sc))
+			raise_priority = false;
 
 		/*
 		 * If the low watermark is met there is no need for processes
@@ -3299,20 +3191,37 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			break;
 
 		/*
+		 * Stop reclaiming if any eligible zone is balanced and clear
+		 * node writeback or congested.
+		 */
+		for (i = 0; i <= classzone_idx; i++) {
+			zone = pgdat->node_zones + i;
+			if (!populated_zone(zone))
+				continue;
+
+			if (zone_balanced(zone, sc.order, 0, classzone_idx)) {
+				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+				clear_bit(PGDAT_DIRTY, &pgdat->flags);
+				goto out;
+			}
+		}
+
+		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
 		if (raise_priority || !sc.nr_reclaimed)
 			sc.priority--;
-	} while (sc.priority >= 1 &&
-			!pgdat_balanced(pgdat, order, classzone_idx));
+	} while (sc.priority >= 1);
 
 out:
 	/*
-	 * Return the highest zone idx we were reclaiming at so
-	 * prepare_kswapd_sleep() makes the same decisions as here.
+	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
+	 * makes a decision on the order we were last reclaiming at. However,
+	 * if another caller entered the allocator slow path while kswapd
+	 * was awake, order will remain at the higher level
 	 */
-	return end_zone;
+	return order;
 }
 
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
@@ -3500,7 +3409,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, true, 0, 0))
+	if (zone_balanced(zone, order, 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 09/27] mm, vmscan: Simplify the logic deciding whether kswapd sleeps
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (7 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-28 16:16   ` Johannes Weiner
  2016-02-23 15:04 ` [PATCH 10/27] mm, vmscan: By default have direct reclaim only shrink once per node Mel Gorman
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

kswapd goes through some complex steps trying to figure out if it
should stay awake based on the classzone_idx and the requested order.
It is unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order. This patch irons out the logic to check just that and the end result
is less headache inducing.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  5 ++--
 mm/compaction.c        |  4 +--
 mm/memory_hotplug.c    |  5 ++--
 mm/page_alloc.c        |  2 +-
 mm/vmscan.c            | 80 +++++++++++++++++++++++++-------------------------
 5 files changed, 49 insertions(+), 47 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 93151a9a4f56..8f164ce06627 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,8 +667,9 @@ typedef struct pglist_data {
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
-	int kswapd_max_order;
-	enum zone_type classzone_idx;
+	int kswapd_order;
+	enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
diff --git a/mm/compaction.c b/mm/compaction.c
index 2ae7c7ea664e..a1274a1f5e66 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1828,8 +1828,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 	 */
 	if (pgdat->kcompactd_max_order <= cc.order)
 		pgdat->kcompactd_max_order = 0;
-	if (pgdat->classzone_idx >= cc.classzone_idx)
-		pgdat->classzone_idx = pgdat->nr_zones - 1;
+	if (pgdat->kswapd_classzone_idx >= cc.classzone_idx)
+		pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
 }
 
 void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b33fe895a35c..1cb26f8bbda1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1147,9 +1147,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 
 		arch_refresh_nodedata(nid, pgdat);
 	} else {
-		/* Reset the nr_zones and classzone_idx to 0 before reuse */
+		/* Reset the nr_zones, order and classzone_idx before reuse */
 		pgdat->nr_zones = 0;
-		pgdat->classzone_idx = 0;
+		pgdat->kswapd_order = 0;
+		pgdat->kswapd_classzone_idx = -1;
 	}
 
 	/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b03c7b5872bf..34faebad3972 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5653,7 +5653,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	unsigned long end_pfn = 0;
 
 	/* pg_data_t should be reset to zero when it's allocated */
-	WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+	WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
 	reset_deferred_meminit(pgdat);
 	pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 57114c130df1..0417f20aead7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2750,7 +2750,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	/* kswapd must be awake if processes are being throttled */
 	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
-		pgdat->classzone_idx = min(pgdat->classzone_idx,
+		pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
 						(enum zone_type)ZONE_NORMAL);
 		wake_up_interruptible(&pgdat->kswapd_wait);
 	}
@@ -3235,6 +3235,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
 
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
+	/* If kswapd has not been woken recently, then full sleep */
+	if (classzone_idx == -1) {
+		classzone_idx = balanced_classzone_idx = MAX_NR_ZONES - 1;
+		goto full_sleep;
+	}
+
 	/* Try to sleep for a short interval */
 	if (prepare_kswapd_sleep(pgdat, order, remaining,
 						balanced_classzone_idx)) {
@@ -3243,6 +3249,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 	}
 
+full_sleep:
 	/*
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
@@ -3303,9 +3310,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
  */
 static int kswapd(void *p)
 {
-	unsigned long order, new_order;
-	int classzone_idx, new_classzone_idx;
-	int balanced_classzone_idx;
+	unsigned int order, classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 
@@ -3335,38 +3340,25 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
-	order = new_order = 0;
-	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
-	balanced_classzone_idx = classzone_idx;
+	pgdat->kswapd_order = order = 0;
+	pgdat->kswapd_classzone_idx = classzone_idx = -1;
 	for ( ; ; ) {
 		bool ret;
 
+kswapd_try_sleep:
+		kswapd_try_to_sleep(pgdat, order, classzone_idx, classzone_idx);
+
 		/*
-		 * While we were reclaiming, there might have been another
-		 * wakeup, so check the values.
+		 * Read the new order and classzone_idx which may be -1 if
+		 * kswapd_try_to_sleep() woke up after a short timeout instead
+		 * of being woken by the page allocator.
 		 */
-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order =  0;
-		pgdat->classzone_idx = pgdat->nr_zones - 1;
-
-		if (order < new_order || classzone_idx > new_classzone_idx) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation or has tigher zone constraints
-			 */
-			order = new_order;
-			classzone_idx = new_classzone_idx;
-		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx,
-						balanced_classzone_idx);
-			order = pgdat->kswapd_max_order;
-			classzone_idx = pgdat->classzone_idx;
-			new_order = order;
-			new_classzone_idx = classzone_idx;
-			pgdat->kswapd_max_order = 0;
-			pgdat->classzone_idx = pgdat->nr_zones - 1;
-		}
+		order = pgdat->kswapd_order;
+		classzone_idx = pgdat->kswapd_classzone_idx;
+		if (classzone_idx == -1)
+			classzone_idx = MAX_NR_ZONES - 1;
+		pgdat->kswapd_order = 0;
+		pgdat->kswapd_classzone_idx = -1;
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -3376,11 +3368,19 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret) {
-			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balanced_classzone_idx = balance_pgdat(pgdat, order,
-								classzone_idx);
-		}
+		if (ret)
+			continue;
+
+		/*
+		 * Try reclaim the requested order but if that fails
+		 * then try sleeping on the basis of the order reclaimed.
+		 */
+		trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
+		if (balance_pgdat(pgdat, order, classzone_idx) < order)
+			goto kswapd_try_sleep;
+
+		order = pgdat->kswapd_order;
+		classzone_idx = pgdat->kswapd_classzone_idx;
 	}
 
 	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
@@ -3403,10 +3403,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
 		return;
 	pgdat = zone->zone_pgdat;
-	if (pgdat->kswapd_max_order < order) {
-		pgdat->kswapd_max_order = order;
-		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
-	}
+	if (pgdat->kswapd_classzone_idx == -1)
+		pgdat->kswapd_classzone_idx = classzone_idx;
+	pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 	if (zone_balanced(zone, order, 0, 0))
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 10/27] mm, vmscan: By default have direct reclaim only shrink once per node
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (8 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 09/27] mm, vmscan: Simplify the logic deciding whether kswapd sleeps Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-28 16:17   ` Johannes Weiner
  2016-02-23 15:04 ` [PATCH 11/27] mm, vmscan: Clear congestion, dirty and need for compaction on a per-node basis Mel Gorman
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

Direct reclaim iterates over all zones in the zonelist and shrinking them
but this is in conflict with node-based reclaim. In the default case,
only shrink once per node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0417f20aead7..fe004cb11b71 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2549,14 +2549,6 @@ static inline bool compaction_ready(struct zone *zone, int order)
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
  *
- * We reclaim from a zone even if that zone is over high_wmark_pages(zone).
- * Because:
- * a) The caller may be trying to free *extra* pages to satisfy a higher-order
- *    allocation or
- * b) The target zone may be at high_wmark_pages(zone) but the lower zones
- *    must go *over* high_wmark_pages(zone) to satisfy the `incremental min'
- *    zone defense algorithm.
- *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
@@ -2568,6 +2560,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
+	pg_data_t *last_pgdat = NULL;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2580,11 +2573,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					classzone_idx, sc->nodemask) {
-		if (!populated_zone(zone)) {
-			reclaim_idx--;
-			classzone_idx--;
+		BUG_ON(!populated_zone(zone));
+
+		/*
+		 * Shrink each node in the zonelist once. If the zonelist is
+		 * ordered by zone (not the default) then a node may be
+		 * shrunk multiple times but in that case the user prefers
+		 * lower zones being preserved
+		 */
+		if (zone->zone_pgdat == last_pgdat)
 			continue;
-		}
+		last_pgdat = zone->zone_pgdat;
+		reclaim_idx = zone_idx(zone);
 
 		/*
 		 * Take care memory controller reclaiming has small influence
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/27] mm, vmscan: Clear congestion, dirty and need for compaction on a per-node basis
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (9 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 10/27] mm, vmscan: By default have direct reclaim only shrink once per node Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 15:04 ` [PATCH 12/27] mm: vmscan: Do not reclaim from kswapd if there is any eligible zone Mel Gorman
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

Congested and dirty tracking of a node and whether reclaim should stall
is still based on zone activity. This patch considers whether the kernel
should stall based on node-based reclaim activity.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe004cb11b71..1f5d3829d35e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2988,7 +2988,17 @@ static bool zone_balanced(struct zone *zone, int order,
 {
 	unsigned long mark = high_wmark_pages(zone) + balance_gap;
 
-	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
+	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
+		return false;
+
+	/*
+	 * If any eligible zone is balanced then the node is not considered
+	 * to be congested or dirty
+	 */
+	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
+	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
+
+	return true;
 }
 
 /*
@@ -3133,13 +3143,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			if (!zone_balanced(zone, order, 0, 0)) {
 				classzone_idx = i;
 				break;
-			} else {
-				/*
-				 * If any eligible zone is balanced then the
-				 * node is not considered congested or dirty.
-				 */
-				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
-				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
 			}
 		}
 
@@ -3199,11 +3202,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone_balanced(zone, sc.order, 0, classzone_idx)) {
-				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
-				clear_bit(PGDAT_DIRTY, &pgdat->flags);
+			if (zone_balanced(zone, sc.order, 0, classzone_idx))
 				goto out;
-			}
 		}
 
 		/*
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 12/27] mm: vmscan: Do not reclaim from kswapd if there is any eligible zone
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (10 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 11/27] mm, vmscan: Clear congestion, dirty and need for compaction on a per-node basis Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 15:04 ` [PATCH 13/27] mm, vmscan: Make shrink_node decisions more node-centric Mel Gorman
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

kswapd scans from highest to lowest for a zone that requires balancing.
This was necessary when reclaim was per-zone to fairly age pages on
lower zones. Now that we are reclaiming on a per-node basis, any eligible
zone can be used and pages will still be aged fairly. This patch avoids
reclaiming excessively unless buffer_heads are over the limit and it's
necessary to reclaim from a higher zone than requested by the waker of
kswapd to relieve low memory pressure.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1f5d3829d35e..39ad2ab76a2b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3123,24 +3123,30 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		sc.nr_reclaimed = 0;
 
-		/* Scan from the highest requested zone to dma */
+		/*
+		 * If the number of buffer_heads in the machine exceeds the
+		 * maximum allowed level and this node has a highmem zone,
+		 * force kswapd to reclaim from it to relieve lowmem pressure.
+		 */
+		if (buffer_heads_over_limit) {
+			for (i = MAX_NR_ZONES - 1; i >= 0; i++) {
+				zone = pgdat->node_zones + i;
+				if (!populated_zone(zone))
+					continue;
+
+				if (is_highmem_idx(i))
+					classzone_idx = i;
+				break;
+			}
+		}
+
+		/* Only reclaim if there are no eligible zones */
 		for (i = classzone_idx; i >= 0; i--) {
 			zone = pgdat->node_zones + i;
 			if (!populated_zone(zone))
 				continue;
 
-			/*
-			 * If the number of buffer_heads in the machine
-			 * exceeds the maximum allowed level and this node
-			 * has a highmem zone, force kswapd to reclaim from
-			 * it to relieve lowmem pressure.
-			 */
-			if (buffer_heads_over_limit && is_highmem_idx(i)) {
-				classzone_idx = i;
-				break;
-			}
-
-			if (!zone_balanced(zone, order, 0, 0)) {
+			if (!zone_balanced(zone, order, 0, classzone_idx)) {
 				classzone_idx = i;
 				break;
 			}
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/27] mm, vmscan: Make shrink_node decisions more node-centric
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (11 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 12/27] mm: vmscan: Do not reclaim from kswapd if there is any eligible zone Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 15:04 ` [PATCH 14/27] mm, memcg: Move memcg limit enforcement from zones to nodes Mel Gorman
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

Earlier patches focused on having direct reclaim and kswapd use data that
is node-centric for reclaiming but shrink_node() itself still uses too much
zone information. This patch removes unnecessary zone-based information
with the most important decision being whether to continue reclaim or
not. Some memcg APIs are adjusted as a result even though memcg itself
still uses some zone information.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/memcontrol.h |  9 +++----
 include/linux/mmzone.h     |  4 ++--
 include/linux/swap.h       |  2 +-
 mm/memcontrol.c            | 17 +++++++-------
 mm/page_alloc.c            |  2 +-
 mm/vmscan.c                | 58 ++++++++++++++++++++++++++--------------------
 mm/workingset.c            |  6 ++---
 7 files changed, 54 insertions(+), 44 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f5626a5c88c2..51f4441a69fc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -306,7 +306,8 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
 
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_lruvec(struct pglist_data *, struct zone *zone,
+				 struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
@@ -592,10 +593,10 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 {
 }
 
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-						    struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+				struct zone *zone, struct mem_cgroup *memcg)
 {
-	return zone_lruvec(zone);
+	return node_lruvec(pgdat);
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8f164ce06627..ffe8a6e606b9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -739,9 +739,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
 	return &zone->zone_pgdat->lru_lock;
 }
 
-static inline struct lruvec *zone_lruvec(struct zone *zone)
+static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
-	return &zone->zone_pgdat->lruvec;
+	return &pgdat->lruvec;
 }
 
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a776f6506522..2137416c1f15 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -324,7 +324,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
 						  bool may_swap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
 						unsigned long *nr_scanned);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9e35424c57e..1cfb7c40fa36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -943,22 +943,23 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
 /**
- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
+ * @node: node of the wanted lruvec
  * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
- * Returns the lru list vector holding pages for the given @zone and
- * @mem.  This can be the global zone lruvec, if the memory controller
+ * Returns the lru list vector holding pages for a given @node or a given
+ * @memcg and @zone. This can be the node lruvec, if the memory controller
  * is disabled.
  */
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-				      struct mem_cgroup *memcg)
+struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+				 struct zone *zone, struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = zone_lruvec(zone);
+		lruvec = node_lruvec(pgdat);
 		goto out;
 	}
 
@@ -1454,8 +1455,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 			}
 			continue;
 		}
-		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-						     zone, &nr_scanned);
+		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
+					zone, &nr_scanned);
 		*total_scanned += nr_scanned;
 		if (!soft_limit_excess(root_memcg))
 			break;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34faebad3972..d320524d68c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5521,6 +5521,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->kcompactd_wait);
 #endif
 	pgdat_page_ext_init(pgdat);
+	lruvec_init(node_lruvec(pgdat));
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
@@ -5585,7 +5586,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		/* For bootup, initialized properly in watermark setup */
 		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
 
-		lruvec_init(zone_lruvec(zone));
 		if (!size)
 			continue;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 39ad2ab76a2b..3f67edc69e04 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2219,10 +2219,11 @@ static inline void init_tlb_ubc(void)
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_zone_memcg(struct zone *zone, struct mem_cgroup *memcg,
+static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
 			      struct scan_control *sc, unsigned long *lru_pages)
 {
-	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	struct zone *zone = &pgdat->node_zones[sc->reclaim_idx];
+	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long targets[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2355,13 +2356,14 @@ static bool in_reclaim_compaction(struct scan_control *sc)
  * calls try_to_compact_zone() that it will have enough free pages to succeed.
  * It will give up earlier than that if there is difficulty reclaiming pages.
  */
-static inline bool should_continue_reclaim(struct zone *zone,
+static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
 {
 	unsigned long pages_for_compaction;
 	unsigned long inactive_lru_pages;
+	int z;
 
 	/* If not in reclaim/compaction mode, stop */
 	if (!in_reclaim_compaction(sc))
@@ -2395,21 +2397,27 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
+	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
 	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
+		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(zone, sc->order, 0, 0)) {
-	case COMPACT_PARTIAL:
-	case COMPACT_CONTINUE:
-		return false;
-	default:
-		return true;
+	for (z = 0; z <= sc->reclaim_idx; z++) {
+		struct zone *zone = &pgdat->node_zones[z];
+
+		switch (compaction_suitable(zone, sc->order, 0, 0)) {
+		case COMPACT_PARTIAL:
+		case COMPACT_CONTINUE:
+			return false;
+		default:
+			/* check next zone */
+			;
+		}
 	}
+	return true;
 }
 
 static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
@@ -2419,15 +2427,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
-	struct zone *zone = &pgdat->node_zones[classzone_idx];
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
 		struct mem_cgroup_reclaim_cookie reclaim = {
-			.zone = zone,
+			.zone = &pgdat->node_zones[classzone_idx],
 			.priority = sc->priority,
 		};
-		unsigned long zone_lru_pages = 0;
+		unsigned long node_lru_pages = 0;
 		struct mem_cgroup *memcg;
 
 		nr_reclaimed = sc->nr_reclaimed;
@@ -2449,11 +2456,11 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 			scanned = sc->nr_scanned;
 
 			sc->reclaim_idx = reclaim_idx;
-			shrink_zone_memcg(zone, memcg, sc, &lru_pages);
-			zone_lru_pages += lru_pages;
+			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
+			node_lru_pages += lru_pages;
 
 			if (!global_reclaim(sc) && reclaim_idx == classzone_idx)
-				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
+				shrink_slab(sc->gfp_mask, pgdat->node_id,
 					    memcg, sc->nr_scanned - scanned,
 					    lru_pages);
 
@@ -2465,7 +2472,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 			/*
 			 * Direct reclaim and kswapd have to scan all memory
 			 * cgroups to fulfill the overall scan target for the
-			 * zone.
+			 * node.
 			 *
 			 * Limit reclaim, on the other hand, only cares about
 			 * nr_to_reclaim pages to be reclaimed and it will
@@ -2484,9 +2491,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 		 * the eligible LRU pages were scanned.
 		 */
 		if (global_reclaim(sc) && reclaim_idx == classzone_idx)
-			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
+			shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
 				    sc->nr_scanned - nr_scanned,
-				    zone_lru_pages);
+				    node_lru_pages);
 
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2501,7 +2508,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 		if (sc->nr_reclaimed - nr_reclaimed)
 			reclaimable = true;
 
-	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
+	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
 	return reclaimable;
@@ -2887,7 +2894,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 #ifdef CONFIG_MEMCG
 
-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
+unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
 						unsigned long *nr_scanned)
@@ -2897,6 +2904,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 		.target_mem_cgroup = memcg,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
+		.reclaim_idx = zone_idx(zone),
 		.may_swap = !noswap,
 	};
 	unsigned long lru_pages;
@@ -2911,11 +2919,11 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 	/*
 	 * NOTE: Although we can get the priority field, using it
 	 * here is not a good idea, since it limits the pages we can scan.
-	 * if we don't reclaim here, the shrink_zone from balance_pgdat
+	 * if we don't reclaim here, the shrink_node from balance_pgdat
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone_memcg(zone, memcg, &sc, &lru_pages);
+	shrink_node_memcg(zone->zone_pgdat, memcg, &sc, &lru_pages);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2973,7 +2981,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
 
 		if (inactive_anon_is_low(lruvec))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
diff --git a/mm/workingset.c b/mm/workingset.c
index 173399b239be..f68cc74a7795 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -218,7 +218,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
 	return pack_shadow(memcgid, zone, eviction);
 }
@@ -267,7 +267,7 @@ bool workingset_refault(void *shadow)
 		rcu_read_unlock();
 		return false;
 	}
-	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 	rcu_read_unlock();
@@ -317,7 +317,7 @@ void workingset_activation(struct page *page)
 	 */
 	if (!mem_cgroup_disabled() && !page_memcg(page))
 		goto out;
-	lruvec = mem_cgroup_zone_lruvec(page_zone(page), page_memcg(page));
+	lruvec = mem_cgroup_lruvec(page_zone(page)->zone_pgdat, page_zone(page), page_memcg(page));
 	atomic_long_inc(&lruvec->inactive_age);
 out:
 	unlock_page_memcg(page);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/27] mm, memcg: Move memcg limit enforcement from zones to nodes
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (12 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 13/27] mm, vmscan: Make shrink_node decisions more node-centric Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-23 15:04 ` [PATCH 15/27] mm, workingset: Make working set detection node-aware Mel Gorman
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

Memcg was broken by the move of all LRUs to nodes because it is tracking
limits on a per-zone basis while receiving reclaim requests on a per-node
basis. This patch moves limit enforcement to the nodes. Technically, all
the variable names should also change but people are already familiar by
the meaning of "mz" even if "mn" would be a more appropriate name now.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/memcontrol.h |  21 ++---
 include/linux/swap.h       |   2 +-
 mm/memcontrol.c            | 204 +++++++++++++++++++--------------------------
 mm/vmscan.c                |  21 +++--
 mm/workingset.c            |   6 +-
 5 files changed, 110 insertions(+), 144 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 51f4441a69fc..96b973e567f6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,7 +60,7 @@ enum mem_cgroup_stat_index {
 };
 
 struct mem_cgroup_reclaim_cookie {
-	struct zone *zone;
+	pg_data_t *pgdat;
 	int priority;
 	unsigned int generation;
 };
@@ -113,7 +113,7 @@ struct mem_cgroup_reclaim_iter {
 /*
  * per-zone information in memory controller.
  */
-struct mem_cgroup_per_zone {
+struct mem_cgroup_per_node {
 	struct lruvec		lruvec;
 	unsigned long		lru_size[NR_LRU_LISTS];
 
@@ -127,10 +127,6 @@ struct mem_cgroup_per_zone {
 						/* use container_of	   */
 };
 
-struct mem_cgroup_per_node {
-	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	unsigned long threshold;
@@ -306,8 +302,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
 
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
-struct lruvec *mem_cgroup_lruvec(struct pglist_data *, struct zone *zone,
-				 struct mem_cgroup *);
+struct lruvec *mem_cgroup_lruvec(struct pglist_data *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
@@ -410,9 +405,9 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 static inline
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 
-	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	return mz->lru_size[lru];
 }
 
@@ -502,7 +497,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
 
@@ -594,7 +589,7 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 }
 
 static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-				struct zone *zone, struct mem_cgroup *memcg)
+				struct mem_cgroup *memcg)
 {
 	return node_lruvec(pgdat);
 }
@@ -718,7 +713,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
 					    unsigned long *total_scanned)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2137416c1f15..4b1a95798560 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -326,7 +326,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  bool may_swap);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
-						struct zone *zone,
+						pg_data_t *pgdat,
 						unsigned long *nr_scanned);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1cfb7c40fa36..5aac5d9827b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -132,15 +132,11 @@ static const char * const mem_cgroup_lru_names[] = {
  * their hierarchy representation
  */
 
-struct mem_cgroup_tree_per_zone {
+struct mem_cgroup_tree_per_node {
 	struct rb_root rb_root;
 	spinlock_t lock;
 };
 
-struct mem_cgroup_tree_per_node {
-	struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
 struct mem_cgroup_tree {
 	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
 };
@@ -322,13 +318,10 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
 
 #endif /* !CONFIG_SLOB */
 
-static struct mem_cgroup_per_zone *
-mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
+static struct mem_cgroup_per_node *
+mem_cgroup_nodeinfo(struct mem_cgroup *memcg, pg_data_t *pgdat)
 {
-	int nid = zone_to_nid(zone);
-	int zid = zone_idx(zone);
-
-	return &memcg->nodeinfo[nid]->zoneinfo[zid];
+	return memcg->nodeinfo[pgdat->node_id];
 }
 
 /**
@@ -382,37 +375,35 @@ ino_t page_cgroup_ino(struct page *page)
 	return ino;
 }
 
-static struct mem_cgroup_per_zone *
+static struct mem_cgroup_per_node *
 mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
 {
 	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
 
-	return &memcg->nodeinfo[nid]->zoneinfo[zid];
+	return memcg->nodeinfo[nid];
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
+static struct mem_cgroup_tree_per_node *
+soft_limit_tree_node(int nid)
 {
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+	return soft_limit_tree.rb_tree_per_node[nid];
 }
 
-static struct mem_cgroup_tree_per_zone *
+static struct mem_cgroup_tree_per_node *
 soft_limit_tree_from_page(struct page *page)
 {
 	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
 
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+	return soft_limit_tree.rb_tree_per_node[nid];
 }
 
-static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
-					 struct mem_cgroup_tree_per_zone *mctz,
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
+					 struct mem_cgroup_tree_per_node *mctz,
 					 unsigned long new_usage_in_excess)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
+	struct mem_cgroup_per_node *mz_node;
 
 	if (mz->on_tree)
 		return;
@@ -422,7 +413,7 @@ static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
 		return;
 	while (*p) {
 		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
+		mz_node = rb_entry(parent, struct mem_cgroup_per_node,
 					tree_node);
 		if (mz->usage_in_excess < mz_node->usage_in_excess)
 			p = &(*p)->rb_left;
@@ -438,8 +429,8 @@ static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
 	mz->on_tree = true;
 }
 
-static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
-					 struct mem_cgroup_tree_per_zone *mctz)
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
+					 struct mem_cgroup_tree_per_node *mctz)
 {
 	if (!mz->on_tree)
 		return;
@@ -447,8 +438,8 @@ static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 	mz->on_tree = false;
 }
 
-static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
-				       struct mem_cgroup_tree_per_zone *mctz)
+static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
+				       struct mem_cgroup_tree_per_node *mctz)
 {
 	unsigned long flags;
 
@@ -472,8 +463,8 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
 static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 {
 	unsigned long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
+	struct mem_cgroup_per_node *mz;
+	struct mem_cgroup_tree_per_node *mctz;
 
 	mctz = soft_limit_tree_from_page(page);
 	/*
@@ -506,24 +497,22 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 
 static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup_tree_per_zone *mctz;
-	struct mem_cgroup_per_zone *mz;
-	int nid, zid;
+	struct mem_cgroup_tree_per_node *mctz;
+	struct mem_cgroup_per_node *mz;
+	int nid;
 
 	for_each_node(nid) {
-		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-			mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-			mctz = soft_limit_tree_node_zone(nid, zid);
-			mem_cgroup_remove_exceeded(mz, mctz);
-		}
+		mz = mem_cgroup_nodeinfo(memcg, NODE_DATA(nid));
+		mctz = soft_limit_tree_node(nid);
+		mem_cgroup_remove_exceeded(mz, mctz);
 	}
 }
 
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+static struct mem_cgroup_per_node *
+__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 {
 	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 
 retry:
 	mz = NULL;
@@ -531,7 +520,7 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	if (!rightmost)
 		goto done;		/* Nothing to reclaim from */
 
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
+	mz = rb_entry(rightmost, struct mem_cgroup_per_node, tree_node);
 	/*
 	 * Remove the node now but someone else can add it back,
 	 * we will to add it back at the end of reclaim to its correct
@@ -545,10 +534,10 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+static struct mem_cgroup_per_node *
+mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 
 	spin_lock_irq(&mctz->lock);
 	mz = __mem_cgroup_largest_soft_limit_node(mctz);
@@ -642,20 +631,16 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 					   int nid, unsigned int lru_mask)
 {
 	unsigned long nr = 0;
-	int zid;
+	struct mem_cgroup_per_node *mz;
+	enum lru_list lru;
 
 	VM_BUG_ON((unsigned)nid >= nr_node_ids);
 
-	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-		struct mem_cgroup_per_zone *mz;
-		enum lru_list lru;
-
-		for_each_lru(lru) {
-			if (!(BIT(lru) & lru_mask))
-				continue;
-			mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-			nr += mz->lru_size[lru];
-		}
+	for_each_lru(lru) {
+		if (!(BIT(lru) & lru_mask))
+			continue;
+		mz = mem_cgroup_nodeinfo(memcg, NODE_DATA(nid));
+		nr += mz->lru_size[lru];
 	}
 	return nr;
 }
@@ -808,9 +793,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 
 	if (reclaim) {
-		struct mem_cgroup_per_zone *mz;
+		struct mem_cgroup_per_node *mz;
 
-		mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone);
+		mz = mem_cgroup_nodeinfo(root, reclaim->pgdat);
 		iter = &mz->iter[reclaim->priority];
 
 		if (prev && reclaim->generation != iter->generation)
@@ -909,19 +894,17 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 {
 	struct mem_cgroup *memcg = dead_memcg;
 	struct mem_cgroup_reclaim_iter *iter;
-	struct mem_cgroup_per_zone *mz;
-	int nid, zid;
+	struct mem_cgroup_per_node *mz;
+	int nid;
 	int i;
 
 	while ((memcg = parent_mem_cgroup(memcg))) {
 		for_each_node(nid) {
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-				mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-				for (i = 0; i <= DEF_PRIORITY; i++) {
-					iter = &mz->iter[i];
-					cmpxchg(&iter->position,
-						dead_memcg, NULL);
-				}
+			mz = mem_cgroup_nodeinfo(memcg, NODE_DATA(nid));
+			for (i = 0; i <= DEF_PRIORITY; i++) {
+				iter = &mz->iter[i];
+				cmpxchg(&iter->position,
+					dead_memcg, NULL);
 			}
 		}
 	}
@@ -945,7 +928,6 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 /**
  * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
  * @node: node of the wanted lruvec
- * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
  * Returns the lru list vector holding pages for a given @node or a given
@@ -953,9 +935,9 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * is disabled.
  */
 struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-				 struct zone *zone, struct mem_cgroup *memcg)
+				 struct mem_cgroup *memcg)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
@@ -963,7 +945,7 @@ struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 		goto out;
 	}
 
-	mz = mem_cgroup_zone_zoneinfo(memcg, zone);
+	mz = mem_cgroup_nodeinfo(memcg, pgdat);
 	lruvec = &mz->lruvec;
 out:
 	/*
@@ -971,8 +953,8 @@ struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 	 * we have to be prepared to initialize lruvec->zone here;
 	 * and if offlined then reonlined, we need to reinitialize it.
 	 */
-	if (unlikely(lruvec->pgdat != zone->zone_pgdat))
-		lruvec->pgdat = zone->zone_pgdat;
+	if (unlikely(lruvec->pgdat != pgdat))
+		lruvec->pgdat = pgdat;
 	return lruvec;
 }
 
@@ -987,7 +969,7 @@ struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
  */
 struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
@@ -1029,13 +1011,13 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 				int nr_pages)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 	unsigned long *lru_size;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	lru_size = mz->lru_size + lru;
 	*lru_size += nr_pages;
 	VM_BUG_ON((long)(*lru_size) < 0);
@@ -1415,7 +1397,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 #endif
 
 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
-				   struct zone *zone,
+				   pg_data_t *pgdat,
 				   gfp_t gfp_mask,
 				   unsigned long *total_scanned)
 {
@@ -1425,7 +1407,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 	unsigned long excess;
 	unsigned long nr_scanned;
 	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
+		.pgdat = pgdat,
 		.priority = 0,
 	};
 
@@ -1456,7 +1438,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 			continue;
 		}
 		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
-					zone, &nr_scanned);
+					pgdat, &nr_scanned);
 		*total_scanned += nr_scanned;
 		if (!soft_limit_excess(root_memcg))
 			break;
@@ -2548,22 +2530,22 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
 					    unsigned long *total_scanned)
 {
 	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
+	struct mem_cgroup_per_node *mz, *next_mz = NULL;
 	unsigned long reclaimed;
 	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
+	struct mem_cgroup_tree_per_node *mctz;
 	unsigned long excess;
 	unsigned long nr_scanned;
 
 	if (order > 0)
 		return 0;
 
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
+	mctz = soft_limit_tree_node(pgdat->node_id);
 	/*
 	 * This loop can run a while, specially if mem_cgroup's continuously
 	 * keep exceeding their soft limit and putting the system under
@@ -2578,7 +2560,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 			break;
 
 		nr_scanned = 0;
-		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
+		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
 						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
@@ -3197,22 +3179,21 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 
 #ifdef CONFIG_DEBUG_VM
 	{
-		int nid, zid;
-		struct mem_cgroup_per_zone *mz;
+		pg_data_t *pgdat;
+		struct mem_cgroup_per_node *mz;
 		struct zone_reclaim_stat *rstat;
 		unsigned long recent_rotated[2] = {0, 0};
 		unsigned long recent_scanned[2] = {0, 0};
 
-		for_each_online_node(nid)
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-				mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-				rstat = &mz->lruvec.reclaim_stat;
+		for_each_online_pgdat(pgdat) {
+			mz = mem_cgroup_nodeinfo(memcg, pgdat);
+			rstat = &mz->lruvec.reclaim_stat;
 
-				recent_rotated[0] += rstat->recent_rotated[0];
-				recent_rotated[1] += rstat->recent_rotated[1];
-				recent_scanned[0] += rstat->recent_scanned[0];
-				recent_scanned[1] += rstat->recent_scanned[1];
-			}
+			recent_rotated[0] += rstat->recent_rotated[0];
+			recent_rotated[1] += rstat->recent_rotated[1];
+			recent_scanned[0] += rstat->recent_scanned[0];
+			recent_scanned[1] += rstat->recent_scanned[1];
+		}
 		seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
 		seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
 		seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
@@ -4038,11 +4019,10 @@ static struct cftype mem_cgroup_legacy_files[] = {
 	{ },	/* terminate */
 };
 
-static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
+static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
-	struct mem_cgroup_per_zone *mz;
-	int zone, tmp = node;
+	int tmp = node;
 	/*
 	 * This routine is called against possible nodes.
 	 * But it's BUG to call kmalloc() against offline node.
@@ -4057,18 +4037,16 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return 1;
 
-	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-		mz = &pn->zoneinfo[zone];
-		lruvec_init(&mz->lruvec);
-		mz->usage_in_excess = 0;
-		mz->on_tree = false;
-		mz->memcg = memcg;
-	}
+	lruvec_init(&pn->lruvec);
+	pn->usage_in_excess = 0;
+	pn->on_tree = false;
+	pn->memcg = memcg;
+
 	memcg->nodeinfo[node] = pn;
 	return 0;
 }
 
-static void free_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
+static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	kfree(memcg->nodeinfo[node]);
 }
@@ -4079,7 +4057,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 
 	memcg_wb_domain_exit(memcg);
 	for_each_node(node)
-		free_mem_cgroup_per_zone_info(memcg, node);
+		free_mem_cgroup_per_node_info(memcg, node);
 	free_percpu(memcg->stat);
 	kfree(memcg);
 }
@@ -4102,7 +4080,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		goto fail;
 
 	for_each_node(node)
-		if (alloc_mem_cgroup_per_zone_info(memcg, node))
+		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
 
 	if (memcg_wb_domain_init(memcg, GFP_KERNEL))
@@ -5650,18 +5628,12 @@ static int __init mem_cgroup_init(void)
 
 	for_each_node(node) {
 		struct mem_cgroup_tree_per_node *rtpn;
-		int zone;
 
 		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
 				    node_online(node) ? node : NUMA_NO_NODE);
 
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			struct mem_cgroup_tree_per_zone *rtpz;
-
-			rtpz = &rtpn->rb_tree_per_zone[zone];
-			rtpz->rb_root = RB_ROOT;
-			spin_lock_init(&rtpz->lock);
-		}
+		rtpn->rb_root = RB_ROOT;
+		spin_lock_init(&rtpn->lock);
 		soft_limit_tree.rb_tree_per_node[node] = rtpn;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f67edc69e04..607a12c77da7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2222,8 +2222,7 @@ static inline void init_tlb_ubc(void)
 static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
 			      struct scan_control *sc, unsigned long *lru_pages)
 {
-	struct zone *zone = &pgdat->node_zones[sc->reclaim_idx];
-	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
+	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long targets[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2431,7 +2430,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
 		struct mem_cgroup_reclaim_cookie reclaim = {
-			.zone = &pgdat->node_zones[classzone_idx],
+			.pgdat = pgdat,
 			.priority = sc->priority,
 		};
 		unsigned long node_lru_pages = 0;
@@ -2630,7 +2629,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
 			 * and balancing, not for a memcg's limit.
 			 */
 			nr_soft_scanned = 0;
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
+			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
 						sc->order, sc->gfp_mask,
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
@@ -2896,7 +2895,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 						gfp_t gfp_mask, bool noswap,
-						struct zone *zone,
+						pg_data_t *pgdat,
 						unsigned long *nr_scanned)
 {
 	struct scan_control sc = {
@@ -2904,7 +2903,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 		.target_mem_cgroup = memcg,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
-		.reclaim_idx = zone_idx(zone),
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.may_swap = !noswap,
 	};
 	unsigned long lru_pages;
@@ -2923,7 +2922,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_node_memcg(zone->zone_pgdat, memcg, &sc, &lru_pages);
+	shrink_node_memcg(pgdat, memcg, &sc, &lru_pages);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2972,7 +2971,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 #endif
 
 static void age_active_anon(struct pglist_data *pgdat,
-				struct zone *zone, struct scan_control *sc)
+				struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 
@@ -2981,7 +2980,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
+		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 
 		if (inactive_anon_is_low(lruvec))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
@@ -3170,7 +3169,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * about consistent aging.
 		 */
 		sc.reclaim_idx = MAX_NR_ZONES - 1;
-		age_active_anon(pgdat, &pgdat->node_zones[sc.reclaim_idx], &sc);
+		age_active_anon(pgdat, &sc);
 
 		/*
 		 * If we're getting trouble reclaiming, start doing writepage
@@ -3182,7 +3181,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		/* Call soft limit reclaim before calling shrink_node. */
 		sc.nr_scanned = 0;
 		nr_soft_scanned = 0;
-		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, order,
+		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, order,
 						sc.gfp_mask, &nr_soft_scanned);
 		sc.nr_reclaimed += nr_soft_reclaimed;
 
diff --git a/mm/workingset.c b/mm/workingset.c
index f68cc74a7795..d06d69670b5d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -218,7 +218,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
 	return pack_shadow(memcgid, zone, eviction);
 }
@@ -267,7 +267,7 @@ bool workingset_refault(void *shadow)
 		rcu_read_unlock();
 		return false;
 	}
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 	rcu_read_unlock();
@@ -317,7 +317,7 @@ void workingset_activation(struct page *page)
 	 */
 	if (!mem_cgroup_disabled() && !page_memcg(page))
 		goto out;
-	lruvec = mem_cgroup_lruvec(page_zone(page)->zone_pgdat, page_zone(page), page_memcg(page));
+	lruvec = mem_cgroup_lruvec(page_zone(page)->zone_pgdat, page_memcg(page));
 	atomic_long_inc(&lruvec->inactive_age);
 out:
 	unlock_page_memcg(page);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 15/27] mm, workingset: Make working set detection node-aware
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (13 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 14/27] mm, memcg: Move memcg limit enforcement from zones to nodes Mel Gorman
@ 2016-02-23 15:04 ` Mel Gorman
  2016-02-28 16:17   ` Johannes Weiner
  2016-02-23 15:17 ` [PATCH 16/27] mm, page_alloc: Consider dirtyable memory in terms of nodes Mel Gorman
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:04 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

Working set and refault detection is still zone-based, fix it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  6 +++---
 mm/vmstat.c            |  6 +++---
 mm/workingset.c        | 37 +++++++++++++++++--------------------
 3 files changed, 23 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ffe8a6e606b9..cef476813581 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -146,9 +146,6 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
-	WORKINGSET_REFAULT,
-	WORKINGSET_ACTIVATE,
-	WORKINGSET_NODERECLAIM,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -163,6 +160,9 @@ enum node_stat_item {
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
+	WORKINGSET_REFAULT,
+	WORKINGSET_ACTIVATE,
+	WORKINGSET_NODERECLAIM,
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 19bd521e161b..0746dd5e1e73 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -970,9 +970,6 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
-	"workingset_refault",
-	"workingset_activate",
-	"workingset_nodereclaim",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 
@@ -985,6 +982,9 @@ const char * const vmstat_text[] = {
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"nr_pages_scanned",
+	"workingset_refault",
+	"workingset_activate",
+	"workingset_nodereclaim",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index d06d69670b5d..82a3fedef0df 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -16,7 +16,7 @@
 /*
  *		Double CLOCK lists
  *
- * Per zone, two clock lists are maintained for file pages: the
+ * Per node, two clock lists are maintained for file pages: the
  * inactive and the active list.  Freshly faulted pages start out at
  * the head of the inactive list and page reclaim scans pages from the
  * tail.  Pages that are accessed multiple times on the inactive list
@@ -141,11 +141,11 @@
  *
  *		Implementation
  *
- * For each zone's file LRU lists, a counter for inactive evictions
- * and activations is maintained (zone->inactive_age).
+ * For each node's file LRU lists, a counter for inactive evictions
+ * and activations is maintained (node->inactive_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
- * identify the zone) is stored in the now empty page cache radix tree
+ * identify the node) is stored in the now empty page cache radix tree
  * slot of the evicted page.  This is called a shadow entry.
  *
  * On cache misses for which there are shadow entries, an eligible
@@ -167,33 +167,30 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
-	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
-	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
+	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
+static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	int memcgid, nid, zid;
+	int memcgid, nid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
-	zid = entry & ((1UL << ZONES_SHIFT) - 1);
-	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
 	entry >>= MEM_CGROUP_ID_SHIFT;
 
 	*memcgidp = memcgid;
-	*zonep = NODE_DATA(nid)->node_zones + zid;
+	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
 }
 
@@ -220,7 +217,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 
 	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, zone, eviction);
+	return pack_shadow(memcgid, zone->zone_pgdat, eviction);
 }
 
 /**
@@ -228,7 +225,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
- * evicted page in the context of the zone it was allocated in.
+ * evicted page in the context of the node it was allocated in.
  *
  * Returns %true if the page should be activated, %false otherwise.
  */
@@ -240,10 +237,10 @@ bool workingset_refault(void *shadow)
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
-	struct zone *zone;
+	struct pglist_data *pgdat;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &zone, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
 
 	rcu_read_lock();
 	/*
@@ -267,7 +264,7 @@ bool workingset_refault(void *shadow)
 		rcu_read_unlock();
 		return false;
 	}
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
+	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 	rcu_read_unlock();
@@ -290,10 +287,10 @@ bool workingset_refault(void *shadow)
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
-	inc_zone_state(zone, WORKINGSET_REFAULT);
+	inc_node_state(pgdat, WORKINGSET_REFAULT);
 
 	if (refault_distance <= active_file) {
-		inc_zone_state(zone, WORKINGSET_ACTIVATE);
+		inc_node_state(pgdat, WORKINGSET_ACTIVATE);
 		return true;
 	}
 	return false;
@@ -435,7 +432,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 		}
 	}
 	BUG_ON(node->count);
-	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
+	inc_node_state(page_zone(virt_to_page(node))->zone_pgdat, WORKINGSET_NODERECLAIM);
 	if (!__radix_tree_delete_node(&mapping->page_tree, node))
 		BUG();
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 16/27] mm, page_alloc: Consider dirtyable memory in terms of nodes
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (14 preceding siblings ...)
  2016-02-23 15:04 ` [PATCH 15/27] mm, workingset: Make working set detection node-aware Mel Gorman
@ 2016-02-23 15:17 ` Mel Gorman
  2016-02-28 16:17   ` Johannes Weiner
  2016-02-23 15:18 ` [PATCH 17/27] mm: Move page mapped accounting to the node Mel Gorman
                   ` (12 subsequent siblings)
  28 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:17 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

Historically dirty pages were spread among zones but now that LRUs are
per-node it is more appropriate to consider dirty pages in a node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h    | 12 +++----
 include/linux/writeback.h |  2 +-
 mm/page-writeback.c       | 89 ++++++++++++++++++++++++++---------------------
 mm/page_alloc.c           |  4 +--
 4 files changed, 58 insertions(+), 49 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cef476813581..a5968e8b88e9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -362,12 +362,6 @@ struct zone {
 	struct pglist_data	*zone_pgdat;
 	struct per_cpu_pageset __percpu *pageset;
 
-	/*
-	 * This is a per-zone reserve of pages that are not available
-	 * to userspace allocations.
-	 */
-	unsigned long		totalreserve_pages;
-
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
@@ -686,6 +680,12 @@ typedef struct pglist_data {
 	/* Number of pages migrated during the rate limiting time interval */
 	unsigned long numabalancing_migrate_nr_pages;
 #endif
+	/*
+	 * This is a per-zone reserve of pages that are not available
+	 * to userspace allocations.
+	 */
+	unsigned long		totalreserve_pages;
+
 	/* Write-intensive fields used from the page allocator */
 	ZONE_PADDING(_pad1_)
 	spinlock_t		lru_lock;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b333c945e571..f6a35510b629 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -315,7 +315,7 @@ void laptop_mode_timer_fn(unsigned long data);
 static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
-bool zone_dirty_ok(struct zone *zone);
+bool node_dirty_ok(struct pglist_data *pgdat);
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
 #ifdef CONFIG_CGROUP_WRITEBACK
 void wb_domain_exit(struct wb_domain *dom);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b0960ec94bc9..4abce6295f7c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -267,26 +267,35 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
  */
 
 /**
- * zone_dirtyable_memory - number of dirtyable pages in a zone
- * @zone: the zone
+ * node_dirtyable_memory - number of dirtyable pages in a node
+ * @pgdat: the node
  *
- * Returns the zone's number of pages potentially available for dirty
- * page cache.  This is the base value for the per-zone dirty limits.
+ * Returns the node's number of pages potentially available for dirty
+ * page cache.  This is the base value for the per-node dirty limits.
  */
-static unsigned long zone_dirtyable_memory(struct zone *zone)
+static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 {
-	unsigned long nr_pages;
+	unsigned long nr_pages = 0;
+	int z;
+
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		nr_pages += zone_page_state(zone, NR_FREE_PAGES);
+	}
 
-	nr_pages = zone_page_state(zone, NR_FREE_PAGES);
 	/*
 	 * Pages reserved for the kernel should not be considered
 	 * dirtyable, to prevent a situation where reclaim has to
 	 * clean pages in order to balance the zones.
 	 */
-	nr_pages -= min(nr_pages, zone->totalreserve_pages);
+	nr_pages -= min(nr_pages, pgdat->totalreserve_pages);
 
-	nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
-	nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
+	nr_pages += node_page_state(pgdat, NR_INACTIVE_FILE);
+	nr_pages += node_page_state(pgdat, NR_ACTIVE_FILE);
 
 	return nr_pages;
 }
@@ -297,22 +306,11 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 	int node;
 	unsigned long x = 0;
 
-	for_each_node_state(node, N_HIGH_MEMORY) {
-		struct zone *z = &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
-		x += zone_dirtyable_memory(z);
-	}
 	/*
-	 * Unreclaimable memory (kernel memory or anonymous memory
-	 * without swap) can bring down the dirtyable pages below
-	 * the zone's dirty balance reserve and the above calculation
-	 * will underflow.  However we still want to add in nodes
-	 * which are below threshold (negative values) to get a more
-	 * accurate calculation but make sure that the total never
-	 * underflows.
+	 * LRU lists are per-node so there is accurate way of accurately
+	 * calculating dirtyable memory of just the high zone
 	 */
-	if ((long)x < 0)
-		x = 0;
+	x = totalhigh_pages;
 
 	/*
 	 * Make sure that the number of highmem pages is never larger
@@ -438,23 +436,23 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 }
 
 /**
- * zone_dirty_limit - maximum number of dirty pages allowed in a zone
- * @zone: the zone
+ * node_dirty_limit - maximum number of dirty pages allowed in a node
+ * @pgdat: the node
  *
- * Returns the maximum number of dirty pages allowed in a zone, based
- * on the zone's dirtyable memory.
+ * Returns the maximum number of dirty pages allowed in a node, based
+ * on the node's dirtyable memory.
  */
-static unsigned long zone_dirty_limit(struct zone *zone)
+static unsigned long node_dirty_limit(struct pglist_data *pgdat)
 {
-	unsigned long zone_memory = zone_dirtyable_memory(zone);
+	unsigned long node_memory = node_dirtyable_memory(pgdat);
 	struct task_struct *tsk = current;
 	unsigned long dirty;
 
 	if (vm_dirty_bytes)
 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
-			zone_memory / global_dirtyable_memory();
+			node_memory / global_dirtyable_memory();
 	else
-		dirty = vm_dirty_ratio * zone_memory / 100;
+		dirty = vm_dirty_ratio * node_memory / 100;
 
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
 		dirty += dirty / 4;
@@ -463,19 +461,30 @@ static unsigned long zone_dirty_limit(struct zone *zone)
 }
 
 /**
- * zone_dirty_ok - tells whether a zone is within its dirty limits
- * @zone: the zone to check
+ * node_dirty_ok - tells whether a node is within its dirty limits
+ * @pgdat: the node to check
  *
- * Returns %true when the dirty pages in @zone are within the zone's
+ * Returns %true when the dirty pages in @pgdat are within the node's
  * dirty limit, %false if the limit is exceeded.
  */
-bool zone_dirty_ok(struct zone *zone)
+bool node_dirty_ok(struct pglist_data *pgdat)
 {
-	unsigned long limit = zone_dirty_limit(zone);
+	int z;
+	unsigned long limit = node_dirty_limit(pgdat);
+	unsigned long nr_pages = 0;
+
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		nr_pages += zone_page_state(zone, NR_FILE_DIRTY);
+		nr_pages += zone_page_state(zone, NR_UNSTABLE_NFS);
+		nr_pages += zone_page_state(zone, NR_WRITEBACK);
+	}
 
-	return zone_page_state(zone, NR_FILE_DIRTY) +
-	       zone_page_state(zone, NR_UNSTABLE_NFS) +
-	       zone_page_state(zone, NR_WRITEBACK) <= limit;
+	return nr_pages <= limit;
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d320524d68c6..4b99c28f7f9e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2665,7 +2665,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (ac->spread_dirty_pages && !zone_dirty_ok(zone))
+		if (ac->spread_dirty_pages && !node_dirty_ok(zone->zone_pgdat))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
@@ -6333,7 +6333,7 @@ static void calculate_totalreserve_pages(void)
 			if (max > zone->managed_pages)
 				max = zone->managed_pages;
 
-			zone->totalreserve_pages = max;
+			pgdat->totalreserve_pages += max;
 
 			reserve_pages += max;
 		}
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 17/27] mm: Move page mapped accounting to the node
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (15 preceding siblings ...)
  2016-02-23 15:17 ` [PATCH 16/27] mm, page_alloc: Consider dirtyable memory in terms of nodes Mel Gorman
@ 2016-02-23 15:18 ` Mel Gorman
  2016-02-23 15:18 ` [PATCH 18/27] mm: Rename NR_ANON_PAGES to NR_ANON_MAPPED Mel Gorman
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:18 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

Reclaim makes decisions based on the number of file pages that are mapped but
it's mixing node and zone information. Account NR_FILE_MAPPED pages on the node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 arch/tile/mm/pgtable.c |  2 +-
 drivers/base/node.c    |  4 ++--
 fs/proc/meminfo.c      |  4 ++--
 include/linux/mmzone.h |  6 +++---
 mm/page_alloc.c        |  6 +++---
 mm/rmap.c              | 13 +++++++------
 mm/vmscan.c            |  2 +-
 mm/vmstat.c            |  4 ++--
 8 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 3ed0a666d44a..2e784e84bd6f 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -55,7 +55,7 @@ void show_mem(unsigned int filter)
 	       global_page_state(NR_FREE_PAGES),
 	       (global_page_state(NR_SLAB_RECLAIMABLE) +
 		global_page_state(NR_SLAB_UNRECLAIMABLE)),
-	       global_page_state(NR_FILE_MAPPED),
+	       global_node_page_state(NR_FILE_MAPPED),
 	       global_page_state(NR_PAGETABLE),
 	       global_page_state(NR_BOUNCE),
 	       global_page_state(NR_FILE_PAGES),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 4260c7f3ee1b..66aed68a0fdc 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,8 +119,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
 		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
-		       nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
+		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
+		       nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index df4661abadc4..2c0dfee0f29f 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -167,8 +167,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-		K(global_page_state(NR_ANON_PAGES)),
-		K(global_page_state(NR_FILE_MAPPED)),
+		K(global_node_page_state(NR_ANON_PAGES)),
+		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a5968e8b88e9..b901dbfb6cc9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -119,9 +119,6 @@ enum zone_stat_item {
 	NR_FREE_PAGES,
 	NR_ALLOC_BATCH,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
-	NR_ANON_PAGES,	/* Mapped anonymous pages */
-	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
-			   only modified from process context */
 	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
@@ -163,6 +160,9 @@ enum node_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
+	NR_ANON_PAGES,	/* Mapped anonymous pages */
+	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
+			   only modified from process context */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b99c28f7f9e..710c1bc26847 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3942,7 +3942,7 @@ void show_free_areas(unsigned int filter)
 		global_page_state(NR_UNSTABLE_NFS),
 		global_page_state(NR_SLAB_RECLAIMABLE),
 		global_page_state(NR_SLAB_UNRECLAIMABLE),
-		global_page_state(NR_FILE_MAPPED),
+		global_node_page_state(NR_FILE_MAPPED),
 		global_page_state(NR_SHMEM),
 		global_page_state(NR_PAGETABLE),
 		global_page_state(NR_BOUNCE),
@@ -3959,6 +3959,7 @@ void show_free_areas(unsigned int filter)
 			" unevictable:%lukB"
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
+			" mapped:%lukB"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -3969,6 +3970,7 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+			K(node_page_state(pgdat, NR_FILE_MAPPED)),
 			!pgdat_reclaimable(pgdat) ? "yes" : "no");
 	}
 
@@ -3993,7 +3995,6 @@ void show_free_areas(unsigned int filter)
 			" mlocked:%lukB"
 			" dirty:%lukB"
 			" writeback:%lukB"
-			" mapped:%lukB"
 			" shmem:%lukB"
 			" slab_reclaimable:%lukB"
 			" slab_unreclaimable:%lukB"
@@ -4017,7 +4018,6 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_FILE_DIRTY)),
 			K(zone_page_state(zone, NR_WRITEBACK)),
-			K(zone_page_state(zone, NR_FILE_MAPPED)),
 			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
diff --git a/mm/rmap.c b/mm/rmap.c
index 6ed6699c45fc..8f405fd9cfe1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1231,7 +1231,8 @@ void do_page_add_anon_rmap(struct page *page,
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 		}
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
+		__mod_node_page_state(page_zone(page)->zone_pgdat,
+				NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1275,7 +1276,7 @@ void page_add_new_anon_rmap(struct page *page,
 		/* increment count (starts at -1) */
 		atomic_set(&page->_mapcount, 0);
 	}
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
+	__mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1289,7 +1290,7 @@ void page_add_file_rmap(struct page *page)
 {
 	lock_page_memcg(page);
 	if (atomic_inc_and_test(&page->_mapcount)) {
-		__inc_zone_page_state(page, NR_FILE_MAPPED);
+		__inc_node_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
 	unlock_page_memcg(page);
@@ -1315,7 +1316,7 @@ static void page_remove_file_rmap(struct page *page)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_zone_page_state(page, NR_FILE_MAPPED);
+	__dec_node_page_state(page, NR_FILE_MAPPED);
 	mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 
 	if (unlikely(PageMlocked(page)))
@@ -1357,7 +1358,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		clear_page_mlock(page);
 
 	if (nr) {
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+		__mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_PAGES, -nr);
 		deferred_split_huge_page(page);
 	}
 }
@@ -1389,7 +1390,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_zone_page_state(page, NR_ANON_PAGES);
+	__dec_node_page_state(page, NR_ANON_PAGES);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 607a12c77da7..20497c5c6bb9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3578,7 +3578,7 @@ int sysctl_min_slab_ratio = 5;
 
 static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
 {
-	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+	unsigned long file_mapped = node_page_state(zone->zone_pgdat, NR_FILE_MAPPED);
 	unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
 		node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0746dd5e1e73..c60df3090ed7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -944,8 +944,6 @@ const char * const vmstat_text[] = {
 	"nr_free_pages",
 	"nr_alloc_batch",
 	"nr_mlock",
-	"nr_anon_pages",
-	"nr_mapped",
 	"nr_file_pages",
 	"nr_dirty",
 	"nr_writeback",
@@ -985,6 +983,8 @@ const char * const vmstat_text[] = {
 	"workingset_refault",
 	"workingset_activate",
 	"workingset_nodereclaim",
+	"nr_anon_pages",
+	"nr_mapped",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 18/27] mm: Rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (16 preceding siblings ...)
  2016-02-23 15:18 ` [PATCH 17/27] mm: Move page mapped accounting to the node Mel Gorman
@ 2016-02-23 15:18 ` Mel Gorman
  2016-02-23 15:18 ` [PATCH 19/27] mm: Move most file-based accounting to the node Mel Gorman
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:18 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES  is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and NR_ANON_PAGES for
mapped pages. This patch renames NR_ANON_PAGES so we have

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 drivers/base/node.c    | 2 +-
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 mm/migrate.c           | 2 +-
 mm/rmap.c              | 8 ++++----
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 66aed68a0fdc..897b6bcb36be 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -120,7 +120,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
-		       nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
+		       nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 2c0dfee0f29f..b3583d75b211 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -167,7 +167,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-		K(global_node_page_state(NR_ANON_PAGES)),
+		K(global_node_page_state(NR_ANON_MAPPED)),
 		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b901dbfb6cc9..05a273171ab0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -160,7 +160,7 @@ enum node_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
-	NR_ANON_PAGES,	/* Mapped anonymous pages */
+	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_VM_NODE_STAT_ITEMS
diff --git a/mm/migrate.c b/mm/migrate.c
index aea3e350f4fc..0492a035df80 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -409,7 +409,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 * new page and drop references to the old page.
 	 *
 	 * Note that anonymous pages are accounted for
-	 * via NR_FILE_PAGES and NR_ANON_PAGES if they
+	 * via NR_FILE_PAGES and NR_ANON_MAPPED if they
 	 * are mapped to swap space.
 	 */
 	if (newzone != oldzone) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8f405fd9cfe1..fd4dd2cdc390 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1232,7 +1232,7 @@ void do_page_add_anon_rmap(struct page *page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 		}
 		__mod_node_page_state(page_zone(page)->zone_pgdat,
-				NR_ANON_PAGES, nr);
+				NR_ANON_MAPPED, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1276,7 +1276,7 @@ void page_add_new_anon_rmap(struct page *page,
 		/* increment count (starts at -1) */
 		atomic_set(&page->_mapcount, 0);
 	}
-	__mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_PAGES, nr);
+	__mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_MAPPED, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1358,7 +1358,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		clear_page_mlock(page);
 
 	if (nr) {
-		__mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_PAGES, -nr);
+		__mod_node_page_state(page_zone(page)->zone_pgdat, NR_ANON_MAPPED, -nr);
 		deferred_split_huge_page(page);
 	}
 }
@@ -1390,7 +1390,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_node_page_state(page, NR_ANON_PAGES);
+	__dec_node_page_state(page, NR_ANON_MAPPED);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 19/27] mm: Move most file-based accounting to the node
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (17 preceding siblings ...)
  2016-02-23 15:18 ` [PATCH 18/27] mm: Rename NR_ANON_PAGES to NR_ANON_MAPPED Mel Gorman
@ 2016-02-23 15:18 ` Mel Gorman
  2016-02-23 15:19 ` [PATCH 20/27] mm: Move vmscan writes and file write " Mel Gorman
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:18 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 arch/s390/appldata/appldata_mem.c         |  2 +-
 arch/tile/mm/pgtable.c                    |  8 +++---
 drivers/base/node.c                       | 10 ++++----
 drivers/staging/android/lowmemorykiller.c |  4 +--
 fs/fs-writeback.c                         |  4 +--
 fs/fuse/file.c                            |  8 +++---
 fs/nfs/internal.h                         |  2 +-
 fs/nfs/write.c                            |  2 +-
 fs/proc/meminfo.c                         | 10 ++++----
 include/linux/mmzone.h                    | 12 ++++-----
 include/trace/events/writeback.h          |  6 ++---
 mm/filemap.c                              | 10 ++++----
 mm/migrate.c                              | 12 ++++-----
 mm/mmap.c                                 |  4 +--
 mm/nommu.c                                |  4 +--
 mm/page-writeback.c                       | 42 +++++++++++++------------------
 mm/page_alloc.c                           | 34 ++++++++++++-------------
 mm/shmem.c                                | 12 ++++-----
 mm/swap_state.c                           |  4 +--
 mm/vmscan.c                               | 16 ++++++------
 mm/vmstat.c                               | 12 ++++-----
 21 files changed, 105 insertions(+), 113 deletions(-)

diff --git a/arch/s390/appldata/appldata_mem.c b/arch/s390/appldata/appldata_mem.c
index edcf2a706942..598df5708501 100644
--- a/arch/s390/appldata/appldata_mem.c
+++ b/arch/s390/appldata/appldata_mem.c
@@ -102,7 +102,7 @@ static void appldata_get_mem_data(void *data)
 	mem_data->totalhigh = P2K(val.totalhigh);
 	mem_data->freehigh  = P2K(val.freehigh);
 	mem_data->bufferram = P2K(val.bufferram);
-	mem_data->cached    = P2K(global_page_state(NR_FILE_PAGES)
+	mem_data->cached    = P2K(global_node_page_state(NR_FILE_PAGES)
 				- val.bufferram);
 
 	si_swapinfo(&val);
diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 2e784e84bd6f..dad42acd0f84 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -49,16 +49,16 @@ void show_mem(unsigned int filter)
 		global_node_page_state(NR_ACTIVE_FILE)),
 	       (global_node_page_state(NR_INACTIVE_ANON) +
 		global_node_page_state(NR_INACTIVE_FILE)),
-	       global_page_state(NR_FILE_DIRTY),
-	       global_page_state(NR_WRITEBACK),
-	       global_page_state(NR_UNSTABLE_NFS),
+	       global_node_page_state(NR_FILE_DIRTY),
+	       global_node_page_state(NR_WRITEBACK),
+	       global_node_page_state(NR_UNSTABLE_NFS),
 	       global_page_state(NR_FREE_PAGES),
 	       (global_page_state(NR_SLAB_RECLAIMABLE) +
 		global_page_state(NR_SLAB_UNRECLAIMABLE)),
 	       global_node_page_state(NR_FILE_MAPPED),
 	       global_page_state(NR_PAGETABLE),
 	       global_page_state(NR_BOUNCE),
-	       global_page_state(NR_FILE_PAGES),
+	       global_node_page_state(NR_FILE_PAGES),
 	       get_nr_swap_pages());
 
 	for_each_zone(zone) {
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 897b6bcb36be..ec733919bc6b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -116,18 +116,18 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d AnonHugePages:  %8lu kB\n"
 #endif
 			,
-		       nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
-		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
-		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
+		       nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+		       nid, K(node_page_state(pgdat, NR_WRITEBACK)),
+		       nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
 		       nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_UNSTABLE_NFS)),
+		       nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 		       nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK_TEMP)),
+		       nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) +
 				sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)),
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 7d677791d13a..e0b01b162303 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -89,8 +89,8 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 	short selected_oom_score_adj;
 	int array_size = ARRAY_SIZE(lowmem_adj);
 	int other_free = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
-	int other_file = global_page_state(NR_FILE_PAGES) -
-						global_page_state(NR_SHMEM) -
+	int other_file = global_node_page_state(NR_FILE_PAGES) -
+						global_node_page_state(NR_SHMEM) -
 						total_swapcache_pages();
 
 	if (lowmem_adj_size < array_size)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c950e6e8..491990aa78e1 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1723,8 +1723,8 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
  */
 static unsigned long get_nr_dirty_pages(void)
 {
-	return global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) +
+	return global_node_page_state(NR_FILE_DIRTY) +
+		global_node_page_state(NR_UNSTABLE_NFS) +
 		get_nr_dirty_inodes();
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b03d253ece15..7ed9f10aa673 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1446,7 +1446,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	list_del(&req->writepages_entry);
 	for (i = 0; i < req->num_pages; i++) {
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-		dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+		dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
 		wb_writeout_inc(&bdi->wb);
 	}
 	wake_up(&fi->page_waitq);
@@ -1636,7 +1636,7 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+	inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);
 
 	spin_lock(&fc->lock);
 	list_add(&req->writepages_entry, &fi->writepages);
@@ -1750,7 +1750,7 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
 		spin_unlock(&fc->lock);
 
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-		dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+		dec_node_page_state(page, NR_WRITEBACK_TEMP);
 		wb_writeout_inc(&bdi->wb);
 		fuse_writepage_free(fc, new_req);
 		fuse_request_free(new_req);
@@ -1849,7 +1849,7 @@ static int fuse_writepages_fill(struct page *page,
 	req->page_descs[req->num_pages].length = PAGE_SIZE;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+	inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);
 
 	err = 0;
 	if (is_writeback && fuse_writepage_in_flight(req, page)) {
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 9a547aa3ec8e..4a605c36b585 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -626,7 +626,7 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
 	if (!cinfo->dreq) {
 		struct inode *inode = page_file_mapping(page)->host;
 
-		inc_zone_page_state(page, NR_UNSTABLE_NFS);
+		inc_node_page_state(page, NR_UNSTABLE_NFS);
 		inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	}
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5754835a2886..d1e8d70f644b 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -897,7 +897,7 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg,
 static void
 nfs_clear_page_commit(struct page *page)
 {
-	dec_zone_page_state(page, NR_UNSTABLE_NFS);
+	dec_node_page_state(page, NR_UNSTABLE_NFS);
 	dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
 		    WB_RECLAIMABLE);
 }
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b3583d75b211..2f1ae07cac98 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -43,7 +43,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	si_swapinfo(&i);
 	committed = percpu_counter_read_positive(&vm_committed_as);
 
-	cached = global_page_state(NR_FILE_PAGES) -
+	cached = global_node_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
 	if (cached < 0)
 		cached = 0;
@@ -165,8 +165,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 		K(i.totalswap),
 		K(i.freeswap),
-		K(global_page_state(NR_FILE_DIRTY)),
-		K(global_page_state(NR_WRITEBACK)),
+		K(global_node_page_state(NR_FILE_DIRTY)),
+		K(global_node_page_state(NR_WRITEBACK)),
 		K(global_node_page_state(NR_ANON_MAPPED)),
 		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
@@ -179,9 +179,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
 #endif
-		K(global_page_state(NR_UNSTABLE_NFS)),
+		K(global_node_page_state(NR_UNSTABLE_NFS)),
 		K(global_page_state(NR_BOUNCE)),
-		K(global_page_state(NR_WRITEBACK_TEMP)),
+		K(global_node_page_state(NR_WRITEBACK_TEMP)),
 		K(vm_commit_limit()),
 		K(committed),
 		(unsigned long)VMALLOC_TOTAL >> 10,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 05a273171ab0..3e65e2abd1bb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -119,20 +119,14 @@ enum zone_stat_item {
 	NR_FREE_PAGES,
 	NR_ALLOC_BATCH,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
-	NR_FILE_PAGES,
-	NR_FILE_DIRTY,
-	NR_WRITEBACK,
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
 	NR_KERNEL_STACK,
 	/* Second 128 byte cacheline */
-	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
-	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
-	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 #ifdef CONFIG_NUMA
@@ -163,6 +157,12 @@ enum node_stat_item {
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
+	NR_FILE_PAGES,
+	NR_FILE_DIRTY,
+	NR_WRITEBACK,
+	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
+	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
+	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index fff846b512e6..af584b2931d5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -442,9 +442,9 @@ TRACE_EVENT(global_dirty_state,
 	),
 
 	TP_fast_assign(
-		__entry->nr_dirty	= global_page_state(NR_FILE_DIRTY);
-		__entry->nr_writeback	= global_page_state(NR_WRITEBACK);
-		__entry->nr_unstable	= global_page_state(NR_UNSTABLE_NFS);
+		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
+		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
+		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
 		__entry->nr_dirtied	= global_page_state(NR_DIRTIED);
 		__entry->nr_written	= global_page_state(NR_WRITTEN);
 		__entry->background_thresh = background_thresh;
diff --git a/mm/filemap.c b/mm/filemap.c
index 99d921e1e1bb..0fa2f236a0fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -200,9 +200,9 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 	/* hugetlb pages do not participate in page cache accounting. */
 	if (!PageHuge(page))
-		__dec_zone_page_state(page, NR_FILE_PAGES);
+		__dec_node_page_state(page, NR_FILE_PAGES);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__dec_node_page_state(page, NR_SHMEM);
 	VM_BUG_ON_PAGE(page_mapped(page), page);
 
 	/*
@@ -545,9 +545,9 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		 * hugetlb pages do not participate in page cache accounting.
 		 */
 		if (!PageHuge(new))
-			__inc_zone_page_state(new, NR_FILE_PAGES);
+			__inc_node_page_state(new, NR_FILE_PAGES);
 		if (PageSwapBacked(new))
-			__inc_zone_page_state(new, NR_SHMEM);
+			__inc_node_page_state(new, NR_SHMEM);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		mem_cgroup_migrate(old, new);
 		radix_tree_preload_end();
@@ -644,7 +644,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
 	/* hugetlb pages do not participate in page cache accounting. */
 	if (!huge)
-		__inc_zone_page_state(page, NR_FILE_PAGES);
+		__inc_node_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index 0492a035df80..0653ab74bd58 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -413,15 +413,15 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 * are mapped to swap space.
 	 */
 	if (newzone != oldzone) {
-		__dec_zone_state(oldzone, NR_FILE_PAGES);
-		__inc_zone_state(newzone, NR_FILE_PAGES);
+		__dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
+		__inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
 		if (PageSwapBacked(page) && !PageSwapCache(page)) {
-			__dec_zone_state(oldzone, NR_SHMEM);
-			__inc_zone_state(newzone, NR_SHMEM);
+			__dec_node_state(oldzone->zone_pgdat, NR_SHMEM);
+			__inc_node_state(newzone->zone_pgdat, NR_SHMEM);
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
-			__dec_zone_state(oldzone, NR_FILE_DIRTY);
-			__inc_zone_state(newzone, NR_FILE_DIRTY);
+			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
+			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/mmap.c b/mm/mmap.c
index e2e9f48b06c2..5acee7f5af62 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -184,7 +184,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		free = global_page_state(NR_FREE_PAGES);
-		free += global_page_state(NR_FILE_PAGES);
+		free += global_node_page_state(NR_FILE_PAGES);
 
 		/*
 		 * shmem pages shouldn't be counted as free in this
@@ -192,7 +192,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		 * that won't affect the overall amount of available
 		 * memory in the system.
 		 */
-		free -= global_page_state(NR_SHMEM);
+		free -= global_node_page_state(NR_SHMEM);
 
 		free += get_nr_swap_pages();
 
diff --git a/mm/nommu.c b/mm/nommu.c
index fbf6f0f1d6c9..fcdd9f3cf632 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1859,7 +1859,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		free = global_page_state(NR_FREE_PAGES);
-		free += global_page_state(NR_FILE_PAGES);
+		free += global_node_page_state(NR_FILE_PAGES);
 
 		/*
 		 * shmem pages shouldn't be counted as free in this
@@ -1867,7 +1867,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		 * that won't affect the overall amount of available
 		 * memory in the system.
 		 */
-		free -= global_page_state(NR_SHMEM);
+		free -= global_node_page_state(NR_SHMEM);
 
 		free += get_nr_swap_pages();
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4abce6295f7c..f778baeff78e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -469,20 +469,12 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat)
  */
 bool node_dirty_ok(struct pglist_data *pgdat)
 {
-	int z;
 	unsigned long limit = node_dirty_limit(pgdat);
 	unsigned long nr_pages = 0;
 
-	for (z = 0; z < MAX_NR_ZONES; z++) {
-		struct zone *zone = pgdat->node_zones + z;
-
-		if (!populated_zone(zone))
-			continue;
-
-		nr_pages += zone_page_state(zone, NR_FILE_DIRTY);
-		nr_pages += zone_page_state(zone, NR_UNSTABLE_NFS);
-		nr_pages += zone_page_state(zone, NR_WRITEBACK);
-	}
+	nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
+	nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
+	nr_pages += node_page_state(pgdat, NR_WRITEBACK);
 
 	return nr_pages <= limit;
 }
@@ -1572,10 +1564,10 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
+		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
+					global_node_page_state(NR_UNSTABLE_NFS);
 		gdtc->avail = global_dirtyable_memory();
-		gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
 
 		domain_dirty_limits(gdtc);
 
@@ -1912,8 +1904,8 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	 * as we're trying to decide whether to put more under writeback.
 	 */
 	gdtc->avail = global_dirtyable_memory();
-	gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
-		      global_page_state(NR_UNSTABLE_NFS);
+	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
+		      global_node_page_state(NR_UNSTABLE_NFS);
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
@@ -1955,8 +1947,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
+                if (global_node_page_state(NR_UNSTABLE_NFS) +
+			global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
@@ -1984,8 +1976,8 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
 void laptop_mode_timer_fn(unsigned long data)
 {
 	struct request_queue *q = (struct request_queue *)data;
-	int nr_pages = global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS);
+	int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
+		global_node_page_state(NR_UNSTABLE_NFS);
 	struct bdi_writeback *wb;
 
 	/*
@@ -2436,7 +2428,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		wb = inode_to_wb(inode);
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-		__inc_zone_page_state(page, NR_FILE_DIRTY);
+		__inc_node_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2457,7 +2449,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 {
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-		dec_zone_page_state(page, NR_FILE_DIRTY);
+		dec_node_page_state(page, NR_FILE_DIRTY);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 	}
@@ -2712,7 +2704,7 @@ int clear_page_dirty_for_io(struct page *page)
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_node_page_state(page, NR_FILE_DIRTY);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2753,7 +2745,7 @@ int test_clear_page_writeback(struct page *page)
 	}
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
-		dec_zone_page_state(page, NR_WRITEBACK);
+		dec_node_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2794,7 +2786,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	}
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
-		inc_zone_page_state(page, NR_WRITEBACK);
+		inc_node_page_state(page, NR_WRITEBACK);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 710c1bc26847..4c618662b8dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3817,7 +3817,7 @@ static inline void show_node(struct zone *zone)
 void si_meminfo(struct sysinfo *val)
 {
 	val->totalram = totalram_pages;
-	val->sharedram = global_page_state(NR_SHMEM);
+	val->sharedram = global_node_page_state(NR_SHMEM);
 	val->freeram = global_page_state(NR_FREE_PAGES);
 	val->bufferram = nr_blockdev_pages();
 	val->totalhigh = totalhigh_pages;
@@ -3837,7 +3837,7 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
 		managed_pages += pgdat->node_zones[zone_type].managed_pages;
 	val->totalram = managed_pages;
-	val->sharedram = sum_zone_node_page_state(nid, NR_SHMEM);
+	val->sharedram = node_page_state(pgdat, NR_SHMEM);
 	val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
 #ifdef CONFIG_HIGHMEM
 	val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].managed_pages;
@@ -3937,13 +3937,13 @@ void show_free_areas(unsigned int filter)
 		global_node_page_state(NR_INACTIVE_FILE),
 		global_node_page_state(NR_ISOLATED_FILE),
 		global_node_page_state(NR_UNEVICTABLE),
-		global_page_state(NR_FILE_DIRTY),
-		global_page_state(NR_WRITEBACK),
-		global_page_state(NR_UNSTABLE_NFS),
+		global_node_page_state(NR_FILE_DIRTY),
+		global_node_page_state(NR_WRITEBACK),
+		global_node_page_state(NR_UNSTABLE_NFS),
 		global_page_state(NR_SLAB_RECLAIMABLE),
 		global_page_state(NR_SLAB_UNRECLAIMABLE),
 		global_node_page_state(NR_FILE_MAPPED),
-		global_page_state(NR_SHMEM),
+		global_node_page_state(NR_SHMEM),
 		global_page_state(NR_PAGETABLE),
 		global_page_state(NR_BOUNCE),
 		global_page_state(NR_FREE_PAGES),
@@ -3960,6 +3960,11 @@ void show_free_areas(unsigned int filter)
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" mapped:%lukB"
+			" dirty:%lukB"
+			" writeback:%lukB"
+			" shmem:%lukB"
+			" writeback_tmp:%lukB"
+			" unstable:%lukB"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -3971,6 +3976,11 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
+			K(node_page_state(pgdat, NR_FILE_DIRTY)),
+			K(node_page_state(pgdat, NR_WRITEBACK)),
+			K(node_page_state(pgdat, NR_SHMEM)),
+			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
+			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 			!pgdat_reclaimable(pgdat) ? "yes" : "no");
 	}
 
@@ -3993,19 +4003,14 @@ void show_free_areas(unsigned int filter)
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
-			" dirty:%lukB"
-			" writeback:%lukB"
-			" shmem:%lukB"
 			" slab_reclaimable:%lukB"
 			" slab_unreclaimable:%lukB"
 			" kernel_stack:%lukB"
 			" pagetables:%lukB"
-			" unstable:%lukB"
 			" bounce:%lukB"
 			" free_pcp:%lukB"
 			" local_pcp:%ukB"
 			" free_cma:%lukB"
-			" writeback_tmp:%lukB"
 			" node_pages_scanned:%lu"
 			"\n",
 			zone->name,
@@ -4016,20 +4021,15 @@ void show_free_areas(unsigned int filter)
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
-			K(zone_page_state(zone, NR_FILE_DIRTY)),
-			K(zone_page_state(zone, NR_WRITEBACK)),
-			K(zone_page_state(zone, NR_SHMEM)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
 			zone_page_state(zone, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
 			K(zone_page_state(zone, NR_PAGETABLE)),
-			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(free_pcp),
 			K(this_cpu_read(zone->pageset->pcp.count)),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
-			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
 			K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
@@ -4072,7 +4072,7 @@ void show_free_areas(unsigned int filter)
 
 	hugetlb_show_meminfo();
 
-	printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
+	printk("%ld total pagecache pages\n", global_node_page_state(NR_FILE_PAGES));
 
 	show_swap_cache_info();
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index b8e8369676be..2d9c0579e398 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -312,8 +312,8 @@ static int shmem_add_to_page_cache(struct page *page,
 								 page);
 	if (!error) {
 		mapping->nrpages++;
-		__inc_zone_page_state(page, NR_FILE_PAGES);
-		__inc_zone_page_state(page, NR_SHMEM);
+		__inc_node_page_state(page, NR_FILE_PAGES);
+		__inc_node_page_state(page, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
 	} else {
 		page->mapping = NULL;
@@ -335,8 +335,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
 	page->mapping = NULL;
 	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
-	__dec_zone_page_state(page, NR_SHMEM);
+	__dec_node_page_state(page, NR_FILE_PAGES);
+	__dec_node_page_state(page, NR_SHMEM);
 	spin_unlock_irq(&mapping->tree_lock);
 	page_cache_release(page);
 	BUG_ON(error);
@@ -1098,8 +1098,8 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 	error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
 								   newpage);
 	if (!error) {
-		__inc_zone_page_state(newpage, NR_FILE_PAGES);
-		__dec_zone_page_state(oldpage, NR_FILE_PAGES);
+		__inc_node_page_state(newpage, NR_FILE_PAGES);
+		__dec_node_page_state(oldpage, NR_FILE_PAGES);
 	}
 	spin_unlock_irq(&swap_mapping->tree_lock);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 69cb2464e7dc..ca49c3e41443 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -95,7 +95,7 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 					entry.val, page);
 	if (likely(!error)) {
 		address_space->nrpages++;
-		__inc_zone_page_state(page, NR_FILE_PAGES);
+		__inc_node_page_state(page, NR_FILE_PAGES);
 		INC_CACHE_INFO(add_total);
 	}
 	spin_unlock_irq(&address_space->tree_lock);
@@ -147,7 +147,7 @@ void __delete_from_swap_cache(struct page *page)
 	set_page_private(page, 0);
 	ClearPageSwapCache(page);
 	address_space->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	__dec_node_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 20497c5c6bb9..faa724af85d5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3576,11 +3576,11 @@ int sysctl_min_unmapped_ratio = 1;
  */
 int sysctl_min_slab_ratio = 5;
 
-static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
+static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
 {
-	unsigned long file_mapped = node_page_state(zone->zone_pgdat, NR_FILE_MAPPED);
-	unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
-		node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
+	unsigned long file_mapped = node_page_state(pgdat, NR_FILE_MAPPED);
+	unsigned long file_lru = node_page_state(pgdat, NR_INACTIVE_FILE) +
+		node_page_state(pgdat, NR_ACTIVE_FILE);
 
 	/*
 	 * It's possible for there to be more file mapped pages than
@@ -3599,17 +3599,17 @@ static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 	/*
 	 * If RECLAIM_UNMAP is set, then all file pages are considered
 	 * potentially reclaimable. Otherwise, we have to worry about
-	 * pages like swapcache and zone_unmapped_file_pages() provides
+	 * pages like swapcache and node_unmapped_file_pages() provides
 	 * a better estimate
 	 */
 	if (zone_reclaim_mode & RECLAIM_UNMAP)
-		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
+		nr_pagecache_reclaimable = node_page_state(zone->zone_pgdat, NR_FILE_PAGES);
 	else
-		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
+		nr_pagecache_reclaimable = node_unmapped_file_pages(zone->zone_pgdat);
 
 	/* If we can't clean pages, remove dirty pages from consideration */
 	if (!(zone_reclaim_mode & RECLAIM_WRITE))
-		delta += zone_page_state(zone, NR_FILE_DIRTY);
+		delta += node_page_state(zone->zone_pgdat, NR_FILE_DIRTY);
 
 	/* Watch for any possible underflows due to delta */
 	if (unlikely(delta > nr_pagecache_reclaimable))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c60df3090ed7..07a25fdbcf28 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -944,19 +944,13 @@ const char * const vmstat_text[] = {
 	"nr_free_pages",
 	"nr_alloc_batch",
 	"nr_mlock",
-	"nr_file_pages",
-	"nr_dirty",
-	"nr_writeback",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
 	"nr_page_table_pages",
 	"nr_kernel_stack",
-	"nr_unstable",
 	"nr_bounce",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
-	"nr_writeback_temp",
-	"nr_shmem",
 	"nr_dirtied",
 	"nr_written",
 
@@ -985,6 +979,12 @@ const char * const vmstat_text[] = {
 	"workingset_nodereclaim",
 	"nr_anon_pages",
 	"nr_mapped",
+	"nr_file_pages",
+	"nr_dirty",
+	"nr_writeback",
+	"nr_writeback_temp",
+	"nr_shmem",
+	"nr_unstable",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 20/27] mm: Move vmscan writes and file write accounting to the node
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (18 preceding siblings ...)
  2016-02-23 15:18 ` [PATCH 19/27] mm: Move most file-based accounting to the node Mel Gorman
@ 2016-02-23 15:19 ` Mel Gorman
  2016-02-23 15:19 ` [PATCH 21/27] mm, vmscan: Update classzone_idx if buffer_heads_over_limit Mel Gorman
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:19 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

As reclaim is now node-based, it follows that page write activity
due to page reclaim should also be accounted for on the node. For
consistency, also account page writes and page dirtying on a per-node
basis.

After this patch, there are a few remaining zone counters that may
appear strange but are fine. NUMA stats are still per-zone as this is a
user-space interface that tools consume. NR_MLOCK, NR_SLAB_*, NR_PAGETABLE,
NR_KERNEL_STACK and NR_BOUNCE are all allocations that potentially pin
low memory and cannot trivially be reclaimed on demand. This information
is still useful for debugging a page allocation failure warning.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h           | 8 ++++----
 include/trace/events/writeback.h | 4 ++--
 mm/page-writeback.c              | 6 +++---
 mm/vmscan.c                      | 4 ++--
 mm/vmstat.c                      | 8 ++++----
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e65e2abd1bb..26c166113450 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -125,10 +125,6 @@ enum zone_stat_item {
 	NR_KERNEL_STACK,
 	/* Second 128 byte cacheline */
 	NR_BOUNCE,
-	NR_VMSCAN_WRITE,
-	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
-	NR_DIRTIED,		/* page dirtyings since bootup */
-	NR_WRITTEN,		/* page writings since bootup */
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
@@ -163,6 +159,10 @@ enum node_stat_item {
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
+	NR_VMSCAN_WRITE,
+	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
+	NR_DIRTIED,		/* page dirtyings since bootup */
+	NR_WRITTEN,		/* page writings since bootup */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index af584b2931d5..15ae1438df20 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -445,8 +445,8 @@ TRACE_EVENT(global_dirty_state,
 		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
 		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
 		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
-		__entry->nr_dirtied	= global_page_state(NR_DIRTIED);
-		__entry->nr_written	= global_page_state(NR_WRITTEN);
+		__entry->nr_dirtied	= global_node_page_state(NR_DIRTIED);
+		__entry->nr_written	= global_node_page_state(NR_WRITTEN);
 		__entry->background_thresh = background_thresh;
 		__entry->dirty_thresh	= dirty_thresh;
 		__entry->dirty_limit	= global_wb_domain.dirty_limit;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f778baeff78e..e16956cd7642 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2429,7 +2429,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
-		__inc_zone_page_state(page, NR_DIRTIED);
+		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
 		task_io_account_write(PAGE_CACHE_SIZE);
@@ -2517,7 +2517,7 @@ void account_page_redirty(struct page *page)
 
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		current->nr_dirtied--;
-		dec_zone_page_state(page, NR_DIRTIED);
+		dec_node_page_state(page, NR_DIRTIED);
 		dec_wb_stat(wb, WB_DIRTIED);
 		unlocked_inode_to_wb_end(inode, locked);
 	}
@@ -2746,7 +2746,7 @@ int test_clear_page_writeback(struct page *page)
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
-		inc_zone_page_state(page, NR_WRITTEN);
+		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index faa724af85d5..df1c834f3c1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -596,7 +596,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page);
-		inc_zone_page_state(page, NR_VMSCAN_WRITE);
+		inc_node_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
 
@@ -1095,7 +1095,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 * except we already have the page isolated
 				 * and know it's dirty
 				 */
-				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
+				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
 				SetPageReclaim(page);
 
 				goto keep_locked;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 07a25fdbcf28..8562ebe2d311 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -949,10 +949,6 @@ const char * const vmstat_text[] = {
 	"nr_page_table_pages",
 	"nr_kernel_stack",
 	"nr_bounce",
-	"nr_vmscan_write",
-	"nr_vmscan_immediate_reclaim",
-	"nr_dirtied",
-	"nr_written",
 
 #ifdef CONFIG_NUMA
 	"numa_hit",
@@ -985,6 +981,10 @@ const char * const vmstat_text[] = {
 	"nr_writeback_temp",
 	"nr_shmem",
 	"nr_unstable",
+	"nr_vmscan_write",
+	"nr_vmscan_immediate_reclaim",
+	"nr_dirtied",
+	"nr_written",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 21/27] mm, vmscan: Update classzone_idx if buffer_heads_over_limit
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (19 preceding siblings ...)
  2016-02-23 15:19 ` [PATCH 20/27] mm: Move vmscan writes and file write " Mel Gorman
@ 2016-02-23 15:19 ` Mel Gorman
  2016-02-23 15:19 ` [PATCH 22/27] mm, vmscan: Only wakeup kswapd once per node for the requested classzone Mel Gorman
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:19 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

If buffer heads are over the limit then the direct reclaim gfp_mask
is promoted to __GFP_HIGHMEM so that lowmem is indirectly freed. With
node-based reclaim, it is also required that the classzone_idx be updated
or the pages will be skipped.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index df1c834f3c1e..9f1492b077be 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2574,8 +2574,10 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc,
 	 * highmem pages could be pinning lowmem pages storing buffer_heads
 	 */
 	orig_mask = sc->gfp_mask;
-	if (buffer_heads_over_limit)
+	if (buffer_heads_over_limit) {
 		sc->gfp_mask |= __GFP_HIGHMEM;
+		classzone_idx = gfp_zone(sc->gfp_mask);
+	}
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					classzone_idx, sc->nodemask) {
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 22/27] mm, vmscan: Only wakeup kswapd once per node for the requested classzone
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (20 preceding siblings ...)
  2016-02-23 15:19 ` [PATCH 21/27] mm, vmscan: Update classzone_idx if buffer_heads_over_limit Mel Gorman
@ 2016-02-23 15:19 ` Mel Gorman
  2016-02-23 15:20 ` [PATCH 23/27] mm, vmscan: Account in vmstat for pages skipped during reclaim Mel Gorman
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:19 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

kswapd is woken when zones are below the low watermark but the wakeup
decision is not taking the classzone into account.  Now that reclaim is
node-based, it is only required to wake kswapd once per node and only if
all zones are unbalanced for the requested classzone.

Note that one node might be checked multiple times but there is no cheap
way of tracking what nodes have already been visited for zoneslists that
be ordered by either zone or node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9f1492b077be..e92765eb0a1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3411,6 +3411,7 @@ static int kswapd(void *p)
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
+	int z;
 
 	if (!populated_zone(zone))
 		return;
@@ -3424,8 +3425,16 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0, 0))
-		return;
+
+	/* Only wake kswapd if all zones are unbalanced */
+	for (z = 0; z <= zone_idx(zone); z++) {
+		zone = pgdat->node_zones + z;
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_balanced(zone, order, 0, classzone_idx))
+			return;
+	}
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 23/27] mm, vmscan: Account in vmstat for pages skipped during reclaim
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (21 preceding siblings ...)
  2016-02-23 15:19 ` [PATCH 22/27] mm, vmscan: Only wakeup kswapd once per node for the requested classzone Mel Gorman
@ 2016-02-23 15:20 ` Mel Gorman
  2016-02-23 15:20 ` [PATCH 24/27] mm: Convert zone_reclaim to node_reclaim Mel Gorman
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:20 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

Low reclaim efficiency occurs when many pages are scanned that cannot
be reclaimed. This occurs for example when pages are dirty or under
writeback. Node-based LRU reclaim introduces a new source as reclaim
for allocation requests requiring lower zones will skip pages belonging
to higher zones. This patch adds vmstat counters to count pages that
were skipped because the calling context could not use pages from that
zone. It will help distinguish one source of low reclaim efficiency.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/vm_event_item.h | 1 +
 mm/vmscan.c                   | 1 +
 mm/vmstat.c                   | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8dcb5a813163..cadaa0f05f67 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -26,6 +26,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
+		FOR_ALL_ZONES(PGSCAN_SKIP),
 		PGREFILL,
 		PGSTEAL_KSWAPD,
 		PGSTEAL_DIRECT,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e92765eb0a1e..a5302b86c032 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1386,6 +1386,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		if (page_zonenum(page) > sc->reclaim_idx) {
 			list_move(&page->lru, &pages_skipped);
+			__count_zone_vm_events(PGSCAN_SKIP, page_zone(page), 1);
 			continue;
 		}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8562ebe2d311..4d8617b02032 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1007,6 +1007,8 @@ const char * const vmstat_text[] = {
 	"pgmajfault",
 	"pglazyfreed",
 
+	TEXTS_FOR_ZONES("pgskip")
+
 	"pgrefill",
 	"pgsteal_kswapd",
 	"pgsteal_direct",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 24/27] mm: Convert zone_reclaim to node_reclaim
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (22 preceding siblings ...)
  2016-02-23 15:20 ` [PATCH 23/27] mm, vmscan: Account in vmstat for pages skipped during reclaim Mel Gorman
@ 2016-02-23 15:20 ` Mel Gorman
  2016-02-23 15:20 ` [PATCH 25/27] mm, vmscan: Add classzone information to tracepoints Mel Gorman
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:20 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

As reclaim is now per-node based, convert zone_reclaim to be node_reclaim.
It is possible that a node will be reclaimed multiple times if it has
multiple zones but this is unavoidable without caching all nodes traversed
so far.  The documentation and interface to userspace is the same from
a configuration perspective and will will be similar in behaviour unless
the node-local allocation requests were also limited to lower zones.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h   | 18 ++++++------
 include/linux/swap.h     |  9 +++---
 include/linux/topology.h |  2 +-
 kernel/sysctl.c          |  4 +--
 mm/huge_memory.c         |  4 +--
 mm/internal.h            |  8 +++---
 mm/page_alloc.c          | 24 ++++++++++------
 mm/vmscan.c              | 75 ++++++++++++++++++++++++------------------------
 8 files changed, 76 insertions(+), 68 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 26c166113450..7372d59c5e3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -370,14 +370,6 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_NUMA
-	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
-	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
-#endif /* CONFIG_NUMA */
-
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -522,7 +514,6 @@ struct zone {
 } ____cacheline_internodealigned_in_smp;
 
 enum zone_flags {
-	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
 	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 };
@@ -538,6 +529,7 @@ enum pgdat_flags {
 	PGDAT_WRITEBACK,		/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -686,6 +678,14 @@ typedef struct pglist_data {
 	 */
 	unsigned long		totalreserve_pages;
 
+#ifdef CONFIG_NUMA
+	/*
+	 * zone reclaim becomes active if more unmapped pages exist.
+	 */
+	unsigned long		min_unmapped_pages;
+	unsigned long		min_slab_pages;
+#endif /* CONFIG_NUMA */
+
 	/* Write-intensive fields used from the page allocator */
 	ZONE_PADDING(_pad1_)
 	spinlock_t		lru_lock;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4b1a95798560..4a46c5e20808 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -334,13 +334,14 @@ extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
 #ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
+extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
 #else
-#define zone_reclaim_mode 0
-static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+#define node_reclaim_mode 0
+static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
+				unsigned int order)
 {
 	return 0;
 }
diff --git a/include/linux/topology.h b/include/linux/topology.h
index afce69296ac0..cb0775e1ee4b 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -54,7 +54,7 @@ int arch_update_cpu_topology(void);
 /*
  * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
  * (in whatever arch specific measurement units returned by node_distance())
- * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim()
+ * and node_reclaim_mode is enabled then the VM will only call node_reclaim()
  * on nodes within this distance.
  */
 #define RECLAIM_DISTANCE 30
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d479707b5ffc..278f1a6b1cb3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1455,8 +1455,8 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "zone_reclaim_mode",
-		.data		= &zone_reclaim_mode,
-		.maxlen		= sizeof(zone_reclaim_mode),
+		.data		= &node_reclaim_mode,
+		.maxlen		= sizeof(node_reclaim_mode),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 		.extra1		= &zero,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 709797510e10..7d1689a78f29 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2186,10 +2186,10 @@ static bool khugepaged_scan_abort(int nid)
 	int i;
 
 	/*
-	 * If zone_reclaim_mode is disabled, then no extra effort is made to
+	 * If node_reclaim_mode is disabled, then no extra effort is made to
 	 * allocate memory locally.
 	 */
-	if (!zone_reclaim_mode)
+	if (!node_reclaim_mode)
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
diff --git a/mm/internal.h b/mm/internal.h
index 08c996d7b8fc..1498e5c850f1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -435,10 +435,10 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
 }
 #endif /* CONFIG_SPARSEMEM */
 
-#define ZONE_RECLAIM_NOSCAN	-2
-#define ZONE_RECLAIM_FULL	-1
-#define ZONE_RECLAIM_SOME	0
-#define ZONE_RECLAIM_SUCCESS	1
+#define NODE_RECLAIM_NOSCAN	-2
+#define NODE_RECLAIM_FULL	-1
+#define NODE_RECLAIM_SOME	0
+#define NODE_RECLAIM_SUCCESS	1
 
 extern int hwpoison_filter(struct page *p);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c618662b8dc..e0cb0a6fe6c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2678,16 +2678,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
 				goto try_this_zone;
 
-			if (zone_reclaim_mode == 0 ||
+			if (node_reclaim_mode == 0 ||
 			    !zone_allows_reclaim(ac->preferred_zone, zone))
 				continue;
 
-			ret = zone_reclaim(zone, gfp_mask, order);
+			ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
 			switch (ret) {
-			case ZONE_RECLAIM_NOSCAN:
+			case NODE_RECLAIM_NOSCAN:
 				/* did not scan */
 				continue;
-			case ZONE_RECLAIM_FULL:
+			case NODE_RECLAIM_FULL:
 				/* scanned but unreclaimable */
 				continue;
 			default:
@@ -5572,9 +5572,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
 #ifdef CONFIG_NUMA
 		zone->node = nid;
-		zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
+		pgdat->min_unmapped_pages += (freesize*sysctl_min_unmapped_ratio)
 						/ 100;
-		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
+		pgdat->min_slab_pages += (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
 		zone->zone_pgdat = pgdat;
@@ -6579,6 +6579,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int rc;
 
@@ -6586,8 +6587,11 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
 	if (rc)
 		return rc;
 
+	for_each_online_pgdat(pgdat)
+		pgdat->min_slab_pages = 0;
+
 	for_each_zone(zone)
-		zone->min_unmapped_pages = (zone->managed_pages *
+		zone->zone_pgdat->min_unmapped_pages += (zone->managed_pages *
 				sysctl_min_unmapped_ratio) / 100;
 	return 0;
 }
@@ -6595,6 +6599,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int rc;
 
@@ -6602,8 +6607,11 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
 	if (rc)
 		return rc;
 
+	for_each_online_pgdat(pgdat)
+		pgdat->min_slab_pages = 0;
+
 	for_each_zone(zone)
-		zone->min_slab_pages = (zone->managed_pages *
+		zone->zone_pgdat->min_slab_pages += (zone->managed_pages *
 				sysctl_min_slab_ratio) / 100;
 	return 0;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a5302b86c032..3c6c766f1535 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3557,12 +3557,12 @@ module_init(kswapd_init)
 
 #ifdef CONFIG_NUMA
 /*
- * Zone reclaim mode
+ * Node reclaim mode
  *
- * If non-zero call zone_reclaim when the number of free pages falls below
+ * If non-zero call node_reclaim when the number of free pages falls below
  * the watermarks.
  */
-int zone_reclaim_mode __read_mostly;
+int node_reclaim_mode __read_mostly;
 
 #define RECLAIM_OFF 0
 #define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
@@ -3570,14 +3570,14 @@ int zone_reclaim_mode __read_mostly;
 #define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
 
 /*
- * Priority for ZONE_RECLAIM. This determines the fraction of pages
+ * Priority for NODE_RECLAIM. This determines the fraction of pages
  * of a node considered for each zone_reclaim. 4 scans 1/16th of
  * a zone.
  */
-#define ZONE_RECLAIM_PRIORITY 4
+#define NODE_RECLAIM_PRIORITY 4
 
 /*
- * Percentage of pages in a zone that must be unmapped for zone_reclaim to
+ * Percentage of pages in a zone that must be unmapped for node_reclaim to
  * occur.
  */
 int sysctl_min_unmapped_ratio = 1;
@@ -3603,7 +3603,7 @@ static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
 }
 
 /* Work out how many page cache pages we can reclaim in this reclaim_mode */
-static unsigned long zone_pagecache_reclaimable(struct zone *zone)
+static unsigned long zone_pagecache_reclaimable(struct pglist_data *pgdat)
 {
 	unsigned long nr_pagecache_reclaimable;
 	unsigned long delta = 0;
@@ -3614,14 +3614,14 @@ static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 	 * pages like swapcache and node_unmapped_file_pages() provides
 	 * a better estimate
 	 */
-	if (zone_reclaim_mode & RECLAIM_UNMAP)
-		nr_pagecache_reclaimable = node_page_state(zone->zone_pgdat, NR_FILE_PAGES);
+	if (node_reclaim_mode & RECLAIM_UNMAP)
+		nr_pagecache_reclaimable = node_page_state(pgdat, NR_FILE_PAGES);
 	else
-		nr_pagecache_reclaimable = node_unmapped_file_pages(zone->zone_pgdat);
+		nr_pagecache_reclaimable = node_unmapped_file_pages(pgdat);
 
 	/* If we can't clean pages, remove dirty pages from consideration */
-	if (!(zone_reclaim_mode & RECLAIM_WRITE))
-		delta += node_page_state(zone->zone_pgdat, NR_FILE_DIRTY);
+	if (!(node_reclaim_mode & RECLAIM_WRITE))
+		delta += node_page_state(pgdat, NR_FILE_DIRTY);
 
 	/* Watch for any possible underflows due to delta */
 	if (unlikely(delta > nr_pagecache_reclaimable))
@@ -3631,21 +3631,22 @@ static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
- * Try to free up some pages from this zone through reclaim.
+ * Try to free up some pages from this node through reclaim.
  */
-static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 {
 	/* Minimum pages needed in order to stay on node */
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
+	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
 		.order = order,
-		.priority = ZONE_RECLAIM_PRIORITY,
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-		.may_unmap = !!(zone_reclaim_mode & RECLAIM_UNMAP),
+		.priority = NODE_RECLAIM_PRIORITY,
+		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
+		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
 		.may_swap = 1,
 	};
 
@@ -3660,13 +3661,13 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
+	if (zone_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
 		/*
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_node(zone->zone_pgdat, &sc, zone_idx(zone), zone_idx(zone));
+			shrink_node(pgdat, &sc, classzone_idx, classzone_idx);
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
@@ -3676,49 +3677,47 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	return sc.nr_reclaimed >= nr_pages;
 }
 
-int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 {
-	int node_id;
 	int ret;
 
 	/*
-	 * Zone reclaim reclaims unmapped file backed pages and
+	 * Node reclaim reclaims unmapped file backed pages and
 	 * slab pages if we are over the defined limits.
 	 *
 	 * A small portion of unmapped file backed pages is needed for
 	 * file I/O otherwise pages read by file I/O will be immediately
-	 * thrown out if the zone is overallocated. So we do not reclaim
-	 * if less than a specified percentage of the zone is used by
+	 * thrown out if the node is overallocated. So we do not reclaim
+	 * if less than a specified percentage of the node is used by
 	 * unmapped file backed pages.
 	 */
-	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
-	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
-		return ZONE_RECLAIM_FULL;
+	if (zone_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
+	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+		return NODE_RECLAIM_FULL;
 
-	if (!pgdat_reclaimable(zone->zone_pgdat))
-		return ZONE_RECLAIM_FULL;
+	if (!pgdat_reclaimable(pgdat))
+		return NODE_RECLAIM_FULL;
 
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
 	if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
-		return ZONE_RECLAIM_NOSCAN;
+		return NODE_RECLAIM_NOSCAN;
 
 	/*
-	 * Only run zone reclaim on the local zone or on zones that do not
+	 * Only run node reclaim on the local node or on nodes that do not
 	 * have associated processors. This will favor the local processor
 	 * over remote processors and spread off node memory allocations
 	 * as wide as possible.
 	 */
-	node_id = zone_to_nid(zone);
-	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
-		return ZONE_RECLAIM_NOSCAN;
+	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
+		return NODE_RECLAIM_NOSCAN;
 
-	if (test_and_set_bit(ZONE_RECLAIM_LOCKED, &zone->flags))
-		return ZONE_RECLAIM_NOSCAN;
+	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
+		return NODE_RECLAIM_NOSCAN;
 
-	ret = __zone_reclaim(zone, gfp_mask, order);
-	clear_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
+	ret = __node_reclaim(pgdat, gfp_mask, order);
+	clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
 
 	if (!ret)
 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 25/27] mm, vmscan: Add classzone information to tracepoints
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (23 preceding siblings ...)
  2016-02-23 15:20 ` [PATCH 24/27] mm: Convert zone_reclaim to node_reclaim Mel Gorman
@ 2016-02-23 15:20 ` Mel Gorman
  2016-02-23 15:21 ` [PATCH 26/27] mm, page_alloc: Remove fair zone allocation policy Mel Gorman
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:20 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

This is convenient when tracking down why the skip count is high because it'll
show what classzone kswapd woke up at and what zones are being isolated.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/trace/events/vmscan.h | 28 ++++++++++++++++++----------
 mm/vmscan.c                   |  4 ++--
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 897f1aa1ee5f..3d242fb8910a 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -55,21 +55,23 @@ TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 TRACE_EVENT(mm_vmscan_kswapd_wake,
 
-	TP_PROTO(int nid, int order),
+	TP_PROTO(int nid, int zid, int order),
 
-	TP_ARGS(nid, order),
+	TP_ARGS(nid, zid, order),
 
 	TP_STRUCT__entry(
 		__field(	int,	nid	)
+		__field(	int,	zid	)
 		__field(	int,	order	)
 	),
 
 	TP_fast_assign(
 		__entry->nid	= nid;
+		__entry->zid    = zid;
 		__entry->order	= order;
 	),
 
-	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+	TP_printk("nid=%d zid=%d order=%d", __entry->nid, __entry->zid, __entry->order)
 );
 
 TRACE_EVENT(mm_vmscan_wakeup_kswapd,
@@ -266,16 +268,18 @@ TRACE_EVENT(mm_shrink_slab_end,
 
 DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
-	TP_PROTO(int order,
+	TP_PROTO(int classzone_idx,
+		int order,
 		unsigned long nr_requested,
 		unsigned long nr_scanned,
 		unsigned long nr_taken,
 		isolate_mode_t isolate_mode,
 		int file),
 
-	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
+	TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
 
 	TP_STRUCT__entry(
+		__field(int, classzone_idx)
 		__field(int, order)
 		__field(unsigned long, nr_requested)
 		__field(unsigned long, nr_scanned)
@@ -285,6 +289,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 	),
 
 	TP_fast_assign(
+		__entry->classzone_idx = classzone_idx;
 		__entry->order = order;
 		__entry->nr_requested = nr_requested;
 		__entry->nr_scanned = nr_scanned;
@@ -293,8 +298,9 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 		__entry->file = file;
 	),
 
-	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
+	TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
 		__entry->isolate_mode,
+		__entry->classzone_idx,
 		__entry->order,
 		__entry->nr_requested,
 		__entry->nr_scanned,
@@ -304,27 +310,29 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
 DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
 
-	TP_PROTO(int order,
+	TP_PROTO(int classzone_idx,
+		int order,
 		unsigned long nr_requested,
 		unsigned long nr_scanned,
 		unsigned long nr_taken,
 		isolate_mode_t isolate_mode,
 		int file),
 
-	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+	TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
 
 );
 
 DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_memcg_isolate,
 
-	TP_PROTO(int order,
+	TP_PROTO(int classzone_idx,
+		int order,
 		unsigned long nr_requested,
 		unsigned long nr_scanned,
 		unsigned long nr_taken,
 		isolate_mode_t isolate_mode,
 		int file),
 
-	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+	TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
 
 );
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3c6c766f1535..9950081c383e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1418,7 +1418,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	if (!list_empty(&pages_skipped))
 		list_splice(&pages_skipped, src);
 	*nr_scanned = scan;
-	trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
+	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
 	return nr_taken;
 }
@@ -3391,7 +3391,7 @@ static int kswapd(void *p)
 		 * Try reclaim the requested order but if that fails
 		 * then try sleeping on the basis of the order reclaimed.
 		 */
-		trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
+		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx, order);
 		if (balance_pgdat(pgdat, order, classzone_idx) < order)
 			goto kswapd_try_sleep;
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 26/27] mm, page_alloc: Remove fair zone allocation policy
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (24 preceding siblings ...)
  2016-02-23 15:20 ` [PATCH 25/27] mm, vmscan: Add classzone information to tracepoints Mel Gorman
@ 2016-02-23 15:21 ` Mel Gorman
  2016-02-23 15:21 ` [PATCH 27/27] mm: page_alloc: Cache the last node whose dirty limit is reached Mel Gorman
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:21 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

The fair zone allocation policy interleaves allocation requests between
zones to avoid an age inversion problem whereby new pages are reclaimed
to balance a zone. Reclaim is now node-based so this should no longer be
an issue and the fair zone allocation policy is not free. This patch
removes it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  2 --
 mm/internal.h          |  1 -
 mm/page_alloc.c        | 76 +-------------------------------------------------
 mm/vmstat.c            |  1 -
 4 files changed, 1 insertion(+), 79 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7372d59c5e3b..ccba10bd3241 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -117,7 +117,6 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_ALLOC_BATCH,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
@@ -515,7 +514,6 @@ struct zone {
 
 enum zone_flags {
 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
-	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 };
 
 enum pgdat_flags {
diff --git a/mm/internal.h b/mm/internal.h
index 1498e5c850f1..ad623eb6caa3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -469,7 +469,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
-#define ALLOC_FAIR		0x100 /* fair zone allocation */
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0cb0a6fe6c4..4fcd6298b9a1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2372,11 +2372,6 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 					  get_pcppage_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
-	if (atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]) <= 0 &&
-	    !test_bit(ZONE_FAIR_DEPLETED, &zone->flags))
-		set_bit(ZONE_FAIR_DEPLETED, &zone->flags);
-
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
@@ -2561,40 +2556,18 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 }
 
 #ifdef CONFIG_NUMA
-static bool zone_local(struct zone *local_zone, struct zone *zone)
-{
-	return local_zone->node == zone->node;
-}
-
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
 				RECLAIM_DISTANCE;
 }
 #else	/* CONFIG_NUMA */
-static bool zone_local(struct zone *local_zone, struct zone *zone)
-{
-	return true;
-}
-
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return true;
 }
 #endif	/* CONFIG_NUMA */
 
-static void reset_alloc_batches(struct zone *preferred_zone)
-{
-	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
-
-	do {
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-		clear_bit(ZONE_FAIR_DEPLETED, &zone->flags);
-	} while (zone++ != preferred_zone);
-}
-
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -2607,11 +2580,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zoneref *z;
 	struct page *page = NULL;
 	struct zone *zone;
-	int nr_fair_skipped = 0;
-	bool zonelist_rescan;
-
-zonelist_scan:
-	zonelist_rescan = false;
 
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2626,20 +2594,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			!cpuset_zone_allowed(zone, gfp_mask))
 				continue;
 		/*
-		 * Distribute pages in proportion to the individual
-		 * zone size to ensure fair page aging.  The zone a
-		 * page was allocated in should have no effect on the
-		 * time the page has in memory before being reclaimed.
-		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			if (!zone_local(ac->preferred_zone, zone))
-				break;
-			if (test_bit(ZONE_FAIR_DEPLETED, &zone->flags)) {
-				nr_fair_skipped++;
-				continue;
-			}
-		}
-		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
 		 * limit, such that no single zone holds more than its
@@ -2718,27 +2672,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 	}
 
-	/*
-	 * The first pass makes sure allocations are spread fairly within the
-	 * local node.  However, the local node might have free pages left
-	 * after the fairness batches are exhausted, and remote zones haven't
-	 * even been considered yet.  Try once more without fairness, and
-	 * include remote zones now, before entering the slowpath and waking
-	 * kswapd: prefer spilling to a remote zone over swapping locally.
-	 */
-	if (alloc_flags & ALLOC_FAIR) {
-		alloc_flags &= ~ALLOC_FAIR;
-		if (nr_fair_skipped) {
-			zonelist_rescan = true;
-			reset_alloc_batches(ac->preferred_zone);
-		}
-		if (nr_online_nodes > 1)
-			zonelist_rescan = true;
-	}
-
-	if (zonelist_rescan)
-		goto zonelist_scan;
-
 	return NULL;
 }
 
@@ -3404,7 +3337,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
 	unsigned int cpuset_mems_cookie;
-	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
+	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
 	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
 		.high_zoneidx = gfp_zone(gfp_mask),
@@ -5583,9 +5516,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone_seqlock_init(zone);
 		zone_pcp_init(zone);
 
-		/* For bootup, initialized properly in watermark setup */
-		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
-
 		if (!size)
 			continue;
 
@@ -6425,10 +6355,6 @@ static void __setup_per_zone_wmarks(void)
 		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
 		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
 
-		__mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4d8617b02032..34fe40824213 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -942,7 +942,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
-	"nr_alloc_batch",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 27/27] mm: page_alloc: Cache the last node whose dirty limit is reached
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (25 preceding siblings ...)
  2016-02-23 15:21 ` [PATCH 26/27] mm, page_alloc: Remove fair zone allocation policy Mel Gorman
@ 2016-02-23 15:21 ` Mel Gorman
  2016-02-23 17:15 ` [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Christoph Lameter
  2016-02-23 20:04 ` Johannes Weiner
  28 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 15:21 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

If a page is about to be dirtied then the page allocator attempts to limit
the total number of dirty pages that exists in any given zone. The call
to node_dirty_ok is expensive so this patch records if the last pgdat
examined hit the dirty limits. In some cases, this reduces the number
of calls to node_dirty_ok().

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4fcd6298b9a1..b20713c42bd1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2580,6 +2580,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zoneref *z;
 	struct page *page = NULL;
 	struct zone *zone;
+	struct pglist_data *last_pgdat_dirty_limit = NULL;
 
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2619,8 +2620,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (ac->spread_dirty_pages && !node_dirty_ok(zone->zone_pgdat))
-			continue;
+		if (ac->spread_dirty_pages) {
+			if (last_pgdat_dirty_limit == zone->zone_pgdat)
+				continue;
+
+			if (!node_dirty_ok(zone->zone_pgdat)) {
+				last_pgdat_dirty_limit = zone->zone_pgdat;
+				continue;
+			}
+		}
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (26 preceding siblings ...)
  2016-02-23 15:21 ` [PATCH 27/27] mm: page_alloc: Cache the last node whose dirty limit is reached Mel Gorman
@ 2016-02-23 17:15 ` Christoph Lameter
  2016-02-23 20:04 ` Johannes Weiner
  28 siblings, 0 replies; 58+ messages in thread
From: Christoph Lameter @ 2016-02-23 17:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML,
	Andi Kleen

On Tue, 23 Feb 2016, Mel Gorman wrote:

> Conceptually, moving to node LRUs should be easier to understand. The
> page allocator plays fewer tricks to game reclaim and reclaim behaves
> similarly on all nodes.

I think this is a good way to simplify reclaim and limit the nasty effects
of zones a bit. It probably would allow us to long term rely less on
zones.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone)
  2016-02-23 15:04 ` [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone) Mel Gorman
@ 2016-02-23 18:04   ` Johannes Weiner
  2016-03-03 10:37   ` Vlastimil Babka
  1 sibling, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-23 18:04 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:24PM +0000, Mel Gorman wrote:
> ac->classzone_idx is determined by the index of the preferred zone and cached
> to avoid repeated calculations. wake_all_kswapds() should use it instead of
> using zone_idx() within a loop.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim
  2016-02-23 15:04 ` [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim Mel Gorman
@ 2016-02-23 18:06   ` Johannes Weiner
  2016-03-03 11:31   ` Vlastimil Babka
  1 sibling, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-23 18:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:25PM +0000, Mel Gorman wrote:
> Direct reclaim obeys cpusets but misses the cpusets_enabled() check.
> The overhead is unlikely to be measurable in the direct reclaim
> path which is expensive but there is no harm is doing it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 03/27] mm, vmstat: Add infrastructure for per-node vmstats
  2016-02-23 15:04 ` [PATCH 03/27] mm, vmstat: Add infrastructure for per-node vmstats Mel Gorman
@ 2016-02-23 18:13   ` Johannes Weiner
  2016-02-24  9:19     ` Mel Gorman
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2016-02-23 18:13 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:26PM +0000, Mel Gorman wrote:
> VM statistic counters for reclaim decisions are zone-based. If the kernel
> is to reclaim on a per-node basis then we need to track per-node statistics
> but there is no infrastructure for that. The most notable change is that
> the old node_page_state is renamed to sum_zone_node_page_state.  The new
> node_page_state takes a pglist_data and uses per-node stats but none exist
> yet. There is some renaming such as vm_stat to vm_zone_stat and the addition
> of vm_node_stat and the renaming of mod_state to mod_zone_state. Otherwise,
> this is mostly a mechanical patch with no functional change. There is a
> lot of similarity between the node and zone helpers which is unfortunate
> but there was no obvious way of reusing the code and maintaining type safety.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Hopefully we can eventually ditch /proc/zoneinfo in favor of a
/proc/nodeinfo and get rid of the per-zone stats accounting.

In general, this patch looks good to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Only one thing I noticed:

> @@ -349,12 +349,14 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>  	shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
>  	local_irq_enable();
>  
> -	if (memcg_kmem_enabled())
> +	if (memcg_kmem_enabled()) {
>  		pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>  						     LRU_ALL_FILE);
> -	else
> -		pages = node_page_state(sc->nid, NR_ACTIVE_FILE) +
> -			node_page_state(sc->nid, NR_INACTIVE_FILE);
> +	} else {
> +		pg_data_t *pgdat = NODE_DATA(sc->nid);
> +		pages = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +			node_page_state(pgdat, NR_INACTIVE_FILE);
> +	}

That should also be sum_zone_node_page_state, right? These are not
valid node items (yet).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 04/27] mm, vmscan: Move lru_lock to the node
  2016-02-23 15:04 ` [PATCH 04/27] mm, vmscan: Move lru_lock to the node Mel Gorman
@ 2016-02-23 18:40   ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-23 18:40 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:27PM +0000, Mel Gorman wrote:
> Node-based reclaim requires node-based LRUs and locking. This is a
> preparation patch that just moves the lru_lock to the node so later patches
> are easier to review. It is a mechanical change but note this patch makes
> contention worse because the LRU lock is hotter and direct reclaim and kswapd
> can contend on the same lock even when reclaiming from different zones.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewing mechanical patches like these is error prone, but nothing
obviously broken stands out to me here.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/27] mm, vmscan: Move LRU lists to node
  2016-02-23 15:04 ` [PATCH 05/27] mm, vmscan: Move LRU lists to node Mel Gorman
@ 2016-02-23 18:42   ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-23 18:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:28PM +0000, Mel Gorman wrote:
> This moves the LRU lists from the zone to the node and all related data
> such as counters, tracing, congestion tracking and writeback tracking.
> This is mostly a mechanical patch but note that it introduces a number
> of anomalies. For example, the scans are per-zone but using per-node
> counters. We also mark a node as congested when a zone is congested. This
> causes weird problems that are fixed later but is easier to review.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Same rules apply as in the previous patch. Looks good, FWIW.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/27] mm, vmscan: Begin reclaiming pages on a per-node basis
  2016-02-23 15:04 ` [PATCH 06/27] mm, vmscan: Begin reclaiming pages on a per-node basis Mel Gorman
@ 2016-02-23 18:57   ` Johannes Weiner
  2016-02-23 19:03     ` Johannes Weiner
  2016-02-24 10:21     ` Mel Gorman
  0 siblings, 2 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-23 18:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:29PM +0000, Mel Gorman wrote:
> @@ -2428,10 +2448,11 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			reclaimed = sc->nr_reclaimed;
>  			scanned = sc->nr_scanned;
>  
> +			sc->reclaim_idx = reclaim_idx;
>  			shrink_zone_memcg(zone, memcg, sc, &lru_pages);
>  			zone_lru_pages += lru_pages;

The setting of sc->reclaim_idx is unexpected here. Why not set it in
the caller and eliminate the reclaim_idx parameter?

> @@ -2558,16 +2579,12 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		sc->gfp_mask |= __GFP_HIGHMEM;
>  
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> -					requested_highidx, sc->nodemask) {

It's unfortunate that we start with the lowest zone here. For Normal
allocations, the most common allocations, this will always have two
full shrink_node() rounds that skip over everything >DMA in the first,
then over everything >DMA32 in the second, even though all pages on
the node are valid reclaim candidates for that allocation.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/27] mm, vmscan: Begin reclaiming pages on a per-node basis
  2016-02-23 18:57   ` Johannes Weiner
@ 2016-02-23 19:03     ` Johannes Weiner
  2016-02-24 10:21     ` Mel Gorman
  1 sibling, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-23 19:03 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 10:57:22AM -0800, Johannes Weiner wrote:
> > @@ -2558,16 +2579,12 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> >  		sc->gfp_mask |= __GFP_HIGHMEM;
> >  
> >  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > -					requested_highidx, sc->nodemask) {
> 
> It's unfortunate that we start with the lowest zone here. For Normal
> allocations, the most common allocations, this will always have two
> full shrink_node() rounds that skip over everything >DMA in the first,
> then over everything >DMA32 in the second, even though all pages on
> the node are valid reclaim candidates for that allocation.

Nevermind, I'm confused about zonelist ordering. It starts with the
highest indexed zones.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2
  2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
                   ` (27 preceding siblings ...)
  2016-02-23 17:15 ` [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Christoph Lameter
@ 2016-02-23 20:04 ` Johannes Weiner
  2016-02-23 20:19   ` Mel Gorman
  28 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2016-02-23 20:04 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:23PM +0000, Mel Gorman wrote:
> In many benchmarks, there is an obvious difference in the number of
> allocations from each zone as the fair zone allocation policy is removed
> towards the end of the series. For example, this is the allocation stats
> when running blogbench that showed no difference in headling performance
> 
>                           mmotm-20160209   nodelru-v2
> DMA allocs                           0           0
> DMA32 allocs                   7218763      608067
> Normal allocs                 12701806    18821286
> Movable allocs                       0           0

According to the mmotm numbers, your DMA32 zone is over a third of
available memory, yet in the nodelru-v2 kernel sees only 3% of the
allocations. That's an insanely high level of aging inversion, where
the lifetime of a cache entry is again highly dependent on placement.

The fact that this doesn't make a performance difference in the
specific benchmarks you ran only proves just that: these specific
benchmarks don't care. IMO, benchmarking is not enough here. If this
is truly supposed to be unproblematic, then I think we need a reasoned
explanation. I can't imagine how it possibly could be, though.

If reclaim can't guarantee a balanced zone utilization then the
allocator has to keep doing it. :( As far as I'm concerned, the
original reason for the fair zone allocator still applies.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2
  2016-02-23 20:04 ` Johannes Weiner
@ 2016-02-23 20:19   ` Mel Gorman
  2016-02-23 20:59     ` Johannes Weiner
  0 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 20:19 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 12:04:16PM -0800, Johannes Weiner wrote:
> On Tue, Feb 23, 2016 at 03:04:23PM +0000, Mel Gorman wrote:
> > In many benchmarks, there is an obvious difference in the number of
> > allocations from each zone as the fair zone allocation policy is removed
> > towards the end of the series. For example, this is the allocation stats
> > when running blogbench that showed no difference in headling performance
> > 
> >                           mmotm-20160209   nodelru-v2
> > DMA allocs                           0           0
> > DMA32 allocs                   7218763      608067
> > Normal allocs                 12701806    18821286
> > Movable allocs                       0           0
> 
> According to the mmotm numbers, your DMA32 zone is over a third of
> available memory, yet in the nodelru-v2 kernel sees only 3% of the
> allocations.

In this case yes but blogbench is not scaled to memory size and is not
reclaim intensive. If you look, you'll see the total number of overall
allocations is very similar. During that test, there is a small amount of
kswapd scan activity (but not reclaim which is odd) at the start of the
test for nodelru but that's about it.

> That's an insanely high level of aging inversion, where
> the lifetime of a cache entry is again highly dependent on placement.
> 

The aging is now indepdant of what zone the page was allocated from because
it's node-based LRU reclaim. That may mean that the occupancy of individual
zones is now different but it should only matter if there is a large number
of address-limited requests.

> The fact that this doesn't make a performance difference in the
> specific benchmarks you ran only proves just that: these specific
> benchmarks don't care. IMO, benchmarking is not enough here. If this
> is truly supposed to be unproblematic, then I think we need a reasoned
> explanation. I can't imagine how it possibly could be, though.
> 

The basic explanation is that reclaim is on a per-node basis and we
no longer balance all zones, just one that is necessary to satisfy the
original request that wokeup kswapd.

> If reclaim can't guarantee a balanced zone utilization then the
> allocator has to keep doing it. :(

That's the key issue - the main reason balanced zone utilisation is
necessary is because we reclaim on a per-zone basis and we must avoid
page aging anomalies. If we balance such that one eligible zone is above
the watermark then it's less of a concern.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2
  2016-02-23 20:19   ` Mel Gorman
@ 2016-02-23 20:59     ` Johannes Weiner
  2016-02-23 21:58       ` Mel Gorman
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2016-02-23 20:59 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 08:19:32PM +0000, Mel Gorman wrote:
> On Tue, Feb 23, 2016 at 12:04:16PM -0800, Johannes Weiner wrote:
> > On Tue, Feb 23, 2016 at 03:04:23PM +0000, Mel Gorman wrote:
> > > In many benchmarks, there is an obvious difference in the number of
> > > allocations from each zone as the fair zone allocation policy is removed
> > > towards the end of the series. For example, this is the allocation stats
> > > when running blogbench that showed no difference in headling performance
> > > 
> > >                           mmotm-20160209   nodelru-v2
> > > DMA allocs                           0           0
> > > DMA32 allocs                   7218763      608067
> > > Normal allocs                 12701806    18821286
> > > Movable allocs                       0           0
> > 
> > According to the mmotm numbers, your DMA32 zone is over a third of
> > available memory, yet in the nodelru-v2 kernel sees only 3% of the
> > allocations.
> 
> In this case yes but blogbench is not scaled to memory size and is not
> reclaim intensive. If you look, you'll see the total number of overall
> allocations is very similar. During that test, there is a small amount of
> kswapd scan activity (but not reclaim which is odd) at the start of the
> test for nodelru but that's about it.

Yes, if fairness enforcement is now done by reclaim, then workloads
without reclaim will show skewed placement as the Normal zone is again
filled up first before moving on to the next zone.

That is fine. But what about the balance in reclaiming workloads?

> > That's an insanely high level of aging inversion, where
> > the lifetime of a cache entry is again highly dependent on placement.
> > 
> 
> The aging is now indepdant of what zone the page was allocated from because
> it's node-based LRU reclaim. That may mean that the occupancy of individual
> zones is now different but it should only matter if there is a large number
> of address-limited requests.

The problem is that kswapd will stay awake and continuously draw
subsequent allocations into a single zone, thus utilizing only a
fraction of available memory. A DMA32-limited kswapd wakeups can
reclaim cache in DMA32 continuously if the allocator continously
places new cache pages in that zone. It looks like that is what
happened in the stutter benchmark.

Sure, it doesn't matter in that benchmark, because the pages are used
only once. But if it had an actual cache workingset bigger than DMA32
but smaller than DMA32+Normal, it would be thrashing unnecessarily.

If kswapd were truly balancing the pages in a node equally, regardless
of zone placement, then in the long run we should see zone allocations
converge to a share that is in proportion to each zone's size. As far
as I can see, that is not quite happening yet.

> > The fact that this doesn't make a performance difference in the
> > specific benchmarks you ran only proves just that: these specific
> > benchmarks don't care. IMO, benchmarking is not enough here. If this
> > is truly supposed to be unproblematic, then I think we need a reasoned
> > explanation. I can't imagine how it possibly could be, though.
> > 
> 
> The basic explanation is that reclaim is on a per-node basis and we
> no longer balance all zones, just one that is necessary to satisfy the
> original request that wokeup kswapd.
> 
> > If reclaim can't guarantee a balanced zone utilization then the
> > allocator has to keep doing it. :(
> 
> That's the key issue - the main reason balanced zone utilisation is
> necessary is because we reclaim on a per-zone basis and we must avoid
> page aging anomalies. If we balance such that one eligible zone is above
> the watermark then it's less of a concern.

Yes, but only if there can't be extended reclaim stretches that prefer
the pages of a single zone. Yet it looks like this is still possible.

I wonder if that were fixed by dropping patch 7/27? Potentially it
would need a bit more work than that. I.e. could we make kswapd
balance only for the highest classzone in the system, and thus make
address-limited allocations fend for themselves in direct reclaim?

This way, we would avoid that pathological interaction between kswapd
and the allocator, and kswapd would be guaranteed to balance fairly.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2
  2016-02-23 20:59     ` Johannes Weiner
@ 2016-02-23 21:58       ` Mel Gorman
  2016-02-24  0:12         ` Johannes Weiner
  0 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 21:58 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 12:59:15PM -0800, Johannes Weiner wrote:
> On Tue, Feb 23, 2016 at 08:19:32PM +0000, Mel Gorman wrote:
> > On Tue, Feb 23, 2016 at 12:04:16PM -0800, Johannes Weiner wrote:
> > > On Tue, Feb 23, 2016 at 03:04:23PM +0000, Mel Gorman wrote:
> > > > In many benchmarks, there is an obvious difference in the number of
> > > > allocations from each zone as the fair zone allocation policy is removed
> > > > towards the end of the series. For example, this is the allocation stats
> > > > when running blogbench that showed no difference in headling performance
> > > > 
> > > >                           mmotm-20160209   nodelru-v2
> > > > DMA allocs                           0           0
> > > > DMA32 allocs                   7218763      608067
> > > > Normal allocs                 12701806    18821286
> > > > Movable allocs                       0           0
> > > 
> > > According to the mmotm numbers, your DMA32 zone is over a third of
> > > available memory, yet in the nodelru-v2 kernel sees only 3% of the
> > > allocations.
> > 
> > In this case yes but blogbench is not scaled to memory size and is not
> > reclaim intensive. If you look, you'll see the total number of overall
> > allocations is very similar. During that test, there is a small amount of
> > kswapd scan activity (but not reclaim which is odd) at the start of the
> > test for nodelru but that's about it.
> 
> Yes, if fairness enforcement is now done by reclaim, then workloads
> without reclaim will show skewed placement as the Normal zone is again
> filled up first before moving on to the next zone.
> 
> That is fine. But what about the balance in reclaiming workloads?
> 

That is the key question -- whether node LRU reclaim renders it
unnecessary.

> > > That's an insanely high level of aging inversion, where
> > > the lifetime of a cache entry is again highly dependent on placement.
> > > 
> > 
> > The aging is now indepdant of what zone the page was allocated from because
> > it's node-based LRU reclaim. That may mean that the occupancy of individual
> > zones is now different but it should only matter if there is a large number
> > of address-limited requests.
> 
> The problem is that kswapd will stay awake and continuously draw
> subsequent allocations into a single zone, thus utilizing only a
> fraction of available memory.

Not quite. Look at prepare_kswapd_sleep() in the full series and it has this


        for (i = 0; i <= classzone_idx; i++) {
                struct zone *zone = pgdat->node_zones + i;

                if (!populated_zone(zone))
                        continue;

                if (zone_balanced(zone, order, 0, classzone_idx))
                        return true;
        }

and balance_pgdat has this

                /* Only reclaim if there are no eligible zones */
                for (i = classzone_idx; i >= 0; i--) {
                        zone = pgdat->node_zones + i;
                        if (!populated_zone(zone))
                                continue;

                        if (!zone_balanced(zone, order, 0, classzone_idx)) {
                                classzone_idx = i;
                                break;
                        }
                }

kswapd only stays awake until *one* balanced zone is available. That is
a key difference with the existing kswapd which balances all zones.

> A DMA32-limited kswapd wakeups can
> reclaim cache in DMA32 continuously if the allocator continously
> places new cache pages in that zone. It looks like that is what
> happened in the stutter benchmark.
> 

There may be corner cases where we artifically wake kswapd at DMA32
instead of a higher zone. If that happens, it should be addressed so
that only GFP_DMA32 wakes and reclaims that zone.

> Sure, it doesn't matter in that benchmark, because the pages are used
> only once. But if it had an actual cache workingset bigger than DMA32
> but smaller than DMA32+Normal, it would be thrashing unnecessarily.
> 
> If kswapd were truly balancing the pages in a node equally, regardless
> of zone placement, then in the long run we should see zone allocations
> converge to a share that is in proportion to each zone's size. As far
> as I can see, that is not quite happening yet.
> 

Not quite either. The order kswapd reclaims is in related to the age of
all pages in the node. Early in the lifetime of the system, that may be
ZONE_NORMAL initially until the other zones are populated. Ultimately
the balance of zones will be related to the age of the pages.

> > > The fact that this doesn't make a performance difference in the
> > > specific benchmarks you ran only proves just that: these specific
> > > benchmarks don't care. IMO, benchmarking is not enough here. If this
> > > is truly supposed to be unproblematic, then I think we need a reasoned
> > > explanation. I can't imagine how it possibly could be, though.
> > > 
> > 
> > The basic explanation is that reclaim is on a per-node basis and we
> > no longer balance all zones, just one that is necessary to satisfy the
> > original request that wokeup kswapd.
> > 
> > > If reclaim can't guarantee a balanced zone utilization then the
> > > allocator has to keep doing it. :(
> > 
> > That's the key issue - the main reason balanced zone utilisation is
> > necessary is because we reclaim on a per-zone basis and we must avoid
> > page aging anomalies. If we balance such that one eligible zone is above
> > the watermark then it's less of a concern.
> 
> Yes, but only if there can't be extended reclaim stretches that prefer
> the pages of a single zone. Yet it looks like this is still possible.
> 

And that is a problem if a workload is dominated by allocations
requiring the lower zones. If that is the common case then it's a bust
and fair zone allocation policy is still required. That removes one
motivation from the series as it leaves some fatness in the page
allocator paths.

> I wonder if that were fixed by dropping patch 7/27?

Potentially yes although it would be preferred to avoid unnecessarily
waking kswapd for a lower zone. That could be enforced by modifying
wake_all_kswapd() to always wake based on the highest available zone in
a pgdat that is below the zone required by the allocation request.

> Potentially it
> would need a bit more work than that. I.e. could we make kswapd
> balance only for the highest classzone in the system, and thus make
> address-limited allocations fend for themselves in direct reclaim?
> 

That would be a side-effect of modifying wake_all_kswapd. Would shoving
that in alleviate your concerns?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2
  2016-02-23 21:58       ` Mel Gorman
@ 2016-02-24  0:12         ` Johannes Weiner
  2016-02-24 10:46           ` Mel Gorman
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2016-02-24  0:12 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 09:58:59PM +0000, Mel Gorman wrote:
> On Tue, Feb 23, 2016 at 12:59:15PM -0800, Johannes Weiner wrote:
> > The problem is that kswapd will stay awake and continuously draw
> > subsequent allocations into a single zone, thus utilizing only a
> > fraction of available memory.
> 
> Not quite. Look at prepare_kswapd_sleep() in the full series and it has this
> 
> 
>         for (i = 0; i <= classzone_idx; i++) {
>                 struct zone *zone = pgdat->node_zones + i;
> 
>                 if (!populated_zone(zone))
>                         continue;
> 
>                 if (zone_balanced(zone, order, 0, classzone_idx))
>                         return true;
>         }
> 
> and balance_pgdat has this
> 
>                 /* Only reclaim if there are no eligible zones */
>                 for (i = classzone_idx; i >= 0; i--) {
>                         zone = pgdat->node_zones + i;
>                         if (!populated_zone(zone))
>                                 continue;
> 
>                         if (!zone_balanced(zone, order, 0, classzone_idx)) {
>                                 classzone_idx = i;
>                                 break;
>                         }
>                 }
> 
> kswapd only stays awake until *one* balanced zone is available. That is
> a key difference with the existing kswapd which balances all zones.

Thanks for clarifying, that is a good point. I applied the full series
now locally and the final code is indeed much easier to understand.

> > Sure, it doesn't matter in that benchmark, because the pages are used
> > only once. But if it had an actual cache workingset bigger than DMA32
> > but smaller than DMA32+Normal, it would be thrashing unnecessarily.
> > 
> > If kswapd were truly balancing the pages in a node equally, regardless
> > of zone placement, then in the long run we should see zone allocations
> > converge to a share that is in proportion to each zone's size. As far
> > as I can see, that is not quite happening yet.
> > 
> 
> Not quite either. The order kswapd reclaims is in related to the age of
> all pages in the node. Early in the lifetime of the system, that may be
> ZONE_NORMAL initially until the other zones are populated. Ultimately
> the balance of zones will be related to the age of the pages.

Thanks again. Yes, the picture is finally clicking into place for me.

> > > > If reclaim can't guarantee a balanced zone utilization then the
> > > > allocator has to keep doing it. :(
> > > 
> > > That's the key issue - the main reason balanced zone utilisation is
> > > necessary is because we reclaim on a per-zone basis and we must avoid
> > > page aging anomalies. If we balance such that one eligible zone is above
> > > the watermark then it's less of a concern.
> > 
> > Yes, but only if there can't be extended reclaim stretches that prefer
> > the pages of a single zone. Yet it looks like this is still possible.
> 
> And that is a problem if a workload is dominated by allocations
> requiring the lower zones. If that is the common case then it's a bust
> and fair zone allocation policy is still required. That removes one
> motivation from the series as it leaves some fatness in the page
> allocator paths.

With your above explanations, I'm now much more confident this series
is doing the right thing. Thanks.

The uncertainty over low-zone allocation floods is real, but what is
also unsettling is that, where the fair zone code used to shield us
from kswapd changes, we now open ourselves up to subtle aging bugs,
which are no longer detectable via the zone placement statistics. And
we have changed kswapd around quite extensively in the recent past.

A good metric for aging distortion might be able to mitigate both
these things. Something to keep an eye on when making changes to
kswapd, or when analyzing performance problems with a workload.

What I have in mind is per-classzone counters of reclaim work. If we
had exact numbers on how much zone-restricted reclaim is being done
relative to unrestricted scans, we could know how severely the aging
process is being distorted under any given workload. That would allow
us to validate these changes here, future kswapd and allocator
changes, and help us identify problematic workloads.

And maybe we can change the now useless pgalloc_ stats from counting
zone placement to counting allocation requests by classzone. We could
then again correlate the number of requests to the amount of work
done. A high amount of restricted reclaim on behalf of mostly Normal
allocation requests would detect the bug I described above, e.g. And
we could generally tell how expensive restricted allocations are in
the new node-LRUs.

What do you think?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 03/27] mm, vmstat: Add infrastructure for per-node vmstats
  2016-02-23 18:13   ` Johannes Weiner
@ 2016-02-24  9:19     ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-24  9:19 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 10:13:18AM -0800, Johannes Weiner wrote:
> On Tue, Feb 23, 2016 at 03:04:26PM +0000, Mel Gorman wrote:
> > VM statistic counters for reclaim decisions are zone-based. If the kernel
> > is to reclaim on a per-node basis then we need to track per-node statistics
> > but there is no infrastructure for that. The most notable change is that
> > the old node_page_state is renamed to sum_zone_node_page_state.  The new
> > node_page_state takes a pglist_data and uses per-node stats but none exist
> > yet. There is some renaming such as vm_stat to vm_zone_stat and the addition
> > of vm_node_stat and the renaming of mod_state to mod_zone_state. Otherwise,
> > this is mostly a mechanical patch with no functional change. There is a
> > lot of similarity between the node and zone helpers which is unfortunate
> > but there was no obvious way of reusing the code and maintaining type safety.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Hopefully we can eventually ditch /proc/zoneinfo in favor of a
> /proc/nodeinfo and get rid of the per-zone stats accounting.
> 

It may not be possible to ditch /proc/zoneinfo entirely but a /proc/nodeinfo
would make sense. It may interfere with userspace that's aware of kernel
internals but that may be manageable.

> In general, this patch looks good to me.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 

Thanks.

> Only one thing I noticed:
> 
> > @@ -349,12 +349,14 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> >  	shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
> >  	local_irq_enable();
> >  
> > -	if (memcg_kmem_enabled())
> > +	if (memcg_kmem_enabled()) {
> >  		pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
> >  						     LRU_ALL_FILE);
> > -	else
> > -		pages = node_page_state(sc->nid, NR_ACTIVE_FILE) +
> > -			node_page_state(sc->nid, NR_INACTIVE_FILE);
> > +	} else {
> > +		pg_data_t *pgdat = NODE_DATA(sc->nid);
> > +		pages = node_page_state(pgdat, NR_ACTIVE_FILE) +
> > +			node_page_state(pgdat, NR_INACTIVE_FILE);
> > +	}
> 
> That should also be sum_zone_node_page_state, right? These are not
> valid node items (yet).

Yep, not for another two patches. Fixed now.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 06/27] mm, vmscan: Begin reclaiming pages on a per-node basis
  2016-02-23 18:57   ` Johannes Weiner
  2016-02-23 19:03     ` Johannes Weiner
@ 2016-02-24 10:21     ` Mel Gorman
  1 sibling, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-24 10:21 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 10:57:22AM -0800, Johannes Weiner wrote:
> On Tue, Feb 23, 2016 at 03:04:29PM +0000, Mel Gorman wrote:
> > @@ -2428,10 +2448,11 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> >  			reclaimed = sc->nr_reclaimed;
> >  			scanned = sc->nr_scanned;
> >  
> > +			sc->reclaim_idx = reclaim_idx;
> >  			shrink_zone_memcg(zone, memcg, sc, &lru_pages);
> >  			zone_lru_pages += lru_pages;
> 
> The setting of sc->reclaim_idx is unexpected here. Why not set it in
> the caller and eliminate the reclaim_idx parameter?
> 

Initially because it was easier to develop the patch for but it's good
either way. I updated this patch and handled the conflicts. It's now set
in the callers.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2
  2016-02-24  0:12         ` Johannes Weiner
@ 2016-02-24 10:46           ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-24 10:46 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 04:12:01PM -0800, Johannes Weiner wrote:
> > > > > If reclaim can't guarantee a balanced zone utilization then the
> > > > > allocator has to keep doing it. :(
> > > > 
> > > > That's the key issue - the main reason balanced zone utilisation is
> > > > necessary is because we reclaim on a per-zone basis and we must avoid
> > > > page aging anomalies. If we balance such that one eligible zone is above
> > > > the watermark then it's less of a concern.
> > > 
> > > Yes, but only if there can't be extended reclaim stretches that prefer
> > > the pages of a single zone. Yet it looks like this is still possible.
> > 
> > And that is a problem if a workload is dominated by allocations
> > requiring the lower zones. If that is the common case then it's a bust
> > and fair zone allocation policy is still required. That removes one
> > motivation from the series as it leaves some fatness in the page
> > allocator paths.
> 
> With your above explanations, I'm now much more confident this series
> is doing the right thing. Thanks.
> 
> The uncertainty over low-zone allocation floods is real, but what is
> also unsettling is that, where the fair zone code used to shield us
> from kswapd changes, we now open ourselves up to subtle aging bugs,
> which are no longer detectable via the zone placement statistics. And
> we have changed kswapd around quite extensively in the recent past.
> 
> A good metric for aging distortion might be able to mitigate both
> these things. Something to keep an eye on when making changes to
> kswapd, or when analyzing performance problems with a workload.
> 
> What I have in mind is per-classzone counters of reclaim work. If we
> had exact numbers on how much zone-restricted reclaim is being done
> relative to unrestricted scans, we could know how severely the aging
> process is being distorted under any given workload. That would allow
> us to validate these changes here, future kswapd and allocator
> changes, and help us identify problematic workloads.
> 

Ok, that makes me think that I should keep the per-zone pgscan figures
even if they are based on node LRU reclaim because we'll know what the
per-zone scan activity is. We already know how many pages get skipped
when reclaiming for lower zones.

> And maybe we can change the now useless pgalloc_ stats from counting
> zone placement to counting allocation requests by classzone.

I can't convince myself about this one way or the other.

> We could
> then again correlate the number of requests to the amount of work
> done. A high amount of restricted reclaim on behalf of mostly Normal
> allocation requests would detect the bug I described above, e.g. And
> we could generally tell how expensive restricted allocations are in
> the new node-LRUs.
> 

I keep thinking the skip statistics gives us similar data -- it does
not tell us how many restricted allocations that resulted in reclaim was
but we do get an idea of the amount of work caused.

I'll think about it some more and see what I come up with.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 07/27] mm, vmscan: Have kswapd only scan based on the highest requested zone
  2016-02-23 15:04 ` [PATCH 07/27] mm, vmscan: Have kswapd only scan based on the highest requested zone Mel Gorman
@ 2016-02-25 22:17   ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-25 22:17 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:30PM +0000, Mel Gorman wrote:
> kswapd checks all eligible zones to see if they need balancing even if it was
> woken for a lower zone. This made sense when we reclaimed on a per-zone basis
> because we wanted to shrink zones fairly so avoid age-inversion problems.
> Ideally this is completely unnecessary when reclaiming on a per-node basis.
> In theory, there may still be anomalies when all requests are for lower
> zones and very old pages are preserved in higher zones but this should be
> the exceptional case.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

As I mentioned in the other subthread, this should probably be fine,
although it would be good, generally, to have some sort of statistics
on how much of the age-distorting reclaim is happening, and then maybe
mention its existence in the changelog of this patch. Just an idea.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes
  2016-02-23 15:04 ` [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman
@ 2016-02-28 16:16   ` Johannes Weiner
  2016-03-03 13:46   ` Vlastimil Babka
  1 sibling, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-28 16:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:31PM +0000, Mel Gorman wrote:
> @@ -3299,20 +3191,37 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			break;
>  
>  		/*
> +		 * Stop reclaiming if any eligible zone is balanced and clear
> +		 * node writeback or congested.
> +		 */
> +		for (i = 0; i <= classzone_idx; i++) {
> +			zone = pgdat->node_zones + i;
> +			if (!populated_zone(zone))
> +				continue;
> +
> +			if (zone_balanced(zone, sc.order, 0, classzone_idx)) {
> +				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> +				clear_bit(PGDAT_DIRTY, &pgdat->flags);
> +				goto out;
> +			}
> +		}
> +
> +		/*
>  		 * Raise priority if scanning rate is too low or there was no
>  		 * progress in reclaiming pages
>  		 */
>  		if (raise_priority || !sc.nr_reclaimed)
>  			sc.priority--;
> -	} while (sc.priority >= 1 &&
> -			!pgdat_balanced(pgdat, order, classzone_idx));
> +	} while (sc.priority >= 1);
>  
>  out:
>  	/*
> -	 * Return the highest zone idx we were reclaiming at so
> -	 * prepare_kswapd_sleep() makes the same decisions as here.
> +	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
> +	 * makes a decision on the order we were last reclaiming at. However,
> +	 * if another caller entered the allocator slow path while kswapd
> +	 * was awake, order will remain at the higher level
>  	 */
> -	return end_zone;
> +	return order;

It's sc.order that's updated based on fragmentation, not order.

There is also a now-stale comment above the function saying it returns
the highest reclaimed zone index.

Otherwise, the patch looks good to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 09/27] mm, vmscan: Simplify the logic deciding whether kswapd sleeps
  2016-02-23 15:04 ` [PATCH 09/27] mm, vmscan: Simplify the logic deciding whether kswapd sleeps Mel Gorman
@ 2016-02-28 16:16   ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-28 16:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:32PM +0000, Mel Gorman wrote:
> kswapd goes through some complex steps trying to figure out if it
> should stay awake based on the classzone_idx and the requested order.
> It is unnecessarily complex and passes in an invalid classzone_idx to
> balance_pgdat().  What matters most of all is whether a larger order has
> been requsted and whether kswapd successfully reclaimed at the previous
> order. This patch irons out the logic to check just that and the end result
> is less headache inducing.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 10/27] mm, vmscan: By default have direct reclaim only shrink once per node
  2016-02-23 15:04 ` [PATCH 10/27] mm, vmscan: By default have direct reclaim only shrink once per node Mel Gorman
@ 2016-02-28 16:17   ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-28 16:17 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:33PM +0000, Mel Gorman wrote:
> Direct reclaim iterates over all zones in the zonelist and shrinking them
> but this is in conflict with node-based reclaim. In the default case,
> only shrink once per node.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 15/27] mm, workingset: Make working set detection node-aware
  2016-02-23 15:04 ` [PATCH 15/27] mm, workingset: Make working set detection node-aware Mel Gorman
@ 2016-02-28 16:17   ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-28 16:17 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:04:38PM +0000, Mel Gorman wrote:
> @@ -167,33 +167,30 @@
>   */
>  static unsigned int bucket_order __read_mostly;
>  
> -static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
> +static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
>  {
>  	eviction >>= bucket_order;
>  	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
> -	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
> -	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
> +	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
>  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);

You need to remove ZONES_SHIFT from the EVICTION_SHIFT as well.

Other than that it looks good to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 16/27] mm, page_alloc: Consider dirtyable memory in terms of nodes
  2016-02-23 15:17 ` [PATCH 16/27] mm, page_alloc: Consider dirtyable memory in terms of nodes Mel Gorman
@ 2016-02-28 16:17   ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2016-02-28 16:17 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Vlastimil Babka, LKML

On Tue, Feb 23, 2016 at 03:17:55PM +0000, Mel Gorman wrote:
> @@ -686,6 +680,12 @@ typedef struct pglist_data {
>  	/* Number of pages migrated during the rate limiting time interval */
>  	unsigned long numabalancing_migrate_nr_pages;
>  #endif
> +	/*
> +	 * This is a per-zone reserve of pages that are not available
> +	 * to userspace allocations.
> +	 */
> +	unsigned long		totalreserve_pages;

"per-node reserve"

> @@ -297,22 +306,11 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>  	int node;
>  	unsigned long x = 0;
>  
> -	for_each_node_state(node, N_HIGH_MEMORY) {
> -		struct zone *z = &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> -
> -		x += zone_dirtyable_memory(z);
> -	}
>  	/*
> -	 * Unreclaimable memory (kernel memory or anonymous memory
> -	 * without swap) can bring down the dirtyable pages below
> -	 * the zone's dirty balance reserve and the above calculation
> -	 * will underflow.  However we still want to add in nodes
> -	 * which are below threshold (negative values) to get a more
> -	 * accurate calculation but make sure that the total never
> -	 * underflows.
> +	 * LRU lists are per-node so there is accurate way of accurately
> +	 * calculating dirtyable memory of just the high zone

"no accurate way of calculating"

> @@ -2665,7 +2665,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		 * will require awareness of zones in the
>  		 * dirty-throttling and the flusher threads.
>  		 */
> -		if (ac->spread_dirty_pages && !zone_dirty_ok(zone))
> +		if (ac->spread_dirty_pages && !node_dirty_ok(zone->zone_pgdat))
>  			continue;

The comment above this branch can be updated. I'm attaching a diff
below, feel free to use it.

>  		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
> @@ -6333,7 +6333,7 @@ static void calculate_totalreserve_pages(void)
>  			if (max > zone->managed_pages)
>  				max = zone->managed_pages;
>  
> -			zone->totalreserve_pages = max;
> +			pgdat->totalreserve_pages += max;

calculate_totalreserve_pages() can be called repeatedly. It needs to
be set freshly in this function, not added to.

---

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c461a94..fedd0b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2596,28 +2596,21 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				continue;
 		/*
 		 * When allocating a page cache page for writing, we
-		 * want to get it from a zone that is within its dirty
-		 * limit, such that no single zone holds more than its
+		 * want to get it from a node that is within its dirty
+		 * limit, such that no node zone holds more than its
 		 * proportional share of globally allowed dirty pages.
-		 * The dirty limits take into account the zone's
+		 * The dirty limits take into account the node's
 		 * lowmem reserves and high watermark so that kswapd
 		 * should be able to balance it without having to
 		 * write pages from its LRU list.
 		 *
-		 * This may look like it could increase pressure on
-		 * lower zones by failing allocations in higher zones
-		 * before they are full.  But the pages that do spill
-		 * over are limited as the lower zones are protected
-		 * by this very same mechanism.  It should not become
-		 * a practical burden to them.
-		 *
 		 * XXX: For now, allow allocations to potentially
-		 * exceed the per-zone dirty limit in the slowpath
+		 * exceed the per-node dirty limit in the slowpath
 		 * (spread_dirty_pages unset) before going into reclaim,
 		 * which is important when on a NUMA setup the allowed
-		 * zones are together not big enough to reach the
+		 * nodes are together not big enough to reach the
 		 * global limit.  The proper fix for these situations
-		 * will require awareness of zones in the
+		 * will require awareness of nodes in the
 		 * dirty-throttling and the flusher threads.
 		 */
 		if (ac->spread_dirty_pages) {

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone)
  2016-02-23 15:04 ` [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone) Mel Gorman
  2016-02-23 18:04   ` Johannes Weiner
@ 2016-03-03 10:37   ` Vlastimil Babka
  1 sibling, 0 replies; 58+ messages in thread
From: Vlastimil Babka @ 2016-03-03 10:37 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM; +Cc: Rik van Riel, Johannes Weiner, LKML

On 02/23/2016 04:04 PM, Mel Gorman wrote:
> ac->classzone_idx is determined by the index of the preferred zone and cached
> to avoid repeated calculations. wake_all_kswapds() should use it instead of
> using zone_idx() within a loop.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim
  2016-02-23 15:04 ` [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim Mel Gorman
  2016-02-23 18:06   ` Johannes Weiner
@ 2016-03-03 11:31   ` Vlastimil Babka
  2016-03-09 11:59     ` Mel Gorman
  1 sibling, 1 reply; 58+ messages in thread
From: Vlastimil Babka @ 2016-03-03 11:31 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, LKML, Peter Zijlstra, Li Zefan, cgroups

On 02/23/2016 04:04 PM, Mel Gorman wrote:
> Direct reclaim obeys cpusets but misses the cpusets_enabled() check.
> The overhead is unlikely to be measurable in the direct reclaim
> path which is expensive but there is no harm is doing it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 86eb21491867..de8d6226e026 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2566,7 +2566,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		 * to global LRU.
>  		 */
>  		if (global_reclaim(sc)) {
> -			if (!cpuset_zone_allowed(zone,
> +			if (cpusets_enabled() && !cpuset_zone_allowed(zone,
>  						 GFP_KERNEL | __GFP_HARDWALL))
>  				continue;

Hmm, wouldn't it be nicer if cpuset_zone_allowed() itself did the right
thing, and not each caller?

How about the patch below? (+CC)

----8<----
From: Vlastimil Babka <vbabka@suse.cz>
Date: Thu, 3 Mar 2016 12:12:09 +0100
Subject: [PATCH] cpuset: use static key better and convert to new API

An important function for cpusets is cpuset_node_allowed(), which acknowledges
that if there's a single root CPU set, it must be trivially allowed. But the
check "nr_cpusets() <= 1" doesn't use the cpusets_enabled_key static key in a
proper way where static keys can reduce the overhead.

This patch converts it so that static key is used properly. It's also switched
to the new static key API and the checking functions are converted to return
bool instead of int. We also provide a new variant __cpuset_zone_allowed()
which expects that the static key check was already done and was true. This is
needed for get_page_from_freelist() where we want to avoid the relatively
slower check if ALLOC_CPUSET is not set in alloc_flags.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/cpuset.h | 29 +++++++++++++++++++----------
 kernel/cpuset.c        | 14 +++++++-------
 mm/page_alloc.c        |  2 +-
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index fea160ee5803..d5f5c7f87e36 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -16,26 +16,26 @@
 
 #ifdef CONFIG_CPUSETS
 
-extern struct static_key cpusets_enabled_key;
+extern struct static_key_false cpusets_enabled_key;
 static inline bool cpusets_enabled(void)
 {
-	return static_key_false(&cpusets_enabled_key);
+	return static_branch_unlikely(&cpusets_enabled_key);
 }
 
 static inline int nr_cpusets(void)
 {
 	/* jump label reference count + the top-level cpuset */
-	return static_key_count(&cpusets_enabled_key) + 1;
+	return static_key_count(&cpusets_enabled_key.key) + 1;
 }
 
 static inline void cpuset_inc(void)
 {
-	static_key_slow_inc(&cpusets_enabled_key);
+	static_branch_inc(&cpusets_enabled_key);
 }
 
 static inline void cpuset_dec(void)
 {
-	static_key_slow_dec(&cpusets_enabled_key);
+	static_branch_dec(&cpusets_enabled_key);
 }
 
 extern int cpuset_init(void);
@@ -48,16 +48,25 @@ extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 void cpuset_init_current_mems_allowed(void);
 int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
-extern int __cpuset_node_allowed(int node, gfp_t gfp_mask);
+extern bool __cpuset_node_allowed(int node, gfp_t gfp_mask);
 
-static inline int cpuset_node_allowed(int node, gfp_t gfp_mask)
+static inline bool cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
-	return nr_cpusets() <= 1 || __cpuset_node_allowed(node, gfp_mask);
+	if (cpusets_enabled())
+		return __cpuset_node_allowed(node, gfp_mask);
+	return true;
 }
 
-static inline int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
+static inline bool __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
+{
+	return __cpuset_node_allowed(zone_to_nid(z), gfp_mask);
+}
+
+static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 {
-	return cpuset_node_allowed(zone_to_nid(z), gfp_mask);
+	if (cpusets_enabled())
+		return __cpuset_zone_allowed(z, gfp_mask);
+	return true;
 }
 
 extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 00ab5c2b7c5b..37a0b44d101f 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -62,7 +62,7 @@
 #include <linux/cgroup.h>
 #include <linux/wait.h>
 
-struct static_key cpusets_enabled_key __read_mostly = STATIC_KEY_INIT_FALSE;
+DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
 
 /* See "Frequency meter" comments, below. */
 
@@ -2528,27 +2528,27 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  */
-int __cpuset_node_allowed(int node, gfp_t gfp_mask)
+bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
 	struct cpuset *cs;		/* current cpuset ancestors */
 	int allowed;			/* is allocation in zone z allowed? */
 	unsigned long flags;
 
 	if (in_interrupt())
-		return 1;
+		return true;
 	if (node_isset(node, current->mems_allowed))
-		return 1;
+		return true;
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
 	if (unlikely(test_thread_flag(TIF_MEMDIE)))
-		return 1;
+		return true;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
-		return 0;
+		return false;
 
 	if (current->flags & PF_EXITING) /* Let dying task have memory */
-		return 1;
+		return true;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
 	spin_lock_irqsave(&callback_lock, flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..b857e9f8bc58 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2628,7 +2628,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 		if (cpusets_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed(zone, gfp_mask))
+			!__cpuset_zone_allowed(zone, gfp_mask))
 				continue;
 		/*
 		 * Distribute pages in proportion to the individual
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes
  2016-02-23 15:04 ` [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman
  2016-02-28 16:16   ` Johannes Weiner
@ 2016-03-03 13:46   ` Vlastimil Babka
  2016-03-09 14:45     ` Mel Gorman
  1 sibling, 1 reply; 58+ messages in thread
From: Vlastimil Babka @ 2016-03-03 13:46 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM; +Cc: Rik van Riel, Johannes Weiner, LKML

On 02/23/2016 04:04 PM, Mel Gorman wrote:
> -static bool zone_balanced(struct zone *zone, int order, bool highorder,
> +static bool zone_balanced(struct zone *zone, int order,
>  			unsigned long balance_gap, int classzone_idx)
>  {
>  	unsigned long mark = high_wmark_pages(zone) + balance_gap;
>  
> -	/*
> -	 * When checking from pgdat_balanced(), kswapd should stop and sleep
> -	 * when it reaches the high order-0 watermark and let kcompactd take
> -	 * over. Other callers such as wakeup_kswapd() want to determine the
> -	 * true high-order watermark.
> -	 */
> -	if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
> -		mark += (1UL << order);
> -		order = 0;
> -	}
> -
>  	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);

Did you really intend to remove this or was it due to rebasing on top of
kcompactd?
My intention was that kswapd will consider zone balanced just by having
enough base pages. Maybe the next patches will reintroduce this differently?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim
  2016-03-03 11:31   ` Vlastimil Babka
@ 2016-03-09 11:59     ` Mel Gorman
  2016-03-09 12:30       ` Vlastimil Babka
  0 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2016-03-09 11:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linux-MM, Rik van Riel, Johannes Weiner, LKML, Peter Zijlstra,
	Li Zefan, cgroups

On Thu, Mar 03, 2016 at 12:31:40PM +0100, Vlastimil Babka wrote:
> On 02/23/2016 04:04 PM, Mel Gorman wrote:
> > Direct reclaim obeys cpusets but misses the cpusets_enabled() check.
> > The overhead is unlikely to be measurable in the direct reclaim
> > path which is expensive but there is no harm is doing it.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  mm/vmscan.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 86eb21491867..de8d6226e026 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2566,7 +2566,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> >  		 * to global LRU.
> >  		 */
> >  		if (global_reclaim(sc)) {
> > -			if (!cpuset_zone_allowed(zone,
> > +			if (cpusets_enabled() && !cpuset_zone_allowed(zone,
> >  						 GFP_KERNEL | __GFP_HARDWALL))
> >  				continue;
> 
> Hmm, wouldn't it be nicer if cpuset_zone_allowed() itself did the right
> thing, and not each caller?
> 
> How about the patch below? (+CC)
> 

The patch appears to be layer upon the entire series but that in itself
is ok. This part is a problem

> An important function for cpusets is cpuset_node_allowed(), which acknowledges
> that if there's a single root CPU set, it must be trivially allowed. But the
> check "nr_cpusets() <= 1" doesn't use the cpusets_enabled_key static key in a
> proper way where static keys can reduce the overhead.


There is one check for the static key and a second for the count to see
if it's likely a valid cpuset that matters has been configured. The
point of that check was that it was lighter than __cpuset_zone_allowed
in the case where no cpuset is configured.

The patches are not equivalent.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim
  2016-03-09 11:59     ` Mel Gorman
@ 2016-03-09 12:30       ` Vlastimil Babka
  0 siblings, 0 replies; 58+ messages in thread
From: Vlastimil Babka @ 2016-03-09 12:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Rik van Riel, Johannes Weiner, LKML, Peter Zijlstra,
	Li Zefan, cgroups

On 03/09/2016 12:59 PM, Mel Gorman wrote:
> On Thu, Mar 03, 2016 at 12:31:40PM +0100, Vlastimil Babka wrote:
>> On 02/23/2016 04:04 PM, Mel Gorman wrote:
>>> Direct reclaim obeys cpusets but misses the cpusets_enabled() check.
>>> The overhead is unlikely to be measurable in the direct reclaim
>>> path which is expensive but there is no harm is doing it.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>>> ---
>>>  mm/vmscan.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 86eb21491867..de8d6226e026 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -2566,7 +2566,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>>>  		 * to global LRU.
>>>  		 */
>>>  		if (global_reclaim(sc)) {
>>> -			if (!cpuset_zone_allowed(zone,
>>> +			if (cpusets_enabled() && !cpuset_zone_allowed(zone,
>>>  						 GFP_KERNEL | __GFP_HARDWALL))
>>>  				continue;
>>
>> Hmm, wouldn't it be nicer if cpuset_zone_allowed() itself did the right
>> thing, and not each caller?
>>
>> How about the patch below? (+CC)
>>
> 
> The patch appears to be layer upon the entire series but that in itself

It could be also completely separate, witch your 02/27 dropped as it's
not tied to the rework anyway? Or did I miss something else cpuset
related in later patches?

> is ok. This part is a problem
> 
>> An important function for cpusets is cpuset_node_allowed(), which acknowledges
>> that if there's a single root CPU set, it must be trivially allowed. But the
>> check "nr_cpusets() <= 1" doesn't use the cpusets_enabled_key static key in a
>> proper way where static keys can reduce the overhead.
> 
> 
> There is one check for the static key and a second for the count to see
> if it's likely a valid cpuset that matters has been configured.

The point is that these should be equivalent, as the static key becomes
enabled only when there's more than one (root) cpuset. So checking
"nr_cpusets() <= 1" does the same as "!cpusets_enabled()", but without
taking advantage of the static key code patching.

> The
> point of that check was that it was lighter than __cpuset_zone_allowed
> in the case where no cpuset is configured.

But shrink_zones() (which you were patching) uses cpuset_zone_allowed(),
not __cpuset_zone_allowed(). The latter is provided only for
get_page_from_freelist(), which inserts extra fast check between
cpusets_enabled() and the slow cpuset allowed checking.

> The patches are not equivalent.
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes
  2016-03-03 13:46   ` Vlastimil Babka
@ 2016-03-09 14:45     ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-03-09 14:45 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Linux-MM, Rik van Riel, Johannes Weiner, LKML

On Thu, Mar 03, 2016 at 02:46:14PM +0100, Vlastimil Babka wrote:
> On 02/23/2016 04:04 PM, Mel Gorman wrote:
> > -static bool zone_balanced(struct zone *zone, int order, bool highorder,
> > +static bool zone_balanced(struct zone *zone, int order,
> >  			unsigned long balance_gap, int classzone_idx)
> >  {
> >  	unsigned long mark = high_wmark_pages(zone) + balance_gap;
> >  
> > -	/*
> > -	 * When checking from pgdat_balanced(), kswapd should stop and sleep
> > -	 * when it reaches the high order-0 watermark and let kcompactd take
> > -	 * over. Other callers such as wakeup_kswapd() want to determine the
> > -	 * true high-order watermark.
> > -	 */
> > -	if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
> > -		mark += (1UL << order);
> > -		order = 0;
> > -	}
> > -
> >  	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
> 
> Did you really intend to remove this or was it due to rebasing on top of
> kcompactd?

It was intentional because kswapd_shrink_node() sets sc.order to 0 when
pages have been reclaimed. There is no guarantee it'll be enough to actually
compact but it's conservative in terms of reclaim. The end result should
be that this check is no longer necessary.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone)
  2016-04-06 11:19 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v3 Mel Gorman
@ 2016-04-06 11:20 ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-04-06 11:20 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

ac->classzone_idx is determined by the index of the preferred zone and cached
to avoid repeated calculations. wake_all_kswapds() should use it instead of
using zone_idx() within a loop.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59de90d5d3a3..2643d10dee98 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3049,7 +3049,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 						ac->high_zoneidx, ac->nodemask)
-		wakeup_kswapd(zone, order, zone_idx(ac->preferred_zone));
+		wakeup_kswapd(zone, order, ac->classzone_idx);
 }
 
 static inline int
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone)
  2016-02-23 13:44 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
@ 2016-02-23 13:44 ` Mel Gorman
  0 siblings, 0 replies; 58+ messages in thread
From: Mel Gorman @ 2016-02-23 13:44 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

ac->classzone_idx is determined by the index of the preferred zone and cached
to avoid repeated calculations. wake_all_kswapds() should use it instead of
using zone_idx() within a loop.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c849a202b6f1..46ecbfa5b228 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3022,7 +3022,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 						ac->high_zoneidx, ac->nodemask)
-		wakeup_kswapd(zone, order, zone_idx(ac->preferred_zone));
+		wakeup_kswapd(zone, order, ac->classzone_idx);
 }
 
 static inline int
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2016-04-06 11:29 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-23 15:04 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
2016-02-23 15:04 ` [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone) Mel Gorman
2016-02-23 18:04   ` Johannes Weiner
2016-03-03 10:37   ` Vlastimil Babka
2016-02-23 15:04 ` [PATCH 02/27] mm, vmscan: Check if cpusets are enabled during direct reclaim Mel Gorman
2016-02-23 18:06   ` Johannes Weiner
2016-03-03 11:31   ` Vlastimil Babka
2016-03-09 11:59     ` Mel Gorman
2016-03-09 12:30       ` Vlastimil Babka
2016-02-23 15:04 ` [PATCH 03/27] mm, vmstat: Add infrastructure for per-node vmstats Mel Gorman
2016-02-23 18:13   ` Johannes Weiner
2016-02-24  9:19     ` Mel Gorman
2016-02-23 15:04 ` [PATCH 04/27] mm, vmscan: Move lru_lock to the node Mel Gorman
2016-02-23 18:40   ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 05/27] mm, vmscan: Move LRU lists to node Mel Gorman
2016-02-23 18:42   ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 06/27] mm, vmscan: Begin reclaiming pages on a per-node basis Mel Gorman
2016-02-23 18:57   ` Johannes Weiner
2016-02-23 19:03     ` Johannes Weiner
2016-02-24 10:21     ` Mel Gorman
2016-02-23 15:04 ` [PATCH 07/27] mm, vmscan: Have kswapd only scan based on the highest requested zone Mel Gorman
2016-02-25 22:17   ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 08/27] mm, vmscan: Make kswapd reclaim in terms of nodes Mel Gorman
2016-02-28 16:16   ` Johannes Weiner
2016-03-03 13:46   ` Vlastimil Babka
2016-03-09 14:45     ` Mel Gorman
2016-02-23 15:04 ` [PATCH 09/27] mm, vmscan: Simplify the logic deciding whether kswapd sleeps Mel Gorman
2016-02-28 16:16   ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 10/27] mm, vmscan: By default have direct reclaim only shrink once per node Mel Gorman
2016-02-28 16:17   ` Johannes Weiner
2016-02-23 15:04 ` [PATCH 11/27] mm, vmscan: Clear congestion, dirty and need for compaction on a per-node basis Mel Gorman
2016-02-23 15:04 ` [PATCH 12/27] mm: vmscan: Do not reclaim from kswapd if there is any eligible zone Mel Gorman
2016-02-23 15:04 ` [PATCH 13/27] mm, vmscan: Make shrink_node decisions more node-centric Mel Gorman
2016-02-23 15:04 ` [PATCH 14/27] mm, memcg: Move memcg limit enforcement from zones to nodes Mel Gorman
2016-02-23 15:04 ` [PATCH 15/27] mm, workingset: Make working set detection node-aware Mel Gorman
2016-02-28 16:17   ` Johannes Weiner
2016-02-23 15:17 ` [PATCH 16/27] mm, page_alloc: Consider dirtyable memory in terms of nodes Mel Gorman
2016-02-28 16:17   ` Johannes Weiner
2016-02-23 15:18 ` [PATCH 17/27] mm: Move page mapped accounting to the node Mel Gorman
2016-02-23 15:18 ` [PATCH 18/27] mm: Rename NR_ANON_PAGES to NR_ANON_MAPPED Mel Gorman
2016-02-23 15:18 ` [PATCH 19/27] mm: Move most file-based accounting to the node Mel Gorman
2016-02-23 15:19 ` [PATCH 20/27] mm: Move vmscan writes and file write " Mel Gorman
2016-02-23 15:19 ` [PATCH 21/27] mm, vmscan: Update classzone_idx if buffer_heads_over_limit Mel Gorman
2016-02-23 15:19 ` [PATCH 22/27] mm, vmscan: Only wakeup kswapd once per node for the requested classzone Mel Gorman
2016-02-23 15:20 ` [PATCH 23/27] mm, vmscan: Account in vmstat for pages skipped during reclaim Mel Gorman
2016-02-23 15:20 ` [PATCH 24/27] mm: Convert zone_reclaim to node_reclaim Mel Gorman
2016-02-23 15:20 ` [PATCH 25/27] mm, vmscan: Add classzone information to tracepoints Mel Gorman
2016-02-23 15:21 ` [PATCH 26/27] mm, page_alloc: Remove fair zone allocation policy Mel Gorman
2016-02-23 15:21 ` [PATCH 27/27] mm: page_alloc: Cache the last node whose dirty limit is reached Mel Gorman
2016-02-23 17:15 ` [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Christoph Lameter
2016-02-23 20:04 ` Johannes Weiner
2016-02-23 20:19   ` Mel Gorman
2016-02-23 20:59     ` Johannes Weiner
2016-02-23 21:58       ` Mel Gorman
2016-02-24  0:12         ` Johannes Weiner
2016-02-24 10:46           ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2016-04-06 11:19 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v3 Mel Gorman
2016-04-06 11:20 ` [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone) Mel Gorman
2016-02-23 13:44 [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2 Mel Gorman
2016-02-23 13:44 ` [PATCH 01/27] mm, page_alloc: Use ac->classzone_idx instead of zone_idx(preferred_zone) Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).