All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
@ 2016-07-08  9:34 ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Minor changes this time

Changelog since v8
o Cosmetic cleanups to comments
o Calculate node vmstat threshold based on the largest zone in the node
o Align retry checks with decisions made by the OOM killer
o Avoid tricks with -1 and kswapd_classzone_idx
o More consistent handling of buffer_heads_over_limit

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
---------

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

                                           4.7.0-rc4                  4.7.0-rc4
                                      mmotm-20160623                 nodelru-v9
Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)

This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623nodelru-v8
User          189.19      191.80
System       2604.45     2533.56
Elapsed      2855.30     2786.39

The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;


                             4.7.0-rc3   4.7.0-rc3
                         mmotm-20160623 nodelru-v8
DMA32 allocs               28794729769           0
Normal allocs              48432501431 77227309877
Movable allocs                       0           0

tiobench on ext4
----------------

tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.

                                         4.7.0-rc4             4.7.0-rc4
                                    mmotm-20160623            nodelru-v9
Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 approx-v9
User          645.72      525.90
System        403.85      331.75
Elapsed      6795.36     6783.67

This shows that the series has little or not impact on tiobench which is
desirable and a reduction in system CPU usage. It indicates that the fair
zone allocation policy was removed in a manner that didn't reintroduce
one class of page aging bug. There were only minor differences in overall
reclaim activity

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Minor Faults                    645838      647465
Major Faults                       573         640
Swap Ins                             0           0
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                  46041453    44190646
Normal allocs                 78053072    79887245
Movable allocs                       0           0
Allocation stalls                   24          67
Stall zone DMA                       0           0
Stall zone DMA32                     0           0
Stall zone Normal                    0           2
Stall zone HighMem                   0           0
Stall zone Movable                   0          65
Direct pages scanned             10969       30609
Kswapd pages scanned          93375144    93492094
Kswapd pages reclaimed        93372243    93489370
Direct pages reclaimed           10969       30609
Kswapd efficiency                  99%         99%
Kswapd velocity              13741.015   13781.934
Direct efficiency                 100%        100%
Direct velocity                  1.614       4.512
Percentage direct scans             0%          0%

kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 4 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).

pgbench read-only large configuration on ext4
---------------------------------------------

pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe

pgbench Transactions
                        4.7.0-rc4             4.7.0-rc4
                   mmotm-20160623            nodelru-v8
Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)

Negligible differences again. As with tiobench, overall reclaim activity
was comparable.

bonnie++ on ext4
----------------

No interesting performance difference, negligible differences on reclaim
stats.

paralleldd on ext4
------------------

This workload uses varying numbers of dd instances to read large amounts of
data from disk.

                               4.7.0-rc3             4.7.0-rc3
                          mmotm-20160623            nodelru-v9
Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 nodelru-v9
User         1548.01     1552.44
System       8609.71     8515.08
Elapsed      3587.10     3594.54

There is little or no change in performance but some drop in system CPU usage.

                             4.7.0-rc3   4.7.0-rc3
                        mmotm-20160623  nodelru-v9
Minor Faults                    362662      367360
Major Faults                      1204        1143
Swap Ins                            22           0
Swap Outs                         2855        1029
DMA allocs                           0           0
DMA32 allocs                  31409797    28837521
Normal allocs                 46611853    49231282
Movable allocs                       0           0
Direct pages scanned                 0           0
Kswapd pages scanned          40845270    40869088
Kswapd pages reclaimed        40830976    40855294
Direct pages reclaimed               0           0
Kswapd efficiency                  99%         99%
Kswapd velocity              11386.711   11369.769
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Page writes by reclaim            2855        1029
Page writes file                     0           0
Page writes anon                  2855        1029
Page reclaim immediate             771        1628
Sector Reads                 293312636   293536360
Sector Writes                 18213568    18186480
Page rescued immediate               0           0
Slabs scanned                   128257      132747
Direct inode steals                181          56
Kswapd inode steals                 59        1131

It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact.

stutter
-------

stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.

stutter
                             4.7.0-rc4             4.7.0-rc4
                        mmotm-20160623            nodelru-v8
Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)

This shows a number of improvements with the worst-case outlier greatly
improved.

Some of the vmstats are interesting

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Swap Ins                           163         502
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                 618719206  1381662383
Normal allocs                891235743   564138421
Movable allocs                       0           0
Allocation stalls                 2603           1
Direct pages scanned            216787           2
Kswapd pages scanned          50719775    41778378
Kswapd pages reclaimed        41541765    41777639
Direct pages reclaimed          209159           0
Kswapd efficiency                  81%         99%
Kswapd velocity              16859.554   14329.059
Direct efficiency                  96%          0%
Direct velocity                 72.061       0.001
Percentage direct scans             0%          0%
Page writes by reclaim         6215049           0
Page writes file               6215049           0
Page writes anon                     0           0
Page reclaim immediate           70673          90
Sector Reads                  81940800    81680456
Sector Writes                100158984    98816036
Page rescued immediate               0           0
Slabs scanned                  1366954       22683

While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.

This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.

1. Reclaim/compaction is going to be affected because the amount of reclaim is
   no longer targetted at a specific zone. Compaction works on a per-zone basis
   so there is no guarantee that reclaiming a few THP's worth page pages will
   have a positive impact on compaction success rates.

2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
   are called is now different. This may or may not be a problem but if it
   is, it'll be because shrinkers are not called enough and some balancing
   is required.

3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
   distributed between zones and the fair zone allocation policy used to do
   something very similar for anon. The distribution is now different but not
   necessarily in any way that matters but it's still worth bearing in mind.

 Documentation/cgroup-v1/memcg_test.txt        |   4 +-
 Documentation/cgroup-v1/memory.txt            |   4 +-
 arch/s390/appldata/appldata_mem.c             |   2 +-
 arch/tile/mm/pgtable.c                        |  18 +-
 drivers/base/node.c                           |  77 ++-
 drivers/staging/android/lowmemorykiller.c     |  12 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c |   6 +-
 fs/fs-writeback.c                             |   4 +-
 fs/fuse/file.c                                |   8 +-
 fs/nfs/internal.h                             |   2 +-
 fs/nfs/write.c                                |   2 +-
 fs/proc/meminfo.c                             |  20 +-
 include/linux/backing-dev.h                   |   2 +-
 include/linux/memcontrol.h                    |  63 +-
 include/linux/mm.h                            |   5 +
 include/linux/mm_inline.h                     |  39 +-
 include/linux/mm_types.h                      |   2 +-
 include/linux/mmzone.h                        | 161 +++--
 include/linux/swap.h                          |  24 +-
 include/linux/topology.h                      |   2 +-
 include/linux/vm_event_item.h                 |  14 +-
 include/linux/vmstat.h                        | 111 +++-
 include/linux/writeback.h                     |   2 +-
 include/trace/events/vmscan.h                 |  63 +-
 include/trace/events/writeback.h              |  10 +-
 kernel/power/snapshot.c                       |  10 +-
 kernel/sysctl.c                               |   4 +-
 mm/backing-dev.c                              |  15 +-
 mm/compaction.c                               |  48 +-
 mm/filemap.c                                  |  16 +-
 mm/huge_memory.c                              |  12 +-
 mm/internal.h                                 |  11 +-
 mm/khugepaged.c                               |  14 +-
 mm/memcontrol.c                               | 215 +++----
 mm/memory-failure.c                           |   4 +-
 mm/memory_hotplug.c                           |   7 +-
 mm/mempolicy.c                                |   2 +-
 mm/migrate.c                                  |  35 +-
 mm/mlock.c                                    |  12 +-
 mm/page-writeback.c                           | 123 ++--
 mm/page_alloc.c                               | 349 +++++-----
 mm/page_idle.c                                |   4 +-
 mm/rmap.c                                     |  26 +-
 mm/shmem.c                                    |  14 +-
 mm/swap.c                                     |  64 +-
 mm/swap_state.c                               |   4 +-
 mm/util.c                                     |   4 +-
 mm/vmscan.c                                   | 893 +++++++++++++-------------
 mm/vmstat.c                                   | 411 +++++++++---
 mm/workingset.c                               |  54 +-
 50 files changed, 1690 insertions(+), 1318 deletions(-)

-- 
2.6.4

^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
@ 2016-07-08  9:34 ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Minor changes this time

Changelog since v8
o Cosmetic cleanups to comments
o Calculate node vmstat threshold based on the largest zone in the node
o Align retry checks with decisions made by the OOM killer
o Avoid tricks with -1 and kswapd_classzone_idx
o More consistent handling of buffer_heads_over_limit

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
---------

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

                                           4.7.0-rc4                  4.7.0-rc4
                                      mmotm-20160623                 nodelru-v9
Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)

This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623nodelru-v8
User          189.19      191.80
System       2604.45     2533.56
Elapsed      2855.30     2786.39

The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;


                             4.7.0-rc3   4.7.0-rc3
                         mmotm-20160623 nodelru-v8
DMA32 allocs               28794729769           0
Normal allocs              48432501431 77227309877
Movable allocs                       0           0

tiobench on ext4
----------------

tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.

                                         4.7.0-rc4             4.7.0-rc4
                                    mmotm-20160623            nodelru-v9
Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 approx-v9
User          645.72      525.90
System        403.85      331.75
Elapsed      6795.36     6783.67

This shows that the series has little or not impact on tiobench which is
desirable and a reduction in system CPU usage. It indicates that the fair
zone allocation policy was removed in a manner that didn't reintroduce
one class of page aging bug. There were only minor differences in overall
reclaim activity

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Minor Faults                    645838      647465
Major Faults                       573         640
Swap Ins                             0           0
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                  46041453    44190646
Normal allocs                 78053072    79887245
Movable allocs                       0           0
Allocation stalls                   24          67
Stall zone DMA                       0           0
Stall zone DMA32                     0           0
Stall zone Normal                    0           2
Stall zone HighMem                   0           0
Stall zone Movable                   0          65
Direct pages scanned             10969       30609
Kswapd pages scanned          93375144    93492094
Kswapd pages reclaimed        93372243    93489370
Direct pages reclaimed           10969       30609
Kswapd efficiency                  99%         99%
Kswapd velocity              13741.015   13781.934
Direct efficiency                 100%        100%
Direct velocity                  1.614       4.512
Percentage direct scans             0%          0%

kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 4 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).

pgbench read-only large configuration on ext4
---------------------------------------------

pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe

pgbench Transactions
                        4.7.0-rc4             4.7.0-rc4
                   mmotm-20160623            nodelru-v8
Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)

Negligible differences again. As with tiobench, overall reclaim activity
was comparable.

bonnie++ on ext4
----------------

No interesting performance difference, negligible differences on reclaim
stats.

paralleldd on ext4
------------------

This workload uses varying numbers of dd instances to read large amounts of
data from disk.

                               4.7.0-rc3             4.7.0-rc3
                          mmotm-20160623            nodelru-v9
Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 nodelru-v9
User         1548.01     1552.44
System       8609.71     8515.08
Elapsed      3587.10     3594.54

There is little or no change in performance but some drop in system CPU usage.

                             4.7.0-rc3   4.7.0-rc3
                        mmotm-20160623  nodelru-v9
Minor Faults                    362662      367360
Major Faults                      1204        1143
Swap Ins                            22           0
Swap Outs                         2855        1029
DMA allocs                           0           0
DMA32 allocs                  31409797    28837521
Normal allocs                 46611853    49231282
Movable allocs                       0           0
Direct pages scanned                 0           0
Kswapd pages scanned          40845270    40869088
Kswapd pages reclaimed        40830976    40855294
Direct pages reclaimed               0           0
Kswapd efficiency                  99%         99%
Kswapd velocity              11386.711   11369.769
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Page writes by reclaim            2855        1029
Page writes file                     0           0
Page writes anon                  2855        1029
Page reclaim immediate             771        1628
Sector Reads                 293312636   293536360
Sector Writes                 18213568    18186480
Page rescued immediate               0           0
Slabs scanned                   128257      132747
Direct inode steals                181          56
Kswapd inode steals                 59        1131

It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact.

stutter
-------

stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.

stutter
                             4.7.0-rc4             4.7.0-rc4
                        mmotm-20160623            nodelru-v8
Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)

This shows a number of improvements with the worst-case outlier greatly
improved.

Some of the vmstats are interesting

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Swap Ins                           163         502
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                 618719206  1381662383
Normal allocs                891235743   564138421
Movable allocs                       0           0
Allocation stalls                 2603           1
Direct pages scanned            216787           2
Kswapd pages scanned          50719775    41778378
Kswapd pages reclaimed        41541765    41777639
Direct pages reclaimed          209159           0
Kswapd efficiency                  81%         99%
Kswapd velocity              16859.554   14329.059
Direct efficiency                  96%          0%
Direct velocity                 72.061       0.001
Percentage direct scans             0%          0%
Page writes by reclaim         6215049           0
Page writes file               6215049           0
Page writes anon                     0           0
Page reclaim immediate           70673          90
Sector Reads                  81940800    81680456
Sector Writes                100158984    98816036
Page rescued immediate               0           0
Slabs scanned                  1366954       22683

While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.

This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.

1. Reclaim/compaction is going to be affected because the amount of reclaim is
   no longer targetted at a specific zone. Compaction works on a per-zone basis
   so there is no guarantee that reclaiming a few THP's worth page pages will
   have a positive impact on compaction success rates.

2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
   are called is now different. This may or may not be a problem but if it
   is, it'll be because shrinkers are not called enough and some balancing
   is required.

3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
   distributed between zones and the fair zone allocation policy used to do
   something very similar for anon. The distribution is now different but not
   necessarily in any way that matters but it's still worth bearing in mind.

 Documentation/cgroup-v1/memcg_test.txt        |   4 +-
 Documentation/cgroup-v1/memory.txt            |   4 +-
 arch/s390/appldata/appldata_mem.c             |   2 +-
 arch/tile/mm/pgtable.c                        |  18 +-
 drivers/base/node.c                           |  77 ++-
 drivers/staging/android/lowmemorykiller.c     |  12 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c |   6 +-
 fs/fs-writeback.c                             |   4 +-
 fs/fuse/file.c                                |   8 +-
 fs/nfs/internal.h                             |   2 +-
 fs/nfs/write.c                                |   2 +-
 fs/proc/meminfo.c                             |  20 +-
 include/linux/backing-dev.h                   |   2 +-
 include/linux/memcontrol.h                    |  63 +-
 include/linux/mm.h                            |   5 +
 include/linux/mm_inline.h                     |  39 +-
 include/linux/mm_types.h                      |   2 +-
 include/linux/mmzone.h                        | 161 +++--
 include/linux/swap.h                          |  24 +-
 include/linux/topology.h                      |   2 +-
 include/linux/vm_event_item.h                 |  14 +-
 include/linux/vmstat.h                        | 111 +++-
 include/linux/writeback.h                     |   2 +-
 include/trace/events/vmscan.h                 |  63 +-
 include/trace/events/writeback.h              |  10 +-
 kernel/power/snapshot.c                       |  10 +-
 kernel/sysctl.c                               |   4 +-
 mm/backing-dev.c                              |  15 +-
 mm/compaction.c                               |  48 +-
 mm/filemap.c                                  |  16 +-
 mm/huge_memory.c                              |  12 +-
 mm/internal.h                                 |  11 +-
 mm/khugepaged.c                               |  14 +-
 mm/memcontrol.c                               | 215 +++----
 mm/memory-failure.c                           |   4 +-
 mm/memory_hotplug.c                           |   7 +-
 mm/mempolicy.c                                |   2 +-
 mm/migrate.c                                  |  35 +-
 mm/mlock.c                                    |  12 +-
 mm/page-writeback.c                           | 123 ++--
 mm/page_alloc.c                               | 349 +++++-----
 mm/page_idle.c                                |   4 +-
 mm/rmap.c                                     |  26 +-
 mm/shmem.c                                    |  14 +-
 mm/swap.c                                     |  64 +-
 mm/swap_state.c                               |   4 +-
 mm/util.c                                     |   4 +-
 mm/vmscan.c                                   | 893 +++++++++++++-------------
 mm/vmstat.c                                   | 411 +++++++++---
 mm/workingset.c                               |  54 +-
 50 files changed, 1690 insertions(+), 1318 deletions(-)

-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 01/34] mm, vmstat: add infrastructure for per-node vmstats
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

VM statistic counters for reclaim decisions are zone-based.  If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that.  The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet.  There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
patch with no functional change.  There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/base/node.c    |  76 +++++++------
 include/linux/mm.h     |   5 +
 include/linux/mmzone.h |  13 +++
 include/linux/vmstat.h |  92 +++++++++++++--
 mm/page_alloc.c        |  10 +-
 mm/vmstat.c            | 295 +++++++++++++++++++++++++++++++++++++++++++++----
 mm/workingset.c        |   9 +-
 7 files changed, 424 insertions(+), 76 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ed0ef0f69489..92d8e090c5b3 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -74,16 +74,16 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, K(node_page_state(nid, NR_ACTIVE_ANON) +
-				node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_INACTIVE_ANON) +
-				node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_ACTIVE_ANON)),
-		       nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
-		       nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_UNEVICTABLE)),
-		       nid, K(node_page_state(nid, NR_MLOCK)));
+		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
+				sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
+				sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
+		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
 	n += sprintf(buf + n,
@@ -117,31 +117,31 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d ShmemPmdMapped: %8lu kB\n"
 #endif
 			,
-		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
-		       nid, K(node_page_state(nid, NR_WRITEBACK)),
-		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
-		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
+		       nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
+		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
+		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
+		       nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
+		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
-		       nid, K(node_page_state(nid, NR_PAGETABLE)),
-		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
-		       nid, K(node_page_state(nid, NR_BOUNCE)),
-		       nid, K(node_page_state(nid, NR_WRITEBACK_TEMP)),
-		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
-				node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
-		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_UNSTABLE_NFS)),
+		       nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK_TEMP)),
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) +
+				sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
-		       nid, K(node_page_state(nid, NR_ANON_THPS) *
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ANON_THPS) *
 				       HPAGE_PMD_NR),
-		       nid, K(node_page_state(nid, NR_SHMEM_THPS) *
+		       nid, K(sum_zone_node_page_state(nid, NR_SHMEM_THPS) *
 				       HPAGE_PMD_NR),
-		       nid, K(node_page_state(nid, NR_SHMEM_PMDMAPPED) *
+		       nid, K(sum_zone_node_page_state(nid, NR_SHMEM_PMDMAPPED) *
 				       HPAGE_PMD_NR));
 #else
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
 #endif
 	n += hugetlb_report_node_meminfo(nid, buf + n);
 	return n;
@@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
 		       "interleave_hit %lu\n"
 		       "local_node %lu\n"
 		       "other_node %lu\n",
-		       node_page_state(dev->id, NUMA_HIT),
-		       node_page_state(dev->id, NUMA_MISS),
-		       node_page_state(dev->id, NUMA_FOREIGN),
-		       node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
-		       node_page_state(dev->id, NUMA_LOCAL),
-		       node_page_state(dev->id, NUMA_OTHER));
+		       sum_zone_node_page_state(dev->id, NUMA_HIT),
+		       sum_zone_node_page_state(dev->id, NUMA_MISS),
+		       sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
+		       sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
+		       sum_zone_node_page_state(dev->id, NUMA_LOCAL),
+		       sum_zone_node_page_state(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
@@ -173,12 +173,18 @@ static ssize_t node_read_vmstat(struct device *dev,
 				struct device_attribute *attr, char *buf)
 {
 	int nid = dev->id;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	int i;
 	int n = 0;
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
-			     node_page_state(nid, i));
+			     sum_zone_node_page_state(nid, i));
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		n += sprintf(buf+n, "%s %lu\n",
+			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+			     node_page_state(pgdat, i));
 
 	return n;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b21e5f30378e..dd79aa2800a3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -933,6 +933,11 @@ static inline struct zone *page_zone(const struct page *page)
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
 }
 
+static inline pg_data_t *page_pgdat(const struct page *page)
+{
+	return NODE_DATA(page_to_nid(page));
+}
+
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 19425e988bdc..078ecb81e209 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -160,6 +160,10 @@ enum zone_stat_item {
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
+enum node_stat_item {
+	NR_VM_NODE_STAT_ITEMS
+};
+
 /*
  * We do arithmetic on the LRU lists in various places in the code,
  * so it is important to keep the active lists LRU_ACTIVE higher in
@@ -267,6 +271,11 @@ struct per_cpu_pageset {
 #endif
 };
 
+struct per_cpu_nodestat {
+	s8 stat_threshold;
+	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+};
+
 #endif /* !__GENERATING_BOUNDS.H */
 
 enum zone_type {
@@ -695,6 +704,10 @@ typedef struct pglist_data {
 	struct list_head split_queue;
 	unsigned long split_queue_len;
 #endif
+
+	/* Per-node vmstats */
+	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
+	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d2da8e053210..d1744aa3ab9c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -106,20 +106,38 @@ static inline void vm_events_fold_cpu(int cpu)
 		zone_idx(zone), delta)
 
 /*
- * Zone based page accounting with per cpu differentials.
+ * Zone and node-based page accounting with per cpu differentials.
  */
-extern atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
 
 static inline void zone_page_state_add(long x, struct zone *zone,
 				 enum zone_stat_item item)
 {
 	atomic_long_add(x, &zone->vm_stat[item]);
-	atomic_long_add(x, &vm_stat[item]);
+	atomic_long_add(x, &vm_zone_stat[item]);
+}
+
+static inline void node_page_state_add(long x, struct pglist_data *pgdat,
+				 enum node_stat_item item)
+{
+	atomic_long_add(x, &pgdat->vm_stat[item]);
+	atomic_long_add(x, &vm_node_stat[item]);
 }
 
 static inline unsigned long global_page_state(enum zone_stat_item item)
 {
-	long x = atomic_long_read(&vm_stat[item]);
+	long x = atomic_long_read(&vm_zone_stat[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+static inline unsigned long global_node_page_state(enum node_stat_item item)
+{
+	long x = atomic_long_read(&vm_node_stat[item]);
 #ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
@@ -161,31 +179,44 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 }
 
 #ifdef CONFIG_NUMA
-
-extern unsigned long node_page_state(int node, enum zone_stat_item item);
-
+extern unsigned long sum_zone_node_page_state(int node,
+						enum zone_stat_item item);
+extern unsigned long node_page_state(struct pglist_data *pgdat,
+						enum node_stat_item item);
 #else
-
-#define node_page_state(node, item) global_page_state(item)
-
+#define sum_zone_node_page_state(node, item) global_page_state(item)
+#define node_page_state(node, item) global_node_page_state(item)
 #endif /* CONFIG_NUMA */
 
 #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
 #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d))
+#define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d)
+#define sub_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, -(__d))
 
 #ifdef CONFIG_SMP
 void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
 void __inc_zone_page_state(struct page *, enum zone_stat_item);
 void __dec_zone_page_state(struct page *, enum zone_stat_item);
 
+void __mod_node_page_state(struct pglist_data *, enum node_stat_item item, long);
+void __inc_node_page_state(struct page *, enum node_stat_item);
+void __dec_node_page_state(struct page *, enum node_stat_item);
+
 void mod_zone_page_state(struct zone *, enum zone_stat_item, long);
 void inc_zone_page_state(struct page *, enum zone_stat_item);
 void dec_zone_page_state(struct page *, enum zone_stat_item);
 
+void mod_node_page_state(struct pglist_data *, enum node_stat_item, long);
+void inc_node_page_state(struct page *, enum node_stat_item);
+void dec_node_page_state(struct page *, enum node_stat_item);
+
 extern void inc_zone_state(struct zone *, enum zone_stat_item);
+extern void inc_node_state(struct pglist_data *, enum node_stat_item);
 extern void __inc_zone_state(struct zone *, enum zone_stat_item);
+extern void __inc_node_state(struct pglist_data *, enum node_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
+extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
@@ -213,16 +244,34 @@ static inline void __mod_zone_page_state(struct zone *zone,
 	zone_page_state_add(delta, zone, item);
 }
 
+static inline void __mod_node_page_state(struct pglist_data *pgdat,
+			enum node_stat_item item, int delta)
+{
+	node_page_state_add(delta, pgdat, item);
+}
+
 static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	atomic_long_inc(&zone->vm_stat[item]);
-	atomic_long_inc(&vm_stat[item]);
+	atomic_long_inc(&vm_zone_stat[item]);
+}
+
+static inline void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	atomic_long_inc(&pgdat->vm_stat[item]);
+	atomic_long_inc(&vm_node_stat[item]);
 }
 
 static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	atomic_long_dec(&zone->vm_stat[item]);
-	atomic_long_dec(&vm_stat[item]);
+	atomic_long_dec(&vm_zone_stat[item]);
+}
+
+static inline void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	atomic_long_dec(&pgdat->vm_stat[item]);
+	atomic_long_dec(&vm_node_stat[item]);
 }
 
 static inline void __inc_zone_page_state(struct page *page,
@@ -231,12 +280,26 @@ static inline void __inc_zone_page_state(struct page *page,
 	__inc_zone_state(page_zone(page), item);
 }
 
+static inline void __inc_node_page_state(struct page *page,
+			enum node_stat_item item)
+{
+	__inc_node_state(page_pgdat(page), item);
+}
+
+
 static inline void __dec_zone_page_state(struct page *page,
 			enum zone_stat_item item)
 {
 	__dec_zone_state(page_zone(page), item);
 }
 
+static inline void __dec_node_page_state(struct page *page,
+			enum node_stat_item item)
+{
+	__dec_node_state(page_pgdat(page), item);
+}
+
+
 /*
  * We only use atomic operations to update counters. So there is no need to
  * disable interrupts.
@@ -245,7 +308,12 @@ static inline void __dec_zone_page_state(struct page *page,
 #define dec_zone_page_state __dec_zone_page_state
 #define mod_zone_page_state __mod_zone_page_state
 
+#define inc_node_page_state __inc_node_page_state
+#define dec_node_page_state __dec_node_page_state
+#define mod_node_page_state __mod_node_page_state
+
 #define inc_zone_state __inc_zone_state
+#define inc_node_state __inc_node_state
 #define dec_zone_state __dec_zone_state
 
 #define set_pgdat_percpu_threshold(pgdat, callback) { }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 403c5dcd24da..34e46c02a406 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4247,8 +4247,8 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
 		managed_pages += pgdat->node_zones[zone_type].managed_pages;
 	val->totalram = managed_pages;
-	val->sharedram = node_page_state(nid, NR_SHMEM);
-	val->freeram = node_page_state(nid, NR_FREE_PAGES);
+	val->sharedram = sum_zone_node_page_state(nid, NR_SHMEM);
+	val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
 #ifdef CONFIG_HIGHMEM
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
 		struct zone *zone = &pgdat->node_zones[zone_type];
@@ -5373,6 +5373,11 @@ static void __meminit setup_zone_pageset(struct zone *zone)
 	zone->pageset = alloc_percpu(struct per_cpu_pageset);
 	for_each_possible_cpu(cpu)
 		zone_pageset_init(zone, cpu);
+
+	if (!zone->zone_pgdat->per_cpu_nodestats) {
+		zone->zone_pgdat->per_cpu_nodestats =
+			alloc_percpu(struct per_cpu_nodestat);
+	}
 }
 
 /*
@@ -6078,6 +6083,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	reset_deferred_meminit(pgdat);
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
+	pgdat->per_cpu_nodestats = NULL;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7997f52935c9..3345d396a99b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -86,8 +86,10 @@ void vm_events_fold_cpu(int cpu)
  *
  * vm_stat contains the global counters
  */
-atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
-EXPORT_SYMBOL(vm_stat);
+atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
+atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
+EXPORT_SYMBOL(vm_zone_stat);
+EXPORT_SYMBOL(vm_node_stat);
 
 #ifdef CONFIG_SMP
 
@@ -167,19 +169,36 @@ int calculate_normal_threshold(struct zone *zone)
  */
 void refresh_zone_stat_thresholds(void)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int cpu;
 	int threshold;
 
+	/* Zero current pgdat thresholds */
+	for_each_online_pgdat(pgdat) {
+		for_each_online_cpu(cpu) {
+			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold = 0;
+		}
+	}
+
 	for_each_populated_zone(zone) {
+		struct pglist_data *pgdat = zone->zone_pgdat;
 		unsigned long max_drift, tolerate_drift;
 
 		threshold = calculate_normal_threshold(zone);
 
-		for_each_online_cpu(cpu)
+		for_each_online_cpu(cpu) {
+			int pgdat_threshold;
+
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
 
+			/* Base nodestat threshold on the largest populated zone. */
+			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
+			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
+				= max(threshold, pgdat_threshold);
+		}
+
 		/*
 		 * Only set percpu_drift_mark if there is a danger that
 		 * NR_FREE_PAGES reports the low watermark is ok when in fact
@@ -238,6 +257,26 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+				long delta)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long x;
+	long t;
+
+	x = delta + __this_cpu_read(*p);
+
+	t = __this_cpu_read(pcp->stat_threshold);
+
+	if (unlikely(x > t || x < -t)) {
+		node_page_state_add(x, pgdat, item);
+		x = 0;
+	}
+	__this_cpu_write(*p, x);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
 /*
  * Optimized increment and decrement functions.
  *
@@ -277,12 +316,34 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 	}
 }
 
+void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s8 v, t;
+
+	v = __this_cpu_inc_return(*p);
+	t = __this_cpu_read(pcp->stat_threshold);
+	if (unlikely(v > t)) {
+		s8 overstep = t >> 1;
+
+		node_page_state_add(v + overstep, pgdat, item);
+		__this_cpu_write(*p, -overstep);
+	}
+}
+
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	__inc_zone_state(page_zone(page), item);
 }
 EXPORT_SYMBOL(__inc_zone_page_state);
 
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	__inc_node_state(page_pgdat(page), item);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
@@ -299,12 +360,34 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 	}
 }
 
+void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s8 v, t;
+
+	v = __this_cpu_dec_return(*p);
+	t = __this_cpu_read(pcp->stat_threshold);
+	if (unlikely(v < - t)) {
+		s8 overstep = t >> 1;
+
+		node_page_state_add(v - overstep, pgdat, item);
+		__this_cpu_write(*p, overstep);
+	}
+}
+
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	__dec_zone_state(page_zone(page), item);
 }
 EXPORT_SYMBOL(__dec_zone_page_state);
 
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	__dec_node_state(page_pgdat(page), item);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+
 #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
 /*
  * If we have cmpxchg_local support then we do not need to incur the overhead
@@ -318,8 +401,8 @@ EXPORT_SYMBOL(__dec_zone_page_state);
  *     1       Overstepping half of threshold
  *     -1      Overstepping minus half of threshold
 */
-static inline void mod_state(struct zone *zone, enum zone_stat_item item,
-			     long delta, int overstep_mode)
+static inline void mod_zone_state(struct zone *zone,
+       enum zone_stat_item item, long delta, int overstep_mode)
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
@@ -359,26 +442,88 @@ static inline void mod_state(struct zone *zone, enum zone_stat_item item,
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 			 long delta)
 {
-	mod_state(zone, item, delta, 0);
+	mod_zone_state(zone, item, delta, 0);
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
 void inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	mod_state(zone, item, 1, 1);
+	mod_zone_state(zone, item, 1, 1);
 }
 
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	mod_state(page_zone(page), item, 1, 1);
+	mod_zone_state(page_zone(page), item, 1, 1);
 }
 EXPORT_SYMBOL(inc_zone_page_state);
 
 void dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	mod_state(page_zone(page), item, -1, -1);
+	mod_zone_state(page_zone(page), item, -1, -1);
 }
 EXPORT_SYMBOL(dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+       enum node_stat_item item, int delta, int overstep_mode)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;  /* overflow to node counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a node.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (n > t || n < -t) {
+			int os = overstep_mode * (t >> 1) ;
+
+			/* Overflow must be added to node counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
 #else
 /*
  * Use interrupt disable to serialize counter updates
@@ -424,21 +569,69 @@ void dec_zone_page_state(struct page *page, enum zone_stat_item item)
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(dec_zone_page_state);
-#endif
 
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_node_state(pgdat, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_node_state);
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_node_page_state(pgdat, item, delta);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	unsigned long flags;
+	struct pglist_data *pgdat;
+
+	pgdat = page_pgdat(page);
+	local_irq_save(flags);
+	__inc_node_state(pgdat, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_node_page_state(page, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+#endif
 
 /*
  * Fold a differential into the global counters.
  * Returns the number of counters updated.
  */
-static int fold_diff(int *diff)
+static int fold_diff(int *zone_diff, int *node_diff)
 {
 	int i;
 	int changes = 0;
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (diff[i]) {
-			atomic_long_add(diff[i], &vm_stat[i]);
+		if (zone_diff[i]) {
+			atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
+			changes++;
+	}
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		if (node_diff[i]) {
+			atomic_long_add(node_diff[i], &vm_node_stat[i]);
 			changes++;
 	}
 	return changes;
@@ -462,9 +655,11 @@ static int fold_diff(int *diff)
  */
 static int refresh_cpu_vm_stats(bool do_pagesets)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int i;
-	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
 	for_each_populated_zone(zone) {
@@ -477,7 +672,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			if (v) {
 
 				atomic_long_add(v, &zone->vm_stat[i]);
-				global_diff[i] += v;
+				global_zone_diff[i] += v;
 #ifdef CONFIG_NUMA
 				/* 3 seconds idle till flush */
 				__this_cpu_write(p->expire, 3);
@@ -516,7 +711,22 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		}
 #endif
 	}
-	changes += fold_diff(global_diff);
+
+	for_each_online_pgdat(pgdat) {
+		struct per_cpu_nodestat __percpu *p = pgdat->per_cpu_nodestats;
+
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+			int v;
+
+			v = this_cpu_xchg(p->vm_node_stat_diff[i], 0);
+			if (v) {
+				atomic_long_add(v, &pgdat->vm_stat[i]);
+				global_node_diff[i] += v;
+			}
+		}
+	}
+
+	changes += fold_diff(global_zone_diff, global_node_diff);
 	return changes;
 }
 
@@ -527,9 +737,11 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
  */
 void cpu_vm_stats_fold(int cpu)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int i;
-	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
@@ -543,11 +755,27 @@ void cpu_vm_stats_fold(int cpu)
 				v = p->vm_stat_diff[i];
 				p->vm_stat_diff[i] = 0;
 				atomic_long_add(v, &zone->vm_stat[i]);
-				global_diff[i] += v;
+				global_zone_diff[i] += v;
 			}
 	}
 
-	fold_diff(global_diff);
+	for_each_online_pgdat(pgdat) {
+		struct per_cpu_nodestat *p;
+
+		p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
+
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+			if (p->vm_node_stat_diff[i]) {
+				int v;
+
+				v = p->vm_node_stat_diff[i];
+				p->vm_node_stat_diff[i] = 0;
+				atomic_long_add(v, &pgdat->vm_stat[i]);
+				global_node_diff[i] += v;
+			}
+	}
+
+	fold_diff(global_zone_diff, global_node_diff);
 }
 
 /*
@@ -563,16 +791,19 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 			int v = pset->vm_stat_diff[i];
 			pset->vm_stat_diff[i] = 0;
 			atomic_long_add(v, &zone->vm_stat[i]);
-			atomic_long_add(v, &vm_stat[i]);
+			atomic_long_add(v, &vm_zone_stat[i]);
 		}
 }
 #endif
 
 #ifdef CONFIG_NUMA
 /*
- * Determine the per node value of a stat item.
+ * Determine the per node value of a stat item. This function
+ * is called frequently in a NUMA machine, so try to be as
+ * frugal as possible.
  */
-unsigned long node_page_state(int node, enum zone_stat_item item)
+unsigned long sum_zone_node_page_state(int node,
+				 enum zone_stat_item item)
 {
 	struct zone *zones = NODE_DATA(node)->node_zones;
 	int i;
@@ -584,6 +815,19 @@ unsigned long node_page_state(int node, enum zone_stat_item item)
 	return count;
 }
 
+/*
+ * Determine the per node value of a stat item.
+ */
+unsigned long node_page_state(struct pglist_data *pgdat,
+				enum node_stat_item item)
+{
+	long x = atomic_long_read(&pgdat->vm_stat[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
 #endif
 
 #ifdef CONFIG_COMPACTION
@@ -1287,6 +1531,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	if (*pos >= ARRAY_SIZE(vmstat_text))
 		return NULL;
 	stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
+			  NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
 			  NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
@@ -1301,6 +1546,10 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		v[i] = global_page_state(i);
 	v += NR_VM_ZONE_STAT_ITEMS;
 
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		v[i] = global_node_page_state(i);
+	v += NR_VM_NODE_STAT_ITEMS;
+
 	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
 			    v + NR_DIRTY_THRESHOLD);
 	v += NR_VM_WRITEBACK_STAT_ITEMS;
@@ -1390,7 +1639,7 @@ int vmstat_refresh(struct ctl_table *table, int write,
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
-		val = atomic_long_read(&vm_stat[i]);
+		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
 			switch (i) {
 			case NR_ALLOC_BATCH:
diff --git a/mm/workingset.c b/mm/workingset.c
index 8252de4566e9..ba972ac2dfdd 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -351,12 +351,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 	shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
 	local_irq_enable();
 
-	if (memcg_kmem_enabled())
+	if (memcg_kmem_enabled()) {
 		pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
 						     LRU_ALL_FILE);
-	else
-		pages = node_page_state(sc->nid, NR_ACTIVE_FILE) +
-			node_page_state(sc->nid, NR_INACTIVE_FILE);
+	} else {
+		pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
+			sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
+	}
 
 	/*
 	 * Active cache pages are limited to 50% of memory, and shadow
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 01/34] mm, vmstat: add infrastructure for per-node vmstats
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

VM statistic counters for reclaim decisions are zone-based.  If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that.  The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet.  There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
patch with no functional change.  There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/base/node.c    |  76 +++++++------
 include/linux/mm.h     |   5 +
 include/linux/mmzone.h |  13 +++
 include/linux/vmstat.h |  92 +++++++++++++--
 mm/page_alloc.c        |  10 +-
 mm/vmstat.c            | 295 +++++++++++++++++++++++++++++++++++++++++++++----
 mm/workingset.c        |   9 +-
 7 files changed, 424 insertions(+), 76 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ed0ef0f69489..92d8e090c5b3 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -74,16 +74,16 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, K(node_page_state(nid, NR_ACTIVE_ANON) +
-				node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_INACTIVE_ANON) +
-				node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_ACTIVE_ANON)),
-		       nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
-		       nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(node_page_state(nid, NR_UNEVICTABLE)),
-		       nid, K(node_page_state(nid, NR_MLOCK)));
+		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
+				sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
+				sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
+		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
 	n += sprintf(buf + n,
@@ -117,31 +117,31 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d ShmemPmdMapped: %8lu kB\n"
 #endif
 			,
-		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
-		       nid, K(node_page_state(nid, NR_WRITEBACK)),
-		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
-		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
+		       nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
+		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
+		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
+		       nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
-		       nid, node_page_state(nid, NR_KERNEL_STACK) *
+		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
-		       nid, K(node_page_state(nid, NR_PAGETABLE)),
-		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
-		       nid, K(node_page_state(nid, NR_BOUNCE)),
-		       nid, K(node_page_state(nid, NR_WRITEBACK_TEMP)),
-		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
-				node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
-		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_UNSTABLE_NFS)),
+		       nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK_TEMP)),
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) +
+				sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
-		       nid, K(node_page_state(nid, NR_ANON_THPS) *
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+		       nid, K(sum_zone_node_page_state(nid, NR_ANON_THPS) *
 				       HPAGE_PMD_NR),
-		       nid, K(node_page_state(nid, NR_SHMEM_THPS) *
+		       nid, K(sum_zone_node_page_state(nid, NR_SHMEM_THPS) *
 				       HPAGE_PMD_NR),
-		       nid, K(node_page_state(nid, NR_SHMEM_PMDMAPPED) *
+		       nid, K(sum_zone_node_page_state(nid, NR_SHMEM_PMDMAPPED) *
 				       HPAGE_PMD_NR));
 #else
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
 #endif
 	n += hugetlb_report_node_meminfo(nid, buf + n);
 	return n;
@@ -160,12 +160,12 @@ static ssize_t node_read_numastat(struct device *dev,
 		       "interleave_hit %lu\n"
 		       "local_node %lu\n"
 		       "other_node %lu\n",
-		       node_page_state(dev->id, NUMA_HIT),
-		       node_page_state(dev->id, NUMA_MISS),
-		       node_page_state(dev->id, NUMA_FOREIGN),
-		       node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
-		       node_page_state(dev->id, NUMA_LOCAL),
-		       node_page_state(dev->id, NUMA_OTHER));
+		       sum_zone_node_page_state(dev->id, NUMA_HIT),
+		       sum_zone_node_page_state(dev->id, NUMA_MISS),
+		       sum_zone_node_page_state(dev->id, NUMA_FOREIGN),
+		       sum_zone_node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
+		       sum_zone_node_page_state(dev->id, NUMA_LOCAL),
+		       sum_zone_node_page_state(dev->id, NUMA_OTHER));
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
@@ -173,12 +173,18 @@ static ssize_t node_read_vmstat(struct device *dev,
 				struct device_attribute *attr, char *buf)
 {
 	int nid = dev->id;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	int i;
 	int n = 0;
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		n += sprintf(buf+n, "%s %lu\n", vmstat_text[i],
-			     node_page_state(nid, i));
+			     sum_zone_node_page_state(nid, i));
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		n += sprintf(buf+n, "%s %lu\n",
+			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+			     node_page_state(pgdat, i));
 
 	return n;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b21e5f30378e..dd79aa2800a3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -933,6 +933,11 @@ static inline struct zone *page_zone(const struct page *page)
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
 }
 
+static inline pg_data_t *page_pgdat(const struct page *page)
+{
+	return NODE_DATA(page_to_nid(page));
+}
+
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 19425e988bdc..078ecb81e209 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -160,6 +160,10 @@ enum zone_stat_item {
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
+enum node_stat_item {
+	NR_VM_NODE_STAT_ITEMS
+};
+
 /*
  * We do arithmetic on the LRU lists in various places in the code,
  * so it is important to keep the active lists LRU_ACTIVE higher in
@@ -267,6 +271,11 @@ struct per_cpu_pageset {
 #endif
 };
 
+struct per_cpu_nodestat {
+	s8 stat_threshold;
+	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+};
+
 #endif /* !__GENERATING_BOUNDS.H */
 
 enum zone_type {
@@ -695,6 +704,10 @@ typedef struct pglist_data {
 	struct list_head split_queue;
 	unsigned long split_queue_len;
 #endif
+
+	/* Per-node vmstats */
+	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
+	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d2da8e053210..d1744aa3ab9c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -106,20 +106,38 @@ static inline void vm_events_fold_cpu(int cpu)
 		zone_idx(zone), delta)
 
 /*
- * Zone based page accounting with per cpu differentials.
+ * Zone and node-based page accounting with per cpu differentials.
  */
-extern atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
+extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
 
 static inline void zone_page_state_add(long x, struct zone *zone,
 				 enum zone_stat_item item)
 {
 	atomic_long_add(x, &zone->vm_stat[item]);
-	atomic_long_add(x, &vm_stat[item]);
+	atomic_long_add(x, &vm_zone_stat[item]);
+}
+
+static inline void node_page_state_add(long x, struct pglist_data *pgdat,
+				 enum node_stat_item item)
+{
+	atomic_long_add(x, &pgdat->vm_stat[item]);
+	atomic_long_add(x, &vm_node_stat[item]);
 }
 
 static inline unsigned long global_page_state(enum zone_stat_item item)
 {
-	long x = atomic_long_read(&vm_stat[item]);
+	long x = atomic_long_read(&vm_zone_stat[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+static inline unsigned long global_node_page_state(enum node_stat_item item)
+{
+	long x = atomic_long_read(&vm_node_stat[item]);
 #ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
@@ -161,31 +179,44 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 }
 
 #ifdef CONFIG_NUMA
-
-extern unsigned long node_page_state(int node, enum zone_stat_item item);
-
+extern unsigned long sum_zone_node_page_state(int node,
+						enum zone_stat_item item);
+extern unsigned long node_page_state(struct pglist_data *pgdat,
+						enum node_stat_item item);
 #else
-
-#define node_page_state(node, item) global_page_state(item)
-
+#define sum_zone_node_page_state(node, item) global_page_state(item)
+#define node_page_state(node, item) global_node_page_state(item)
 #endif /* CONFIG_NUMA */
 
 #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
 #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d))
+#define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d)
+#define sub_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, -(__d))
 
 #ifdef CONFIG_SMP
 void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
 void __inc_zone_page_state(struct page *, enum zone_stat_item);
 void __dec_zone_page_state(struct page *, enum zone_stat_item);
 
+void __mod_node_page_state(struct pglist_data *, enum node_stat_item item, long);
+void __inc_node_page_state(struct page *, enum node_stat_item);
+void __dec_node_page_state(struct page *, enum node_stat_item);
+
 void mod_zone_page_state(struct zone *, enum zone_stat_item, long);
 void inc_zone_page_state(struct page *, enum zone_stat_item);
 void dec_zone_page_state(struct page *, enum zone_stat_item);
 
+void mod_node_page_state(struct pglist_data *, enum node_stat_item, long);
+void inc_node_page_state(struct page *, enum node_stat_item);
+void dec_node_page_state(struct page *, enum node_stat_item);
+
 extern void inc_zone_state(struct zone *, enum zone_stat_item);
+extern void inc_node_state(struct pglist_data *, enum node_stat_item);
 extern void __inc_zone_state(struct zone *, enum zone_stat_item);
+extern void __inc_node_state(struct pglist_data *, enum node_stat_item);
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
+extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
@@ -213,16 +244,34 @@ static inline void __mod_zone_page_state(struct zone *zone,
 	zone_page_state_add(delta, zone, item);
 }
 
+static inline void __mod_node_page_state(struct pglist_data *pgdat,
+			enum node_stat_item item, int delta)
+{
+	node_page_state_add(delta, pgdat, item);
+}
+
 static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	atomic_long_inc(&zone->vm_stat[item]);
-	atomic_long_inc(&vm_stat[item]);
+	atomic_long_inc(&vm_zone_stat[item]);
+}
+
+static inline void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	atomic_long_inc(&pgdat->vm_stat[item]);
+	atomic_long_inc(&vm_node_stat[item]);
 }
 
 static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	atomic_long_dec(&zone->vm_stat[item]);
-	atomic_long_dec(&vm_stat[item]);
+	atomic_long_dec(&vm_zone_stat[item]);
+}
+
+static inline void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	atomic_long_dec(&pgdat->vm_stat[item]);
+	atomic_long_dec(&vm_node_stat[item]);
 }
 
 static inline void __inc_zone_page_state(struct page *page,
@@ -231,12 +280,26 @@ static inline void __inc_zone_page_state(struct page *page,
 	__inc_zone_state(page_zone(page), item);
 }
 
+static inline void __inc_node_page_state(struct page *page,
+			enum node_stat_item item)
+{
+	__inc_node_state(page_pgdat(page), item);
+}
+
+
 static inline void __dec_zone_page_state(struct page *page,
 			enum zone_stat_item item)
 {
 	__dec_zone_state(page_zone(page), item);
 }
 
+static inline void __dec_node_page_state(struct page *page,
+			enum node_stat_item item)
+{
+	__dec_node_state(page_pgdat(page), item);
+}
+
+
 /*
  * We only use atomic operations to update counters. So there is no need to
  * disable interrupts.
@@ -245,7 +308,12 @@ static inline void __dec_zone_page_state(struct page *page,
 #define dec_zone_page_state __dec_zone_page_state
 #define mod_zone_page_state __mod_zone_page_state
 
+#define inc_node_page_state __inc_node_page_state
+#define dec_node_page_state __dec_node_page_state
+#define mod_node_page_state __mod_node_page_state
+
 #define inc_zone_state __inc_zone_state
+#define inc_node_state __inc_node_state
 #define dec_zone_state __dec_zone_state
 
 #define set_pgdat_percpu_threshold(pgdat, callback) { }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 403c5dcd24da..34e46c02a406 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4247,8 +4247,8 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
 		managed_pages += pgdat->node_zones[zone_type].managed_pages;
 	val->totalram = managed_pages;
-	val->sharedram = node_page_state(nid, NR_SHMEM);
-	val->freeram = node_page_state(nid, NR_FREE_PAGES);
+	val->sharedram = sum_zone_node_page_state(nid, NR_SHMEM);
+	val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
 #ifdef CONFIG_HIGHMEM
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
 		struct zone *zone = &pgdat->node_zones[zone_type];
@@ -5373,6 +5373,11 @@ static void __meminit setup_zone_pageset(struct zone *zone)
 	zone->pageset = alloc_percpu(struct per_cpu_pageset);
 	for_each_possible_cpu(cpu)
 		zone_pageset_init(zone, cpu);
+
+	if (!zone->zone_pgdat->per_cpu_nodestats) {
+		zone->zone_pgdat->per_cpu_nodestats =
+			alloc_percpu(struct per_cpu_nodestat);
+	}
 }
 
 /*
@@ -6078,6 +6083,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	reset_deferred_meminit(pgdat);
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
+	pgdat->per_cpu_nodestats = NULL;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7997f52935c9..3345d396a99b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -86,8 +86,10 @@ void vm_events_fold_cpu(int cpu)
  *
  * vm_stat contains the global counters
  */
-atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
-EXPORT_SYMBOL(vm_stat);
+atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp;
+atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp;
+EXPORT_SYMBOL(vm_zone_stat);
+EXPORT_SYMBOL(vm_node_stat);
 
 #ifdef CONFIG_SMP
 
@@ -167,19 +169,36 @@ int calculate_normal_threshold(struct zone *zone)
  */
 void refresh_zone_stat_thresholds(void)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int cpu;
 	int threshold;
 
+	/* Zero current pgdat thresholds */
+	for_each_online_pgdat(pgdat) {
+		for_each_online_cpu(cpu) {
+			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold = 0;
+		}
+	}
+
 	for_each_populated_zone(zone) {
+		struct pglist_data *pgdat = zone->zone_pgdat;
 		unsigned long max_drift, tolerate_drift;
 
 		threshold = calculate_normal_threshold(zone);
 
-		for_each_online_cpu(cpu)
+		for_each_online_cpu(cpu) {
+			int pgdat_threshold;
+
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
 
+			/* Base nodestat threshold on the largest populated zone. */
+			pgdat_threshold = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold;
+			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold
+				= max(threshold, pgdat_threshold);
+		}
+
 		/*
 		 * Only set percpu_drift_mark if there is a danger that
 		 * NR_FREE_PAGES reports the low watermark is ok when in fact
@@ -238,6 +257,26 @@ void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+				long delta)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long x;
+	long t;
+
+	x = delta + __this_cpu_read(*p);
+
+	t = __this_cpu_read(pcp->stat_threshold);
+
+	if (unlikely(x > t || x < -t)) {
+		node_page_state_add(x, pgdat, item);
+		x = 0;
+	}
+	__this_cpu_write(*p, x);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
 /*
  * Optimized increment and decrement functions.
  *
@@ -277,12 +316,34 @@ void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 	}
 }
 
+void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s8 v, t;
+
+	v = __this_cpu_inc_return(*p);
+	t = __this_cpu_read(pcp->stat_threshold);
+	if (unlikely(v > t)) {
+		s8 overstep = t >> 1;
+
+		node_page_state_add(v + overstep, pgdat, item);
+		__this_cpu_write(*p, -overstep);
+	}
+}
+
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	__inc_zone_state(page_zone(page), item);
 }
 EXPORT_SYMBOL(__inc_zone_page_state);
 
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	__inc_node_state(page_pgdat(page), item);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
@@ -299,12 +360,34 @@ void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 	}
 }
 
+void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s8 v, t;
+
+	v = __this_cpu_dec_return(*p);
+	t = __this_cpu_read(pcp->stat_threshold);
+	if (unlikely(v < - t)) {
+		s8 overstep = t >> 1;
+
+		node_page_state_add(v - overstep, pgdat, item);
+		__this_cpu_write(*p, overstep);
+	}
+}
+
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	__dec_zone_state(page_zone(page), item);
 }
 EXPORT_SYMBOL(__dec_zone_page_state);
 
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	__dec_node_state(page_pgdat(page), item);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+
 #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
 /*
  * If we have cmpxchg_local support then we do not need to incur the overhead
@@ -318,8 +401,8 @@ EXPORT_SYMBOL(__dec_zone_page_state);
  *     1       Overstepping half of threshold
  *     -1      Overstepping minus half of threshold
 */
-static inline void mod_state(struct zone *zone, enum zone_stat_item item,
-			     long delta, int overstep_mode)
+static inline void mod_zone_state(struct zone *zone,
+       enum zone_stat_item item, long delta, int overstep_mode)
 {
 	struct per_cpu_pageset __percpu *pcp = zone->pageset;
 	s8 __percpu *p = pcp->vm_stat_diff + item;
@@ -359,26 +442,88 @@ static inline void mod_state(struct zone *zone, enum zone_stat_item item,
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 			 long delta)
 {
-	mod_state(zone, item, delta, 0);
+	mod_zone_state(zone, item, delta, 0);
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
 void inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	mod_state(zone, item, 1, 1);
+	mod_zone_state(zone, item, 1, 1);
 }
 
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	mod_state(page_zone(page), item, 1, 1);
+	mod_zone_state(page_zone(page), item, 1, 1);
 }
 EXPORT_SYMBOL(inc_zone_page_state);
 
 void dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	mod_state(page_zone(page), item, -1, -1);
+	mod_zone_state(page_zone(page), item, -1, -1);
 }
 EXPORT_SYMBOL(dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+       enum node_stat_item item, int delta, int overstep_mode)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;  /* overflow to node counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a node.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (n > t || n < -t) {
+			int os = overstep_mode * (t >> 1) ;
+
+			/* Overflow must be added to node counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
 #else
 /*
  * Use interrupt disable to serialize counter updates
@@ -424,21 +569,69 @@ void dec_zone_page_state(struct page *page, enum zone_stat_item item)
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(dec_zone_page_state);
-#endif
 
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_node_state(pgdat, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_node_state);
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__mod_node_page_state(pgdat, item, delta);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	unsigned long flags;
+	struct pglist_data *pgdat;
+
+	pgdat = page_pgdat(page);
+	local_irq_save(flags);
+	__inc_node_state(pgdat, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_node_page_state(page, item);
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+#endif
 
 /*
  * Fold a differential into the global counters.
  * Returns the number of counters updated.
  */
-static int fold_diff(int *diff)
+static int fold_diff(int *zone_diff, int *node_diff)
 {
 	int i;
 	int changes = 0;
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (diff[i]) {
-			atomic_long_add(diff[i], &vm_stat[i]);
+		if (zone_diff[i]) {
+			atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
+			changes++;
+	}
+
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		if (node_diff[i]) {
+			atomic_long_add(node_diff[i], &vm_node_stat[i]);
 			changes++;
 	}
 	return changes;
@@ -462,9 +655,11 @@ static int fold_diff(int *diff)
  */
 static int refresh_cpu_vm_stats(bool do_pagesets)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int i;
-	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
 	for_each_populated_zone(zone) {
@@ -477,7 +672,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			if (v) {
 
 				atomic_long_add(v, &zone->vm_stat[i]);
-				global_diff[i] += v;
+				global_zone_diff[i] += v;
 #ifdef CONFIG_NUMA
 				/* 3 seconds idle till flush */
 				__this_cpu_write(p->expire, 3);
@@ -516,7 +711,22 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		}
 #endif
 	}
-	changes += fold_diff(global_diff);
+
+	for_each_online_pgdat(pgdat) {
+		struct per_cpu_nodestat __percpu *p = pgdat->per_cpu_nodestats;
+
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+			int v;
+
+			v = this_cpu_xchg(p->vm_node_stat_diff[i], 0);
+			if (v) {
+				atomic_long_add(v, &pgdat->vm_stat[i]);
+				global_node_diff[i] += v;
+			}
+		}
+	}
+
+	changes += fold_diff(global_zone_diff, global_node_diff);
 	return changes;
 }
 
@@ -527,9 +737,11 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
  */
 void cpu_vm_stats_fold(int cpu)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int i;
-	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
@@ -543,11 +755,27 @@ void cpu_vm_stats_fold(int cpu)
 				v = p->vm_stat_diff[i];
 				p->vm_stat_diff[i] = 0;
 				atomic_long_add(v, &zone->vm_stat[i]);
-				global_diff[i] += v;
+				global_zone_diff[i] += v;
 			}
 	}
 
-	fold_diff(global_diff);
+	for_each_online_pgdat(pgdat) {
+		struct per_cpu_nodestat *p;
+
+		p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
+
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+			if (p->vm_node_stat_diff[i]) {
+				int v;
+
+				v = p->vm_node_stat_diff[i];
+				p->vm_node_stat_diff[i] = 0;
+				atomic_long_add(v, &pgdat->vm_stat[i]);
+				global_node_diff[i] += v;
+			}
+	}
+
+	fold_diff(global_zone_diff, global_node_diff);
 }
 
 /*
@@ -563,16 +791,19 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 			int v = pset->vm_stat_diff[i];
 			pset->vm_stat_diff[i] = 0;
 			atomic_long_add(v, &zone->vm_stat[i]);
-			atomic_long_add(v, &vm_stat[i]);
+			atomic_long_add(v, &vm_zone_stat[i]);
 		}
 }
 #endif
 
 #ifdef CONFIG_NUMA
 /*
- * Determine the per node value of a stat item.
+ * Determine the per node value of a stat item. This function
+ * is called frequently in a NUMA machine, so try to be as
+ * frugal as possible.
  */
-unsigned long node_page_state(int node, enum zone_stat_item item)
+unsigned long sum_zone_node_page_state(int node,
+				 enum zone_stat_item item)
 {
 	struct zone *zones = NODE_DATA(node)->node_zones;
 	int i;
@@ -584,6 +815,19 @@ unsigned long node_page_state(int node, enum zone_stat_item item)
 	return count;
 }
 
+/*
+ * Determine the per node value of a stat item.
+ */
+unsigned long node_page_state(struct pglist_data *pgdat,
+				enum node_stat_item item)
+{
+	long x = atomic_long_read(&pgdat->vm_stat[item]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
 #endif
 
 #ifdef CONFIG_COMPACTION
@@ -1287,6 +1531,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	if (*pos >= ARRAY_SIZE(vmstat_text))
 		return NULL;
 	stat_items_size = NR_VM_ZONE_STAT_ITEMS * sizeof(unsigned long) +
+			  NR_VM_NODE_STAT_ITEMS * sizeof(unsigned long) +
 			  NR_VM_WRITEBACK_STAT_ITEMS * sizeof(unsigned long);
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
@@ -1301,6 +1546,10 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		v[i] = global_page_state(i);
 	v += NR_VM_ZONE_STAT_ITEMS;
 
+	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+		v[i] = global_node_page_state(i);
+	v += NR_VM_NODE_STAT_ITEMS;
+
 	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
 			    v + NR_DIRTY_THRESHOLD);
 	v += NR_VM_WRITEBACK_STAT_ITEMS;
@@ -1390,7 +1639,7 @@ int vmstat_refresh(struct ctl_table *table, int write,
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
-		val = atomic_long_read(&vm_stat[i]);
+		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
 			switch (i) {
 			case NR_ALLOC_BATCH:
diff --git a/mm/workingset.c b/mm/workingset.c
index 8252de4566e9..ba972ac2dfdd 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -351,12 +351,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 	shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
 	local_irq_enable();
 
-	if (memcg_kmem_enabled())
+	if (memcg_kmem_enabled()) {
 		pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
 						     LRU_ALL_FILE);
-	else
-		pages = node_page_state(sc->nid, NR_ACTIVE_FILE) +
-			node_page_state(sc->nid, NR_INACTIVE_FILE);
+	} else {
+		pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
+			sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
+	}
 
 	/*
 	 * Active cache pages are limited to 50% of memory, and shadow
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 02/34] mm, vmscan: move lru_lock to the node
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Node-based reclaim requires node-based LRUs and locking.  This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review.  It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming from
different zones.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/cgroup-v1/memcg_test.txt |  4 +--
 Documentation/cgroup-v1/memory.txt     |  4 +--
 include/linux/mm_types.h               |  2 +-
 include/linux/mmzone.h                 | 10 +++++--
 mm/compaction.c                        | 10 +++----
 mm/filemap.c                           |  4 +--
 mm/huge_memory.c                       |  6 ++---
 mm/memcontrol.c                        |  6 ++---
 mm/mlock.c                             | 10 +++----
 mm/page_alloc.c                        |  4 +--
 mm/page_idle.c                         |  4 +--
 mm/rmap.c                              |  2 +-
 mm/swap.c                              | 30 ++++++++++-----------
 mm/vmscan.c                            | 48 +++++++++++++++++-----------------
 14 files changed, 75 insertions(+), 69 deletions(-)

diff --git a/Documentation/cgroup-v1/memcg_test.txt b/Documentation/cgroup-v1/memcg_test.txt
index 8870b0212150..78a8c2963b38 100644
--- a/Documentation/cgroup-v1/memcg_test.txt
+++ b/Documentation/cgroup-v1/memcg_test.txt
@@ -107,9 +107,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global zone->lru_lock).
+	VM's control (means that it's handled under global zone_lru_lock).
 	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under zone->lru_lock().
+	list management functions under zone_lru_lock().
 
 	A special function is mem_cgroup_isolate_pages(). This scans
 	memcg's private LRU and call __isolate_lru_page() to extract a page
diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index b14abf217239..946e69103cdd 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
    Other lock order is following:
    PG_locked.
    mm->page_table_lock
-       zone->lru_lock
+       zone_lru_lock
 	  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  zone->lru_lock, it has no lock of its own.
+  zone_lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e093e1d3285b..ca2ed9a6c8d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -118,7 +118,7 @@ struct page {
 	 */
 	union {
 		struct list_head lru;	/* Pageout list, eg. active_list
-					 * protected by zone->lru_lock !
+					 * protected by zone_lru_lock !
 					 * Can be used as a generic list
 					 * by the page owner.
 					 */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 078ecb81e209..cfa870107abe 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -93,7 +93,7 @@ struct free_area {
 struct pglist_data;
 
 /*
- * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
@@ -496,7 +496,6 @@ struct zone {
 	/* Write-intensive fields used by page reclaim */
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
 	/*
@@ -690,6 +689,9 @@ typedef struct pglist_data {
 	/* Number of pages migrated during the rate limiting time interval */
 	unsigned long numabalancing_migrate_nr_pages;
 #endif
+	/* Write-intensive fields used by page reclaim */
+	ZONE_PADDING(_pad1_)
+	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
@@ -721,6 +723,10 @@ typedef struct pglist_data {
 
 #define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
+static inline spinlock_t *zone_lru_lock(struct zone *zone)
+{
+	return &zone->zone_pgdat->lru_lock;
+}
 
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
 {
diff --git a/mm/compaction.c b/mm/compaction.c
index 0bd53fb05162..7607efb7bee2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -752,7 +752,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 * if contended.
 		 */
 		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&zone->lru_lock, flags,
+		    && compact_unlock_should_abort(zone_lru_lock(zone), flags,
 								&locked, cc))
 			break;
 
@@ -813,7 +813,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&zone->lru_lock,
+					spin_unlock_irqrestore(zone_lru_lock(zone),
 									flags);
 					locked = false;
 				}
@@ -836,7 +836,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
-			locked = compact_trylock_irqsave(&zone->lru_lock,
+			locked = compact_trylock_irqsave(zone_lru_lock(zone),
 								&flags, cc);
 			if (!locked)
 				break;
@@ -899,7 +899,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&zone->lru_lock,	flags);
+				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 				locked = false;
 			}
 			acct_isolated(zone, cc);
@@ -927,7 +927,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		low_pfn = end_pfn;
 
 	if (locked)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 
 	/*
 	 * Update the pageblock-skip information and cached scanner pfn,
diff --git a/mm/filemap.c b/mm/filemap.c
index e90c1543ec2d..7ec50bd6f88c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -95,8 +95,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->tree_lock		(try_to_unmap_one)
- *    ->zone.lru_lock		(follow_page->mark_page_accessed)
- *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->zone_lru_lock(zone)	(follow_page->mark_page_accessed)
+ *    ->zone_lru_lock(zone)	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 848c16caf8f8..2f997328ae64 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1860,7 +1860,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		spin_unlock(&head->mapping->tree_lock);
 	}
 
-	spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
+	spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
 
 	unfreeze_page(head);
 
@@ -2046,7 +2046,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		lru_add_drain();
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&page_zone(head)->lru_lock, flags);
+	spin_lock_irqsave(zone_lru_lock(page_zone(head)), flags);
 
 	if (mapping) {
 		void **pslot;
@@ -2089,7 +2089,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&pgdata->split_queue_lock);
 fail:		if (mapping)
 			spin_unlock(&mapping->tree_lock);
-		spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
 		unfreeze_page(head);
 		ret = -EBUSY;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40dfca3ef4bb..9b70f9ca8ddf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2065,7 +2065,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
@@ -2089,7 +2089,7 @@ static void unlock_page_lru(struct page *page, int isolated)
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -2389,7 +2389,7 @@ void memcg_kmem_uncharge(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * zone->lru_lock and migration entries setup in all page mappings.
+ * zone_lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index ef8dc9f395c4..997f63082ff5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -188,7 +188,7 @@ unsigned int munlock_vma_page(struct page *page)
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	nr_pages = hpage_nr_pages(page);
 	if (!TestClearPageMlocked(page))
@@ -197,14 +197,14 @@ unsigned int munlock_vma_page(struct page *page)
 	__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
 
 	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irq(zone_lru_lock(zone));
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 out:
 	return nr_pages - 1;
@@ -289,7 +289,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	pagevec_init(&pvec_putback, 0);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -315,7 +315,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	}
 	delta_munlocked = -nr + pagevec_count(&pvec_putback);
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34e46c02a406..48b5414009ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5947,6 +5947,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->kcompactd_wait);
 #endif
 	pgdat_page_ext_init(pgdat);
+	spin_lock_init(&pgdat->lru_lock);
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
@@ -6001,10 +6002,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
+		zone->zone_pgdat = pgdat;
 		spin_lock_init(&zone->lock);
-		spin_lock_init(&zone->lru_lock);
 		zone_seqlock_init(zone);
-		zone->zone_pgdat = pgdat;
 		zone_pcp_init(zone);
 
 		/* For bootup, initialized properly in watermark setup */
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 4ea9c4ef5146..ae11aa914e55 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -41,12 +41,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
 		return NULL;
 
 	zone = page_zone(page);
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 	return page;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 256e585c67ef..573253efb645 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
  *         mapping->i_mmap_rwsem
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               zone->lru_lock (in mark_page_accessed, isolate_lru_page)
+ *               zone_lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/swap.c b/mm/swap.c
index 616df4ddd870..bf37e5cfae81 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&zone->lru_lock, flags);
+		spin_lock_irqsave(zone_lru_lock(zone), flags);
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 	}
 	mem_cgroup_uncharge(page);
 }
@@ -189,16 +189,16 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 		if (pagezone != zone) {
 			if (zone)
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 			zone = pagezone;
-			spin_lock_irqsave(&zone->lru_lock, flags);
+			spin_lock_irqsave(zone_lru_lock(zone), flags);
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		(*move_fn)(page, lruvec, arg);
 	}
 	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 	release_pages(pvec->pages, pvec->nr, pvec->cold);
 	pagevec_reinit(pvec);
 }
@@ -318,9 +318,9 @@ void activate_page(struct page *page)
 	struct zone *zone = page_zone(page);
 
 	page = compound_head(page);
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 }
 #endif
 
@@ -448,13 +448,13 @@ void add_page_to_unevictable_list(struct page *page)
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	lruvec = mem_cgroup_page_lruvec(page, zone);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 }
 
 /**
@@ -744,7 +744,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		 * same zone. The lock is held only if zone != NULL.
 		 */
 		if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 			zone = NULL;
 		}
 
@@ -759,7 +759,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 
 		if (PageCompound(page)) {
 			if (zone) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 				zone = NULL;
 			}
 			__put_compound_page(page);
@@ -771,11 +771,11 @@ void release_pages(struct page **pages, int nr, bool cold)
 
 			if (pagezone != zone) {
 				if (zone)
-					spin_unlock_irqrestore(&zone->lru_lock,
+					spin_unlock_irqrestore(zone_lru_lock(zone),
 									flags);
 				lock_batch = 0;
 				zone = pagezone;
-				spin_lock_irqsave(&zone->lru_lock, flags);
+				spin_lock_irqsave(zone_lru_lock(zone), flags);
 			}
 
 			lruvec = mem_cgroup_page_lruvec(page, zone);
@@ -790,7 +790,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_hot_cold_page_list(&pages_to_free, cold);
@@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
 	VM_BUG_ON(NR_CPUS != 1 &&
-		  !spin_is_locked(&lruvec_zone(lruvec)->lru_lock));
+		  !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 21d417ccff69..e7ffcd259cc4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1349,7 +1349,7 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 }
 
 /*
- * zone->lru_lock is heavily contended.  Some of the functions that
+ * zone_lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
@@ -1444,7 +1444,7 @@ int isolate_lru_page(struct page *page)
 		struct zone *zone = page_zone(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&zone->lru_lock);
+		spin_lock_irq(zone_lru_lock(zone));
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
@@ -1453,7 +1453,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irq(zone_lru_lock(zone));
 	}
 	return ret;
 }
@@ -1512,9 +1512,9 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&zone->lru_lock);
+			spin_unlock_irq(zone_lru_lock(zone));
 			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irq(zone_lru_lock(zone));
 			continue;
 		}
 
@@ -1535,10 +1535,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irq(zone_lru_lock(zone));
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				spin_lock_irq(zone_lru_lock(zone));
 			} else
 				list_add(&page->lru, &pages_to_free);
 		}
@@ -1600,7 +1600,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
@@ -1616,7 +1616,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		else
 			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	if (nr_taken == 0)
 		return 0;
@@ -1626,7 +1626,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 				&nr_writeback, &nr_immediate,
 				false);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	if (global_reclaim(sc)) {
 		if (current_is_kswapd())
@@ -1641,7 +1641,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_hot_cold_page_list(&page_list, true);
@@ -1715,9 +1715,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone->lru_lock across the whole operation.  But if
+ * appropriate to hold zone_lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone->lru_lock around each page.  It's impossible to balance
+ * should drop zone_lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
@@ -1754,10 +1754,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irq(zone_lru_lock(zone));
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				spin_lock_irq(zone_lru_lock(zone));
 			} else
 				list_add(&page->lru, pages_to_free);
 		}
@@ -1792,7 +1792,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
@@ -1805,7 +1805,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
 	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
 
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -1850,7 +1850,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -1862,7 +1862,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	mem_cgroup_uncharge_list(&l_hold);
 	free_hot_cold_page_list(&l_hold, true);
@@ -2077,7 +2077,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2098,7 +2098,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -3791,9 +3791,9 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 		pagezone = page_zone(page);
 		if (pagezone != zone) {
 			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irq(zone_lru_lock(zone));
 			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irq(zone_lru_lock(zone));
 		}
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 
@@ -3814,7 +3814,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 	if (zone) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irq(zone_lru_lock(zone));
 	}
 }
 #endif /* CONFIG_SHMEM */
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 02/34] mm, vmscan: move lru_lock to the node
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Node-based reclaim requires node-based LRUs and locking.  This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review.  It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming from
different zones.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/cgroup-v1/memcg_test.txt |  4 +--
 Documentation/cgroup-v1/memory.txt     |  4 +--
 include/linux/mm_types.h               |  2 +-
 include/linux/mmzone.h                 | 10 +++++--
 mm/compaction.c                        | 10 +++----
 mm/filemap.c                           |  4 +--
 mm/huge_memory.c                       |  6 ++---
 mm/memcontrol.c                        |  6 ++---
 mm/mlock.c                             | 10 +++----
 mm/page_alloc.c                        |  4 +--
 mm/page_idle.c                         |  4 +--
 mm/rmap.c                              |  2 +-
 mm/swap.c                              | 30 ++++++++++-----------
 mm/vmscan.c                            | 48 +++++++++++++++++-----------------
 14 files changed, 75 insertions(+), 69 deletions(-)

diff --git a/Documentation/cgroup-v1/memcg_test.txt b/Documentation/cgroup-v1/memcg_test.txt
index 8870b0212150..78a8c2963b38 100644
--- a/Documentation/cgroup-v1/memcg_test.txt
+++ b/Documentation/cgroup-v1/memcg_test.txt
@@ -107,9 +107,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global zone->lru_lock).
+	VM's control (means that it's handled under global zone_lru_lock).
 	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under zone->lru_lock().
+	list management functions under zone_lru_lock().
 
 	A special function is mem_cgroup_isolate_pages(). This scans
 	memcg's private LRU and call __isolate_lru_page() to extract a page
diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index b14abf217239..946e69103cdd 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
    Other lock order is following:
    PG_locked.
    mm->page_table_lock
-       zone->lru_lock
+       zone_lru_lock
 	  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  zone->lru_lock, it has no lock of its own.
+  zone_lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e093e1d3285b..ca2ed9a6c8d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -118,7 +118,7 @@ struct page {
 	 */
 	union {
 		struct list_head lru;	/* Pageout list, eg. active_list
-					 * protected by zone->lru_lock !
+					 * protected by zone_lru_lock !
 					 * Can be used as a generic list
 					 * by the page owner.
 					 */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 078ecb81e209..cfa870107abe 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -93,7 +93,7 @@ struct free_area {
 struct pglist_data;
 
 /*
- * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
@@ -496,7 +496,6 @@ struct zone {
 	/* Write-intensive fields used by page reclaim */
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
 	/*
@@ -690,6 +689,9 @@ typedef struct pglist_data {
 	/* Number of pages migrated during the rate limiting time interval */
 	unsigned long numabalancing_migrate_nr_pages;
 #endif
+	/* Write-intensive fields used by page reclaim */
+	ZONE_PADDING(_pad1_)
+	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
@@ -721,6 +723,10 @@ typedef struct pglist_data {
 
 #define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
+static inline spinlock_t *zone_lru_lock(struct zone *zone)
+{
+	return &zone->zone_pgdat->lru_lock;
+}
 
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
 {
diff --git a/mm/compaction.c b/mm/compaction.c
index 0bd53fb05162..7607efb7bee2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -752,7 +752,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 * if contended.
 		 */
 		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&zone->lru_lock, flags,
+		    && compact_unlock_should_abort(zone_lru_lock(zone), flags,
 								&locked, cc))
 			break;
 
@@ -813,7 +813,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&zone->lru_lock,
+					spin_unlock_irqrestore(zone_lru_lock(zone),
 									flags);
 					locked = false;
 				}
@@ -836,7 +836,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
-			locked = compact_trylock_irqsave(&zone->lru_lock,
+			locked = compact_trylock_irqsave(zone_lru_lock(zone),
 								&flags, cc);
 			if (!locked)
 				break;
@@ -899,7 +899,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&zone->lru_lock,	flags);
+				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 				locked = false;
 			}
 			acct_isolated(zone, cc);
@@ -927,7 +927,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		low_pfn = end_pfn;
 
 	if (locked)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 
 	/*
 	 * Update the pageblock-skip information and cached scanner pfn,
diff --git a/mm/filemap.c b/mm/filemap.c
index e90c1543ec2d..7ec50bd6f88c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -95,8 +95,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->tree_lock		(try_to_unmap_one)
- *    ->zone.lru_lock		(follow_page->mark_page_accessed)
- *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->zone_lru_lock(zone)	(follow_page->mark_page_accessed)
+ *    ->zone_lru_lock(zone)	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 848c16caf8f8..2f997328ae64 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1860,7 +1860,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		spin_unlock(&head->mapping->tree_lock);
 	}
 
-	spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
+	spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
 
 	unfreeze_page(head);
 
@@ -2046,7 +2046,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		lru_add_drain();
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&page_zone(head)->lru_lock, flags);
+	spin_lock_irqsave(zone_lru_lock(page_zone(head)), flags);
 
 	if (mapping) {
 		void **pslot;
@@ -2089,7 +2089,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&pgdata->split_queue_lock);
 fail:		if (mapping)
 			spin_unlock(&mapping->tree_lock);
-		spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
 		unfreeze_page(head);
 		ret = -EBUSY;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40dfca3ef4bb..9b70f9ca8ddf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2065,7 +2065,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 {
 	struct zone *zone = page_zone(page);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
@@ -2089,7 +2089,7 @@ static void unlock_page_lru(struct page *page, int isolated)
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -2389,7 +2389,7 @@ void memcg_kmem_uncharge(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * zone->lru_lock and migration entries setup in all page mappings.
+ * zone_lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index ef8dc9f395c4..997f63082ff5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -188,7 +188,7 @@ unsigned int munlock_vma_page(struct page *page)
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	nr_pages = hpage_nr_pages(page);
 	if (!TestClearPageMlocked(page))
@@ -197,14 +197,14 @@ unsigned int munlock_vma_page(struct page *page)
 	__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
 
 	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irq(zone_lru_lock(zone));
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 out:
 	return nr_pages - 1;
@@ -289,7 +289,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	pagevec_init(&pvec_putback, 0);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -315,7 +315,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	}
 	delta_munlocked = -nr + pagevec_count(&pvec_putback);
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34e46c02a406..48b5414009ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5947,6 +5947,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->kcompactd_wait);
 #endif
 	pgdat_page_ext_init(pgdat);
+	spin_lock_init(&pgdat->lru_lock);
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
@@ -6001,10 +6002,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
+		zone->zone_pgdat = pgdat;
 		spin_lock_init(&zone->lock);
-		spin_lock_init(&zone->lru_lock);
 		zone_seqlock_init(zone);
-		zone->zone_pgdat = pgdat;
 		zone_pcp_init(zone);
 
 		/* For bootup, initialized properly in watermark setup */
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 4ea9c4ef5146..ae11aa914e55 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -41,12 +41,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
 		return NULL;
 
 	zone = page_zone(page);
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 	return page;
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 256e585c67ef..573253efb645 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
  *         mapping->i_mmap_rwsem
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               zone->lru_lock (in mark_page_accessed, isolate_lru_page)
+ *               zone_lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/swap.c b/mm/swap.c
index 616df4ddd870..bf37e5cfae81 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,12 +62,12 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&zone->lru_lock, flags);
+		spin_lock_irqsave(zone_lru_lock(zone), flags);
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 	}
 	mem_cgroup_uncharge(page);
 }
@@ -189,16 +189,16 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 		if (pagezone != zone) {
 			if (zone)
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 			zone = pagezone;
-			spin_lock_irqsave(&zone->lru_lock, flags);
+			spin_lock_irqsave(zone_lru_lock(zone), flags);
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		(*move_fn)(page, lruvec, arg);
 	}
 	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 	release_pages(pvec->pages, pvec->nr, pvec->cold);
 	pagevec_reinit(pvec);
 }
@@ -318,9 +318,9 @@ void activate_page(struct page *page)
 	struct zone *zone = page_zone(page);
 
 	page = compound_head(page);
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 }
 #endif
 
@@ -448,13 +448,13 @@ void add_page_to_unevictable_list(struct page *page)
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	lruvec = mem_cgroup_page_lruvec(page, zone);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 }
 
 /**
@@ -744,7 +744,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		 * same zone. The lock is held only if zone != NULL.
 		 */
 		if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 			zone = NULL;
 		}
 
@@ -759,7 +759,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 
 		if (PageCompound(page)) {
 			if (zone) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 				zone = NULL;
 			}
 			__put_compound_page(page);
@@ -771,11 +771,11 @@ void release_pages(struct page **pages, int nr, bool cold)
 
 			if (pagezone != zone) {
 				if (zone)
-					spin_unlock_irqrestore(&zone->lru_lock,
+					spin_unlock_irqrestore(zone_lru_lock(zone),
 									flags);
 				lock_batch = 0;
 				zone = pagezone;
-				spin_lock_irqsave(&zone->lru_lock, flags);
+				spin_lock_irqsave(zone_lru_lock(zone), flags);
 			}
 
 			lruvec = mem_cgroup_page_lruvec(page, zone);
@@ -790,7 +790,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
+		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_hot_cold_page_list(&pages_to_free, cold);
@@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
 	VM_BUG_ON(NR_CPUS != 1 &&
-		  !spin_is_locked(&lruvec_zone(lruvec)->lru_lock));
+		  !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 21d417ccff69..e7ffcd259cc4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1349,7 +1349,7 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 }
 
 /*
- * zone->lru_lock is heavily contended.  Some of the functions that
+ * zone_lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
@@ -1444,7 +1444,7 @@ int isolate_lru_page(struct page *page)
 		struct zone *zone = page_zone(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&zone->lru_lock);
+		spin_lock_irq(zone_lru_lock(zone));
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
@@ -1453,7 +1453,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irq(zone_lru_lock(zone));
 	}
 	return ret;
 }
@@ -1512,9 +1512,9 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&zone->lru_lock);
+			spin_unlock_irq(zone_lru_lock(zone));
 			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irq(zone_lru_lock(zone));
 			continue;
 		}
 
@@ -1535,10 +1535,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irq(zone_lru_lock(zone));
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				spin_lock_irq(zone_lru_lock(zone));
 			} else
 				list_add(&page->lru, &pages_to_free);
 		}
@@ -1600,7 +1600,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
@@ -1616,7 +1616,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		else
 			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
 	}
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	if (nr_taken == 0)
 		return 0;
@@ -1626,7 +1626,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 				&nr_writeback, &nr_immediate,
 				false);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	if (global_reclaim(sc)) {
 		if (current_is_kswapd())
@@ -1641,7 +1641,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_hot_cold_page_list(&page_list, true);
@@ -1715,9 +1715,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone->lru_lock across the whole operation.  But if
+ * appropriate to hold zone_lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone->lru_lock around each page.  It's impossible to balance
+ * should drop zone_lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
@@ -1754,10 +1754,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irq(zone_lru_lock(zone));
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&zone->lru_lock);
+				spin_lock_irq(zone_lru_lock(zone));
 			} else
 				list_add(&page->lru, pages_to_free);
 		}
@@ -1792,7 +1792,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
@@ -1805,7 +1805,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
 	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
 
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -1850,7 +1850,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -1862,7 +1862,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	mem_cgroup_uncharge_list(&l_hold);
 	free_hot_cold_page_list(&l_hold, true);
@@ -2077,7 +2077,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(zone_lru_lock(zone));
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2098,7 +2098,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(zone_lru_lock(zone));
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -3791,9 +3791,9 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 		pagezone = page_zone(page);
 		if (pagezone != zone) {
 			if (zone)
-				spin_unlock_irq(&zone->lru_lock);
+				spin_unlock_irq(zone_lru_lock(zone));
 			zone = pagezone;
-			spin_lock_irq(&zone->lru_lock);
+			spin_lock_irq(zone_lru_lock(zone));
 		}
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 
@@ -3814,7 +3814,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 	if (zone) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&zone->lru_lock);
+		spin_unlock_irq(zone_lru_lock(zone));
 	}
 }
 #endif /* CONFIG_SHMEM */
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 03/34] mm, vmscan: move LRU lists to node
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

This moves the LRU lists from the zone to the node and related data
such as counters, tracing, congestion tracking and writeback tracking.
Unfortunately, due to reclaim and compaction retry logic, it is necessary
to account for the number of LRU pages on both zone and node logic.
Most reclaim logic is based on the node counters but the retry logic uses
the zone counters which do not distinguish inactive and active sizes.
It would be possible to leave the LRU counters on a per-zone basis but
it's a heavier calculation across multiple cache lines that is much more
frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies.  For example, the scans are
per-zone but using per-node counters.  We also mark a node as congested
when a zone is congested.  This causes weird problems that are fixed later
but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU. It potentially
   could be implemented similar to the UNEVICTABLE list.

   That would reduce the skip rate with the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/tile/mm/pgtable.c                    |   8 +-
 drivers/base/node.c                       |  19 +--
 drivers/staging/android/lowmemorykiller.c |   8 +-
 include/linux/backing-dev.h               |   2 +-
 include/linux/memcontrol.h                |  18 +--
 include/linux/mm_inline.h                 |  21 ++-
 include/linux/mmzone.h                    |  68 +++++----
 include/linux/swap.h                      |   1 +
 include/linux/vm_event_item.h             |  10 +-
 include/linux/vmstat.h                    |  17 +++
 include/trace/events/vmscan.h             |  12 +-
 kernel/power/snapshot.c                   |  10 +-
 mm/backing-dev.c                          |  15 +-
 mm/compaction.c                           |  18 +--
 mm/huge_memory.c                          |   2 +-
 mm/internal.h                             |   2 +-
 mm/khugepaged.c                           |   4 +-
 mm/memcontrol.c                           |  17 +--
 mm/memory-failure.c                       |   4 +-
 mm/memory_hotplug.c                       |   2 +-
 mm/mempolicy.c                            |   2 +-
 mm/migrate.c                              |  21 +--
 mm/mlock.c                                |   2 +-
 mm/page-writeback.c                       |   8 +-
 mm/page_alloc.c                           |  68 +++++----
 mm/swap.c                                 |  50 +++----
 mm/vmscan.c                               | 226 +++++++++++++++++-------------
 mm/vmstat.c                               |  47 ++++---
 mm/workingset.c                           |   4 +-
 29 files changed, 386 insertions(+), 300 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index c4d5bf841a7f..9e389213580d 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
 	struct zone *zone;
 
 	pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
-	       (global_page_state(NR_ACTIVE_ANON) +
-		global_page_state(NR_ACTIVE_FILE)),
-	       (global_page_state(NR_INACTIVE_ANON) +
-		global_page_state(NR_INACTIVE_FILE)),
+	       (global_node_page_state(NR_ACTIVE_ANON) +
+		global_node_page_state(NR_ACTIVE_FILE)),
+	       (global_node_page_state(NR_INACTIVE_ANON) +
+		global_node_page_state(NR_INACTIVE_FILE)),
 	       global_page_state(NR_FILE_DIRTY),
 	       global_page_state(NR_WRITEBACK),
 	       global_page_state(NR_UNSTABLE_NFS),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 92d8e090c5b3..b7f01a4a642d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 {
 	int n;
 	int nid = dev->id;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct sysinfo i;
 
 	si_meminfo_node(&i, nid);
@@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
-				sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
-				sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
-		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
+		       nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
+				node_page_state(pgdat, NR_ACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
+				node_page_state(pgdat, NR_INACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
+		       nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
+		       nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
 		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 24d2745e9437..93dbcc38eb0f 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -72,10 +72,10 @@ static unsigned long lowmem_deathpending_timeout;
 static unsigned long lowmem_count(struct shrinker *s,
 				  struct shrink_control *sc)
 {
-	return global_page_state(NR_ACTIVE_ANON) +
-		global_page_state(NR_ACTIVE_FILE) +
-		global_page_state(NR_INACTIVE_ANON) +
-		global_page_state(NR_INACTIVE_FILE);
+	return global_node_page_state(NR_ACTIVE_ANON) +
+		global_node_page_state(NR_ACTIVE_FILE) +
+		global_node_page_state(NR_INACTIVE_ANON) +
+		global_node_page_state(NR_INACTIVE_FILE);
 }
 
 static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c82794f20110..491a91717788 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
 }
 
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(struct zone *zone, int sync, long timeout);
+long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 104efa6874db..68f1121c8fe7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -340,7 +340,7 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = &zone->lruvec;
+		lruvec = zone_lruvec(zone);
 		goto out;
 	}
 
@@ -349,15 +349,15 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 out:
 	/*
 	 * Since a node can be onlined after the mem_cgroup was created,
-	 * we have to be prepared to initialize lruvec->zone here;
+	 * we have to be prepared to initialize lruvec->pgdat here;
 	 * and if offlined then reonlined, we need to reinitialize it.
 	 */
-	if (unlikely(lruvec->zone != zone))
-		lruvec->zone = zone;
+	if (unlikely(lruvec->pgdat != zone->zone_pgdat))
+		lruvec->pgdat = zone->zone_pgdat;
 	return lruvec;
 }
 
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -438,7 +438,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
-		int nr_pages);
+		enum zone_type zid, int nr_pages);
 
 unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 					   int nid, unsigned int lru_mask);
@@ -613,13 +613,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 						    struct mem_cgroup *memcg)
 {
-	return &zone->lruvec;
+	return zone_lruvec(zone);
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
-						    struct zone *zone)
+						    struct pglist_data *pgdat)
 {
-	return &zone->lruvec;
+	return &pgdat->lruvec;
 }
 
 static inline bool mm_match_cgroup(struct mm_struct *mm,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 5bd29ba4f174..9aadcc781857 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -23,25 +23,32 @@ static inline int page_is_file_cache(struct page *page)
 }
 
 static __always_inline void __update_lru_size(struct lruvec *lruvec,
-				enum lru_list lru, int nr_pages)
+				enum lru_list lru, enum zone_type zid,
+				int nr_pages)
 {
-	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
+	__mod_zone_page_state(&pgdat->node_zones[zid],
+		NR_ZONE_LRU_BASE + !!is_file_lru(lru),
+		nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
-				enum lru_list lru, int nr_pages)
+				enum lru_list lru, enum zone_type zid,
+				int nr_pages)
 {
 #ifdef CONFIG_MEMCG
-	mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+	mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
 #else
-	__update_lru_size(lruvec, lru, nr_pages);
+	__update_lru_size(lruvec, lru, zid, nr_pages);
 #endif
 }
 
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
@@ -49,7 +56,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
 	list_del(&page->lru);
-	update_lru_size(lruvec, lru, -hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
 }
 
 /**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cfa870107abe..d4f5cac0a8c3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -111,12 +111,9 @@ enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
 	NR_ALLOC_BATCH,
-	NR_LRU_BASE,
-	NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
-	NR_ACTIVE_ANON,		/*  "     "     "   "       "         */
-	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
-	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
-	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
+	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
+	NR_ZONE_LRU_FILE,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
@@ -134,12 +131,9 @@ enum zone_stat_item {
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
-	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
-	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
-	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	NR_ZSPAGES,		/* allocated in zsmalloc */
 #endif
@@ -161,6 +155,15 @@ enum zone_stat_item {
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
+	NR_LRU_BASE,
+	NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
+	NR_ACTIVE_ANON,		/*  "     "     "   "       "         */
+	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
+	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
+	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
+	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
+	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	NR_VM_NODE_STAT_ITEMS
 };
 
@@ -219,7 +222,7 @@ struct lruvec {
 	/* Evictions & activations on the inactive file list */
 	atomic_long_t			inactive_age;
 #ifdef CONFIG_MEMCG
-	struct zone			*zone;
+	struct pglist_data *pgdat;
 #endif
 };
 
@@ -357,13 +360,6 @@ struct zone {
 #ifdef CONFIG_NUMA
 	int node;
 #endif
-
-	/*
-	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
-	 * this zone's LRU.  Maintained by the pageout code.
-	 */
-	unsigned int inactive_ratio;
-
 	struct pglist_data	*zone_pgdat;
 	struct per_cpu_pageset __percpu *pageset;
 
@@ -495,9 +491,6 @@ struct zone {
 
 	/* Write-intensive fields used by page reclaim */
 
-	/* Fields commonly accessed by the page reclaim scanner */
-	struct lruvec		lruvec;
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
@@ -537,17 +530,20 @@ struct zone {
 
 enum zone_flags {
 	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
-	ZONE_CONGESTED,			/* zone has many dirty pages backed by
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
+};
+
+enum pgdat_flags {
+	PGDAT_CONGESTED,		/* pgdat has many dirty pages backed by
 					 * a congested BDI
 					 */
-	ZONE_DIRTY,			/* reclaim scanning has recently found
+	PGDAT_DIRTY,			/* reclaim scanning has recently found
 					 * many dirty file pages at the tail
 					 * of the LRU.
 					 */
-	ZONE_WRITEBACK,			/* reclaim scanning has recently found
+	PGDAT_WRITEBACK,		/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
-	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -707,6 +703,19 @@ typedef struct pglist_data {
 	unsigned long split_queue_len;
 #endif
 
+	/* Fields commonly accessed by the page reclaim scanner */
+	struct lruvec		lruvec;
+
+	/*
+	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+	 * this node's LRU.  Maintained by the pageout code.
+	 */
+	unsigned int inactive_ratio;
+
+	unsigned long		flags;
+
+	ZONE_PADDING(_pad2_)
+
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
@@ -728,6 +737,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
 	return &zone->zone_pgdat->lru_lock;
 }
 
+static inline struct lruvec *zone_lruvec(struct zone *zone)
+{
+	return &zone->zone_pgdat->lruvec;
+}
+
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
 {
 	return pgdat->node_start_pfn + pgdat->node_spanned_pages;
@@ -779,12 +793,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
 
 extern void lruvec_init(struct lruvec *lruvec);
 
-static inline struct zone *lruvec_zone(struct lruvec *lruvec)
+static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
 {
 #ifdef CONFIG_MEMCG
-	return lruvec->zone;
+	return lruvec->pgdat;
 #else
-	return container_of(lruvec, struct zone, lruvec);
+	return container_of(lruvec, struct pglist_data, lruvec);
 #endif
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0af2bb2028fd..c82f916008b7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
+extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 42604173f122..1798ff542517 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
-		FOR_ALL_ZONES(PGREFILL),
-		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
-		FOR_ALL_ZONES(PGSTEAL_DIRECT),
-		FOR_ALL_ZONES(PGSCAN_KSWAPD),
-		FOR_ALL_ZONES(PGSCAN_DIRECT),
+		PGREFILL,
+		PGSTEAL_KSWAPD,
+		PGSTEAL_DIRECT,
+		PGSCAN_KSWAPD,
+		PGSCAN_DIRECT,
 		PGSCAN_DIRECT_THROTTLE,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d1744aa3ab9c..fee321c98550 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 	return x;
 }
 
+static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
+					enum node_stat_item item)
+{
+	long x = atomic_long_read(&pgdat->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+
 #ifdef CONFIG_NUMA
 extern unsigned long sum_zone_node_page_state(int node,
 						enum zone_stat_item item);
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 0101ef37f1ee..897f1aa1ee5f 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -352,15 +352,14 @@ TRACE_EVENT(mm_vmscan_writepage,
 
 TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 
-	TP_PROTO(struct zone *zone,
+	TP_PROTO(int nid,
 		unsigned long nr_scanned, unsigned long nr_reclaimed,
 		int priority, int file),
 
-	TP_ARGS(zone, nr_scanned, nr_reclaimed, priority, file),
+	TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
 
 	TP_STRUCT__entry(
 		__field(int, nid)
-		__field(int, zid)
 		__field(unsigned long, nr_scanned)
 		__field(unsigned long, nr_reclaimed)
 		__field(int, priority)
@@ -368,16 +367,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 	),
 
 	TP_fast_assign(
-		__entry->nid = zone_to_nid(zone);
-		__entry->zid = zone_idx(zone);
+		__entry->nid = nid;
 		__entry->nr_scanned = nr_scanned;
 		__entry->nr_reclaimed = nr_reclaimed;
 		__entry->priority = priority;
 		__entry->reclaim_flags = trace_shrink_flags(file);
 	),
 
-	TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
-		__entry->nid, __entry->zid,
+	TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+		__entry->nid,
 		__entry->nr_scanned, __entry->nr_reclaimed,
 		__entry->priority,
 		show_reclaim_flags(__entry->reclaim_flags))
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 3a970604308f..24a06bc23f85 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
 	unsigned long size;
 
 	size = global_page_state(NR_SLAB_RECLAIMABLE)
-		+ global_page_state(NR_ACTIVE_ANON)
-		+ global_page_state(NR_INACTIVE_ANON)
-		+ global_page_state(NR_ACTIVE_FILE)
-		+ global_page_state(NR_INACTIVE_FILE)
-		- global_page_state(NR_FILE_MAPPED);
+		+ global_node_page_state(NR_ACTIVE_ANON)
+		+ global_node_page_state(NR_INACTIVE_ANON)
+		+ global_node_page_state(NR_ACTIVE_FILE)
+		+ global_node_page_state(NR_INACTIVE_FILE)
+		- global_node_page_state(NR_FILE_MAPPED);
 
 	return saveable <= size ? 0 : saveable - size;
 }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index f53b23ab7ed7..a8c3af46bd3d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -982,24 +982,24 @@ long congestion_wait(int sync, long timeout)
 EXPORT_SYMBOL(congestion_wait);
 
 /**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
- * @zone: A zone to check if it is heavily congested
+ * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
+ * @pgdat: A pgdat to check if it is heavily congested
  * @sync: SYNC or ASYNC IO
  * @timeout: timeout in jiffies
  *
  * In the event of a congested backing_dev (any backing_dev) and the given
- * @zone has experienced recent congestion, this waits for up to @timeout
+ * @pgdat has experienced recent congestion, this waits for up to @timeout
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
+ * In the absence of pgdat congestion, cond_resched() is called to yield
  * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
  * returned. return_value == timeout implies the function did not sleep.
  */
-long wait_iff_congested(struct zone *zone, int sync, long timeout)
+long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
@@ -1008,12 +1008,13 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 
 	/*
 	 * If there is no congestion, or heavy congestion is not being
-	 * encountered in the current zone, yield if necessary instead
+	 * encountered in the current pgdat, yield if necessary instead
 	 * of sleeping on the congestion queue
 	 */
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
-	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
+	    !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
 		cond_resched();
+
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
 		if (ret < 0)
diff --git a/mm/compaction.c b/mm/compaction.c
index 7607efb7bee2..a0bd85712516 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -646,8 +646,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
 	list_for_each_entry(page, &cc->migratepages, lru)
 		count[!!page_is_file_cache(page)]++;
 
-	mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
-	mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
@@ -655,12 +655,12 @@ static bool too_many_isolated(struct zone *zone)
 {
 	unsigned long active, inactive, isolated;
 
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
-					zone_page_state(zone, NR_INACTIVE_ANON);
-	active = zone_page_state(zone, NR_ACTIVE_FILE) +
-					zone_page_state(zone, NR_ACTIVE_ANON);
-	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
-					zone_page_state(zone, NR_ISOLATED_ANON);
+	inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
+			node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
+	active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
+			node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
+	isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
+			node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
 
 	return isolated > (inactive + active) / 2;
 }
@@ -856,7 +856,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			}
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, isolate_mode) != 0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f997328ae64..5d5b2207cfd2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1830,7 +1830,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	pgoff_t end = -1;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, zone);
+	lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
 
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
diff --git a/mm/internal.h b/mm/internal.h
index 9b6a6c43ac39..2f80d0343c56 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -78,7 +78,7 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
-extern bool zone_reclaimable(struct zone *zone);
+extern bool pgdat_reclaimable(struct pglist_data *pgdat);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 93d5f87c00d5..d7a49f665f04 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -480,7 +480,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 static void release_pte_page(struct page *page)
 {
 	/* 0 stands for page_is_file_cache(page) == false */
-	dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+	dec_node_page_state(page, NR_ISOLATED_ANON + 0);
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -576,7 +576,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 		/* 0 stands for page_is_file_cache(page) == false */
-		inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+		inc_node_page_state(page, NR_ISOLATED_ANON + 0);
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b70f9ca8ddf..50c86ad121bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -943,14 +943,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * and putback protocol: the LRU lock must be held, and the page must
  * either be PageLRU() or the caller must have isolated/allocated it.
  */
-struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
+struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = &zone->lruvec;
+		lruvec = &pgdat->lruvec;
 		goto out;
 	}
 
@@ -970,8 +970,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
 	 * we have to be prepared to initialize lruvec->zone here;
 	 * and if offlined then reonlined, we need to reinitialize it.
 	 */
-	if (unlikely(lruvec->zone != zone))
-		lruvec->zone = zone;
+	if (unlikely(lruvec->pgdat != pgdat))
+		lruvec->pgdat = pgdat;
 	return lruvec;
 }
 
@@ -979,6 +979,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
  * @lru: index of lru list the page is sitting on
+ * @zid: Zone ID of the zone pages have been added to
  * @nr_pages: positive when adding or negative when removing
  *
  * This function must be called under lru_lock, just before a page is added
@@ -986,14 +987,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
  * so as to allow it to check that lru_size 0 is consistent with list_empty).
  */
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
-				int nr_pages)
+				enum zone_type zid, int nr_pages)
 {
 	struct mem_cgroup_per_zone *mz;
 	unsigned long *lru_size;
 	long size;
 	bool empty;
 
-	__update_lru_size(lruvec, lru, nr_pages);
+	__update_lru_size(lruvec, lru, zid, nr_pages);
 
 	if (mem_cgroup_disabled())
 		return;
@@ -2069,7 +2070,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		*isolated = 1;
@@ -2084,7 +2085,7 @@ static void unlock_page_lru(struct page *page, int isolated)
 	if (isolated) {
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2fcca6b0e005..11de752ccaf5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1663,7 +1663,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	put_hwpoison_page(page);
 	if (!ret) {
 		LIST_HEAD(pagelist);
-		inc_zone_page_state(page, NR_ISOLATED_ANON +
+		inc_node_page_state(page, NR_ISOLATED_ANON +
 					page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
@@ -1671,7 +1671,7 @@ static int __soft_offline_page(struct page *page, int flags)
 		if (ret) {
 			if (!list_empty(&pagelist)) {
 				list_del(&page->lru);
-				dec_zone_page_state(page, NR_ISOLATED_ANON +
+				dec_node_page_state(page, NR_ISOLATED_ANON +
 						page_is_file_cache(page));
 				putback_lru_page(page);
 			}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 82d0b98d27f8..c5278360ca66 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1586,7 +1586,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			put_page(page);
 			list_add_tail(&page->lru, &source);
 			move_pages--;
-			inc_zone_page_state(page, NR_ISOLATED_ANON +
+			inc_node_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
 
 		} else {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 53e40d3f3933..d8c4e38fb5f4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -962,7 +962,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
 		if (!isolate_lru_page(page)) {
 			list_add_tail(&page->lru, pagelist);
-			inc_zone_page_state(page, NR_ISOLATED_ANON +
+			inc_node_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
 		}
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 2232f6923cc7..3033dae33a0a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -168,7 +168,7 @@ void putback_movable_pages(struct list_head *l)
 			continue;
 		}
 		list_del(&page->lru);
-		dec_zone_page_state(page, NR_ISOLATED_ANON +
+		dec_node_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
 		/*
 		 * We isolated non-lru movable page so here we can use
@@ -1119,7 +1119,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		 * restored.
 		 */
 		list_del(&page->lru);
-		dec_zone_page_state(page, NR_ISOLATED_ANON +
+		dec_node_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
 	}
 
@@ -1460,7 +1460,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		err = isolate_lru_page(page);
 		if (!err) {
 			list_add_tail(&page->lru, &pagelist);
-			inc_zone_page_state(page, NR_ISOLATED_ANON +
+			inc_node_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
 		}
 put_and_set:
@@ -1726,15 +1726,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 				   unsigned long nr_migrate_pages)
 {
 	int z;
+
+	if (!pgdat_reclaimable(pgdat))
+		return false;
+
 	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 		struct zone *zone = pgdat->node_zones + z;
 
 		if (!populated_zone(zone))
 			continue;
 
-		if (!zone_reclaimable(zone))
-			continue;
-
 		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
 		if (!zone_watermark_ok(zone, 0,
 				       high_wmark_pages(zone) +
@@ -1828,7 +1829,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	}
 
 	page_lru = page_is_file_cache(page);
-	mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
+	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
 				hpage_nr_pages(page));
 
 	/*
@@ -1886,7 +1887,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
-			dec_zone_page_state(page, NR_ISOLATED_ANON +
+			dec_node_page_state(page, NR_ISOLATED_ANON +
 					page_is_file_cache(page));
 			putback_lru_page(page);
 		}
@@ -1979,7 +1980,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		/* Retake the callers reference and putback on LRU */
 		get_page(page);
 		putback_lru_page(page);
-		mod_zone_page_state(page_zone(page),
+		mod_node_page_state(page_pgdat(page),
 			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
 
 		goto out_unlock;
@@ -2030,7 +2031,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-	mod_zone_page_state(page_zone(page),
+	mod_node_page_state(page_pgdat(page),
 			NR_ISOLATED_ANON + page_lru,
 			-HPAGE_PMD_NR);
 	return isolated;
diff --git a/mm/mlock.c b/mm/mlock.c
index 997f63082ff5..14645be06e30 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -103,7 +103,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
 		ClearPageLRU(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d578d2a56b19..0ada2b2954b0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -285,8 +285,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
 	 */
 	nr_pages -= min(nr_pages, zone->totalreserve_pages);
 
-	nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
-	nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
+	nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
+	nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
 
 	return nr_pages;
 }
@@ -348,8 +348,8 @@ static unsigned long global_dirtyable_memory(void)
 	 */
 	x -= min(x, totalreserve_pages);
 
-	x += global_page_state(NR_INACTIVE_FILE);
-	x += global_page_state(NR_ACTIVE_FILE);
+	x += global_node_page_state(NR_INACTIVE_FILE);
+	x += global_node_page_state(NR_ACTIVE_FILE);
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48b5414009ac..b84b85ae54ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1090,9 +1090,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
-	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
 
 	while (count) {
 		struct page *page;
@@ -1147,9 +1147,9 @@ static void free_one_page(struct zone *zone,
 {
 	unsigned long nr_scanned;
 	spin_lock(&zone->lock);
-	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
 
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
@@ -4331,6 +4331,7 @@ void show_free_areas(unsigned int filter)
 	unsigned long free_pcp = 0;
 	int cpu;
 	struct zone *zone;
+	pg_data_t *pgdat;
 
 	for_each_populated_zone(zone) {
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -4349,13 +4350,13 @@ void show_free_areas(unsigned int filter)
 		" anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
 #endif
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
-		global_page_state(NR_ACTIVE_ANON),
-		global_page_state(NR_INACTIVE_ANON),
-		global_page_state(NR_ISOLATED_ANON),
-		global_page_state(NR_ACTIVE_FILE),
-		global_page_state(NR_INACTIVE_FILE),
-		global_page_state(NR_ISOLATED_FILE),
-		global_page_state(NR_UNEVICTABLE),
+		global_node_page_state(NR_ACTIVE_ANON),
+		global_node_page_state(NR_INACTIVE_ANON),
+		global_node_page_state(NR_ISOLATED_ANON),
+		global_node_page_state(NR_ACTIVE_FILE),
+		global_node_page_state(NR_INACTIVE_FILE),
+		global_node_page_state(NR_ISOLATED_FILE),
+		global_node_page_state(NR_UNEVICTABLE),
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -4374,6 +4375,28 @@ void show_free_areas(unsigned int filter)
 		free_pcp,
 		global_page_state(NR_FREE_CMA_PAGES));
 
+	for_each_online_pgdat(pgdat) {
+		printk("Node %d"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
+			" unevictable:%lukB"
+			" isolated(anon):%lukB"
+			" isolated(file):%lukB"
+			" all_unreclaimable? %s"
+			"\n",
+			pgdat->node_id,
+			K(node_page_state(pgdat, NR_ACTIVE_ANON)),
+			K(node_page_state(pgdat, NR_INACTIVE_ANON)),
+			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
+			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
+			K(node_page_state(pgdat, NR_UNEVICTABLE)),
+			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
+			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+			!pgdat_reclaimable(pgdat) ? "yes" : "no");
+	}
+
 	for_each_populated_zone(zone) {
 		int i;
 
@@ -4390,13 +4413,6 @@ void show_free_areas(unsigned int filter)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
-			" active_anon:%lukB"
-			" inactive_anon:%lukB"
-			" active_file:%lukB"
-			" inactive_file:%lukB"
-			" unevictable:%lukB"
-			" isolated(anon):%lukB"
-			" isolated(file):%lukB"
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
@@ -4419,21 +4435,13 @@ void show_free_areas(unsigned int filter)
 			" local_pcp:%ukB"
 			" free_cma:%lukB"
 			" writeback_tmp:%lukB"
-			" pages_scanned:%lu"
-			" all_unreclaimable? %s"
+			" node_pages_scanned:%lu"
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
-			K(zone_page_state(zone, NR_ACTIVE_ANON)),
-			K(zone_page_state(zone, NR_INACTIVE_ANON)),
-			K(zone_page_state(zone, NR_ACTIVE_FILE)),
-			K(zone_page_state(zone, NR_INACTIVE_FILE)),
-			K(zone_page_state(zone, NR_UNEVICTABLE)),
-			K(zone_page_state(zone, NR_ISOLATED_ANON)),
-			K(zone_page_state(zone, NR_ISOLATED_FILE)),
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
@@ -4458,9 +4466,7 @@ void show_free_areas(unsigned int filter)
 			K(this_cpu_read(zone->pageset->pcp.count)),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
-			K(zone_page_state(zone, NR_PAGES_SCANNED)),
-			(!zone_reclaimable(zone) ? "yes" : "no")
-			);
+			K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
 			printk(" %ld", zone->lowmem_reserve[i]);
@@ -6010,7 +6016,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		/* For bootup, initialized properly in watermark setup */
 		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
 
-		lruvec_init(&zone->lruvec);
+		lruvec_init(zone_lruvec(zone));
 		if (!size)
 			continue;
 
diff --git a/mm/swap.c b/mm/swap.c
index bf37e5cfae81..77af473635fe 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
 		unsigned long flags;
 
 		spin_lock_irqsave(zone_lru_lock(zone), flags);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -194,7 +194,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(zone_lru_lock(zone), flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		(*move_fn)(page, lruvec, arg);
 	}
 	if (zone)
@@ -319,7 +319,7 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(zone_lru_lock(zone));
-	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
+	__activate_page(page, mem_cgroup_page_lruvec(page, zone->zone_pgdat), NULL);
 	spin_unlock_irq(zone_lru_lock(zone));
 }
 #endif
@@ -445,16 +445,16 @@ void lru_cache_add(struct page *page)
  */
 void add_page_to_unevictable_list(struct page *page)
 {
-	struct zone *zone = page_zone(page);
+	struct pglist_data *pgdat = page_pgdat(page);
 	struct lruvec *lruvec;
 
-	spin_lock_irq(zone_lru_lock(zone));
-	lruvec = mem_cgroup_page_lruvec(page, zone);
+	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 }
 
 /**
@@ -730,7 +730,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct zone *zone = NULL;
+	struct pglist_data *locked_pgdat = NULL;
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 	unsigned int uninitialized_var(lock_batch);
@@ -741,11 +741,11 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same zone. The lock is held only if zone != NULL.
+		 * same pgdat. The lock is held only if pgdat != NULL.
 		 */
-		if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(zone_lru_lock(zone), flags);
-			zone = NULL;
+		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
+			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+			locked_pgdat = NULL;
 		}
 
 		if (is_huge_zero_page(page)) {
@@ -758,27 +758,27 @@ void release_pages(struct page **pages, int nr, bool cold)
 			continue;
 
 		if (PageCompound(page)) {
-			if (zone) {
-				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
-				zone = NULL;
+			if (locked_pgdat) {
+				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+				locked_pgdat = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct zone *pagezone = page_zone(page);
+			struct pglist_data *pgdat = page_pgdat(page);
 
-			if (pagezone != zone) {
-				if (zone)
-					spin_unlock_irqrestore(zone_lru_lock(zone),
+			if (pgdat != locked_pgdat) {
+				if (locked_pgdat)
+					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
 									flags);
 				lock_batch = 0;
-				zone = pagezone;
-				spin_lock_irqsave(zone_lru_lock(zone), flags);
+				locked_pgdat = pgdat;
+				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, zone);
+			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -789,8 +789,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (zone)
-		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
+	if (locked_pgdat)
+		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_hot_cold_page_list(&pages_to_free, cold);
@@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
 	VM_BUG_ON(NR_CPUS != 1 &&
-		  !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
+		  !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e7ffcd259cc4..86a523a761c9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -191,26 +191,42 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
+/*
+ * This misses isolated pages which are not accounted for to save counters.
+ * As the data only determines if reclaim or compaction continues, it is
+ * not expected that isolated pages will be a dominating factor.
+ */
 unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
-	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
-	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
-	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
+	nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
+	if (get_nr_swap_pages() > 0)
+		nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
+
+	return nr;
+}
+
+unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
+{
+	unsigned long nr;
+
+	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
+	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
+	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
 
 	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
-		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
-		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
+		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
+		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
+		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
 
 	return nr;
 }
 
-bool zone_reclaimable(struct zone *zone)
+bool pgdat_reclaimable(struct pglist_data *pgdat)
 {
-	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
-		zone_reclaimable_pages(zone) * 6;
+	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
+		pgdat_reclaimable_pages(pgdat) * 6;
 }
 
 unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -218,7 +234,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_lru_size(lruvec, lru);
 
-	return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
+	return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
 }
 
 /*
@@ -877,7 +893,7 @@ static void page_check_dirty_writeback(struct page *page,
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct zone *zone,
+				      struct pglist_data *pgdat,
 				      struct scan_control *sc,
 				      enum ttu_flags ttu_flags,
 				      unsigned long *ret_nr_dirty,
@@ -917,7 +933,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep;
 
 		VM_BUG_ON_PAGE(PageActive(page), page);
-		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		sc->nr_scanned++;
 
@@ -996,7 +1011,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			/* Case 1 above */
 			if (current_is_kswapd() &&
 			    PageReclaim(page) &&
-			    test_bit(ZONE_WRITEBACK, &zone->flags)) {
+			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
 				nr_immediate++;
 				goto keep_locked;
 
@@ -1092,7 +1107,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 */
 			if (page_is_file_cache(page) &&
 					(!current_is_kswapd() ||
-					 !test_bit(ZONE_DIRTY, &zone->flags))) {
+					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -1266,11 +1281,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 		}
 	}
 
-	ret = shrink_page_list(&clean_pages, zone, &sc,
+	ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
 			TTU_UNMAP|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
 	list_splice(&clean_pages, page_list);
-	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
 	return ret;
 }
 
@@ -1375,7 +1390,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 {
 	struct list_head *src = &lruvec->lists[lru];
 	unsigned long nr_taken = 0;
-	unsigned long scan;
+	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
+	unsigned long scan, nr_pages;
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
 					!list_empty(src); scan++) {
@@ -1388,7 +1404,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
-			nr_taken += hpage_nr_pages(page);
+			nr_pages = hpage_nr_pages(page);
+			nr_taken += nr_pages;
+			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
 
@@ -1405,6 +1423,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
+	for (scan = 0; scan < MAX_NR_ZONES; scan++) {
+		nr_pages = nr_zone_taken[scan];
+		if (!nr_pages)
+			continue;
+
+		update_lru_size(lruvec, lru, scan, -nr_pages);
+	}
 	return nr_taken;
 }
 
@@ -1445,7 +1470,7 @@ int isolate_lru_page(struct page *page)
 		struct lruvec *lruvec;
 
 		spin_lock_irq(zone_lru_lock(zone));
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
 			get_page(page);
@@ -1465,7 +1490,7 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct zone *zone, int file,
+static int too_many_isolated(struct pglist_data *pgdat, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
@@ -1477,11 +1502,11 @@ static int too_many_isolated(struct zone *zone, int file,
 		return 0;
 
 	if (file) {
-		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
+		isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
 	} else {
-		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+		inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
+		isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
 	}
 
 	/*
@@ -1499,7 +1524,7 @@ static noinline_for_stack void
 putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	LIST_HEAD(pages_to_free);
 
 	/*
@@ -1512,13 +1537,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(zone_lru_lock(zone));
+			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(zone_lru_lock(zone));
+			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		SetPageLRU(page);
 		lru = page_lru(page);
@@ -1535,10 +1560,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(zone_lru_lock(zone));
+				spin_unlock_irq(&pgdat->lru_lock);
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(zone_lru_lock(zone));
+				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		}
@@ -1582,10 +1607,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned long nr_immediate = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
+	while (unlikely(too_many_isolated(pgdat, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
@@ -1600,48 +1625,45 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	update_lru_size(lruvec, lru, -nr_taken);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
+		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 		if (current_is_kswapd())
-			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
+			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
 		else
-			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
+			__count_vm_events(PGSCAN_DIRECT, nr_scanned);
 	}
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
 				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
 				&nr_writeback, &nr_immediate,
 				false);
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 
 	if (global_reclaim(sc)) {
 		if (current_is_kswapd())
-			__count_zone_vm_events(PGSTEAL_KSWAPD, zone,
-					       nr_reclaimed);
+			__count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
 		else
-			__count_zone_vm_events(PGSTEAL_DIRECT, zone,
-					       nr_reclaimed);
+			__count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
 	}
 
 	putback_inactive_pages(lruvec, &page_list);
 
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_hot_cold_page_list(&page_list, true);
@@ -1661,7 +1683,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 * are encountered in the nr_immediate check below.
 	 */
 	if (nr_writeback && nr_writeback == nr_taken)
-		set_bit(ZONE_WRITEBACK, &zone->flags);
+		set_bit(PGDAT_WRITEBACK, &pgdat->flags);
 
 	/*
 	 * Legacy memcg will stall in page writeback so avoid forcibly
@@ -1673,16 +1695,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		 * backed by a congested BDI and wait_iff_congested will stall.
 		 */
 		if (nr_dirty && nr_dirty == nr_congested)
-			set_bit(ZONE_CONGESTED, &zone->flags);
+			set_bit(PGDAT_CONGESTED, &pgdat->flags);
 
 		/*
 		 * If dirty pages are scanned that are not queued for IO, it
 		 * implies that flushers are not keeping up. In this case, flag
-		 * the zone ZONE_DIRTY and kswapd will start writing pages from
+		 * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
 		 * reclaim context.
 		 */
 		if (nr_unqueued_dirty == nr_taken)
-			set_bit(ZONE_DIRTY, &zone->flags);
+			set_bit(PGDAT_DIRTY, &pgdat->flags);
 
 		/*
 		 * If kswapd scans pages marked marked for immediate
@@ -1701,9 +1723,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 */
 	if (!sc->hibernation_mode && !current_is_kswapd() &&
 	    current_may_throttle())
-		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+		wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
 
-	trace_mm_vmscan_lru_shrink_inactive(zone, nr_scanned, nr_reclaimed,
+	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
+			nr_scanned, nr_reclaimed,
 			sc->priority, file);
 	return nr_reclaimed;
 }
@@ -1731,20 +1754,20 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 				     struct list_head *pages_to_free,
 				     enum lru_list lru)
 {
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long pgmoved = 0;
 	struct page *page;
 	int nr_pages;
 
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		SetPageLRU(page);
 
 		nr_pages = hpage_nr_pages(page);
-		update_lru_size(lruvec, lru, nr_pages);
+		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
 		list_move(&page->lru, &lruvec->lists[lru]);
 		pgmoved += nr_pages;
 
@@ -1754,10 +1777,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(zone_lru_lock(zone));
+				spin_unlock_irq(&pgdat->lru_lock);
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(zone_lru_lock(zone));
+				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, pages_to_free);
 		}
@@ -1783,7 +1806,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	unsigned long nr_rotated = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lru_add_drain();
 
@@ -1792,20 +1815,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	update_lru_size(lruvec, lru, -nr_taken);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc))
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
-	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
+		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
+	__count_vm_events(PGREFILL, nr_scanned);
 
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -1850,7 +1872,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -1861,8 +1883,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(zone_lru_lock(zone));
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_hold);
 	free_hot_cold_page_list(&l_hold, true);
@@ -1956,7 +1978,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2];
 	u64 denominator = 0;	/* gcc */
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
 	unsigned long anon, file;
@@ -1977,7 +1999,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * well.
 	 */
 	if (current_is_kswapd()) {
-		if (!zone_reclaimable(zone))
+		if (!pgdat_reclaimable(pgdat))
 			force_scan = true;
 		if (!mem_cgroup_online(memcg))
 			force_scan = true;
@@ -2023,14 +2045,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * anon pages.  Try to detect this based on file LRU size.
 	 */
 	if (global_reclaim(sc)) {
-		unsigned long zonefile;
-		unsigned long zonefree;
+		unsigned long pgdatfile;
+		unsigned long pgdatfree;
+		int z;
+		unsigned long total_high_wmark = 0;
 
-		zonefree = zone_page_state(zone, NR_FREE_PAGES);
-		zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
-			   zone_page_state(zone, NR_INACTIVE_FILE);
+		pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
+
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+			if (!populated_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
 
-		if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
+		if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
 			scan_balance = SCAN_ANON;
 			goto out;
 		}
@@ -2077,7 +2109,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2098,7 +2130,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -2352,9 +2384,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
+	inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
 	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+		inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
@@ -2554,7 +2586,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 				continue;
 
 			if (sc->priority != DEF_PRIORITY &&
-			    !zone_reclaimable(zone))
+			    !pgdat_reclaimable(zone->zone_pgdat))
 				continue;	/* Let kswapd poll it */
 
 			/*
@@ -2692,7 +2724,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
 		if (!populated_zone(zone) ||
-		    zone_reclaimable_pages(zone) == 0)
+		    pgdat_reclaimable_pages(pgdat) == 0)
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
@@ -3000,7 +3032,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 		 * DEF_PRIORITY. Effectively, it considers them balanced so
 		 * they must be considered balanced here as well!
 		 */
-		if (!zone_reclaimable(zone)) {
+		if (!pgdat_reclaimable(zone->zone_pgdat)) {
 			balanced_pages += zone->managed_pages;
 			continue;
 		}
@@ -3063,6 +3095,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 {
 	unsigned long balance_gap;
 	bool lowmem_pressure;
+	struct pglist_data *pgdat = zone->zone_pgdat;
 
 	/* Reclaim above the high watermark. */
 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
@@ -3087,7 +3120,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
 
 	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
 
-	clear_bit(ZONE_WRITEBACK, &zone->flags);
+	/* TODO: ANOMALY */
+	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
 
 	/*
 	 * If a zone reaches its high watermark, consider it to be no longer
@@ -3095,10 +3129,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * BDIs but as pressure is relieved, speculatively avoid congestion
 	 * waits.
 	 */
-	if (zone_reclaimable(zone) &&
+	if (pgdat_reclaimable(zone->zone_pgdat) &&
 	    zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
-		clear_bit(ZONE_CONGESTED, &zone->flags);
-		clear_bit(ZONE_DIRTY, &zone->flags);
+		clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+		clear_bit(PGDAT_DIRTY, &pgdat->flags);
 	}
 
 	return sc->nr_scanned >= sc->nr_to_reclaim;
@@ -3157,7 +3191,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				continue;
 
 			if (sc.priority != DEF_PRIORITY &&
-			    !zone_reclaimable(zone))
+			    !pgdat_reclaimable(zone->zone_pgdat))
 				continue;
 
 			/*
@@ -3184,9 +3218,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				/*
 				 * If balanced, clear the dirty and congested
 				 * flags
+				 *
+				 * TODO: ANOMALY
 				 */
-				clear_bit(ZONE_CONGESTED, &zone->flags);
-				clear_bit(ZONE_DIRTY, &zone->flags);
+				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
+				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
 			}
 		}
 
@@ -3216,7 +3252,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				continue;
 
 			if (sc.priority != DEF_PRIORITY &&
-			    !zone_reclaimable(zone))
+			    !pgdat_reclaimable(zone->zone_pgdat))
 				continue;
 
 			sc.nr_scanned = 0;
@@ -3612,8 +3648,8 @@ int sysctl_min_slab_ratio = 5;
 static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
 {
 	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
-	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
-		zone_page_state(zone, NR_ACTIVE_FILE);
+	unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
+		node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
 
 	/*
 	 * It's possible for there to be more file mapped pages than
@@ -3716,7 +3752,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return ZONE_RECLAIM_FULL;
 
-	if (!zone_reclaimable(zone))
+	if (!pgdat_reclaimable(zone->zone_pgdat))
 		return ZONE_RECLAIM_FULL;
 
 	/*
@@ -3795,7 +3831,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 			zone = pagezone;
 			spin_lock_irq(zone_lru_lock(zone));
 		}
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3345d396a99b..de0c17076270 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -936,11 +936,8 @@ const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
 	"nr_alloc_batch",
-	"nr_inactive_anon",
-	"nr_active_anon",
-	"nr_inactive_file",
-	"nr_active_file",
-	"nr_unevictable",
+	"nr_zone_anon_lru",
+	"nr_zone_file_lru",
 	"nr_mlock",
 	"nr_anon_pages",
 	"nr_mapped",
@@ -956,12 +953,9 @@ const char * const vmstat_text[] = {
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_writeback_temp",
-	"nr_isolated_anon",
-	"nr_isolated_file",
 	"nr_shmem",
 	"nr_dirtied",
 	"nr_written",
-	"nr_pages_scanned",
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	"nr_zspages",
 #endif
@@ -981,6 +975,16 @@ const char * const vmstat_text[] = {
 	"nr_shmem_pmdmapped",
 	"nr_free_cma",
 
+	/* Node-based counters */
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
+	"nr_unevictable",
+	"nr_isolated_anon",
+	"nr_isolated_file",
+	"nr_pages_scanned",
+
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
@@ -1002,11 +1006,11 @@ const char * const vmstat_text[] = {
 	"pgmajfault",
 	"pglazyfreed",
 
-	TEXTS_FOR_ZONES("pgrefill")
-	TEXTS_FOR_ZONES("pgsteal_kswapd")
-	TEXTS_FOR_ZONES("pgsteal_direct")
-	TEXTS_FOR_ZONES("pgscan_kswapd")
-	TEXTS_FOR_ZONES("pgscan_direct")
+	"pgrefill",
+	"pgsteal_kswapd",
+	"pgsteal_direct",
+	"pgscan_kswapd",
+	"pgscan_direct",
 	"pgscan_direct_throttle",
 
 #ifdef CONFIG_NUMA
@@ -1434,7 +1438,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n        scanned  %lu"
+		   "\n   node_scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu",
@@ -1442,13 +1446,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-		   zone_page_state(zone, NR_PAGES_SCANNED),
+		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone->managed_pages);
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
+		seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
 				zone_page_state(zone, i));
 
 	seq_printf(m,
@@ -1478,12 +1482,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 #endif
 	}
 	seq_printf(m,
-		   "\n  all_unreclaimable: %u"
-		   "\n  start_pfn:         %lu"
-		   "\n  inactive_ratio:    %u",
-		   !zone_reclaimable(zone),
+		   "\n  node_unreclaimable:  %u"
+		   "\n  start_pfn:           %lu"
+		   "\n  node_inactive_ratio: %u",
+		   !pgdat_reclaimable(zone->zone_pgdat),
 		   zone->zone_start_pfn,
-		   zone->inactive_ratio);
+		   zone->zone_pgdat->inactive_ratio);
 	seq_putc(m, '\n');
 }
 
@@ -1574,7 +1578,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
 {
 	unsigned long *l = arg;
 	unsigned long off = l - (unsigned long *)m->private;
-
 	seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
 	return 0;
 }
diff --git a/mm/workingset.c b/mm/workingset.c
index ba972ac2dfdd..ebe14445809a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -355,8 +355,8 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
 						     LRU_ALL_FILE);
 	} else {
-		pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
-			sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
+		pages = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
+			node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
 	}
 
 	/*
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

This moves the LRU lists from the zone to the node and related data
such as counters, tracing, congestion tracking and writeback tracking.
Unfortunately, due to reclaim and compaction retry logic, it is necessary
to account for the number of LRU pages on both zone and node logic.
Most reclaim logic is based on the node counters but the retry logic uses
the zone counters which do not distinguish inactive and active sizes.
It would be possible to leave the LRU counters on a per-zone basis but
it's a heavier calculation across multiple cache lines that is much more
frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies.  For example, the scans are
per-zone but using per-node counters.  We also mark a node as congested
when a zone is congested.  This causes weird problems that are fixed later
but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU. It potentially
   could be implemented similar to the UNEVICTABLE list.

   That would reduce the skip rate with the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/tile/mm/pgtable.c                    |   8 +-
 drivers/base/node.c                       |  19 +--
 drivers/staging/android/lowmemorykiller.c |   8 +-
 include/linux/backing-dev.h               |   2 +-
 include/linux/memcontrol.h                |  18 +--
 include/linux/mm_inline.h                 |  21 ++-
 include/linux/mmzone.h                    |  68 +++++----
 include/linux/swap.h                      |   1 +
 include/linux/vm_event_item.h             |  10 +-
 include/linux/vmstat.h                    |  17 +++
 include/trace/events/vmscan.h             |  12 +-
 kernel/power/snapshot.c                   |  10 +-
 mm/backing-dev.c                          |  15 +-
 mm/compaction.c                           |  18 +--
 mm/huge_memory.c                          |   2 +-
 mm/internal.h                             |   2 +-
 mm/khugepaged.c                           |   4 +-
 mm/memcontrol.c                           |  17 +--
 mm/memory-failure.c                       |   4 +-
 mm/memory_hotplug.c                       |   2 +-
 mm/mempolicy.c                            |   2 +-
 mm/migrate.c                              |  21 +--
 mm/mlock.c                                |   2 +-
 mm/page-writeback.c                       |   8 +-
 mm/page_alloc.c                           |  68 +++++----
 mm/swap.c                                 |  50 +++----
 mm/vmscan.c                               | 226 +++++++++++++++++-------------
 mm/vmstat.c                               |  47 ++++---
 mm/workingset.c                           |   4 +-
 29 files changed, 386 insertions(+), 300 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index c4d5bf841a7f..9e389213580d 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
 	struct zone *zone;
 
 	pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
-	       (global_page_state(NR_ACTIVE_ANON) +
-		global_page_state(NR_ACTIVE_FILE)),
-	       (global_page_state(NR_INACTIVE_ANON) +
-		global_page_state(NR_INACTIVE_FILE)),
+	       (global_node_page_state(NR_ACTIVE_ANON) +
+		global_node_page_state(NR_ACTIVE_FILE)),
+	       (global_node_page_state(NR_INACTIVE_ANON) +
+		global_node_page_state(NR_INACTIVE_FILE)),
 	       global_page_state(NR_FILE_DIRTY),
 	       global_page_state(NR_WRITEBACK),
 	       global_page_state(NR_UNSTABLE_NFS),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 92d8e090c5b3..b7f01a4a642d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 {
 	int n;
 	int nid = dev->id;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct sysinfo i;
 
 	si_meminfo_node(&i, nid);
@@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
-				sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
-				sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
-		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
+		       nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
+				node_page_state(pgdat, NR_ACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
+				node_page_state(pgdat, NR_INACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
+		       nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
+		       nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
+		       nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
 		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 24d2745e9437..93dbcc38eb0f 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -72,10 +72,10 @@ static unsigned long lowmem_deathpending_timeout;
 static unsigned long lowmem_count(struct shrinker *s,
 				  struct shrink_control *sc)
 {
-	return global_page_state(NR_ACTIVE_ANON) +
-		global_page_state(NR_ACTIVE_FILE) +
-		global_page_state(NR_INACTIVE_ANON) +
-		global_page_state(NR_INACTIVE_FILE);
+	return global_node_page_state(NR_ACTIVE_ANON) +
+		global_node_page_state(NR_ACTIVE_FILE) +
+		global_node_page_state(NR_INACTIVE_ANON) +
+		global_node_page_state(NR_INACTIVE_FILE);
 }
 
 static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c82794f20110..491a91717788 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
 }
 
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(struct zone *zone, int sync, long timeout);
+long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 104efa6874db..68f1121c8fe7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -340,7 +340,7 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = &zone->lruvec;
+		lruvec = zone_lruvec(zone);
 		goto out;
 	}
 
@@ -349,15 +349,15 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 out:
 	/*
 	 * Since a node can be onlined after the mem_cgroup was created,
-	 * we have to be prepared to initialize lruvec->zone here;
+	 * we have to be prepared to initialize lruvec->pgdat here;
 	 * and if offlined then reonlined, we need to reinitialize it.
 	 */
-	if (unlikely(lruvec->zone != zone))
-		lruvec->zone = zone;
+	if (unlikely(lruvec->pgdat != zone->zone_pgdat))
+		lruvec->pgdat = zone->zone_pgdat;
 	return lruvec;
 }
 
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
@@ -438,7 +438,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
-		int nr_pages);
+		enum zone_type zid, int nr_pages);
 
 unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 					   int nid, unsigned int lru_mask);
@@ -613,13 +613,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 						    struct mem_cgroup *memcg)
 {
-	return &zone->lruvec;
+	return zone_lruvec(zone);
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
-						    struct zone *zone)
+						    struct pglist_data *pgdat)
 {
-	return &zone->lruvec;
+	return &pgdat->lruvec;
 }
 
 static inline bool mm_match_cgroup(struct mm_struct *mm,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 5bd29ba4f174..9aadcc781857 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -23,25 +23,32 @@ static inline int page_is_file_cache(struct page *page)
 }
 
 static __always_inline void __update_lru_size(struct lruvec *lruvec,
-				enum lru_list lru, int nr_pages)
+				enum lru_list lru, enum zone_type zid,
+				int nr_pages)
 {
-	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
+	__mod_zone_page_state(&pgdat->node_zones[zid],
+		NR_ZONE_LRU_BASE + !!is_file_lru(lru),
+		nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
-				enum lru_list lru, int nr_pages)
+				enum lru_list lru, enum zone_type zid,
+				int nr_pages)
 {
 #ifdef CONFIG_MEMCG
-	mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+	mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
 #else
-	__update_lru_size(lruvec, lru, nr_pages);
+	__update_lru_size(lruvec, lru, zid, nr_pages);
 #endif
 }
 
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
@@ -49,7 +56,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
 	list_del(&page->lru);
-	update_lru_size(lruvec, lru, -hpage_nr_pages(page));
+	update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
 }
 
 /**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cfa870107abe..d4f5cac0a8c3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -111,12 +111,9 @@ enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
 	NR_ALLOC_BATCH,
-	NR_LRU_BASE,
-	NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
-	NR_ACTIVE_ANON,		/*  "     "     "   "       "         */
-	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
-	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
-	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
+	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
+	NR_ZONE_LRU_FILE,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
@@ -134,12 +131,9 @@ enum zone_stat_item {
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
-	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
-	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
-	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	NR_ZSPAGES,		/* allocated in zsmalloc */
 #endif
@@ -161,6 +155,15 @@ enum zone_stat_item {
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
+	NR_LRU_BASE,
+	NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
+	NR_ACTIVE_ANON,		/*  "     "     "   "       "         */
+	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
+	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
+	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
+	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
+	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	NR_VM_NODE_STAT_ITEMS
 };
 
@@ -219,7 +222,7 @@ struct lruvec {
 	/* Evictions & activations on the inactive file list */
 	atomic_long_t			inactive_age;
 #ifdef CONFIG_MEMCG
-	struct zone			*zone;
+	struct pglist_data *pgdat;
 #endif
 };
 
@@ -357,13 +360,6 @@ struct zone {
 #ifdef CONFIG_NUMA
 	int node;
 #endif
-
-	/*
-	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
-	 * this zone's LRU.  Maintained by the pageout code.
-	 */
-	unsigned int inactive_ratio;
-
 	struct pglist_data	*zone_pgdat;
 	struct per_cpu_pageset __percpu *pageset;
 
@@ -495,9 +491,6 @@ struct zone {
 
 	/* Write-intensive fields used by page reclaim */
 
-	/* Fields commonly accessed by the page reclaim scanner */
-	struct lruvec		lruvec;
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
@@ -537,17 +530,20 @@ struct zone {
 
 enum zone_flags {
 	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
-	ZONE_CONGESTED,			/* zone has many dirty pages backed by
+	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
+};
+
+enum pgdat_flags {
+	PGDAT_CONGESTED,		/* pgdat has many dirty pages backed by
 					 * a congested BDI
 					 */
-	ZONE_DIRTY,			/* reclaim scanning has recently found
+	PGDAT_DIRTY,			/* reclaim scanning has recently found
 					 * many dirty file pages at the tail
 					 * of the LRU.
 					 */
-	ZONE_WRITEBACK,			/* reclaim scanning has recently found
+	PGDAT_WRITEBACK,		/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
-	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -707,6 +703,19 @@ typedef struct pglist_data {
 	unsigned long split_queue_len;
 #endif
 
+	/* Fields commonly accessed by the page reclaim scanner */
+	struct lruvec		lruvec;
+
+	/*
+	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
+	 * this node's LRU.  Maintained by the pageout code.
+	 */
+	unsigned int inactive_ratio;
+
+	unsigned long		flags;
+
+	ZONE_PADDING(_pad2_)
+
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
@@ -728,6 +737,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
 	return &zone->zone_pgdat->lru_lock;
 }
 
+static inline struct lruvec *zone_lruvec(struct zone *zone)
+{
+	return &zone->zone_pgdat->lruvec;
+}
+
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
 {
 	return pgdat->node_start_pfn + pgdat->node_spanned_pages;
@@ -779,12 +793,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
 
 extern void lruvec_init(struct lruvec *lruvec);
 
-static inline struct zone *lruvec_zone(struct lruvec *lruvec)
+static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
 {
 #ifdef CONFIG_MEMCG
-	return lruvec->zone;
+	return lruvec->pgdat;
 #else
-	return container_of(lruvec, struct zone, lruvec);
+	return container_of(lruvec, struct pglist_data, lruvec);
 #endif
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0af2bb2028fd..c82f916008b7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
+extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 42604173f122..1798ff542517 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
-		FOR_ALL_ZONES(PGREFILL),
-		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
-		FOR_ALL_ZONES(PGSTEAL_DIRECT),
-		FOR_ALL_ZONES(PGSCAN_KSWAPD),
-		FOR_ALL_ZONES(PGSCAN_DIRECT),
+		PGREFILL,
+		PGSTEAL_KSWAPD,
+		PGSTEAL_DIRECT,
+		PGSCAN_KSWAPD,
+		PGSCAN_DIRECT,
 		PGSCAN_DIRECT_THROTTLE,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_FAILED,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d1744aa3ab9c..fee321c98550 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 	return x;
 }
 
+static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
+					enum node_stat_item item)
+{
+	long x = atomic_long_read(&pgdat->vm_stat[item]);
+
+#ifdef CONFIG_SMP
+	int cpu;
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item];
+
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
+}
+
+
 #ifdef CONFIG_NUMA
 extern unsigned long sum_zone_node_page_state(int node,
 						enum zone_stat_item item);
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 0101ef37f1ee..897f1aa1ee5f 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -352,15 +352,14 @@ TRACE_EVENT(mm_vmscan_writepage,
 
 TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 
-	TP_PROTO(struct zone *zone,
+	TP_PROTO(int nid,
 		unsigned long nr_scanned, unsigned long nr_reclaimed,
 		int priority, int file),
 
-	TP_ARGS(zone, nr_scanned, nr_reclaimed, priority, file),
+	TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
 
 	TP_STRUCT__entry(
 		__field(int, nid)
-		__field(int, zid)
 		__field(unsigned long, nr_scanned)
 		__field(unsigned long, nr_reclaimed)
 		__field(int, priority)
@@ -368,16 +367,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 	),
 
 	TP_fast_assign(
-		__entry->nid = zone_to_nid(zone);
-		__entry->zid = zone_idx(zone);
+		__entry->nid = nid;
 		__entry->nr_scanned = nr_scanned;
 		__entry->nr_reclaimed = nr_reclaimed;
 		__entry->priority = priority;
 		__entry->reclaim_flags = trace_shrink_flags(file);
 	),
 
-	TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
-		__entry->nid, __entry->zid,
+	TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+		__entry->nid,
 		__entry->nr_scanned, __entry->nr_reclaimed,
 		__entry->priority,
 		show_reclaim_flags(__entry->reclaim_flags))
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 3a970604308f..24a06bc23f85 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
 	unsigned long size;
 
 	size = global_page_state(NR_SLAB_RECLAIMABLE)
-		+ global_page_state(NR_ACTIVE_ANON)
-		+ global_page_state(NR_INACTIVE_ANON)
-		+ global_page_state(NR_ACTIVE_FILE)
-		+ global_page_state(NR_INACTIVE_FILE)
-		- global_page_state(NR_FILE_MAPPED);
+		+ global_node_page_state(NR_ACTIVE_ANON)
+		+ global_node_page_state(NR_INACTIVE_ANON)
+		+ global_node_page_state(NR_ACTIVE_FILE)
+		+ global_node_page_state(NR_INACTIVE_FILE)
+		- global_node_page_state(NR_FILE_MAPPED);
 
 	return saveable <= size ? 0 : saveable - size;
 }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index f53b23ab7ed7..a8c3af46bd3d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -982,24 +982,24 @@ long congestion_wait(int sync, long timeout)
 EXPORT_SYMBOL(congestion_wait);
 
 /**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
- * @zone: A zone to check if it is heavily congested
+ * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
+ * @pgdat: A pgdat to check if it is heavily congested
  * @sync: SYNC or ASYNC IO
  * @timeout: timeout in jiffies
  *
  * In the event of a congested backing_dev (any backing_dev) and the given
- * @zone has experienced recent congestion, this waits for up to @timeout
+ * @pgdat has experienced recent congestion, this waits for up to @timeout
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
+ * In the absence of pgdat congestion, cond_resched() is called to yield
  * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
  * returned. return_value == timeout implies the function did not sleep.
  */
-long wait_iff_congested(struct zone *zone, int sync, long timeout)
+long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
@@ -1008,12 +1008,13 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 
 	/*
 	 * If there is no congestion, or heavy congestion is not being
-	 * encountered in the current zone, yield if necessary instead
+	 * encountered in the current pgdat, yield if necessary instead
 	 * of sleeping on the congestion queue
 	 */
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
-	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
+	    !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
 		cond_resched();
+
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
 		if (ret < 0)
diff --git a/mm/compaction.c b/mm/compaction.c
index 7607efb7bee2..a0bd85712516 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -646,8 +646,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
 	list_for_each_entry(page, &cc->migratepages, lru)
 		count[!!page_is_file_cache(page)]++;
 
-	mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
-	mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
@@ -655,12 +655,12 @@ static bool too_many_isolated(struct zone *zone)
 {
 	unsigned long active, inactive, isolated;
 
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
-					zone_page_state(zone, NR_INACTIVE_ANON);
-	active = zone_page_state(zone, NR_ACTIVE_FILE) +
-					zone_page_state(zone, NR_ACTIVE_ANON);
-	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
-					zone_page_state(zone, NR_ISOLATED_ANON);
+	inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
+			node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
+	active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
+			node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
+	isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
+			node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
 
 	return isolated > (inactive + active) / 2;
 }
@@ -856,7 +856,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			}
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, isolate_mode) != 0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f997328ae64..5d5b2207cfd2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1830,7 +1830,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	pgoff_t end = -1;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, zone);
+	lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
 
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
diff --git a/mm/internal.h b/mm/internal.h
index 9b6a6c43ac39..2f80d0343c56 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -78,7 +78,7 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
-extern bool zone_reclaimable(struct zone *zone);
+extern bool pgdat_reclaimable(struct pglist_data *pgdat);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 93d5f87c00d5..d7a49f665f04 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -480,7 +480,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 static void release_pte_page(struct page *page)
 {
 	/* 0 stands for page_is_file_cache(page) == false */
-	dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+	dec_node_page_state(page, NR_ISOLATED_ANON + 0);
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -576,7 +576,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 		/* 0 stands for page_is_file_cache(page) == false */
-		inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+		inc_node_page_state(page, NR_ISOLATED_ANON + 0);
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b70f9ca8ddf..50c86ad121bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -943,14 +943,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  * and putback protocol: the LRU lock must be held, and the page must
  * either be PageLRU() or the caller must have isolated/allocated it.
  */
-struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
+struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = &zone->lruvec;
+		lruvec = &pgdat->lruvec;
 		goto out;
 	}
 
@@ -970,8 +970,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
 	 * we have to be prepared to initialize lruvec->zone here;
 	 * and if offlined then reonlined, we need to reinitialize it.
 	 */
-	if (unlikely(lruvec->zone != zone))
-		lruvec->zone = zone;
+	if (unlikely(lruvec->pgdat != pgdat))
+		lruvec->pgdat = pgdat;
 	return lruvec;
 }
 
@@ -979,6 +979,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
  * @lru: index of lru list the page is sitting on
+ * @zid: Zone ID of the zone pages have been added to
  * @nr_pages: positive when adding or negative when removing
  *
  * This function must be called under lru_lock, just before a page is added
@@ -986,14 +987,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
  * so as to allow it to check that lru_size 0 is consistent with list_empty).
  */
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
-				int nr_pages)
+				enum zone_type zid, int nr_pages)
 {
 	struct mem_cgroup_per_zone *mz;
 	unsigned long *lru_size;
 	long size;
 	bool empty;
 
-	__update_lru_size(lruvec, lru, nr_pages);
+	__update_lru_size(lruvec, lru, zid, nr_pages);
 
 	if (mem_cgroup_disabled())
 		return;
@@ -2069,7 +2070,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		*isolated = 1;
@@ -2084,7 +2085,7 @@ static void unlock_page_lru(struct page *page, int isolated)
 	if (isolated) {
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2fcca6b0e005..11de752ccaf5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1663,7 +1663,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	put_hwpoison_page(page);
 	if (!ret) {
 		LIST_HEAD(pagelist);
-		inc_zone_page_state(page, NR_ISOLATED_ANON +
+		inc_node_page_state(page, NR_ISOLATED_ANON +
 					page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
@@ -1671,7 +1671,7 @@ static int __soft_offline_page(struct page *page, int flags)
 		if (ret) {
 			if (!list_empty(&pagelist)) {
 				list_del(&page->lru);
-				dec_zone_page_state(page, NR_ISOLATED_ANON +
+				dec_node_page_state(page, NR_ISOLATED_ANON +
 						page_is_file_cache(page));
 				putback_lru_page(page);
 			}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 82d0b98d27f8..c5278360ca66 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1586,7 +1586,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			put_page(page);
 			list_add_tail(&page->lru, &source);
 			move_pages--;
-			inc_zone_page_state(page, NR_ISOLATED_ANON +
+			inc_node_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
 
 		} else {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 53e40d3f3933..d8c4e38fb5f4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -962,7 +962,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
 	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
 		if (!isolate_lru_page(page)) {
 			list_add_tail(&page->lru, pagelist);
-			inc_zone_page_state(page, NR_ISOLATED_ANON +
+			inc_node_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
 		}
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index 2232f6923cc7..3033dae33a0a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -168,7 +168,7 @@ void putback_movable_pages(struct list_head *l)
 			continue;
 		}
 		list_del(&page->lru);
-		dec_zone_page_state(page, NR_ISOLATED_ANON +
+		dec_node_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
 		/*
 		 * We isolated non-lru movable page so here we can use
@@ -1119,7 +1119,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		 * restored.
 		 */
 		list_del(&page->lru);
-		dec_zone_page_state(page, NR_ISOLATED_ANON +
+		dec_node_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
 	}
 
@@ -1460,7 +1460,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		err = isolate_lru_page(page);
 		if (!err) {
 			list_add_tail(&page->lru, &pagelist);
-			inc_zone_page_state(page, NR_ISOLATED_ANON +
+			inc_node_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
 		}
 put_and_set:
@@ -1726,15 +1726,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 				   unsigned long nr_migrate_pages)
 {
 	int z;
+
+	if (!pgdat_reclaimable(pgdat))
+		return false;
+
 	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 		struct zone *zone = pgdat->node_zones + z;
 
 		if (!populated_zone(zone))
 			continue;
 
-		if (!zone_reclaimable(zone))
-			continue;
-
 		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
 		if (!zone_watermark_ok(zone, 0,
 				       high_wmark_pages(zone) +
@@ -1828,7 +1829,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	}
 
 	page_lru = page_is_file_cache(page);
-	mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
+	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
 				hpage_nr_pages(page));
 
 	/*
@@ -1886,7 +1887,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
-			dec_zone_page_state(page, NR_ISOLATED_ANON +
+			dec_node_page_state(page, NR_ISOLATED_ANON +
 					page_is_file_cache(page));
 			putback_lru_page(page);
 		}
@@ -1979,7 +1980,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		/* Retake the callers reference and putback on LRU */
 		get_page(page);
 		putback_lru_page(page);
-		mod_zone_page_state(page_zone(page),
+		mod_node_page_state(page_pgdat(page),
 			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
 
 		goto out_unlock;
@@ -2030,7 +2031,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
 	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
 
-	mod_zone_page_state(page_zone(page),
+	mod_node_page_state(page_pgdat(page),
 			NR_ISOLATED_ANON + page_lru,
 			-HPAGE_PMD_NR);
 	return isolated;
diff --git a/mm/mlock.c b/mm/mlock.c
index 997f63082ff5..14645be06e30 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -103,7 +103,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
 		ClearPageLRU(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d578d2a56b19..0ada2b2954b0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -285,8 +285,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
 	 */
 	nr_pages -= min(nr_pages, zone->totalreserve_pages);
 
-	nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
-	nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
+	nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
+	nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
 
 	return nr_pages;
 }
@@ -348,8 +348,8 @@ static unsigned long global_dirtyable_memory(void)
 	 */
 	x -= min(x, totalreserve_pages);
 
-	x += global_page_state(NR_INACTIVE_FILE);
-	x += global_page_state(NR_ACTIVE_FILE);
+	x += global_node_page_state(NR_INACTIVE_FILE);
+	x += global_node_page_state(NR_ACTIVE_FILE);
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48b5414009ac..b84b85ae54ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1090,9 +1090,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
-	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
 
 	while (count) {
 		struct page *page;
@@ -1147,9 +1147,9 @@ static void free_one_page(struct zone *zone,
 {
 	unsigned long nr_scanned;
 	spin_lock(&zone->lock);
-	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
+	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
 	if (nr_scanned)
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
+		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
 
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
@@ -4331,6 +4331,7 @@ void show_free_areas(unsigned int filter)
 	unsigned long free_pcp = 0;
 	int cpu;
 	struct zone *zone;
+	pg_data_t *pgdat;
 
 	for_each_populated_zone(zone) {
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -4349,13 +4350,13 @@ void show_free_areas(unsigned int filter)
 		" anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
 #endif
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
-		global_page_state(NR_ACTIVE_ANON),
-		global_page_state(NR_INACTIVE_ANON),
-		global_page_state(NR_ISOLATED_ANON),
-		global_page_state(NR_ACTIVE_FILE),
-		global_page_state(NR_INACTIVE_FILE),
-		global_page_state(NR_ISOLATED_FILE),
-		global_page_state(NR_UNEVICTABLE),
+		global_node_page_state(NR_ACTIVE_ANON),
+		global_node_page_state(NR_INACTIVE_ANON),
+		global_node_page_state(NR_ISOLATED_ANON),
+		global_node_page_state(NR_ACTIVE_FILE),
+		global_node_page_state(NR_INACTIVE_FILE),
+		global_node_page_state(NR_ISOLATED_FILE),
+		global_node_page_state(NR_UNEVICTABLE),
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -4374,6 +4375,28 @@ void show_free_areas(unsigned int filter)
 		free_pcp,
 		global_page_state(NR_FREE_CMA_PAGES));
 
+	for_each_online_pgdat(pgdat) {
+		printk("Node %d"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
+			" unevictable:%lukB"
+			" isolated(anon):%lukB"
+			" isolated(file):%lukB"
+			" all_unreclaimable? %s"
+			"\n",
+			pgdat->node_id,
+			K(node_page_state(pgdat, NR_ACTIVE_ANON)),
+			K(node_page_state(pgdat, NR_INACTIVE_ANON)),
+			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
+			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
+			K(node_page_state(pgdat, NR_UNEVICTABLE)),
+			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
+			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+			!pgdat_reclaimable(pgdat) ? "yes" : "no");
+	}
+
 	for_each_populated_zone(zone) {
 		int i;
 
@@ -4390,13 +4413,6 @@ void show_free_areas(unsigned int filter)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
-			" active_anon:%lukB"
-			" inactive_anon:%lukB"
-			" active_file:%lukB"
-			" inactive_file:%lukB"
-			" unevictable:%lukB"
-			" isolated(anon):%lukB"
-			" isolated(file):%lukB"
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
@@ -4419,21 +4435,13 @@ void show_free_areas(unsigned int filter)
 			" local_pcp:%ukB"
 			" free_cma:%lukB"
 			" writeback_tmp:%lukB"
-			" pages_scanned:%lu"
-			" all_unreclaimable? %s"
+			" node_pages_scanned:%lu"
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
-			K(zone_page_state(zone, NR_ACTIVE_ANON)),
-			K(zone_page_state(zone, NR_INACTIVE_ANON)),
-			K(zone_page_state(zone, NR_ACTIVE_FILE)),
-			K(zone_page_state(zone, NR_INACTIVE_FILE)),
-			K(zone_page_state(zone, NR_UNEVICTABLE)),
-			K(zone_page_state(zone, NR_ISOLATED_ANON)),
-			K(zone_page_state(zone, NR_ISOLATED_FILE)),
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
@@ -4458,9 +4466,7 @@ void show_free_areas(unsigned int filter)
 			K(this_cpu_read(zone->pageset->pcp.count)),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
 			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
-			K(zone_page_state(zone, NR_PAGES_SCANNED)),
-			(!zone_reclaimable(zone) ? "yes" : "no")
-			);
+			K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
 			printk(" %ld", zone->lowmem_reserve[i]);
@@ -6010,7 +6016,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		/* For bootup, initialized properly in watermark setup */
 		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
 
-		lruvec_init(&zone->lruvec);
+		lruvec_init(zone_lruvec(zone));
 		if (!size)
 			continue;
 
diff --git a/mm/swap.c b/mm/swap.c
index bf37e5cfae81..77af473635fe 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
 		unsigned long flags;
 
 		spin_lock_irqsave(zone_lru_lock(zone), flags);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -194,7 +194,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(zone_lru_lock(zone), flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		(*move_fn)(page, lruvec, arg);
 	}
 	if (zone)
@@ -319,7 +319,7 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(zone_lru_lock(zone));
-	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
+	__activate_page(page, mem_cgroup_page_lruvec(page, zone->zone_pgdat), NULL);
 	spin_unlock_irq(zone_lru_lock(zone));
 }
 #endif
@@ -445,16 +445,16 @@ void lru_cache_add(struct page *page)
  */
 void add_page_to_unevictable_list(struct page *page)
 {
-	struct zone *zone = page_zone(page);
+	struct pglist_data *pgdat = page_pgdat(page);
 	struct lruvec *lruvec;
 
-	spin_lock_irq(zone_lru_lock(zone));
-	lruvec = mem_cgroup_page_lruvec(page, zone);
+	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 }
 
 /**
@@ -730,7 +730,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct zone *zone = NULL;
+	struct pglist_data *locked_pgdat = NULL;
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
 	unsigned int uninitialized_var(lock_batch);
@@ -741,11 +741,11 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same zone. The lock is held only if zone != NULL.
+		 * same pgdat. The lock is held only if pgdat != NULL.
 		 */
-		if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(zone_lru_lock(zone), flags);
-			zone = NULL;
+		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
+			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+			locked_pgdat = NULL;
 		}
 
 		if (is_huge_zero_page(page)) {
@@ -758,27 +758,27 @@ void release_pages(struct page **pages, int nr, bool cold)
 			continue;
 
 		if (PageCompound(page)) {
-			if (zone) {
-				spin_unlock_irqrestore(zone_lru_lock(zone), flags);
-				zone = NULL;
+			if (locked_pgdat) {
+				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+				locked_pgdat = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct zone *pagezone = page_zone(page);
+			struct pglist_data *pgdat = page_pgdat(page);
 
-			if (pagezone != zone) {
-				if (zone)
-					spin_unlock_irqrestore(zone_lru_lock(zone),
+			if (pgdat != locked_pgdat) {
+				if (locked_pgdat)
+					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
 									flags);
 				lock_batch = 0;
-				zone = pagezone;
-				spin_lock_irqsave(zone_lru_lock(zone), flags);
+				locked_pgdat = pgdat;
+				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, zone);
+			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -789,8 +789,8 @@ void release_pages(struct page **pages, int nr, bool cold)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (zone)
-		spin_unlock_irqrestore(zone_lru_lock(zone), flags);
+	if (locked_pgdat)
+		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_hot_cold_page_list(&pages_to_free, cold);
@@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
 	VM_BUG_ON(NR_CPUS != 1 &&
-		  !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
+		  !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e7ffcd259cc4..86a523a761c9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -191,26 +191,42 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
+/*
+ * This misses isolated pages which are not accounted for to save counters.
+ * As the data only determines if reclaim or compaction continues, it is
+ * not expected that isolated pages will be a dominating factor.
+ */
 unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
-	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
-	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
-	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
+	nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
+	if (get_nr_swap_pages() > 0)
+		nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
+
+	return nr;
+}
+
+unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
+{
+	unsigned long nr;
+
+	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
+	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
+	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
 
 	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
-		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
-		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
+		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
+		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
+		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
 
 	return nr;
 }
 
-bool zone_reclaimable(struct zone *zone)
+bool pgdat_reclaimable(struct pglist_data *pgdat)
 {
-	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
-		zone_reclaimable_pages(zone) * 6;
+	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
+		pgdat_reclaimable_pages(pgdat) * 6;
 }
 
 unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -218,7 +234,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_get_lru_size(lruvec, lru);
 
-	return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
+	return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
 }
 
 /*
@@ -877,7 +893,7 @@ static void page_check_dirty_writeback(struct page *page,
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct zone *zone,
+				      struct pglist_data *pgdat,
 				      struct scan_control *sc,
 				      enum ttu_flags ttu_flags,
 				      unsigned long *ret_nr_dirty,
@@ -917,7 +933,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep;
 
 		VM_BUG_ON_PAGE(PageActive(page), page);
-		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		sc->nr_scanned++;
 
@@ -996,7 +1011,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			/* Case 1 above */
 			if (current_is_kswapd() &&
 			    PageReclaim(page) &&
-			    test_bit(ZONE_WRITEBACK, &zone->flags)) {
+			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
 				nr_immediate++;
 				goto keep_locked;
 
@@ -1092,7 +1107,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 */
 			if (page_is_file_cache(page) &&
 					(!current_is_kswapd() ||
-					 !test_bit(ZONE_DIRTY, &zone->flags))) {
+					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -1266,11 +1281,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 		}
 	}
 
-	ret = shrink_page_list(&clean_pages, zone, &sc,
+	ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
 			TTU_UNMAP|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
 	list_splice(&clean_pages, page_list);
-	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
+	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
 	return ret;
 }
 
@@ -1375,7 +1390,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 {
 	struct list_head *src = &lruvec->lists[lru];
 	unsigned long nr_taken = 0;
-	unsigned long scan;
+	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
+	unsigned long scan, nr_pages;
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
 					!list_empty(src); scan++) {
@@ -1388,7 +1404,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
-			nr_taken += hpage_nr_pages(page);
+			nr_pages = hpage_nr_pages(page);
+			nr_taken += nr_pages;
+			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
 
@@ -1405,6 +1423,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
+	for (scan = 0; scan < MAX_NR_ZONES; scan++) {
+		nr_pages = nr_zone_taken[scan];
+		if (!nr_pages)
+			continue;
+
+		update_lru_size(lruvec, lru, scan, -nr_pages);
+	}
 	return nr_taken;
 }
 
@@ -1445,7 +1470,7 @@ int isolate_lru_page(struct page *page)
 		struct lruvec *lruvec;
 
 		spin_lock_irq(zone_lru_lock(zone));
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
 			get_page(page);
@@ -1465,7 +1490,7 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct zone *zone, int file,
+static int too_many_isolated(struct pglist_data *pgdat, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
@@ -1477,11 +1502,11 @@ static int too_many_isolated(struct zone *zone, int file,
 		return 0;
 
 	if (file) {
-		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
+		isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
 	} else {
-		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
-		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+		inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
+		isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
 	}
 
 	/*
@@ -1499,7 +1524,7 @@ static noinline_for_stack void
 putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	LIST_HEAD(pages_to_free);
 
 	/*
@@ -1512,13 +1537,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(zone_lru_lock(zone));
+			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(zone_lru_lock(zone));
+			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		SetPageLRU(page);
 		lru = page_lru(page);
@@ -1535,10 +1560,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(zone_lru_lock(zone));
+				spin_unlock_irq(&pgdat->lru_lock);
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(zone_lru_lock(zone));
+				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		}
@@ -1582,10 +1607,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned long nr_immediate = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(zone, file, sc))) {
+	while (unlikely(too_many_isolated(pgdat, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
@@ -1600,48 +1625,45 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	update_lru_size(lruvec, lru, -nr_taken);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
+		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 		if (current_is_kswapd())
-			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
+			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
 		else
-			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
+			__count_vm_events(PGSCAN_DIRECT, nr_scanned);
 	}
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
 				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
 				&nr_writeback, &nr_immediate,
 				false);
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 
 	if (global_reclaim(sc)) {
 		if (current_is_kswapd())
-			__count_zone_vm_events(PGSTEAL_KSWAPD, zone,
-					       nr_reclaimed);
+			__count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
 		else
-			__count_zone_vm_events(PGSTEAL_DIRECT, zone,
-					       nr_reclaimed);
+			__count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
 	}
 
 	putback_inactive_pages(lruvec, &page_list);
 
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_hot_cold_page_list(&page_list, true);
@@ -1661,7 +1683,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 * are encountered in the nr_immediate check below.
 	 */
 	if (nr_writeback && nr_writeback == nr_taken)
-		set_bit(ZONE_WRITEBACK, &zone->flags);
+		set_bit(PGDAT_WRITEBACK, &pgdat->flags);
 
 	/*
 	 * Legacy memcg will stall in page writeback so avoid forcibly
@@ -1673,16 +1695,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		 * backed by a congested BDI and wait_iff_congested will stall.
 		 */
 		if (nr_dirty && nr_dirty == nr_congested)
-			set_bit(ZONE_CONGESTED, &zone->flags);
+			set_bit(PGDAT_CONGESTED, &pgdat->flags);
 
 		/*
 		 * If dirty pages are scanned that are not queued for IO, it
 		 * implies that flushers are not keeping up. In this case, flag
-		 * the zone ZONE_DIRTY and kswapd will start writing pages from
+		 * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
 		 * reclaim context.
 		 */
 		if (nr_unqueued_dirty == nr_taken)
-			set_bit(ZONE_DIRTY, &zone->flags);
+			set_bit(PGDAT_DIRTY, &pgdat->flags);
 
 		/*
 		 * If kswapd scans pages marked marked for immediate
@@ -1701,9 +1723,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 */
 	if (!sc->hibernation_mode && !current_is_kswapd() &&
 	    current_may_throttle())
-		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+		wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
 
-	trace_mm_vmscan_lru_shrink_inactive(zone, nr_scanned, nr_reclaimed,
+	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
+			nr_scanned, nr_reclaimed,
 			sc->priority, file);
 	return nr_reclaimed;
 }
@@ -1731,20 +1754,20 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 				     struct list_head *pages_to_free,
 				     enum lru_list lru)
 {
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long pgmoved = 0;
 	struct page *page;
 	int nr_pages;
 
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		SetPageLRU(page);
 
 		nr_pages = hpage_nr_pages(page);
-		update_lru_size(lruvec, lru, nr_pages);
+		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
 		list_move(&page->lru, &lruvec->lists[lru]);
 		pgmoved += nr_pages;
 
@@ -1754,10 +1777,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(zone_lru_lock(zone));
+				spin_unlock_irq(&pgdat->lru_lock);
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(zone_lru_lock(zone));
+				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, pages_to_free);
 		}
@@ -1783,7 +1806,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	unsigned long nr_rotated = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lru_add_drain();
 
@@ -1792,20 +1815,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	if (!sc->may_writepage)
 		isolate_mode |= ISOLATE_CLEAN;
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	update_lru_size(lruvec, lru, -nr_taken);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc))
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
-	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
+		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
+	__count_vm_events(PGREFILL, nr_scanned);
 
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -1850,7 +1872,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -1861,8 +1883,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(zone_lru_lock(zone));
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_hold);
 	free_hot_cold_page_list(&l_hold, true);
@@ -1956,7 +1978,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2];
 	u64 denominator = 0;	/* gcc */
-	struct zone *zone = lruvec_zone(lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
 	unsigned long anon, file;
@@ -1977,7 +1999,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * well.
 	 */
 	if (current_is_kswapd()) {
-		if (!zone_reclaimable(zone))
+		if (!pgdat_reclaimable(pgdat))
 			force_scan = true;
 		if (!mem_cgroup_online(memcg))
 			force_scan = true;
@@ -2023,14 +2045,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * anon pages.  Try to detect this based on file LRU size.
 	 */
 	if (global_reclaim(sc)) {
-		unsigned long zonefile;
-		unsigned long zonefree;
+		unsigned long pgdatfile;
+		unsigned long pgdatfree;
+		int z;
+		unsigned long total_high_wmark = 0;
 
-		zonefree = zone_page_state(zone, NR_FREE_PAGES);
-		zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
-			   zone_page_state(zone, NR_INACTIVE_FILE);
+		pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
+		pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
+			   node_page_state(pgdat, NR_INACTIVE_FILE);
+
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+			if (!populated_zone(zone))
+				continue;
+
+			total_high_wmark += high_wmark_pages(zone);
+		}
 
-		if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
+		if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
 			scan_balance = SCAN_ANON;
 			goto out;
 		}
@@ -2077,7 +2109,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
-	spin_lock_irq(zone_lru_lock(zone));
+	spin_lock_irq(&pgdat->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2098,7 +2130,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(zone_lru_lock(zone));
+	spin_unlock_irq(&pgdat->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -2352,9 +2384,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
+	inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
 	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+		inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
@@ -2554,7 +2586,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 				continue;
 
 			if (sc->priority != DEF_PRIORITY &&
-			    !zone_reclaimable(zone))
+			    !pgdat_reclaimable(zone->zone_pgdat))
 				continue;	/* Let kswapd poll it */
 
 			/*
@@ -2692,7 +2724,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
 		if (!populated_zone(zone) ||
-		    zone_reclaimable_pages(zone) == 0)
+		    pgdat_reclaimable_pages(pgdat) == 0)
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
@@ -3000,7 +3032,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 		 * DEF_PRIORITY. Effectively, it considers them balanced so
 		 * they must be considered balanced here as well!
 		 */
-		if (!zone_reclaimable(zone)) {
+		if (!pgdat_reclaimable(zone->zone_pgdat)) {
 			balanced_pages += zone->managed_pages;
 			continue;
 		}
@@ -3063,6 +3095,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 {
 	unsigned long balance_gap;
 	bool lowmem_pressure;
+	struct pglist_data *pgdat = zone->zone_pgdat;
 
 	/* Reclaim above the high watermark. */
 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
@@ -3087,7 +3120,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
 
 	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
 
-	clear_bit(ZONE_WRITEBACK, &zone->flags);
+	/* TODO: ANOMALY */
+	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
 
 	/*
 	 * If a zone reaches its high watermark, consider it to be no longer
@@ -3095,10 +3129,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * BDIs but as pressure is relieved, speculatively avoid congestion
 	 * waits.
 	 */
-	if (zone_reclaimable(zone) &&
+	if (pgdat_reclaimable(zone->zone_pgdat) &&
 	    zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
-		clear_bit(ZONE_CONGESTED, &zone->flags);
-		clear_bit(ZONE_DIRTY, &zone->flags);
+		clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+		clear_bit(PGDAT_DIRTY, &pgdat->flags);
 	}
 
 	return sc->nr_scanned >= sc->nr_to_reclaim;
@@ -3157,7 +3191,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				continue;
 
 			if (sc.priority != DEF_PRIORITY &&
-			    !zone_reclaimable(zone))
+			    !pgdat_reclaimable(zone->zone_pgdat))
 				continue;
 
 			/*
@@ -3184,9 +3218,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				/*
 				 * If balanced, clear the dirty and congested
 				 * flags
+				 *
+				 * TODO: ANOMALY
 				 */
-				clear_bit(ZONE_CONGESTED, &zone->flags);
-				clear_bit(ZONE_DIRTY, &zone->flags);
+				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
+				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
 			}
 		}
 
@@ -3216,7 +3252,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				continue;
 
 			if (sc.priority != DEF_PRIORITY &&
-			    !zone_reclaimable(zone))
+			    !pgdat_reclaimable(zone->zone_pgdat))
 				continue;
 
 			sc.nr_scanned = 0;
@@ -3612,8 +3648,8 @@ int sysctl_min_slab_ratio = 5;
 static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
 {
 	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
-	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
-		zone_page_state(zone, NR_ACTIVE_FILE);
+	unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
+		node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
 
 	/*
 	 * It's possible for there to be more file mapped pages than
@@ -3716,7 +3752,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
 		return ZONE_RECLAIM_FULL;
 
-	if (!zone_reclaimable(zone))
+	if (!pgdat_reclaimable(zone->zone_pgdat))
 		return ZONE_RECLAIM_FULL;
 
 	/*
@@ -3795,7 +3831,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
 			zone = pagezone;
 			spin_lock_irq(zone_lru_lock(zone));
 		}
-		lruvec = mem_cgroup_page_lruvec(page, zone);
+		lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3345d396a99b..de0c17076270 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -936,11 +936,8 @@ const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
 	"nr_alloc_batch",
-	"nr_inactive_anon",
-	"nr_active_anon",
-	"nr_inactive_file",
-	"nr_active_file",
-	"nr_unevictable",
+	"nr_zone_anon_lru",
+	"nr_zone_file_lru",
 	"nr_mlock",
 	"nr_anon_pages",
 	"nr_mapped",
@@ -956,12 +953,9 @@ const char * const vmstat_text[] = {
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_writeback_temp",
-	"nr_isolated_anon",
-	"nr_isolated_file",
 	"nr_shmem",
 	"nr_dirtied",
 	"nr_written",
-	"nr_pages_scanned",
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	"nr_zspages",
 #endif
@@ -981,6 +975,16 @@ const char * const vmstat_text[] = {
 	"nr_shmem_pmdmapped",
 	"nr_free_cma",
 
+	/* Node-based counters */
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
+	"nr_unevictable",
+	"nr_isolated_anon",
+	"nr_isolated_file",
+	"nr_pages_scanned",
+
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
@@ -1002,11 +1006,11 @@ const char * const vmstat_text[] = {
 	"pgmajfault",
 	"pglazyfreed",
 
-	TEXTS_FOR_ZONES("pgrefill")
-	TEXTS_FOR_ZONES("pgsteal_kswapd")
-	TEXTS_FOR_ZONES("pgsteal_direct")
-	TEXTS_FOR_ZONES("pgscan_kswapd")
-	TEXTS_FOR_ZONES("pgscan_direct")
+	"pgrefill",
+	"pgsteal_kswapd",
+	"pgsteal_direct",
+	"pgscan_kswapd",
+	"pgscan_direct",
 	"pgscan_direct_throttle",
 
 #ifdef CONFIG_NUMA
@@ -1434,7 +1438,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n        scanned  %lu"
+		   "\n   node_scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu",
@@ -1442,13 +1446,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-		   zone_page_state(zone, NR_PAGES_SCANNED),
+		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone->managed_pages);
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
+		seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
 				zone_page_state(zone, i));
 
 	seq_printf(m,
@@ -1478,12 +1482,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 #endif
 	}
 	seq_printf(m,
-		   "\n  all_unreclaimable: %u"
-		   "\n  start_pfn:         %lu"
-		   "\n  inactive_ratio:    %u",
-		   !zone_reclaimable(zone),
+		   "\n  node_unreclaimable:  %u"
+		   "\n  start_pfn:           %lu"
+		   "\n  node_inactive_ratio: %u",
+		   !pgdat_reclaimable(zone->zone_pgdat),
 		   zone->zone_start_pfn,
-		   zone->inactive_ratio);
+		   zone->zone_pgdat->inactive_ratio);
 	seq_putc(m, '\n');
 }
 
@@ -1574,7 +1578,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
 {
 	unsigned long *l = arg;
 	unsigned long off = l - (unsigned long *)m->private;
-
 	seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
 	return 0;
 }
diff --git a/mm/workingset.c b/mm/workingset.c
index ba972ac2dfdd..ebe14445809a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -355,8 +355,8 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
 						     LRU_ALL_FILE);
 	} else {
-		pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
-			sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
+		pages = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
+			node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
 	}
 
 	/*
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 04/34] mm, mmzone: clarify the usage of zone padding
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Zone padding separates write-intensive fields used by page allocation,
compaction and vmstats but the comments are a little misleading and
need clarification.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d4f5cac0a8c3..edafdaf62e90 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -477,20 +477,21 @@ struct zone {
 	unsigned long		wait_table_hash_nr_entries;
 	unsigned long		wait_table_bits;
 
+	/* Write-intensive fields used from the page allocator */
 	ZONE_PADDING(_pad1_)
+
 	/* free areas of different sizes */
 	struct free_area	free_area[MAX_ORDER];
 
 	/* zone flags, see below */
 	unsigned long		flags;
 
-	/* Write-intensive fields used from the page allocator */
+	/* Primarily protects free_area */
 	spinlock_t		lock;
 
+	/* Write-intensive fields used by compaction and vmstats. */
 	ZONE_PADDING(_pad2_)
 
-	/* Write-intensive fields used by page reclaim */
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 04/34] mm, mmzone: clarify the usage of zone padding
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Zone padding separates write-intensive fields used by page allocation,
compaction and vmstats but the comments are a little misleading and
need clarification.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d4f5cac0a8c3..edafdaf62e90 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -477,20 +477,21 @@ struct zone {
 	unsigned long		wait_table_hash_nr_entries;
 	unsigned long		wait_table_bits;
 
+	/* Write-intensive fields used from the page allocator */
 	ZONE_PADDING(_pad1_)
+
 	/* free areas of different sizes */
 	struct free_area	free_area[MAX_ORDER];
 
 	/* zone flags, see below */
 	unsigned long		flags;
 
-	/* Write-intensive fields used from the page allocator */
+	/* Primarily protects free_area */
 	spinlock_t		lock;
 
+	/* Write-intensive fields used by compaction and vmstats. */
 	ZONE_PADDING(_pad2_)
 
-	/* Write-intensive fields used by page reclaim */
-
 	/*
 	 * When free pages are below this point, additional steps are taken
 	 * when reading the number of free pages to avoid per-cpu counter
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 05/34] mm, vmscan: begin reclaiming pages on a per-node basis
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

This patch makes reclaim decisions on a per-node basis.  A reclaimer knows
what zone is required by the allocation request and skips pages from
higher zones.  In many cases this will be ok because it's a GFP_HIGHMEM
request of some description.  On 64-bit, ZONE_DMA32 requests will cause
some problems but 32-bit devices on 64-bit platforms are increasingly
rare.  Historically it would have been a major problem on 32-bit with big
Highmem:Lowmem ratios but such configurations are also now rare and even
where they exist, they are not encouraged.  If it really becomes a
problem, it'll manifest as very low reclaim efficiencies.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmscan.c | 79 ++++++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 55 insertions(+), 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 86a523a761c9..766b36bec829 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -84,6 +84,9 @@ struct scan_control {
 	/* Scan (total_size >> priority) pages at once */
 	int priority;
 
+	/* The highest zone to isolate pages for reclaim from */
+	enum zone_type reclaim_idx;
+
 	unsigned int may_writepage:1;
 
 	/* Can mapped pages be reclaimed? */
@@ -1392,6 +1395,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	unsigned long nr_taken = 0;
 	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
 	unsigned long scan, nr_pages;
+	LIST_HEAD(pages_skipped);
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
 					!list_empty(src); scan++) {
@@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 
+		if (page_zonenum(page) > sc->reclaim_idx) {
+			list_move(&page->lru, &pages_skipped);
+			continue;
+		}
+
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
 			nr_pages = hpage_nr_pages(page);
@@ -1420,6 +1429,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		}
 	}
 
+	/*
+	 * Splice any skipped pages to the start of the LRU list. Note that
+	 * this disrupts the LRU order when reclaiming for lower zones but
+	 * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
+	 * scanning would soon rescan the same pages to skip and put the
+	 * system at risk of premature OOM.
+	 */
+	if (!list_empty(&pages_skipped))
+		list_splice(&pages_skipped, src);
 	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
@@ -1589,7 +1607,7 @@ static int current_may_throttle(void)
 }
 
 /*
- * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
+ * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
@@ -2401,12 +2419,13 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static bool shrink_zone(struct zone *zone, struct scan_control *sc,
-			bool is_classzone)
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
+			enum zone_type classzone_idx)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
+	struct zone *zone = &pgdat->node_zones[classzone_idx];
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2438,7 +2457,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			shrink_zone_memcg(zone, memcg, sc, &lru_pages);
 			zone_lru_pages += lru_pages;
 
-			if (memcg && is_classzone)
+			if (!global_reclaim(sc))
 				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
 					    memcg, sc->nr_scanned - scanned,
 					    lru_pages);
@@ -2469,7 +2488,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 		 * Shrink the slab caches in the same proportion that
 		 * the eligible LRU pages were scanned.
 		 */
-		if (global_reclaim(sc) && is_classzone)
+		if (global_reclaim(sc))
 			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
 				    sc->nr_scanned - nr_scanned,
 				    zone_lru_pages);
@@ -2553,7 +2572,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
-	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
+	enum zone_type classzone_idx;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2561,17 +2580,23 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	 * highmem pages could be pinning lowmem pages storing buffer_heads
 	 */
 	orig_mask = sc->gfp_mask;
-	if (buffer_heads_over_limit)
+	if (buffer_heads_over_limit) {
 		sc->gfp_mask |= __GFP_HIGHMEM;
+		sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask);
+	}
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask), sc->nodemask) {
-		enum zone_type classzone_idx;
-
+					sc->reclaim_idx, sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 
-		classzone_idx = requested_highidx;
+		/*
+		 * Note that reclaim_idx does not change as it is the highest
+		 * zone reclaimed from which for empty zones is a no-op but
+		 * classzone_idx is used by shrink_node to test if the slabs
+		 * should be shrunk on a given node.
+		 */
+		classzone_idx = sc->reclaim_idx;
 		while (!populated_zone(zone->zone_pgdat->node_zones +
 							classzone_idx))
 			classzone_idx--;
@@ -2600,8 +2625,8 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			 */
 			if (IS_ENABLED(CONFIG_COMPACTION) &&
 			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
-			    zonelist_zone_idx(z) <= requested_highidx &&
-			    compaction_ready(zone, sc->order, requested_highidx)) {
+			    zonelist_zone_idx(z) <= classzone_idx &&
+			    compaction_ready(zone, sc->order, classzone_idx)) {
 				sc->compaction_ready = true;
 				continue;
 			}
@@ -2621,7 +2646,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			/* need some check for avoid more shrink_zone() */
 		}
 
-		shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
+		shrink_node(zone->zone_pgdat, sc, classzone_idx);
 	}
 
 	/*
@@ -2847,6 +2872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -2886,6 +2912,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 		.target_mem_cgroup = memcg,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.may_swap = !noswap,
 	};
 	unsigned long lru_pages;
@@ -2924,6 +2951,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
@@ -3118,7 +3146,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 						balance_gap, classzone_idx))
 		return true;
 
-	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
+	shrink_node(zone->zone_pgdat, sc, classzone_idx);
 
 	/* TODO: ANOMALY */
 	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
@@ -3167,6 +3195,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.order = order,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
@@ -3237,15 +3266,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.may_writepage = 1;
 
 		/*
-		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * Continue scanning in the highmem->dma direction stopping at
+		 * the last zone which needs scanning. This may reclaim lowmem
+		 * pages that are not necessary for zone balancing but it
+		 * preserves LRU ordering. It is assumed that the bulk of
+		 * allocation requests can use arbitrary zones with the
+		 * possible exception of big highmem:lowmem configurations.
 		 */
-		for (i = 0; i <= end_zone; i++) {
+		for (i = end_zone; i >= 0; i--) {
 			struct zone *zone = pgdat->node_zones + i;
 
 			if (!populated_zone(zone))
@@ -3256,6 +3284,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				continue;
 
 			sc.nr_scanned = 0;
+			sc.reclaim_idx = i;
 
 			nr_soft_scanned = 0;
 			/*
@@ -3513,6 +3542,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 	struct scan_control sc = {
 		.nr_to_reclaim = nr_to_reclaim,
 		.gfp_mask = GFP_HIGHUSER_MOVABLE,
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.priority = DEF_PRIORITY,
 		.may_writepage = 1,
 		.may_unmap = 1,
@@ -3704,6 +3734,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_UNMAP),
 		.may_swap = 1,
+		.reclaim_idx = zone_idx(zone),
 	};
 
 	cond_resched();
@@ -3723,7 +3754,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_zone(zone, &sc, true);
+			shrink_node(zone->zone_pgdat, &sc, zone_idx(zone));
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 05/34] mm, vmscan: begin reclaiming pages on a per-node basis
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

This patch makes reclaim decisions on a per-node basis.  A reclaimer knows
what zone is required by the allocation request and skips pages from
higher zones.  In many cases this will be ok because it's a GFP_HIGHMEM
request of some description.  On 64-bit, ZONE_DMA32 requests will cause
some problems but 32-bit devices on 64-bit platforms are increasingly
rare.  Historically it would have been a major problem on 32-bit with big
Highmem:Lowmem ratios but such configurations are also now rare and even
where they exist, they are not encouraged.  If it really becomes a
problem, it'll manifest as very low reclaim efficiencies.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmscan.c | 79 ++++++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 55 insertions(+), 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 86a523a761c9..766b36bec829 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -84,6 +84,9 @@ struct scan_control {
 	/* Scan (total_size >> priority) pages at once */
 	int priority;
 
+	/* The highest zone to isolate pages for reclaim from */
+	enum zone_type reclaim_idx;
+
 	unsigned int may_writepage:1;
 
 	/* Can mapped pages be reclaimed? */
@@ -1392,6 +1395,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	unsigned long nr_taken = 0;
 	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
 	unsigned long scan, nr_pages;
+	LIST_HEAD(pages_skipped);
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
 					!list_empty(src); scan++) {
@@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 
+		if (page_zonenum(page) > sc->reclaim_idx) {
+			list_move(&page->lru, &pages_skipped);
+			continue;
+		}
+
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
 			nr_pages = hpage_nr_pages(page);
@@ -1420,6 +1429,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		}
 	}
 
+	/*
+	 * Splice any skipped pages to the start of the LRU list. Note that
+	 * this disrupts the LRU order when reclaiming for lower zones but
+	 * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
+	 * scanning would soon rescan the same pages to skip and put the
+	 * system at risk of premature OOM.
+	 */
+	if (!list_empty(&pages_skipped))
+		list_splice(&pages_skipped, src);
 	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
@@ -1589,7 +1607,7 @@ static int current_may_throttle(void)
 }
 
 /*
- * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
+ * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
@@ -2401,12 +2419,13 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	}
 }
 
-static bool shrink_zone(struct zone *zone, struct scan_control *sc,
-			bool is_classzone)
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
+			enum zone_type classzone_idx)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
+	struct zone *zone = &pgdat->node_zones[classzone_idx];
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2438,7 +2457,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			shrink_zone_memcg(zone, memcg, sc, &lru_pages);
 			zone_lru_pages += lru_pages;
 
-			if (memcg && is_classzone)
+			if (!global_reclaim(sc))
 				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
 					    memcg, sc->nr_scanned - scanned,
 					    lru_pages);
@@ -2469,7 +2488,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 		 * Shrink the slab caches in the same proportion that
 		 * the eligible LRU pages were scanned.
 		 */
-		if (global_reclaim(sc) && is_classzone)
+		if (global_reclaim(sc))
 			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
 				    sc->nr_scanned - nr_scanned,
 				    zone_lru_pages);
@@ -2553,7 +2572,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
-	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
+	enum zone_type classzone_idx;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2561,17 +2580,23 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	 * highmem pages could be pinning lowmem pages storing buffer_heads
 	 */
 	orig_mask = sc->gfp_mask;
-	if (buffer_heads_over_limit)
+	if (buffer_heads_over_limit) {
 		sc->gfp_mask |= __GFP_HIGHMEM;
+		sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask);
+	}
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask), sc->nodemask) {
-		enum zone_type classzone_idx;
-
+					sc->reclaim_idx, sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 
-		classzone_idx = requested_highidx;
+		/*
+		 * Note that reclaim_idx does not change as it is the highest
+		 * zone reclaimed from which for empty zones is a no-op but
+		 * classzone_idx is used by shrink_node to test if the slabs
+		 * should be shrunk on a given node.
+		 */
+		classzone_idx = sc->reclaim_idx;
 		while (!populated_zone(zone->zone_pgdat->node_zones +
 							classzone_idx))
 			classzone_idx--;
@@ -2600,8 +2625,8 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			 */
 			if (IS_ENABLED(CONFIG_COMPACTION) &&
 			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
-			    zonelist_zone_idx(z) <= requested_highidx &&
-			    compaction_ready(zone, sc->order, requested_highidx)) {
+			    zonelist_zone_idx(z) <= classzone_idx &&
+			    compaction_ready(zone, sc->order, classzone_idx)) {
 				sc->compaction_ready = true;
 				continue;
 			}
@@ -2621,7 +2646,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			/* need some check for avoid more shrink_zone() */
 		}
 
-		shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
+		shrink_node(zone->zone_pgdat, sc, classzone_idx);
 	}
 
 	/*
@@ -2847,6 +2872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
@@ -2886,6 +2912,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 		.target_mem_cgroup = memcg,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.may_swap = !noswap,
 	};
 	unsigned long lru_pages;
@@ -2924,6 +2951,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
@@ -3118,7 +3146,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 						balance_gap, classzone_idx))
 		return true;
 
-	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
+	shrink_node(zone->zone_pgdat, sc, classzone_idx);
 
 	/* TODO: ANOMALY */
 	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
@@ -3167,6 +3195,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long nr_soft_scanned;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.order = order,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
@@ -3237,15 +3266,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.may_writepage = 1;
 
 		/*
-		 * Now scan the zone in the dma->highmem direction, stopping
-		 * at the last zone which needs scanning.
-		 *
-		 * We do this because the page allocator works in the opposite
-		 * direction.  This prevents the page allocator from allocating
-		 * pages behind kswapd's direction of progress, which would
-		 * cause too much scanning of the lower zones.
+		 * Continue scanning in the highmem->dma direction stopping at
+		 * the last zone which needs scanning. This may reclaim lowmem
+		 * pages that are not necessary for zone balancing but it
+		 * preserves LRU ordering. It is assumed that the bulk of
+		 * allocation requests can use arbitrary zones with the
+		 * possible exception of big highmem:lowmem configurations.
 		 */
-		for (i = 0; i <= end_zone; i++) {
+		for (i = end_zone; i >= 0; i--) {
 			struct zone *zone = pgdat->node_zones + i;
 
 			if (!populated_zone(zone))
@@ -3256,6 +3284,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				continue;
 
 			sc.nr_scanned = 0;
+			sc.reclaim_idx = i;
 
 			nr_soft_scanned = 0;
 			/*
@@ -3513,6 +3542,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 	struct scan_control sc = {
 		.nr_to_reclaim = nr_to_reclaim,
 		.gfp_mask = GFP_HIGHUSER_MOVABLE,
+		.reclaim_idx = MAX_NR_ZONES - 1,
 		.priority = DEF_PRIORITY,
 		.may_writepage = 1,
 		.may_unmap = 1,
@@ -3704,6 +3734,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_UNMAP),
 		.may_swap = 1,
+		.reclaim_idx = zone_idx(zone),
 	};
 
 	cond_resched();
@@ -3723,7 +3754,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_zone(zone, &sc, true);
+			shrink_node(zone->zone_pgdat, &sc, zone_idx(zone));
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 06/34] mm, vmscan: have kswapd only scan based on the highest requested zone
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

kswapd checks all eligible zones to see if they need balancing even if it
was woken for a lower zone.  This made sense when we reclaimed on a
per-zone basis because we wanted to shrink zones fairly so avoid
age-inversion problems.  Ideally this is completely unnecessary when
reclaiming on a per-node basis.  In theory, there may still be anomalies
when all requests are for lower zones and very old pages are preserved in
higher zones but this should be the exceptional case.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 766b36bec829..c6e61dae382b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3209,11 +3209,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		sc.nr_reclaimed = 0;
 
-		/*
-		 * Scan in the highmem->dma direction for the highest
-		 * zone which needs scanning
-		 */
-		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+		/* Scan from the highest requested zone to dma */
+		for (i = classzone_idx; i >= 0; i--) {
 			struct zone *zone = pgdat->node_zones + i;
 
 			if (!populated_zone(zone))
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 06/34] mm, vmscan: have kswapd only scan based on the highest requested zone
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

kswapd checks all eligible zones to see if they need balancing even if it
was woken for a lower zone.  This made sense when we reclaimed on a
per-zone basis because we wanted to shrink zones fairly so avoid
age-inversion problems.  Ideally this is completely unnecessary when
reclaiming on a per-node basis.  In theory, there may still be anomalies
when all requests are for lower zones and very old pages are preserved in
higher zones but this should be the exceptional case.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 766b36bec829..c6e61dae382b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3209,11 +3209,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		sc.nr_reclaimed = 0;
 
-		/*
-		 * Scan in the highmem->dma direction for the highest
-		 * zone which needs scanning
-		 */
-		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+		/* Scan from the highest requested zone to dma */
+		for (i = classzone_idx; i >= 0; i--) {
 			struct zone *zone = pgdat->node_zones + i;
 
 			if (!populated_zone(zone))
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
patch gets rid of many of the node-based versus zone-based decisions.

o A node is considered balanced when any eligible lower zone is balanced.
  This eliminates one class of age-inversion problem because we avoid
  reclaiming a newer page just because it's in the wrong zone
o pgdat_balanced disappears because we now only care about one zone being
  balanced.
o Some anomalies related to writeback and congestion tracking being based on
  zones disappear.
o kswapd no longer has to take care to reclaim zones in the reverse order
  that the page allocator uses.
o Most importantly of all, reclaim from node 0 with multiple zones will
  have similar aging and reclaiming characteristics as every
  other node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 292 +++++++++++++++++++++---------------------------------------
 1 file changed, 101 insertions(+), 191 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6e61dae382b..7b382b90b145 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2980,7 +2980,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 }
 #endif
 
-static void age_active_anon(struct zone *zone, struct scan_control *sc)
+static void age_active_anon(struct pglist_data *pgdat,
+				struct zone *zone, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 
@@ -2999,85 +3000,15 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order, bool highorder,
+static bool zone_balanced(struct zone *zone, int order,
 			unsigned long balance_gap, int classzone_idx)
 {
 	unsigned long mark = high_wmark_pages(zone) + balance_gap;
 
-	/*
-	 * When checking from pgdat_balanced(), kswapd should stop and sleep
-	 * when it reaches the high order-0 watermark and let kcompactd take
-	 * over. Other callers such as wakeup_kswapd() want to determine the
-	 * true high-order watermark.
-	 */
-	if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
-		mark += (1UL << order);
-		order = 0;
-	}
-
 	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
 
 /*
- * pgdat_balanced() is used when checking if a node is balanced.
- *
- * For order-0, all zones must be balanced!
- *
- * For high-order allocations only zones that meet watermarks and are in a
- * zone allowed by the callers classzone_idx are added to balanced_pages. The
- * total of balanced pages must be at least 25% of the zones allowed by
- * classzone_idx for the node to be considered balanced. Forcing all zones to
- * be balanced for high orders can cause excessive reclaim when there are
- * imbalanced zones.
- * The choice of 25% is due to
- *   o a 16M DMA zone that is balanced will not balance a zone on any
- *     reasonable sized machine
- *   o On all other machines, the top zone must be at least a reasonable
- *     percentage of the middle zones. For example, on 32-bit x86, highmem
- *     would need to be at least 256M for it to be balance a whole node.
- *     Similarly, on x86-64 the Normal zone would need to be at least 1G
- *     to balance a node on its own. These seemed like reasonable ratios.
- */
-static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
-{
-	unsigned long managed_pages = 0;
-	unsigned long balanced_pages = 0;
-	int i;
-
-	/* Check the watermark levels */
-	for (i = 0; i <= classzone_idx; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		if (!populated_zone(zone))
-			continue;
-
-		managed_pages += zone->managed_pages;
-
-		/*
-		 * A special case here:
-		 *
-		 * balance_pgdat() skips over all_unreclaimable after
-		 * DEF_PRIORITY. Effectively, it considers them balanced so
-		 * they must be considered balanced here as well!
-		 */
-		if (!pgdat_reclaimable(zone->zone_pgdat)) {
-			balanced_pages += zone->managed_pages;
-			continue;
-		}
-
-		if (zone_balanced(zone, order, false, 0, i))
-			balanced_pages += zone->managed_pages;
-		else if (!order)
-			return false;
-	}
-
-	if (order)
-		return balanced_pages >= (managed_pages >> 2);
-	else
-		return true;
-}
-
-/*
  * Prepare kswapd for sleeping. This verifies that there are no processes
  * waiting in throttle_direct_reclaim() and that watermarks have been met.
  *
@@ -3086,6 +3017,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 					int classzone_idx)
 {
+	int i;
+
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return false;
@@ -3106,101 +3039,90 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
-	return pgdat_balanced(pgdat, order, classzone_idx);
+	for (i = 0; i <= classzone_idx; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_balanced(zone, order, 0, classzone_idx))
+			return true;
+	}
+
+	return false;
 }
 
 /*
- * kswapd shrinks the zone by the number of pages required to reach
- * the high watermark.
+ * kswapd shrinks a node of pages that are at or below the highest usable
+ * zone that is currently unbalanced.
  *
  * Returns true if kswapd scanned at least the requested number of pages to
  * reclaim or if the lack of progress was due to pages under writeback.
  * This is used to determine if the scanning priority needs to be raised.
  */
-static bool kswapd_shrink_zone(struct zone *zone,
+static bool kswapd_shrink_node(pg_data_t *pgdat,
 			       int classzone_idx,
 			       struct scan_control *sc)
 {
-	unsigned long balance_gap;
-	bool lowmem_pressure;
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct zone *zone;
+	int z;
 
-	/* Reclaim above the high watermark. */
-	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+	/* Reclaim a number of pages proportional to the number of zones */
+	sc->nr_to_reclaim = 0;
+	for (z = 0; z <= classzone_idx; z++) {
+		zone = pgdat->node_zones + z;
+		if (!populated_zone(zone))
+			continue;
 
-	/*
-	 * We put equal pressure on every zone, unless one zone has way too
-	 * many pages free already. The "too many pages" is defined as the
-	 * high wmark plus a "gap" where the gap is either the low
-	 * watermark or 1% of the zone, whichever is smaller.
-	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
+		sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX);
+	}
 
 	/*
-	 * If there is no low memory pressure or the zone is balanced then no
-	 * reclaim is necessary
+	 * Historically care was taken to put equal pressure on all zones but
+	 * now pressure is applied based on node LRU order.
 	 */
-	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
-	if (!lowmem_pressure && zone_balanced(zone, sc->order, false,
-						balance_gap, classzone_idx))
-		return true;
-
-	shrink_node(zone->zone_pgdat, sc, classzone_idx);
-
-	/* TODO: ANOMALY */
-	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
+	shrink_node(pgdat, sc, classzone_idx);
 
 	/*
-	 * If a zone reaches its high watermark, consider it to be no longer
-	 * congested. It's possible there are dirty pages backed by congested
-	 * BDIs but as pressure is relieved, speculatively avoid congestion
-	 * waits.
+	 * Fragmentation may mean that the system cannot be rebalanced for
+	 * high-order allocations. If twice the allocation size has been
+	 * reclaimed then recheck watermarks only at order-0 to prevent
+	 * excessive reclaim. Assume that a process requested a high-order
+	 * can direct reclaim/compact.
 	 */
-	if (pgdat_reclaimable(zone->zone_pgdat) &&
-	    zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
-		clear_bit(PGDAT_CONGESTED, &pgdat->flags);
-		clear_bit(PGDAT_DIRTY, &pgdat->flags);
-	}
+	if (sc->order && sc->nr_reclaimed >= 2UL << sc->order)
+		sc->order = 0;
 
 	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
 /*
- * For kswapd, balance_pgdat() will work across all this node's zones until
- * they are all at high_wmark_pages(zone).
- *
- * Returns the highest zone idx kswapd was reclaiming at
+ * For kswapd, balance_pgdat() will reclaim pages across a node from zones
+ * that are eligible for use by the caller until at least one zone is
+ * balanced.
  *
- * There is special handling here for zones which are full of pinned pages.
- * This can happen if the pages are all mlocked, or if they are all used by
- * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
- * What we do is to detect the case where all pages in the zone have been
- * scanned twice and there has been zero successful reclaim.  Mark the zone as
- * dead and from now on, only perform a short scan.  Basically we're polling
- * the zone for when the problem goes away.
+ * Returns the order kswapd finished reclaiming at.
  *
  * kswapd scans the zones in the highmem->normal->dma direction.  It skips
  * zones which have free_pages > high_wmark_pages(zone), but once a zone is
- * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the
- * lower zones regardless of the number of free pages in the lower zones. This
- * interoperates with the page allocator fallback scheme to ensure that aging
- * of pages is balanced across the zones.
+ * found to have free_pages <= high_wmark_pages(zone), any page is that zone
+ * or lower is eligible for reclaim until at least one usable zone is
+ * balanced.
  */
 static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
-	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	struct zone *zone;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
-		.reclaim_idx = MAX_NR_ZONES - 1,
 		.order = order,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.reclaim_idx = classzone_idx,
 	};
 	count_vm_event(PAGEOUTRUN);
 
@@ -3211,21 +3133,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		/* Scan from the highest requested zone to dma */
 		for (i = classzone_idx; i >= 0; i--) {
-			struct zone *zone = pgdat->node_zones + i;
-
+			zone = pgdat->node_zones + i;
 			if (!populated_zone(zone))
 				continue;
 
-			if (sc.priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;
-
-			/*
-			 * Do some background aging of the anon list, to give
-			 * pages a chance to be referenced before reclaiming.
-			 */
-			age_active_anon(zone, &sc);
-
 			/*
 			 * If the number of buffer_heads in the machine
 			 * exceeds the maximum allowed level and this node
@@ -3233,19 +3144,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			 * it to relieve lowmem pressure.
 			 */
 			if (buffer_heads_over_limit && is_highmem_idx(i)) {
-				end_zone = i;
+				classzone_idx = i;
 				break;
 			}
 
-			if (!zone_balanced(zone, order, false, 0, 0)) {
-				end_zone = i;
+			if (!zone_balanced(zone, order, 0, 0)) {
+				classzone_idx = i;
 				break;
 			} else {
 				/*
-				 * If balanced, clear the dirty and congested
-				 * flags
-				 *
-				 * TODO: ANOMALY
+				 * If any eligible zone is balanced then the
+				 * node is not considered congested or dirty.
 				 */
 				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
 				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
@@ -3256,51 +3165,34 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			goto out;
 
 		/*
+		 * Do some background aging of the anon list, to give
+		 * pages a chance to be referenced before reclaiming. All
+		 * pages are rotated regardless of classzone as this is
+		 * about consistent aging.
+		 */
+		age_active_anon(pgdat, &pgdat->node_zones[MAX_NR_ZONES - 1], &sc);
+
+		/*
 		 * If we're getting trouble reclaiming, start doing writepage
 		 * even in laptop mode.
 		 */
-		if (sc.priority < DEF_PRIORITY - 2)
+		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
 			sc.may_writepage = 1;
 
+		/* Call soft limit reclaim before calling shrink_node. */
+		sc.nr_scanned = 0;
+		nr_soft_scanned = 0;
+		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, sc.order,
+						sc.gfp_mask, &nr_soft_scanned);
+		sc.nr_reclaimed += nr_soft_reclaimed;
+
 		/*
-		 * Continue scanning in the highmem->dma direction stopping at
-		 * the last zone which needs scanning. This may reclaim lowmem
-		 * pages that are not necessary for zone balancing but it
-		 * preserves LRU ordering. It is assumed that the bulk of
-		 * allocation requests can use arbitrary zones with the
-		 * possible exception of big highmem:lowmem configurations.
+		 * There should be no need to raise the scanning priority if
+		 * enough pages are already being scanned that that high
+		 * watermark would be met at 100% efficiency.
 		 */
-		for (i = end_zone; i >= 0; i--) {
-			struct zone *zone = pgdat->node_zones + i;
-
-			if (!populated_zone(zone))
-				continue;
-
-			if (sc.priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;
-
-			sc.nr_scanned = 0;
-			sc.reclaim_idx = i;
-
-			nr_soft_scanned = 0;
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 */
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-							order, sc.gfp_mask,
-							&nr_soft_scanned);
-			sc.nr_reclaimed += nr_soft_reclaimed;
-
-			/*
-			 * There should be no need to raise the scanning
-			 * priority if enough pages are already being scanned
-			 * that that high watermark would be met at 100%
-			 * efficiency.
-			 */
-			if (kswapd_shrink_zone(zone, end_zone, &sc))
-				raise_priority = false;
-		}
+		if (kswapd_shrink_node(pgdat, classzone_idx, &sc))
+			raise_priority = false;
 
 		/*
 		 * If the low watermark is met there is no need for processes
@@ -3316,20 +3208,37 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			break;
 
 		/*
+		 * Stop reclaiming if any eligible zone is balanced and clear
+		 * node writeback or congested.
+		 */
+		for (i = 0; i <= classzone_idx; i++) {
+			zone = pgdat->node_zones + i;
+			if (!populated_zone(zone))
+				continue;
+
+			if (zone_balanced(zone, sc.order, 0, classzone_idx)) {
+				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+				clear_bit(PGDAT_DIRTY, &pgdat->flags);
+				goto out;
+			}
+		}
+
+		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
 		if (raise_priority || !sc.nr_reclaimed)
 			sc.priority--;
-	} while (sc.priority >= 1 &&
-			!pgdat_balanced(pgdat, order, classzone_idx));
+	} while (sc.priority >= 1);
 
 out:
 	/*
-	 * Return the highest zone idx we were reclaiming at so
-	 * prepare_kswapd_sleep() makes the same decisions as here.
+	 * Return the order kswapd stopped reclaiming at as
+	 * prepare_kswapd_sleep() takes it into account. If another caller
+	 * entered the allocator slow path while kswapd was awake, order will
+	 * remain at the higher level.
 	 */
-	return end_zone;
+	return sc.order;
 }
 
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
@@ -3486,8 +3395,9 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balanced_classzone_idx = balance_pgdat(pgdat, order,
-								classzone_idx);
+
+			/* return value ignored until next patch */
+			balance_pgdat(pgdat, order, classzone_idx);
 		}
 	}
 
@@ -3517,7 +3427,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, true, 0, 0))
+	if (zone_balanced(zone, order, 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
patch gets rid of many of the node-based versus zone-based decisions.

o A node is considered balanced when any eligible lower zone is balanced.
  This eliminates one class of age-inversion problem because we avoid
  reclaiming a newer page just because it's in the wrong zone
o pgdat_balanced disappears because we now only care about one zone being
  balanced.
o Some anomalies related to writeback and congestion tracking being based on
  zones disappear.
o kswapd no longer has to take care to reclaim zones in the reverse order
  that the page allocator uses.
o Most importantly of all, reclaim from node 0 with multiple zones will
  have similar aging and reclaiming characteristics as every
  other node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 292 +++++++++++++++++++++---------------------------------------
 1 file changed, 101 insertions(+), 191 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6e61dae382b..7b382b90b145 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2980,7 +2980,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 }
 #endif
 
-static void age_active_anon(struct zone *zone, struct scan_control *sc)
+static void age_active_anon(struct pglist_data *pgdat,
+				struct zone *zone, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 
@@ -2999,85 +3000,15 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order, bool highorder,
+static bool zone_balanced(struct zone *zone, int order,
 			unsigned long balance_gap, int classzone_idx)
 {
 	unsigned long mark = high_wmark_pages(zone) + balance_gap;
 
-	/*
-	 * When checking from pgdat_balanced(), kswapd should stop and sleep
-	 * when it reaches the high order-0 watermark and let kcompactd take
-	 * over. Other callers such as wakeup_kswapd() want to determine the
-	 * true high-order watermark.
-	 */
-	if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
-		mark += (1UL << order);
-		order = 0;
-	}
-
 	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
 
 /*
- * pgdat_balanced() is used when checking if a node is balanced.
- *
- * For order-0, all zones must be balanced!
- *
- * For high-order allocations only zones that meet watermarks and are in a
- * zone allowed by the callers classzone_idx are added to balanced_pages. The
- * total of balanced pages must be at least 25% of the zones allowed by
- * classzone_idx for the node to be considered balanced. Forcing all zones to
- * be balanced for high orders can cause excessive reclaim when there are
- * imbalanced zones.
- * The choice of 25% is due to
- *   o a 16M DMA zone that is balanced will not balance a zone on any
- *     reasonable sized machine
- *   o On all other machines, the top zone must be at least a reasonable
- *     percentage of the middle zones. For example, on 32-bit x86, highmem
- *     would need to be at least 256M for it to be balance a whole node.
- *     Similarly, on x86-64 the Normal zone would need to be at least 1G
- *     to balance a node on its own. These seemed like reasonable ratios.
- */
-static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
-{
-	unsigned long managed_pages = 0;
-	unsigned long balanced_pages = 0;
-	int i;
-
-	/* Check the watermark levels */
-	for (i = 0; i <= classzone_idx; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		if (!populated_zone(zone))
-			continue;
-
-		managed_pages += zone->managed_pages;
-
-		/*
-		 * A special case here:
-		 *
-		 * balance_pgdat() skips over all_unreclaimable after
-		 * DEF_PRIORITY. Effectively, it considers them balanced so
-		 * they must be considered balanced here as well!
-		 */
-		if (!pgdat_reclaimable(zone->zone_pgdat)) {
-			balanced_pages += zone->managed_pages;
-			continue;
-		}
-
-		if (zone_balanced(zone, order, false, 0, i))
-			balanced_pages += zone->managed_pages;
-		else if (!order)
-			return false;
-	}
-
-	if (order)
-		return balanced_pages >= (managed_pages >> 2);
-	else
-		return true;
-}
-
-/*
  * Prepare kswapd for sleeping. This verifies that there are no processes
  * waiting in throttle_direct_reclaim() and that watermarks have been met.
  *
@@ -3086,6 +3017,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 					int classzone_idx)
 {
+	int i;
+
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return false;
@@ -3106,101 +3039,90 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
-	return pgdat_balanced(pgdat, order, classzone_idx);
+	for (i = 0; i <= classzone_idx; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_balanced(zone, order, 0, classzone_idx))
+			return true;
+	}
+
+	return false;
 }
 
 /*
- * kswapd shrinks the zone by the number of pages required to reach
- * the high watermark.
+ * kswapd shrinks a node of pages that are at or below the highest usable
+ * zone that is currently unbalanced.
  *
  * Returns true if kswapd scanned at least the requested number of pages to
  * reclaim or if the lack of progress was due to pages under writeback.
  * This is used to determine if the scanning priority needs to be raised.
  */
-static bool kswapd_shrink_zone(struct zone *zone,
+static bool kswapd_shrink_node(pg_data_t *pgdat,
 			       int classzone_idx,
 			       struct scan_control *sc)
 {
-	unsigned long balance_gap;
-	bool lowmem_pressure;
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct zone *zone;
+	int z;
 
-	/* Reclaim above the high watermark. */
-	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+	/* Reclaim a number of pages proportional to the number of zones */
+	sc->nr_to_reclaim = 0;
+	for (z = 0; z <= classzone_idx; z++) {
+		zone = pgdat->node_zones + z;
+		if (!populated_zone(zone))
+			continue;
 
-	/*
-	 * We put equal pressure on every zone, unless one zone has way too
-	 * many pages free already. The "too many pages" is defined as the
-	 * high wmark plus a "gap" where the gap is either the low
-	 * watermark or 1% of the zone, whichever is smaller.
-	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
+		sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX);
+	}
 
 	/*
-	 * If there is no low memory pressure or the zone is balanced then no
-	 * reclaim is necessary
+	 * Historically care was taken to put equal pressure on all zones but
+	 * now pressure is applied based on node LRU order.
 	 */
-	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
-	if (!lowmem_pressure && zone_balanced(zone, sc->order, false,
-						balance_gap, classzone_idx))
-		return true;
-
-	shrink_node(zone->zone_pgdat, sc, classzone_idx);
-
-	/* TODO: ANOMALY */
-	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
+	shrink_node(pgdat, sc, classzone_idx);
 
 	/*
-	 * If a zone reaches its high watermark, consider it to be no longer
-	 * congested. It's possible there are dirty pages backed by congested
-	 * BDIs but as pressure is relieved, speculatively avoid congestion
-	 * waits.
+	 * Fragmentation may mean that the system cannot be rebalanced for
+	 * high-order allocations. If twice the allocation size has been
+	 * reclaimed then recheck watermarks only at order-0 to prevent
+	 * excessive reclaim. Assume that a process requested a high-order
+	 * can direct reclaim/compact.
 	 */
-	if (pgdat_reclaimable(zone->zone_pgdat) &&
-	    zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
-		clear_bit(PGDAT_CONGESTED, &pgdat->flags);
-		clear_bit(PGDAT_DIRTY, &pgdat->flags);
-	}
+	if (sc->order && sc->nr_reclaimed >= 2UL << sc->order)
+		sc->order = 0;
 
 	return sc->nr_scanned >= sc->nr_to_reclaim;
 }
 
 /*
- * For kswapd, balance_pgdat() will work across all this node's zones until
- * they are all at high_wmark_pages(zone).
- *
- * Returns the highest zone idx kswapd was reclaiming at
+ * For kswapd, balance_pgdat() will reclaim pages across a node from zones
+ * that are eligible for use by the caller until at least one zone is
+ * balanced.
  *
- * There is special handling here for zones which are full of pinned pages.
- * This can happen if the pages are all mlocked, or if they are all used by
- * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
- * What we do is to detect the case where all pages in the zone have been
- * scanned twice and there has been zero successful reclaim.  Mark the zone as
- * dead and from now on, only perform a short scan.  Basically we're polling
- * the zone for when the problem goes away.
+ * Returns the order kswapd finished reclaiming at.
  *
  * kswapd scans the zones in the highmem->normal->dma direction.  It skips
  * zones which have free_pages > high_wmark_pages(zone), but once a zone is
- * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the
- * lower zones regardless of the number of free pages in the lower zones. This
- * interoperates with the page allocator fallback scheme to ensure that aging
- * of pages is balanced across the zones.
+ * found to have free_pages <= high_wmark_pages(zone), any page is that zone
+ * or lower is eligible for reclaim until at least one usable zone is
+ * balanced.
  */
 static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
-	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	struct zone *zone;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
-		.reclaim_idx = MAX_NR_ZONES - 1,
 		.order = order,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.reclaim_idx = classzone_idx,
 	};
 	count_vm_event(PAGEOUTRUN);
 
@@ -3211,21 +3133,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		/* Scan from the highest requested zone to dma */
 		for (i = classzone_idx; i >= 0; i--) {
-			struct zone *zone = pgdat->node_zones + i;
-
+			zone = pgdat->node_zones + i;
 			if (!populated_zone(zone))
 				continue;
 
-			if (sc.priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;
-
-			/*
-			 * Do some background aging of the anon list, to give
-			 * pages a chance to be referenced before reclaiming.
-			 */
-			age_active_anon(zone, &sc);
-
 			/*
 			 * If the number of buffer_heads in the machine
 			 * exceeds the maximum allowed level and this node
@@ -3233,19 +3144,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			 * it to relieve lowmem pressure.
 			 */
 			if (buffer_heads_over_limit && is_highmem_idx(i)) {
-				end_zone = i;
+				classzone_idx = i;
 				break;
 			}
 
-			if (!zone_balanced(zone, order, false, 0, 0)) {
-				end_zone = i;
+			if (!zone_balanced(zone, order, 0, 0)) {
+				classzone_idx = i;
 				break;
 			} else {
 				/*
-				 * If balanced, clear the dirty and congested
-				 * flags
-				 *
-				 * TODO: ANOMALY
+				 * If any eligible zone is balanced then the
+				 * node is not considered congested or dirty.
 				 */
 				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
 				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
@@ -3256,51 +3165,34 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			goto out;
 
 		/*
+		 * Do some background aging of the anon list, to give
+		 * pages a chance to be referenced before reclaiming. All
+		 * pages are rotated regardless of classzone as this is
+		 * about consistent aging.
+		 */
+		age_active_anon(pgdat, &pgdat->node_zones[MAX_NR_ZONES - 1], &sc);
+
+		/*
 		 * If we're getting trouble reclaiming, start doing writepage
 		 * even in laptop mode.
 		 */
-		if (sc.priority < DEF_PRIORITY - 2)
+		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
 			sc.may_writepage = 1;
 
+		/* Call soft limit reclaim before calling shrink_node. */
+		sc.nr_scanned = 0;
+		nr_soft_scanned = 0;
+		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, sc.order,
+						sc.gfp_mask, &nr_soft_scanned);
+		sc.nr_reclaimed += nr_soft_reclaimed;
+
 		/*
-		 * Continue scanning in the highmem->dma direction stopping at
-		 * the last zone which needs scanning. This may reclaim lowmem
-		 * pages that are not necessary for zone balancing but it
-		 * preserves LRU ordering. It is assumed that the bulk of
-		 * allocation requests can use arbitrary zones with the
-		 * possible exception of big highmem:lowmem configurations.
+		 * There should be no need to raise the scanning priority if
+		 * enough pages are already being scanned that that high
+		 * watermark would be met at 100% efficiency.
 		 */
-		for (i = end_zone; i >= 0; i--) {
-			struct zone *zone = pgdat->node_zones + i;
-
-			if (!populated_zone(zone))
-				continue;
-
-			if (sc.priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;
-
-			sc.nr_scanned = 0;
-			sc.reclaim_idx = i;
-
-			nr_soft_scanned = 0;
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 */
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-							order, sc.gfp_mask,
-							&nr_soft_scanned);
-			sc.nr_reclaimed += nr_soft_reclaimed;
-
-			/*
-			 * There should be no need to raise the scanning
-			 * priority if enough pages are already being scanned
-			 * that that high watermark would be met at 100%
-			 * efficiency.
-			 */
-			if (kswapd_shrink_zone(zone, end_zone, &sc))
-				raise_priority = false;
-		}
+		if (kswapd_shrink_node(pgdat, classzone_idx, &sc))
+			raise_priority = false;
 
 		/*
 		 * If the low watermark is met there is no need for processes
@@ -3316,20 +3208,37 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			break;
 
 		/*
+		 * Stop reclaiming if any eligible zone is balanced and clear
+		 * node writeback or congested.
+		 */
+		for (i = 0; i <= classzone_idx; i++) {
+			zone = pgdat->node_zones + i;
+			if (!populated_zone(zone))
+				continue;
+
+			if (zone_balanced(zone, sc.order, 0, classzone_idx)) {
+				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+				clear_bit(PGDAT_DIRTY, &pgdat->flags);
+				goto out;
+			}
+		}
+
+		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
 		if (raise_priority || !sc.nr_reclaimed)
 			sc.priority--;
-	} while (sc.priority >= 1 &&
-			!pgdat_balanced(pgdat, order, classzone_idx));
+	} while (sc.priority >= 1);
 
 out:
 	/*
-	 * Return the highest zone idx we were reclaiming at so
-	 * prepare_kswapd_sleep() makes the same decisions as here.
+	 * Return the order kswapd stopped reclaiming at as
+	 * prepare_kswapd_sleep() takes it into account. If another caller
+	 * entered the allocator slow path while kswapd was awake, order will
+	 * remain at the higher level.
 	 */
-	return end_zone;
+	return sc.order;
 }
 
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
@@ -3486,8 +3395,9 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balanced_classzone_idx = balance_pgdat(pgdat, order,
-								classzone_idx);
+
+			/* return value ignored until next patch */
+			balance_pgdat(pgdat, order, classzone_idx);
 		}
 	}
 
@@ -3517,7 +3427,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, true, 0, 0))
+	if (zone_balanced(zone, order, 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 08/34] mm, vmscan: remove balance gap
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The balance gap was introduced to apply equal pressure to all zones when
reclaiming for a higher zone.  With node-based LRU, the need for the
balance gap is removed and the code is dead so remove it.

[vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/swap.h |  9 ---------
 mm/vmscan.c          | 19 ++++++++-----------
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c82f916008b7..916e2eddecd6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -157,15 +157,6 @@ enum {
 #define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
-/*
- * Ratio between zone->managed_pages and the "gap" that above the per-zone
- * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that
- * do not meet the (high_wmark + gap) watermark, even which already met the
- * high_wmark, in order to provide better per-zone lru behavior. We are ok to
- * spend not more than 1% of the memory for this zone balancing "gap".
- */
-#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
-
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b382b90b145..a52167eabc96 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2518,7 +2518,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
  */
 static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
 {
-	unsigned long balance_gap, watermark;
+	unsigned long watermark;
 	bool watermark_ok;
 
 	/*
@@ -2527,9 +2527,7 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
 	 * there is a buffer of free pages available to give compaction
 	 * a reasonable chance of completing and allocating the page
 	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-	watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
+	watermark = high_wmark_pages(zone) + (2UL << order);
 	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
 
 	/*
@@ -3000,10 +2998,9 @@ static void age_active_anon(struct pglist_data *pgdat,
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order,
-			unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 {
-	unsigned long mark = high_wmark_pages(zone) + balance_gap;
+	unsigned long mark = high_wmark_pages(zone);
 
 	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
@@ -3045,7 +3042,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 		if (!populated_zone(zone))
 			continue;
 
-		if (zone_balanced(zone, order, 0, classzone_idx))
+		if (zone_balanced(zone, order, classzone_idx))
 			return true;
 	}
 
@@ -3148,7 +3145,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				break;
 			}
 
-			if (!zone_balanced(zone, order, 0, 0)) {
+			if (!zone_balanced(zone, order, 0)) {
 				classzone_idx = i;
 				break;
 			} else {
@@ -3216,7 +3213,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone_balanced(zone, sc.order, 0, classzone_idx)) {
+			if (zone_balanced(zone, sc.order, classzone_idx)) {
 				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
 				clear_bit(PGDAT_DIRTY, &pgdat->flags);
 				goto out;
@@ -3427,7 +3424,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0, 0))
+	if (zone_balanced(zone, order, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 08/34] mm, vmscan: remove balance gap
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The balance gap was introduced to apply equal pressure to all zones when
reclaiming for a higher zone.  With node-based LRU, the need for the
balance gap is removed and the code is dead so remove it.

[vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/swap.h |  9 ---------
 mm/vmscan.c          | 19 ++++++++-----------
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c82f916008b7..916e2eddecd6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -157,15 +157,6 @@ enum {
 #define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
-/*
- * Ratio between zone->managed_pages and the "gap" that above the per-zone
- * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that
- * do not meet the (high_wmark + gap) watermark, even which already met the
- * high_wmark, in order to provide better per-zone lru behavior. We are ok to
- * spend not more than 1% of the memory for this zone balancing "gap".
- */
-#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
-
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b382b90b145..a52167eabc96 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2518,7 +2518,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
  */
 static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
 {
-	unsigned long balance_gap, watermark;
+	unsigned long watermark;
 	bool watermark_ok;
 
 	/*
@@ -2527,9 +2527,7 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
 	 * there is a buffer of free pages available to give compaction
 	 * a reasonable chance of completing and allocating the page
 	 */
-	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-	watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
+	watermark = high_wmark_pages(zone) + (2UL << order);
 	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
 
 	/*
@@ -3000,10 +2998,9 @@ static void age_active_anon(struct pglist_data *pgdat,
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order,
-			unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 {
-	unsigned long mark = high_wmark_pages(zone) + balance_gap;
+	unsigned long mark = high_wmark_pages(zone);
 
 	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
@@ -3045,7 +3042,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 		if (!populated_zone(zone))
 			continue;
 
-		if (zone_balanced(zone, order, 0, classzone_idx))
+		if (zone_balanced(zone, order, classzone_idx))
 			return true;
 	}
 
@@ -3148,7 +3145,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				break;
 			}
 
-			if (!zone_balanced(zone, order, 0, 0)) {
+			if (!zone_balanced(zone, order, 0)) {
 				classzone_idx = i;
 				break;
 			} else {
@@ -3216,7 +3213,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone_balanced(zone, sc.order, 0, classzone_idx)) {
+			if (zone_balanced(zone, sc.order, classzone_idx)) {
 				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
 				clear_bit(PGDAT_DIRTY, &pgdat->flags);
 				goto out;
@@ -3427,7 +3424,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0, 0))
+	if (zone_balanced(zone, order, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 09/34] mm, vmscan: simplify the logic deciding whether kswapd sleeps
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |   5 ++-
 mm/memory_hotplug.c    |   5 ++-
 mm/page_alloc.c        |   2 +-
 mm/vmscan.c            | 101 ++++++++++++++++++++++++-------------------------
 4 files changed, 57 insertions(+), 56 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index edafdaf62e90..4062fa74526f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -668,8 +668,9 @@ typedef struct pglist_data {
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
-	int kswapd_max_order;
-	enum zone_type classzone_idx;
+	int kswapd_order;
+	enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c5278360ca66..065140ecd081 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 
 		arch_refresh_nodedata(nid, pgdat);
 	} else {
-		/* Reset the nr_zones and classzone_idx to 0 before reuse */
+		/* Reset the nr_zones, order and classzone_idx before reuse */
 		pgdat->nr_zones = 0;
-		pgdat->classzone_idx = 0;
+		pgdat->kswapd_order = 0;
+		pgdat->kswapd_classzone_idx = 0;
 	}
 
 	/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b84b85ae54ff..d25dc24f65f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	unsigned long end_pfn = 0;
 
 	/* pg_data_t should be reset to zero when it's allocated */
-	WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+	WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
 	reset_deferred_meminit(pgdat);
 	pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a52167eabc96..905c60473126 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	/* kswapd must be awake if processes are being throttled */
 	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
-		pgdat->classzone_idx = min(pgdat->classzone_idx,
+		pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
 						(enum zone_type)ZONE_NORMAL);
 		wake_up_interruptible(&pgdat->kswapd_wait);
 	}
@@ -3042,11 +3042,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 		if (!populated_zone(zone))
 			continue;
 
-		if (zone_balanced(zone, order, classzone_idx))
-			return true;
+		if (!zone_balanced(zone, order, classzone_idx))
+			return false;
 	}
 
-	return false;
+	return true;
 }
 
 /*
@@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	return sc.order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
-				int classzone_idx, int balanced_classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
+				unsigned int classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
@@ -3250,8 +3250,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (prepare_kswapd_sleep(pgdat, order, remaining,
-						balanced_classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) {
 		/*
 		 * Compaction records what page blocks it recently failed to
 		 * isolate pages from and skips them in the future scanning.
@@ -3264,9 +3263,20 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
 		 * We have freed the memory, now we should compact it to make
 		 * allocation of the requested order possible.
 		 */
-		wakeup_kcompactd(pgdat, order, classzone_idx);
+		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
 
 		remaining = schedule_timeout(HZ/10);
+
+		/*
+		 * If woken prematurely then reset kswapd_classzone_idx and
+		 * order. The values will either be from a wakeup request or
+		 * the previous request that slept prematurely.
+		 */
+		if (remaining) {
+			pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+			pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
+		}
+
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 	}
@@ -3275,8 +3285,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (prepare_kswapd_sleep(pgdat, order, remaining,
-						balanced_classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
@@ -3317,9 +3326,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
  */
 static int kswapd(void *p)
 {
-	unsigned long order, new_order;
-	int classzone_idx, new_classzone_idx;
-	int balanced_classzone_idx;
+	unsigned int alloc_order, reclaim_order, classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 
@@ -3349,38 +3356,20 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
-	order = new_order = 0;
-	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
-	balanced_classzone_idx = classzone_idx;
+	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
+	pgdat->kswapd_classzone_idx = classzone_idx = 0;
 	for ( ; ; ) {
 		bool ret;
 
-		/*
-		 * While we were reclaiming, there might have been another
-		 * wakeup, so check the values.
-		 */
-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order =  0;
-		pgdat->classzone_idx = pgdat->nr_zones - 1;
+kswapd_try_sleep:
+		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
+					classzone_idx);
 
-		if (order < new_order || classzone_idx > new_classzone_idx) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation or has tigher zone constraints
-			 */
-			order = new_order;
-			classzone_idx = new_classzone_idx;
-		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx,
-						balanced_classzone_idx);
-			order = pgdat->kswapd_max_order;
-			classzone_idx = pgdat->classzone_idx;
-			new_order = order;
-			new_classzone_idx = classzone_idx;
-			pgdat->kswapd_max_order = 0;
-			pgdat->classzone_idx = pgdat->nr_zones - 1;
-		}
+		/* Read the new order and classzone_idx */
+		alloc_order = reclaim_order = pgdat->kswapd_order;
+		classzone_idx = pgdat->kswapd_classzone_idx;
+		pgdat->kswapd_order = 0;
+		pgdat->kswapd_classzone_idx = 0;
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -3390,12 +3379,24 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret) {
-			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
+		if (ret)
+			continue;
+
+		/*
+		 * Reclaim begins at the requested order but if a high-order
+		 * reclaim fails then kswapd falls back to reclaiming for
+		 * order-0. If that happens, kswapd will consider sleeping
+		 * for the order it finished reclaiming at (reclaim_order)
+		 * but kcompactd is woken to compact for the original
+		 * request (alloc_order).
+		 */
+		trace_mm_vmscan_kswapd_wake(pgdat->node_id, alloc_order);
+		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+		if (reclaim_order < alloc_order)
+			goto kswapd_try_sleep;
 
-			/* return value ignored until next patch */
-			balance_pgdat(pgdat, order, classzone_idx);
-		}
+		alloc_order = reclaim_order = pgdat->kswapd_order;
+		classzone_idx = pgdat->kswapd_classzone_idx;
 	}
 
 	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
@@ -3418,10 +3419,8 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
 		return;
 	pgdat = zone->zone_pgdat;
-	if (pgdat->kswapd_max_order < order) {
-		pgdat->kswapd_max_order = order;
-		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
-	}
+	pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 	if (zone_balanced(zone, order, 0))
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 09/34] mm, vmscan: simplify the logic deciding whether kswapd sleeps
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |   5 ++-
 mm/memory_hotplug.c    |   5 ++-
 mm/page_alloc.c        |   2 +-
 mm/vmscan.c            | 101 ++++++++++++++++++++++++-------------------------
 4 files changed, 57 insertions(+), 56 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index edafdaf62e90..4062fa74526f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -668,8 +668,9 @@ typedef struct pglist_data {
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
-	int kswapd_max_order;
-	enum zone_type classzone_idx;
+	int kswapd_order;
+	enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c5278360ca66..065140ecd081 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 
 		arch_refresh_nodedata(nid, pgdat);
 	} else {
-		/* Reset the nr_zones and classzone_idx to 0 before reuse */
+		/* Reset the nr_zones, order and classzone_idx before reuse */
 		pgdat->nr_zones = 0;
-		pgdat->classzone_idx = 0;
+		pgdat->kswapd_order = 0;
+		pgdat->kswapd_classzone_idx = 0;
 	}
 
 	/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b84b85ae54ff..d25dc24f65f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	unsigned long end_pfn = 0;
 
 	/* pg_data_t should be reset to zero when it's allocated */
-	WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+	WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
 	reset_deferred_meminit(pgdat);
 	pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a52167eabc96..905c60473126 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	/* kswapd must be awake if processes are being throttled */
 	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
-		pgdat->classzone_idx = min(pgdat->classzone_idx,
+		pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
 						(enum zone_type)ZONE_NORMAL);
 		wake_up_interruptible(&pgdat->kswapd_wait);
 	}
@@ -3042,11 +3042,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 		if (!populated_zone(zone))
 			continue;
 
-		if (zone_balanced(zone, order, classzone_idx))
-			return true;
+		if (!zone_balanced(zone, order, classzone_idx))
+			return false;
 	}
 
-	return false;
+	return true;
 }
 
 /*
@@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	return sc.order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
-				int classzone_idx, int balanced_classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
+				unsigned int classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
@@ -3250,8 +3250,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (prepare_kswapd_sleep(pgdat, order, remaining,
-						balanced_classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) {
 		/*
 		 * Compaction records what page blocks it recently failed to
 		 * isolate pages from and skips them in the future scanning.
@@ -3264,9 +3263,20 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
 		 * We have freed the memory, now we should compact it to make
 		 * allocation of the requested order possible.
 		 */
-		wakeup_kcompactd(pgdat, order, classzone_idx);
+		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
 
 		remaining = schedule_timeout(HZ/10);
+
+		/*
+		 * If woken prematurely then reset kswapd_classzone_idx and
+		 * order. The values will either be from a wakeup request or
+		 * the previous request that slept prematurely.
+		 */
+		if (remaining) {
+			pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+			pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
+		}
+
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 	}
@@ -3275,8 +3285,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (prepare_kswapd_sleep(pgdat, order, remaining,
-						balanced_classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
@@ -3317,9 +3326,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
  */
 static int kswapd(void *p)
 {
-	unsigned long order, new_order;
-	int classzone_idx, new_classzone_idx;
-	int balanced_classzone_idx;
+	unsigned int alloc_order, reclaim_order, classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 
@@ -3349,38 +3356,20 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
-	order = new_order = 0;
-	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
-	balanced_classzone_idx = classzone_idx;
+	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
+	pgdat->kswapd_classzone_idx = classzone_idx = 0;
 	for ( ; ; ) {
 		bool ret;
 
-		/*
-		 * While we were reclaiming, there might have been another
-		 * wakeup, so check the values.
-		 */
-		new_order = pgdat->kswapd_max_order;
-		new_classzone_idx = pgdat->classzone_idx;
-		pgdat->kswapd_max_order =  0;
-		pgdat->classzone_idx = pgdat->nr_zones - 1;
+kswapd_try_sleep:
+		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
+					classzone_idx);
 
-		if (order < new_order || classzone_idx > new_classzone_idx) {
-			/*
-			 * Don't sleep if someone wants a larger 'order'
-			 * allocation or has tigher zone constraints
-			 */
-			order = new_order;
-			classzone_idx = new_classzone_idx;
-		} else {
-			kswapd_try_to_sleep(pgdat, order, classzone_idx,
-						balanced_classzone_idx);
-			order = pgdat->kswapd_max_order;
-			classzone_idx = pgdat->classzone_idx;
-			new_order = order;
-			new_classzone_idx = classzone_idx;
-			pgdat->kswapd_max_order = 0;
-			pgdat->classzone_idx = pgdat->nr_zones - 1;
-		}
+		/* Read the new order and classzone_idx */
+		alloc_order = reclaim_order = pgdat->kswapd_order;
+		classzone_idx = pgdat->kswapd_classzone_idx;
+		pgdat->kswapd_order = 0;
+		pgdat->kswapd_classzone_idx = 0;
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -3390,12 +3379,24 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret) {
-			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
+		if (ret)
+			continue;
+
+		/*
+		 * Reclaim begins at the requested order but if a high-order
+		 * reclaim fails then kswapd falls back to reclaiming for
+		 * order-0. If that happens, kswapd will consider sleeping
+		 * for the order it finished reclaiming at (reclaim_order)
+		 * but kcompactd is woken to compact for the original
+		 * request (alloc_order).
+		 */
+		trace_mm_vmscan_kswapd_wake(pgdat->node_id, alloc_order);
+		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+		if (reclaim_order < alloc_order)
+			goto kswapd_try_sleep;
 
-			/* return value ignored until next patch */
-			balance_pgdat(pgdat, order, classzone_idx);
-		}
+		alloc_order = reclaim_order = pgdat->kswapd_order;
+		classzone_idx = pgdat->kswapd_classzone_idx;
 	}
 
 	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
@@ -3418,10 +3419,8 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
 		return;
 	pgdat = zone->zone_pgdat;
-	if (pgdat->kswapd_max_order < order) {
-		pgdat->kswapd_max_order = order;
-		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
-	}
+	pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 	if (zone_balanced(zone, order, 0))
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 10/34] mm, vmscan: by default have direct reclaim only shrink once per node
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Direct reclaim iterates over all zones in the zonelist and shrinking them
but this is in conflict with node-based reclaim.  In the default case,
only shrink once per node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 905c60473126..01fe4708e404 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2552,14 +2552,6 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
  *
- * We reclaim from a zone even if that zone is over high_wmark_pages(zone).
- * Because:
- * a) The caller may be trying to free *extra* pages to satisfy a higher-order
- *    allocation or
- * b) The target zone may be at high_wmark_pages(zone) but the lower zones
- *    must go *over* high_wmark_pages(zone) to satisfy the `incremental min'
- *    zone defense algorithm.
- *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
@@ -2571,6 +2563,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
 	enum zone_type classzone_idx;
+	pg_data_t *last_pgdat = NULL;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2630,6 +2623,15 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			}
 
 			/*
+			 * Shrink each node in the zonelist once. If the
+			 * zonelist is ordered by zone (not the default) then a
+			 * node may be shrunk multiple times but in that case
+			 * the user prefers lower zones being preserved.
+			 */
+			if (zone->zone_pgdat == last_pgdat)
+				continue;
+
+			/*
 			 * This steals pages from memory cgroups over softlimit
 			 * and returns the number of reclaimed pages and
 			 * scanned pages. This works for global memory pressure
@@ -2644,6 +2646,10 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			/* need some check for avoid more shrink_zone() */
 		}
 
+		/* See comment about same check for global reclaim above */
+		if (zone->zone_pgdat == last_pgdat)
+			continue;
+		last_pgdat = zone->zone_pgdat;
 		shrink_node(zone->zone_pgdat, sc, classzone_idx);
 	}
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 10/34] mm, vmscan: by default have direct reclaim only shrink once per node
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Direct reclaim iterates over all zones in the zonelist and shrinking them
but this is in conflict with node-based reclaim.  In the default case,
only shrink once per node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 905c60473126..01fe4708e404 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2552,14 +2552,6 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
  *
- * We reclaim from a zone even if that zone is over high_wmark_pages(zone).
- * Because:
- * a) The caller may be trying to free *extra* pages to satisfy a higher-order
- *    allocation or
- * b) The target zone may be at high_wmark_pages(zone) but the lower zones
- *    must go *over* high_wmark_pages(zone) to satisfy the `incremental min'
- *    zone defense algorithm.
- *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
@@ -2571,6 +2563,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
 	enum zone_type classzone_idx;
+	pg_data_t *last_pgdat = NULL;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2630,6 +2623,15 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			}
 
 			/*
+			 * Shrink each node in the zonelist once. If the
+			 * zonelist is ordered by zone (not the default) then a
+			 * node may be shrunk multiple times but in that case
+			 * the user prefers lower zones being preserved.
+			 */
+			if (zone->zone_pgdat == last_pgdat)
+				continue;
+
+			/*
 			 * This steals pages from memory cgroups over softlimit
 			 * and returns the number of reclaimed pages and
 			 * scanned pages. This works for global memory pressure
@@ -2644,6 +2646,10 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			/* need some check for avoid more shrink_zone() */
 		}
 
+		/* See comment about same check for global reclaim above */
+		if (zone->zone_pgdat == last_pgdat)
+			continue;
+		last_pgdat = zone->zone_pgdat;
 		shrink_node(zone->zone_pgdat, sc, classzone_idx);
 	}
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Reclaim may stall if there is too much dirty or congested data on a node.
This was previously based on zone flags and the logic for clearing the
flags is in two places.  As congestion/dirty tracking is now tracked on a
per-node basis, we can remove some duplicate logic.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmscan.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 01fe4708e404..8b39b903bd14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3008,7 +3008,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 {
 	unsigned long mark = high_wmark_pages(zone);
 
-	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
+	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
+		return false;
+
+	/*
+	 * If any eligible zone is balanced then the node is not considered
+	 * to be congested or dirty
+	 */
+	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
+	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
+
+	return true;
 }
 
 /*
@@ -3154,13 +3164,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			if (!zone_balanced(zone, order, 0)) {
 				classzone_idx = i;
 				break;
-			} else {
-				/*
-				 * If any eligible zone is balanced then the
-				 * node is not considered congested or dirty.
-				 */
-				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
-				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
 			}
 		}
 
@@ -3219,11 +3222,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone_balanced(zone, sc.order, classzone_idx)) {
-				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
-				clear_bit(PGDAT_DIRTY, &pgdat->flags);
+			if (zone_balanced(zone, sc.order, classzone_idx))
 				goto out;
-			}
 		}
 
 		/*
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Reclaim may stall if there is too much dirty or congested data on a node.
This was previously based on zone flags and the logic for clearing the
flags is in two places.  As congestion/dirty tracking is now tracked on a
per-node basis, we can remove some duplicate logic.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmscan.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 01fe4708e404..8b39b903bd14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3008,7 +3008,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 {
 	unsigned long mark = high_wmark_pages(zone);
 
-	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
+	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
+		return false;
+
+	/*
+	 * If any eligible zone is balanced then the node is not considered
+	 * to be congested or dirty
+	 */
+	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
+	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
+
+	return true;
 }
 
 /*
@@ -3154,13 +3164,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			if (!zone_balanced(zone, order, 0)) {
 				classzone_idx = i;
 				break;
-			} else {
-				/*
-				 * If any eligible zone is balanced then the
-				 * node is not considered congested or dirty.
-				 */
-				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
-				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
 			}
 		}
 
@@ -3219,11 +3222,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone_balanced(zone, sc.order, classzone_idx)) {
-				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
-				clear_bit(PGDAT_DIRTY, &pgdat->flags);
+			if (zone_balanced(zone, sc.order, classzone_idx))
 				goto out;
-			}
 		}
 
 		/*
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

kswapd scans from highest to lowest for a zone that requires balancing.
This was necessary when reclaim was per-zone to fairly age pages on lower
zones.  Now that we are reclaiming on a per-node basis, any eligible zone
can be used and pages will still be aged fairly.  This patch avoids
reclaiming excessively unless buffer_heads are over the limit and it's
necessary to reclaim from a higher zone than requested by the waker of
kswapd to relieve low memory pressure.

[hillf.zj@alibaba-inc.com: Force kswapd reclaim no more than needed]
Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 59 +++++++++++++++++++++++++++--------------------------------
 1 file changed, 27 insertions(+), 32 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b39b903bd14..b7a276f4b1b0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3144,31 +3144,39 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		sc.nr_reclaimed = 0;
 
-		/* Scan from the highest requested zone to dma */
-		for (i = classzone_idx; i >= 0; i--) {
-			zone = pgdat->node_zones + i;
-			if (!populated_zone(zone))
-				continue;
-
-			/*
-			 * If the number of buffer_heads in the machine
-			 * exceeds the maximum allowed level and this node
-			 * has a highmem zone, force kswapd to reclaim from
-			 * it to relieve lowmem pressure.
-			 */
-			if (buffer_heads_over_limit && is_highmem_idx(i)) {
-				classzone_idx = i;
-				break;
-			}
+		/*
+		 * If the number of buffer_heads in the machine exceeds the
+		 * maximum allowed level then reclaim from all zones. This is
+		 * not specific to highmem as highmem may not exist but it is
+		 * it is expected that buffer_heads are stripped in writeback.
+		 */
+		if (buffer_heads_over_limit) {
+			for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
+				zone = pgdat->node_zones + i;
+				if (!populated_zone(zone))
+					continue;
 
-			if (!zone_balanced(zone, order, 0)) {
 				classzone_idx = i;
 				break;
 			}
 		}
 
-		if (i < 0)
-			goto out;
+		/*
+		 * Only reclaim if there are no eligible zones. Check from
+		 * high to low zone as allocations prefer higher zones.
+		 * Scanning from low to high zone would allow congestion to be
+		 * cleared during a very small window when a small low
+		 * zone was balanced even under extreme pressure when the
+		 * overall node may be congested.
+		 */
+		for (i = classzone_idx; i >= 0; i--) {
+			zone = pgdat->node_zones + i;
+			if (!populated_zone(zone))
+				continue;
+
+			if (zone_balanced(zone, sc.order, classzone_idx))
+				goto out;
+		}
 
 		/*
 		 * Do some background aging of the anon list, to give
@@ -3214,19 +3222,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			break;
 
 		/*
-		 * Stop reclaiming if any eligible zone is balanced and clear
-		 * node writeback or congested.
-		 */
-		for (i = 0; i <= classzone_idx; i++) {
-			zone = pgdat->node_zones + i;
-			if (!populated_zone(zone))
-				continue;
-
-			if (zone_balanced(zone, sc.order, classzone_idx))
-				goto out;
-		}
-
-		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

kswapd scans from highest to lowest for a zone that requires balancing.
This was necessary when reclaim was per-zone to fairly age pages on lower
zones.  Now that we are reclaiming on a per-node basis, any eligible zone
can be used and pages will still be aged fairly.  This patch avoids
reclaiming excessively unless buffer_heads are over the limit and it's
necessary to reclaim from a higher zone than requested by the waker of
kswapd to relieve low memory pressure.

[hillf.zj@alibaba-inc.com: Force kswapd reclaim no more than needed]
Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 59 +++++++++++++++++++++++++++--------------------------------
 1 file changed, 27 insertions(+), 32 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b39b903bd14..b7a276f4b1b0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3144,31 +3144,39 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		sc.nr_reclaimed = 0;
 
-		/* Scan from the highest requested zone to dma */
-		for (i = classzone_idx; i >= 0; i--) {
-			zone = pgdat->node_zones + i;
-			if (!populated_zone(zone))
-				continue;
-
-			/*
-			 * If the number of buffer_heads in the machine
-			 * exceeds the maximum allowed level and this node
-			 * has a highmem zone, force kswapd to reclaim from
-			 * it to relieve lowmem pressure.
-			 */
-			if (buffer_heads_over_limit && is_highmem_idx(i)) {
-				classzone_idx = i;
-				break;
-			}
+		/*
+		 * If the number of buffer_heads in the machine exceeds the
+		 * maximum allowed level then reclaim from all zones. This is
+		 * not specific to highmem as highmem may not exist but it is
+		 * it is expected that buffer_heads are stripped in writeback.
+		 */
+		if (buffer_heads_over_limit) {
+			for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
+				zone = pgdat->node_zones + i;
+				if (!populated_zone(zone))
+					continue;
 
-			if (!zone_balanced(zone, order, 0)) {
 				classzone_idx = i;
 				break;
 			}
 		}
 
-		if (i < 0)
-			goto out;
+		/*
+		 * Only reclaim if there are no eligible zones. Check from
+		 * high to low zone as allocations prefer higher zones.
+		 * Scanning from low to high zone would allow congestion to be
+		 * cleared during a very small window when a small low
+		 * zone was balanced even under extreme pressure when the
+		 * overall node may be congested.
+		 */
+		for (i = classzone_idx; i >= 0; i--) {
+			zone = pgdat->node_zones + i;
+			if (!populated_zone(zone))
+				continue;
+
+			if (zone_balanced(zone, sc.order, classzone_idx))
+				goto out;
+		}
 
 		/*
 		 * Do some background aging of the anon list, to give
@@ -3214,19 +3222,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			break;
 
 		/*
-		 * Stop reclaiming if any eligible zone is balanced and clear
-		 * node writeback or congested.
-		 */
-		for (i = 0; i <= classzone_idx; i++) {
-			zone = pgdat->node_zones + i;
-			if (!populated_zone(zone))
-				continue;
-
-			if (zone_balanced(zone, sc.order, classzone_idx))
-				goto out;
-		}
-
-		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 13/34] mm, vmscan: make shrink_node decisions more node-centric
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Earlier patches focused on having direct reclaim and kswapd use data that
is node-centric for reclaiming but shrink_node() itself still uses too
much zone information.  This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not.  Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memcontrol.h | 19 ++++++++-------
 include/linux/mmzone.h     |  4 ++--
 include/linux/swap.h       |  2 +-
 mm/memcontrol.c            |  4 ++--
 mm/page_alloc.c            |  2 +-
 mm/vmscan.c                | 59 ++++++++++++++++++++++++++--------------------
 mm/workingset.c            |  6 ++---
 7 files changed, 52 insertions(+), 44 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 68f1121c8fe7..c13227d018f2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -325,22 +325,23 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
 }
 
 /**
- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
+ * @node: node of the wanted lruvec
  * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
- * Returns the lru list vector holding pages for the given @zone and
- * @mem.  This can be the global zone lruvec, if the memory controller
+ * Returns the lru list vector holding pages for a given @node or a given
+ * @memcg and @zone. This can be the node lruvec, if the memory controller
  * is disabled.
  */
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-						    struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+				struct zone *zone, struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = zone_lruvec(zone);
+		lruvec = node_lruvec(pgdat);
 		goto out;
 	}
 
@@ -610,10 +611,10 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 {
 }
 
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-						    struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+				struct zone *zone, struct mem_cgroup *memcg)
 {
-	return zone_lruvec(zone);
+	return node_lruvec(pgdat);
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4062fa74526f..895c365e3259 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -739,9 +739,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
 	return &zone->zone_pgdat->lru_lock;
 }
 
-static inline struct lruvec *zone_lruvec(struct zone *zone)
+static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
-	return &zone->zone_pgdat->lruvec;
+	return &pgdat->lruvec;
 }
 
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 916e2eddecd6..0ad616d7c381 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,7 +316,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
 						  bool may_swap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
 						unsigned long *nr_scanned);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50c86ad121bc..c9ebec98e92a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1432,8 +1432,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 			}
 			continue;
 		}
-		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-						     zone, &nr_scanned);
+		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
+					zone, &nr_scanned);
 		*total_scanned += nr_scanned;
 		if (!soft_limit_excess(root_memcg))
 			break;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d25dc24f65f2..8215c51d5b23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5954,6 +5954,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 #endif
 	pgdat_page_ext_init(pgdat);
 	spin_lock_init(&pgdat->lru_lock);
+	lruvec_init(node_lruvec(pgdat));
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
@@ -6016,7 +6017,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		/* For bootup, initialized properly in watermark setup */
 		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
 
-		lruvec_init(zone_lruvec(zone));
 		if (!size)
 			continue;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7a276f4b1b0..f0bea68b8780 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2224,12 +2224,13 @@ static inline void init_tlb_ubc(void)
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 /*
- * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
+ * This is a basic per-node page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_zone_memcg(struct zone *zone, struct mem_cgroup *memcg,
+static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
 			      struct scan_control *sc, unsigned long *lru_pages)
 {
-	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	struct zone *zone = &pgdat->node_zones[sc->reclaim_idx];
+	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long targets[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2362,13 +2363,14 @@ static bool in_reclaim_compaction(struct scan_control *sc)
  * calls try_to_compact_zone() that it will have enough free pages to succeed.
  * It will give up earlier than that if there is difficulty reclaiming pages.
  */
-static inline bool should_continue_reclaim(struct zone *zone,
+static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
 {
 	unsigned long pages_for_compaction;
 	unsigned long inactive_lru_pages;
+	int z;
 
 	/* If not in reclaim/compaction mode, stop */
 	if (!in_reclaim_compaction(sc))
@@ -2402,21 +2404,27 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
+	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
 	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
+		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(zone, sc->order, 0, 0)) {
-	case COMPACT_PARTIAL:
-	case COMPACT_CONTINUE:
-		return false;
-	default:
-		return true;
+	for (z = 0; z <= sc->reclaim_idx; z++) {
+		struct zone *zone = &pgdat->node_zones[z];
+
+		switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) {
+		case COMPACT_PARTIAL:
+		case COMPACT_CONTINUE:
+			return false;
+		default:
+			/* check next zone */
+			;
+		}
 	}
+	return true;
 }
 
 static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
@@ -2425,15 +2433,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
-	struct zone *zone = &pgdat->node_zones[classzone_idx];
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
 		struct mem_cgroup_reclaim_cookie reclaim = {
-			.zone = zone,
+			.zone = &pgdat->node_zones[classzone_idx],
 			.priority = sc->priority,
 		};
-		unsigned long zone_lru_pages = 0;
+		unsigned long node_lru_pages = 0;
 		struct mem_cgroup *memcg;
 
 		nr_reclaimed = sc->nr_reclaimed;
@@ -2454,11 +2461,11 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 			reclaimed = sc->nr_reclaimed;
 			scanned = sc->nr_scanned;
 
-			shrink_zone_memcg(zone, memcg, sc, &lru_pages);
-			zone_lru_pages += lru_pages;
+			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
+			node_lru_pages += lru_pages;
 
 			if (!global_reclaim(sc))
-				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
+				shrink_slab(sc->gfp_mask, pgdat->node_id,
 					    memcg, sc->nr_scanned - scanned,
 					    lru_pages);
 
@@ -2470,7 +2477,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 			/*
 			 * Direct reclaim and kswapd have to scan all memory
 			 * cgroups to fulfill the overall scan target for the
-			 * zone.
+			 * node.
 			 *
 			 * Limit reclaim, on the other hand, only cares about
 			 * nr_to_reclaim pages to be reclaimed and it will
@@ -2489,9 +2496,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 		 * the eligible LRU pages were scanned.
 		 */
 		if (global_reclaim(sc))
-			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
+			shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
 				    sc->nr_scanned - nr_scanned,
-				    zone_lru_pages);
+				    node_lru_pages);
 
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2506,7 +2513,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 		if (sc->nr_reclaimed - nr_reclaimed)
 			reclaimable = true;
 
-	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
+	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
 	return reclaimable;
@@ -2906,7 +2913,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 #ifdef CONFIG_MEMCG
 
-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
+unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
 						unsigned long *nr_scanned)
@@ -2931,11 +2938,11 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 	/*
 	 * NOTE: Although we can get the priority field, using it
 	 * here is not a good idea, since it limits the pages we can scan.
-	 * if we don't reclaim here, the shrink_zone from balance_pgdat
+	 * if we don't reclaim here, the shrink_node from balance_pgdat
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone_memcg(zone, memcg, &sc, &lru_pages);
+	shrink_node_memcg(zone->zone_pgdat, memcg, &sc, &lru_pages);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2994,7 +3001,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
 
 		if (inactive_list_is_low(lruvec, false))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
diff --git a/mm/workingset.c b/mm/workingset.c
index ebe14445809a..de68ad681585 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -218,7 +218,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
 	return pack_shadow(memcgid, zone, eviction);
 }
@@ -267,7 +267,7 @@ bool workingset_refault(void *shadow)
 		rcu_read_unlock();
 		return false;
 	}
-	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 	rcu_read_unlock();
@@ -319,7 +319,7 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	lruvec = mem_cgroup_zone_lruvec(page_zone(page), memcg);
+	lruvec = mem_cgroup_lruvec(page_pgdat(page), page_zone(page), memcg);
 	atomic_long_inc(&lruvec->inactive_age);
 out:
 	rcu_read_unlock();
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 13/34] mm, vmscan: make shrink_node decisions more node-centric
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Earlier patches focused on having direct reclaim and kswapd use data that
is node-centric for reclaiming but shrink_node() itself still uses too
much zone information.  This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not.  Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memcontrol.h | 19 ++++++++-------
 include/linux/mmzone.h     |  4 ++--
 include/linux/swap.h       |  2 +-
 mm/memcontrol.c            |  4 ++--
 mm/page_alloc.c            |  2 +-
 mm/vmscan.c                | 59 ++++++++++++++++++++++++++--------------------
 mm/workingset.c            |  6 ++---
 7 files changed, 52 insertions(+), 44 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 68f1121c8fe7..c13227d018f2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -325,22 +325,23 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
 }
 
 /**
- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
+ * @node: node of the wanted lruvec
  * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
- * Returns the lru list vector holding pages for the given @zone and
- * @mem.  This can be the global zone lruvec, if the memory controller
+ * Returns the lru list vector holding pages for a given @node or a given
+ * @memcg and @zone. This can be the node lruvec, if the memory controller
  * is disabled.
  */
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-						    struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+				struct zone *zone, struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = zone_lruvec(zone);
+		lruvec = node_lruvec(pgdat);
 		goto out;
 	}
 
@@ -610,10 +611,10 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 {
 }
 
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-						    struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+				struct zone *zone, struct mem_cgroup *memcg)
 {
-	return zone_lruvec(zone);
+	return node_lruvec(pgdat);
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4062fa74526f..895c365e3259 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -739,9 +739,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
 	return &zone->zone_pgdat->lru_lock;
 }
 
-static inline struct lruvec *zone_lruvec(struct zone *zone)
+static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
-	return &zone->zone_pgdat->lruvec;
+	return &pgdat->lruvec;
 }
 
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 916e2eddecd6..0ad616d7c381 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,7 +316,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
 						  bool may_swap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
 						unsigned long *nr_scanned);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50c86ad121bc..c9ebec98e92a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1432,8 +1432,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 			}
 			continue;
 		}
-		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-						     zone, &nr_scanned);
+		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
+					zone, &nr_scanned);
 		*total_scanned += nr_scanned;
 		if (!soft_limit_excess(root_memcg))
 			break;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d25dc24f65f2..8215c51d5b23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5954,6 +5954,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 #endif
 	pgdat_page_ext_init(pgdat);
 	spin_lock_init(&pgdat->lru_lock);
+	lruvec_init(node_lruvec(pgdat));
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
@@ -6016,7 +6017,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		/* For bootup, initialized properly in watermark setup */
 		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
 
-		lruvec_init(zone_lruvec(zone));
 		if (!size)
 			continue;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7a276f4b1b0..f0bea68b8780 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2224,12 +2224,13 @@ static inline void init_tlb_ubc(void)
 #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
 
 /*
- * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
+ * This is a basic per-node page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_zone_memcg(struct zone *zone, struct mem_cgroup *memcg,
+static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
 			      struct scan_control *sc, unsigned long *lru_pages)
 {
-	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	struct zone *zone = &pgdat->node_zones[sc->reclaim_idx];
+	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long targets[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2362,13 +2363,14 @@ static bool in_reclaim_compaction(struct scan_control *sc)
  * calls try_to_compact_zone() that it will have enough free pages to succeed.
  * It will give up earlier than that if there is difficulty reclaiming pages.
  */
-static inline bool should_continue_reclaim(struct zone *zone,
+static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
 {
 	unsigned long pages_for_compaction;
 	unsigned long inactive_lru_pages;
+	int z;
 
 	/* If not in reclaim/compaction mode, stop */
 	if (!in_reclaim_compaction(sc))
@@ -2402,21 +2404,27 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
+	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
 	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
+		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(zone, sc->order, 0, 0)) {
-	case COMPACT_PARTIAL:
-	case COMPACT_CONTINUE:
-		return false;
-	default:
-		return true;
+	for (z = 0; z <= sc->reclaim_idx; z++) {
+		struct zone *zone = &pgdat->node_zones[z];
+
+		switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) {
+		case COMPACT_PARTIAL:
+		case COMPACT_CONTINUE:
+			return false;
+		default:
+			/* check next zone */
+			;
+		}
 	}
+	return true;
 }
 
 static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
@@ -2425,15 +2433,14 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
-	struct zone *zone = &pgdat->node_zones[classzone_idx];
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
 		struct mem_cgroup_reclaim_cookie reclaim = {
-			.zone = zone,
+			.zone = &pgdat->node_zones[classzone_idx],
 			.priority = sc->priority,
 		};
-		unsigned long zone_lru_pages = 0;
+		unsigned long node_lru_pages = 0;
 		struct mem_cgroup *memcg;
 
 		nr_reclaimed = sc->nr_reclaimed;
@@ -2454,11 +2461,11 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 			reclaimed = sc->nr_reclaimed;
 			scanned = sc->nr_scanned;
 
-			shrink_zone_memcg(zone, memcg, sc, &lru_pages);
-			zone_lru_pages += lru_pages;
+			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
+			node_lru_pages += lru_pages;
 
 			if (!global_reclaim(sc))
-				shrink_slab(sc->gfp_mask, zone_to_nid(zone),
+				shrink_slab(sc->gfp_mask, pgdat->node_id,
 					    memcg, sc->nr_scanned - scanned,
 					    lru_pages);
 
@@ -2470,7 +2477,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 			/*
 			 * Direct reclaim and kswapd have to scan all memory
 			 * cgroups to fulfill the overall scan target for the
-			 * zone.
+			 * node.
 			 *
 			 * Limit reclaim, on the other hand, only cares about
 			 * nr_to_reclaim pages to be reclaimed and it will
@@ -2489,9 +2496,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 		 * the eligible LRU pages were scanned.
 		 */
 		if (global_reclaim(sc))
-			shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
+			shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
 				    sc->nr_scanned - nr_scanned,
-				    zone_lru_pages);
+				    node_lru_pages);
 
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2506,7 +2513,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 		if (sc->nr_reclaimed - nr_reclaimed)
 			reclaimable = true;
 
-	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
+	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
 	return reclaimable;
@@ -2906,7 +2913,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 #ifdef CONFIG_MEMCG
 
-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
+unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
 						unsigned long *nr_scanned)
@@ -2931,11 +2938,11 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 	/*
 	 * NOTE: Although we can get the priority field, using it
 	 * here is not a good idea, since it limits the pages we can scan.
-	 * if we don't reclaim here, the shrink_zone from balance_pgdat
+	 * if we don't reclaim here, the shrink_node from balance_pgdat
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone_memcg(zone, memcg, &sc, &lru_pages);
+	shrink_node_memcg(zone->zone_pgdat, memcg, &sc, &lru_pages);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2994,7 +3001,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
 
 		if (inactive_list_is_low(lruvec, false))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
diff --git a/mm/workingset.c b/mm/workingset.c
index ebe14445809a..de68ad681585 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -218,7 +218,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
 	return pack_shadow(memcgid, zone, eviction);
 }
@@ -267,7 +267,7 @@ bool workingset_refault(void *shadow)
 		rcu_read_unlock();
 		return false;
 	}
-	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 	rcu_read_unlock();
@@ -319,7 +319,7 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	lruvec = mem_cgroup_zone_lruvec(page_zone(page), memcg);
+	lruvec = mem_cgroup_lruvec(page_pgdat(page), page_zone(page), memcg);
 	atomic_long_inc(&lruvec->inactive_age);
 out:
 	rcu_read_unlock();
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 14/34] mm, memcg: move memcg limit enforcement from zones to nodes
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Memcg needs adjustment after moving LRUs to the node. Limits are tracked
per memcg but the soft-limit excess is tracked per zone. As global page
reclaim is based on the node, it is easy to imagine a situation where
a zone soft limit is exceeded even though the memcg limit is fine.

This patch moves the soft limit tree the node.  Technically, all the variable
names should also change but people are already familiar by the meaning of
"mz" even if "mn" would be a more appropriate name now.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h |  38 ++++-----
 include/linux/swap.h       |   2 +-
 mm/memcontrol.c            | 190 ++++++++++++++++++++-------------------------
 mm/vmscan.c                |  19 +++--
 mm/workingset.c            |   6 +-
 5 files changed, 111 insertions(+), 144 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c13227d018f2..80bf8458148a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -61,7 +61,7 @@ enum mem_cgroup_stat_index {
 };
 
 struct mem_cgroup_reclaim_cookie {
-	struct zone *zone;
+	pg_data_t *pgdat;
 	int priority;
 	unsigned int generation;
 };
@@ -119,7 +119,7 @@ struct mem_cgroup_reclaim_iter {
 /*
  * per-zone information in memory controller.
  */
-struct mem_cgroup_per_zone {
+struct mem_cgroup_per_node {
 	struct lruvec		lruvec;
 	unsigned long		lru_size[NR_LRU_LISTS];
 
@@ -133,10 +133,6 @@ struct mem_cgroup_per_zone {
 						/* use container_of	   */
 };
 
-struct mem_cgroup_per_node {
-	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	unsigned long threshold;
@@ -315,19 +311,15 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
 
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
-static inline struct mem_cgroup_per_zone *
-mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
+static struct mem_cgroup_per_node *
+mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
 {
-	int nid = zone_to_nid(zone);
-	int zid = zone_idx(zone);
-
-	return &memcg->nodeinfo[nid]->zoneinfo[zid];
+	return memcg->nodeinfo[nid];
 }
 
 /**
  * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
  * @node: node of the wanted lruvec
- * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
  * Returns the lru list vector holding pages for a given @node or a given
@@ -335,9 +327,9 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
  * is disabled.
  */
 static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-				struct zone *zone, struct mem_cgroup *memcg)
+				struct mem_cgroup *memcg)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
@@ -345,7 +337,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 		goto out;
 	}
 
-	mz = mem_cgroup_zone_zoneinfo(memcg, zone);
+	mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
 	lruvec = &mz->lruvec;
 out:
 	/*
@@ -353,8 +345,8 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 	 * we have to be prepared to initialize lruvec->pgdat here;
 	 * and if offlined then reonlined, we need to reinitialize it.
 	 */
-	if (unlikely(lruvec->pgdat != zone->zone_pgdat))
-		lruvec->pgdat = zone->zone_pgdat;
+	if (unlikely(lruvec->pgdat != pgdat))
+		lruvec->pgdat = pgdat;
 	return lruvec;
 }
 
@@ -447,9 +439,9 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 static inline
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 
-	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	return mz->lru_size[lru];
 }
 
@@ -520,7 +512,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
 
@@ -612,7 +604,7 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 }
 
 static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-				struct zone *zone, struct mem_cgroup *memcg)
+				struct mem_cgroup *memcg)
 {
 	return node_lruvec(pgdat);
 }
@@ -724,7 +716,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
 					    unsigned long *total_scanned)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ad616d7c381..2a23ddc96edd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,7 +318,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  bool may_swap);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
-						struct zone *zone,
+						pg_data_t *pgdat,
 						unsigned long *nr_scanned);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c9ebec98e92a..9cbd40ebccd1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -132,15 +132,11 @@ static const char * const mem_cgroup_lru_names[] = {
  * their hierarchy representation
  */
 
-struct mem_cgroup_tree_per_zone {
+struct mem_cgroup_tree_per_node {
 	struct rb_root rb_root;
 	spinlock_t lock;
 };
 
-struct mem_cgroup_tree_per_node {
-	struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
 struct mem_cgroup_tree {
 	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
 };
@@ -374,37 +370,35 @@ ino_t page_cgroup_ino(struct page *page)
 	return ino;
 }
 
-static struct mem_cgroup_per_zone *
-mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
+static struct mem_cgroup_per_node *
+mem_cgroup_page_nodeinfo(struct mem_cgroup *memcg, struct page *page)
 {
 	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
 
-	return &memcg->nodeinfo[nid]->zoneinfo[zid];
+	return memcg->nodeinfo[nid];
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
+static struct mem_cgroup_tree_per_node *
+soft_limit_tree_node(int nid)
 {
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+	return soft_limit_tree.rb_tree_per_node[nid];
 }
 
-static struct mem_cgroup_tree_per_zone *
+static struct mem_cgroup_tree_per_node *
 soft_limit_tree_from_page(struct page *page)
 {
 	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
 
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+	return soft_limit_tree.rb_tree_per_node[nid];
 }
 
-static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
-					 struct mem_cgroup_tree_per_zone *mctz,
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
+					 struct mem_cgroup_tree_per_node *mctz,
 					 unsigned long new_usage_in_excess)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
+	struct mem_cgroup_per_node *mz_node;
 
 	if (mz->on_tree)
 		return;
@@ -414,7 +408,7 @@ static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
 		return;
 	while (*p) {
 		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
+		mz_node = rb_entry(parent, struct mem_cgroup_per_node,
 					tree_node);
 		if (mz->usage_in_excess < mz_node->usage_in_excess)
 			p = &(*p)->rb_left;
@@ -430,8 +424,8 @@ static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
 	mz->on_tree = true;
 }
 
-static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
-					 struct mem_cgroup_tree_per_zone *mctz)
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
+					 struct mem_cgroup_tree_per_node *mctz)
 {
 	if (!mz->on_tree)
 		return;
@@ -439,8 +433,8 @@ static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 	mz->on_tree = false;
 }
 
-static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
-				       struct mem_cgroup_tree_per_zone *mctz)
+static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
+				       struct mem_cgroup_tree_per_node *mctz)
 {
 	unsigned long flags;
 
@@ -464,8 +458,8 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
 static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 {
 	unsigned long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
+	struct mem_cgroup_per_node *mz;
+	struct mem_cgroup_tree_per_node *mctz;
 
 	mctz = soft_limit_tree_from_page(page);
 	/*
@@ -473,7 +467,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 	 * because their event counter is not touched.
 	 */
 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
-		mz = mem_cgroup_page_zoneinfo(memcg, page);
+		mz = mem_cgroup_page_nodeinfo(memcg, page);
 		excess = soft_limit_excess(memcg);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
@@ -498,24 +492,22 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 
 static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup_tree_per_zone *mctz;
-	struct mem_cgroup_per_zone *mz;
-	int nid, zid;
+	struct mem_cgroup_tree_per_node *mctz;
+	struct mem_cgroup_per_node *mz;
+	int nid;
 
 	for_each_node(nid) {
-		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-			mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-			mctz = soft_limit_tree_node_zone(nid, zid);
-			mem_cgroup_remove_exceeded(mz, mctz);
-		}
+		mz = mem_cgroup_nodeinfo(memcg, nid);
+		mctz = soft_limit_tree_node(nid);
+		mem_cgroup_remove_exceeded(mz, mctz);
 	}
 }
 
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+static struct mem_cgroup_per_node *
+__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 {
 	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 
 retry:
 	mz = NULL;
@@ -523,7 +515,7 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	if (!rightmost)
 		goto done;		/* Nothing to reclaim from */
 
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
+	mz = rb_entry(rightmost, struct mem_cgroup_per_node, tree_node);
 	/*
 	 * Remove the node now but someone else can add it back,
 	 * we will to add it back at the end of reclaim to its correct
@@ -537,10 +529,10 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+static struct mem_cgroup_per_node *
+mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 
 	spin_lock_irq(&mctz->lock);
 	mz = __mem_cgroup_largest_soft_limit_node(mctz);
@@ -634,20 +626,16 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 					   int nid, unsigned int lru_mask)
 {
 	unsigned long nr = 0;
-	int zid;
+	struct mem_cgroup_per_node *mz;
+	enum lru_list lru;
 
 	VM_BUG_ON((unsigned)nid >= nr_node_ids);
 
-	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-		struct mem_cgroup_per_zone *mz;
-		enum lru_list lru;
-
-		for_each_lru(lru) {
-			if (!(BIT(lru) & lru_mask))
-				continue;
-			mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-			nr += mz->lru_size[lru];
-		}
+	for_each_lru(lru) {
+		if (!(BIT(lru) & lru_mask))
+			continue;
+		mz = mem_cgroup_nodeinfo(memcg, nid);
+		nr += mz->lru_size[lru];
 	}
 	return nr;
 }
@@ -800,9 +788,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 
 	if (reclaim) {
-		struct mem_cgroup_per_zone *mz;
+		struct mem_cgroup_per_node *mz;
 
-		mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone);
+		mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
 		iter = &mz->iter[reclaim->priority];
 
 		if (prev && reclaim->generation != iter->generation)
@@ -901,19 +889,17 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 {
 	struct mem_cgroup *memcg = dead_memcg;
 	struct mem_cgroup_reclaim_iter *iter;
-	struct mem_cgroup_per_zone *mz;
-	int nid, zid;
+	struct mem_cgroup_per_node *mz;
+	int nid;
 	int i;
 
 	while ((memcg = parent_mem_cgroup(memcg))) {
 		for_each_node(nid) {
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-				mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-				for (i = 0; i <= DEF_PRIORITY; i++) {
-					iter = &mz->iter[i];
-					cmpxchg(&iter->position,
-						dead_memcg, NULL);
-				}
+			mz = mem_cgroup_nodeinfo(memcg, nid);
+			for (i = 0; i <= DEF_PRIORITY; i++) {
+				iter = &mz->iter[i];
+				cmpxchg(&iter->position,
+					dead_memcg, NULL);
 			}
 		}
 	}
@@ -945,7 +931,7 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  */
 struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
@@ -962,7 +948,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	if (!memcg)
 		memcg = root_mem_cgroup;
 
-	mz = mem_cgroup_page_zoneinfo(memcg, page);
+	mz = mem_cgroup_page_nodeinfo(memcg, page);
 	lruvec = &mz->lruvec;
 out:
 	/*
@@ -989,7 +975,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 				enum zone_type zid, int nr_pages)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 	unsigned long *lru_size;
 	long size;
 	bool empty;
@@ -999,7 +985,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 	if (mem_cgroup_disabled())
 		return;
 
-	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	lru_size = mz->lru_size + lru;
 	empty = list_empty(lruvec->lists + lru);
 
@@ -1392,7 +1378,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 #endif
 
 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
-				   struct zone *zone,
+				   pg_data_t *pgdat,
 				   gfp_t gfp_mask,
 				   unsigned long *total_scanned)
 {
@@ -1402,7 +1388,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 	unsigned long excess;
 	unsigned long nr_scanned;
 	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
+		.pgdat = pgdat,
 		.priority = 0,
 	};
 
@@ -1433,7 +1419,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 			continue;
 		}
 		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
-					zone, &nr_scanned);
+					pgdat, &nr_scanned);
 		*total_scanned += nr_scanned;
 		if (!soft_limit_excess(root_memcg))
 			break;
@@ -2560,22 +2546,22 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
 					    unsigned long *total_scanned)
 {
 	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
+	struct mem_cgroup_per_node *mz, *next_mz = NULL;
 	unsigned long reclaimed;
 	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
+	struct mem_cgroup_tree_per_node *mctz;
 	unsigned long excess;
 	unsigned long nr_scanned;
 
 	if (order > 0)
 		return 0;
 
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
+	mctz = soft_limit_tree_node(pgdat->node_id);
 	/*
 	 * This loop can run a while, specially if mem_cgroup's continuously
 	 * keep exceeding their soft limit and putting the system under
@@ -2590,7 +2576,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 			break;
 
 		nr_scanned = 0;
-		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
+		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
 						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
@@ -3211,22 +3197,21 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 
 #ifdef CONFIG_DEBUG_VM
 	{
-		int nid, zid;
-		struct mem_cgroup_per_zone *mz;
+		pg_data_t *pgdat;
+		struct mem_cgroup_per_node *mz;
 		struct zone_reclaim_stat *rstat;
 		unsigned long recent_rotated[2] = {0, 0};
 		unsigned long recent_scanned[2] = {0, 0};
 
-		for_each_online_node(nid)
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-				mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-				rstat = &mz->lruvec.reclaim_stat;
+		for_each_online_pgdat(pgdat) {
+			mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
+			rstat = &mz->lruvec.reclaim_stat;
 
-				recent_rotated[0] += rstat->recent_rotated[0];
-				recent_rotated[1] += rstat->recent_rotated[1];
-				recent_scanned[0] += rstat->recent_scanned[0];
-				recent_scanned[1] += rstat->recent_scanned[1];
-			}
+			recent_rotated[0] += rstat->recent_rotated[0];
+			recent_rotated[1] += rstat->recent_rotated[1];
+			recent_scanned[0] += rstat->recent_scanned[0];
+			recent_scanned[1] += rstat->recent_scanned[1];
+		}
 		seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
 		seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
 		seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
@@ -4106,11 +4091,10 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return idr_find(&mem_cgroup_idr, id);
 }
 
-static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
+static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
-	struct mem_cgroup_per_zone *mz;
-	int zone, tmp = node;
+	int tmp = node;
 	/*
 	 * This routine is called against possible nodes.
 	 * But it's BUG to call kmalloc() against offline node.
@@ -4125,18 +4109,16 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return 1;
 
-	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-		mz = &pn->zoneinfo[zone];
-		lruvec_init(&mz->lruvec);
-		mz->usage_in_excess = 0;
-		mz->on_tree = false;
-		mz->memcg = memcg;
-	}
+	lruvec_init(&pn->lruvec);
+	pn->usage_in_excess = 0;
+	pn->on_tree = false;
+	pn->memcg = memcg;
+
 	memcg->nodeinfo[node] = pn;
 	return 0;
 }
 
-static void free_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
+static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	kfree(memcg->nodeinfo[node]);
 }
@@ -4147,7 +4129,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 
 	memcg_wb_domain_exit(memcg);
 	for_each_node(node)
-		free_mem_cgroup_per_zone_info(memcg, node);
+		free_mem_cgroup_per_node_info(memcg, node);
 	free_percpu(memcg->stat);
 	kfree(memcg);
 }
@@ -4176,7 +4158,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		goto fail;
 
 	for_each_node(node)
-		if (alloc_mem_cgroup_per_zone_info(memcg, node))
+		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
 
 	if (memcg_wb_domain_init(memcg, GFP_KERNEL))
@@ -5779,18 +5761,12 @@ static int __init mem_cgroup_init(void)
 
 	for_each_node(node) {
 		struct mem_cgroup_tree_per_node *rtpn;
-		int zone;
 
 		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
 				    node_online(node) ? node : NUMA_NO_NODE);
 
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			struct mem_cgroup_tree_per_zone *rtpz;
-
-			rtpz = &rtpn->rb_tree_per_zone[zone];
-			rtpz->rb_root = RB_ROOT;
-			spin_lock_init(&rtpz->lock);
-		}
+		rtpn->rb_root = RB_ROOT;
+		spin_lock_init(&rtpn->lock);
 		soft_limit_tree.rb_tree_per_node[node] = rtpn;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f0bea68b8780..8d2555dd3ef3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2229,8 +2229,7 @@ static inline void init_tlb_ubc(void)
 static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
 			      struct scan_control *sc, unsigned long *lru_pages)
 {
-	struct zone *zone = &pgdat->node_zones[sc->reclaim_idx];
-	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
+	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long targets[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2437,7 +2436,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
 		struct mem_cgroup_reclaim_cookie reclaim = {
-			.zone = &pgdat->node_zones[classzone_idx],
+			.pgdat = pgdat,
 			.priority = sc->priority,
 		};
 		unsigned long node_lru_pages = 0;
@@ -2645,7 +2644,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			 * and balancing, not for a memcg's limit.
 			 */
 			nr_soft_scanned = 0;
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
+			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
 						sc->order, sc->gfp_mask,
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
@@ -2915,7 +2914,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 						gfp_t gfp_mask, bool noswap,
-						struct zone *zone,
+						pg_data_t *pgdat,
 						unsigned long *nr_scanned)
 {
 	struct scan_control sc = {
@@ -2942,7 +2941,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_node_memcg(zone->zone_pgdat, memcg, &sc, &lru_pages);
+	shrink_node_memcg(pgdat, memcg, &sc, &lru_pages);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2992,7 +2991,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 #endif
 
 static void age_active_anon(struct pglist_data *pgdat,
-				struct zone *zone, struct scan_control *sc)
+				struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 
@@ -3001,7 +3000,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
+		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 
 		if (inactive_list_is_low(lruvec, false))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
@@ -3191,7 +3190,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * pages are rotated regardless of classzone as this is
 		 * about consistent aging.
 		 */
-		age_active_anon(pgdat, &pgdat->node_zones[MAX_NR_ZONES - 1], &sc);
+		age_active_anon(pgdat, &sc);
 
 		/*
 		 * If we're getting trouble reclaiming, start doing writepage
@@ -3203,7 +3202,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		/* Call soft limit reclaim before calling shrink_node. */
 		sc.nr_scanned = 0;
 		nr_soft_scanned = 0;
-		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, sc.order,
+		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order,
 						sc.gfp_mask, &nr_soft_scanned);
 		sc.nr_reclaimed += nr_soft_reclaimed;
 
diff --git a/mm/workingset.c b/mm/workingset.c
index de68ad681585..9a1016f5d500 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -218,7 +218,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
 	return pack_shadow(memcgid, zone, eviction);
 }
@@ -267,7 +267,7 @@ bool workingset_refault(void *shadow)
 		rcu_read_unlock();
 		return false;
 	}
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 	rcu_read_unlock();
@@ -319,7 +319,7 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	lruvec = mem_cgroup_lruvec(page_pgdat(page), page_zone(page), memcg);
+	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
 	atomic_long_inc(&lruvec->inactive_age);
 out:
 	rcu_read_unlock();
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 14/34] mm, memcg: move memcg limit enforcement from zones to nodes
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Memcg needs adjustment after moving LRUs to the node. Limits are tracked
per memcg but the soft-limit excess is tracked per zone. As global page
reclaim is based on the node, it is easy to imagine a situation where
a zone soft limit is exceeded even though the memcg limit is fine.

This patch moves the soft limit tree the node.  Technically, all the variable
names should also change but people are already familiar by the meaning of
"mz" even if "mn" would be a more appropriate name now.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h |  38 ++++-----
 include/linux/swap.h       |   2 +-
 mm/memcontrol.c            | 190 ++++++++++++++++++++-------------------------
 mm/vmscan.c                |  19 +++--
 mm/workingset.c            |   6 +-
 5 files changed, 111 insertions(+), 144 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c13227d018f2..80bf8458148a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -61,7 +61,7 @@ enum mem_cgroup_stat_index {
 };
 
 struct mem_cgroup_reclaim_cookie {
-	struct zone *zone;
+	pg_data_t *pgdat;
 	int priority;
 	unsigned int generation;
 };
@@ -119,7 +119,7 @@ struct mem_cgroup_reclaim_iter {
 /*
  * per-zone information in memory controller.
  */
-struct mem_cgroup_per_zone {
+struct mem_cgroup_per_node {
 	struct lruvec		lruvec;
 	unsigned long		lru_size[NR_LRU_LISTS];
 
@@ -133,10 +133,6 @@ struct mem_cgroup_per_zone {
 						/* use container_of	   */
 };
 
-struct mem_cgroup_per_node {
-	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	unsigned long threshold;
@@ -315,19 +311,15 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
 
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
-static inline struct mem_cgroup_per_zone *
-mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
+static struct mem_cgroup_per_node *
+mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
 {
-	int nid = zone_to_nid(zone);
-	int zid = zone_idx(zone);
-
-	return &memcg->nodeinfo[nid]->zoneinfo[zid];
+	return memcg->nodeinfo[nid];
 }
 
 /**
  * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
  * @node: node of the wanted lruvec
- * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
  * Returns the lru list vector holding pages for a given @node or a given
@@ -335,9 +327,9 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
  * is disabled.
  */
 static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-				struct zone *zone, struct mem_cgroup *memcg)
+				struct mem_cgroup *memcg)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
@@ -345,7 +337,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 		goto out;
 	}
 
-	mz = mem_cgroup_zone_zoneinfo(memcg, zone);
+	mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
 	lruvec = &mz->lruvec;
 out:
 	/*
@@ -353,8 +345,8 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 	 * we have to be prepared to initialize lruvec->pgdat here;
 	 * and if offlined then reonlined, we need to reinitialize it.
 	 */
-	if (unlikely(lruvec->pgdat != zone->zone_pgdat))
-		lruvec->pgdat = zone->zone_pgdat;
+	if (unlikely(lruvec->pgdat != pgdat))
+		lruvec->pgdat = pgdat;
 	return lruvec;
 }
 
@@ -447,9 +439,9 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 static inline
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 
-	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	return mz->lru_size[lru];
 }
 
@@ -520,7 +512,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
 
@@ -612,7 +604,7 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 }
 
 static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-				struct zone *zone, struct mem_cgroup *memcg)
+				struct mem_cgroup *memcg)
 {
 	return node_lruvec(pgdat);
 }
@@ -724,7 +716,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
 					    unsigned long *total_scanned)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ad616d7c381..2a23ddc96edd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,7 +318,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  bool may_swap);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
-						struct zone *zone,
+						pg_data_t *pgdat,
 						unsigned long *nr_scanned);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c9ebec98e92a..9cbd40ebccd1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -132,15 +132,11 @@ static const char * const mem_cgroup_lru_names[] = {
  * their hierarchy representation
  */
 
-struct mem_cgroup_tree_per_zone {
+struct mem_cgroup_tree_per_node {
 	struct rb_root rb_root;
 	spinlock_t lock;
 };
 
-struct mem_cgroup_tree_per_node {
-	struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
 struct mem_cgroup_tree {
 	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
 };
@@ -374,37 +370,35 @@ ino_t page_cgroup_ino(struct page *page)
 	return ino;
 }
 
-static struct mem_cgroup_per_zone *
-mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
+static struct mem_cgroup_per_node *
+mem_cgroup_page_nodeinfo(struct mem_cgroup *memcg, struct page *page)
 {
 	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
 
-	return &memcg->nodeinfo[nid]->zoneinfo[zid];
+	return memcg->nodeinfo[nid];
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
+static struct mem_cgroup_tree_per_node *
+soft_limit_tree_node(int nid)
 {
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+	return soft_limit_tree.rb_tree_per_node[nid];
 }
 
-static struct mem_cgroup_tree_per_zone *
+static struct mem_cgroup_tree_per_node *
 soft_limit_tree_from_page(struct page *page)
 {
 	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
 
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+	return soft_limit_tree.rb_tree_per_node[nid];
 }
 
-static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
-					 struct mem_cgroup_tree_per_zone *mctz,
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
+					 struct mem_cgroup_tree_per_node *mctz,
 					 unsigned long new_usage_in_excess)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
+	struct mem_cgroup_per_node *mz_node;
 
 	if (mz->on_tree)
 		return;
@@ -414,7 +408,7 @@ static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
 		return;
 	while (*p) {
 		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
+		mz_node = rb_entry(parent, struct mem_cgroup_per_node,
 					tree_node);
 		if (mz->usage_in_excess < mz_node->usage_in_excess)
 			p = &(*p)->rb_left;
@@ -430,8 +424,8 @@ static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz,
 	mz->on_tree = true;
 }
 
-static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
-					 struct mem_cgroup_tree_per_zone *mctz)
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
+					 struct mem_cgroup_tree_per_node *mctz)
 {
 	if (!mz->on_tree)
 		return;
@@ -439,8 +433,8 @@ static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 	mz->on_tree = false;
 }
 
-static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
-				       struct mem_cgroup_tree_per_zone *mctz)
+static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
+				       struct mem_cgroup_tree_per_node *mctz)
 {
 	unsigned long flags;
 
@@ -464,8 +458,8 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
 static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 {
 	unsigned long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
+	struct mem_cgroup_per_node *mz;
+	struct mem_cgroup_tree_per_node *mctz;
 
 	mctz = soft_limit_tree_from_page(page);
 	/*
@@ -473,7 +467,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 	 * because their event counter is not touched.
 	 */
 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
-		mz = mem_cgroup_page_zoneinfo(memcg, page);
+		mz = mem_cgroup_page_nodeinfo(memcg, page);
 		excess = soft_limit_excess(memcg);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
@@ -498,24 +492,22 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 
 static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
 {
-	struct mem_cgroup_tree_per_zone *mctz;
-	struct mem_cgroup_per_zone *mz;
-	int nid, zid;
+	struct mem_cgroup_tree_per_node *mctz;
+	struct mem_cgroup_per_node *mz;
+	int nid;
 
 	for_each_node(nid) {
-		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-			mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-			mctz = soft_limit_tree_node_zone(nid, zid);
-			mem_cgroup_remove_exceeded(mz, mctz);
-		}
+		mz = mem_cgroup_nodeinfo(memcg, nid);
+		mctz = soft_limit_tree_node(nid);
+		mem_cgroup_remove_exceeded(mz, mctz);
 	}
 }
 
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+static struct mem_cgroup_per_node *
+__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 {
 	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 
 retry:
 	mz = NULL;
@@ -523,7 +515,7 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	if (!rightmost)
 		goto done;		/* Nothing to reclaim from */
 
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
+	mz = rb_entry(rightmost, struct mem_cgroup_per_node, tree_node);
 	/*
 	 * Remove the node now but someone else can add it back,
 	 * we will to add it back at the end of reclaim to its correct
@@ -537,10 +529,10 @@ __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 	return mz;
 }
 
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+static struct mem_cgroup_per_node *
+mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 
 	spin_lock_irq(&mctz->lock);
 	mz = __mem_cgroup_largest_soft_limit_node(mctz);
@@ -634,20 +626,16 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 					   int nid, unsigned int lru_mask)
 {
 	unsigned long nr = 0;
-	int zid;
+	struct mem_cgroup_per_node *mz;
+	enum lru_list lru;
 
 	VM_BUG_ON((unsigned)nid >= nr_node_ids);
 
-	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-		struct mem_cgroup_per_zone *mz;
-		enum lru_list lru;
-
-		for_each_lru(lru) {
-			if (!(BIT(lru) & lru_mask))
-				continue;
-			mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-			nr += mz->lru_size[lru];
-		}
+	for_each_lru(lru) {
+		if (!(BIT(lru) & lru_mask))
+			continue;
+		mz = mem_cgroup_nodeinfo(memcg, nid);
+		nr += mz->lru_size[lru];
 	}
 	return nr;
 }
@@ -800,9 +788,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 
 	if (reclaim) {
-		struct mem_cgroup_per_zone *mz;
+		struct mem_cgroup_per_node *mz;
 
-		mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone);
+		mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
 		iter = &mz->iter[reclaim->priority];
 
 		if (prev && reclaim->generation != iter->generation)
@@ -901,19 +889,17 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
 {
 	struct mem_cgroup *memcg = dead_memcg;
 	struct mem_cgroup_reclaim_iter *iter;
-	struct mem_cgroup_per_zone *mz;
-	int nid, zid;
+	struct mem_cgroup_per_node *mz;
+	int nid;
 	int i;
 
 	while ((memcg = parent_mem_cgroup(memcg))) {
 		for_each_node(nid) {
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-				mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-				for (i = 0; i <= DEF_PRIORITY; i++) {
-					iter = &mz->iter[i];
-					cmpxchg(&iter->position,
-						dead_memcg, NULL);
-				}
+			mz = mem_cgroup_nodeinfo(memcg, nid);
+			for (i = 0; i <= DEF_PRIORITY; i++) {
+				iter = &mz->iter[i];
+				cmpxchg(&iter->position,
+					dead_memcg, NULL);
 			}
 		}
 	}
@@ -945,7 +931,7 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
  */
 struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
@@ -962,7 +948,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	if (!memcg)
 		memcg = root_mem_cgroup;
 
-	mz = mem_cgroup_page_zoneinfo(memcg, page);
+	mz = mem_cgroup_page_nodeinfo(memcg, page);
 	lruvec = &mz->lruvec;
 out:
 	/*
@@ -989,7 +975,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 				enum zone_type zid, int nr_pages)
 {
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_per_node *mz;
 	unsigned long *lru_size;
 	long size;
 	bool empty;
@@ -999,7 +985,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 	if (mem_cgroup_disabled())
 		return;
 
-	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	lru_size = mz->lru_size + lru;
 	empty = list_empty(lruvec->lists + lru);
 
@@ -1392,7 +1378,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
 #endif
 
 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
-				   struct zone *zone,
+				   pg_data_t *pgdat,
 				   gfp_t gfp_mask,
 				   unsigned long *total_scanned)
 {
@@ -1402,7 +1388,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 	unsigned long excess;
 	unsigned long nr_scanned;
 	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
+		.pgdat = pgdat,
 		.priority = 0,
 	};
 
@@ -1433,7 +1419,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 			continue;
 		}
 		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
-					zone, &nr_scanned);
+					pgdat, &nr_scanned);
 		*total_scanned += nr_scanned;
 		if (!soft_limit_excess(root_memcg))
 			break;
@@ -2560,22 +2546,22 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
 					    unsigned long *total_scanned)
 {
 	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
+	struct mem_cgroup_per_node *mz, *next_mz = NULL;
 	unsigned long reclaimed;
 	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
+	struct mem_cgroup_tree_per_node *mctz;
 	unsigned long excess;
 	unsigned long nr_scanned;
 
 	if (order > 0)
 		return 0;
 
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
+	mctz = soft_limit_tree_node(pgdat->node_id);
 	/*
 	 * This loop can run a while, specially if mem_cgroup's continuously
 	 * keep exceeding their soft limit and putting the system under
@@ -2590,7 +2576,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 			break;
 
 		nr_scanned = 0;
-		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
+		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
 						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
@@ -3211,22 +3197,21 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 
 #ifdef CONFIG_DEBUG_VM
 	{
-		int nid, zid;
-		struct mem_cgroup_per_zone *mz;
+		pg_data_t *pgdat;
+		struct mem_cgroup_per_node *mz;
 		struct zone_reclaim_stat *rstat;
 		unsigned long recent_rotated[2] = {0, 0};
 		unsigned long recent_scanned[2] = {0, 0};
 
-		for_each_online_node(nid)
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-				mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
-				rstat = &mz->lruvec.reclaim_stat;
+		for_each_online_pgdat(pgdat) {
+			mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
+			rstat = &mz->lruvec.reclaim_stat;
 
-				recent_rotated[0] += rstat->recent_rotated[0];
-				recent_rotated[1] += rstat->recent_rotated[1];
-				recent_scanned[0] += rstat->recent_scanned[0];
-				recent_scanned[1] += rstat->recent_scanned[1];
-			}
+			recent_rotated[0] += rstat->recent_rotated[0];
+			recent_rotated[1] += rstat->recent_rotated[1];
+			recent_scanned[0] += rstat->recent_scanned[0];
+			recent_scanned[1] += rstat->recent_scanned[1];
+		}
 		seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
 		seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
 		seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
@@ -4106,11 +4091,10 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return idr_find(&mem_cgroup_idr, id);
 }
 
-static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
+static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
-	struct mem_cgroup_per_zone *mz;
-	int zone, tmp = node;
+	int tmp = node;
 	/*
 	 * This routine is called against possible nodes.
 	 * But it's BUG to call kmalloc() against offline node.
@@ -4125,18 +4109,16 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 	if (!pn)
 		return 1;
 
-	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-		mz = &pn->zoneinfo[zone];
-		lruvec_init(&mz->lruvec);
-		mz->usage_in_excess = 0;
-		mz->on_tree = false;
-		mz->memcg = memcg;
-	}
+	lruvec_init(&pn->lruvec);
+	pn->usage_in_excess = 0;
+	pn->on_tree = false;
+	pn->memcg = memcg;
+
 	memcg->nodeinfo[node] = pn;
 	return 0;
 }
 
-static void free_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
+static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	kfree(memcg->nodeinfo[node]);
 }
@@ -4147,7 +4129,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 
 	memcg_wb_domain_exit(memcg);
 	for_each_node(node)
-		free_mem_cgroup_per_zone_info(memcg, node);
+		free_mem_cgroup_per_node_info(memcg, node);
 	free_percpu(memcg->stat);
 	kfree(memcg);
 }
@@ -4176,7 +4158,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		goto fail;
 
 	for_each_node(node)
-		if (alloc_mem_cgroup_per_zone_info(memcg, node))
+		if (alloc_mem_cgroup_per_node_info(memcg, node))
 			goto fail;
 
 	if (memcg_wb_domain_init(memcg, GFP_KERNEL))
@@ -5779,18 +5761,12 @@ static int __init mem_cgroup_init(void)
 
 	for_each_node(node) {
 		struct mem_cgroup_tree_per_node *rtpn;
-		int zone;
 
 		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
 				    node_online(node) ? node : NUMA_NO_NODE);
 
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			struct mem_cgroup_tree_per_zone *rtpz;
-
-			rtpz = &rtpn->rb_tree_per_zone[zone];
-			rtpz->rb_root = RB_ROOT;
-			spin_lock_init(&rtpz->lock);
-		}
+		rtpn->rb_root = RB_ROOT;
+		spin_lock_init(&rtpn->lock);
 		soft_limit_tree.rb_tree_per_node[node] = rtpn;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f0bea68b8780..8d2555dd3ef3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2229,8 +2229,7 @@ static inline void init_tlb_ubc(void)
 static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
 			      struct scan_control *sc, unsigned long *lru_pages)
 {
-	struct zone *zone = &pgdat->node_zones[sc->reclaim_idx];
-	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
+	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long targets[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2437,7 +2436,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
 		struct mem_cgroup_reclaim_cookie reclaim = {
-			.zone = &pgdat->node_zones[classzone_idx],
+			.pgdat = pgdat,
 			.priority = sc->priority,
 		};
 		unsigned long node_lru_pages = 0;
@@ -2645,7 +2644,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			 * and balancing, not for a memcg's limit.
 			 */
 			nr_soft_scanned = 0;
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
+			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
 						sc->order, sc->gfp_mask,
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
@@ -2915,7 +2914,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 						gfp_t gfp_mask, bool noswap,
-						struct zone *zone,
+						pg_data_t *pgdat,
 						unsigned long *nr_scanned)
 {
 	struct scan_control sc = {
@@ -2942,7 +2941,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_node_memcg(zone->zone_pgdat, memcg, &sc, &lru_pages);
+	shrink_node_memcg(pgdat, memcg, &sc, &lru_pages);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
@@ -2992,7 +2991,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 #endif
 
 static void age_active_anon(struct pglist_data *pgdat,
-				struct zone *zone, struct scan_control *sc)
+				struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 
@@ -3001,7 +3000,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, zone, memcg);
+		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
 
 		if (inactive_list_is_low(lruvec, false))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
@@ -3191,7 +3190,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * pages are rotated regardless of classzone as this is
 		 * about consistent aging.
 		 */
-		age_active_anon(pgdat, &pgdat->node_zones[MAX_NR_ZONES - 1], &sc);
+		age_active_anon(pgdat, &sc);
 
 		/*
 		 * If we're getting trouble reclaiming, start doing writepage
@@ -3203,7 +3202,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		/* Call soft limit reclaim before calling shrink_node. */
 		sc.nr_scanned = 0;
 		nr_soft_scanned = 0;
-		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, sc.order,
+		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order,
 						sc.gfp_mask, &nr_soft_scanned);
 		sc.nr_reclaimed += nr_soft_reclaimed;
 
diff --git a/mm/workingset.c b/mm/workingset.c
index de68ad681585..9a1016f5d500 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -218,7 +218,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
 	return pack_shadow(memcgid, zone, eviction);
 }
@@ -267,7 +267,7 @@ bool workingset_refault(void *shadow)
 		rcu_read_unlock();
 		return false;
 	}
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, zone, memcg);
+	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 	rcu_read_unlock();
@@ -319,7 +319,7 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	lruvec = mem_cgroup_lruvec(page_pgdat(page), page_zone(page), memcg);
+	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
 	atomic_long_inc(&lruvec->inactive_age);
 out:
 	rcu_read_unlock();
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 15/34] mm, workingset: make working set detection node-aware
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Working set and refault detection is still zone-based, fix it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  6 +++---
 include/linux/vmstat.h |  1 -
 mm/vmstat.c            | 20 +++-----------------
 mm/workingset.c        | 43 ++++++++++++++++++++-----------------------
 4 files changed, 26 insertions(+), 44 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 895c365e3259..62f477d6cfe8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -145,9 +145,6 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
-	WORKINGSET_REFAULT,
-	WORKINGSET_ACTIVATE,
-	WORKINGSET_NODERECLAIM,
 	NR_ANON_THPS,
 	NR_SHMEM_THPS,
 	NR_SHMEM_PMDMAPPED,
@@ -164,6 +161,9 @@ enum node_stat_item {
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
+	WORKINGSET_REFAULT,
+	WORKINGSET_ACTIVATE,
+	WORKINGSET_NODERECLAIM,
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fee321c98550..6b7975cd98aa 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -227,7 +227,6 @@ void mod_node_page_state(struct pglist_data *, enum node_stat_item, long);
 void inc_node_page_state(struct page *, enum node_stat_item);
 void dec_node_page_state(struct page *, enum node_stat_item);
 
-extern void inc_zone_state(struct zone *, enum zone_stat_item);
 extern void inc_node_state(struct pglist_data *, enum node_stat_item);
 extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void __inc_node_state(struct pglist_data *, enum node_stat_item);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index de0c17076270..d17d66e85def 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -446,11 +446,6 @@ void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
-void inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
-	mod_zone_state(zone, item, 1, 1);
-}
-
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	mod_zone_state(page_zone(page), item, 1, 1);
@@ -539,15 +534,6 @@ void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
-void inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__inc_zone_state(zone, item);
-	local_irq_restore(flags);
-}
-
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	unsigned long flags;
@@ -967,9 +953,6 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
-	"workingset_refault",
-	"workingset_activate",
-	"workingset_nodereclaim",
 	"nr_anon_transparent_hugepages",
 	"nr_shmem_hugepages",
 	"nr_shmem_pmdmapped",
@@ -984,6 +967,9 @@ const char * const vmstat_text[] = {
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"nr_pages_scanned",
+	"workingset_refault",
+	"workingset_activate",
+	"workingset_nodereclaim",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 9a1016f5d500..56334e7d6924 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -16,7 +16,7 @@
 /*
  *		Double CLOCK lists
  *
- * Per zone, two clock lists are maintained for file pages: the
+ * Per node, two clock lists are maintained for file pages: the
  * inactive and the active list.  Freshly faulted pages start out at
  * the head of the inactive list and page reclaim scans pages from the
  * tail.  Pages that are accessed multiple times on the inactive list
@@ -141,11 +141,11 @@
  *
  *		Implementation
  *
- * For each zone's file LRU lists, a counter for inactive evictions
- * and activations is maintained (zone->inactive_age).
+ * For each node's file LRU lists, a counter for inactive evictions
+ * and activations is maintained (node->inactive_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
- * identify the zone) is stored in the now empty page cache radix tree
+ * identify the node) is stored in the now empty page cache radix tree
  * slot of the evicted page.  This is called a shadow entry.
  *
  * On cache misses for which there are shadow entries, an eligible
@@ -153,7 +153,7 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 ZONES_SHIFT + NODES_SHIFT +	\
+			 NODES_SHIFT +	\
 			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
@@ -167,33 +167,30 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
-	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
-	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
+	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
+static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	int memcgid, nid, zid;
+	int memcgid, nid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
-	zid = entry & ((1UL << ZONES_SHIFT) - 1);
-	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
 	entry >>= MEM_CGROUP_ID_SHIFT;
 
 	*memcgidp = memcgid;
-	*zonep = NODE_DATA(nid)->node_zones + zid;
+	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
 }
 
@@ -208,7 +205,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
 	struct mem_cgroup *memcg = page_memcg(page);
-	struct zone *zone = page_zone(page);
+	struct pglist_data *pgdat = page_pgdat(page);
 	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -218,9 +215,9 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
+	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, zone, eviction);
+	return pack_shadow(memcgid, pgdat, eviction);
 }
 
 /**
@@ -228,7 +225,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
- * evicted page in the context of the zone it was allocated in.
+ * evicted page in the context of the node it was allocated in.
  *
  * Returns %true if the page should be activated, %false otherwise.
  */
@@ -240,10 +237,10 @@ bool workingset_refault(void *shadow)
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
-	struct zone *zone;
+	struct pglist_data *pgdat;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &zone, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
 
 	rcu_read_lock();
 	/*
@@ -267,7 +264,7 @@ bool workingset_refault(void *shadow)
 		rcu_read_unlock();
 		return false;
 	}
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
+	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 	rcu_read_unlock();
@@ -290,10 +287,10 @@ bool workingset_refault(void *shadow)
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
-	inc_zone_state(zone, WORKINGSET_REFAULT);
+	inc_node_state(pgdat, WORKINGSET_REFAULT);
 
 	if (refault_distance <= active_file) {
-		inc_zone_state(zone, WORKINGSET_ACTIVATE);
+		inc_node_state(pgdat, WORKINGSET_ACTIVATE);
 		return true;
 	}
 	return false;
@@ -436,7 +433,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 		}
 	}
 	BUG_ON(node->count);
-	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
+	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
 	if (!__radix_tree_delete_node(&mapping->page_tree, node))
 		BUG();
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 15/34] mm, workingset: make working set detection node-aware
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Working set and refault detection is still zone-based, fix it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  6 +++---
 include/linux/vmstat.h |  1 -
 mm/vmstat.c            | 20 +++-----------------
 mm/workingset.c        | 43 ++++++++++++++++++++-----------------------
 4 files changed, 26 insertions(+), 44 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 895c365e3259..62f477d6cfe8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -145,9 +145,6 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
-	WORKINGSET_REFAULT,
-	WORKINGSET_ACTIVATE,
-	WORKINGSET_NODERECLAIM,
 	NR_ANON_THPS,
 	NR_SHMEM_THPS,
 	NR_SHMEM_PMDMAPPED,
@@ -164,6 +161,9 @@ enum node_stat_item {
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
+	WORKINGSET_REFAULT,
+	WORKINGSET_ACTIVATE,
+	WORKINGSET_NODERECLAIM,
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fee321c98550..6b7975cd98aa 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -227,7 +227,6 @@ void mod_node_page_state(struct pglist_data *, enum node_stat_item, long);
 void inc_node_page_state(struct page *, enum node_stat_item);
 void dec_node_page_state(struct page *, enum node_stat_item);
 
-extern void inc_zone_state(struct zone *, enum zone_stat_item);
 extern void inc_node_state(struct pglist_data *, enum node_stat_item);
 extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void __inc_node_state(struct pglist_data *, enum node_stat_item);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index de0c17076270..d17d66e85def 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -446,11 +446,6 @@ void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
-void inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
-	mod_zone_state(zone, item, 1, 1);
-}
-
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	mod_zone_state(page_zone(page), item, 1, 1);
@@ -539,15 +534,6 @@ void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
-void inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__inc_zone_state(zone, item);
-	local_irq_restore(flags);
-}
-
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	unsigned long flags;
@@ -967,9 +953,6 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
-	"workingset_refault",
-	"workingset_activate",
-	"workingset_nodereclaim",
 	"nr_anon_transparent_hugepages",
 	"nr_shmem_hugepages",
 	"nr_shmem_pmdmapped",
@@ -984,6 +967,9 @@ const char * const vmstat_text[] = {
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"nr_pages_scanned",
+	"workingset_refault",
+	"workingset_activate",
+	"workingset_nodereclaim",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 9a1016f5d500..56334e7d6924 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -16,7 +16,7 @@
 /*
  *		Double CLOCK lists
  *
- * Per zone, two clock lists are maintained for file pages: the
+ * Per node, two clock lists are maintained for file pages: the
  * inactive and the active list.  Freshly faulted pages start out at
  * the head of the inactive list and page reclaim scans pages from the
  * tail.  Pages that are accessed multiple times on the inactive list
@@ -141,11 +141,11 @@
  *
  *		Implementation
  *
- * For each zone's file LRU lists, a counter for inactive evictions
- * and activations is maintained (zone->inactive_age).
+ * For each node's file LRU lists, a counter for inactive evictions
+ * and activations is maintained (node->inactive_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
- * identify the zone) is stored in the now empty page cache radix tree
+ * identify the node) is stored in the now empty page cache radix tree
  * slot of the evicted page.  This is called a shadow entry.
  *
  * On cache misses for which there are shadow entries, an eligible
@@ -153,7 +153,7 @@
  */
 
 #define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 ZONES_SHIFT + NODES_SHIFT +	\
+			 NODES_SHIFT +	\
 			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
@@ -167,33 +167,30 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
+static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
-	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
-	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
+	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
+static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 			  unsigned long *evictionp)
 {
 	unsigned long entry = (unsigned long)shadow;
-	int memcgid, nid, zid;
+	int memcgid, nid;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
-	zid = entry & ((1UL << ZONES_SHIFT) - 1);
-	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
 	entry >>= MEM_CGROUP_ID_SHIFT;
 
 	*memcgidp = memcgid;
-	*zonep = NODE_DATA(nid)->node_zones + zid;
+	*pgdat = NODE_DATA(nid);
 	*evictionp = entry << bucket_order;
 }
 
@@ -208,7 +205,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
 	struct mem_cgroup *memcg = page_memcg(page);
-	struct zone *zone = page_zone(page);
+	struct pglist_data *pgdat = page_pgdat(page);
 	int memcgid = mem_cgroup_id(memcg);
 	unsigned long eviction;
 	struct lruvec *lruvec;
@@ -218,9 +215,9 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
+	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, zone, eviction);
+	return pack_shadow(memcgid, pgdat, eviction);
 }
 
 /**
@@ -228,7 +225,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
- * evicted page in the context of the zone it was allocated in.
+ * evicted page in the context of the node it was allocated in.
  *
  * Returns %true if the page should be activated, %false otherwise.
  */
@@ -240,10 +237,10 @@ bool workingset_refault(void *shadow)
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
-	struct zone *zone;
+	struct pglist_data *pgdat;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &zone, &eviction);
+	unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
 
 	rcu_read_lock();
 	/*
@@ -267,7 +264,7 @@ bool workingset_refault(void *shadow)
 		rcu_read_unlock();
 		return false;
 	}
-	lruvec = mem_cgroup_lruvec(zone->zone_pgdat, memcg);
+	lruvec = mem_cgroup_lruvec(pgdat, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
 	rcu_read_unlock();
@@ -290,10 +287,10 @@ bool workingset_refault(void *shadow)
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
-	inc_zone_state(zone, WORKINGSET_REFAULT);
+	inc_node_state(pgdat, WORKINGSET_REFAULT);
 
 	if (refault_distance <= active_file) {
-		inc_zone_state(zone, WORKINGSET_ACTIVATE);
+		inc_node_state(pgdat, WORKINGSET_ACTIVATE);
 		return true;
 	}
 	return false;
@@ -436,7 +433,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 		}
 	}
 	BUG_ON(node->count);
-	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
+	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
 	if (!__radix_tree_delete_node(&mapping->page_tree, node))
 		BUG();
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 16/34] mm, page_alloc: consider dirtyable memory in terms of nodes
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Historically dirty pages were spread among zones but now that LRUs are
per-node it is more appropriate to consider dirty pages in a node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h    | 12 +++----
 include/linux/writeback.h |  2 +-
 mm/page-writeback.c       | 91 +++++++++++++++++++++++++++++++----------------
 mm/page_alloc.c           | 26 ++++++--------
 4 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 62f477d6cfe8..fae2fe3c6942 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -363,12 +363,6 @@ struct zone {
 	struct pglist_data	*zone_pgdat;
 	struct per_cpu_pageset __percpu *pageset;
 
-	/*
-	 * This is a per-zone reserve of pages that are not available
-	 * to userspace allocations.
-	 */
-	unsigned long		totalreserve_pages;
-
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
@@ -687,6 +681,12 @@ typedef struct pglist_data {
 	/* Number of pages migrated during the rate limiting time interval */
 	unsigned long numabalancing_migrate_nr_pages;
 #endif
+	/*
+	 * This is a per-node reserve of pages that are not available
+	 * to userspace allocations.
+	 */
+	unsigned long		totalreserve_pages;
+
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
 	spinlock_t		lru_lock;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 717e6149e753..fc1e16c25a29 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -320,7 +320,7 @@ void laptop_mode_timer_fn(unsigned long data);
 static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
-bool zone_dirty_ok(struct zone *zone);
+bool node_dirty_ok(struct pglist_data *pgdat);
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
 #ifdef CONFIG_CGROUP_WRITEBACK
 void wb_domain_exit(struct wb_domain *dom);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0ada2b2954b0..f7c0fb993fb9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -267,26 +267,35 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
  */
 
 /**
- * zone_dirtyable_memory - number of dirtyable pages in a zone
- * @zone: the zone
+ * node_dirtyable_memory - number of dirtyable pages in a node
+ * @pgdat: the node
  *
- * Returns the zone's number of pages potentially available for dirty
- * page cache.  This is the base value for the per-zone dirty limits.
+ * Returns the node's number of pages potentially available for dirty
+ * page cache.  This is the base value for the per-node dirty limits.
  */
-static unsigned long zone_dirtyable_memory(struct zone *zone)
+static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 {
-	unsigned long nr_pages;
+	unsigned long nr_pages = 0;
+	int z;
+
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		nr_pages += zone_page_state(zone, NR_FREE_PAGES);
+	}
 
-	nr_pages = zone_page_state(zone, NR_FREE_PAGES);
 	/*
 	 * Pages reserved for the kernel should not be considered
 	 * dirtyable, to prevent a situation where reclaim has to
 	 * clean pages in order to balance the zones.
 	 */
-	nr_pages -= min(nr_pages, zone->totalreserve_pages);
+	nr_pages -= min(nr_pages, pgdat->totalreserve_pages);
 
-	nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
-	nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
+	nr_pages += node_page_state(pgdat, NR_INACTIVE_FILE);
+	nr_pages += node_page_state(pgdat, NR_ACTIVE_FILE);
 
 	return nr_pages;
 }
@@ -299,13 +308,24 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 	int i;
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
-		for (i = 0; i < MAX_NR_ZONES; i++) {
-			struct zone *z = &NODE_DATA(node)->node_zones[i];
+		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
+			struct zone *z;
+			unsigned long dirtyable;
+
+			if (!is_highmem_idx(i))
+				continue;
+
+			z = &NODE_DATA(node)->node_zones[i];
+			dirtyable = zone_page_state(z, NR_FREE_PAGES) +
+				zone_page_state(z, NR_ZONE_LRU_FILE);
 
-			if (is_highmem(z))
-				x += zone_dirtyable_memory(z);
+			/* watch for underflows */
+			dirtyable -= min(dirtyable, high_wmark_pages(z));
+
+			x += dirtyable;
 		}
 	}
+
 	/*
 	 * Unreclaimable memory (kernel memory or anonymous memory
 	 * without swap) can bring down the dirtyable pages below
@@ -445,23 +465,23 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 }
 
 /**
- * zone_dirty_limit - maximum number of dirty pages allowed in a zone
- * @zone: the zone
+ * node_dirty_limit - maximum number of dirty pages allowed in a node
+ * @pgdat: the node
  *
- * Returns the maximum number of dirty pages allowed in a zone, based
- * on the zone's dirtyable memory.
+ * Returns the maximum number of dirty pages allowed in a node, based
+ * on the node's dirtyable memory.
  */
-static unsigned long zone_dirty_limit(struct zone *zone)
+static unsigned long node_dirty_limit(struct pglist_data *pgdat)
 {
-	unsigned long zone_memory = zone_dirtyable_memory(zone);
+	unsigned long node_memory = node_dirtyable_memory(pgdat);
 	struct task_struct *tsk = current;
 	unsigned long dirty;
 
 	if (vm_dirty_bytes)
 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
-			zone_memory / global_dirtyable_memory();
+			node_memory / global_dirtyable_memory();
 	else
-		dirty = vm_dirty_ratio * zone_memory / 100;
+		dirty = vm_dirty_ratio * node_memory / 100;
 
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
 		dirty += dirty / 4;
@@ -470,19 +490,30 @@ static unsigned long zone_dirty_limit(struct zone *zone)
 }
 
 /**
- * zone_dirty_ok - tells whether a zone is within its dirty limits
- * @zone: the zone to check
+ * node_dirty_ok - tells whether a node is within its dirty limits
+ * @pgdat: the node to check
  *
- * Returns %true when the dirty pages in @zone are within the zone's
+ * Returns %true when the dirty pages in @pgdat are within the node's
  * dirty limit, %false if the limit is exceeded.
  */
-bool zone_dirty_ok(struct zone *zone)
+bool node_dirty_ok(struct pglist_data *pgdat)
 {
-	unsigned long limit = zone_dirty_limit(zone);
+	int z;
+	unsigned long limit = node_dirty_limit(pgdat);
+	unsigned long nr_pages = 0;
+
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		nr_pages += zone_page_state(zone, NR_FILE_DIRTY);
+		nr_pages += zone_page_state(zone, NR_UNSTABLE_NFS);
+		nr_pages += zone_page_state(zone, NR_WRITEBACK);
+	}
 
-	return zone_page_state(zone, NR_FILE_DIRTY) +
-	       zone_page_state(zone, NR_UNSTABLE_NFS) +
-	       zone_page_state(zone, NR_WRITEBACK) <= limit;
+	return nr_pages <= limit;
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8215c51d5b23..9e113a6ff9a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2955,31 +2955,24 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 		/*
 		 * When allocating a page cache page for writing, we
-		 * want to get it from a zone that is within its dirty
-		 * limit, such that no single zone holds more than its
+		 * want to get it from a node that is within its dirty
+		 * limit, such that no single node holds more than its
 		 * proportional share of globally allowed dirty pages.
-		 * The dirty limits take into account the zone's
+		 * The dirty limits take into account the node's
 		 * lowmem reserves and high watermark so that kswapd
 		 * should be able to balance it without having to
 		 * write pages from its LRU list.
 		 *
-		 * This may look like it could increase pressure on
-		 * lower zones by failing allocations in higher zones
-		 * before they are full.  But the pages that do spill
-		 * over are limited as the lower zones are protected
-		 * by this very same mechanism.  It should not become
-		 * a practical burden to them.
-		 *
 		 * XXX: For now, allow allocations to potentially
-		 * exceed the per-zone dirty limit in the slowpath
+		 * exceed the per-node dirty limit in the slowpath
 		 * (spread_dirty_pages unset) before going into reclaim,
 		 * which is important when on a NUMA setup the allowed
-		 * zones are together not big enough to reach the
+		 * nodes are together not big enough to reach the
 		 * global limit.  The proper fix for these situations
-		 * will require awareness of zones in the
+		 * will require awareness of nodes in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (ac->spread_dirty_pages && !zone_dirty_ok(zone))
+		if (ac->spread_dirty_pages && !node_dirty_ok(zone->zone_pgdat))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
@@ -6749,6 +6742,9 @@ static void calculate_totalreserve_pages(void)
 	enum zone_type i, j;
 
 	for_each_online_pgdat(pgdat) {
+
+		pgdat->totalreserve_pages = 0;
+
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 			long max = 0;
@@ -6765,7 +6761,7 @@ static void calculate_totalreserve_pages(void)
 			if (max > zone->managed_pages)
 				max = zone->managed_pages;
 
-			zone->totalreserve_pages = max;
+			pgdat->totalreserve_pages += max;
 
 			reserve_pages += max;
 		}
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 16/34] mm, page_alloc: consider dirtyable memory in terms of nodes
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Historically dirty pages were spread among zones but now that LRUs are
per-node it is more appropriate to consider dirty pages in a node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h    | 12 +++----
 include/linux/writeback.h |  2 +-
 mm/page-writeback.c       | 91 +++++++++++++++++++++++++++++++----------------
 mm/page_alloc.c           | 26 ++++++--------
 4 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 62f477d6cfe8..fae2fe3c6942 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -363,12 +363,6 @@ struct zone {
 	struct pglist_data	*zone_pgdat;
 	struct per_cpu_pageset __percpu *pageset;
 
-	/*
-	 * This is a per-zone reserve of pages that are not available
-	 * to userspace allocations.
-	 */
-	unsigned long		totalreserve_pages;
-
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
@@ -687,6 +681,12 @@ typedef struct pglist_data {
 	/* Number of pages migrated during the rate limiting time interval */
 	unsigned long numabalancing_migrate_nr_pages;
 #endif
+	/*
+	 * This is a per-node reserve of pages that are not available
+	 * to userspace allocations.
+	 */
+	unsigned long		totalreserve_pages;
+
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
 	spinlock_t		lru_lock;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 717e6149e753..fc1e16c25a29 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -320,7 +320,7 @@ void laptop_mode_timer_fn(unsigned long data);
 static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
-bool zone_dirty_ok(struct zone *zone);
+bool node_dirty_ok(struct pglist_data *pgdat);
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
 #ifdef CONFIG_CGROUP_WRITEBACK
 void wb_domain_exit(struct wb_domain *dom);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0ada2b2954b0..f7c0fb993fb9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -267,26 +267,35 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
  */
 
 /**
- * zone_dirtyable_memory - number of dirtyable pages in a zone
- * @zone: the zone
+ * node_dirtyable_memory - number of dirtyable pages in a node
+ * @pgdat: the node
  *
- * Returns the zone's number of pages potentially available for dirty
- * page cache.  This is the base value for the per-zone dirty limits.
+ * Returns the node's number of pages potentially available for dirty
+ * page cache.  This is the base value for the per-node dirty limits.
  */
-static unsigned long zone_dirtyable_memory(struct zone *zone)
+static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 {
-	unsigned long nr_pages;
+	unsigned long nr_pages = 0;
+	int z;
+
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		nr_pages += zone_page_state(zone, NR_FREE_PAGES);
+	}
 
-	nr_pages = zone_page_state(zone, NR_FREE_PAGES);
 	/*
 	 * Pages reserved for the kernel should not be considered
 	 * dirtyable, to prevent a situation where reclaim has to
 	 * clean pages in order to balance the zones.
 	 */
-	nr_pages -= min(nr_pages, zone->totalreserve_pages);
+	nr_pages -= min(nr_pages, pgdat->totalreserve_pages);
 
-	nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
-	nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
+	nr_pages += node_page_state(pgdat, NR_INACTIVE_FILE);
+	nr_pages += node_page_state(pgdat, NR_ACTIVE_FILE);
 
 	return nr_pages;
 }
@@ -299,13 +308,24 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 	int i;
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
-		for (i = 0; i < MAX_NR_ZONES; i++) {
-			struct zone *z = &NODE_DATA(node)->node_zones[i];
+		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
+			struct zone *z;
+			unsigned long dirtyable;
+
+			if (!is_highmem_idx(i))
+				continue;
+
+			z = &NODE_DATA(node)->node_zones[i];
+			dirtyable = zone_page_state(z, NR_FREE_PAGES) +
+				zone_page_state(z, NR_ZONE_LRU_FILE);
 
-			if (is_highmem(z))
-				x += zone_dirtyable_memory(z);
+			/* watch for underflows */
+			dirtyable -= min(dirtyable, high_wmark_pages(z));
+
+			x += dirtyable;
 		}
 	}
+
 	/*
 	 * Unreclaimable memory (kernel memory or anonymous memory
 	 * without swap) can bring down the dirtyable pages below
@@ -445,23 +465,23 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 }
 
 /**
- * zone_dirty_limit - maximum number of dirty pages allowed in a zone
- * @zone: the zone
+ * node_dirty_limit - maximum number of dirty pages allowed in a node
+ * @pgdat: the node
  *
- * Returns the maximum number of dirty pages allowed in a zone, based
- * on the zone's dirtyable memory.
+ * Returns the maximum number of dirty pages allowed in a node, based
+ * on the node's dirtyable memory.
  */
-static unsigned long zone_dirty_limit(struct zone *zone)
+static unsigned long node_dirty_limit(struct pglist_data *pgdat)
 {
-	unsigned long zone_memory = zone_dirtyable_memory(zone);
+	unsigned long node_memory = node_dirtyable_memory(pgdat);
 	struct task_struct *tsk = current;
 	unsigned long dirty;
 
 	if (vm_dirty_bytes)
 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
-			zone_memory / global_dirtyable_memory();
+			node_memory / global_dirtyable_memory();
 	else
-		dirty = vm_dirty_ratio * zone_memory / 100;
+		dirty = vm_dirty_ratio * node_memory / 100;
 
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
 		dirty += dirty / 4;
@@ -470,19 +490,30 @@ static unsigned long zone_dirty_limit(struct zone *zone)
 }
 
 /**
- * zone_dirty_ok - tells whether a zone is within its dirty limits
- * @zone: the zone to check
+ * node_dirty_ok - tells whether a node is within its dirty limits
+ * @pgdat: the node to check
  *
- * Returns %true when the dirty pages in @zone are within the zone's
+ * Returns %true when the dirty pages in @pgdat are within the node's
  * dirty limit, %false if the limit is exceeded.
  */
-bool zone_dirty_ok(struct zone *zone)
+bool node_dirty_ok(struct pglist_data *pgdat)
 {
-	unsigned long limit = zone_dirty_limit(zone);
+	int z;
+	unsigned long limit = node_dirty_limit(pgdat);
+	unsigned long nr_pages = 0;
+
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		nr_pages += zone_page_state(zone, NR_FILE_DIRTY);
+		nr_pages += zone_page_state(zone, NR_UNSTABLE_NFS);
+		nr_pages += zone_page_state(zone, NR_WRITEBACK);
+	}
 
-	return zone_page_state(zone, NR_FILE_DIRTY) +
-	       zone_page_state(zone, NR_UNSTABLE_NFS) +
-	       zone_page_state(zone, NR_WRITEBACK) <= limit;
+	return nr_pages <= limit;
 }
 
 int dirty_background_ratio_handler(struct ctl_table *table, int write,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8215c51d5b23..9e113a6ff9a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2955,31 +2955,24 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 		/*
 		 * When allocating a page cache page for writing, we
-		 * want to get it from a zone that is within its dirty
-		 * limit, such that no single zone holds more than its
+		 * want to get it from a node that is within its dirty
+		 * limit, such that no single node holds more than its
 		 * proportional share of globally allowed dirty pages.
-		 * The dirty limits take into account the zone's
+		 * The dirty limits take into account the node's
 		 * lowmem reserves and high watermark so that kswapd
 		 * should be able to balance it without having to
 		 * write pages from its LRU list.
 		 *
-		 * This may look like it could increase pressure on
-		 * lower zones by failing allocations in higher zones
-		 * before they are full.  But the pages that do spill
-		 * over are limited as the lower zones are protected
-		 * by this very same mechanism.  It should not become
-		 * a practical burden to them.
-		 *
 		 * XXX: For now, allow allocations to potentially
-		 * exceed the per-zone dirty limit in the slowpath
+		 * exceed the per-node dirty limit in the slowpath
 		 * (spread_dirty_pages unset) before going into reclaim,
 		 * which is important when on a NUMA setup the allowed
-		 * zones are together not big enough to reach the
+		 * nodes are together not big enough to reach the
 		 * global limit.  The proper fix for these situations
-		 * will require awareness of zones in the
+		 * will require awareness of nodes in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (ac->spread_dirty_pages && !zone_dirty_ok(zone))
+		if (ac->spread_dirty_pages && !node_dirty_ok(zone->zone_pgdat))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
@@ -6749,6 +6742,9 @@ static void calculate_totalreserve_pages(void)
 	enum zone_type i, j;
 
 	for_each_online_pgdat(pgdat) {
+
+		pgdat->totalreserve_pages = 0;
+
 		for (i = 0; i < MAX_NR_ZONES; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 			long max = 0;
@@ -6765,7 +6761,7 @@ static void calculate_totalreserve_pages(void)
 			if (max > zone->managed_pages)
 				max = zone->managed_pages;
 
-			zone->totalreserve_pages = max;
+			pgdat->totalreserve_pages += max;
 
 			reserve_pages += max;
 		}
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 17/34] mm: move page mapped accounting to the node
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Reclaim makes decisions based on the number of pages that are mapped but
it's mixing node and zone information.  Account NR_FILE_MAPPED and
NR_ANON_PAGES pages on the node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/tile/mm/pgtable.c |  2 +-
 drivers/base/node.c    |  4 ++--
 fs/proc/meminfo.c      |  4 ++--
 include/linux/mmzone.h |  6 +++---
 mm/page_alloc.c        |  6 +++---
 mm/rmap.c              | 14 +++++++-------
 mm/vmscan.c            |  2 +-
 mm/vmstat.c            |  4 ++--
 8 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 9e389213580d..c606b0ef2f7e 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -55,7 +55,7 @@ void show_mem(unsigned int filter)
 	       global_page_state(NR_FREE_PAGES),
 	       (global_page_state(NR_SLAB_RECLAIMABLE) +
 		global_page_state(NR_SLAB_UNRECLAIMABLE)),
-	       global_page_state(NR_FILE_MAPPED),
+	       global_node_page_state(NR_FILE_MAPPED),
 	       global_page_state(NR_PAGETABLE),
 	       global_page_state(NR_BOUNCE),
 	       global_page_state(NR_FILE_PAGES),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index b7f01a4a642d..acca09536ad9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
 		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
-		       nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
+		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
+		       nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index cf301a9ef512..b8d52aa2f19a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -140,8 +140,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-		K(global_page_state(NR_ANON_PAGES)),
-		K(global_page_state(NR_FILE_MAPPED)),
+		K(global_node_page_state(NR_ANON_PAGES)),
+		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fae2fe3c6942..95d34d1e1fb5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,9 +115,6 @@ enum zone_stat_item {
 	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
 	NR_ZONE_LRU_FILE,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
-	NR_ANON_PAGES,	/* Mapped anonymous pages */
-	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
-			   only modified from process context */
 	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
@@ -164,6 +161,9 @@ enum node_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
+	NR_ANON_PAGES,	/* Mapped anonymous pages */
+	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
+			   only modified from process context */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e113a6ff9a0..78338b51819b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4355,7 +4355,7 @@ void show_free_areas(unsigned int filter)
 		global_page_state(NR_UNSTABLE_NFS),
 		global_page_state(NR_SLAB_RECLAIMABLE),
 		global_page_state(NR_SLAB_UNRECLAIMABLE),
-		global_page_state(NR_FILE_MAPPED),
+		global_node_page_state(NR_FILE_MAPPED),
 		global_page_state(NR_SHMEM),
 		global_page_state(NR_PAGETABLE),
 		global_page_state(NR_BOUNCE),
@@ -4377,6 +4377,7 @@ void show_free_areas(unsigned int filter)
 			" unevictable:%lukB"
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
+			" mapped:%lukB"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -4387,6 +4388,7 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+			K(node_page_state(pgdat, NR_FILE_MAPPED)),
 			!pgdat_reclaimable(pgdat) ? "yes" : "no");
 	}
 
@@ -4411,7 +4413,6 @@ void show_free_areas(unsigned int filter)
 			" mlocked:%lukB"
 			" dirty:%lukB"
 			" writeback:%lukB"
-			" mapped:%lukB"
 			" shmem:%lukB"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			" shmem_thp: %lukB"
@@ -4440,7 +4441,6 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_FILE_DIRTY)),
 			K(zone_page_state(zone, NR_WRITEBACK)),
-			K(zone_page_state(zone, NR_FILE_MAPPED)),
 			K(zone_page_state(zone, NR_SHMEM)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			K(zone_page_state(zone, NR_SHMEM_THPS) * HPAGE_PMD_NR),
diff --git a/mm/rmap.c b/mm/rmap.c
index 573253efb645..17876517f5fa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1217,7 +1217,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 */
 		if (compound)
 			__inc_zone_page_state(page, NR_ANON_THPS);
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1261,7 +1261,7 @@ void page_add_new_anon_rmap(struct page *page,
 		/* increment count (starts at -1) */
 		atomic_set(&page->_mapcount, 0);
 	}
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
+	__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1296,7 +1296,7 @@ void page_add_file_rmap(struct page *page, bool compound)
 		if (!atomic_inc_and_test(&page->_mapcount))
 			goto out;
 	}
-	__mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, nr);
+	__mod_node_page_state(page_pgdat(page), NR_FILE_MAPPED, nr);
 	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 out:
 	unlock_page_memcg(page);
@@ -1332,11 +1332,11 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 	}
 
 	/*
-	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
+	 * We use the irq-unsafe __{inc|mod}_zone_page_state because
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, -nr);
+	__mod_node_page_state(page_pgdat(page), NR_FILE_MAPPED, -nr);
 	mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 
 	if (unlikely(PageMlocked(page)))
@@ -1378,7 +1378,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		clear_page_mlock(page);
 
 	if (nr) {
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, -nr);
 		deferred_split_huge_page(page);
 	}
 }
@@ -1407,7 +1407,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_zone_page_state(page, NR_ANON_PAGES);
+	__dec_node_page_state(page, NR_ANON_PAGES);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8d2555dd3ef3..2e8d72d7e268 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3587,7 +3587,7 @@ int sysctl_min_slab_ratio = 5;
 
 static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
 {
-	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+	unsigned long file_mapped = node_page_state(zone->zone_pgdat, NR_FILE_MAPPED);
 	unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
 		node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d17d66e85def..02e7406e8fcd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -925,8 +925,6 @@ const char * const vmstat_text[] = {
 	"nr_zone_anon_lru",
 	"nr_zone_file_lru",
 	"nr_mlock",
-	"nr_anon_pages",
-	"nr_mapped",
 	"nr_file_pages",
 	"nr_dirty",
 	"nr_writeback",
@@ -970,6 +968,8 @@ const char * const vmstat_text[] = {
 	"workingset_refault",
 	"workingset_activate",
 	"workingset_nodereclaim",
+	"nr_anon_pages",
+	"nr_mapped",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 17/34] mm: move page mapped accounting to the node
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

Reclaim makes decisions based on the number of pages that are mapped but
it's mixing node and zone information.  Account NR_FILE_MAPPED and
NR_ANON_PAGES pages on the node.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/tile/mm/pgtable.c |  2 +-
 drivers/base/node.c    |  4 ++--
 fs/proc/meminfo.c      |  4 ++--
 include/linux/mmzone.h |  6 +++---
 mm/page_alloc.c        |  6 +++---
 mm/rmap.c              | 14 +++++++-------
 mm/vmscan.c            |  2 +-
 mm/vmstat.c            |  4 ++--
 8 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 9e389213580d..c606b0ef2f7e 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -55,7 +55,7 @@ void show_mem(unsigned int filter)
 	       global_page_state(NR_FREE_PAGES),
 	       (global_page_state(NR_SLAB_RECLAIMABLE) +
 		global_page_state(NR_SLAB_UNRECLAIMABLE)),
-	       global_page_state(NR_FILE_MAPPED),
+	       global_node_page_state(NR_FILE_MAPPED),
 	       global_page_state(NR_PAGETABLE),
 	       global_page_state(NR_BOUNCE),
 	       global_page_state(NR_FILE_PAGES),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index b7f01a4a642d..acca09536ad9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
 		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
-		       nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
+		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
+		       nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index cf301a9ef512..b8d52aa2f19a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -140,8 +140,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-		K(global_page_state(NR_ANON_PAGES)),
-		K(global_page_state(NR_FILE_MAPPED)),
+		K(global_node_page_state(NR_ANON_PAGES)),
+		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fae2fe3c6942..95d34d1e1fb5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,9 +115,6 @@ enum zone_stat_item {
 	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
 	NR_ZONE_LRU_FILE,
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
-	NR_ANON_PAGES,	/* Mapped anonymous pages */
-	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
-			   only modified from process context */
 	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
@@ -164,6 +161,9 @@ enum node_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
+	NR_ANON_PAGES,	/* Mapped anonymous pages */
+	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
+			   only modified from process context */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e113a6ff9a0..78338b51819b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4355,7 +4355,7 @@ void show_free_areas(unsigned int filter)
 		global_page_state(NR_UNSTABLE_NFS),
 		global_page_state(NR_SLAB_RECLAIMABLE),
 		global_page_state(NR_SLAB_UNRECLAIMABLE),
-		global_page_state(NR_FILE_MAPPED),
+		global_node_page_state(NR_FILE_MAPPED),
 		global_page_state(NR_SHMEM),
 		global_page_state(NR_PAGETABLE),
 		global_page_state(NR_BOUNCE),
@@ -4377,6 +4377,7 @@ void show_free_areas(unsigned int filter)
 			" unevictable:%lukB"
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
+			" mapped:%lukB"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -4387,6 +4388,7 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
+			K(node_page_state(pgdat, NR_FILE_MAPPED)),
 			!pgdat_reclaimable(pgdat) ? "yes" : "no");
 	}
 
@@ -4411,7 +4413,6 @@ void show_free_areas(unsigned int filter)
 			" mlocked:%lukB"
 			" dirty:%lukB"
 			" writeback:%lukB"
-			" mapped:%lukB"
 			" shmem:%lukB"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			" shmem_thp: %lukB"
@@ -4440,7 +4441,6 @@ void show_free_areas(unsigned int filter)
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_FILE_DIRTY)),
 			K(zone_page_state(zone, NR_WRITEBACK)),
-			K(zone_page_state(zone, NR_FILE_MAPPED)),
 			K(zone_page_state(zone, NR_SHMEM)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 			K(zone_page_state(zone, NR_SHMEM_THPS) * HPAGE_PMD_NR),
diff --git a/mm/rmap.c b/mm/rmap.c
index 573253efb645..17876517f5fa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1217,7 +1217,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 */
 		if (compound)
 			__inc_zone_page_state(page, NR_ANON_THPS);
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1261,7 +1261,7 @@ void page_add_new_anon_rmap(struct page *page,
 		/* increment count (starts at -1) */
 		atomic_set(&page->_mapcount, 0);
 	}
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
+	__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1296,7 +1296,7 @@ void page_add_file_rmap(struct page *page, bool compound)
 		if (!atomic_inc_and_test(&page->_mapcount))
 			goto out;
 	}
-	__mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, nr);
+	__mod_node_page_state(page_pgdat(page), NR_FILE_MAPPED, nr);
 	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 out:
 	unlock_page_memcg(page);
@@ -1332,11 +1332,11 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 	}
 
 	/*
-	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
+	 * We use the irq-unsafe __{inc|mod}_zone_page_state because
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, -nr);
+	__mod_node_page_state(page_pgdat(page), NR_FILE_MAPPED, -nr);
 	mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 
 	if (unlikely(PageMlocked(page)))
@@ -1378,7 +1378,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		clear_page_mlock(page);
 
 	if (nr) {
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, -nr);
 		deferred_split_huge_page(page);
 	}
 }
@@ -1407,7 +1407,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_zone_page_state(page, NR_ANON_PAGES);
+	__dec_node_page_state(page, NR_ANON_PAGES);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8d2555dd3ef3..2e8d72d7e268 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3587,7 +3587,7 @@ int sysctl_min_slab_ratio = 5;
 
 static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
 {
-	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+	unsigned long file_mapped = node_page_state(zone->zone_pgdat, NR_FILE_MAPPED);
 	unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
 		node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d17d66e85def..02e7406e8fcd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -925,8 +925,6 @@ const char * const vmstat_text[] = {
 	"nr_zone_anon_lru",
 	"nr_zone_file_lru",
 	"nr_mlock",
-	"nr_anon_pages",
-	"nr_mapped",
 	"nr_file_pages",
 	"nr_dirty",
 	"nr_writeback",
@@ -970,6 +968,8 @@ const char * const vmstat_text[] = {
 	"workingset_refault",
 	"workingset_activate",
 	"workingset_nodereclaim",
+	"nr_anon_pages",
+	"nr_mapped",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES  is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
have

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/base/node.c    | 2 +-
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 mm/migrate.c           | 2 +-
 mm/rmap.c              | 8 ++++----
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index acca09536ad9..ac69a7215bcc 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -122,7 +122,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
-		       nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
+		       nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b8d52aa2f19a..40f108783d59 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -140,7 +140,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-		K(global_node_page_state(NR_ANON_PAGES)),
+		K(global_node_page_state(NR_ANON_MAPPED)),
 		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 95d34d1e1fb5..2d4a8804eafa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -161,7 +161,7 @@ enum node_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
-	NR_ANON_PAGES,	/* Mapped anonymous pages */
+	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_VM_NODE_STAT_ITEMS
diff --git a/mm/migrate.c b/mm/migrate.c
index 3033dae33a0a..fba770c54d84 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -501,7 +501,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 * new page and drop references to the old page.
 	 *
 	 * Note that anonymous pages are accounted for
-	 * via NR_FILE_PAGES and NR_ANON_PAGES if they
+	 * via NR_FILE_PAGES and NR_ANON_MAPPED if they
 	 * are mapped to swap space.
 	 */
 	if (newzone != oldzone) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 17876517f5fa..a66f80bc8703 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1217,7 +1217,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 */
 		if (compound)
 			__inc_zone_page_state(page, NR_ANON_THPS);
-		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1261,7 +1261,7 @@ void page_add_new_anon_rmap(struct page *page,
 		/* increment count (starts at -1) */
 		atomic_set(&page->_mapcount, 0);
 	}
-	__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
+	__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1378,7 +1378,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		clear_page_mlock(page);
 
 	if (nr) {
-		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, -nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
 		deferred_split_huge_page(page);
 	}
 }
@@ -1407,7 +1407,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_node_page_state(page, NR_ANON_PAGES);
+	__dec_node_page_state(page, NR_ANON_MAPPED);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES  is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
have

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/base/node.c    | 2 +-
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 2 +-
 mm/migrate.c           | 2 +-
 mm/rmap.c              | 8 ++++----
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index acca09536ad9..ac69a7215bcc 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -122,7 +122,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
-		       nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
+		       nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b8d52aa2f19a..40f108783d59 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -140,7 +140,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-		K(global_node_page_state(NR_ANON_PAGES)),
+		K(global_node_page_state(NR_ANON_MAPPED)),
 		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 95d34d1e1fb5..2d4a8804eafa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -161,7 +161,7 @@ enum node_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
-	NR_ANON_PAGES,	/* Mapped anonymous pages */
+	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_VM_NODE_STAT_ITEMS
diff --git a/mm/migrate.c b/mm/migrate.c
index 3033dae33a0a..fba770c54d84 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -501,7 +501,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 * new page and drop references to the old page.
 	 *
 	 * Note that anonymous pages are accounted for
-	 * via NR_FILE_PAGES and NR_ANON_PAGES if they
+	 * via NR_FILE_PAGES and NR_ANON_MAPPED if they
 	 * are mapped to swap space.
 	 */
 	if (newzone != oldzone) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 17876517f5fa..a66f80bc8703 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1217,7 +1217,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 */
 		if (compound)
 			__inc_zone_page_state(page, NR_ANON_THPS);
-		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1261,7 +1261,7 @@ void page_add_new_anon_rmap(struct page *page,
 		/* increment count (starts at -1) */
 		atomic_set(&page->_mapcount, 0);
 	}
-	__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
+	__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1378,7 +1378,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		clear_page_mlock(page);
 
 	if (nr) {
-		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, -nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
 		deferred_split_huge_page(page);
 	}
 }
@@ -1407,7 +1407,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_node_page_state(page, NR_ANON_PAGES);
+	__dec_node_page_state(page, NR_ANON_MAPPED);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 19/34] mm: move most file-based accounting to the node
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/s390/appldata/appldata_mem.c             |  2 +-
 arch/tile/mm/pgtable.c                        |  8 +--
 drivers/base/node.c                           | 16 +++---
 drivers/staging/android/lowmemorykiller.c     |  4 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c |  6 ++-
 fs/fs-writeback.c                             |  4 +-
 fs/fuse/file.c                                |  8 +--
 fs/nfs/internal.h                             |  2 +-
 fs/nfs/write.c                                |  2 +-
 fs/proc/meminfo.c                             | 16 +++---
 include/linux/mmzone.h                        | 19 +++----
 include/trace/events/writeback.h              |  6 +--
 mm/filemap.c                                  | 12 ++---
 mm/huge_memory.c                              |  4 +-
 mm/khugepaged.c                               |  6 +--
 mm/migrate.c                                  | 14 ++---
 mm/page-writeback.c                           | 47 ++++++++---------
 mm/page_alloc.c                               | 74 ++++++++++++---------------
 mm/rmap.c                                     | 10 ++--
 mm/shmem.c                                    | 14 ++---
 mm/swap_state.c                               |  4 +-
 mm/util.c                                     |  4 +-
 mm/vmscan.c                                   | 16 +++---
 mm/vmstat.c                                   | 19 +++----
 24 files changed, 155 insertions(+), 162 deletions(-)

diff --git a/arch/s390/appldata/appldata_mem.c b/arch/s390/appldata/appldata_mem.c
index edcf2a706942..598df5708501 100644
--- a/arch/s390/appldata/appldata_mem.c
+++ b/arch/s390/appldata/appldata_mem.c
@@ -102,7 +102,7 @@ static void appldata_get_mem_data(void *data)
 	mem_data->totalhigh = P2K(val.totalhigh);
 	mem_data->freehigh  = P2K(val.freehigh);
 	mem_data->bufferram = P2K(val.bufferram);
-	mem_data->cached    = P2K(global_page_state(NR_FILE_PAGES)
+	mem_data->cached    = P2K(global_node_page_state(NR_FILE_PAGES)
 				- val.bufferram);
 
 	si_swapinfo(&val);
diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index c606b0ef2f7e..7cc6ee7f1a58 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -49,16 +49,16 @@ void show_mem(unsigned int filter)
 		global_node_page_state(NR_ACTIVE_FILE)),
 	       (global_node_page_state(NR_INACTIVE_ANON) +
 		global_node_page_state(NR_INACTIVE_FILE)),
-	       global_page_state(NR_FILE_DIRTY),
-	       global_page_state(NR_WRITEBACK),
-	       global_page_state(NR_UNSTABLE_NFS),
+	       global_node_page_state(NR_FILE_DIRTY),
+	       global_node_page_state(NR_WRITEBACK),
+	       global_node_page_state(NR_UNSTABLE_NFS),
 	       global_page_state(NR_FREE_PAGES),
 	       (global_page_state(NR_SLAB_RECLAIMABLE) +
 		global_page_state(NR_SLAB_UNRECLAIMABLE)),
 	       global_node_page_state(NR_FILE_MAPPED),
 	       global_page_state(NR_PAGETABLE),
 	       global_page_state(NR_BOUNCE),
-	       global_page_state(NR_FILE_PAGES),
+	       global_node_page_state(NR_FILE_PAGES),
 	       get_nr_swap_pages());
 
 	for_each_zone(zone) {
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ac69a7215bcc..89e4f96e0834 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -118,28 +118,28 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d ShmemPmdMapped: %8lu kB\n"
 #endif
 			,
-		       nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
-		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
-		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
+		       nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+		       nid, K(node_page_state(pgdat, NR_WRITEBACK)),
+		       nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
 		       nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_UNSTABLE_NFS)),
+		       nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 		       nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK_TEMP)),
+		       nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) +
 				sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ANON_THPS) *
+		       nid, K(node_page_state(pgdat, NR_ANON_THPS) *
 				       HPAGE_PMD_NR),
-		       nid, K(sum_zone_node_page_state(nid, NR_SHMEM_THPS) *
+		       nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
 				       HPAGE_PMD_NR),
-		       nid, K(sum_zone_node_page_state(nid, NR_SHMEM_PMDMAPPED) *
+		       nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
 				       HPAGE_PMD_NR));
 #else
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 93dbcc38eb0f..45a1b4ec4ca3 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -91,8 +91,8 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 	short selected_oom_score_adj;
 	int array_size = ARRAY_SIZE(lowmem_adj);
 	int other_free = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
-	int other_file = global_page_state(NR_FILE_PAGES) -
-						global_page_state(NR_SHMEM) -
+	int other_file = global_node_page_state(NR_FILE_PAGES) -
+						global_node_page_state(NR_SHMEM) -
 						total_swapcache_pages();
 
 	if (lowmem_adj_size < array_size)
diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
index d1a7d6beee60..d011135802d5 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cache.c
+++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
@@ -1864,7 +1864,8 @@ void osc_dec_unstable_pages(struct ptlrpc_request *req)
 	LASSERT(page_count >= 0);
 
 	for (i = 0; i < page_count; i++)
-		dec_zone_page_state(desc->bd_iov[i].kiov_page, NR_UNSTABLE_NFS);
+		dec_node_page_state(desc->bd_iov[i].kiov_page,
+							NR_UNSTABLE_NFS);
 
 	atomic_sub(page_count, &cli->cl_cache->ccc_unstable_nr);
 	LASSERT(atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0);
@@ -1898,7 +1899,8 @@ void osc_inc_unstable_pages(struct ptlrpc_request *req)
 	LASSERT(page_count >= 0);
 
 	for (i = 0; i < page_count; i++)
-		inc_zone_page_state(desc->bd_iov[i].kiov_page, NR_UNSTABLE_NFS);
+		inc_node_page_state(desc->bd_iov[i].kiov_page,
+							NR_UNSTABLE_NFS);
 
 	LASSERT(atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0);
 	atomic_add(page_count, &cli->cl_cache->ccc_unstable_nr);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e21d20bc8a54..a6ca1cb2831b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1807,8 +1807,8 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
  */
 static unsigned long get_nr_dirty_pages(void)
 {
-	return global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) +
+	return global_node_page_state(NR_FILE_DIRTY) +
+		global_node_page_state(NR_UNSTABLE_NFS) +
 		get_nr_dirty_inodes();
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7270e89880b5..1b96fa4a966f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1451,7 +1451,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	list_del(&req->writepages_entry);
 	for (i = 0; i < req->num_pages; i++) {
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-		dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+		dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
 		wb_writeout_inc(&bdi->wb);
 	}
 	wake_up(&fi->page_waitq);
@@ -1641,7 +1641,7 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+	inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);
 
 	spin_lock(&fc->lock);
 	list_add(&req->writepages_entry, &fi->writepages);
@@ -1755,7 +1755,7 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
 		spin_unlock(&fc->lock);
 
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-		dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+		dec_node_page_state(page, NR_WRITEBACK_TEMP);
 		wb_writeout_inc(&bdi->wb);
 		fuse_writepage_free(fc, new_req);
 		fuse_request_free(new_req);
@@ -1854,7 +1854,7 @@ static int fuse_writepages_fill(struct page *page,
 	req->page_descs[req->num_pages].length = PAGE_SIZE;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+	inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);
 
 	err = 0;
 	if (is_writeback && fuse_writepage_in_flight(req, page)) {
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 898e66cc5089..514f096b3bdb 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -655,7 +655,7 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
 	if (!cinfo->dreq) {
 		struct inode *inode = page_file_mapping(page)->host;
 
-		inc_zone_page_state(page, NR_UNSTABLE_NFS);
+		inc_node_page_state(page, NR_UNSTABLE_NFS);
 		inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	}
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 3087fb6f1983..4715549be0c3 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -887,7 +887,7 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg,
 static void
 nfs_clear_page_commit(struct page *page)
 {
-	dec_zone_page_state(page, NR_UNSTABLE_NFS);
+	dec_node_page_state(page, NR_UNSTABLE_NFS);
 	dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
 		    WB_RECLAIMABLE);
 }
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 40f108783d59..c1fdcc1a907a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -40,7 +40,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	si_swapinfo(&i);
 	committed = percpu_counter_read_positive(&vm_committed_as);
 
-	cached = global_page_state(NR_FILE_PAGES) -
+	cached = global_node_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
 	if (cached < 0)
 		cached = 0;
@@ -138,8 +138,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 		K(i.totalswap),
 		K(i.freeswap),
-		K(global_page_state(NR_FILE_DIRTY)),
-		K(global_page_state(NR_WRITEBACK)),
+		K(global_node_page_state(NR_FILE_DIRTY)),
+		K(global_node_page_state(NR_WRITEBACK)),
 		K(global_node_page_state(NR_ANON_MAPPED)),
 		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
@@ -152,9 +152,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
 #endif
-		K(global_page_state(NR_UNSTABLE_NFS)),
+		K(global_node_page_state(NR_UNSTABLE_NFS)),
 		K(global_page_state(NR_BOUNCE)),
-		K(global_page_state(NR_WRITEBACK_TEMP)),
+		K(global_node_page_state(NR_WRITEBACK_TEMP)),
 		K(vm_commit_limit()),
 		K(committed),
 		(unsigned long)VMALLOC_TOTAL >> 10,
@@ -164,9 +164,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		, K(global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
-		, K(global_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR)
-		, K(global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
+		, K(global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
+		, K(global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR)
+		, K(global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
 #endif
 #ifdef CONFIG_CMA
 		, K(totalcma_pages)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2d4a8804eafa..acd4665c3025 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -114,21 +114,16 @@ enum zone_stat_item {
 	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
 	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
 	NR_ZONE_LRU_FILE,
+	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
-	NR_FILE_PAGES,
-	NR_FILE_DIRTY,
-	NR_WRITEBACK,
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
 	NR_KERNEL_STACK,
 	/* Second 128 byte cacheline */
-	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
-	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
-	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 #if IS_ENABLED(CONFIG_ZSMALLOC)
@@ -142,9 +137,6 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
-	NR_ANON_THPS,
-	NR_SHMEM_THPS,
-	NR_SHMEM_PMDMAPPED,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
@@ -164,6 +156,15 @@ enum node_stat_item {
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
+	NR_FILE_PAGES,
+	NR_FILE_DIRTY,
+	NR_WRITEBACK,
+	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
+	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
+	NR_SHMEM_THPS,
+	NR_SHMEM_PMDMAPPED,
+	NR_ANON_THPS,
+	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 531f5811ff6b..ad20f2d2b1f9 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -412,9 +412,9 @@ TRACE_EVENT(global_dirty_state,
 	),
 
 	TP_fast_assign(
-		__entry->nr_dirty	= global_page_state(NR_FILE_DIRTY);
-		__entry->nr_writeback	= global_page_state(NR_WRITEBACK);
-		__entry->nr_unstable	= global_page_state(NR_UNSTABLE_NFS);
+		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
+		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
+		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
 		__entry->nr_dirtied	= global_page_state(NR_DIRTIED);
 		__entry->nr_written	= global_page_state(NR_WRITTEN);
 		__entry->background_thresh = background_thresh;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7ec50bd6f88c..c5f5e46c6f7f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -218,11 +218,11 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 	/* hugetlb pages do not participate in page cache accounting. */
 	if (!PageHuge(page))
-		__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
+		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page)) {
-		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
+		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
 		if (PageTransHuge(page))
-			__dec_zone_page_state(page, NR_SHMEM_THPS);
+			__dec_node_page_state(page, NR_SHMEM_THPS);
 	} else {
 		VM_BUG_ON_PAGE(PageTransHuge(page) && !PageHuge(page), page);
 	}
@@ -568,9 +568,9 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		 * hugetlb pages do not participate in page cache accounting.
 		 */
 		if (!PageHuge(new))
-			__inc_zone_page_state(new, NR_FILE_PAGES);
+			__inc_node_page_state(new, NR_FILE_PAGES);
 		if (PageSwapBacked(new))
-			__inc_zone_page_state(new, NR_SHMEM);
+			__inc_node_page_state(new, NR_SHMEM);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		mem_cgroup_migrate(old, new);
 		radix_tree_preload_end();
@@ -677,7 +677,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
 	/* hugetlb pages do not participate in page cache accounting. */
 	if (!huge)
-		__inc_zone_page_state(page, NR_FILE_PAGES);
+		__inc_node_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d5b2207cfd2..8ec69736de18 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1591,7 +1591,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
 		/* Last compound_mapcount is gone. */
-		__dec_zone_page_state(page, NR_ANON_THPS);
+		__dec_node_page_state(page, NR_ANON_THPS);
 		if (TestClearPageDoubleMap(page)) {
 			/* No need in mapcount reference anymore */
 			for (i = 0; i < HPAGE_PMD_NR; i++)
@@ -2073,7 +2073,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			list_del(page_deferred_list(head));
 		}
 		if (mapping)
-			__dec_zone_page_state(page, NR_SHMEM_THPS);
+			__dec_node_page_state(page, NR_SHMEM_THPS);
 		spin_unlock(&pgdata->split_queue_lock);
 		__split_huge_page(page, list, flags);
 		ret = 0;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d7a49f665f04..d907cdc3dc28 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1474,10 +1474,10 @@ static void collapse_shmem(struct mm_struct *mm,
 		}
 
 		local_irq_save(flags);
-		__inc_zone_page_state(new_page, NR_SHMEM_THPS);
+		__inc_node_page_state(new_page, NR_SHMEM_THPS);
 		if (nr_none) {
-			__mod_zone_page_state(zone, NR_FILE_PAGES, nr_none);
-			__mod_zone_page_state(zone, NR_SHMEM, nr_none);
+			__mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
+			__mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
 		}
 		local_irq_restore(flags);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index fba770c54d84..c77997dc6ed7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -505,15 +505,17 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 * are mapped to swap space.
 	 */
 	if (newzone != oldzone) {
-		__dec_zone_state(oldzone, NR_FILE_PAGES);
-		__inc_zone_state(newzone, NR_FILE_PAGES);
+		__dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
+		__inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
 		if (PageSwapBacked(page) && !PageSwapCache(page)) {
-			__dec_zone_state(oldzone, NR_SHMEM);
-			__inc_zone_state(newzone, NR_SHMEM);
+			__dec_node_state(oldzone->zone_pgdat, NR_SHMEM);
+			__inc_node_state(newzone->zone_pgdat, NR_SHMEM);
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
-			__dec_zone_state(oldzone, NR_FILE_DIRTY);
-			__inc_zone_state(newzone, NR_FILE_DIRTY);
+			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
+			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
+			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
+			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f7c0fb993fb9..f97591d9fa00 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -498,20 +498,12 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat)
  */
 bool node_dirty_ok(struct pglist_data *pgdat)
 {
-	int z;
 	unsigned long limit = node_dirty_limit(pgdat);
 	unsigned long nr_pages = 0;
 
-	for (z = 0; z < MAX_NR_ZONES; z++) {
-		struct zone *zone = pgdat->node_zones + z;
-
-		if (!populated_zone(zone))
-			continue;
-
-		nr_pages += zone_page_state(zone, NR_FILE_DIRTY);
-		nr_pages += zone_page_state(zone, NR_UNSTABLE_NFS);
-		nr_pages += zone_page_state(zone, NR_WRITEBACK);
-	}
+	nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
+	nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
+	nr_pages += node_page_state(pgdat, NR_WRITEBACK);
 
 	return nr_pages <= limit;
 }
@@ -1601,10 +1593,10 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
+		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
+					global_node_page_state(NR_UNSTABLE_NFS);
 		gdtc->avail = global_dirtyable_memory();
-		gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
 
 		domain_dirty_limits(gdtc);
 
@@ -1941,8 +1933,8 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	 * as we're trying to decide whether to put more under writeback.
 	 */
 	gdtc->avail = global_dirtyable_memory();
-	gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
-		      global_page_state(NR_UNSTABLE_NFS);
+	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
+		      global_node_page_state(NR_UNSTABLE_NFS);
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
@@ -1986,8 +1978,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
+                if (global_node_page_state(NR_UNSTABLE_NFS) +
+			global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
@@ -2015,8 +2007,8 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
 void laptop_mode_timer_fn(unsigned long data)
 {
 	struct request_queue *q = (struct request_queue *)data;
-	int nr_pages = global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS);
+	int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
+		global_node_page_state(NR_UNSTABLE_NFS);
 	struct bdi_writeback *wb;
 
 	/*
@@ -2467,7 +2459,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		wb = inode_to_wb(inode);
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-		__inc_zone_page_state(page, NR_FILE_DIRTY);
+		__inc_node_page_state(page, NR_FILE_DIRTY);
+		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2488,7 +2481,8 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 {
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-		dec_zone_page_state(page, NR_FILE_DIRTY);
+		dec_node_page_state(page, NR_FILE_DIRTY);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2744,7 +2738,8 @@ int clear_page_dirty_for_io(struct page *page)
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_node_page_state(page, NR_FILE_DIRTY);
+			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2790,7 +2785,8 @@ int test_clear_page_writeback(struct page *page)
 	}
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
-		dec_zone_page_state(page, NR_WRITEBACK);
+		dec_node_page_state(page, NR_WRITEBACK);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2844,7 +2840,8 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	}
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
-		inc_zone_page_state(page, NR_WRITEBACK);
+		inc_node_page_state(page, NR_WRITEBACK);
+		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78338b51819b..fd34b305c8ee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3535,14 +3535,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			 * prevent from pre mature OOM
 			 */
 			if (!did_some_progress) {
-				unsigned long writeback;
-				unsigned long dirty;
+				unsigned long write_pending;
 
-				writeback = zone_page_state_snapshot(zone,
-								     NR_WRITEBACK);
-				dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+				write_pending = zone_page_state_snapshot(zone,
+							NR_ZONE_WRITE_PENDING);
 
-				if (2*(writeback + dirty) > reclaimable) {
+				if (2 * write_pending > reclaimable) {
 					congestion_wait(BLK_RW_ASYNC, HZ/10);
 					return true;
 				}
@@ -4218,7 +4216,7 @@ EXPORT_SYMBOL_GPL(si_mem_available);
 void si_meminfo(struct sysinfo *val)
 {
 	val->totalram = totalram_pages;
-	val->sharedram = global_page_state(NR_SHMEM);
+	val->sharedram = global_node_page_state(NR_SHMEM);
 	val->freeram = global_page_state(NR_FREE_PAGES);
 	val->bufferram = nr_blockdev_pages();
 	val->totalhigh = totalhigh_pages;
@@ -4240,7 +4238,7 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
 		managed_pages += pgdat->node_zones[zone_type].managed_pages;
 	val->totalram = managed_pages;
-	val->sharedram = sum_zone_node_page_state(nid, NR_SHMEM);
+	val->sharedram = node_page_state(pgdat, NR_SHMEM);
 	val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
 #ifdef CONFIG_HIGHMEM
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
@@ -4339,9 +4337,6 @@ void show_free_areas(unsigned int filter)
 		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		" anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
-#endif
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
 		global_node_page_state(NR_INACTIVE_ANON),
@@ -4350,20 +4345,15 @@ void show_free_areas(unsigned int filter)
 		global_node_page_state(NR_INACTIVE_FILE),
 		global_node_page_state(NR_ISOLATED_FILE),
 		global_node_page_state(NR_UNEVICTABLE),
-		global_page_state(NR_FILE_DIRTY),
-		global_page_state(NR_WRITEBACK),
-		global_page_state(NR_UNSTABLE_NFS),
+		global_node_page_state(NR_FILE_DIRTY),
+		global_node_page_state(NR_WRITEBACK),
+		global_node_page_state(NR_UNSTABLE_NFS),
 		global_page_state(NR_SLAB_RECLAIMABLE),
 		global_page_state(NR_SLAB_UNRECLAIMABLE),
 		global_node_page_state(NR_FILE_MAPPED),
-		global_page_state(NR_SHMEM),
+		global_node_page_state(NR_SHMEM),
 		global_page_state(NR_PAGETABLE),
 		global_page_state(NR_BOUNCE),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR,
-		global_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR,
-		global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR,
-#endif
 		global_page_state(NR_FREE_PAGES),
 		free_pcp,
 		global_page_state(NR_FREE_CMA_PAGES));
@@ -4378,6 +4368,16 @@ void show_free_areas(unsigned int filter)
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" mapped:%lukB"
+			" dirty:%lukB"
+			" writeback:%lukB"
+			" shmem:%lukB"
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			" shmem_thp: %lukB"
+			" shmem_pmdmapped: %lukB"
+			" anon_thp: %lukB"
+#endif
+			" writeback_tmp:%lukB"
+			" unstable:%lukB"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -4389,6 +4389,17 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
+			K(node_page_state(pgdat, NR_FILE_DIRTY)),
+			K(node_page_state(pgdat, NR_WRITEBACK)),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
+					* HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
+#endif
+			K(node_page_state(pgdat, NR_SHMEM)),
+			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
+			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 			!pgdat_reclaimable(pgdat) ? "yes" : "no");
 	}
 
@@ -4411,24 +4422,14 @@ void show_free_areas(unsigned int filter)
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
-			" dirty:%lukB"
-			" writeback:%lukB"
-			" shmem:%lukB"
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-			" shmem_thp: %lukB"
-			" shmem_pmdmapped: %lukB"
-			" anon_thp: %lukB"
-#endif
 			" slab_reclaimable:%lukB"
 			" slab_unreclaimable:%lukB"
 			" kernel_stack:%lukB"
 			" pagetables:%lukB"
-			" unstable:%lukB"
 			" bounce:%lukB"
 			" free_pcp:%lukB"
 			" local_pcp:%ukB"
 			" free_cma:%lukB"
-			" writeback_tmp:%lukB"
 			" node_pages_scanned:%lu"
 			"\n",
 			zone->name,
@@ -4439,26 +4440,15 @@ void show_free_areas(unsigned int filter)
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
-			K(zone_page_state(zone, NR_FILE_DIRTY)),
-			K(zone_page_state(zone, NR_WRITEBACK)),
-			K(zone_page_state(zone, NR_SHMEM)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-			K(zone_page_state(zone, NR_SHMEM_THPS) * HPAGE_PMD_NR),
-			K(zone_page_state(zone, NR_SHMEM_PMDMAPPED)
-					* HPAGE_PMD_NR),
-			K(zone_page_state(zone, NR_ANON_THPS) * HPAGE_PMD_NR),
-#endif
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
 			zone_page_state(zone, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
 			K(zone_page_state(zone, NR_PAGETABLE)),
-			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(free_pcp),
 			K(this_cpu_read(zone->pageset->pcp.count)),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
-			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
 			K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
@@ -4501,7 +4491,7 @@ void show_free_areas(unsigned int filter)
 
 	hugetlb_show_meminfo();
 
-	printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
+	printk("%ld total pagecache pages\n", global_node_page_state(NR_FILE_PAGES));
 
 	show_swap_cache_info();
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index a66f80bc8703..5b6dc9e33f7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1216,7 +1216,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound)
-			__inc_zone_page_state(page, NR_ANON_THPS);
+			__inc_node_page_state(page, NR_ANON_THPS);
 		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	}
 	if (unlikely(PageKsm(page)))
@@ -1254,7 +1254,7 @@ void page_add_new_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
-		__inc_zone_page_state(page, NR_ANON_THPS);
+		__inc_node_page_state(page, NR_ANON_THPS);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1285,7 +1285,7 @@ void page_add_file_rmap(struct page *page, bool compound)
 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
 			goto out;
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
-		__inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
 	} else {
 		if (PageTransCompound(page)) {
 			VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1325,7 +1325,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			goto out;
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
-		__dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
 	} else {
 		if (!atomic_add_negative(-1, &page->_mapcount))
 			goto out;
@@ -1359,7 +1359,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	__dec_zone_page_state(page, NR_ANON_THPS);
+	__dec_node_page_state(page, NR_ANON_THPS);
 
 	if (TestClearPageDoubleMap(page)) {
 		/*
diff --git a/mm/shmem.c b/mm/shmem.c
index bfaa007ccb58..8975df09ec26 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -575,9 +575,9 @@ static int shmem_add_to_page_cache(struct page *page,
 	if (!error) {
 		mapping->nrpages += nr;
 		if (PageTransHuge(page))
-			__inc_zone_page_state(page, NR_SHMEM_THPS);
-		__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
-		__mod_zone_page_state(page_zone(page), NR_SHMEM, nr);
+			__inc_node_page_state(page, NR_SHMEM_THPS);
+		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
+		__mod_node_page_state(page_pgdat(page), NR_SHMEM, nr);
 		spin_unlock_irq(&mapping->tree_lock);
 	} else {
 		page->mapping = NULL;
@@ -601,8 +601,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
 	page->mapping = NULL;
 	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
-	__dec_zone_page_state(page, NR_SHMEM);
+	__dec_node_page_state(page, NR_FILE_PAGES);
+	__dec_node_page_state(page, NR_SHMEM);
 	spin_unlock_irq(&mapping->tree_lock);
 	put_page(page);
 	BUG_ON(error);
@@ -1493,8 +1493,8 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 	error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
 								   newpage);
 	if (!error) {
-		__inc_zone_page_state(newpage, NR_FILE_PAGES);
-		__dec_zone_page_state(oldpage, NR_FILE_PAGES);
+		__inc_node_page_state(newpage, NR_FILE_PAGES);
+		__dec_node_page_state(oldpage, NR_FILE_PAGES);
 	}
 	spin_unlock_irq(&swap_mapping->tree_lock);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c99463ac02fb..c8310a37be3a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -95,7 +95,7 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 					entry.val, page);
 	if (likely(!error)) {
 		address_space->nrpages++;
-		__inc_zone_page_state(page, NR_FILE_PAGES);
+		__inc_node_page_state(page, NR_FILE_PAGES);
 		INC_CACHE_INFO(add_total);
 	}
 	spin_unlock_irq(&address_space->tree_lock);
@@ -147,7 +147,7 @@ void __delete_from_swap_cache(struct page *page)
 	set_page_private(page, 0);
 	ClearPageSwapCache(page);
 	address_space->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	__dec_node_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
 }
 
diff --git a/mm/util.c b/mm/util.c
index 8d010ef2ce1c..662cddf914af 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -528,7 +528,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		free = global_page_state(NR_FREE_PAGES);
-		free += global_page_state(NR_FILE_PAGES);
+		free += global_node_page_state(NR_FILE_PAGES);
 
 		/*
 		 * shmem pages shouldn't be counted as free in this
@@ -536,7 +536,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		 * that won't affect the overall amount of available
 		 * memory in the system.
 		 */
-		free -= global_page_state(NR_SHMEM);
+		free -= global_node_page_state(NR_SHMEM);
 
 		free += get_nr_swap_pages();
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e8d72d7e268..aef2a6245657 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3585,11 +3585,11 @@ int sysctl_min_unmapped_ratio = 1;
  */
 int sysctl_min_slab_ratio = 5;
 
-static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
+static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
 {
-	unsigned long file_mapped = node_page_state(zone->zone_pgdat, NR_FILE_MAPPED);
-	unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
-		node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
+	unsigned long file_mapped = node_page_state(pgdat, NR_FILE_MAPPED);
+	unsigned long file_lru = node_page_state(pgdat, NR_INACTIVE_FILE) +
+		node_page_state(pgdat, NR_ACTIVE_FILE);
 
 	/*
 	 * It's possible for there to be more file mapped pages than
@@ -3608,17 +3608,17 @@ static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 	/*
 	 * If RECLAIM_UNMAP is set, then all file pages are considered
 	 * potentially reclaimable. Otherwise, we have to worry about
-	 * pages like swapcache and zone_unmapped_file_pages() provides
+	 * pages like swapcache and node_unmapped_file_pages() provides
 	 * a better estimate
 	 */
 	if (zone_reclaim_mode & RECLAIM_UNMAP)
-		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
+		nr_pagecache_reclaimable = node_page_state(zone->zone_pgdat, NR_FILE_PAGES);
 	else
-		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
+		nr_pagecache_reclaimable = node_unmapped_file_pages(zone->zone_pgdat);
 
 	/* If we can't clean pages, remove dirty pages from consideration */
 	if (!(zone_reclaim_mode & RECLAIM_WRITE))
-		delta += zone_page_state(zone, NR_FILE_DIRTY);
+		delta += node_page_state(zone->zone_pgdat, NR_FILE_DIRTY);
 
 	/* Watch for any possible underflows due to delta */
 	if (unlikely(delta > nr_pagecache_reclaimable))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 02e7406e8fcd..455392158062 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -924,20 +924,15 @@ const char * const vmstat_text[] = {
 	"nr_alloc_batch",
 	"nr_zone_anon_lru",
 	"nr_zone_file_lru",
+	"nr_zone_write_pending",
 	"nr_mlock",
-	"nr_file_pages",
-	"nr_dirty",
-	"nr_writeback",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
 	"nr_page_table_pages",
 	"nr_kernel_stack",
-	"nr_unstable",
 	"nr_bounce",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
-	"nr_writeback_temp",
-	"nr_shmem",
 	"nr_dirtied",
 	"nr_written",
 #if IS_ENABLED(CONFIG_ZSMALLOC)
@@ -951,9 +946,6 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
-	"nr_anon_transparent_hugepages",
-	"nr_shmem_hugepages",
-	"nr_shmem_pmdmapped",
 	"nr_free_cma",
 
 	/* Node-based counters */
@@ -970,6 +962,15 @@ const char * const vmstat_text[] = {
 	"workingset_nodereclaim",
 	"nr_anon_pages",
 	"nr_mapped",
+	"nr_file_pages",
+	"nr_dirty",
+	"nr_writeback",
+	"nr_writeback_temp",
+	"nr_shmem",
+	"nr_shmem_hugepages",
+	"nr_shmem_pmdmapped",
+	"nr_anon_transparent_hugepages",
+	"nr_unstable",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 19/34] mm: move most file-based accounting to the node
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/s390/appldata/appldata_mem.c             |  2 +-
 arch/tile/mm/pgtable.c                        |  8 +--
 drivers/base/node.c                           | 16 +++---
 drivers/staging/android/lowmemorykiller.c     |  4 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c |  6 ++-
 fs/fs-writeback.c                             |  4 +-
 fs/fuse/file.c                                |  8 +--
 fs/nfs/internal.h                             |  2 +-
 fs/nfs/write.c                                |  2 +-
 fs/proc/meminfo.c                             | 16 +++---
 include/linux/mmzone.h                        | 19 +++----
 include/trace/events/writeback.h              |  6 +--
 mm/filemap.c                                  | 12 ++---
 mm/huge_memory.c                              |  4 +-
 mm/khugepaged.c                               |  6 +--
 mm/migrate.c                                  | 14 ++---
 mm/page-writeback.c                           | 47 ++++++++---------
 mm/page_alloc.c                               | 74 ++++++++++++---------------
 mm/rmap.c                                     | 10 ++--
 mm/shmem.c                                    | 14 ++---
 mm/swap_state.c                               |  4 +-
 mm/util.c                                     |  4 +-
 mm/vmscan.c                                   | 16 +++---
 mm/vmstat.c                                   | 19 +++----
 24 files changed, 155 insertions(+), 162 deletions(-)

diff --git a/arch/s390/appldata/appldata_mem.c b/arch/s390/appldata/appldata_mem.c
index edcf2a706942..598df5708501 100644
--- a/arch/s390/appldata/appldata_mem.c
+++ b/arch/s390/appldata/appldata_mem.c
@@ -102,7 +102,7 @@ static void appldata_get_mem_data(void *data)
 	mem_data->totalhigh = P2K(val.totalhigh);
 	mem_data->freehigh  = P2K(val.freehigh);
 	mem_data->bufferram = P2K(val.bufferram);
-	mem_data->cached    = P2K(global_page_state(NR_FILE_PAGES)
+	mem_data->cached    = P2K(global_node_page_state(NR_FILE_PAGES)
 				- val.bufferram);
 
 	si_swapinfo(&val);
diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index c606b0ef2f7e..7cc6ee7f1a58 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -49,16 +49,16 @@ void show_mem(unsigned int filter)
 		global_node_page_state(NR_ACTIVE_FILE)),
 	       (global_node_page_state(NR_INACTIVE_ANON) +
 		global_node_page_state(NR_INACTIVE_FILE)),
-	       global_page_state(NR_FILE_DIRTY),
-	       global_page_state(NR_WRITEBACK),
-	       global_page_state(NR_UNSTABLE_NFS),
+	       global_node_page_state(NR_FILE_DIRTY),
+	       global_node_page_state(NR_WRITEBACK),
+	       global_node_page_state(NR_UNSTABLE_NFS),
 	       global_page_state(NR_FREE_PAGES),
 	       (global_page_state(NR_SLAB_RECLAIMABLE) +
 		global_page_state(NR_SLAB_UNRECLAIMABLE)),
 	       global_node_page_state(NR_FILE_MAPPED),
 	       global_page_state(NR_PAGETABLE),
 	       global_page_state(NR_BOUNCE),
-	       global_page_state(NR_FILE_PAGES),
+	       global_node_page_state(NR_FILE_PAGES),
 	       get_nr_swap_pages());
 
 	for_each_zone(zone) {
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ac69a7215bcc..89e4f96e0834 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -118,28 +118,28 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d ShmemPmdMapped: %8lu kB\n"
 #endif
 			,
-		       nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
-		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
-		       nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
+		       nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+		       nid, K(node_page_state(pgdat, NR_WRITEBACK)),
+		       nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
 		       nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_UNSTABLE_NFS)),
+		       nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 		       nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK_TEMP)),
+		       nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) +
 				sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
-		       nid, K(sum_zone_node_page_state(nid, NR_ANON_THPS) *
+		       nid, K(node_page_state(pgdat, NR_ANON_THPS) *
 				       HPAGE_PMD_NR),
-		       nid, K(sum_zone_node_page_state(nid, NR_SHMEM_THPS) *
+		       nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
 				       HPAGE_PMD_NR),
-		       nid, K(sum_zone_node_page_state(nid, NR_SHMEM_PMDMAPPED) *
+		       nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
 				       HPAGE_PMD_NR));
 #else
 		       nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 93dbcc38eb0f..45a1b4ec4ca3 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -91,8 +91,8 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 	short selected_oom_score_adj;
 	int array_size = ARRAY_SIZE(lowmem_adj);
 	int other_free = global_page_state(NR_FREE_PAGES) - totalreserve_pages;
-	int other_file = global_page_state(NR_FILE_PAGES) -
-						global_page_state(NR_SHMEM) -
+	int other_file = global_node_page_state(NR_FILE_PAGES) -
+						global_node_page_state(NR_SHMEM) -
 						total_swapcache_pages();
 
 	if (lowmem_adj_size < array_size)
diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
index d1a7d6beee60..d011135802d5 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cache.c
+++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
@@ -1864,7 +1864,8 @@ void osc_dec_unstable_pages(struct ptlrpc_request *req)
 	LASSERT(page_count >= 0);
 
 	for (i = 0; i < page_count; i++)
-		dec_zone_page_state(desc->bd_iov[i].kiov_page, NR_UNSTABLE_NFS);
+		dec_node_page_state(desc->bd_iov[i].kiov_page,
+							NR_UNSTABLE_NFS);
 
 	atomic_sub(page_count, &cli->cl_cache->ccc_unstable_nr);
 	LASSERT(atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0);
@@ -1898,7 +1899,8 @@ void osc_inc_unstable_pages(struct ptlrpc_request *req)
 	LASSERT(page_count >= 0);
 
 	for (i = 0; i < page_count; i++)
-		inc_zone_page_state(desc->bd_iov[i].kiov_page, NR_UNSTABLE_NFS);
+		inc_node_page_state(desc->bd_iov[i].kiov_page,
+							NR_UNSTABLE_NFS);
 
 	LASSERT(atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0);
 	atomic_add(page_count, &cli->cl_cache->ccc_unstable_nr);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e21d20bc8a54..a6ca1cb2831b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1807,8 +1807,8 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
  */
 static unsigned long get_nr_dirty_pages(void)
 {
-	return global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) +
+	return global_node_page_state(NR_FILE_DIRTY) +
+		global_node_page_state(NR_UNSTABLE_NFS) +
 		get_nr_dirty_inodes();
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7270e89880b5..1b96fa4a966f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1451,7 +1451,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	list_del(&req->writepages_entry);
 	for (i = 0; i < req->num_pages; i++) {
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-		dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+		dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
 		wb_writeout_inc(&bdi->wb);
 	}
 	wake_up(&fi->page_waitq);
@@ -1641,7 +1641,7 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+	inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);
 
 	spin_lock(&fc->lock);
 	list_add(&req->writepages_entry, &fi->writepages);
@@ -1755,7 +1755,7 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
 		spin_unlock(&fc->lock);
 
 		dec_wb_stat(&bdi->wb, WB_WRITEBACK);
-		dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+		dec_node_page_state(page, NR_WRITEBACK_TEMP);
 		wb_writeout_inc(&bdi->wb);
 		fuse_writepage_free(fc, new_req);
 		fuse_request_free(new_req);
@@ -1854,7 +1854,7 @@ static int fuse_writepages_fill(struct page *page,
 	req->page_descs[req->num_pages].length = PAGE_SIZE;
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
-	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+	inc_node_page_state(tmp_page, NR_WRITEBACK_TEMP);
 
 	err = 0;
 	if (is_writeback && fuse_writepage_in_flight(req, page)) {
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 898e66cc5089..514f096b3bdb 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -655,7 +655,7 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
 	if (!cinfo->dreq) {
 		struct inode *inode = page_file_mapping(page)->host;
 
-		inc_zone_page_state(page, NR_UNSTABLE_NFS);
+		inc_node_page_state(page, NR_UNSTABLE_NFS);
 		inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	}
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 3087fb6f1983..4715549be0c3 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -887,7 +887,7 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg,
 static void
 nfs_clear_page_commit(struct page *page)
 {
-	dec_zone_page_state(page, NR_UNSTABLE_NFS);
+	dec_node_page_state(page, NR_UNSTABLE_NFS);
 	dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
 		    WB_RECLAIMABLE);
 }
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 40f108783d59..c1fdcc1a907a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -40,7 +40,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	si_swapinfo(&i);
 	committed = percpu_counter_read_positive(&vm_committed_as);
 
-	cached = global_page_state(NR_FILE_PAGES) -
+	cached = global_node_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
 	if (cached < 0)
 		cached = 0;
@@ -138,8 +138,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 		K(i.totalswap),
 		K(i.freeswap),
-		K(global_page_state(NR_FILE_DIRTY)),
-		K(global_page_state(NR_WRITEBACK)),
+		K(global_node_page_state(NR_FILE_DIRTY)),
+		K(global_node_page_state(NR_WRITEBACK)),
 		K(global_node_page_state(NR_ANON_MAPPED)),
 		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
@@ -152,9 +152,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_QUICKLIST
 		K(quicklist_total_size()),
 #endif
-		K(global_page_state(NR_UNSTABLE_NFS)),
+		K(global_node_page_state(NR_UNSTABLE_NFS)),
 		K(global_page_state(NR_BOUNCE)),
-		K(global_page_state(NR_WRITEBACK_TEMP)),
+		K(global_node_page_state(NR_WRITEBACK_TEMP)),
 		K(vm_commit_limit()),
 		K(committed),
 		(unsigned long)VMALLOC_TOTAL >> 10,
@@ -164,9 +164,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		, K(global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
-		, K(global_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR)
-		, K(global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
+		, K(global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
+		, K(global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR)
+		, K(global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
 #endif
 #ifdef CONFIG_CMA
 		, K(totalcma_pages)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2d4a8804eafa..acd4665c3025 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -114,21 +114,16 @@ enum zone_stat_item {
 	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
 	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
 	NR_ZONE_LRU_FILE,
+	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
-	NR_FILE_PAGES,
-	NR_FILE_DIRTY,
-	NR_WRITEBACK,
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
 	NR_KERNEL_STACK,
 	/* Second 128 byte cacheline */
-	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
-	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
-	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 #if IS_ENABLED(CONFIG_ZSMALLOC)
@@ -142,9 +137,6 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
-	NR_ANON_THPS,
-	NR_SHMEM_THPS,
-	NR_SHMEM_PMDMAPPED,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
@@ -164,6 +156,15 @@ enum node_stat_item {
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
+	NR_FILE_PAGES,
+	NR_FILE_DIRTY,
+	NR_WRITEBACK,
+	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
+	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
+	NR_SHMEM_THPS,
+	NR_SHMEM_PMDMAPPED,
+	NR_ANON_THPS,
+	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 531f5811ff6b..ad20f2d2b1f9 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -412,9 +412,9 @@ TRACE_EVENT(global_dirty_state,
 	),
 
 	TP_fast_assign(
-		__entry->nr_dirty	= global_page_state(NR_FILE_DIRTY);
-		__entry->nr_writeback	= global_page_state(NR_WRITEBACK);
-		__entry->nr_unstable	= global_page_state(NR_UNSTABLE_NFS);
+		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
+		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
+		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
 		__entry->nr_dirtied	= global_page_state(NR_DIRTIED);
 		__entry->nr_written	= global_page_state(NR_WRITTEN);
 		__entry->background_thresh = background_thresh;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7ec50bd6f88c..c5f5e46c6f7f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -218,11 +218,11 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 	/* hugetlb pages do not participate in page cache accounting. */
 	if (!PageHuge(page))
-		__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
+		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page)) {
-		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
+		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
 		if (PageTransHuge(page))
-			__dec_zone_page_state(page, NR_SHMEM_THPS);
+			__dec_node_page_state(page, NR_SHMEM_THPS);
 	} else {
 		VM_BUG_ON_PAGE(PageTransHuge(page) && !PageHuge(page), page);
 	}
@@ -568,9 +568,9 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		 * hugetlb pages do not participate in page cache accounting.
 		 */
 		if (!PageHuge(new))
-			__inc_zone_page_state(new, NR_FILE_PAGES);
+			__inc_node_page_state(new, NR_FILE_PAGES);
 		if (PageSwapBacked(new))
-			__inc_zone_page_state(new, NR_SHMEM);
+			__inc_node_page_state(new, NR_SHMEM);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		mem_cgroup_migrate(old, new);
 		radix_tree_preload_end();
@@ -677,7 +677,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
 	/* hugetlb pages do not participate in page cache accounting. */
 	if (!huge)
-		__inc_zone_page_state(page, NR_FILE_PAGES);
+		__inc_node_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d5b2207cfd2..8ec69736de18 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1591,7 +1591,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
 		/* Last compound_mapcount is gone. */
-		__dec_zone_page_state(page, NR_ANON_THPS);
+		__dec_node_page_state(page, NR_ANON_THPS);
 		if (TestClearPageDoubleMap(page)) {
 			/* No need in mapcount reference anymore */
 			for (i = 0; i < HPAGE_PMD_NR; i++)
@@ -2073,7 +2073,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			list_del(page_deferred_list(head));
 		}
 		if (mapping)
-			__dec_zone_page_state(page, NR_SHMEM_THPS);
+			__dec_node_page_state(page, NR_SHMEM_THPS);
 		spin_unlock(&pgdata->split_queue_lock);
 		__split_huge_page(page, list, flags);
 		ret = 0;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d7a49f665f04..d907cdc3dc28 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1474,10 +1474,10 @@ static void collapse_shmem(struct mm_struct *mm,
 		}
 
 		local_irq_save(flags);
-		__inc_zone_page_state(new_page, NR_SHMEM_THPS);
+		__inc_node_page_state(new_page, NR_SHMEM_THPS);
 		if (nr_none) {
-			__mod_zone_page_state(zone, NR_FILE_PAGES, nr_none);
-			__mod_zone_page_state(zone, NR_SHMEM, nr_none);
+			__mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
+			__mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
 		}
 		local_irq_restore(flags);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index fba770c54d84..c77997dc6ed7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -505,15 +505,17 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 * are mapped to swap space.
 	 */
 	if (newzone != oldzone) {
-		__dec_zone_state(oldzone, NR_FILE_PAGES);
-		__inc_zone_state(newzone, NR_FILE_PAGES);
+		__dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
+		__inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
 		if (PageSwapBacked(page) && !PageSwapCache(page)) {
-			__dec_zone_state(oldzone, NR_SHMEM);
-			__inc_zone_state(newzone, NR_SHMEM);
+			__dec_node_state(oldzone->zone_pgdat, NR_SHMEM);
+			__inc_node_state(newzone->zone_pgdat, NR_SHMEM);
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
-			__dec_zone_state(oldzone, NR_FILE_DIRTY);
-			__inc_zone_state(newzone, NR_FILE_DIRTY);
+			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
+			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
+			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
+			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f7c0fb993fb9..f97591d9fa00 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -498,20 +498,12 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat)
  */
 bool node_dirty_ok(struct pglist_data *pgdat)
 {
-	int z;
 	unsigned long limit = node_dirty_limit(pgdat);
 	unsigned long nr_pages = 0;
 
-	for (z = 0; z < MAX_NR_ZONES; z++) {
-		struct zone *zone = pgdat->node_zones + z;
-
-		if (!populated_zone(zone))
-			continue;
-
-		nr_pages += zone_page_state(zone, NR_FILE_DIRTY);
-		nr_pages += zone_page_state(zone, NR_UNSTABLE_NFS);
-		nr_pages += zone_page_state(zone, NR_WRITEBACK);
-	}
+	nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
+	nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
+	nr_pages += node_page_state(pgdat, NR_WRITEBACK);
 
 	return nr_pages <= limit;
 }
@@ -1601,10 +1593,10 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
+		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
+					global_node_page_state(NR_UNSTABLE_NFS);
 		gdtc->avail = global_dirtyable_memory();
-		gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
 
 		domain_dirty_limits(gdtc);
 
@@ -1941,8 +1933,8 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	 * as we're trying to decide whether to put more under writeback.
 	 */
 	gdtc->avail = global_dirtyable_memory();
-	gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
-		      global_page_state(NR_UNSTABLE_NFS);
+	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
+		      global_node_page_state(NR_UNSTABLE_NFS);
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
@@ -1986,8 +1978,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
+                if (global_node_page_state(NR_UNSTABLE_NFS) +
+			global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
@@ -2015,8 +2007,8 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
 void laptop_mode_timer_fn(unsigned long data)
 {
 	struct request_queue *q = (struct request_queue *)data;
-	int nr_pages = global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS);
+	int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
+		global_node_page_state(NR_UNSTABLE_NFS);
 	struct bdi_writeback *wb;
 
 	/*
@@ -2467,7 +2459,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		wb = inode_to_wb(inode);
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-		__inc_zone_page_state(page, NR_FILE_DIRTY);
+		__inc_node_page_state(page, NR_FILE_DIRTY);
+		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2488,7 +2481,8 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 {
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-		dec_zone_page_state(page, NR_FILE_DIRTY);
+		dec_node_page_state(page, NR_FILE_DIRTY);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2744,7 +2738,8 @@ int clear_page_dirty_for_io(struct page *page)
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_node_page_state(page, NR_FILE_DIRTY);
+			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2790,7 +2785,8 @@ int test_clear_page_writeback(struct page *page)
 	}
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
-		dec_zone_page_state(page, NR_WRITEBACK);
+		dec_node_page_state(page, NR_WRITEBACK);
+		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2844,7 +2840,8 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	}
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
-		inc_zone_page_state(page, NR_WRITEBACK);
+		inc_node_page_state(page, NR_WRITEBACK);
+		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78338b51819b..fd34b305c8ee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3535,14 +3535,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			 * prevent from pre mature OOM
 			 */
 			if (!did_some_progress) {
-				unsigned long writeback;
-				unsigned long dirty;
+				unsigned long write_pending;
 
-				writeback = zone_page_state_snapshot(zone,
-								     NR_WRITEBACK);
-				dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+				write_pending = zone_page_state_snapshot(zone,
+							NR_ZONE_WRITE_PENDING);
 
-				if (2*(writeback + dirty) > reclaimable) {
+				if (2 * write_pending > reclaimable) {
 					congestion_wait(BLK_RW_ASYNC, HZ/10);
 					return true;
 				}
@@ -4218,7 +4216,7 @@ EXPORT_SYMBOL_GPL(si_mem_available);
 void si_meminfo(struct sysinfo *val)
 {
 	val->totalram = totalram_pages;
-	val->sharedram = global_page_state(NR_SHMEM);
+	val->sharedram = global_node_page_state(NR_SHMEM);
 	val->freeram = global_page_state(NR_FREE_PAGES);
 	val->bufferram = nr_blockdev_pages();
 	val->totalhigh = totalhigh_pages;
@@ -4240,7 +4238,7 @@ void si_meminfo_node(struct sysinfo *val, int nid)
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
 		managed_pages += pgdat->node_zones[zone_type].managed_pages;
 	val->totalram = managed_pages;
-	val->sharedram = sum_zone_node_page_state(nid, NR_SHMEM);
+	val->sharedram = node_page_state(pgdat, NR_SHMEM);
 	val->freeram = sum_zone_node_page_state(nid, NR_FREE_PAGES);
 #ifdef CONFIG_HIGHMEM
 	for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
@@ -4339,9 +4337,6 @@ void show_free_areas(unsigned int filter)
 		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		" anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
-#endif
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
 		global_node_page_state(NR_INACTIVE_ANON),
@@ -4350,20 +4345,15 @@ void show_free_areas(unsigned int filter)
 		global_node_page_state(NR_INACTIVE_FILE),
 		global_node_page_state(NR_ISOLATED_FILE),
 		global_node_page_state(NR_UNEVICTABLE),
-		global_page_state(NR_FILE_DIRTY),
-		global_page_state(NR_WRITEBACK),
-		global_page_state(NR_UNSTABLE_NFS),
+		global_node_page_state(NR_FILE_DIRTY),
+		global_node_page_state(NR_WRITEBACK),
+		global_node_page_state(NR_UNSTABLE_NFS),
 		global_page_state(NR_SLAB_RECLAIMABLE),
 		global_page_state(NR_SLAB_UNRECLAIMABLE),
 		global_node_page_state(NR_FILE_MAPPED),
-		global_page_state(NR_SHMEM),
+		global_node_page_state(NR_SHMEM),
 		global_page_state(NR_PAGETABLE),
 		global_page_state(NR_BOUNCE),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR,
-		global_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR,
-		global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR,
-#endif
 		global_page_state(NR_FREE_PAGES),
 		free_pcp,
 		global_page_state(NR_FREE_CMA_PAGES));
@@ -4378,6 +4368,16 @@ void show_free_areas(unsigned int filter)
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" mapped:%lukB"
+			" dirty:%lukB"
+			" writeback:%lukB"
+			" shmem:%lukB"
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			" shmem_thp: %lukB"
+			" shmem_pmdmapped: %lukB"
+			" anon_thp: %lukB"
+#endif
+			" writeback_tmp:%lukB"
+			" unstable:%lukB"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -4389,6 +4389,17 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
+			K(node_page_state(pgdat, NR_FILE_DIRTY)),
+			K(node_page_state(pgdat, NR_WRITEBACK)),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
+					* HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
+#endif
+			K(node_page_state(pgdat, NR_SHMEM)),
+			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
+			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 			!pgdat_reclaimable(pgdat) ? "yes" : "no");
 	}
 
@@ -4411,24 +4422,14 @@ void show_free_areas(unsigned int filter)
 			" present:%lukB"
 			" managed:%lukB"
 			" mlocked:%lukB"
-			" dirty:%lukB"
-			" writeback:%lukB"
-			" shmem:%lukB"
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-			" shmem_thp: %lukB"
-			" shmem_pmdmapped: %lukB"
-			" anon_thp: %lukB"
-#endif
 			" slab_reclaimable:%lukB"
 			" slab_unreclaimable:%lukB"
 			" kernel_stack:%lukB"
 			" pagetables:%lukB"
-			" unstable:%lukB"
 			" bounce:%lukB"
 			" free_pcp:%lukB"
 			" local_pcp:%ukB"
 			" free_cma:%lukB"
-			" writeback_tmp:%lukB"
 			" node_pages_scanned:%lu"
 			"\n",
 			zone->name,
@@ -4439,26 +4440,15 @@ void show_free_areas(unsigned int filter)
 			K(zone->present_pages),
 			K(zone->managed_pages),
 			K(zone_page_state(zone, NR_MLOCK)),
-			K(zone_page_state(zone, NR_FILE_DIRTY)),
-			K(zone_page_state(zone, NR_WRITEBACK)),
-			K(zone_page_state(zone, NR_SHMEM)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-			K(zone_page_state(zone, NR_SHMEM_THPS) * HPAGE_PMD_NR),
-			K(zone_page_state(zone, NR_SHMEM_PMDMAPPED)
-					* HPAGE_PMD_NR),
-			K(zone_page_state(zone, NR_ANON_THPS) * HPAGE_PMD_NR),
-#endif
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
 			zone_page_state(zone, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
 			K(zone_page_state(zone, NR_PAGETABLE)),
-			K(zone_page_state(zone, NR_UNSTABLE_NFS)),
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(free_pcp),
 			K(this_cpu_read(zone->pageset->pcp.count)),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
-			K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
 			K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
@@ -4501,7 +4491,7 @@ void show_free_areas(unsigned int filter)
 
 	hugetlb_show_meminfo();
 
-	printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
+	printk("%ld total pagecache pages\n", global_node_page_state(NR_FILE_PAGES));
 
 	show_swap_cache_info();
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index a66f80bc8703..5b6dc9e33f7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1216,7 +1216,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound)
-			__inc_zone_page_state(page, NR_ANON_THPS);
+			__inc_node_page_state(page, NR_ANON_THPS);
 		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	}
 	if (unlikely(PageKsm(page)))
@@ -1254,7 +1254,7 @@ void page_add_new_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
-		__inc_zone_page_state(page, NR_ANON_THPS);
+		__inc_node_page_state(page, NR_ANON_THPS);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1285,7 +1285,7 @@ void page_add_file_rmap(struct page *page, bool compound)
 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
 			goto out;
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
-		__inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
 	} else {
 		if (PageTransCompound(page)) {
 			VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1325,7 +1325,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			goto out;
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
-		__dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
 	} else {
 		if (!atomic_add_negative(-1, &page->_mapcount))
 			goto out;
@@ -1359,7 +1359,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	__dec_zone_page_state(page, NR_ANON_THPS);
+	__dec_node_page_state(page, NR_ANON_THPS);
 
 	if (TestClearPageDoubleMap(page)) {
 		/*
diff --git a/mm/shmem.c b/mm/shmem.c
index bfaa007ccb58..8975df09ec26 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -575,9 +575,9 @@ static int shmem_add_to_page_cache(struct page *page,
 	if (!error) {
 		mapping->nrpages += nr;
 		if (PageTransHuge(page))
-			__inc_zone_page_state(page, NR_SHMEM_THPS);
-		__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
-		__mod_zone_page_state(page_zone(page), NR_SHMEM, nr);
+			__inc_node_page_state(page, NR_SHMEM_THPS);
+		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
+		__mod_node_page_state(page_pgdat(page), NR_SHMEM, nr);
 		spin_unlock_irq(&mapping->tree_lock);
 	} else {
 		page->mapping = NULL;
@@ -601,8 +601,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
 	page->mapping = NULL;
 	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
-	__dec_zone_page_state(page, NR_SHMEM);
+	__dec_node_page_state(page, NR_FILE_PAGES);
+	__dec_node_page_state(page, NR_SHMEM);
 	spin_unlock_irq(&mapping->tree_lock);
 	put_page(page);
 	BUG_ON(error);
@@ -1493,8 +1493,8 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 	error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
 								   newpage);
 	if (!error) {
-		__inc_zone_page_state(newpage, NR_FILE_PAGES);
-		__dec_zone_page_state(oldpage, NR_FILE_PAGES);
+		__inc_node_page_state(newpage, NR_FILE_PAGES);
+		__dec_node_page_state(oldpage, NR_FILE_PAGES);
 	}
 	spin_unlock_irq(&swap_mapping->tree_lock);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c99463ac02fb..c8310a37be3a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -95,7 +95,7 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 					entry.val, page);
 	if (likely(!error)) {
 		address_space->nrpages++;
-		__inc_zone_page_state(page, NR_FILE_PAGES);
+		__inc_node_page_state(page, NR_FILE_PAGES);
 		INC_CACHE_INFO(add_total);
 	}
 	spin_unlock_irq(&address_space->tree_lock);
@@ -147,7 +147,7 @@ void __delete_from_swap_cache(struct page *page)
 	set_page_private(page, 0);
 	ClearPageSwapCache(page);
 	address_space->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	__dec_node_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
 }
 
diff --git a/mm/util.c b/mm/util.c
index 8d010ef2ce1c..662cddf914af 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -528,7 +528,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		free = global_page_state(NR_FREE_PAGES);
-		free += global_page_state(NR_FILE_PAGES);
+		free += global_node_page_state(NR_FILE_PAGES);
 
 		/*
 		 * shmem pages shouldn't be counted as free in this
@@ -536,7 +536,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		 * that won't affect the overall amount of available
 		 * memory in the system.
 		 */
-		free -= global_page_state(NR_SHMEM);
+		free -= global_node_page_state(NR_SHMEM);
 
 		free += get_nr_swap_pages();
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e8d72d7e268..aef2a6245657 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3585,11 +3585,11 @@ int sysctl_min_unmapped_ratio = 1;
  */
 int sysctl_min_slab_ratio = 5;
 
-static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
+static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
 {
-	unsigned long file_mapped = node_page_state(zone->zone_pgdat, NR_FILE_MAPPED);
-	unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
-		node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
+	unsigned long file_mapped = node_page_state(pgdat, NR_FILE_MAPPED);
+	unsigned long file_lru = node_page_state(pgdat, NR_INACTIVE_FILE) +
+		node_page_state(pgdat, NR_ACTIVE_FILE);
 
 	/*
 	 * It's possible for there to be more file mapped pages than
@@ -3608,17 +3608,17 @@ static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 	/*
 	 * If RECLAIM_UNMAP is set, then all file pages are considered
 	 * potentially reclaimable. Otherwise, we have to worry about
-	 * pages like swapcache and zone_unmapped_file_pages() provides
+	 * pages like swapcache and node_unmapped_file_pages() provides
 	 * a better estimate
 	 */
 	if (zone_reclaim_mode & RECLAIM_UNMAP)
-		nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
+		nr_pagecache_reclaimable = node_page_state(zone->zone_pgdat, NR_FILE_PAGES);
 	else
-		nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
+		nr_pagecache_reclaimable = node_unmapped_file_pages(zone->zone_pgdat);
 
 	/* If we can't clean pages, remove dirty pages from consideration */
 	if (!(zone_reclaim_mode & RECLAIM_WRITE))
-		delta += zone_page_state(zone, NR_FILE_DIRTY);
+		delta += node_page_state(zone->zone_pgdat, NR_FILE_DIRTY);
 
 	/* Watch for any possible underflows due to delta */
 	if (unlikely(delta > nr_pagecache_reclaimable))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 02e7406e8fcd..455392158062 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -924,20 +924,15 @@ const char * const vmstat_text[] = {
 	"nr_alloc_batch",
 	"nr_zone_anon_lru",
 	"nr_zone_file_lru",
+	"nr_zone_write_pending",
 	"nr_mlock",
-	"nr_file_pages",
-	"nr_dirty",
-	"nr_writeback",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
 	"nr_page_table_pages",
 	"nr_kernel_stack",
-	"nr_unstable",
 	"nr_bounce",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
-	"nr_writeback_temp",
-	"nr_shmem",
 	"nr_dirtied",
 	"nr_written",
 #if IS_ENABLED(CONFIG_ZSMALLOC)
@@ -951,9 +946,6 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
-	"nr_anon_transparent_hugepages",
-	"nr_shmem_hugepages",
-	"nr_shmem_pmdmapped",
 	"nr_free_cma",
 
 	/* Node-based counters */
@@ -970,6 +962,15 @@ const char * const vmstat_text[] = {
 	"workingset_nodereclaim",
 	"nr_anon_pages",
 	"nr_mapped",
+	"nr_file_pages",
+	"nr_dirty",
+	"nr_writeback",
+	"nr_writeback_temp",
+	"nr_shmem",
+	"nr_shmem_hugepages",
+	"nr_shmem_pmdmapped",
+	"nr_anon_transparent_hugepages",
+	"nr_unstable",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 20/34] mm: move vmscan writes and file write accounting to the node
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

As reclaim is now node-based, it follows that page write activity due to
page reclaim should also be accounted for on the node.  For consistency,
also account page writes and page dirtying on a per-node basis.

After this patch, there are a few remaining zone counters that may appear
strange but are fine.  NUMA stats are still per-zone as this is a
user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
potentially pin low memory and cannot trivially be reclaimed on demand.
This information is still useful for debugging a page allocation failure
warning.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h           | 8 ++++----
 include/trace/events/writeback.h | 4 ++--
 mm/page-writeback.c              | 6 +++---
 mm/vmscan.c                      | 4 ++--
 mm/vmstat.c                      | 8 ++++----
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index acd4665c3025..e3d6d42722a0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -122,10 +122,6 @@ enum zone_stat_item {
 	NR_KERNEL_STACK,
 	/* Second 128 byte cacheline */
 	NR_BOUNCE,
-	NR_VMSCAN_WRITE,
-	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
-	NR_DIRTIED,		/* page dirtyings since bootup */
-	NR_WRITTEN,		/* page writings since bootup */
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	NR_ZSPAGES,		/* allocated in zsmalloc */
 #endif
@@ -165,6 +161,10 @@ enum node_stat_item {
 	NR_SHMEM_PMDMAPPED,
 	NR_ANON_THPS,
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
+	NR_VMSCAN_WRITE,
+	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
+	NR_DIRTIED,		/* page dirtyings since bootup */
+	NR_WRITTEN,		/* page writings since bootup */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index ad20f2d2b1f9..2ccd9ccbf9ef 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -415,8 +415,8 @@ TRACE_EVENT(global_dirty_state,
 		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
 		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
 		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
-		__entry->nr_dirtied	= global_page_state(NR_DIRTIED);
-		__entry->nr_written	= global_page_state(NR_WRITTEN);
+		__entry->nr_dirtied	= global_node_page_state(NR_DIRTIED);
+		__entry->nr_written	= global_node_page_state(NR_WRITTEN);
 		__entry->background_thresh = background_thresh;
 		__entry->dirty_thresh	= dirty_thresh;
 		__entry->dirty_limit	= global_wb_domain.dirty_limit;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f97591d9fa00..3c02aa603f5a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2461,7 +2461,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-		__inc_zone_page_state(page, NR_DIRTIED);
+		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
 		task_io_account_write(PAGE_SIZE);
@@ -2550,7 +2550,7 @@ void account_page_redirty(struct page *page)
 
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		current->nr_dirtied--;
-		dec_zone_page_state(page, NR_DIRTIED);
+		dec_node_page_state(page, NR_DIRTIED);
 		dec_wb_stat(wb, WB_DIRTIED);
 		unlocked_inode_to_wb_end(inode, locked);
 	}
@@ -2787,7 +2787,7 @@ int test_clear_page_writeback(struct page *page)
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
 		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-		inc_zone_page_state(page, NR_WRITTEN);
+		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aef2a6245657..5ad670881d8d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -612,7 +612,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page);
-		inc_zone_page_state(page, NR_VMSCAN_WRITE);
+		inc_node_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
 
@@ -1117,7 +1117,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 * except we already have the page isolated
 				 * and know it's dirty
 				 */
-				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
+				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
 				SetPageReclaim(page);
 
 				goto keep_locked;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 455392158062..bc94968400d0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -931,10 +931,6 @@ const char * const vmstat_text[] = {
 	"nr_page_table_pages",
 	"nr_kernel_stack",
 	"nr_bounce",
-	"nr_vmscan_write",
-	"nr_vmscan_immediate_reclaim",
-	"nr_dirtied",
-	"nr_written",
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	"nr_zspages",
 #endif
@@ -971,6 +967,10 @@ const char * const vmstat_text[] = {
 	"nr_shmem_pmdmapped",
 	"nr_anon_transparent_hugepages",
 	"nr_unstable",
+	"nr_vmscan_write",
+	"nr_vmscan_immediate_reclaim",
+	"nr_dirtied",
+	"nr_written",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 20/34] mm: move vmscan writes and file write accounting to the node
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

As reclaim is now node-based, it follows that page write activity due to
page reclaim should also be accounted for on the node.  For consistency,
also account page writes and page dirtying on a per-node basis.

After this patch, there are a few remaining zone counters that may appear
strange but are fine.  NUMA stats are still per-zone as this is a
user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
potentially pin low memory and cannot trivially be reclaimed on demand.
This information is still useful for debugging a page allocation failure
warning.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h           | 8 ++++----
 include/trace/events/writeback.h | 4 ++--
 mm/page-writeback.c              | 6 +++---
 mm/vmscan.c                      | 4 ++--
 mm/vmstat.c                      | 8 ++++----
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index acd4665c3025..e3d6d42722a0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -122,10 +122,6 @@ enum zone_stat_item {
 	NR_KERNEL_STACK,
 	/* Second 128 byte cacheline */
 	NR_BOUNCE,
-	NR_VMSCAN_WRITE,
-	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
-	NR_DIRTIED,		/* page dirtyings since bootup */
-	NR_WRITTEN,		/* page writings since bootup */
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	NR_ZSPAGES,		/* allocated in zsmalloc */
 #endif
@@ -165,6 +161,10 @@ enum node_stat_item {
 	NR_SHMEM_PMDMAPPED,
 	NR_ANON_THPS,
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
+	NR_VMSCAN_WRITE,
+	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
+	NR_DIRTIED,		/* page dirtyings since bootup */
+	NR_WRITTEN,		/* page writings since bootup */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index ad20f2d2b1f9..2ccd9ccbf9ef 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -415,8 +415,8 @@ TRACE_EVENT(global_dirty_state,
 		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
 		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
 		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
-		__entry->nr_dirtied	= global_page_state(NR_DIRTIED);
-		__entry->nr_written	= global_page_state(NR_WRITTEN);
+		__entry->nr_dirtied	= global_node_page_state(NR_DIRTIED);
+		__entry->nr_written	= global_node_page_state(NR_WRITTEN);
 		__entry->background_thresh = background_thresh;
 		__entry->dirty_thresh	= dirty_thresh;
 		__entry->dirty_limit	= global_wb_domain.dirty_limit;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f97591d9fa00..3c02aa603f5a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2461,7 +2461,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-		__inc_zone_page_state(page, NR_DIRTIED);
+		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
 		task_io_account_write(PAGE_SIZE);
@@ -2550,7 +2550,7 @@ void account_page_redirty(struct page *page)
 
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		current->nr_dirtied--;
-		dec_zone_page_state(page, NR_DIRTIED);
+		dec_node_page_state(page, NR_DIRTIED);
 		dec_wb_stat(wb, WB_DIRTIED);
 		unlocked_inode_to_wb_end(inode, locked);
 	}
@@ -2787,7 +2787,7 @@ int test_clear_page_writeback(struct page *page)
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
 		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-		inc_zone_page_state(page, NR_WRITTEN);
+		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aef2a6245657..5ad670881d8d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -612,7 +612,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page);
-		inc_zone_page_state(page, NR_VMSCAN_WRITE);
+		inc_node_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
 
@@ -1117,7 +1117,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 * except we already have the page isolated
 				 * and know it's dirty
 				 */
-				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
+				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
 				SetPageReclaim(page);
 
 				goto keep_locked;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 455392158062..bc94968400d0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -931,10 +931,6 @@ const char * const vmstat_text[] = {
 	"nr_page_table_pages",
 	"nr_kernel_stack",
 	"nr_bounce",
-	"nr_vmscan_write",
-	"nr_vmscan_immediate_reclaim",
-	"nr_dirtied",
-	"nr_written",
 #if IS_ENABLED(CONFIG_ZSMALLOC)
 	"nr_zspages",
 #endif
@@ -971,6 +967,10 @@ const char * const vmstat_text[] = {
 	"nr_shmem_pmdmapped",
 	"nr_anon_transparent_hugepages",
 	"nr_unstable",
+	"nr_vmscan_write",
+	"nr_vmscan_immediate_reclaim",
+	"nr_dirtied",
+	"nr_written",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 21/34] mm, vmscan: only wakeup kswapd once per node for the requested classzone
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

kswapd is woken when zones are below the low watermark but the wakeup
decision is not taking the classzone into account.  Now that reclaim is
node-based, it is only required to wake kswapd once per node and only if
all zones are unbalanced for the requested classzone.

Note that one node might be checked multiple times if the zonelist is
ordered by node because there is no cheap way of tracking what nodes have
already been visited.  For zone-ordering, each node should be checked only
once.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c |  8 ++++++--
 mm/vmscan.c     | 13 +++++++++++--
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd34b305c8ee..bb261885c121 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3410,10 +3410,14 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 {
 	struct zoneref *z;
 	struct zone *zone;
+	pg_data_t *last_pgdat = NULL;
 
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
-						ac->high_zoneidx, ac->nodemask)
-		wakeup_kswapd(zone, order, ac_classzone_idx(ac));
+					ac->high_zoneidx, ac->nodemask) {
+		if (last_pgdat != zone->zone_pgdat)
+			wakeup_kswapd(zone, order, ac_classzone_idx(ac));
+		last_pgdat = zone->zone_pgdat;
+	}
 }
 
 static inline unsigned int
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ad670881d8d..cc820bbe9c01 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3419,6 +3419,7 @@ static int kswapd(void *p)
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
+	int z;
 
 	if (!populated_zone(zone))
 		return;
@@ -3430,8 +3431,16 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0))
-		return;
+
+	/* Only wake kswapd if all zones are unbalanced */
+	for (z = 0; z <= classzone_idx; z++) {
+		zone = pgdat->node_zones + z;
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_balanced(zone, order, classzone_idx))
+			return;
+	}
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 21/34] mm, vmscan: only wakeup kswapd once per node for the requested classzone
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

kswapd is woken when zones are below the low watermark but the wakeup
decision is not taking the classzone into account.  Now that reclaim is
node-based, it is only required to wake kswapd once per node and only if
all zones are unbalanced for the requested classzone.

Note that one node might be checked multiple times if the zonelist is
ordered by node because there is no cheap way of tracking what nodes have
already been visited.  For zone-ordering, each node should be checked only
once.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c |  8 ++++++--
 mm/vmscan.c     | 13 +++++++++++--
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd34b305c8ee..bb261885c121 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3410,10 +3410,14 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 {
 	struct zoneref *z;
 	struct zone *zone;
+	pg_data_t *last_pgdat = NULL;
 
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
-						ac->high_zoneidx, ac->nodemask)
-		wakeup_kswapd(zone, order, ac_classzone_idx(ac));
+					ac->high_zoneidx, ac->nodemask) {
+		if (last_pgdat != zone->zone_pgdat)
+			wakeup_kswapd(zone, order, ac_classzone_idx(ac));
+		last_pgdat = zone->zone_pgdat;
+	}
 }
 
 static inline unsigned int
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ad670881d8d..cc820bbe9c01 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3419,6 +3419,7 @@ static int kswapd(void *p)
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
+	int z;
 
 	if (!populated_zone(zone))
 		return;
@@ -3430,8 +3431,16 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0))
-		return;
+
+	/* Only wake kswapd if all zones are unbalanced */
+	for (z = 0; z <= classzone_idx; z++) {
+		zone = pgdat->node_zones + z;
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_balanced(zone, order, classzone_idx))
+			return;
+	}
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 22/34] mm, page_alloc: wake kswapd based on the highest eligible zone
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The ac_classzone_idx is used as the basis for waking kswapd and that is based
on the preferred zoneref. If the preferred zoneref's first zone is lower
than what is available on other nodes, it's possible that kswapd is woken
on a zone with only higher, but still eligible, zones. As classzone_idx
is strictly adhered to now, it causes a problem because eligible pages
are skipped.

For example, node 0 has only DMA32 and node 1 has only NORMAL. An allocating
context running on node 0 may wake kswapd on node 1 telling it to skip
all NORMAL pages.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb261885c121..e6ee52f1c15f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3415,7 +3415,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 					ac->high_zoneidx, ac->nodemask) {
 		if (last_pgdat != zone->zone_pgdat)
-			wakeup_kswapd(zone, order, ac_classzone_idx(ac));
+			wakeup_kswapd(zone, order, ac->high_zoneidx);
 		last_pgdat = zone->zone_pgdat;
 	}
 }
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 22/34] mm, page_alloc: wake kswapd based on the highest eligible zone
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The ac_classzone_idx is used as the basis for waking kswapd and that is based
on the preferred zoneref. If the preferred zoneref's first zone is lower
than what is available on other nodes, it's possible that kswapd is woken
on a zone with only higher, but still eligible, zones. As classzone_idx
is strictly adhered to now, it causes a problem because eligible pages
are skipped.

For example, node 0 has only DMA32 and node 1 has only NORMAL. An allocating
context running on node 0 may wake kswapd on node 1 telling it to skip
all NORMAL pages.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb261885c121..e6ee52f1c15f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3415,7 +3415,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 					ac->high_zoneidx, ac->nodemask) {
 		if (last_pgdat != zone->zone_pgdat)
-			wakeup_kswapd(zone, order, ac_classzone_idx(ac));
+			wakeup_kswapd(zone, order, ac->high_zoneidx);
 		last_pgdat = zone->zone_pgdat;
 	}
 }
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 23/34] mm: convert zone_reclaim to node_reclaim
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:34   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

As reclaim is now per-node based, convert zone_reclaim to be node_reclaim.
It is possible that a node will be reclaimed multiple times if it has
multiple zones but this is unavoidable without caching all nodes traversed
so far.  The documentation and interface to userspace is the same from a
configuration perspective and will will be similar in behaviour unless the
node-local allocation requests were also limited to lower zones.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h   | 18 +++++------
 include/linux/swap.h     |  9 +++---
 include/linux/topology.h |  2 +-
 kernel/sysctl.c          |  4 +--
 mm/internal.h            |  8 ++---
 mm/khugepaged.c          |  4 +--
 mm/page_alloc.c          | 24 ++++++++++-----
 mm/vmscan.c              | 77 ++++++++++++++++++++++++------------------------
 8 files changed, 77 insertions(+), 69 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e3d6d42722a0..e19c081c794e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -372,14 +372,6 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_NUMA
-	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
-	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
-#endif /* CONFIG_NUMA */
-
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -525,7 +517,6 @@ struct zone {
 } ____cacheline_internodealigned_in_smp;
 
 enum zone_flags {
-	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 };
 
@@ -540,6 +531,7 @@ enum pgdat_flags {
 	PGDAT_WRITEBACK,		/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -688,6 +680,14 @@ typedef struct pglist_data {
 	 */
 	unsigned long		totalreserve_pages;
 
+#ifdef CONFIG_NUMA
+	/*
+	 * zone reclaim becomes active if more unmapped pages exist.
+	 */
+	unsigned long		min_unmapped_pages;
+	unsigned long		min_slab_pages;
+#endif /* CONFIG_NUMA */
+
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
 	spinlock_t		lru_lock;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2a23ddc96edd..b17cc4830fa6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -326,13 +326,14 @@ extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
 #ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
+extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
 #else
-#define zone_reclaim_mode 0
-static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+#define node_reclaim_mode 0
+static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
+				unsigned int order)
 {
 	return 0;
 }
diff --git a/include/linux/topology.h b/include/linux/topology.h
index afce69296ac0..cb0775e1ee4b 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -54,7 +54,7 @@ int arch_update_cpu_topology(void);
 /*
  * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
  * (in whatever arch specific measurement units returned by node_distance())
- * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim()
+ * and node_reclaim_mode is enabled then the VM will only call node_reclaim()
  * on nodes within this distance.
  */
 #define RECLAIM_DISTANCE 30
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index de331c3858e5..6e47ebe5384e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1498,8 +1498,8 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "zone_reclaim_mode",
-		.data		= &zone_reclaim_mode,
-		.maxlen		= sizeof(zone_reclaim_mode),
+		.data		= &node_reclaim_mode,
+		.maxlen		= sizeof(node_reclaim_mode),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 		.extra1		= &zero,
diff --git a/mm/internal.h b/mm/internal.h
index 2f80d0343c56..1e21b2d3838d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -433,10 +433,10 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
 }
 #endif /* CONFIG_SPARSEMEM */
 
-#define ZONE_RECLAIM_NOSCAN	-2
-#define ZONE_RECLAIM_FULL	-1
-#define ZONE_RECLAIM_SOME	0
-#define ZONE_RECLAIM_SUCCESS	1
+#define NODE_RECLAIM_NOSCAN	-2
+#define NODE_RECLAIM_FULL	-1
+#define NODE_RECLAIM_SOME	0
+#define NODE_RECLAIM_SUCCESS	1
 
 extern int hwpoison_filter(struct page *p);
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d907cdc3dc28..bb49bd1d2d9f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -672,10 +672,10 @@ static bool khugepaged_scan_abort(int nid)
 	int i;
 
 	/*
-	 * If zone_reclaim_mode is disabled, then no extra effort is made to
+	 * If node_reclaim_mode is disabled, then no extra effort is made to
 	 * allocate memory locally.
 	 */
-	if (!zone_reclaim_mode)
+	if (!node_reclaim_mode)
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e6ee52f1c15f..de825b07e233 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2985,16 +2985,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
 				goto try_this_zone;
 
-			if (zone_reclaim_mode == 0 ||
+			if (node_reclaim_mode == 0 ||
 			    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
 				continue;
 
-			ret = zone_reclaim(zone, gfp_mask, order);
+			ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
 			switch (ret) {
-			case ZONE_RECLAIM_NOSCAN:
+			case NODE_RECLAIM_NOSCAN:
 				/* did not scan */
 				continue;
-			case ZONE_RECLAIM_FULL:
+			case NODE_RECLAIM_FULL:
 				/* scanned but unreclaimable */
 				continue;
 			default:
@@ -5991,9 +5991,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
 #ifdef CONFIG_NUMA
 		zone->node = nid;
-		zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
+		pgdat->min_unmapped_pages += (freesize*sysctl_min_unmapped_ratio)
 						/ 100;
-		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
+		pgdat->min_slab_pages += (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
 		zone->zone_pgdat = pgdat;
@@ -6970,6 +6970,7 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int rc;
 
@@ -6977,8 +6978,11 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
 	if (rc)
 		return rc;
 
+	for_each_online_pgdat(pgdat)
+		pgdat->min_slab_pages = 0;
+
 	for_each_zone(zone)
-		zone->min_unmapped_pages = (zone->managed_pages *
+		zone->zone_pgdat->min_unmapped_pages += (zone->managed_pages *
 				sysctl_min_unmapped_ratio) / 100;
 	return 0;
 }
@@ -6986,6 +6990,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int rc;
 
@@ -6993,8 +6998,11 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
 	if (rc)
 		return rc;
 
+	for_each_online_pgdat(pgdat)
+		pgdat->min_slab_pages = 0;
+
 	for_each_zone(zone)
-		zone->min_slab_pages = (zone->managed_pages *
+		zone->zone_pgdat->min_slab_pages += (zone->managed_pages *
 				sysctl_min_slab_ratio) / 100;
 	return 0;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc820bbe9c01..e12b0fd2044c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3563,12 +3563,12 @@ module_init(kswapd_init)
 
 #ifdef CONFIG_NUMA
 /*
- * Zone reclaim mode
+ * Node reclaim mode
  *
- * If non-zero call zone_reclaim when the number of free pages falls below
+ * If non-zero call node_reclaim when the number of free pages falls below
  * the watermarks.
  */
-int zone_reclaim_mode __read_mostly;
+int node_reclaim_mode __read_mostly;
 
 #define RECLAIM_OFF 0
 #define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
@@ -3576,14 +3576,14 @@ int zone_reclaim_mode __read_mostly;
 #define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
 
 /*
- * Priority for ZONE_RECLAIM. This determines the fraction of pages
+ * Priority for NODE_RECLAIM. This determines the fraction of pages
  * of a node considered for each zone_reclaim. 4 scans 1/16th of
  * a zone.
  */
-#define ZONE_RECLAIM_PRIORITY 4
+#define NODE_RECLAIM_PRIORITY 4
 
 /*
- * Percentage of pages in a zone that must be unmapped for zone_reclaim to
+ * Percentage of pages in a zone that must be unmapped for node_reclaim to
  * occur.
  */
 int sysctl_min_unmapped_ratio = 1;
@@ -3609,7 +3609,7 @@ static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
 }
 
 /* Work out how many page cache pages we can reclaim in this reclaim_mode */
-static unsigned long zone_pagecache_reclaimable(struct zone *zone)
+static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
 {
 	unsigned long nr_pagecache_reclaimable;
 	unsigned long delta = 0;
@@ -3620,14 +3620,14 @@ static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 	 * pages like swapcache and node_unmapped_file_pages() provides
 	 * a better estimate
 	 */
-	if (zone_reclaim_mode & RECLAIM_UNMAP)
-		nr_pagecache_reclaimable = node_page_state(zone->zone_pgdat, NR_FILE_PAGES);
+	if (node_reclaim_mode & RECLAIM_UNMAP)
+		nr_pagecache_reclaimable = node_page_state(pgdat, NR_FILE_PAGES);
 	else
-		nr_pagecache_reclaimable = node_unmapped_file_pages(zone->zone_pgdat);
+		nr_pagecache_reclaimable = node_unmapped_file_pages(pgdat);
 
 	/* If we can't clean pages, remove dirty pages from consideration */
-	if (!(zone_reclaim_mode & RECLAIM_WRITE))
-		delta += node_page_state(zone->zone_pgdat, NR_FILE_DIRTY);
+	if (!(node_reclaim_mode & RECLAIM_WRITE))
+		delta += node_page_state(pgdat, NR_FILE_DIRTY);
 
 	/* Watch for any possible underflows due to delta */
 	if (unlikely(delta > nr_pagecache_reclaimable))
@@ -3637,23 +3637,24 @@ static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
- * Try to free up some pages from this zone through reclaim.
+ * Try to free up some pages from this node through reclaim.
  */
-static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 {
 	/* Minimum pages needed in order to stay on node */
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
+	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
 		.order = order,
-		.priority = ZONE_RECLAIM_PRIORITY,
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-		.may_unmap = !!(zone_reclaim_mode & RECLAIM_UNMAP),
+		.priority = NODE_RECLAIM_PRIORITY,
+		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
+		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
 		.may_swap = 1,
-		.reclaim_idx = zone_idx(zone),
+		.reclaim_idx = classzone_idx,
 	};
 
 	cond_resched();
@@ -3667,13 +3668,13 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
+	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
 		/*
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_node(zone->zone_pgdat, &sc, zone_idx(zone));
+			shrink_node(pgdat, &sc, classzone_idx);
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
@@ -3683,49 +3684,47 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	return sc.nr_reclaimed >= nr_pages;
 }
 
-int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 {
-	int node_id;
 	int ret;
 
 	/*
-	 * Zone reclaim reclaims unmapped file backed pages and
+	 * Node reclaim reclaims unmapped file backed pages and
 	 * slab pages if we are over the defined limits.
 	 *
 	 * A small portion of unmapped file backed pages is needed for
 	 * file I/O otherwise pages read by file I/O will be immediately
-	 * thrown out if the zone is overallocated. So we do not reclaim
-	 * if less than a specified percentage of the zone is used by
+	 * thrown out if the node is overallocated. So we do not reclaim
+	 * if less than a specified percentage of the node is used by
 	 * unmapped file backed pages.
 	 */
-	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
-	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
-		return ZONE_RECLAIM_FULL;
+	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
+	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+		return NODE_RECLAIM_FULL;
 
-	if (!pgdat_reclaimable(zone->zone_pgdat))
-		return ZONE_RECLAIM_FULL;
+	if (!pgdat_reclaimable(pgdat))
+		return NODE_RECLAIM_FULL;
 
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
 	if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
-		return ZONE_RECLAIM_NOSCAN;
+		return NODE_RECLAIM_NOSCAN;
 
 	/*
-	 * Only run zone reclaim on the local zone or on zones that do not
+	 * Only run node reclaim on the local node or on nodes that do not
 	 * have associated processors. This will favor the local processor
 	 * over remote processors and spread off node memory allocations
 	 * as wide as possible.
 	 */
-	node_id = zone_to_nid(zone);
-	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
-		return ZONE_RECLAIM_NOSCAN;
+	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
+		return NODE_RECLAIM_NOSCAN;
 
-	if (test_and_set_bit(ZONE_RECLAIM_LOCKED, &zone->flags))
-		return ZONE_RECLAIM_NOSCAN;
+	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
+		return NODE_RECLAIM_NOSCAN;
 
-	ret = __zone_reclaim(zone, gfp_mask, order);
-	clear_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
+	ret = __node_reclaim(pgdat, gfp_mask, order);
+	clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
 
 	if (!ret)
 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 23/34] mm: convert zone_reclaim to node_reclaim
@ 2016-07-08  9:34   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:34 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

As reclaim is now per-node based, convert zone_reclaim to be node_reclaim.
It is possible that a node will be reclaimed multiple times if it has
multiple zones but this is unavoidable without caching all nodes traversed
so far.  The documentation and interface to userspace is the same from a
configuration perspective and will will be similar in behaviour unless the
node-local allocation requests were also limited to lower zones.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h   | 18 +++++------
 include/linux/swap.h     |  9 +++---
 include/linux/topology.h |  2 +-
 kernel/sysctl.c          |  4 +--
 mm/internal.h            |  8 ++---
 mm/khugepaged.c          |  4 +--
 mm/page_alloc.c          | 24 ++++++++++-----
 mm/vmscan.c              | 77 ++++++++++++++++++++++++------------------------
 8 files changed, 77 insertions(+), 69 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e3d6d42722a0..e19c081c794e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -372,14 +372,6 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_NUMA
-	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
-	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
-#endif /* CONFIG_NUMA */
-
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -525,7 +517,6 @@ struct zone {
 } ____cacheline_internodealigned_in_smp;
 
 enum zone_flags {
-	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
 };
 
@@ -540,6 +531,7 @@ enum pgdat_flags {
 	PGDAT_WRITEBACK,		/* reclaim scanning has recently found
 					 * many pages under writeback
 					 */
+	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -688,6 +680,14 @@ typedef struct pglist_data {
 	 */
 	unsigned long		totalreserve_pages;
 
+#ifdef CONFIG_NUMA
+	/*
+	 * zone reclaim becomes active if more unmapped pages exist.
+	 */
+	unsigned long		min_unmapped_pages;
+	unsigned long		min_slab_pages;
+#endif /* CONFIG_NUMA */
+
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
 	spinlock_t		lru_lock;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2a23ddc96edd..b17cc4830fa6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -326,13 +326,14 @@ extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
 #ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
+extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
 #else
-#define zone_reclaim_mode 0
-static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+#define node_reclaim_mode 0
+static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
+				unsigned int order)
 {
 	return 0;
 }
diff --git a/include/linux/topology.h b/include/linux/topology.h
index afce69296ac0..cb0775e1ee4b 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -54,7 +54,7 @@ int arch_update_cpu_topology(void);
 /*
  * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
  * (in whatever arch specific measurement units returned by node_distance())
- * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim()
+ * and node_reclaim_mode is enabled then the VM will only call node_reclaim()
  * on nodes within this distance.
  */
 #define RECLAIM_DISTANCE 30
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index de331c3858e5..6e47ebe5384e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1498,8 +1498,8 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "zone_reclaim_mode",
-		.data		= &zone_reclaim_mode,
-		.maxlen		= sizeof(zone_reclaim_mode),
+		.data		= &node_reclaim_mode,
+		.maxlen		= sizeof(node_reclaim_mode),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 		.extra1		= &zero,
diff --git a/mm/internal.h b/mm/internal.h
index 2f80d0343c56..1e21b2d3838d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -433,10 +433,10 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
 }
 #endif /* CONFIG_SPARSEMEM */
 
-#define ZONE_RECLAIM_NOSCAN	-2
-#define ZONE_RECLAIM_FULL	-1
-#define ZONE_RECLAIM_SOME	0
-#define ZONE_RECLAIM_SUCCESS	1
+#define NODE_RECLAIM_NOSCAN	-2
+#define NODE_RECLAIM_FULL	-1
+#define NODE_RECLAIM_SOME	0
+#define NODE_RECLAIM_SUCCESS	1
 
 extern int hwpoison_filter(struct page *p);
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d907cdc3dc28..bb49bd1d2d9f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -672,10 +672,10 @@ static bool khugepaged_scan_abort(int nid)
 	int i;
 
 	/*
-	 * If zone_reclaim_mode is disabled, then no extra effort is made to
+	 * If node_reclaim_mode is disabled, then no extra effort is made to
 	 * allocate memory locally.
 	 */
-	if (!zone_reclaim_mode)
+	if (!node_reclaim_mode)
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e6ee52f1c15f..de825b07e233 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2985,16 +2985,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
 				goto try_this_zone;
 
-			if (zone_reclaim_mode == 0 ||
+			if (node_reclaim_mode == 0 ||
 			    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
 				continue;
 
-			ret = zone_reclaim(zone, gfp_mask, order);
+			ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
 			switch (ret) {
-			case ZONE_RECLAIM_NOSCAN:
+			case NODE_RECLAIM_NOSCAN:
 				/* did not scan */
 				continue;
-			case ZONE_RECLAIM_FULL:
+			case NODE_RECLAIM_FULL:
 				/* scanned but unreclaimable */
 				continue;
 			default:
@@ -5991,9 +5991,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
 #ifdef CONFIG_NUMA
 		zone->node = nid;
-		zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
+		pgdat->min_unmapped_pages += (freesize*sysctl_min_unmapped_ratio)
 						/ 100;
-		zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
+		pgdat->min_slab_pages += (freesize * sysctl_min_slab_ratio) / 100;
 #endif
 		zone->name = zone_names[j];
 		zone->zone_pgdat = pgdat;
@@ -6970,6 +6970,7 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int rc;
 
@@ -6977,8 +6978,11 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
 	if (rc)
 		return rc;
 
+	for_each_online_pgdat(pgdat)
+		pgdat->min_slab_pages = 0;
+
 	for_each_zone(zone)
-		zone->min_unmapped_pages = (zone->managed_pages *
+		zone->zone_pgdat->min_unmapped_pages += (zone->managed_pages *
 				sysctl_min_unmapped_ratio) / 100;
 	return 0;
 }
@@ -6986,6 +6990,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *table, int write,
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
+	struct pglist_data *pgdat;
 	struct zone *zone;
 	int rc;
 
@@ -6993,8 +6998,11 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *table, int write,
 	if (rc)
 		return rc;
 
+	for_each_online_pgdat(pgdat)
+		pgdat->min_slab_pages = 0;
+
 	for_each_zone(zone)
-		zone->min_slab_pages = (zone->managed_pages *
+		zone->zone_pgdat->min_slab_pages += (zone->managed_pages *
 				sysctl_min_slab_ratio) / 100;
 	return 0;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc820bbe9c01..e12b0fd2044c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3563,12 +3563,12 @@ module_init(kswapd_init)
 
 #ifdef CONFIG_NUMA
 /*
- * Zone reclaim mode
+ * Node reclaim mode
  *
- * If non-zero call zone_reclaim when the number of free pages falls below
+ * If non-zero call node_reclaim when the number of free pages falls below
  * the watermarks.
  */
-int zone_reclaim_mode __read_mostly;
+int node_reclaim_mode __read_mostly;
 
 #define RECLAIM_OFF 0
 #define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
@@ -3576,14 +3576,14 @@ int zone_reclaim_mode __read_mostly;
 #define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
 
 /*
- * Priority for ZONE_RECLAIM. This determines the fraction of pages
+ * Priority for NODE_RECLAIM. This determines the fraction of pages
  * of a node considered for each zone_reclaim. 4 scans 1/16th of
  * a zone.
  */
-#define ZONE_RECLAIM_PRIORITY 4
+#define NODE_RECLAIM_PRIORITY 4
 
 /*
- * Percentage of pages in a zone that must be unmapped for zone_reclaim to
+ * Percentage of pages in a zone that must be unmapped for node_reclaim to
  * occur.
  */
 int sysctl_min_unmapped_ratio = 1;
@@ -3609,7 +3609,7 @@ static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
 }
 
 /* Work out how many page cache pages we can reclaim in this reclaim_mode */
-static unsigned long zone_pagecache_reclaimable(struct zone *zone)
+static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
 {
 	unsigned long nr_pagecache_reclaimable;
 	unsigned long delta = 0;
@@ -3620,14 +3620,14 @@ static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 	 * pages like swapcache and node_unmapped_file_pages() provides
 	 * a better estimate
 	 */
-	if (zone_reclaim_mode & RECLAIM_UNMAP)
-		nr_pagecache_reclaimable = node_page_state(zone->zone_pgdat, NR_FILE_PAGES);
+	if (node_reclaim_mode & RECLAIM_UNMAP)
+		nr_pagecache_reclaimable = node_page_state(pgdat, NR_FILE_PAGES);
 	else
-		nr_pagecache_reclaimable = node_unmapped_file_pages(zone->zone_pgdat);
+		nr_pagecache_reclaimable = node_unmapped_file_pages(pgdat);
 
 	/* If we can't clean pages, remove dirty pages from consideration */
-	if (!(zone_reclaim_mode & RECLAIM_WRITE))
-		delta += node_page_state(zone->zone_pgdat, NR_FILE_DIRTY);
+	if (!(node_reclaim_mode & RECLAIM_WRITE))
+		delta += node_page_state(pgdat, NR_FILE_DIRTY);
 
 	/* Watch for any possible underflows due to delta */
 	if (unlikely(delta > nr_pagecache_reclaimable))
@@ -3637,23 +3637,24 @@ static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
- * Try to free up some pages from this zone through reclaim.
+ * Try to free up some pages from this node through reclaim.
  */
-static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 {
 	/* Minimum pages needed in order to stay on node */
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
+	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
 		.order = order,
-		.priority = ZONE_RECLAIM_PRIORITY,
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-		.may_unmap = !!(zone_reclaim_mode & RECLAIM_UNMAP),
+		.priority = NODE_RECLAIM_PRIORITY,
+		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
+		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
 		.may_swap = 1,
-		.reclaim_idx = zone_idx(zone),
+		.reclaim_idx = classzone_idx,
 	};
 
 	cond_resched();
@@ -3667,13 +3668,13 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
+	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
 		/*
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_node(zone->zone_pgdat, &sc, zone_idx(zone));
+			shrink_node(pgdat, &sc, classzone_idx);
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
@@ -3683,49 +3684,47 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	return sc.nr_reclaimed >= nr_pages;
 }
 
-int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 {
-	int node_id;
 	int ret;
 
 	/*
-	 * Zone reclaim reclaims unmapped file backed pages and
+	 * Node reclaim reclaims unmapped file backed pages and
 	 * slab pages if we are over the defined limits.
 	 *
 	 * A small portion of unmapped file backed pages is needed for
 	 * file I/O otherwise pages read by file I/O will be immediately
-	 * thrown out if the zone is overallocated. So we do not reclaim
-	 * if less than a specified percentage of the zone is used by
+	 * thrown out if the node is overallocated. So we do not reclaim
+	 * if less than a specified percentage of the node is used by
 	 * unmapped file backed pages.
 	 */
-	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
-	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
-		return ZONE_RECLAIM_FULL;
+	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
+	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+		return NODE_RECLAIM_FULL;
 
-	if (!pgdat_reclaimable(zone->zone_pgdat))
-		return ZONE_RECLAIM_FULL;
+	if (!pgdat_reclaimable(pgdat))
+		return NODE_RECLAIM_FULL;
 
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
 	if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
-		return ZONE_RECLAIM_NOSCAN;
+		return NODE_RECLAIM_NOSCAN;
 
 	/*
-	 * Only run zone reclaim on the local zone or on zones that do not
+	 * Only run node reclaim on the local node or on nodes that do not
 	 * have associated processors. This will favor the local processor
 	 * over remote processors and spread off node memory allocations
 	 * as wide as possible.
 	 */
-	node_id = zone_to_nid(zone);
-	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
-		return ZONE_RECLAIM_NOSCAN;
+	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
+		return NODE_RECLAIM_NOSCAN;
 
-	if (test_and_set_bit(ZONE_RECLAIM_LOCKED, &zone->flags))
-		return ZONE_RECLAIM_NOSCAN;
+	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
+		return NODE_RECLAIM_NOSCAN;
 
-	ret = __zone_reclaim(zone, gfp_mask, order);
-	clear_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
+	ret = __node_reclaim(pgdat, gfp_mask, order);
+	clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
 
 	if (!ret)
 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 24/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

shrink_node receives all information it needs about classzone_idx
from sc->reclaim_idx so remove the aliases.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmscan.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e12b0fd2044c..bba71b6c9a4c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2426,8 +2426,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	return true;
 }
 
-static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
-			enum zone_type classzone_idx)
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
@@ -2656,7 +2655,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (zone->zone_pgdat == last_pgdat)
 			continue;
 		last_pgdat = zone->zone_pgdat;
-		shrink_node(zone->zone_pgdat, sc, classzone_idx);
+		shrink_node(zone->zone_pgdat, sc);
 	}
 
 	/*
@@ -3080,7 +3079,6 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  * This is used to determine if the scanning priority needs to be raised.
  */
 static bool kswapd_shrink_node(pg_data_t *pgdat,
-			       int classzone_idx,
 			       struct scan_control *sc)
 {
 	struct zone *zone;
@@ -3088,7 +3086,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 
 	/* Reclaim a number of pages proportional to the number of zones */
 	sc->nr_to_reclaim = 0;
-	for (z = 0; z <= classzone_idx; z++) {
+	for (z = 0; z <= sc->reclaim_idx; z++) {
 		zone = pgdat->node_zones + z;
 		if (!populated_zone(zone))
 			continue;
@@ -3100,7 +3098,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 	 * Historically care was taken to put equal pressure on all zones but
 	 * now pressure is applied based on node LRU order.
 	 */
-	shrink_node(pgdat, sc, classzone_idx);
+	shrink_node(pgdat, sc);
 
 	/*
 	 * Fragmentation may mean that the system cannot be rebalanced for
@@ -3162,7 +3160,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				if (!populated_zone(zone))
 					continue;
 
-				classzone_idx = i;
+				sc.reclaim_idx = i;
 				break;
 			}
 		}
@@ -3175,12 +3173,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * zone was balanced even under extreme pressure when the
 		 * overall node may be congested.
 		 */
-		for (i = classzone_idx; i >= 0; i--) {
+		for (i = sc.reclaim_idx; i >= 0; i--) {
 			zone = pgdat->node_zones + i;
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone_balanced(zone, sc.order, classzone_idx))
+			if (zone_balanced(zone, sc.order, sc.reclaim_idx))
 				goto out;
 		}
 
@@ -3211,7 +3209,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * enough pages are already being scanned that that high
 		 * watermark would be met at 100% efficiency.
 		 */
-		if (kswapd_shrink_node(pgdat, classzone_idx, &sc))
+		if (kswapd_shrink_node(pgdat, &sc))
 			raise_priority = false;
 
 		/*
@@ -3674,7 +3672,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_node(pgdat, &sc, classzone_idx);
+			shrink_node(pgdat, &sc);
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 24/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

shrink_node receives all information it needs about classzone_idx
from sc->reclaim_idx so remove the aliases.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmscan.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e12b0fd2044c..bba71b6c9a4c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2426,8 +2426,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	return true;
 }
 
-static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
-			enum zone_type classzone_idx)
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
@@ -2656,7 +2655,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (zone->zone_pgdat == last_pgdat)
 			continue;
 		last_pgdat = zone->zone_pgdat;
-		shrink_node(zone->zone_pgdat, sc, classzone_idx);
+		shrink_node(zone->zone_pgdat, sc);
 	}
 
 	/*
@@ -3080,7 +3079,6 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  * This is used to determine if the scanning priority needs to be raised.
  */
 static bool kswapd_shrink_node(pg_data_t *pgdat,
-			       int classzone_idx,
 			       struct scan_control *sc)
 {
 	struct zone *zone;
@@ -3088,7 +3086,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 
 	/* Reclaim a number of pages proportional to the number of zones */
 	sc->nr_to_reclaim = 0;
-	for (z = 0; z <= classzone_idx; z++) {
+	for (z = 0; z <= sc->reclaim_idx; z++) {
 		zone = pgdat->node_zones + z;
 		if (!populated_zone(zone))
 			continue;
@@ -3100,7 +3098,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 	 * Historically care was taken to put equal pressure on all zones but
 	 * now pressure is applied based on node LRU order.
 	 */
-	shrink_node(pgdat, sc, classzone_idx);
+	shrink_node(pgdat, sc);
 
 	/*
 	 * Fragmentation may mean that the system cannot be rebalanced for
@@ -3162,7 +3160,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				if (!populated_zone(zone))
 					continue;
 
-				classzone_idx = i;
+				sc.reclaim_idx = i;
 				break;
 			}
 		}
@@ -3175,12 +3173,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * zone was balanced even under extreme pressure when the
 		 * overall node may be congested.
 		 */
-		for (i = classzone_idx; i >= 0; i--) {
+		for (i = sc.reclaim_idx; i >= 0; i--) {
 			zone = pgdat->node_zones + i;
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone_balanced(zone, sc.order, classzone_idx))
+			if (zone_balanced(zone, sc.order, sc.reclaim_idx))
 				goto out;
 		}
 
@@ -3211,7 +3209,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * enough pages are already being scanned that that high
 		 * watermark would be met at 100% efficiency.
 		 */
-		if (kswapd_shrink_node(pgdat, classzone_idx, &sc))
+		if (kswapd_shrink_node(pgdat, &sc))
 			raise_priority = false;
 
 		/*
@@ -3674,7 +3672,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_node(pgdat, &sc, classzone_idx);
+			shrink_node(pgdat, &sc);
 		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
 	}
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 25/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The scan_control structure has enough information available for
compaction_ready() to make a decision. The classzone_idx manipulations in
shrink_zones() are no longer necessary as the highest populated zone is
no longer used to determine if shrink_slab should be called or not.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmscan.c | 28 ++++++++--------------------
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bba71b6c9a4c..6d5c78e2312b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2521,7 +2521,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
  * Returns true if compaction should go ahead for a high-order request, or
  * the high-order allocation would succeed without compaction.
  */
-static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
+static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 {
 	unsigned long watermark;
 	bool watermark_ok;
@@ -2532,21 +2532,21 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
 	 * there is a buffer of free pages available to give compaction
 	 * a reasonable chance of completing and allocating the page
 	 */
-	watermark = high_wmark_pages(zone) + (2UL << order);
-	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
+	watermark = high_wmark_pages(zone) + (2UL << sc->order);
+	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
 
 	/*
 	 * If compaction is deferred, reclaim up to a point where
 	 * compaction will have a chance of success when re-enabled
 	 */
-	if (compaction_deferred(zone, order))
+	if (compaction_deferred(zone, sc->order))
 		return watermark_ok;
 
 	/*
 	 * If compaction is not ready to start and allocation is not likely
 	 * to succeed without it, then keep reclaiming.
 	 */
-	if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
+	if (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx) == COMPACT_SKIPPED)
 		return false;
 
 	return watermark_ok;
@@ -2567,7 +2567,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
-	enum zone_type classzone_idx;
 	pg_data_t *last_pgdat = NULL;
 
 	/*
@@ -2578,7 +2577,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	orig_mask = sc->gfp_mask;
 	if (buffer_heads_over_limit) {
 		sc->gfp_mask |= __GFP_HIGHMEM;
-		sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask);
+		sc->reclaim_idx = gfp_zone(sc->gfp_mask);
 	}
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
@@ -2587,17 +2586,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			continue;
 
 		/*
-		 * Note that reclaim_idx does not change as it is the highest
-		 * zone reclaimed from which for empty zones is a no-op but
-		 * classzone_idx is used by shrink_node to test if the slabs
-		 * should be shrunk on a given node.
-		 */
-		classzone_idx = sc->reclaim_idx;
-		while (!populated_zone(zone->zone_pgdat->node_zones +
-							classzone_idx))
-			classzone_idx--;
-
-		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
@@ -2621,8 +2609,8 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			 */
 			if (IS_ENABLED(CONFIG_COMPACTION) &&
 			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
-			    zonelist_zone_idx(z) <= classzone_idx &&
-			    compaction_ready(zone, sc->order, classzone_idx)) {
+			    zonelist_zone_idx(z) <= sc->reclaim_idx &&
+			    compaction_ready(zone, sc)) {
 				sc->compaction_ready = true;
 				continue;
 			}
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 25/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The scan_control structure has enough information available for
compaction_ready() to make a decision. The classzone_idx manipulations in
shrink_zones() are no longer necessary as the highest populated zone is
no longer used to determine if shrink_slab should be called or not.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmscan.c | 28 ++++++++--------------------
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bba71b6c9a4c..6d5c78e2312b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2521,7 +2521,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
  * Returns true if compaction should go ahead for a high-order request, or
  * the high-order allocation would succeed without compaction.
  */
-static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
+static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 {
 	unsigned long watermark;
 	bool watermark_ok;
@@ -2532,21 +2532,21 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
 	 * there is a buffer of free pages available to give compaction
 	 * a reasonable chance of completing and allocating the page
 	 */
-	watermark = high_wmark_pages(zone) + (2UL << order);
-	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
+	watermark = high_wmark_pages(zone) + (2UL << sc->order);
+	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
 
 	/*
 	 * If compaction is deferred, reclaim up to a point where
 	 * compaction will have a chance of success when re-enabled
 	 */
-	if (compaction_deferred(zone, order))
+	if (compaction_deferred(zone, sc->order))
 		return watermark_ok;
 
 	/*
 	 * If compaction is not ready to start and allocation is not likely
 	 * to succeed without it, then keep reclaiming.
 	 */
-	if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
+	if (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx) == COMPACT_SKIPPED)
 		return false;
 
 	return watermark_ok;
@@ -2567,7 +2567,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
-	enum zone_type classzone_idx;
 	pg_data_t *last_pgdat = NULL;
 
 	/*
@@ -2578,7 +2577,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	orig_mask = sc->gfp_mask;
 	if (buffer_heads_over_limit) {
 		sc->gfp_mask |= __GFP_HIGHMEM;
-		sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask);
+		sc->reclaim_idx = gfp_zone(sc->gfp_mask);
 	}
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
@@ -2587,17 +2586,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			continue;
 
 		/*
-		 * Note that reclaim_idx does not change as it is the highest
-		 * zone reclaimed from which for empty zones is a no-op but
-		 * classzone_idx is used by shrink_node to test if the slabs
-		 * should be shrunk on a given node.
-		 */
-		classzone_idx = sc->reclaim_idx;
-		while (!populated_zone(zone->zone_pgdat->node_zones +
-							classzone_idx))
-			classzone_idx--;
-
-		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
@@ -2621,8 +2609,8 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			 */
 			if (IS_ENABLED(CONFIG_COMPACTION) &&
 			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
-			    zonelist_zone_idx(z) <= classzone_idx &&
-			    compaction_ready(zone, sc->order, classzone_idx)) {
+			    zonelist_zone_idx(z) <= sc->reclaim_idx &&
+			    compaction_ready(zone, sc)) {
 				sc->compaction_ready = true;
 				continue;
 			}
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 26/34] mm, vmscan: avoid passing in remaining unnecessarily to prepare_kswapd_sleep
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep
always passes in 0 for remaining and the second call can trivially
check the parameter in advance.

Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d5c78e2312b..8a67aa53aa7b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3020,15 +3020,10 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
  *
  * Returns true if kswapd is ready to sleep
  */
-static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
-					int classzone_idx)
+static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
 
-	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
-	if (remaining)
-		return false;
-
 	/*
 	 * The throttled processes are normally woken up in balance_pgdat() as
 	 * soon as pfmemalloc_watermark_ok() is true. But there is a potential
@@ -3243,7 +3238,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
 		/*
 		 * Compaction records what page blocks it recently failed to
 		 * isolate pages from and skips them in the future scanning.
@@ -3278,7 +3273,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) {
+	if (!remaining &&
+	    prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 26/34] mm, vmscan: avoid passing in remaining unnecessarily to prepare_kswapd_sleep
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep
always passes in 0 for remaining and the second call can trivially
check the parameter in advance.

Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d5c78e2312b..8a67aa53aa7b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3020,15 +3020,10 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
  *
  * Returns true if kswapd is ready to sleep
  */
-static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
-					int classzone_idx)
+static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
 
-	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
-	if (remaining)
-		return false;
-
 	/*
 	 * The throttled processes are normally woken up in balance_pgdat() as
 	 * soon as pfmemalloc_watermark_ok() is true. But there is a potential
@@ -3243,7 +3238,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
 		/*
 		 * Compaction records what page blocks it recently failed to
 		 * isolate pages from and skips them in the future scanning.
@@ -3278,7 +3273,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) {
+	if (!remaining &&
+	    prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 27/34] mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The buffer_heads_over_limit limit in kswapd is inconsistent with direct
reclaim behaviour. It may force an an attempt to reclaim from all zones and
then not reclaim at all because higher zones were balanced than required
by the original request.

This patch will causes kswapd to consider reclaiming from all zones if
buffer_heads_over_limit.  However, if there are eligible zones for the
allocation request that woke kswapd then no reclaim will occur even if
buffer_heads_over_limit. This avoids kswapd over-reclaiming just because
buffer_heads_over_limit.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8a67aa53aa7b..97d0f7997fe7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3122,7 +3122,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = 1,
-		.reclaim_idx = classzone_idx,
 	};
 	count_vm_event(PAGEOUTRUN);
 
@@ -3130,12 +3129,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		bool raise_priority = true;
 
 		sc.nr_reclaimed = 0;
+		sc.reclaim_idx = classzone_idx;
 
 		/*
-		 * If the number of buffer_heads in the machine exceeds the
-		 * maximum allowed level then reclaim from all zones. This is
-		 * not specific to highmem as highmem may not exist but it is
-		 * it is expected that buffer_heads are stripped in writeback.
+		 * If the number of buffer_heads exceeds the maximum allowed
+		 * then consider reclaiming from all zones. This is not
+		 * specific to highmem which may not exist but it is it is
+		 * expected that buffer_heads are stripped in writeback.
+		 * Reclaim may still not go ahead if all eligible zones
+		 * for the original allocation request are balanced to
+		 * avoid excessive reclaim from kswapd.
 		 */
 		if (buffer_heads_over_limit) {
 			for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
@@ -3154,14 +3157,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * Scanning from low to high zone would allow congestion to be
 		 * cleared during a very small window when a small low
 		 * zone was balanced even under extreme pressure when the
-		 * overall node may be congested.
+		 * overall node may be congested. Note that sc.reclaim_idx
+		 * is not used as buffer_heads_over_limit may have adjusted
+		 * it.
 		 */
-		for (i = sc.reclaim_idx; i >= 0; i--) {
+		for (i = classzone_idx; i >= 0; i--) {
 			zone = pgdat->node_zones + i;
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone_balanced(zone, sc.order, sc.reclaim_idx))
+			if (zone_balanced(zone, sc.order, classzone_idx))
 				goto out;
 		}
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 27/34] mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The buffer_heads_over_limit limit in kswapd is inconsistent with direct
reclaim behaviour. It may force an an attempt to reclaim from all zones and
then not reclaim at all because higher zones were balanced than required
by the original request.

This patch will causes kswapd to consider reclaiming from all zones if
buffer_heads_over_limit.  However, if there are eligible zones for the
allocation request that woke kswapd then no reclaim will occur even if
buffer_heads_over_limit. This avoids kswapd over-reclaiming just because
buffer_heads_over_limit.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8a67aa53aa7b..97d0f7997fe7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3122,7 +3122,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = 1,
-		.reclaim_idx = classzone_idx,
 	};
 	count_vm_event(PAGEOUTRUN);
 
@@ -3130,12 +3129,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		bool raise_priority = true;
 
 		sc.nr_reclaimed = 0;
+		sc.reclaim_idx = classzone_idx;
 
 		/*
-		 * If the number of buffer_heads in the machine exceeds the
-		 * maximum allowed level then reclaim from all zones. This is
-		 * not specific to highmem as highmem may not exist but it is
-		 * it is expected that buffer_heads are stripped in writeback.
+		 * If the number of buffer_heads exceeds the maximum allowed
+		 * then consider reclaiming from all zones. This is not
+		 * specific to highmem which may not exist but it is it is
+		 * expected that buffer_heads are stripped in writeback.
+		 * Reclaim may still not go ahead if all eligible zones
+		 * for the original allocation request are balanced to
+		 * avoid excessive reclaim from kswapd.
 		 */
 		if (buffer_heads_over_limit) {
 			for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
@@ -3154,14 +3157,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * Scanning from low to high zone would allow congestion to be
 		 * cleared during a very small window when a small low
 		 * zone was balanced even under extreme pressure when the
-		 * overall node may be congested.
+		 * overall node may be congested. Note that sc.reclaim_idx
+		 * is not used as buffer_heads_over_limit may have adjusted
+		 * it.
 		 */
-		for (i = sc.reclaim_idx; i >= 0; i--) {
+		for (i = classzone_idx; i >= 0; i--) {
 			zone = pgdat->node_zones + i;
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone_balanced(zone, sc.order, sc.reclaim_idx))
+			if (zone_balanced(zone, sc.order, classzone_idx))
 				goto out;
 		}
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 28/34] mm, vmscan: add classzone information to tracepoints
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

This is convenient when tracking down why the skip count is high because
it'll show what classzone kswapd woke up at and what zones are being
isolated.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/trace/events/vmscan.h | 51 ++++++++++++++++++++++++++-----------------
 mm/vmscan.c                   | 14 +++++++-----
 2 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 897f1aa1ee5f..c88fd0934e7e 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -55,21 +55,23 @@ TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 TRACE_EVENT(mm_vmscan_kswapd_wake,
 
-	TP_PROTO(int nid, int order),
+	TP_PROTO(int nid, int zid, int order),
 
-	TP_ARGS(nid, order),
+	TP_ARGS(nid, zid, order),
 
 	TP_STRUCT__entry(
 		__field(	int,	nid	)
+		__field(	int,	zid	)
 		__field(	int,	order	)
 	),
 
 	TP_fast_assign(
 		__entry->nid	= nid;
+		__entry->zid    = zid;
 		__entry->order	= order;
 	),
 
-	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+	TP_printk("nid=%d zid=%d order=%d", __entry->nid, __entry->zid, __entry->order)
 );
 
 TRACE_EVENT(mm_vmscan_wakeup_kswapd,
@@ -98,47 +100,50 @@ TRACE_EVENT(mm_vmscan_wakeup_kswapd,
 
 DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
 
-	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int classzone_idx),
 
-	TP_ARGS(order, may_writepage, gfp_flags),
+	TP_ARGS(order, may_writepage, gfp_flags, classzone_idx),
 
 	TP_STRUCT__entry(
 		__field(	int,	order		)
 		__field(	int,	may_writepage	)
 		__field(	gfp_t,	gfp_flags	)
+		__field(	int,	classzone_idx	)
 	),
 
 	TP_fast_assign(
 		__entry->order		= order;
 		__entry->may_writepage	= may_writepage;
 		__entry->gfp_flags	= gfp_flags;
+		__entry->classzone_idx	= classzone_idx;
 	),
 
-	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s classzone_idx=%d",
 		__entry->order,
 		__entry->may_writepage,
-		show_gfp_flags(__entry->gfp_flags))
+		show_gfp_flags(__entry->gfp_flags),
+		__entry->classzone_idx)
 );
 
 DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_begin,
 
-	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int classzone_idx),
 
-	TP_ARGS(order, may_writepage, gfp_flags)
+	TP_ARGS(order, may_writepage, gfp_flags, classzone_idx)
 );
 
 DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,
 
-	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int classzone_idx),
 
-	TP_ARGS(order, may_writepage, gfp_flags)
+	TP_ARGS(order, may_writepage, gfp_flags, classzone_idx)
 );
 
 DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin,
 
-	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int classzone_idx),
 
-	TP_ARGS(order, may_writepage, gfp_flags)
+	TP_ARGS(order, may_writepage, gfp_flags, classzone_idx)
 );
 
 DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
@@ -266,16 +271,18 @@ TRACE_EVENT(mm_shrink_slab_end,
 
 DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
-	TP_PROTO(int order,
+	TP_PROTO(int classzone_idx,
+		int order,
 		unsigned long nr_requested,
 		unsigned long nr_scanned,
 		unsigned long nr_taken,
 		isolate_mode_t isolate_mode,
 		int file),
 
-	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
+	TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
 
 	TP_STRUCT__entry(
+		__field(int, classzone_idx)
 		__field(int, order)
 		__field(unsigned long, nr_requested)
 		__field(unsigned long, nr_scanned)
@@ -285,6 +292,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 	),
 
 	TP_fast_assign(
+		__entry->classzone_idx = classzone_idx;
 		__entry->order = order;
 		__entry->nr_requested = nr_requested;
 		__entry->nr_scanned = nr_scanned;
@@ -293,8 +301,9 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 		__entry->file = file;
 	),
 
-	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
+	TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
 		__entry->isolate_mode,
+		__entry->classzone_idx,
 		__entry->order,
 		__entry->nr_requested,
 		__entry->nr_scanned,
@@ -304,27 +313,29 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
 DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
 
-	TP_PROTO(int order,
+	TP_PROTO(int classzone_idx,
+		int order,
 		unsigned long nr_requested,
 		unsigned long nr_scanned,
 		unsigned long nr_taken,
 		isolate_mode_t isolate_mode,
 		int file),
 
-	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+	TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
 
 );
 
 DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_memcg_isolate,
 
-	TP_PROTO(int order,
+	TP_PROTO(int classzone_idx,
+		int order,
 		unsigned long nr_requested,
 		unsigned long nr_scanned,
 		unsigned long nr_taken,
 		isolate_mode_t isolate_mode,
 		int file),
 
-	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+	TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
 
 );
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 97d0f7997fe7..bbd77fa4ec6a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1439,7 +1439,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	if (!list_empty(&pages_skipped))
 		list_splice(&pages_skipped, src);
 	*nr_scanned = scan;
-	trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
+	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
 	for (scan = 0; scan < MAX_NR_ZONES; scan++) {
 		nr_pages = nr_zone_taken[scan];
@@ -2888,7 +2888,8 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
-				gfp_mask);
+				gfp_mask,
+				sc.reclaim_idx);
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
@@ -2919,7 +2920,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order,
 						      sc.may_writepage,
-						      sc.gfp_mask);
+						      sc.gfp_mask,
+						      sc.reclaim_idx);
 
 	/*
 	 * NOTE: Although we can get the priority field, using it
@@ -2967,7 +2969,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
-					    sc.gfp_mask);
+					    sc.gfp_mask,
+					    sc.reclaim_idx);
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
@@ -3384,7 +3387,8 @@ static int kswapd(void *p)
 		 * but kcompactd is woken to compact for the original
 		 * request (alloc_order).
 		 */
-		trace_mm_vmscan_kswapd_wake(pgdat->node_id, alloc_order);
+		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
+						alloc_order);
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 28/34] mm, vmscan: add classzone information to tracepoints
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

This is convenient when tracking down why the skip count is high because
it'll show what classzone kswapd woke up at and what zones are being
isolated.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/trace/events/vmscan.h | 51 ++++++++++++++++++++++++++-----------------
 mm/vmscan.c                   | 14 +++++++-----
 2 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 897f1aa1ee5f..c88fd0934e7e 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -55,21 +55,23 @@ TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 TRACE_EVENT(mm_vmscan_kswapd_wake,
 
-	TP_PROTO(int nid, int order),
+	TP_PROTO(int nid, int zid, int order),
 
-	TP_ARGS(nid, order),
+	TP_ARGS(nid, zid, order),
 
 	TP_STRUCT__entry(
 		__field(	int,	nid	)
+		__field(	int,	zid	)
 		__field(	int,	order	)
 	),
 
 	TP_fast_assign(
 		__entry->nid	= nid;
+		__entry->zid    = zid;
 		__entry->order	= order;
 	),
 
-	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+	TP_printk("nid=%d zid=%d order=%d", __entry->nid, __entry->zid, __entry->order)
 );
 
 TRACE_EVENT(mm_vmscan_wakeup_kswapd,
@@ -98,47 +100,50 @@ TRACE_EVENT(mm_vmscan_wakeup_kswapd,
 
 DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
 
-	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int classzone_idx),
 
-	TP_ARGS(order, may_writepage, gfp_flags),
+	TP_ARGS(order, may_writepage, gfp_flags, classzone_idx),
 
 	TP_STRUCT__entry(
 		__field(	int,	order		)
 		__field(	int,	may_writepage	)
 		__field(	gfp_t,	gfp_flags	)
+		__field(	int,	classzone_idx	)
 	),
 
 	TP_fast_assign(
 		__entry->order		= order;
 		__entry->may_writepage	= may_writepage;
 		__entry->gfp_flags	= gfp_flags;
+		__entry->classzone_idx	= classzone_idx;
 	),
 
-	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s classzone_idx=%d",
 		__entry->order,
 		__entry->may_writepage,
-		show_gfp_flags(__entry->gfp_flags))
+		show_gfp_flags(__entry->gfp_flags),
+		__entry->classzone_idx)
 );
 
 DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_begin,
 
-	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int classzone_idx),
 
-	TP_ARGS(order, may_writepage, gfp_flags)
+	TP_ARGS(order, may_writepage, gfp_flags, classzone_idx)
 );
 
 DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,
 
-	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int classzone_idx),
 
-	TP_ARGS(order, may_writepage, gfp_flags)
+	TP_ARGS(order, may_writepage, gfp_flags, classzone_idx)
 );
 
 DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin,
 
-	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int classzone_idx),
 
-	TP_ARGS(order, may_writepage, gfp_flags)
+	TP_ARGS(order, may_writepage, gfp_flags, classzone_idx)
 );
 
 DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
@@ -266,16 +271,18 @@ TRACE_EVENT(mm_shrink_slab_end,
 
 DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
-	TP_PROTO(int order,
+	TP_PROTO(int classzone_idx,
+		int order,
 		unsigned long nr_requested,
 		unsigned long nr_scanned,
 		unsigned long nr_taken,
 		isolate_mode_t isolate_mode,
 		int file),
 
-	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
+	TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
 
 	TP_STRUCT__entry(
+		__field(int, classzone_idx)
 		__field(int, order)
 		__field(unsigned long, nr_requested)
 		__field(unsigned long, nr_scanned)
@@ -285,6 +292,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 	),
 
 	TP_fast_assign(
+		__entry->classzone_idx = classzone_idx;
 		__entry->order = order;
 		__entry->nr_requested = nr_requested;
 		__entry->nr_scanned = nr_scanned;
@@ -293,8 +301,9 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 		__entry->file = file;
 	),
 
-	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
+	TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
 		__entry->isolate_mode,
+		__entry->classzone_idx,
 		__entry->order,
 		__entry->nr_requested,
 		__entry->nr_scanned,
@@ -304,27 +313,29 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
 DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
 
-	TP_PROTO(int order,
+	TP_PROTO(int classzone_idx,
+		int order,
 		unsigned long nr_requested,
 		unsigned long nr_scanned,
 		unsigned long nr_taken,
 		isolate_mode_t isolate_mode,
 		int file),
 
-	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+	TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
 
 );
 
 DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_memcg_isolate,
 
-	TP_PROTO(int order,
+	TP_PROTO(int classzone_idx,
+		int order,
 		unsigned long nr_requested,
 		unsigned long nr_scanned,
 		unsigned long nr_taken,
 		isolate_mode_t isolate_mode,
 		int file),
 
-	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+	TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
 
 );
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 97d0f7997fe7..bbd77fa4ec6a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1439,7 +1439,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	if (!list_empty(&pages_skipped))
 		list_splice(&pages_skipped, src);
 	*nr_scanned = scan;
-	trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
+	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
 	for (scan = 0; scan < MAX_NR_ZONES; scan++) {
 		nr_pages = nr_zone_taken[scan];
@@ -2888,7 +2888,8 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
-				gfp_mask);
+				gfp_mask,
+				sc.reclaim_idx);
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
@@ -2919,7 +2920,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order,
 						      sc.may_writepage,
-						      sc.gfp_mask);
+						      sc.gfp_mask,
+						      sc.reclaim_idx);
 
 	/*
 	 * NOTE: Although we can get the priority field, using it
@@ -2967,7 +2969,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
-					    sc.gfp_mask);
+					    sc.gfp_mask,
+					    sc.reclaim_idx);
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
@@ -3384,7 +3387,8 @@ static int kswapd(void *p)
 		 * but kcompactd is woken to compact for the original
 		 * request (alloc_order).
 		 */
-		trace_mm_vmscan_kswapd_wake(pgdat->node_id, alloc_order);
+		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
+						alloc_order);
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 29/34] mm, page_alloc: remove fair zone allocation policy
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The fair zone allocation policy interleaves allocation requests between
zones to avoid an age inversion problem whereby new pages are reclaimed to
balance a zone.  Reclaim is now node-based so this should no longer be an
issue and the fair zone allocation policy is not free.  This patch removes
it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  5 ----
 mm/internal.h          |  1 -
 mm/page_alloc.c        | 75 +-------------------------------------------------
 mm/vmstat.c            |  4 +--
 4 files changed, 2 insertions(+), 83 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e19c081c794e..bd33e6f1bed0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,7 +110,6 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_ALLOC_BATCH,
 	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
 	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
 	NR_ZONE_LRU_FILE,
@@ -516,10 +515,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
-enum zone_flags {
-	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
-};
-
 enum pgdat_flags {
 	PGDAT_CONGESTED,		/* pgdat has many dirty pages backed by
 					 * a congested BDI
diff --git a/mm/internal.h b/mm/internal.h
index 1e21b2d3838d..28932cd6a195 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -467,7 +467,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
-#define ALLOC_FAIR		0x100 /* fair zone allocation */
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index de825b07e233..565f08832853 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2630,7 +2630,6 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 			else
 				page = list_first_entry(list, struct page, lru);
 
-			__dec_zone_state(zone, NR_ALLOC_BATCH);
 			list_del(&page->lru);
 			pcp->count--;
 
@@ -2656,15 +2655,10 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
-		__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
 		__mod_zone_freepage_state(zone, -(1 << order),
 					  get_pcppage_migratetype(page));
 	}
 
-	if (atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]) <= 0 &&
-	    !test_bit(ZONE_FAIR_DEPLETED, &zone->flags))
-		set_bit(ZONE_FAIR_DEPLETED, &zone->flags);
-
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
@@ -2875,40 +2869,18 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 }
 
 #ifdef CONFIG_NUMA
-static bool zone_local(struct zone *local_zone, struct zone *zone)
-{
-	return local_zone->node == zone->node;
-}
-
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
 				RECLAIM_DISTANCE;
 }
 #else	/* CONFIG_NUMA */
-static bool zone_local(struct zone *local_zone, struct zone *zone)
-{
-	return true;
-}
-
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return true;
 }
 #endif	/* CONFIG_NUMA */
 
-static void reset_alloc_batches(struct zone *preferred_zone)
-{
-	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
-
-	do {
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-		clear_bit(ZONE_FAIR_DEPLETED, &zone->flags);
-	} while (zone++ != preferred_zone);
-}
-
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -2919,10 +2891,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 {
 	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
-	bool fair_skipped = false;
-	bool apply_fair = (alloc_flags & ALLOC_FAIR);
-
-zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
@@ -2937,23 +2905,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			!__cpuset_zone_allowed(zone, gfp_mask))
 				continue;
 		/*
-		 * Distribute pages in proportion to the individual
-		 * zone size to ensure fair page aging.  The zone a
-		 * page was allocated in should have no effect on the
-		 * time the page has in memory before being reclaimed.
-		 */
-		if (apply_fair) {
-			if (test_bit(ZONE_FAIR_DEPLETED, &zone->flags)) {
-				fair_skipped = true;
-				continue;
-			}
-			if (!zone_local(ac->preferred_zoneref->zone, zone)) {
-				if (fair_skipped)
-					goto reset_fair;
-				apply_fair = false;
-			}
-		}
-		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a node that is within its dirty
 		 * limit, such that no single node holds more than its
@@ -3024,23 +2975,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 	}
 
-	/*
-	 * The first pass makes sure allocations are spread fairly within the
-	 * local node.  However, the local node might have free pages left
-	 * after the fairness batches are exhausted, and remote zones haven't
-	 * even been considered yet.  Try once more without fairness, and
-	 * include remote zones now, before entering the slowpath and waking
-	 * kswapd: prefer spilling to a remote zone over swapping locally.
-	 */
-	if (fair_skipped) {
-reset_fair:
-		apply_fair = false;
-		fair_skipped = false;
-		reset_alloc_batches(ac->preferred_zoneref->zone);
-		z = ac->preferred_zoneref;
-		goto zonelist_scan;
-	}
-
 	return NULL;
 }
 
@@ -3789,7 +3723,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
-	unsigned int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
+	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
 		.high_zoneidx = gfp_zone(gfp_mask),
@@ -6001,9 +5935,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone_seqlock_init(zone);
 		zone_pcp_init(zone);
 
-		/* For bootup, initialized properly in watermark setup */
-		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
-
 		if (!size)
 			continue;
 
@@ -6856,10 +6787,6 @@ static void __setup_per_zone_wmarks(void)
 		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
-		__mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bc94968400d0..ab7f78995c89 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,7 +921,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
-	"nr_alloc_batch",
 	"nr_zone_anon_lru",
 	"nr_zone_file_lru",
 	"nr_zone_write_pending",
@@ -1632,10 +1631,9 @@ int vmstat_refresh(struct ctl_table *table, int write,
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
 			switch (i) {
-			case NR_ALLOC_BATCH:
 			case NR_PAGES_SCANNED:
 				/*
-				 * These are often seen to go negative in
+				 * This is often seen to go negative in
 				 * recent kernels, but not to go permanently
 				 * negative.  Whilst it would be nicer not to
 				 * have exceptions, rooting them out would be
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 29/34] mm, page_alloc: remove fair zone allocation policy
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The fair zone allocation policy interleaves allocation requests between
zones to avoid an age inversion problem whereby new pages are reclaimed to
balance a zone.  Reclaim is now node-based so this should no longer be an
issue and the fair zone allocation policy is not free.  This patch removes
it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  5 ----
 mm/internal.h          |  1 -
 mm/page_alloc.c        | 75 +-------------------------------------------------
 mm/vmstat.c            |  4 +--
 4 files changed, 2 insertions(+), 83 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e19c081c794e..bd33e6f1bed0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,7 +110,6 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_ALLOC_BATCH,
 	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
 	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
 	NR_ZONE_LRU_FILE,
@@ -516,10 +515,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
-enum zone_flags {
-	ZONE_FAIR_DEPLETED,		/* fair zone policy batch depleted */
-};
-
 enum pgdat_flags {
 	PGDAT_CONGESTED,		/* pgdat has many dirty pages backed by
 					 * a congested BDI
diff --git a/mm/internal.h b/mm/internal.h
index 1e21b2d3838d..28932cd6a195 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -467,7 +467,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
-#define ALLOC_FAIR		0x100 /* fair zone allocation */
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index de825b07e233..565f08832853 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2630,7 +2630,6 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 			else
 				page = list_first_entry(list, struct page, lru);
 
-			__dec_zone_state(zone, NR_ALLOC_BATCH);
 			list_del(&page->lru);
 			pcp->count--;
 
@@ -2656,15 +2655,10 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
-		__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
 		__mod_zone_freepage_state(zone, -(1 << order),
 					  get_pcppage_migratetype(page));
 	}
 
-	if (atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]) <= 0 &&
-	    !test_bit(ZONE_FAIR_DEPLETED, &zone->flags))
-		set_bit(ZONE_FAIR_DEPLETED, &zone->flags);
-
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
@@ -2875,40 +2869,18 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 }
 
 #ifdef CONFIG_NUMA
-static bool zone_local(struct zone *local_zone, struct zone *zone)
-{
-	return local_zone->node == zone->node;
-}
-
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
 				RECLAIM_DISTANCE;
 }
 #else	/* CONFIG_NUMA */
-static bool zone_local(struct zone *local_zone, struct zone *zone)
-{
-	return true;
-}
-
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return true;
 }
 #endif	/* CONFIG_NUMA */
 
-static void reset_alloc_batches(struct zone *preferred_zone)
-{
-	struct zone *zone = preferred_zone->zone_pgdat->node_zones;
-
-	do {
-		mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-		clear_bit(ZONE_FAIR_DEPLETED, &zone->flags);
-	} while (zone++ != preferred_zone);
-}
-
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -2919,10 +2891,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 {
 	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
-	bool fair_skipped = false;
-	bool apply_fair = (alloc_flags & ALLOC_FAIR);
-
-zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
@@ -2937,23 +2905,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			!__cpuset_zone_allowed(zone, gfp_mask))
 				continue;
 		/*
-		 * Distribute pages in proportion to the individual
-		 * zone size to ensure fair page aging.  The zone a
-		 * page was allocated in should have no effect on the
-		 * time the page has in memory before being reclaimed.
-		 */
-		if (apply_fair) {
-			if (test_bit(ZONE_FAIR_DEPLETED, &zone->flags)) {
-				fair_skipped = true;
-				continue;
-			}
-			if (!zone_local(ac->preferred_zoneref->zone, zone)) {
-				if (fair_skipped)
-					goto reset_fair;
-				apply_fair = false;
-			}
-		}
-		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a node that is within its dirty
 		 * limit, such that no single node holds more than its
@@ -3024,23 +2975,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 	}
 
-	/*
-	 * The first pass makes sure allocations are spread fairly within the
-	 * local node.  However, the local node might have free pages left
-	 * after the fairness batches are exhausted, and remote zones haven't
-	 * even been considered yet.  Try once more without fairness, and
-	 * include remote zones now, before entering the slowpath and waking
-	 * kswapd: prefer spilling to a remote zone over swapping locally.
-	 */
-	if (fair_skipped) {
-reset_fair:
-		apply_fair = false;
-		fair_skipped = false;
-		reset_alloc_batches(ac->preferred_zoneref->zone);
-		z = ac->preferred_zoneref;
-		goto zonelist_scan;
-	}
-
 	return NULL;
 }
 
@@ -3789,7 +3723,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
-	unsigned int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
+	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
 		.high_zoneidx = gfp_zone(gfp_mask),
@@ -6001,9 +5935,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone_seqlock_init(zone);
 		zone_pcp_init(zone);
 
-		/* For bootup, initialized properly in watermark setup */
-		mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
-
 		if (!size)
 			continue;
 
@@ -6856,10 +6787,6 @@ static void __setup_per_zone_wmarks(void)
 		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
-		__mod_zone_page_state(zone, NR_ALLOC_BATCH,
-			high_wmark_pages(zone) - low_wmark_pages(zone) -
-			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
-
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bc94968400d0..ab7f78995c89 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,7 +921,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
-	"nr_alloc_batch",
 	"nr_zone_anon_lru",
 	"nr_zone_file_lru",
 	"nr_zone_write_pending",
@@ -1632,10 +1631,9 @@ int vmstat_refresh(struct ctl_table *table, int write,
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
 			switch (i) {
-			case NR_ALLOC_BATCH:
 			case NR_PAGES_SCANNED:
 				/*
-				 * These are often seen to go negative in
+				 * This is often seen to go negative in
 				 * recent kernels, but not to go permanently
 				 * negative.  Whilst it would be nicer not to
 				 * have exceptions, rooting them out would be
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 30/34] mm: page_alloc: cache the last node whose dirty limit is reached
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

If a page is about to be dirtied then the page allocator attempts to limit
the total number of dirty pages that exists in any given zone.  The call
to node_dirty_ok is expensive so this patch records if the last pgdat
examined hit the dirty limits.  In some cases, this reduces the number of
calls to node_dirty_ok().

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 565f08832853..958424fc64be 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2891,6 +2891,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 {
 	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
+	struct pglist_data *last_pgdat_dirty_limit = NULL;
+
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
@@ -2923,8 +2925,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 * will require awareness of nodes in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (ac->spread_dirty_pages && !node_dirty_ok(zone->zone_pgdat))
-			continue;
+		if (ac->spread_dirty_pages) {
+			if (last_pgdat_dirty_limit == zone->zone_pgdat)
+				continue;
+
+			if (!node_dirty_ok(zone->zone_pgdat)) {
+				last_pgdat_dirty_limit = zone->zone_pgdat;
+				continue;
+			}
+		}
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_fast(zone, order, mark,
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 30/34] mm: page_alloc: cache the last node whose dirty limit is reached
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

If a page is about to be dirtied then the page allocator attempts to limit
the total number of dirty pages that exists in any given zone.  The call
to node_dirty_ok is expensive so this patch records if the last pgdat
examined hit the dirty limits.  In some cases, this reduces the number of
calls to node_dirty_ok().

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 565f08832853..958424fc64be 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2891,6 +2891,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 {
 	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
+	struct pglist_data *last_pgdat_dirty_limit = NULL;
+
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
@@ -2923,8 +2925,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 * will require awareness of nodes in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (ac->spread_dirty_pages && !node_dirty_ok(zone->zone_pgdat))
-			continue;
+		if (ac->spread_dirty_pages) {
+			if (last_pgdat_dirty_limit == zone->zone_pgdat)
+				continue;
+
+			if (!node_dirty_ok(zone->zone_pgdat)) {
+				last_pgdat_dirty_limit = zone->zone_pgdat;
+				continue;
+			}
+		}
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_fast(zone, order, mark,
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 31/34] mm: vmstat: replace __count_zone_vm_events with a zone id equivalent
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

This is partially a preparation patch for more vmstat work but it also has
the slight advantage that __count_zid_vm_events is cheaper to calculate
than __count_zone_vm_events().

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/vmstat.h | 5 ++---
 mm/page_alloc.c        | 2 +-
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 6b7975cd98aa..613771909b6e 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -101,9 +101,8 @@ static inline void vm_events_fold_cpu(int cpu)
 #define count_vm_vmacache_event(x) do {} while (0)
 #endif
 
-#define __count_zone_vm_events(item, zone, delta) \
-		__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
-		zone_idx(zone), delta)
+#define __count_zid_vm_events(item, zid, delta) \
+	__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
 
 /*
  * Zone and node-based page accounting with per cpu differentials.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 958424fc64be..030114f55b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2659,7 +2659,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 					  get_pcppage_migratetype(page));
 	}
 
-	__count_zone_vm_events(PGALLOC, zone, 1 << order);
+	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 31/34] mm: vmstat: replace __count_zone_vm_events with a zone id equivalent
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

This is partially a preparation patch for more vmstat work but it also has
the slight advantage that __count_zid_vm_events is cheaper to calculate
than __count_zone_vm_events().

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/vmstat.h | 5 ++---
 mm/page_alloc.c        | 2 +-
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 6b7975cd98aa..613771909b6e 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -101,9 +101,8 @@ static inline void vm_events_fold_cpu(int cpu)
 #define count_vm_vmacache_event(x) do {} while (0)
 #endif
 
-#define __count_zone_vm_events(item, zone, delta) \
-		__count_vm_events(item##_NORMAL - ZONE_NORMAL + \
-		zone_idx(zone), delta)
+#define __count_zid_vm_events(item, zid, delta) \
+	__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
 
 /*
  * Zone and node-based page accounting with per cpu differentials.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 958424fc64be..030114f55b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2659,7 +2659,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 					  get_pcppage_migratetype(page));
 	}
 
-	__count_zone_vm_events(PGALLOC, zone, 1 << order);
+	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 32/34] mm: vmstat: account per-zone stalls and pages skipped during reclaim
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The vmstat allocstall was fairly useful in the general sense but
node-based LRUs change that.  It's important to know if a stall was for an
address-limited allocation request as this will require skipping pages
from other zones.  This patch adds pgstall_* counters to replace
allocstall.  The sum of the counters will equal the old allocstall so it
can be trivially recalculated.  A high number of address-limited
allocation requests may result in a lot of useless LRU scanning for
suitable pages.

As address-limited allocations require pages to be skipped, it's important
to know how much useless LRU scanning took place so this patch adds
pgskip* counters.  This yields the following model

1. The number of address-space limited stalls can be accounted for (pgstall)
2. The amount of useless work required to reclaim the data is accounted (pgskip)
3. The total number of scans is available from pgscan_kswapd and pgscan_direct
   so from that the ratio of useful to useless scans can be calculated.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/vm_event_item.h |  4 +++-
 mm/vmscan.c                   | 15 +++++++++++++--
 mm/vmstat.c                   |  3 ++-
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1798ff542517..6d47f66f0e9c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -23,6 +23,8 @@
 
 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
+		FOR_ALL_ZONES(PGSTALL),
+		FOR_ALL_ZONES(PGSCAN_SKIP),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
@@ -37,7 +39,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
-		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
 #ifdef CONFIG_NUMA_BALANCING
 		NUMA_PTE_UPDATES,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bbd77fa4ec6a..a8b432ff0d14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1394,6 +1394,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	struct list_head *src = &lruvec->lists[lru];
 	unsigned long nr_taken = 0;
 	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
+	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
 	unsigned long scan, nr_pages;
 	LIST_HEAD(pages_skipped);
 
@@ -1408,6 +1409,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		if (page_zonenum(page) > sc->reclaim_idx) {
 			list_move(&page->lru, &pages_skipped);
+			nr_skipped[page_zonenum(page)]++;
 			continue;
 		}
 
@@ -1436,8 +1438,17 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	 * scanning would soon rescan the same pages to skip and put the
 	 * system at risk of premature OOM.
 	 */
-	if (!list_empty(&pages_skipped))
+	if (!list_empty(&pages_skipped)) {
+		int zid;
+
 		list_splice(&pages_skipped, src);
+		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+			if (!nr_skipped[zid])
+				continue;
+
+			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
+		}
+	}
 	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
@@ -2679,7 +2690,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	delayacct_freepages_start();
 
 	if (global_reclaim(sc))
-		count_vm_event(ALLOCSTALL);
+		__count_zid_vm_events(PGSTALL, sc->reclaim_idx, 1);
 
 	do {
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ab7f78995c89..af070cf5e762 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -983,6 +983,8 @@ const char * const vmstat_text[] = {
 	"pswpout",
 
 	TEXTS_FOR_ZONES("pgalloc")
+	TEXTS_FOR_ZONES("pgstall")
+	TEXTS_FOR_ZONES("pgskip")
 
 	"pgfree",
 	"pgactivate",
@@ -1008,7 +1010,6 @@ const char * const vmstat_text[] = {
 	"kswapd_low_wmark_hit_quickly",
 	"kswapd_high_wmark_hit_quickly",
 	"pageoutrun",
-	"allocstall",
 
 	"pgrotated",
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 32/34] mm: vmstat: account per-zone stalls and pages skipped during reclaim
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The vmstat allocstall was fairly useful in the general sense but
node-based LRUs change that.  It's important to know if a stall was for an
address-limited allocation request as this will require skipping pages
from other zones.  This patch adds pgstall_* counters to replace
allocstall.  The sum of the counters will equal the old allocstall so it
can be trivially recalculated.  A high number of address-limited
allocation requests may result in a lot of useless LRU scanning for
suitable pages.

As address-limited allocations require pages to be skipped, it's important
to know how much useless LRU scanning took place so this patch adds
pgskip* counters.  This yields the following model

1. The number of address-space limited stalls can be accounted for (pgstall)
2. The amount of useless work required to reclaim the data is accounted (pgskip)
3. The total number of scans is available from pgscan_kswapd and pgscan_direct
   so from that the ratio of useful to useless scans can be calculated.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/vm_event_item.h |  4 +++-
 mm/vmscan.c                   | 15 +++++++++++++--
 mm/vmstat.c                   |  3 ++-
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1798ff542517..6d47f66f0e9c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -23,6 +23,8 @@
 
 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
+		FOR_ALL_ZONES(PGSTALL),
+		FOR_ALL_ZONES(PGSCAN_SKIP),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
@@ -37,7 +39,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
-		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
 #ifdef CONFIG_NUMA_BALANCING
 		NUMA_PTE_UPDATES,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bbd77fa4ec6a..a8b432ff0d14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1394,6 +1394,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	struct list_head *src = &lruvec->lists[lru];
 	unsigned long nr_taken = 0;
 	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
+	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
 	unsigned long scan, nr_pages;
 	LIST_HEAD(pages_skipped);
 
@@ -1408,6 +1409,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		if (page_zonenum(page) > sc->reclaim_idx) {
 			list_move(&page->lru, &pages_skipped);
+			nr_skipped[page_zonenum(page)]++;
 			continue;
 		}
 
@@ -1436,8 +1438,17 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	 * scanning would soon rescan the same pages to skip and put the
 	 * system at risk of premature OOM.
 	 */
-	if (!list_empty(&pages_skipped))
+	if (!list_empty(&pages_skipped)) {
+		int zid;
+
 		list_splice(&pages_skipped, src);
+		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+			if (!nr_skipped[zid])
+				continue;
+
+			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
+		}
+	}
 	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, scan,
 				    nr_taken, mode, is_file_lru(lru));
@@ -2679,7 +2690,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	delayacct_freepages_start();
 
 	if (global_reclaim(sc))
-		count_vm_event(ALLOCSTALL);
+		__count_zid_vm_events(PGSTALL, sc->reclaim_idx, 1);
 
 	do {
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ab7f78995c89..af070cf5e762 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -983,6 +983,8 @@ const char * const vmstat_text[] = {
 	"pswpout",
 
 	TEXTS_FOR_ZONES("pgalloc")
+	TEXTS_FOR_ZONES("pgstall")
+	TEXTS_FOR_ZONES("pgskip")
 
 	"pgfree",
 	"pgactivate",
@@ -1008,7 +1010,6 @@ const char * const vmstat_text[] = {
 	"kswapd_low_wmark_hit_quickly",
 	"kswapd_high_wmark_hit_quickly",
 	"pageoutrun",
-	"allocstall",
 
 	"pgrotated",
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 33/34] mm, vmstat: print node-based stats in zoneinfo file
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

There are a number of stats that were previously accessible via zoneinfo
that are now invisible. While it is possible to create a new file for the
node stats, this may be missed by users. Instead this patch prints the
stats under the first populated zone in /proc/zoneinfo.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmstat.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index af070cf5e762..60372f31fee3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1415,11 +1415,35 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
+static bool is_zone_first_populated(pg_data_t *pgdat, struct zone *zone)
+{
+	int zid;
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		struct zone *compare = &pgdat->node_zones[zid];
+
+		if (populated_zone(compare))
+			return zone == compare;
+	}
+
+	/* The zone must be somewhere! */
+	WARN_ON_ONCE(1);
+	return false;
+}
+
 static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 							struct zone *zone)
 {
 	int i;
 	seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+	if (is_zone_first_populated(pgdat, zone)) {
+		seq_printf(m, "\n  per-node stats");
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+			seq_printf(m, "\n      %-12s %lu",
+				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+				node_page_state(pgdat, i));
+		}
+	}
 	seq_printf(m,
 		   "\n  pages free     %lu"
 		   "\n        min      %lu"
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 33/34] mm, vmstat: print node-based stats in zoneinfo file
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

There are a number of stats that were previously accessible via zoneinfo
that are now invisible. While it is possible to create a new file for the
node stats, this may be missed by users. Instead this patch prints the
stats under the first populated zone in /proc/zoneinfo.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmstat.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index af070cf5e762..60372f31fee3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1415,11 +1415,35 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
+static bool is_zone_first_populated(pg_data_t *pgdat, struct zone *zone)
+{
+	int zid;
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		struct zone *compare = &pgdat->node_zones[zid];
+
+		if (populated_zone(compare))
+			return zone == compare;
+	}
+
+	/* The zone must be somewhere! */
+	WARN_ON_ONCE(1);
+	return false;
+}
+
 static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 							struct zone *zone)
 {
 	int i;
 	seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+	if (is_zone_first_populated(pgdat, zone)) {
+		seq_printf(m, "\n  per-node stats");
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+			seq_printf(m, "\n      %-12s %lu",
+				vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
+				node_page_state(pgdat, i));
+		}
+	}
 	seq_printf(m,
 		   "\n  pages free     %lu"
 		   "\n        min      %lu"
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries
  2016-07-08  9:34 ` Mel Gorman
@ 2016-07-08  9:35   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The number of LRU pages, dirty pages and writeback pages must be accounted
for on both zones and nodes because of the reclaim retry logic, compaction
retry logic and highmem calculations all depending on per-zone stats.

Many lowmem allocations are immune from OOM kill due to a check in
__alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception
is costly high-order allocations or allocations that cannot fail. If the
__alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations
then it would fall through to __alloc_pages_direct_compact.

This patch will blindly retry reclaim for zone-constrained allocations
in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal but
without per-zone stats there are not many alternatives. The impact it that
zone-constrained allocations may delay before considering the OOM killer.

As there is no guarantee enough memory can ever be freed to satisfy
compaction, this patch avoids retrying compaction for zone-contrained
allocations.

In combination, that means that the per-node stats can be used when deciding
whether to continue reclaim using a rough approximation.  While it is
possible this will make the wrong decision on occasion, it will not infinite
loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES.

The final step is calculating the number of dirtyable highmem pages. As
those calculations only care about the global count of file pages in
highmem. This patch uses a global counter used instead of per-zone stats
as it is sufficient.

In combination, this allows the per-zone LRU and dirty state counters to
be removed.

Suggested by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 include/linux/mm_inline.h | 24 +++++++++++++++++++++---
 include/linux/mmzone.h    |  4 ----
 include/linux/swap.h      |  1 -
 mm/compaction.c           | 20 +++++++++++++++++++-
 mm/migrate.c              |  2 --
 mm/page-writeback.c       | 13 +++++--------
 mm/page_alloc.c           | 47 +++++++++++++++++++++++++++++++++++++++--------
 mm/vmscan.c               | 16 ----------------
 mm/vmstat.c               |  3 ---
 9 files changed, 84 insertions(+), 46 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9aadcc781857..ccd40e357b56 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,26 @@
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
 
+#ifdef CONFIG_HIGHMEM
+extern atomic_t highmem_file_pages;
+
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+	if (is_highmem_idx(zid) && is_file_lru(lru)) {
+		if (nr_pages > 0)
+			atomic_add(nr_pages, &highmem_file_pages);
+		else
+			atomic_sub(nr_pages, &highmem_file_pages);
+	}
+}
+#else
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+}
+#endif
+
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -29,9 +49,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
-	__mod_zone_page_state(&pgdat->node_zones[zid],
-		NR_ZONE_LRU_BASE + !!is_file_lru(lru),
-		nr_pages);
+	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd33e6f1bed0..a3b7f45aac56 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,10 +110,6 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
-	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
-	NR_ZONE_LRU_FILE,
-	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..cc753c639e3d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/compaction.c b/mm/compaction.c
index a0bd85712516..640532831b94 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1446,6 +1446,11 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *last_pgdat = NULL;
+
+	/* Do not retry compaction for zone-constrained allocations */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		return false;
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
@@ -1456,14 +1461,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		unsigned long available;
 		enum compact_result compact_result;
 
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+
+		/*
+		 * This over-estimates the number of pages available for
+		 * reclaim/compaction but walking the LRU would take too
+		 * long. The consequences are that compaction may retry
+		 * longer than it should for a zone-constrained allocation
+		 * request.
+		 */
+		last_pgdat = zone->zone_pgdat;
+		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
+
 		/*
 		 * Do not consider all the reclaimable memory because we do not
 		 * want to trash just for a single high order allocation which
 		 * is even not guaranteed to appear even if __compaction_suitable
 		 * is happy about the watermark check.
 		 */
-		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		available = min(zone->managed_pages, available);
 		compact_result = __compaction_suitable(zone, order, alloc_flags,
 				ac_classzone_idx(ac), available);
 		if (compact_result != COMPACT_SKIPPED &&
diff --git a/mm/migrate.c b/mm/migrate.c
index c77997dc6ed7..ed2f85e61de1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
 			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
 			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3c02aa603f5a..0bca2376bd42 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	return nr_pages;
 }
+#ifdef CONFIG_HIGHMEM
+atomic_t highmem_file_pages;
+#endif
 
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
@@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 	int node;
 	unsigned long x = 0;
 	int i;
+	unsigned long dirtyable = atomic_read(&highmem_file_pages);
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
 		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
 			struct zone *z;
-			unsigned long dirtyable;
 
 			if (!is_highmem_idx(i))
 				continue;
 
 			z = &NODE_DATA(node)->node_zones[i];
-			dirtyable = zone_page_state(z, NR_FREE_PAGES) +
-				zone_page_state(z, NR_ZONE_LRU_FILE);
+			dirtyable += zone_page_state(z, NR_FREE_PAGES);
 
 			/* watch for underflows */
 			dirtyable -= min(dirtyable, high_wmark_pages(z));
@@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
-		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		dec_node_page_state(page, NR_FILE_DIRTY);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page)
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 			dec_node_page_state(page, NR_FILE_DIRTY);
-			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page)
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		inc_node_page_state(page, NR_WRITEBACK);
-		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 030114f55b0e..a702c31d2df4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *current_pgdat = NULL;
 
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
@@ -3454,6 +3455,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/*
+	 * Blindly retry lowmem allocation requests that are often ignored by
+	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
+	 * and fast means of calculating reclaimable, dirty and writeback pages
+	 * in eligible zones.
+	 */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		goto out;
+
+	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
 	 * If none of the target zones can satisfy our allocation request even
 	 * if all reclaimable pages are considered then we are screwed and have
@@ -3463,18 +3473,38 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
+		int zid;
 
-		available = reclaimable = zone_reclaimable_pages(zone);
+		if (current_pgdat == zone->zone_pgdat)
+			continue;
+
+		current_pgdat = zone->zone_pgdat;
+		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					  MAX_RECLAIM_RETRIES);
-		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+		/* Account for all free pages on eligible zones */
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *acct_zone = &current_pgdat->node_zones[zid];
+
+			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
+		}
 
 		/*
 		 * Would the allocation succeed if we reclaimed the whole
-		 * available?
+		 * available? This is approximate because there is no
+		 * accurate count of reclaimable pages per zone.
 		 */
-		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
-				ac_classzone_idx(ac), alloc_flags, available)) {
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *check_zone = &current_pgdat->node_zones[zid];
+			unsigned long estimate;
+
+			estimate = min(check_zone->managed_pages, available);
+			if (!__zone_watermark_ok(check_zone, order,
+					min_wmark_pages(check_zone), ac_classzone_idx(ac),
+					alloc_flags, estimate))
+				continue;
+
 			/*
 			 * If we didn't make any progress and have a lot of
 			 * dirty + writeback pages then we should wait for
@@ -3484,15 +3514,16 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			if (!did_some_progress) {
 				unsigned long write_pending;
 
-				write_pending = zone_page_state_snapshot(zone,
-							NR_ZONE_WRITE_PENDING);
+				write_pending =
+					node_page_state(current_pgdat, NR_WRITEBACK) +
+					node_page_state(current_pgdat, NR_FILE_DIRTY);
 
 				if (2 * write_pending > reclaimable) {
 					congestion_wait(BLK_RW_ASYNC, HZ/10);
 					return true;
 				}
 			}
-
+out:
 			/*
 			 * Memory allocation/reclaim might be called from a WQ
 			 * context and the current implementation of the WQ
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a8b432ff0d14..d079210d46ee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-/*
- * This misses isolated pages which are not accounted for to save counters.
- * As the data only determines if reclaim or compaction continues, it is
- * not expected that isolated pages will be a dominating factor.
- */
-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	unsigned long nr;
-
-	nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
-
-	return nr;
-}
-
 unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 60372f31fee3..7415775faf08 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,9 +921,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
-	"nr_zone_anon_lru",
-	"nr_zone_file_lru",
-	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries
@ 2016-07-08  9:35   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-08  9:35 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim,
	Joonsoo Kim, LKML, Mel Gorman

The number of LRU pages, dirty pages and writeback pages must be accounted
for on both zones and nodes because of the reclaim retry logic, compaction
retry logic and highmem calculations all depending on per-zone stats.

Many lowmem allocations are immune from OOM kill due to a check in
__alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception
is costly high-order allocations or allocations that cannot fail. If the
__alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations
then it would fall through to __alloc_pages_direct_compact.

This patch will blindly retry reclaim for zone-constrained allocations
in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal but
without per-zone stats there are not many alternatives. The impact it that
zone-constrained allocations may delay before considering the OOM killer.

As there is no guarantee enough memory can ever be freed to satisfy
compaction, this patch avoids retrying compaction for zone-contrained
allocations.

In combination, that means that the per-node stats can be used when deciding
whether to continue reclaim using a rough approximation.  While it is
possible this will make the wrong decision on occasion, it will not infinite
loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES.

The final step is calculating the number of dirtyable highmem pages. As
those calculations only care about the global count of file pages in
highmem. This patch uses a global counter used instead of per-zone stats
as it is sufficient.

In combination, this allows the per-zone LRU and dirty state counters to
be removed.

Suggested by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 include/linux/mm_inline.h | 24 +++++++++++++++++++++---
 include/linux/mmzone.h    |  4 ----
 include/linux/swap.h      |  1 -
 mm/compaction.c           | 20 +++++++++++++++++++-
 mm/migrate.c              |  2 --
 mm/page-writeback.c       | 13 +++++--------
 mm/page_alloc.c           | 47 +++++++++++++++++++++++++++++++++++++++--------
 mm/vmscan.c               | 16 ----------------
 mm/vmstat.c               |  3 ---
 9 files changed, 84 insertions(+), 46 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9aadcc781857..ccd40e357b56 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,26 @@
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
 
+#ifdef CONFIG_HIGHMEM
+extern atomic_t highmem_file_pages;
+
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+	if (is_highmem_idx(zid) && is_file_lru(lru)) {
+		if (nr_pages > 0)
+			atomic_add(nr_pages, &highmem_file_pages);
+		else
+			atomic_sub(nr_pages, &highmem_file_pages);
+	}
+}
+#else
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+}
+#endif
+
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -29,9 +49,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
-	__mod_zone_page_state(&pgdat->node_zones[zid],
-		NR_ZONE_LRU_BASE + !!is_file_lru(lru),
-		nr_pages);
+	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd33e6f1bed0..a3b7f45aac56 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,10 +110,6 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
-	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
-	NR_ZONE_LRU_FILE,
-	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..cc753c639e3d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/compaction.c b/mm/compaction.c
index a0bd85712516..640532831b94 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1446,6 +1446,11 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *last_pgdat = NULL;
+
+	/* Do not retry compaction for zone-constrained allocations */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		return false;
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
@@ -1456,14 +1461,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		unsigned long available;
 		enum compact_result compact_result;
 
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+
+		/*
+		 * This over-estimates the number of pages available for
+		 * reclaim/compaction but walking the LRU would take too
+		 * long. The consequences are that compaction may retry
+		 * longer than it should for a zone-constrained allocation
+		 * request.
+		 */
+		last_pgdat = zone->zone_pgdat;
+		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
+
 		/*
 		 * Do not consider all the reclaimable memory because we do not
 		 * want to trash just for a single high order allocation which
 		 * is even not guaranteed to appear even if __compaction_suitable
 		 * is happy about the watermark check.
 		 */
-		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		available = min(zone->managed_pages, available);
 		compact_result = __compaction_suitable(zone, order, alloc_flags,
 				ac_classzone_idx(ac), available);
 		if (compact_result != COMPACT_SKIPPED &&
diff --git a/mm/migrate.c b/mm/migrate.c
index c77997dc6ed7..ed2f85e61de1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
 			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
 			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3c02aa603f5a..0bca2376bd42 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	return nr_pages;
 }
+#ifdef CONFIG_HIGHMEM
+atomic_t highmem_file_pages;
+#endif
 
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
@@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 	int node;
 	unsigned long x = 0;
 	int i;
+	unsigned long dirtyable = atomic_read(&highmem_file_pages);
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
 		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
 			struct zone *z;
-			unsigned long dirtyable;
 
 			if (!is_highmem_idx(i))
 				continue;
 
 			z = &NODE_DATA(node)->node_zones[i];
-			dirtyable = zone_page_state(z, NR_FREE_PAGES) +
-				zone_page_state(z, NR_ZONE_LRU_FILE);
+			dirtyable += zone_page_state(z, NR_FREE_PAGES);
 
 			/* watch for underflows */
 			dirtyable -= min(dirtyable, high_wmark_pages(z));
@@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
-		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		dec_node_page_state(page, NR_FILE_DIRTY);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page)
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 			dec_node_page_state(page, NR_FILE_DIRTY);
-			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page)
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		inc_node_page_state(page, NR_WRITEBACK);
-		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 030114f55b0e..a702c31d2df4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *current_pgdat = NULL;
 
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
@@ -3454,6 +3455,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/*
+	 * Blindly retry lowmem allocation requests that are often ignored by
+	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
+	 * and fast means of calculating reclaimable, dirty and writeback pages
+	 * in eligible zones.
+	 */
+	if (ac->high_zoneidx < ZONE_NORMAL)
+		goto out;
+
+	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
 	 * If none of the target zones can satisfy our allocation request even
 	 * if all reclaimable pages are considered then we are screwed and have
@@ -3463,18 +3473,38 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
+		int zid;
 
-		available = reclaimable = zone_reclaimable_pages(zone);
+		if (current_pgdat == zone->zone_pgdat)
+			continue;
+
+		current_pgdat = zone->zone_pgdat;
+		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					  MAX_RECLAIM_RETRIES);
-		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+		/* Account for all free pages on eligible zones */
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *acct_zone = &current_pgdat->node_zones[zid];
+
+			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
+		}
 
 		/*
 		 * Would the allocation succeed if we reclaimed the whole
-		 * available?
+		 * available? This is approximate because there is no
+		 * accurate count of reclaimable pages per zone.
 		 */
-		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
-				ac_classzone_idx(ac), alloc_flags, available)) {
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *check_zone = &current_pgdat->node_zones[zid];
+			unsigned long estimate;
+
+			estimate = min(check_zone->managed_pages, available);
+			if (!__zone_watermark_ok(check_zone, order,
+					min_wmark_pages(check_zone), ac_classzone_idx(ac),
+					alloc_flags, estimate))
+				continue;
+
 			/*
 			 * If we didn't make any progress and have a lot of
 			 * dirty + writeback pages then we should wait for
@@ -3484,15 +3514,16 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			if (!did_some_progress) {
 				unsigned long write_pending;
 
-				write_pending = zone_page_state_snapshot(zone,
-							NR_ZONE_WRITE_PENDING);
+				write_pending =
+					node_page_state(current_pgdat, NR_WRITEBACK) +
+					node_page_state(current_pgdat, NR_FILE_DIRTY);
 
 				if (2 * write_pending > reclaimable) {
 					congestion_wait(BLK_RW_ASYNC, HZ/10);
 					return true;
 				}
 			}
-
+out:
 			/*
 			 * Memory allocation/reclaim might be called from a WQ
 			 * context and the current implementation of the WQ
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a8b432ff0d14..d079210d46ee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-/*
- * This misses isolated pages which are not accounted for to save counters.
- * As the data only determines if reclaim or compaction continues, it is
- * not expected that isolated pages will be a dominating factor.
- */
-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	unsigned long nr;
-
-	nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
-
-	return nr;
-}
-
 unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 60372f31fee3..7415775faf08 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,9 +921,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
-	"nr_zone_anon_lru",
-	"nr_zone_file_lru",
-	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/34] mm, vmscan: move lru_lock to the node
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 11:06     ` Balbir Singh
  -1 siblings, 0 replies; 220+ messages in thread
From: Balbir Singh @ 2016-07-12 11:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:38AM +0100, Mel Gorman wrote:
> Node-based reclaim requires node-based LRUs and locking.  This is a
> preparation patch that just moves the lru_lock to the node so later
> patches are easier to review.  It is a mechanical change but note this
> patch makes contention worse because the LRU lock is hotter and direct
> reclaim and kswapd can contend on the same lock even when reclaiming from
> different zones.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Minchan Kim <minchan@kernel.org>
> ---
>  Documentation/cgroup-v1/memcg_test.txt |  4 +--
>  Documentation/cgroup-v1/memory.txt     |  4 +--
>  include/linux/mm_types.h               |  2 +-
>  include/linux/mmzone.h                 | 10 +++++--
>  mm/compaction.c                        | 10 +++----
>  mm/filemap.c                           |  4 +--
>  mm/huge_memory.c                       |  6 ++---
>  mm/memcontrol.c                        |  6 ++---
>  mm/mlock.c                             | 10 +++----
>  mm/page_alloc.c                        |  4 +--
>  mm/page_idle.c                         |  4 +--
>  mm/rmap.c                              |  2 +-
>  mm/swap.c                              | 30 ++++++++++-----------
>  mm/vmscan.c                            | 48 +++++++++++++++++-----------------
>  14 files changed, 75 insertions(+), 69 deletions(-)
> 
> diff --git a/Documentation/cgroup-v1/memcg_test.txt b/Documentation/cgroup-v1/memcg_test.txt
> index 8870b0212150..78a8c2963b38 100644
> --- a/Documentation/cgroup-v1/memcg_test.txt
> +++ b/Documentation/cgroup-v1/memcg_test.txt
> @@ -107,9 +107,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
>  
>  8. LRU
>          Each memcg has its own private LRU. Now, its handling is under global
> -	VM's control (means that it's handled under global zone->lru_lock).
> +	VM's control (means that it's handled under global zone_lru_lock).
>  	Almost all routines around memcg's LRU is called by global LRU's
> -	list management functions under zone->lru_lock().
> +	list management functions under zone_lru_lock().
>  
>  	A special function is mem_cgroup_isolate_pages(). This scans
>  	memcg's private LRU and call __isolate_lru_page() to extract a page
> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> index b14abf217239..946e69103cdd 100644
> --- a/Documentation/cgroup-v1/memory.txt
> +++ b/Documentation/cgroup-v1/memory.txt
> @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
>     Other lock order is following:
>     PG_locked.
>     mm->page_table_lock
> -       zone->lru_lock
> +       zone_lru_lock

zone_lru_lock is a little confusing, can't we just call it
node_lru_lock?

>  	  lock_page_cgroup.
>    In many cases, just lock_page_cgroup() is called.
>    per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
> -  zone->lru_lock, it has no lock of its own.
> +  zone_lru_lock, it has no lock of its own.
>  
>  2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e093e1d3285b..ca2ed9a6c8d8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -118,7 +118,7 @@ struct page {
>  	 */
>  	union {
>  		struct list_head lru;	/* Pageout list, eg. active_list
> -					 * protected by zone->lru_lock !
> +					 * protected by zone_lru_lock !
>  					 * Can be used as a generic list
>  					 * by the page owner.
>  					 */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 078ecb81e209..cfa870107abe 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -93,7 +93,7 @@ struct free_area {
>  struct pglist_data;
>  
>  /*
> - * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
> + * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
>   * So add a wild amount of padding here to ensure that they fall into separate
>   * cachelines.  There are very few zone structures in the machine, so space
>   * consumption is not a concern here.
> @@ -496,7 +496,6 @@ struct zone {
>  	/* Write-intensive fields used by page reclaim */
>  
>  	/* Fields commonly accessed by the page reclaim scanner */
> -	spinlock_t		lru_lock;
>  	struct lruvec		lruvec;
>  
>  	/*
> @@ -690,6 +689,9 @@ typedef struct pglist_data {
>  	/* Number of pages migrated during the rate limiting time interval */
>  	unsigned long numabalancing_migrate_nr_pages;
>  #endif
> +	/* Write-intensive fields used by page reclaim */
> +	ZONE_PADDING(_pad1_)a

I thought this was to have zone->lock and zone->lru_lock in different
cachelines, do we still need the padding here?

> +	spinlock_t		lru_lock;
>

Balbir Singh.  

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/34] mm, vmscan: move lru_lock to the node
@ 2016-07-12 11:06     ` Balbir Singh
  0 siblings, 0 replies; 220+ messages in thread
From: Balbir Singh @ 2016-07-12 11:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:38AM +0100, Mel Gorman wrote:
> Node-based reclaim requires node-based LRUs and locking.  This is a
> preparation patch that just moves the lru_lock to the node so later
> patches are easier to review.  It is a mechanical change but note this
> patch makes contention worse because the LRU lock is hotter and direct
> reclaim and kswapd can contend on the same lock even when reclaiming from
> different zones.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Minchan Kim <minchan@kernel.org>
> ---
>  Documentation/cgroup-v1/memcg_test.txt |  4 +--
>  Documentation/cgroup-v1/memory.txt     |  4 +--
>  include/linux/mm_types.h               |  2 +-
>  include/linux/mmzone.h                 | 10 +++++--
>  mm/compaction.c                        | 10 +++----
>  mm/filemap.c                           |  4 +--
>  mm/huge_memory.c                       |  6 ++---
>  mm/memcontrol.c                        |  6 ++---
>  mm/mlock.c                             | 10 +++----
>  mm/page_alloc.c                        |  4 +--
>  mm/page_idle.c                         |  4 +--
>  mm/rmap.c                              |  2 +-
>  mm/swap.c                              | 30 ++++++++++-----------
>  mm/vmscan.c                            | 48 +++++++++++++++++-----------------
>  14 files changed, 75 insertions(+), 69 deletions(-)
> 
> diff --git a/Documentation/cgroup-v1/memcg_test.txt b/Documentation/cgroup-v1/memcg_test.txt
> index 8870b0212150..78a8c2963b38 100644
> --- a/Documentation/cgroup-v1/memcg_test.txt
> +++ b/Documentation/cgroup-v1/memcg_test.txt
> @@ -107,9 +107,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
>  
>  8. LRU
>          Each memcg has its own private LRU. Now, its handling is under global
> -	VM's control (means that it's handled under global zone->lru_lock).
> +	VM's control (means that it's handled under global zone_lru_lock).
>  	Almost all routines around memcg's LRU is called by global LRU's
> -	list management functions under zone->lru_lock().
> +	list management functions under zone_lru_lock().
>  
>  	A special function is mem_cgroup_isolate_pages(). This scans
>  	memcg's private LRU and call __isolate_lru_page() to extract a page
> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> index b14abf217239..946e69103cdd 100644
> --- a/Documentation/cgroup-v1/memory.txt
> +++ b/Documentation/cgroup-v1/memory.txt
> @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
>     Other lock order is following:
>     PG_locked.
>     mm->page_table_lock
> -       zone->lru_lock
> +       zone_lru_lock

zone_lru_lock is a little confusing, can't we just call it
node_lru_lock?

>  	  lock_page_cgroup.
>    In many cases, just lock_page_cgroup() is called.
>    per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
> -  zone->lru_lock, it has no lock of its own.
> +  zone_lru_lock, it has no lock of its own.
>  
>  2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e093e1d3285b..ca2ed9a6c8d8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -118,7 +118,7 @@ struct page {
>  	 */
>  	union {
>  		struct list_head lru;	/* Pageout list, eg. active_list
> -					 * protected by zone->lru_lock !
> +					 * protected by zone_lru_lock !
>  					 * Can be used as a generic list
>  					 * by the page owner.
>  					 */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 078ecb81e209..cfa870107abe 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -93,7 +93,7 @@ struct free_area {
>  struct pglist_data;
>  
>  /*
> - * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
> + * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
>   * So add a wild amount of padding here to ensure that they fall into separate
>   * cachelines.  There are very few zone structures in the machine, so space
>   * consumption is not a concern here.
> @@ -496,7 +496,6 @@ struct zone {
>  	/* Write-intensive fields used by page reclaim */
>  
>  	/* Fields commonly accessed by the page reclaim scanner */
> -	spinlock_t		lru_lock;
>  	struct lruvec		lruvec;
>  
>  	/*
> @@ -690,6 +689,9 @@ typedef struct pglist_data {
>  	/* Number of pages migrated during the rate limiting time interval */
>  	unsigned long numabalancing_migrate_nr_pages;
>  #endif
> +	/* Write-intensive fields used by page reclaim */
> +	ZONE_PADDING(_pad1_)a

I thought this was to have zone->lock and zone->lru_lock in different
cachelines, do we still need the padding here?

> +	spinlock_t		lru_lock;
>

Balbir Singh.  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/34] mm, vmscan: move lru_lock to the node
  2016-07-12 11:06     ` Balbir Singh
@ 2016-07-12 11:18       ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-12 11:18 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 09:06:04PM +1000, Balbir Singh wrote:
> > diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> > index b14abf217239..946e69103cdd 100644
> > --- a/Documentation/cgroup-v1/memory.txt
> > +++ b/Documentation/cgroup-v1/memory.txt
> > @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
> >     Other lock order is following:
> >     PG_locked.
> >     mm->page_table_lock
> > -       zone->lru_lock
> > +       zone_lru_lock
> 
> zone_lru_lock is a little confusing, can't we just call it
> node_lru_lock?
> 

It's a matter of perspective. People familiar with the VM already expect
a zone lock so will be looking for it. I can do a rename if you insist
but it may not actually help.

> > @@ -496,7 +496,6 @@ struct zone {
> >  	/* Write-intensive fields used by page reclaim */
> >  
> >  	/* Fields commonly accessed by the page reclaim scanner */
> > -	spinlock_t		lru_lock;
> >  	struct lruvec		lruvec;
> >  
> >  	/*
> > @@ -690,6 +689,9 @@ typedef struct pglist_data {
> >  	/* Number of pages migrated during the rate limiting time interval */
> >  	unsigned long numabalancing_migrate_nr_pages;
> >  #endif
> > +	/* Write-intensive fields used by page reclaim */
> > +	ZONE_PADDING(_pad1_)a
> 
> I thought this was to have zone->lock and zone->lru_lock in different
> cachelines, do we still need the padding here?
> 

The zone padding current keeps the page lock wait tables, page allocator
lists, compaction and vmstats on separate cache lines. They're still
fine.

The node padding may not be necessary. It currently ensures that zonelists
and numa balancing are separate from the LRU lock but there is no guarantee
the current arrangement is optimal. It would depend on both the kernel
config and the workload but it may be necessary in the future to split
node into read-mostly sections and then different write-intensive sections
similar to what has happened to struct zone in the past.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/34] mm, vmscan: move lru_lock to the node
@ 2016-07-12 11:18       ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-12 11:18 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 09:06:04PM +1000, Balbir Singh wrote:
> > diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> > index b14abf217239..946e69103cdd 100644
> > --- a/Documentation/cgroup-v1/memory.txt
> > +++ b/Documentation/cgroup-v1/memory.txt
> > @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
> >     Other lock order is following:
> >     PG_locked.
> >     mm->page_table_lock
> > -       zone->lru_lock
> > +       zone_lru_lock
> 
> zone_lru_lock is a little confusing, can't we just call it
> node_lru_lock?
> 

It's a matter of perspective. People familiar with the VM already expect
a zone lock so will be looking for it. I can do a rename if you insist
but it may not actually help.

> > @@ -496,7 +496,6 @@ struct zone {
> >  	/* Write-intensive fields used by page reclaim */
> >  
> >  	/* Fields commonly accessed by the page reclaim scanner */
> > -	spinlock_t		lru_lock;
> >  	struct lruvec		lruvec;
> >  
> >  	/*
> > @@ -690,6 +689,9 @@ typedef struct pglist_data {
> >  	/* Number of pages migrated during the rate limiting time interval */
> >  	unsigned long numabalancing_migrate_nr_pages;
> >  #endif
> > +	/* Write-intensive fields used by page reclaim */
> > +	ZONE_PADDING(_pad1_)a
> 
> I thought this was to have zone->lock and zone->lru_lock in different
> cachelines, do we still need the padding here?
> 

The zone padding current keeps the page lock wait tables, page allocator
lists, compaction and vmstats on separate cache lines. They're still
fine.

The node padding may not be necessary. It currently ensures that zonelists
and numa balancing are separate from the LRU lock but there is no guarantee
the current arrangement is optimal. It would depend on both the kernel
config and the workload but it may be necessary in the future to split
node into read-mostly sections and then different write-intensive sections
similar to what has happened to struct zone in the past.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 04/34] mm, mmzone: clarify the usage of zone padding
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 13:49     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 13:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:40AM +0100, Mel Gorman wrote:
> Zone padding separates write-intensive fields used by page allocation,
> compaction and vmstats but the comments are a little misleading and
> need clarification.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 04/34] mm, mmzone: clarify the usage of zone padding
@ 2016-07-12 13:49     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 13:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:40AM +0100, Mel Gorman wrote:
> Zone padding separates write-intensive fields used by page allocation,
> compaction and vmstats but the comments are a little misleading and
> need clarification.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 05/34] mm, vmscan: begin reclaiming pages on a per-node basis
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 13:54     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 13:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:41AM +0100, Mel Gorman wrote:
> This patch makes reclaim decisions on a per-node basis.  A reclaimer knows
> what zone is required by the allocation request and skips pages from
> higher zones.  In many cases this will be ok because it's a GFP_HIGHMEM
> request of some description.  On 64-bit, ZONE_DMA32 requests will cause
> some problems but 32-bit devices on 64-bit platforms are increasingly
> rare.  Historically it would have been a major problem on 32-bit with big
> Highmem:Lowmem ratios but such configurations are also now rare and even
> where they exist, they are not encouraged.  If it really becomes a
> problem, it'll manifest as very low reclaim efficiencies.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 05/34] mm, vmscan: begin reclaiming pages on a per-node basis
@ 2016-07-12 13:54     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 13:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:41AM +0100, Mel Gorman wrote:
> This patch makes reclaim decisions on a per-node basis.  A reclaimer knows
> what zone is required by the allocation request and skips pages from
> higher zones.  In many cases this will be ok because it's a GFP_HIGHMEM
> request of some description.  On 64-bit, ZONE_DMA32 requests will cause
> some problems but 32-bit devices on 64-bit platforms are increasingly
> rare.  Historically it would have been a major problem on 32-bit with big
> Highmem:Lowmem ratios but such configurations are also now rare and even
> where they exist, they are not encouraged.  If it really becomes a
> problem, it'll manifest as very low reclaim efficiencies.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/34] mm, vmscan: have kswapd only scan based on the highest requested zone
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 14:05     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:42AM +0100, Mel Gorman wrote:
> kswapd checks all eligible zones to see if they need balancing even if it
> was woken for a lower zone.  This made sense when we reclaimed on a
> per-zone basis because we wanted to shrink zones fairly so avoid
> age-inversion problems.  Ideally this is completely unnecessary when
> reclaiming on a per-node basis.  In theory, there may still be anomalies
> when all requests are for lower zones and very old pages are preserved in
> higher zones but this should be the exceptional case.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

I wasn't quite sure at first what the rationale is for this patch,
since it probably won't make much difference in pratice. But I do
agree that the code is cleaner to have kswapd check exactly what it
was asked to check, rather than some do-the-"right"-thing magic.

A hypothetical onslaught of low-zone allocations will wreak havoc to
the page age in higher zones anyway, right? So I don't think that case
matters all that much.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/34] mm, vmscan: have kswapd only scan based on the highest requested zone
@ 2016-07-12 14:05     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:42AM +0100, Mel Gorman wrote:
> kswapd checks all eligible zones to see if they need balancing even if it
> was woken for a lower zone.  This made sense when we reclaimed on a
> per-zone basis because we wanted to shrink zones fairly so avoid
> age-inversion problems.  Ideally this is completely unnecessary when
> reclaiming on a per-node basis.  In theory, there may still be anomalies
> when all requests are for lower zones and very old pages are preserved in
> higher zones but this should be the exceptional case.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

I wasn't quite sure at first what the rationale is for this patch,
since it probably won't make much difference in pratice. But I do
agree that the code is cleaner to have kswapd check exactly what it
was asked to check, rather than some do-the-"right"-thing magic.

A hypothetical onslaught of low-zone allocations will wreak havoc to
the page age in higher zones anyway, right? So I don't think that case
matters all that much.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 08/34] mm, vmscan: remove balance gap
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 14:06     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:44AM +0100, Mel Gorman wrote:
> The balance gap was introduced to apply equal pressure to all zones when
> reclaiming for a higher zone.  With node-based LRU, the need for the
> balance gap is removed and the code is dead so remove it.
> 
> [vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 08/34] mm, vmscan: remove balance gap
@ 2016-07-12 14:06     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:44AM +0100, Mel Gorman wrote:
> The balance gap was introduced to apply equal pressure to all zones when
> reclaiming for a higher zone.  With node-based LRU, the need for the
> balance gap is removed and the code is dead so remove it.
> 
> [vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 14:22     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:47AM +0100, Mel Gorman wrote:
> @@ -3008,7 +3008,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
>  {
>  	unsigned long mark = high_wmark_pages(zone);
>  
> -	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
> +	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
> +		return false;
> +
> +	/*
> +	 * If any eligible zone is balanced then the node is not considered
> +	 * to be congested or dirty
> +	 */
> +	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);

Predicate functions that secretly modify internal state give me the
willies... The diffstat is flat, too. Is this really an improvement?

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
@ 2016-07-12 14:22     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:47AM +0100, Mel Gorman wrote:
> @@ -3008,7 +3008,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
>  {
>  	unsigned long mark = high_wmark_pages(zone);
>  
> -	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
> +	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
> +		return false;
> +
> +	/*
> +	 * If any eligible zone is balanced then the node is not considered
> +	 * to be congested or dirty
> +	 */
> +	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);

Predicate functions that secretly modify internal state give me the
willies... The diffstat is flat, too. Is this really an improvement?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 14:29     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:48AM +0100, Mel Gorman wrote:
> kswapd scans from highest to lowest for a zone that requires balancing.
> This was necessary when reclaim was per-zone to fairly age pages on lower
> zones.  Now that we are reclaiming on a per-node basis, any eligible zone
> can be used and pages will still be aged fairly.  This patch avoids
> reclaiming excessively unless buffer_heads are over the limit and it's
> necessary to reclaim from a higher zone than requested by the waker of
> kswapd to relieve low memory pressure.
> 
> [hillf.zj@alibaba-inc.com: Force kswapd reclaim no more than needed]
> Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgorman@techsingularity.net
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Hillf Danton <hillf.zj@alibaba-inc.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

One and a half observations:

> @@ -3144,31 +3144,39 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  
>  		sc.nr_reclaimed = 0;
>  
> -		/* Scan from the highest requested zone to dma */
> -		for (i = classzone_idx; i >= 0; i--) {
> -			zone = pgdat->node_zones + i;
> -			if (!populated_zone(zone))
> -				continue;
> -
> -			/*
> -			 * If the number of buffer_heads in the machine
> -			 * exceeds the maximum allowed level and this node
> -			 * has a highmem zone, force kswapd to reclaim from
> -			 * it to relieve lowmem pressure.
> -			 */
> -			if (buffer_heads_over_limit && is_highmem_idx(i)) {
> -				classzone_idx = i;
> -				break;
> -			}
> +		/*
> +		 * If the number of buffer_heads in the machine exceeds the
> +		 * maximum allowed level then reclaim from all zones. This is
> +		 * not specific to highmem as highmem may not exist but it is
> +		 * it is expected that buffer_heads are stripped in writeback.

The mention of highmem in this comment make only sense within the
context of this diff; it'll be pretty confusing in the standalone
code.

Also, double "it is" :)

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
@ 2016-07-12 14:29     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:48AM +0100, Mel Gorman wrote:
> kswapd scans from highest to lowest for a zone that requires balancing.
> This was necessary when reclaim was per-zone to fairly age pages on lower
> zones.  Now that we are reclaiming on a per-node basis, any eligible zone
> can be used and pages will still be aged fairly.  This patch avoids
> reclaiming excessively unless buffer_heads are over the limit and it's
> necessary to reclaim from a higher zone than requested by the waker of
> kswapd to relieve low memory pressure.
> 
> [hillf.zj@alibaba-inc.com: Force kswapd reclaim no more than needed]
> Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgorman@techsingularity.net
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Hillf Danton <hillf.zj@alibaba-inc.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

One and a half observations:

> @@ -3144,31 +3144,39 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  
>  		sc.nr_reclaimed = 0;
>  
> -		/* Scan from the highest requested zone to dma */
> -		for (i = classzone_idx; i >= 0; i--) {
> -			zone = pgdat->node_zones + i;
> -			if (!populated_zone(zone))
> -				continue;
> -
> -			/*
> -			 * If the number of buffer_heads in the machine
> -			 * exceeds the maximum allowed level and this node
> -			 * has a highmem zone, force kswapd to reclaim from
> -			 * it to relieve lowmem pressure.
> -			 */
> -			if (buffer_heads_over_limit && is_highmem_idx(i)) {
> -				classzone_idx = i;
> -				break;
> -			}
> +		/*
> +		 * If the number of buffer_heads in the machine exceeds the
> +		 * maximum allowed level then reclaim from all zones. This is
> +		 * not specific to highmem as highmem may not exist but it is
> +		 * it is expected that buffer_heads are stripped in writeback.

The mention of highmem in this comment make only sense within the
context of this diff; it'll be pretty confusing in the standalone
code.

Also, double "it is" :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 13/34] mm, vmscan: make shrink_node decisions more node-centric
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 14:32     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:49AM +0100, Mel Gorman wrote:
> Earlier patches focused on having direct reclaim and kswapd use data that
> is node-centric for reclaiming but shrink_node() itself still uses too
> much zone information.  This patch removes unnecessary zone-based
> information with the most important decision being whether to continue
> reclaim or not.  Some memcg APIs are adjusted as a result even though
> memcg itself still uses some zone information.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Second half of the memcg conversion is in the next patch. Ok.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 13/34] mm, vmscan: make shrink_node decisions more node-centric
@ 2016-07-12 14:32     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:49AM +0100, Mel Gorman wrote:
> Earlier patches focused on having direct reclaim and kswapd use data that
> is node-centric for reclaiming but shrink_node() itself still uses too
> much zone information.  This patch removes unnecessary zone-based
> information with the most important decision being whether to continue
> reclaim or not.  Some memcg APIs are adjusted as a result even though
> memcg itself still uses some zone information.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Second half of the memcg conversion is in the next patch. Ok.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 14/34] mm, memcg: move memcg limit enforcement from zones to nodes
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 14:38     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:50AM +0100, Mel Gorman wrote:
> Memcg needs adjustment after moving LRUs to the node. Limits are tracked
> per memcg but the soft-limit excess is tracked per zone. As global page
> reclaim is based on the node, it is easy to imagine a situation where
> a zone soft limit is exceeded even though the memcg limit is fine.
> 
> This patch moves the soft limit tree the node.  Technically, all the variable
> names should also change but people are already familiar by the meaning of
> "mz" even if "mn" would be a more appropriate name now.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>

Yep, the soft limit tracking scope needs to match the reclaim scope.

Nice patch :)

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 14/34] mm, memcg: move memcg limit enforcement from zones to nodes
@ 2016-07-12 14:38     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:50AM +0100, Mel Gorman wrote:
> Memcg needs adjustment after moving LRUs to the node. Limits are tracked
> per memcg but the soft-limit excess is tracked per zone. As global page
> reclaim is based on the node, it is easy to imagine a situation where
> a zone soft limit is exceeded even though the memcg limit is fine.
> 
> This patch moves the soft limit tree the node.  Technically, all the variable
> names should also change but people are already familiar by the meaning of
> "mz" even if "mn" would be a more appropriate name now.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>

Yep, the soft limit tracking scope needs to match the reclaim scope.

Nice patch :)

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 17/34] mm: move page mapped accounting to the node
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 14:42     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:53AM +0100, Mel Gorman wrote:
> Reclaim makes decisions based on the number of pages that are mapped but
> it's mixing node and zone information.  Account NR_FILE_MAPPED and
> NR_ANON_PAGES pages on the node.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 17/34] mm: move page mapped accounting to the node
@ 2016-07-12 14:42     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:53AM +0100, Mel Gorman wrote:
> Reclaim makes decisions based on the number of pages that are mapped but
> it's mixing node and zone information.  Account NR_FILE_MAPPED and
> NR_ANON_PAGES pages on the node.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 14:58     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:54AM +0100, Mel Gorman wrote:
> NR_FILE_PAGES  is the number of        file pages.
> NR_FILE_MAPPED is the number of mapped file pages.
> NR_ANON_PAGES  is the number of mapped anon pages.
> 
> This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
> NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
> have
> 
> NR_FILE_PAGES  is the number of        file pages.
> NR_FILE_MAPPED is the number of mapped file pages.
> NR_ANON_MAPPED is the number of mapped anon pages.

That looks wrong to me. The symmetry is between NR_FILE_PAGES and
NR_ANON_PAGES. NR_FILE_MAPPED is merely elaborating on the mapped
subset of NR_FILE_PAGES, something which isn't necessary for anon
pages as they're always mapped.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-12 14:58     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 14:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:54AM +0100, Mel Gorman wrote:
> NR_FILE_PAGES  is the number of        file pages.
> NR_FILE_MAPPED is the number of mapped file pages.
> NR_ANON_PAGES  is the number of mapped anon pages.
> 
> This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
> NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
> have
> 
> NR_FILE_PAGES  is the number of        file pages.
> NR_FILE_MAPPED is the number of mapped file pages.
> NR_ANON_MAPPED is the number of mapped anon pages.

That looks wrong to me. The symmetry is between NR_FILE_PAGES and
NR_ANON_PAGES. NR_FILE_MAPPED is merely elaborating on the mapped
subset of NR_FILE_PAGES, something which isn't necessary for anon
pages as they're always mapped.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 19/34] mm: move most file-based accounting to the node
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 15:11     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 15:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:55AM +0100, Mel Gorman wrote:
> There are now a number of accounting oddities such as mapped file pages
> being accounted for on the node while the total number of file pages are
> accounted on the zone. This can be coped with to some extent but it's
> confusing so this patch moves the relevant file-based accounted. Due to
> throttling logic in the page allocator for reliable OOM detection, it is
> still necessary to track dirty and writeback pages on a per-zone basis.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>

The straight conversion bits are mind-numbing to review, so I focussed
mostly on the NR_ZONE_WRITE_PENDING sites. They look good to me except
for the migration one:

> @@ -505,15 +505,17 @@ int migrate_page_move_mapping(struct address_space *mapping,
>  	 * are mapped to swap space.
>  	 */
>  	if (newzone != oldzone) {
> -		__dec_zone_state(oldzone, NR_FILE_PAGES);
> -		__inc_zone_state(newzone, NR_FILE_PAGES);
> +		__dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
> +		__inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
>  		if (PageSwapBacked(page) && !PageSwapCache(page)) {
> -			__dec_zone_state(oldzone, NR_SHMEM);
> -			__inc_zone_state(newzone, NR_SHMEM);
> +			__dec_node_state(oldzone->zone_pgdat, NR_SHMEM);
> +			__inc_node_state(newzone->zone_pgdat, NR_SHMEM);
>  		}
>  		if (dirty && mapping_cap_account_dirty(mapping)) {
> -			__dec_zone_state(oldzone, NR_FILE_DIRTY);
> -			__inc_zone_state(newzone, NR_FILE_DIRTY);
> +			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
> +			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
> +			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
> +			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);

That double dec of NR_ZONE_WRITE_PENDING should be dec(old) -> inc(new).

Otherwise, the patch looks good to me.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 19/34] mm: move most file-based accounting to the node
@ 2016-07-12 15:11     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 15:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:55AM +0100, Mel Gorman wrote:
> There are now a number of accounting oddities such as mapped file pages
> being accounted for on the node while the total number of file pages are
> accounted on the zone. This can be coped with to some extent but it's
> confusing so this patch moves the relevant file-based accounted. Due to
> throttling logic in the page allocator for reliable OOM detection, it is
> still necessary to track dirty and writeback pages on a per-zone basis.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>

The straight conversion bits are mind-numbing to review, so I focussed
mostly on the NR_ZONE_WRITE_PENDING sites. They look good to me except
for the migration one:

> @@ -505,15 +505,17 @@ int migrate_page_move_mapping(struct address_space *mapping,
>  	 * are mapped to swap space.
>  	 */
>  	if (newzone != oldzone) {
> -		__dec_zone_state(oldzone, NR_FILE_PAGES);
> -		__inc_zone_state(newzone, NR_FILE_PAGES);
> +		__dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
> +		__inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
>  		if (PageSwapBacked(page) && !PageSwapCache(page)) {
> -			__dec_zone_state(oldzone, NR_SHMEM);
> -			__inc_zone_state(newzone, NR_SHMEM);
> +			__dec_node_state(oldzone->zone_pgdat, NR_SHMEM);
> +			__inc_node_state(newzone->zone_pgdat, NR_SHMEM);
>  		}
>  		if (dirty && mapping_cap_account_dirty(mapping)) {
> -			__dec_zone_state(oldzone, NR_FILE_DIRTY);
> -			__inc_zone_state(newzone, NR_FILE_DIRTY);
> +			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
> +			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
> +			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
> +			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);

That double dec of NR_ZONE_WRITE_PENDING should be dec(old) -> inc(new).

Otherwise, the patch looks good to me.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 20/34] mm: move vmscan writes and file write accounting to the node
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 15:15     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 15:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:56AM +0100, Mel Gorman wrote:
> As reclaim is now node-based, it follows that page write activity due to
> page reclaim should also be accounted for on the node.  For consistency,
> also account page writes and page dirtying on a per-node basis.
> 
> After this patch, there are a few remaining zone counters that may appear
> strange but are fine.  NUMA stats are still per-zone as this is a
> user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
> NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
> potentially pin low memory and cannot trivially be reclaimed on demand.
> This information is still useful for debugging a page allocation failure
> warning.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>

More conversion... ;) FWIW, I didn't spot anything problematic. And
agreed with leaving unmovable stuff counters on a per-zone basis.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 20/34] mm: move vmscan writes and file write accounting to the node
@ 2016-07-12 15:15     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 15:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:56AM +0100, Mel Gorman wrote:
> As reclaim is now node-based, it follows that page write activity due to
> page reclaim should also be accounted for on the node.  For consistency,
> also account page writes and page dirtying on a per-node basis.
> 
> After this patch, there are a few remaining zone counters that may appear
> strange but are fine.  NUMA stats are still per-zone as this is a
> user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
> NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
> potentially pin low memory and cannot trivially be reclaimed on demand.
> This information is still useful for debugging a page allocation failure
> warning.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>

More conversion... ;) FWIW, I didn't spot anything problematic. And
agreed with leaving unmovable stuff counters on a per-zone basis.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 21/34] mm, vmscan: only wakeup kswapd once per node for the requested classzone
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 17:18     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 17:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:57AM +0100, Mel Gorman wrote:
> kswapd is woken when zones are below the low watermark but the wakeup
> decision is not taking the classzone into account.  Now that reclaim is
> node-based, it is only required to wake kswapd once per node and only if
> all zones are unbalanced for the requested classzone.
> 
> Note that one node might be checked multiple times if the zonelist is
> ordered by node because there is no cheap way of tracking what nodes have
> already been visited.  For zone-ordering, each node should be checked only
> once.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 21/34] mm, vmscan: only wakeup kswapd once per node for the requested classzone
@ 2016-07-12 17:18     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 17:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:57AM +0100, Mel Gorman wrote:
> kswapd is woken when zones are below the low watermark but the wakeup
> decision is not taking the classzone into account.  Now that reclaim is
> node-based, it is only required to wake kswapd once per node and only if
> all zones are unbalanced for the requested classzone.
> 
> Note that one node might be checked multiple times if the zonelist is
> ordered by node because there is no cheap way of tracking what nodes have
> already been visited.  For zone-ordering, each node should be checked only
> once.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 22/34] mm, page_alloc: wake kswapd based on the highest eligible zone
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 17:24     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:58AM +0100, Mel Gorman wrote:
> The ac_classzone_idx is used as the basis for waking kswapd and that is based
> on the preferred zoneref. If the preferred zoneref's first zone is lower
> than what is available on other nodes, it's possible that kswapd is woken
> on a zone with only higher, but still eligible, zones. As classzone_idx
> is strictly adhered to now, it causes a problem because eligible pages
> are skipped.
> 
> For example, node 0 has only DMA32 and node 1 has only NORMAL. An allocating
> context running on node 0 may wake kswapd on node 1 telling it to skip
> all NORMAL pages.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 22/34] mm, page_alloc: wake kswapd based on the highest eligible zone
@ 2016-07-12 17:24     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:58AM +0100, Mel Gorman wrote:
> The ac_classzone_idx is used as the basis for waking kswapd and that is based
> on the preferred zoneref. If the preferred zoneref's first zone is lower
> than what is available on other nodes, it's possible that kswapd is woken
> on a zone with only higher, but still eligible, zones. As classzone_idx
> is strictly adhered to now, it causes a problem because eligible pages
> are skipped.
> 
> For example, node 0 has only DMA32 and node 1 has only NORMAL. An allocating
> context running on node 0 may wake kswapd on node 1 telling it to skip
> all NORMAL pages.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 23/34] mm: convert zone_reclaim to node_reclaim
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-12 17:28     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 17:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:59AM +0100, Mel Gorman wrote:
> As reclaim is now per-node based, convert zone_reclaim to be node_reclaim.
> It is possible that a node will be reclaimed multiple times if it has
> multiple zones but this is unavoidable without caching all nodes traversed
> so far.  The documentation and interface to userspace is the same from a
> configuration perspective and will will be similar in behaviour unless the
> node-local allocation requests were also limited to lower zones.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 23/34] mm: convert zone_reclaim to node_reclaim
@ 2016-07-12 17:28     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 17:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:59AM +0100, Mel Gorman wrote:
> As reclaim is now per-node based, convert zone_reclaim to be node_reclaim.
> It is possible that a node will be reclaimed multiple times if it has
> multiple zones but this is unavoidable without caching all nodes traversed
> so far.  The documentation and interface to userspace is the same from a
> configuration perspective and will will be similar in behaviour unless the
> node-local allocation requests were also limited to lower zones.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 24/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 17:31     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 17:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:00AM +0100, Mel Gorman wrote:
> shrink_node receives all information it needs about classzone_idx
> from sc->reclaim_idx so remove the aliases.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 24/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node
@ 2016-07-12 17:31     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 17:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:00AM +0100, Mel Gorman wrote:
> shrink_node receives all information it needs about classzone_idx
> from sc->reclaim_idx so remove the aliases.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 25/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 18:01     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:01AM +0100, Mel Gorman wrote:
> The scan_control structure has enough information available for
> compaction_ready() to make a decision. The classzone_idx manipulations in
> shrink_zones() are no longer necessary as the highest populated zone is
> no longer used to determine if shrink_slab should be called or not.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 25/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready
@ 2016-07-12 18:01     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:01AM +0100, Mel Gorman wrote:
> The scan_control structure has enough information available for
> compaction_ready() to make a decision. The classzone_idx manipulations in
> shrink_zones() are no longer necessary as the highest populated zone is
> no longer used to determine if shrink_slab should be called or not.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 26/34] mm, vmscan: avoid passing in remaining unnecessarily to prepare_kswapd_sleep
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 18:06     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:02AM +0100, Mel Gorman wrote:
> As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep
> always passes in 0 for remaining and the second call can trivially
> check the parameter in advance.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 26/34] mm, vmscan: avoid passing in remaining unnecessarily to prepare_kswapd_sleep
@ 2016-07-12 18:06     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:02AM +0100, Mel Gorman wrote:
> As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep
> always passes in 0 for remaining and the second call can trivially
> check the parameter in advance.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 27/34] mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 18:10     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:03AM +0100, Mel Gorman wrote:
> The buffer_heads_over_limit limit in kswapd is inconsistent with direct
> reclaim behaviour. It may force an an attempt to reclaim from all zones and
> then not reclaim at all because higher zones were balanced than required
> by the original request.
> 
> This patch will causes kswapd to consider reclaiming from all zones if
> buffer_heads_over_limit.  However, if there are eligible zones for the
> allocation request that woke kswapd then no reclaim will occur even if
> buffer_heads_over_limit. This avoids kswapd over-reclaiming just because
> buffer_heads_over_limit.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 27/34] mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit
@ 2016-07-12 18:10     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:03AM +0100, Mel Gorman wrote:
> The buffer_heads_over_limit limit in kswapd is inconsistent with direct
> reclaim behaviour. It may force an an attempt to reclaim from all zones and
> then not reclaim at all because higher zones were balanced than required
> by the original request.
> 
> This patch will causes kswapd to consider reclaiming from all zones if
> buffer_heads_over_limit.  However, if there are eligible zones for the
> allocation request that woke kswapd then no reclaim will occur even if
> buffer_heads_over_limit. This avoids kswapd over-reclaiming just because
> buffer_heads_over_limit.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 28/34] mm, vmscan: add classzone information to tracepoints
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 18:13     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:04AM +0100, Mel Gorman wrote:
> This is convenient when tracking down why the skip count is high because
> it'll show what classzone kswapd woke up at and what zones are being
> isolated.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 28/34] mm, vmscan: add classzone information to tracepoints
@ 2016-07-12 18:13     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:04AM +0100, Mel Gorman wrote:
> This is convenient when tracking down why the skip count is high because
> it'll show what classzone kswapd woke up at and what zones are being
> isolated.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 29/34] mm, page_alloc: remove fair zone allocation policy
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 18:18     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:05AM +0100, Mel Gorman wrote:
> The fair zone allocation policy interleaves allocation requests between
> zones to avoid an age inversion problem whereby new pages are reclaimed to
> balance a zone.  Reclaim is now node-based so this should no longer be an
> issue and the fair zone allocation policy is not free.  This patch removes
> it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

It's indeed no longer needed with a single set of LRUs on each NUMA
node. I'm glad this wart is finally going. Thanks Mel.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 29/34] mm, page_alloc: remove fair zone allocation policy
@ 2016-07-12 18:18     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:05AM +0100, Mel Gorman wrote:
> The fair zone allocation policy interleaves allocation requests between
> zones to avoid an age inversion problem whereby new pages are reclaimed to
> balance a zone.  Reclaim is now node-based so this should no longer be an
> issue and the fair zone allocation policy is not free.  This patch removes
> it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

It's indeed no longer needed with a single set of LRUs on each NUMA
node. I'm glad this wart is finally going. Thanks Mel.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 30/34] mm: page_alloc: cache the last node whose dirty limit is reached
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 18:43     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:06AM +0100, Mel Gorman wrote:
> If a page is about to be dirtied then the page allocator attempts to limit
> the total number of dirty pages that exists in any given zone.  The call
> to node_dirty_ok is expensive so this patch records if the last pgdat
> examined hit the dirty limits.  In some cases, this reduces the number of
> calls to node_dirty_ok().
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

I guess...

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 30/34] mm: page_alloc: cache the last node whose dirty limit is reached
@ 2016-07-12 18:43     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 18:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:06AM +0100, Mel Gorman wrote:
> If a page is about to be dirtied then the page allocator attempts to limit
> the total number of dirty pages that exists in any given zone.  The call
> to node_dirty_ok is expensive so this patch records if the last pgdat
> examined hit the dirty limits.  In some cases, this reduces the number of
> calls to node_dirty_ok().
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

I guess...

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 32/34] mm: vmstat: account per-zone stalls and pages skipped during reclaim
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 19:06     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 19:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:08AM +0100, Mel Gorman wrote:
> The vmstat allocstall was fairly useful in the general sense but
> node-based LRUs change that.  It's important to know if a stall was for an
> address-limited allocation request as this will require skipping pages
> from other zones.  This patch adds pgstall_* counters to replace
> allocstall.  The sum of the counters will equal the old allocstall so it
> can be trivially recalculated.  A high number of address-limited
> allocation requests may result in a lot of useless LRU scanning for
> suitable pages.
> 
> As address-limited allocations require pages to be skipped, it's important
> to know how much useless LRU scanning took place so this patch adds
> pgskip* counters.  This yields the following model
> 
> 1. The number of address-space limited stalls can be accounted for (pgstall)
> 2. The amount of useless work required to reclaim the data is accounted (pgskip)
> 3. The total number of scans is available from pgscan_kswapd and pgscan_direct
>    so from that the ratio of useful to useless scans can be calculated.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

These statistics should be quite helpful, so:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

But I have one nitpick:

> @@ -23,6 +23,8 @@
>  
>  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGALLOC),
> +		FOR_ALL_ZONES(PGSTALL),
> +		FOR_ALL_ZONES(PGSCAN_SKIP),
>  		PGFREE, PGACTIVATE, PGDEACTIVATE,
>  		PGFAULT, PGMAJFAULT,
>  		PGLAZYFREED,

The PG prefix seems to stand for page, and all stat names that contain
it represent some per-page event. PGSTALL is not a page event, though.
Would you mind sticking with allocstall? allocstall_dma32 etc.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 32/34] mm: vmstat: account per-zone stalls and pages skipped during reclaim
@ 2016-07-12 19:06     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 19:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:08AM +0100, Mel Gorman wrote:
> The vmstat allocstall was fairly useful in the general sense but
> node-based LRUs change that.  It's important to know if a stall was for an
> address-limited allocation request as this will require skipping pages
> from other zones.  This patch adds pgstall_* counters to replace
> allocstall.  The sum of the counters will equal the old allocstall so it
> can be trivially recalculated.  A high number of address-limited
> allocation requests may result in a lot of useless LRU scanning for
> suitable pages.
> 
> As address-limited allocations require pages to be skipped, it's important
> to know how much useless LRU scanning took place so this patch adds
> pgskip* counters.  This yields the following model
> 
> 1. The number of address-space limited stalls can be accounted for (pgstall)
> 2. The amount of useless work required to reclaim the data is accounted (pgskip)
> 3. The total number of scans is available from pgscan_kswapd and pgscan_direct
>    so from that the ratio of useful to useless scans can be calculated.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

These statistics should be quite helpful, so:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

But I have one nitpick:

> @@ -23,6 +23,8 @@
>  
>  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGALLOC),
> +		FOR_ALL_ZONES(PGSTALL),
> +		FOR_ALL_ZONES(PGSCAN_SKIP),
>  		PGFREE, PGACTIVATE, PGDEACTIVATE,
>  		PGFAULT, PGMAJFAULT,
>  		PGLAZYFREED,

The PG prefix seems to stand for page, and all stat names that contain
it represent some per-page event. PGSTALL is not a page event, though.
Would you mind sticking with allocstall? allocstall_dma32 etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 31/34] mm: vmstat: replace __count_zone_vm_events with a zone id equivalent
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 19:10     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 19:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:07AM +0100, Mel Gorman wrote:
> This is partially a preparation patch for more vmstat work but it also has
> the slight advantage that __count_zid_vm_events is cheaper to calculate
> than __count_zone_vm_events().
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 31/34] mm: vmstat: replace __count_zone_vm_events with a zone id equivalent
@ 2016-07-12 19:10     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 19:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:07AM +0100, Mel Gorman wrote:
> This is partially a preparation patch for more vmstat work but it also has
> the slight advantage that __count_zid_vm_events is cheaper to calculate
> than __count_zone_vm_events().
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 33/34] mm, vmstat: print node-based stats in zoneinfo file
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-12 19:18     ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 19:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:09AM +0100, Mel Gorman wrote:
> There are a number of stats that were previously accessible via zoneinfo
> that are now invisible. While it is possible to create a new file for the
> node stats, this may be missed by users. Instead this patch prints the
> stats under the first populated zone in /proc/zoneinfo.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

I had no idea we could make /proc/zoneinfo any worse!

But it'll work, I guess.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 33/34] mm, vmstat: print node-based stats in zoneinfo file
@ 2016-07-12 19:18     ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-12 19:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:35:09AM +0100, Mel Gorman wrote:
> There are a number of stats that were previously accessible via zoneinfo
> that are now invisible. While it is possible to create a new file for the
> node stats, this may be missed by users. Instead this patch prints the
> stats under the first populated zone in /proc/zoneinfo.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

I had no idea we could make /proc/zoneinfo any worse!

But it'll work, I guess.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/34] mm, vmscan: move lru_lock to the node
  2016-07-12 11:18       ` Mel Gorman
@ 2016-07-13  5:50         ` Balbir Singh
  -1 siblings, 0 replies; 220+ messages in thread
From: Balbir Singh @ 2016-07-13  5:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Balbir Singh, Andrew Morton, Linux-MM, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 12:18:05PM +0100, Mel Gorman wrote:
> On Tue, Jul 12, 2016 at 09:06:04PM +1000, Balbir Singh wrote:
> > > diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> > > index b14abf217239..946e69103cdd 100644
> > > --- a/Documentation/cgroup-v1/memory.txt
> > > +++ b/Documentation/cgroup-v1/memory.txt
> > > @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
> > >     Other lock order is following:
> > >     PG_locked.
> > >     mm->page_table_lock
> > > -       zone->lru_lock
> > > +       zone_lru_lock
> > 
> > zone_lru_lock is a little confusing, can't we just call it
> > node_lru_lock?
> > 
> 
> It's a matter of perspective. People familiar with the VM already expect
> a zone lock so will be looking for it. I can do a rename if you insist
> but it may not actually help.

I don't want to insist, but zone_ in the name can be confusing, as to
leading us to think that the lru_lock is still in the zone

If the rest of the reviewers are fine with, we don't need to rename

> 
> > > @@ -496,7 +496,6 @@ struct zone {
> > >  	/* Write-intensive fields used by page reclaim */
> > >  
> > >  	/* Fields commonly accessed by the page reclaim scanner */
> > > -	spinlock_t		lru_lock;
> > >  	struct lruvec		lruvec;
> > >  
> > >  	/*
> > > @@ -690,6 +689,9 @@ typedef struct pglist_data {
> > >  	/* Number of pages migrated during the rate limiting time interval */
> > >  	unsigned long numabalancing_migrate_nr_pages;
> > >  #endif
> > > +	/* Write-intensive fields used by page reclaim */
> > > +	ZONE_PADDING(_pad1_)a
> > 
> > I thought this was to have zone->lock and zone->lru_lock in different
> > cachelines, do we still need the padding here?
> > 
> 
> The zone padding current keeps the page lock wait tables, page allocator
> lists, compaction and vmstats on separate cache lines. They're still
> fine.
> 
> The node padding may not be necessary. It currently ensures that zonelists
> and numa balancing are separate from the LRU lock but there is no guarantee
> the current arrangement is optimal. It would depend on both the kernel
> config and the workload but it may be necessary in the future to split
> node into read-mostly sections and then different write-intensive sections
> similar to what has happened to struct zone in the past.
>

Fair enough

Balbir Singh. 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/34] mm, vmscan: move lru_lock to the node
@ 2016-07-13  5:50         ` Balbir Singh
  0 siblings, 0 replies; 220+ messages in thread
From: Balbir Singh @ 2016-07-13  5:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Balbir Singh, Andrew Morton, Linux-MM, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 12:18:05PM +0100, Mel Gorman wrote:
> On Tue, Jul 12, 2016 at 09:06:04PM +1000, Balbir Singh wrote:
> > > diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> > > index b14abf217239..946e69103cdd 100644
> > > --- a/Documentation/cgroup-v1/memory.txt
> > > +++ b/Documentation/cgroup-v1/memory.txt
> > > @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
> > >     Other lock order is following:
> > >     PG_locked.
> > >     mm->page_table_lock
> > > -       zone->lru_lock
> > > +       zone_lru_lock
> > 
> > zone_lru_lock is a little confusing, can't we just call it
> > node_lru_lock?
> > 
> 
> It's a matter of perspective. People familiar with the VM already expect
> a zone lock so will be looking for it. I can do a rename if you insist
> but it may not actually help.

I don't want to insist, but zone_ in the name can be confusing, as to
leading us to think that the lru_lock is still in the zone

If the rest of the reviewers are fine with, we don't need to rename

> 
> > > @@ -496,7 +496,6 @@ struct zone {
> > >  	/* Write-intensive fields used by page reclaim */
> > >  
> > >  	/* Fields commonly accessed by the page reclaim scanner */
> > > -	spinlock_t		lru_lock;
> > >  	struct lruvec		lruvec;
> > >  
> > >  	/*
> > > @@ -690,6 +689,9 @@ typedef struct pglist_data {
> > >  	/* Number of pages migrated during the rate limiting time interval */
> > >  	unsigned long numabalancing_migrate_nr_pages;
> > >  #endif
> > > +	/* Write-intensive fields used by page reclaim */
> > > +	ZONE_PADDING(_pad1_)a
> > 
> > I thought this was to have zone->lock and zone->lru_lock in different
> > cachelines, do we still need the padding here?
> > 
> 
> The zone padding current keeps the page lock wait tables, page allocator
> lists, compaction and vmstats on separate cache lines. They're still
> fine.
> 
> The node padding may not be necessary. It currently ensures that zonelists
> and numa balancing are separate from the LRU lock but there is no guarantee
> the current arrangement is optimal. It would depend on both the kernel
> config and the workload but it may be necessary in the future to split
> node into read-mostly sections and then different write-intensive sections
> similar to what has happened to struct zone in the past.
>

Fair enough

Balbir Singh. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/34] mm, vmscan: have kswapd only scan based on the highest requested zone
  2016-07-12 14:05     ` Johannes Weiner
@ 2016-07-13  8:37       ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:05:04AM -0400, Johannes Weiner wrote:
> On Fri, Jul 08, 2016 at 10:34:42AM +0100, Mel Gorman wrote:
> > kswapd checks all eligible zones to see if they need balancing even if it
> > was woken for a lower zone.  This made sense when we reclaimed on a
> > per-zone basis because we wanted to shrink zones fairly so avoid
> > age-inversion problems.  Ideally this is completely unnecessary when
> > reclaiming on a per-node basis.  In theory, there may still be anomalies
> > when all requests are for lower zones and very old pages are preserved in
> > higher zones but this should be the exceptional case.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> I wasn't quite sure at first what the rationale is for this patch,
> since it probably won't make much difference in pratice.

Possibly not, it depends on how much embedded 32-bit platforms use features
like zswap. What I wanted to avoid was a lowmem allocation for zswap
excessively reclaiming highmem putting even further pressure on zswap if
the pages are anonymous.

> But I do
> agree that the code is cleaner to have kswapd check exactly what it
> was asked to check, rather than some do-the-"right"-thing magic.
> 

But this is a justification on its own. I encountered an astonishing number
of magic number handling that just happened to mostly work. I wanted to
iron them out as much as possible.

> A hypothetical onslaught of low-zone allocations will wreak havoc to
> the page age in higher zones anyway, right? So I don't think that case
> matters all that much.

Possibly not, but it was straight-forward to mitigate the damage without
too many side-effects.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/34] mm, vmscan: have kswapd only scan based on the highest requested zone
@ 2016-07-13  8:37       ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:05:04AM -0400, Johannes Weiner wrote:
> On Fri, Jul 08, 2016 at 10:34:42AM +0100, Mel Gorman wrote:
> > kswapd checks all eligible zones to see if they need balancing even if it
> > was woken for a lower zone.  This made sense when we reclaimed on a
> > per-zone basis because we wanted to shrink zones fairly so avoid
> > age-inversion problems.  Ideally this is completely unnecessary when
> > reclaiming on a per-node basis.  In theory, there may still be anomalies
> > when all requests are for lower zones and very old pages are preserved in
> > higher zones but this should be the exceptional case.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> I wasn't quite sure at first what the rationale is for this patch,
> since it probably won't make much difference in pratice.

Possibly not, it depends on how much embedded 32-bit platforms use features
like zswap. What I wanted to avoid was a lowmem allocation for zswap
excessively reclaiming highmem putting even further pressure on zswap if
the pages are anonymous.

> But I do
> agree that the code is cleaner to have kswapd check exactly what it
> was asked to check, rather than some do-the-"right"-thing magic.
> 

But this is a justification on its own. I encountered an astonishing number
of magic number handling that just happened to mostly work. I wanted to
iron them out as much as possible.

> A hypothetical onslaught of low-zone allocations will wreak havoc to
> the page age in higher zones anyway, right? So I don't think that case
> matters all that much.

Possibly not, but it was straight-forward to mitigate the damage without
too many side-effects.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/34] mm, vmscan: move lru_lock to the node
  2016-07-13  5:50         ` Balbir Singh
@ 2016-07-13  8:39           ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-13  8:39 UTC (permalink / raw)
  To: bsingharora, Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner,
	Minchan Kim, Joonsoo Kim, LKML

On 07/13/2016 07:50 AM, Balbir Singh wrote:
> On Tue, Jul 12, 2016 at 12:18:05PM +0100, Mel Gorman wrote:
>> On Tue, Jul 12, 2016 at 09:06:04PM +1000, Balbir Singh wrote:
>>>> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
>>>> index b14abf217239..946e69103cdd 100644
>>>> --- a/Documentation/cgroup-v1/memory.txt
>>>> +++ b/Documentation/cgroup-v1/memory.txt
>>>> @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
>>>>     Other lock order is following:
>>>>     PG_locked.
>>>>     mm->page_table_lock
>>>> -       zone->lru_lock
>>>> +       zone_lru_lock
>>>
>>> zone_lru_lock is a little confusing, can't we just call it
>>> node_lru_lock?
>>>
>>
>> It's a matter of perspective. People familiar with the VM already expect
>> a zone lock so will be looking for it. I can do a rename if you insist
>> but it may not actually help.
>
> I don't want to insist, but zone_ in the name can be confusing, as to
> leading us to think that the lru_lock is still in the zone

On the other hand, it suggests that the argument of the function is a 
zone. Passing a zone to something called "node_lru_lock()" would be more 
confusing to me. Also it's mostly a convenience wrapper to ease the 
transition, whose usage will likely diminish over time.

> If the rest of the reviewers are fine with, we don't need to rename

Yes, it's not worth the trouble.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/34] mm, vmscan: move lru_lock to the node
@ 2016-07-13  8:39           ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-13  8:39 UTC (permalink / raw)
  To: bsingharora, Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner,
	Minchan Kim, Joonsoo Kim, LKML

On 07/13/2016 07:50 AM, Balbir Singh wrote:
> On Tue, Jul 12, 2016 at 12:18:05PM +0100, Mel Gorman wrote:
>> On Tue, Jul 12, 2016 at 09:06:04PM +1000, Balbir Singh wrote:
>>>> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
>>>> index b14abf217239..946e69103cdd 100644
>>>> --- a/Documentation/cgroup-v1/memory.txt
>>>> +++ b/Documentation/cgroup-v1/memory.txt
>>>> @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered.
>>>>     Other lock order is following:
>>>>     PG_locked.
>>>>     mm->page_table_lock
>>>> -       zone->lru_lock
>>>> +       zone_lru_lock
>>>
>>> zone_lru_lock is a little confusing, can't we just call it
>>> node_lru_lock?
>>>
>>
>> It's a matter of perspective. People familiar with the VM already expect
>> a zone lock so will be looking for it. I can do a rename if you insist
>> but it may not actually help.
>
> I don't want to insist, but zone_ in the name can be confusing, as to
> leading us to think that the lru_lock is still in the zone

On the other hand, it suggests that the argument of the function is a 
zone. Passing a zone to something called "node_lru_lock()" would be more 
confusing to me. Also it's mostly a convenience wrapper to ease the 
transition, whose usage will likely diminish over time.

> If the rest of the reviewers are fine with, we don't need to rename

Yes, it's not worth the trouble.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
  2016-07-12 14:22     ` Johannes Weiner
@ 2016-07-13  8:40       ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:22:56AM -0400, Johannes Weiner wrote:
> On Fri, Jul 08, 2016 at 10:34:47AM +0100, Mel Gorman wrote:
> > @@ -3008,7 +3008,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
> >  {
> >  	unsigned long mark = high_wmark_pages(zone);
> >  
> > -	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
> > +	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
> > +		return false;
> > +
> > +	/*
> > +	 * If any eligible zone is balanced then the node is not considered
> > +	 * to be congested or dirty
> > +	 */
> > +	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> > +	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
> 
> Predicate functions that secretly modify internal state give me the
> willies... The diffstat is flat, too. Is this really an improvement?

Primarily, it's about less duplicated code and maintenance overhead.
Overall I was both trying to remove side-effects and make the kswapd flow
easier to follow.

I'm open to renaming suggestions that make it clear the function
has side-effects. Best I came up with after the first coffee is
try_reset_zone_balanced() which returns true with congestion/dirty state
cleared if the zone is balanced. The name is not great because it's also
a little misleading. It doesn't try to reset anything, that's reclaim's job.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
@ 2016-07-13  8:40       ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:22:56AM -0400, Johannes Weiner wrote:
> On Fri, Jul 08, 2016 at 10:34:47AM +0100, Mel Gorman wrote:
> > @@ -3008,7 +3008,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
> >  {
> >  	unsigned long mark = high_wmark_pages(zone);
> >  
> > -	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
> > +	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
> > +		return false;
> > +
> > +	/*
> > +	 * If any eligible zone is balanced then the node is not considered
> > +	 * to be congested or dirty
> > +	 */
> > +	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> > +	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
> 
> Predicate functions that secretly modify internal state give me the
> willies... The diffstat is flat, too. Is this really an improvement?

Primarily, it's about less duplicated code and maintenance overhead.
Overall I was both trying to remove side-effects and make the kswapd flow
easier to follow.

I'm open to renaming suggestions that make it clear the function
has side-effects. Best I came up with after the first coffee is
try_reset_zone_balanced() which returns true with congestion/dirty state
cleared if the zone is balanced. The name is not great because it's also
a little misleading. It doesn't try to reset anything, that's reclaim's job.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
  2016-07-12 14:29     ` Johannes Weiner
@ 2016-07-13  8:47       ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:29:09AM -0400, Johannes Weiner wrote:
> > +		/*
> > +		 * If the number of buffer_heads in the machine exceeds the
> > +		 * maximum allowed level then reclaim from all zones. This is
> > +		 * not specific to highmem as highmem may not exist but it is
> > +		 * it is expected that buffer_heads are stripped in writeback.
> 
> The mention of highmem in this comment make only sense within the
> context of this diff; it'll be pretty confusing in the standalone
> code.
> 
> Also, double "it is" :)

Is this any better?

Note that it's marked as a fix to a later patch to reduce collisions in
mmotm. It's not a bisection risk so I saw little need to cause
unnecessary conflicts for Andrew.

---8<---
mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit -fix

Johannes reported that the comment about buffer_heads_over_limit in
balance_pgdat only made sense in the context of the patch. This
patch clarifies the reasoning and how it applies to 32 and 64 bit
systems.

This is a fix to the mmotm patch
mm-vmscan-have-kswapd-reclaim-from-all-zones-if-reclaiming-and-buffer_heads_over_limit.patch

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d079210d46ee..21eae17ee730 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3131,12 +3131,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		/*
 		 * If the number of buffer_heads exceeds the maximum allowed
-		 * then consider reclaiming from all zones. This is not
-		 * specific to highmem which may not exist but it is it is
-		 * expected that buffer_heads are stripped in writeback.
-		 * Reclaim may still not go ahead if all eligible zones
-		 * for the original allocation request are balanced to
-		 * avoid excessive reclaim from kswapd.
+		 * then consider reclaiming from all zones. This has a dual
+		 * purpose -- on 64-bit systems it is expected that
+		 * buffer_heads are stripped during active rotation. On 32-bit
+		 * systems, highmem pages can pin lowmem memory and shrinking
+		 * buffers can relieve lowmem pressure. Reclaim may still not
+		 * go ahead if all eligible zones for the original allocation
+		 * request are balanced to avoid excessive reclaim from kswapd.
 		 */
 		if (buffer_heads_over_limit) {
 			for (i = MAX_NR_ZONES - 1; i >= 0; i--) {

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
@ 2016-07-13  8:47       ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:29:09AM -0400, Johannes Weiner wrote:
> > +		/*
> > +		 * If the number of buffer_heads in the machine exceeds the
> > +		 * maximum allowed level then reclaim from all zones. This is
> > +		 * not specific to highmem as highmem may not exist but it is
> > +		 * it is expected that buffer_heads are stripped in writeback.
> 
> The mention of highmem in this comment make only sense within the
> context of this diff; it'll be pretty confusing in the standalone
> code.
> 
> Also, double "it is" :)

Is this any better?

Note that it's marked as a fix to a later patch to reduce collisions in
mmotm. It's not a bisection risk so I saw little need to cause
unnecessary conflicts for Andrew.

---8<---
mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit -fix

Johannes reported that the comment about buffer_heads_over_limit in
balance_pgdat only made sense in the context of the patch. This
patch clarifies the reasoning and how it applies to 32 and 64 bit
systems.

This is a fix to the mmotm patch
mm-vmscan-have-kswapd-reclaim-from-all-zones-if-reclaiming-and-buffer_heads_over_limit.patch

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d079210d46ee..21eae17ee730 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3131,12 +3131,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 		/*
 		 * If the number of buffer_heads exceeds the maximum allowed
-		 * then consider reclaiming from all zones. This is not
-		 * specific to highmem which may not exist but it is it is
-		 * expected that buffer_heads are stripped in writeback.
-		 * Reclaim may still not go ahead if all eligible zones
-		 * for the original allocation request are balanced to
-		 * avoid excessive reclaim from kswapd.
+		 * then consider reclaiming from all zones. This has a dual
+		 * purpose -- on 64-bit systems it is expected that
+		 * buffer_heads are stripped during active rotation. On 32-bit
+		 * systems, highmem pages can pin lowmem memory and shrinking
+		 * buffers can relieve lowmem pressure. Reclaim may still not
+		 * go ahead if all eligible zones for the original allocation
+		 * request are balanced to avoid excessive reclaim from kswapd.
 		 */
 		if (buffer_heads_over_limit) {
 			for (i = MAX_NR_ZONES - 1; i >= 0; i--) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 13/34] mm, vmscan: make shrink_node decisions more node-centric
  2016-07-12 14:32     ` Johannes Weiner
@ 2016-07-13  8:48       ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:32:34AM -0400, Johannes Weiner wrote:
> On Fri, Jul 08, 2016 at 10:34:49AM +0100, Mel Gorman wrote:
> > Earlier patches focused on having direct reclaim and kswapd use data that
> > is node-centric for reclaiming but shrink_node() itself still uses too
> > much zone information.  This patch removes unnecessary zone-based
> > information with the most important decision being whether to continue
> > reclaim or not.  Some memcg APIs are adjusted as a result even though
> > memcg itself still uses some zone information.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Second half of the memcg conversion is in the next patch. Ok.

Yeah. I know it bumps the patch count but the combined patch is a headache
to read.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 13/34] mm, vmscan: make shrink_node decisions more node-centric
@ 2016-07-13  8:48       ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:32:34AM -0400, Johannes Weiner wrote:
> On Fri, Jul 08, 2016 at 10:34:49AM +0100, Mel Gorman wrote:
> > Earlier patches focused on having direct reclaim and kswapd use data that
> > is node-centric for reclaiming but shrink_node() itself still uses too
> > much zone information.  This patch removes unnecessary zone-based
> > information with the most important decision being whether to continue
> > reclaim or not.  Some memcg APIs are adjusted as a result even though
> > memcg itself still uses some zone information.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Second half of the memcg conversion is in the next patch. Ok.

Yeah. I know it bumps the patch count but the combined patch is a headache
to read.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-12 14:58     ` Johannes Weiner
@ 2016-07-13  8:55       ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:58:01AM -0400, Johannes Weiner wrote:
> On Fri, Jul 08, 2016 at 10:34:54AM +0100, Mel Gorman wrote:
> > NR_FILE_PAGES  is the number of        file pages.
> > NR_FILE_MAPPED is the number of mapped file pages.
> > NR_ANON_PAGES  is the number of mapped anon pages.
> > 
> > This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
> > NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
> > have
> > 
> > NR_FILE_PAGES  is the number of        file pages.
> > NR_FILE_MAPPED is the number of mapped file pages.
> > NR_ANON_MAPPED is the number of mapped anon pages.
> 
> That looks wrong to me. The symmetry is between NR_FILE_PAGES and
> NR_ANON_PAGES. NR_FILE_MAPPED is merely elaborating on the mapped
> subset of NR_FILE_PAGES, something which isn't necessary for anon
> pages as they're always mapped.

How strongly do you feel about reverting it as later patches would cause
lots of conflicts.

Obviously I found the new names clearer but I was thinking a lot at the
time about mapped vs unmapped due to looking closely at both reclaim and
[f|m]advise functions at the time. I found it mildly irksome to switch
between the semantics of file/anon when looking at the vmstat updates.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-13  8:55       ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13  8:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Tue, Jul 12, 2016 at 10:58:01AM -0400, Johannes Weiner wrote:
> On Fri, Jul 08, 2016 at 10:34:54AM +0100, Mel Gorman wrote:
> > NR_FILE_PAGES  is the number of        file pages.
> > NR_FILE_MAPPED is the number of mapped file pages.
> > NR_ANON_PAGES  is the number of mapped anon pages.
> > 
> > This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
> > NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
> > have
> > 
> > NR_FILE_PAGES  is the number of        file pages.
> > NR_FILE_MAPPED is the number of mapped file pages.
> > NR_ANON_MAPPED is the number of mapped anon pages.
> 
> That looks wrong to me. The symmetry is between NR_FILE_PAGES and
> NR_ANON_PAGES. NR_FILE_MAPPED is merely elaborating on the mapped
> subset of NR_FILE_PAGES, something which isn't necessary for anon
> pages as they're always mapped.

How strongly do you feel about reverting it as later patches would cause
lots of conflicts.

Obviously I found the new names clearer but I was thinking a lot at the
time about mapped vs unmapped due to looking closely at both reclaim and
[f|m]advise functions at the time. I found it mildly irksome to switch
between the semantics of file/anon when looking at the vmstat updates.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
  2016-07-13  8:47       ` Mel Gorman
@ 2016-07-13 12:28         ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-13 12:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 09:47:42AM +0100, Mel Gorman wrote:
> On Tue, Jul 12, 2016 at 10:29:09AM -0400, Johannes Weiner wrote:
> > > +		/*
> > > +		 * If the number of buffer_heads in the machine exceeds the
> > > +		 * maximum allowed level then reclaim from all zones. This is
> > > +		 * not specific to highmem as highmem may not exist but it is
> > > +		 * it is expected that buffer_heads are stripped in writeback.
> > 
> > The mention of highmem in this comment make only sense within the
> > context of this diff; it'll be pretty confusing in the standalone
> > code.
> > 
> > Also, double "it is" :)
> 
> Is this any better?

Yes, this is great! Thank you.

> Note that it's marked as a fix to a later patch to reduce collisions in
> mmotm. It's not a bisection risk so I saw little need to cause
> unnecessary conflicts for Andrew.

That seems completely reasonable to me.

> ---8<---
> mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit -fix
> 
> Johannes reported that the comment about buffer_heads_over_limit in
> balance_pgdat only made sense in the context of the patch. This
> patch clarifies the reasoning and how it applies to 32 and 64 bit
> systems.
> 
> This is a fix to the mmotm patch
> mm-vmscan-have-kswapd-reclaim-from-all-zones-if-reclaiming-and-buffer_heads_over_limit.patch
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
@ 2016-07-13 12:28         ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-13 12:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 09:47:42AM +0100, Mel Gorman wrote:
> On Tue, Jul 12, 2016 at 10:29:09AM -0400, Johannes Weiner wrote:
> > > +		/*
> > > +		 * If the number of buffer_heads in the machine exceeds the
> > > +		 * maximum allowed level then reclaim from all zones. This is
> > > +		 * not specific to highmem as highmem may not exist but it is
> > > +		 * it is expected that buffer_heads are stripped in writeback.
> > 
> > The mention of highmem in this comment make only sense within the
> > context of this diff; it'll be pretty confusing in the standalone
> > code.
> > 
> > Also, double "it is" :)
> 
> Is this any better?

Yes, this is great! Thank you.

> Note that it's marked as a fix to a later patch to reduce collisions in
> mmotm. It's not a bisection risk so I saw little need to cause
> unnecessary conflicts for Andrew.

That seems completely reasonable to me.

> ---8<---
> mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit -fix
> 
> Johannes reported that the comment about buffer_heads_over_limit in
> balance_pgdat only made sense in the context of the patch. This
> patch clarifies the reasoning and how it applies to 32 and 64 bit
> systems.
> 
> This is a fix to the mmotm patch
> mm-vmscan-have-kswapd-reclaim-from-all-zones-if-reclaiming-and-buffer_heads_over_limit.patch
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-13  8:55       ` Mel Gorman
@ 2016-07-13 13:04         ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-13 13:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 09:55:16AM +0100, Mel Gorman wrote:
> On Tue, Jul 12, 2016 at 10:58:01AM -0400, Johannes Weiner wrote:
> > On Fri, Jul 08, 2016 at 10:34:54AM +0100, Mel Gorman wrote:
> > > NR_FILE_PAGES  is the number of        file pages.
> > > NR_FILE_MAPPED is the number of mapped file pages.
> > > NR_ANON_PAGES  is the number of mapped anon pages.
> > > 
> > > This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
> > > NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
> > > have
> > > 
> > > NR_FILE_PAGES  is the number of        file pages.
> > > NR_FILE_MAPPED is the number of mapped file pages.
> > > NR_ANON_MAPPED is the number of mapped anon pages.
> > 
> > That looks wrong to me. The symmetry is between NR_FILE_PAGES and
> > NR_ANON_PAGES. NR_FILE_MAPPED is merely elaborating on the mapped
> > subset of NR_FILE_PAGES, something which isn't necessary for anon
> > pages as they're always mapped.
> 
> How strongly do you feel about reverting it as later patches would cause
> lots of conflicts.
> 
> Obviously I found the new names clearer but I was thinking a lot at the
> time about mapped vs unmapped due to looking closely at both reclaim and
> [f|m]advise functions at the time. I found it mildly irksome to switch
> between the semantics of file/anon when looking at the vmstat updates.

I can see that. It all depends on whether you consider mapping state
or page type the more fundamental attribute, and coming from the
mapping perspective those new names make sense as well.

However, that leaves the disconnect between the enum name and what we
print to userspace. I find myself having to associate those quite a
lot to find all the sites that modify a given /proc/vmstat item, and
that's a bit of a pain if the names don't match.

I don't care strongly enough to cause a respin of half the series, and
it's not your problem that I waited until the last revision went into
mmots to review and comment. But if you agreed to a revert, would you
consider tacking on a revert patch at the end of the series?

Something like this?

>From de22dd5dee337db8590f46919616dd7ef2cfd002 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 13 Jul 2016 08:50:24 -0400
Subject: [PATCH] mm: revert NR_ANON_MAPPED to NR_ANON_PAGES

This reverts 'mm: rename NR_ANON_PAGES to NR_ANON_MAPPED', which had
the following rationale:

> NR_FILE_PAGES  is the number of        file pages.
> NR_FILE_MAPPED is the number of mapped file pages.
> NR_ANON_PAGES  is the number of mapped anon pages.
>
> This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
> NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
> have
>
> NR_FILE_PAGES  is the number of        file pages.
> NR_FILE_MAPPED is the number of mapped file pages.
> NR_ANON_MAPPED is the number of mapped anon pages.

Arguably, the symmetry is either between mapped and unmapped, or anon
and file, so both namings work. However, this change disconnected the
internal enum name from the name exported to userspace, which makes it
painful to trace back an observed statistic to its sources in the VM.

Revert back, such that NR_ANON_PAGES and NR_FILE_PAGES go together,
and NR_FILE_MAPPED is an elaboration on the latter. To make this even
clearer, reorder the statistics so that NR_FILE_MAPPED goes with the
other file page specifics, NR_FILE_DIRTY, NR_FILE_WRITEBACK, NR_SHMEM.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 drivers/base/node.c    | 2 +-
 fs/proc/meminfo.c      | 2 +-
 include/linux/mmzone.h | 4 ++--
 mm/migrate.c           | 2 +-
 mm/rmap.c              | 8 ++++----
 mm/vmstat.c            | 2 +-
 6 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 89e4f96..50d4004 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -122,7 +122,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(pgdat, NR_WRITEBACK)),
 		       nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
 		       nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
-		       nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
+		       nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index c1fdcc1..f7b4bbd 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -140,7 +140,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.freeswap),
 		K(global_node_page_state(NR_FILE_DIRTY)),
 		K(global_node_page_state(NR_WRITEBACK)),
-		K(global_node_page_state(NR_ANON_MAPPED)),
+		K(global_node_page_state(NR_ANON_PAGES)),
 		K(global_node_page_state(NR_FILE_MAPPED)),
 		K(i.sharedram),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a3b7f45..fd16082 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,10 +144,10 @@ enum node_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
-	NR_ANON_MAPPED,	/* Mapped anonymous pages */
+	NR_ANON_PAGES,	/* Mapped anonymous pages */
+	NR_FILE_PAGES,
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
-	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
diff --git a/mm/migrate.c b/mm/migrate.c
index ed2f85e..525679a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -501,7 +501,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	 * new page and drop references to the old page.
 	 *
 	 * Note that anonymous pages are accounted for
-	 * via NR_FILE_PAGES and NR_ANON_MAPPED if they
+	 * via NR_FILE_PAGES and NR_ANON_PAGES if they
 	 * are mapped to swap space.
 	 */
 	if (newzone != oldzone) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 414688c..203ba16 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1217,7 +1217,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 */
 		if (compound)
 			__inc_node_page_state(page, NR_ANON_THPS);
-		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1261,7 +1261,7 @@ void page_add_new_anon_rmap(struct page *page,
 		/* increment count (starts at -1) */
 		atomic_set(&page->_mapcount, 0);
 	}
-	__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
+	__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1378,7 +1378,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		clear_page_mlock(page);
 
 	if (nr) {
-		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
+		__mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, -nr);
 		deferred_split_huge_page(page);
 	}
 }
@@ -1407,7 +1407,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_node_page_state(page, NR_ANON_MAPPED);
+	__dec_node_page_state(page, NR_ANON_PAGES);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7415775..1d5de5d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -953,8 +953,8 @@ const char * const vmstat_text[] = {
 	"workingset_activate",
 	"workingset_nodereclaim",
 	"nr_anon_pages",
-	"nr_mapped",
 	"nr_file_pages",
+	"nr_mapped",
 	"nr_dirty",
 	"nr_writeback",
 	"nr_writeback_temp",
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-13 13:04         ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-13 13:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 09:55:16AM +0100, Mel Gorman wrote:
> On Tue, Jul 12, 2016 at 10:58:01AM -0400, Johannes Weiner wrote:
> > On Fri, Jul 08, 2016 at 10:34:54AM +0100, Mel Gorman wrote:
> > > NR_FILE_PAGES  is the number of        file pages.
> > > NR_FILE_MAPPED is the number of mapped file pages.
> > > NR_ANON_PAGES  is the number of mapped anon pages.
> > > 
> > > This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
> > > NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
> > > have
> > > 
> > > NR_FILE_PAGES  is the number of        file pages.
> > > NR_FILE_MAPPED is the number of mapped file pages.
> > > NR_ANON_MAPPED is the number of mapped anon pages.
> > 
> > That looks wrong to me. The symmetry is between NR_FILE_PAGES and
> > NR_ANON_PAGES. NR_FILE_MAPPED is merely elaborating on the mapped
> > subset of NR_FILE_PAGES, something which isn't necessary for anon
> > pages as they're always mapped.
> 
> How strongly do you feel about reverting it as later patches would cause
> lots of conflicts.
> 
> Obviously I found the new names clearer but I was thinking a lot at the
> time about mapped vs unmapped due to looking closely at both reclaim and
> [f|m]advise functions at the time. I found it mildly irksome to switch
> between the semantics of file/anon when looking at the vmstat updates.

I can see that. It all depends on whether you consider mapping state
or page type the more fundamental attribute, and coming from the
mapping perspective those new names make sense as well.

However, that leaves the disconnect between the enum name and what we
print to userspace. I find myself having to associate those quite a
lot to find all the sites that modify a given /proc/vmstat item, and
that's a bit of a pain if the names don't match.

I don't care strongly enough to cause a respin of half the series, and
it's not your problem that I waited until the last revision went into
mmots to review and comment. But if you agreed to a revert, would you
consider tacking on a revert patch at the end of the series?

Something like this?

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-13 13:04         ` Johannes Weiner
@ 2016-07-13 13:37           ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13 13:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 09:04:15AM -0400, Johannes Weiner wrote:
> > Obviously I found the new names clearer but I was thinking a lot at the
> > time about mapped vs unmapped due to looking closely at both reclaim and
> > [f|m]advise functions at the time. I found it mildly irksome to switch
> > between the semantics of file/anon when looking at the vmstat updates.
> 
> I can see that. It all depends on whether you consider mapping state
> or page type the more fundamental attribute, and coming from the
> mapping perspective those new names make sense as well.
> 

>From a reclaim perspective, I consider the mapped state to be more
important. This is particularly true when the advise calls are taken
into account. For example, madvise unmaps the pages without affecting
memory residency (distinct from RSS) without aging. fadvise ignores mapped
pages so the mapped state is very important for advise hints.  Similarly,
the mapped state can affect how the pages are aged as mapped pages affect
slab scan rates and incur TLB flushes on unmap. I guess I've been thinking
about mapped/unmapped a lot recently which pushed me towards distinct naming.

> However, that leaves the disconnect between the enum name and what we
> print to userspace. I find myself having to associate those quite a
> lot to find all the sites that modify a given /proc/vmstat item, and
> that's a bit of a pain if the names don't match.
> 

I was tempted to rename userspace what is printed to vmstat as well but
worried about breaking tools that parse it.

> I don't care strongly enough to cause a respin of half the series, and
> it's not your problem that I waited until the last revision went into
> mmots to review and comment. But if you agreed to a revert, would you
> consider tacking on a revert patch at the end of the series?
> 

In this case, I'm going to ask the other people on the cc for a
tie-breaker. If someone else prefers the old names then I'm happy for
your patch to be applied on top with my ack instead of respinning the
whole series.

Anyone for a tie breaker?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-13 13:37           ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-13 13:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 09:04:15AM -0400, Johannes Weiner wrote:
> > Obviously I found the new names clearer but I was thinking a lot at the
> > time about mapped vs unmapped due to looking closely at both reclaim and
> > [f|m]advise functions at the time. I found it mildly irksome to switch
> > between the semantics of file/anon when looking at the vmstat updates.
> 
> I can see that. It all depends on whether you consider mapping state
> or page type the more fundamental attribute, and coming from the
> mapping perspective those new names make sense as well.
> 

>From a reclaim perspective, I consider the mapped state to be more
important. This is particularly true when the advise calls are taken
into account. For example, madvise unmaps the pages without affecting
memory residency (distinct from RSS) without aging. fadvise ignores mapped
pages so the mapped state is very important for advise hints.  Similarly,
the mapped state can affect how the pages are aged as mapped pages affect
slab scan rates and incur TLB flushes on unmap. I guess I've been thinking
about mapped/unmapped a lot recently which pushed me towards distinct naming.

> However, that leaves the disconnect between the enum name and what we
> print to userspace. I find myself having to associate those quite a
> lot to find all the sites that modify a given /proc/vmstat item, and
> that's a bit of a pain if the names don't match.
> 

I was tempted to rename userspace what is printed to vmstat as well but
worried about breaking tools that parse it.

> I don't care strongly enough to cause a respin of half the series, and
> it's not your problem that I waited until the last revision went into
> mmots to review and comment. But if you agreed to a revert, would you
> consider tacking on a revert patch at the end of the series?
> 

In this case, I'm going to ask the other people on the cc for a
tie-breaker. If someone else prefers the old names then I'm happy for
your patch to be applied on top with my ack instead of respinning the
whole series.

Anyone for a tie breaker?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-13 13:37           ` Mel Gorman
@ 2016-07-13 21:13             ` Andrew Morton
  -1 siblings, 0 replies; 220+ messages in thread
From: Andrew Morton @ 2016-07-13 21:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, 13 Jul 2016 14:37:01 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> > I don't care strongly enough to cause a respin of half the series, and
> > it's not your problem that I waited until the last revision went into
> > mmots to review and comment. But if you agreed to a revert, would you
> > consider tacking on a revert patch at the end of the series?
> > 
> 
> In this case, I'm going to ask the other people on the cc for a
> tie-breaker. If someone else prefers the old names then I'm happy for
> your patch to be applied on top with my ack instead of respinning the
> whole series.
> 
> Anyone for a tie breaker?

I am aggressively undecided.  I guess as it's a bit of a 51/49
situation, the "stay with what people are familiar with" benefit tips the
balance toward the legacy names?

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-13 21:13             ` Andrew Morton
  0 siblings, 0 replies; 220+ messages in thread
From: Andrew Morton @ 2016-07-13 21:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, 13 Jul 2016 14:37:01 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> > I don't care strongly enough to cause a respin of half the series, and
> > it's not your problem that I waited until the last revision went into
> > mmots to review and comment. But if you agreed to a revert, would you
> > consider tacking on a revert patch at the end of the series?
> > 
> 
> In this case, I'm going to ask the other people on the cc for a
> tie-breaker. If someone else prefers the old names then I'm happy for
> your patch to be applied on top with my ack instead of respinning the
> whole series.
> 
> Anyone for a tie breaker?

I am aggressively undecided.  I guess as it's a bit of a 51/49
situation, the "stay with what people are familiar with" benefit tips the
balance toward the legacy names?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-13 13:37           ` Mel Gorman
@ 2016-07-14  1:27             ` Minchan Kim
  -1 siblings, 0 replies; 220+ messages in thread
From: Minchan Kim @ 2016-07-14  1:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Linux-MM, Rik van Riel,
	Vlastimil Babka, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 02:37:01PM +0100, Mel Gorman wrote:
> On Wed, Jul 13, 2016 at 09:04:15AM -0400, Johannes Weiner wrote:
> > > Obviously I found the new names clearer but I was thinking a lot at the
> > > time about mapped vs unmapped due to looking closely at both reclaim and
> > > [f|m]advise functions at the time. I found it mildly irksome to switch
> > > between the semantics of file/anon when looking at the vmstat updates.
> > 
> > I can see that. It all depends on whether you consider mapping state
> > or page type the more fundamental attribute, and coming from the
> > mapping perspective those new names make sense as well.
> > 
> 
> From a reclaim perspective, I consider the mapped state to be more
> important. This is particularly true when the advise calls are taken
> into account. For example, madvise unmaps the pages without affecting
> memory residency (distinct from RSS) without aging. fadvise ignores mapped
> pages so the mapped state is very important for advise hints.  Similarly,
> the mapped state can affect how the pages are aged as mapped pages affect
> slab scan rates and incur TLB flushes on unmap. I guess I've been thinking
> about mapped/unmapped a lot recently which pushed me towards distinct naming.
> 
> > However, that leaves the disconnect between the enum name and what we
> > print to userspace. I find myself having to associate those quite a
> > lot to find all the sites that modify a given /proc/vmstat item, and
> > that's a bit of a pain if the names don't match.
> > 
> 
> I was tempted to rename userspace what is printed to vmstat as well but
> worried about breaking tools that parse it.
> 
> > I don't care strongly enough to cause a respin of half the series, and
> > it's not your problem that I waited until the last revision went into
> > mmots to review and comment. But if you agreed to a revert, would you
> > consider tacking on a revert patch at the end of the series?
> > 
> 
> In this case, I'm going to ask the other people on the cc for a
> tie-breaker. If someone else prefers the old names then I'm happy for
> your patch to be applied on top with my ack instead of respinning the
> whole series.
> 
> Anyone for a tie breaker?

I have thought it from reclaim perspective for a long time so I tempted to
change the naming like new one but there is no big justification for that.
In this chance, I vote new name.

Thanks.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-14  1:27             ` Minchan Kim
  0 siblings, 0 replies; 220+ messages in thread
From: Minchan Kim @ 2016-07-14  1:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Linux-MM, Rik van Riel,
	Vlastimil Babka, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 02:37:01PM +0100, Mel Gorman wrote:
> On Wed, Jul 13, 2016 at 09:04:15AM -0400, Johannes Weiner wrote:
> > > Obviously I found the new names clearer but I was thinking a lot at the
> > > time about mapped vs unmapped due to looking closely at both reclaim and
> > > [f|m]advise functions at the time. I found it mildly irksome to switch
> > > between the semantics of file/anon when looking at the vmstat updates.
> > 
> > I can see that. It all depends on whether you consider mapping state
> > or page type the more fundamental attribute, and coming from the
> > mapping perspective those new names make sense as well.
> > 
> 
> From a reclaim perspective, I consider the mapped state to be more
> important. This is particularly true when the advise calls are taken
> into account. For example, madvise unmaps the pages without affecting
> memory residency (distinct from RSS) without aging. fadvise ignores mapped
> pages so the mapped state is very important for advise hints.  Similarly,
> the mapped state can affect how the pages are aged as mapped pages affect
> slab scan rates and incur TLB flushes on unmap. I guess I've been thinking
> about mapped/unmapped a lot recently which pushed me towards distinct naming.
> 
> > However, that leaves the disconnect between the enum name and what we
> > print to userspace. I find myself having to associate those quite a
> > lot to find all the sites that modify a given /proc/vmstat item, and
> > that's a bit of a pain if the names don't match.
> > 
> 
> I was tempted to rename userspace what is printed to vmstat as well but
> worried about breaking tools that parse it.
> 
> > I don't care strongly enough to cause a respin of half the series, and
> > it's not your problem that I waited until the last revision went into
> > mmots to review and comment. But if you agreed to a revert, would you
> > consider tacking on a revert patch at the end of the series?
> > 
> 
> In this case, I'm going to ask the other people on the cc for a
> tie-breaker. If someone else prefers the old names then I'm happy for
> your patch to be applied on top with my ack instead of respinning the
> whole series.
> 
> Anyone for a tie breaker?

I have thought it from reclaim perspective for a long time so I tempted to
change the naming like new one but there is no big justification for that.
In this chance, I vote new name.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 05/34] mm, vmscan: begin reclaiming pages on a per-node basis
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-14  9:19     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14  9:19 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:34 AM, Mel Gorman wrote:
> This patch makes reclaim decisions on a per-node basis.  A reclaimer knows
> what zone is required by the allocation request and skips pages from
> higher zones.  In many cases this will be ok because it's a GFP_HIGHMEM
> request of some description.  On 64-bit, ZONE_DMA32 requests will cause
> some problems but 32-bit devices on 64-bit platforms are increasingly
> rare.  Historically it would have been a major problem on 32-bit with big
> Highmem:Lowmem ratios but such configurations are also now rare and even
> where they exist, they are not encouraged.  If it really becomes a
> problem, it'll manifest as very low reclaim efficiencies.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

I think my previous complaints are fixed.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

[...]

> @@ -2553,7 +2572,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	unsigned long nr_soft_reclaimed;
>  	unsigned long nr_soft_scanned;
>  	gfp_t orig_mask;
> -	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
> +	enum zone_type classzone_idx;
>
>  	/*
>  	 * If the number of buffer_heads in the machine exceeds the maximum
> @@ -2561,17 +2580,23 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	 * highmem pages could be pinning lowmem pages storing buffer_heads
>  	 */
>  	orig_mask = sc->gfp_mask;
> -	if (buffer_heads_over_limit)
> +	if (buffer_heads_over_limit) {
>  		sc->gfp_mask |= __GFP_HIGHMEM;
> +		sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask);

Setting classzone_idx seems pointless here as it will be overwritten in 
the for loop. Unless that changes with some later patch. Anyway it 
doesn't hurt anything.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 05/34] mm, vmscan: begin reclaiming pages on a per-node basis
@ 2016-07-14  9:19     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14  9:19 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:34 AM, Mel Gorman wrote:
> This patch makes reclaim decisions on a per-node basis.  A reclaimer knows
> what zone is required by the allocation request and skips pages from
> higher zones.  In many cases this will be ok because it's a GFP_HIGHMEM
> request of some description.  On 64-bit, ZONE_DMA32 requests will cause
> some problems but 32-bit devices on 64-bit platforms are increasingly
> rare.  Historically it would have been a major problem on 32-bit with big
> Highmem:Lowmem ratios but such configurations are also now rare and even
> where they exist, they are not encouraged.  If it really becomes a
> problem, it'll manifest as very low reclaim efficiencies.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

I think my previous complaints are fixed.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

[...]

> @@ -2553,7 +2572,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	unsigned long nr_soft_reclaimed;
>  	unsigned long nr_soft_scanned;
>  	gfp_t orig_mask;
> -	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
> +	enum zone_type classzone_idx;
>
>  	/*
>  	 * If the number of buffer_heads in the machine exceeds the maximum
> @@ -2561,17 +2580,23 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	 * highmem pages could be pinning lowmem pages storing buffer_heads
>  	 */
>  	orig_mask = sc->gfp_mask;
> -	if (buffer_heads_over_limit)
> +	if (buffer_heads_over_limit) {
>  		sc->gfp_mask |= __GFP_HIGHMEM;
> +		sc->reclaim_idx = classzone_idx = gfp_zone(sc->gfp_mask);

Setting classzone_idx seems pointless here as it will be overwritten in 
the for loop. Unless that changes with some later patch. Anyway it 
doesn't hurt anything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-14  9:45     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14  9:45 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:34 AM, Mel Gorman wrote:
> Reclaim may stall if there is too much dirty or congested data on a node.
> This was previously based on zone flags and the logic for clearing the
> flags is in two places.  As congestion/dirty tracking is now tracked on a
> per-node basis, we can remove some duplicate logic.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

I don't think it's that big a side-effect to be concerned. I would be 
more worried about frequently dirtying cache lines from other 
zone_balanced callers where, but I guess even the calls from 
wakeup_kswapd() rarely return true for zone_balanced(), otherwise the 
wakeups wouldn't have a reason to happen.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/vmscan.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 01fe4708e404..8b39b903bd14 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3008,7 +3008,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
>  {
>  	unsigned long mark = high_wmark_pages(zone);
>
> -	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
> +	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
> +		return false;
> +
> +	/*
> +	 * If any eligible zone is balanced then the node is not considered
> +	 * to be congested or dirty
> +	 */
> +	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
> +
> +	return true;
>  }
>
>  /*
> @@ -3154,13 +3164,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			if (!zone_balanced(zone, order, 0)) {
>  				classzone_idx = i;
>  				break;
> -			} else {
> -				/*
> -				 * If any eligible zone is balanced then the
> -				 * node is not considered congested or dirty.
> -				 */
> -				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> -				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
>  			}
>  		}
>
> @@ -3219,11 +3222,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			if (!populated_zone(zone))
>  				continue;
>
> -			if (zone_balanced(zone, sc.order, classzone_idx)) {
> -				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> -				clear_bit(PGDAT_DIRTY, &pgdat->flags);
> +			if (zone_balanced(zone, sc.order, classzone_idx))
>  				goto out;
> -			}
>  		}
>
>  		/*
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
@ 2016-07-14  9:45     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14  9:45 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:34 AM, Mel Gorman wrote:
> Reclaim may stall if there is too much dirty or congested data on a node.
> This was previously based on zone flags and the logic for clearing the
> flags is in two places.  As congestion/dirty tracking is now tracked on a
> per-node basis, we can remove some duplicate logic.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

I don't think it's that big a side-effect to be concerned. I would be 
more worried about frequently dirtying cache lines from other 
zone_balanced callers where, but I guess even the calls from 
wakeup_kswapd() rarely return true for zone_balanced(), otherwise the 
wakeups wouldn't have a reason to happen.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/vmscan.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 01fe4708e404..8b39b903bd14 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3008,7 +3008,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
>  {
>  	unsigned long mark = high_wmark_pages(zone);
>
> -	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
> +	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
> +		return false;
> +
> +	/*
> +	 * If any eligible zone is balanced then the node is not considered
> +	 * to be congested or dirty
> +	 */
> +	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
> +
> +	return true;
>  }
>
>  /*
> @@ -3154,13 +3164,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			if (!zone_balanced(zone, order, 0)) {
>  				classzone_idx = i;
>  				break;
> -			} else {
> -				/*
> -				 * If any eligible zone is balanced then the
> -				 * node is not considered congested or dirty.
> -				 */
> -				clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> -				clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
>  			}
>  		}
>
> @@ -3219,11 +3222,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			if (!populated_zone(zone))
>  				continue;
>
> -			if (zone_balanced(zone, sc.order, classzone_idx)) {
> -				clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> -				clear_bit(PGDAT_DIRTY, &pgdat->flags);
> +			if (zone_balanced(zone, sc.order, classzone_idx))
>  				goto out;
> -			}
>  		}
>
>  		/*
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 22/34] mm, page_alloc: wake kswapd based on the highest eligible zone
  2016-07-08  9:34   ` Mel Gorman
@ 2016-07-14 10:05     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 10:05 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:34 AM, Mel Gorman wrote:
> The ac_classzone_idx is used as the basis for waking kswapd and that is based
> on the preferred zoneref. If the preferred zoneref's first zone is lower
> than what is available on other nodes, it's possible that kswapd is woken
> on a zone with only higher, but still eligible, zones. As classzone_idx
> is strictly adhered to now, it causes a problem because eligible pages
> are skipped.
>
> For example, node 0 has only DMA32 and node 1 has only NORMAL. An allocating
> context running on node 0 may wake kswapd on node 1 telling it to skip
> all NORMAL pages.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bb261885c121..e6ee52f1c15f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3415,7 +3415,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
>  	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
>  					ac->high_zoneidx, ac->nodemask) {
>  		if (last_pgdat != zone->zone_pgdat)
> -			wakeup_kswapd(zone, order, ac_classzone_idx(ac));
> +			wakeup_kswapd(zone, order, ac->high_zoneidx);
>  		last_pgdat = zone->zone_pgdat;
>  	}
>  }
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 22/34] mm, page_alloc: wake kswapd based on the highest eligible zone
@ 2016-07-14 10:05     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 10:05 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:34 AM, Mel Gorman wrote:
> The ac_classzone_idx is used as the basis for waking kswapd and that is based
> on the preferred zoneref. If the preferred zoneref's first zone is lower
> than what is available on other nodes, it's possible that kswapd is woken
> on a zone with only higher, but still eligible, zones. As classzone_idx
> is strictly adhered to now, it causes a problem because eligible pages
> are skipped.
>
> For example, node 0 has only DMA32 and node 1 has only NORMAL. An allocating
> context running on node 0 may wake kswapd on node 1 telling it to skip
> all NORMAL pages.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bb261885c121..e6ee52f1c15f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3415,7 +3415,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
>  	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
>  					ac->high_zoneidx, ac->nodemask) {
>  		if (last_pgdat != zone->zone_pgdat)
> -			wakeup_kswapd(zone, order, ac_classzone_idx(ac));
> +			wakeup_kswapd(zone, order, ac->high_zoneidx);
>  		last_pgdat = zone->zone_pgdat;
>  	}
>  }
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 24/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-14 10:09     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 10:09 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> shrink_node receives all information it needs about classzone_idx
> from sc->reclaim_idx so remove the aliases.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 24/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node
@ 2016-07-14 10:09     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 10:09 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> shrink_node receives all information it needs about classzone_idx
> from sc->reclaim_idx so remove the aliases.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 25/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-14 12:12     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 12:12 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> The scan_control structure has enough information available for
> compaction_ready() to make a decision. The classzone_idx manipulations in
> shrink_zones() are no longer necessary as the highest populated zone is
> no longer used to determine if shrink_slab should be called or not.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> @@ -2621,8 +2609,8 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  			 */
>  			if (IS_ENABLED(CONFIG_COMPACTION) &&
>  			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
> -			    zonelist_zone_idx(z) <= classzone_idx &&
> -			    compaction_ready(zone, sc->order, classzone_idx)) {
> +			    zonelist_zone_idx(z) <= sc->reclaim_idx &&

Hm I notice that the condition on the line above should be always true 
as the same sc->reclaim_idx value is used to limit zone_idx in 
for_each_zone_zonelist_nodemask(). The implication was probably true 
even before this patch (classzone_idx would be <= sc->reclaim_idx), but 
now it stands out clearly.

> +			    compaction_ready(zone, sc)) {
>  				sc->compaction_ready = true;
>  				continue;
>  			}
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 25/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready
@ 2016-07-14 12:12     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 12:12 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> The scan_control structure has enough information available for
> compaction_ready() to make a decision. The classzone_idx manipulations in
> shrink_zones() are no longer necessary as the highest populated zone is
> no longer used to determine if shrink_slab should be called or not.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> @@ -2621,8 +2609,8 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  			 */
>  			if (IS_ENABLED(CONFIG_COMPACTION) &&
>  			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
> -			    zonelist_zone_idx(z) <= classzone_idx &&
> -			    compaction_ready(zone, sc->order, classzone_idx)) {
> +			    zonelist_zone_idx(z) <= sc->reclaim_idx &&

Hm I notice that the condition on the line above should be always true 
as the same sc->reclaim_idx value is used to limit zone_idx in 
for_each_zone_zonelist_nodemask(). The implication was probably true 
even before this patch (classzone_idx would be <= sc->reclaim_idx), but 
now it stands out clearly.

> +			    compaction_ready(zone, sc)) {
>  				sc->compaction_ready = true;
>  				continue;
>  			}
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 26/34] mm, vmscan: avoid passing in remaining unnecessarily to prepare_kswapd_sleep
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-14 12:48     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 12:48 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep
> always passes in 0 for remaining and the second call can trivially
> check the parameter in advance.
>
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 26/34] mm, vmscan: avoid passing in remaining unnecessarily to prepare_kswapd_sleep
@ 2016-07-14 12:48     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 12:48 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep
> always passes in 0 for remaining and the second call can trivially
> check the parameter in advance.
>
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 27/34] mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-14 12:54     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 12:54 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> The buffer_heads_over_limit limit in kswapd is inconsistent with direct
> reclaim behaviour. It may force an an attempt to reclaim from all zones and
> then not reclaim at all because higher zones were balanced than required
> by the original request.
>
> This patch will causes kswapd to consider reclaiming from all zones if
> buffer_heads_over_limit.  However, if there are eligible zones for the
> allocation request that woke kswapd then no reclaim will occur even if
> buffer_heads_over_limit. This avoids kswapd over-reclaiming just because
> buffer_heads_over_limit.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 27/34] mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit
@ 2016-07-14 12:54     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 12:54 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> The buffer_heads_over_limit limit in kswapd is inconsistent with direct
> reclaim behaviour. It may force an an attempt to reclaim from all zones and
> then not reclaim at all because higher zones were balanced than required
> by the original request.
>
> This patch will causes kswapd to consider reclaiming from all zones if
> buffer_heads_over_limit.  However, if there are eligible zones for the
> allocation request that woke kswapd then no reclaim will occur even if
> buffer_heads_over_limit. This avoids kswapd over-reclaiming just because
> buffer_heads_over_limit.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 33/34] mm, vmstat: print node-based stats in zoneinfo file
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-14 12:56     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 12:56 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> There are a number of stats that were previously accessible via zoneinfo
> that are now invisible. While it is possible to create a new file for the
> node stats, this may be missed by users. Instead this patch prints the
> stats under the first populated zone in /proc/zoneinfo.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 33/34] mm, vmstat: print node-based stats in zoneinfo file
@ 2016-07-14 12:56     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 12:56 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> There are a number of stats that were previously accessible via zoneinfo
> that are now invisible. While it is possible to create a new file for the
> node stats, this may be missed by users. Instead this patch prints the
> stats under the first populated zone in /proc/zoneinfo.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries
  2016-07-08  9:35   ` Mel Gorman
@ 2016-07-14 13:40     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 13:40 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> The number of LRU pages, dirty pages and writeback pages must be accounted
> for on both zones and nodes because of the reclaim retry logic, compaction
> retry logic and highmem calculations all depending on per-zone stats.
>
> Many lowmem allocations are immune from OOM kill due to a check in
> __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
> 03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception
> is costly high-order allocations or allocations that cannot fail. If the
> __alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations
> then it would fall through to __alloc_pages_direct_compact.
>
> This patch will blindly retry reclaim for zone-constrained allocations
> in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal but
> without per-zone stats there are not many alternatives. The impact it that
> zone-constrained allocations may delay before considering the OOM killer.
>
> As there is no guarantee enough memory can ever be freed to satisfy
> compaction, this patch avoids retrying compaction for zone-contrained
> allocations.
>
> In combination, that means that the per-node stats can be used when deciding
> whether to continue reclaim using a rough approximation.  While it is
> possible this will make the wrong decision on occasion, it will not infinite
> loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES.
>
> The final step is calculating the number of dirtyable highmem pages. As
> those calculations only care about the global count of file pages in
> highmem. This patch uses a global counter used instead of per-zone stats
> as it is sufficient.
>
> In combination, this allows the per-zone LRU and dirty state counters to
> be removed.
>
> Suggested by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

The resulting should_reclaim_retry() makes my head spin, I hope Michal 
can make more sense of it :) So just some comments below.

> @@ -4,6 +4,26 @@
>  #include <linux/huge_mm.h>
>  #include <linux/swap.h>
>
> +#ifdef CONFIG_HIGHMEM
> +extern atomic_t highmem_file_pages;
> +
> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> +							int nr_pages)
> +{
> +	if (is_highmem_idx(zid) && is_file_lru(lru)) {
> +		if (nr_pages > 0)

This seems like a unnecessary branch, atomic_add should handle negative 
nr_pages just fine?

> +			atomic_add(nr_pages, &highmem_file_pages);
> +		else
> +			atomic_sub(nr_pages, &highmem_file_pages);
> +	}
> +}

[...]

> @@ -1446,6 +1446,11 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	pg_data_t *last_pgdat = NULL;
> +
> +	/* Do not retry compaction for zone-constrained allocations */
> +	if (ac->high_zoneidx < ZONE_NORMAL)
> +		return false;
>
>  	/*
>  	 * Make sure at least one zone would pass __compaction_suitable if we continue
> @@ -1456,14 +1461,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>  		unsigned long available;
>  		enum compact_result compact_result;
>
> +		if (last_pgdat == zone->zone_pgdat)
> +			continue;
> +
> +		/*
> +		 * This over-estimates the number of pages available for
> +		 * reclaim/compaction but walking the LRU would take too
> +		 * long. The consequences are that compaction may retry
> +		 * longer than it should for a zone-constrained allocation
> +		 * request.

The comment above says that we don't retry zone-constrained at all. Is 
this an obsolete comment, or does it refer to the ZONE_NORMAL 
constraint? (as opposed to HIGHMEM, MOVABLE etc?).

[...]

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	pg_data_t *current_pgdat = NULL;
>
>  	/*
>  	 * Make sure we converge to OOM if we cannot make any progress
> @@ -3454,6 +3455,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		return false;
>
>  	/*
> +	 * Blindly retry lowmem allocation requests that are often ignored by
> +	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
> +	 * and fast means of calculating reclaimable, dirty and writeback pages
> +	 * in eligible zones.
> +	 */
> +	if (ac->high_zoneidx < ZONE_NORMAL)
> +		goto out;

A goto inside two nested for cycles? Is there no hope for sanity? :(

> +
> +	/*
>  	 * Keep reclaiming pages while there is a chance this will lead somewhere.
>  	 * If none of the target zones can satisfy our allocation request even
>  	 * if all reclaimable pages are considered then we are screwed and have
> @@ -3463,18 +3473,38 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  					ac->nodemask) {
>  		unsigned long available;
>  		unsigned long reclaimable;
> +		int zid;
>
> -		available = reclaimable = zone_reclaimable_pages(zone);
> +		if (current_pgdat == zone->zone_pgdat)
> +			continue;
> +
> +		current_pgdat = zone->zone_pgdat;
> +		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
>  		available -= DIV_ROUND_UP(no_progress_loops * available,
>  					  MAX_RECLAIM_RETRIES);
> -		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +		/* Account for all free pages on eligible zones */
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *acct_zone = &current_pgdat->node_zones[zid];
> +
> +			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
> +		}
>
>  		/*
>  		 * Would the allocation succeed if we reclaimed the whole
> -		 * available?
> +		 * available? This is approximate because there is no
> +		 * accurate count of reclaimable pages per zone.
>  		 */
> -		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> -				ac_classzone_idx(ac), alloc_flags, available)) {
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *check_zone = &current_pgdat->node_zones[zid];
> +			unsigned long estimate;
> +
> +			estimate = min(check_zone->managed_pages, available);
> +			if (!__zone_watermark_ok(check_zone, order,
> +					min_wmark_pages(check_zone), ac_classzone_idx(ac),
> +					alloc_flags, estimate))
> +				continue;
> +
>  			/*
>  			 * If we didn't make any progress and have a lot of
>  			 * dirty + writeback pages then we should wait for
> @@ -3484,15 +3514,16 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  			if (!did_some_progress) {
>  				unsigned long write_pending;
>
> -				write_pending = zone_page_state_snapshot(zone,
> -							NR_ZONE_WRITE_PENDING);
> +				write_pending =
> +					node_page_state(current_pgdat, NR_WRITEBACK) +
> +					node_page_state(current_pgdat, NR_FILE_DIRTY);
>
>  				if (2 * write_pending > reclaimable) {
>  					congestion_wait(BLK_RW_ASYNC, HZ/10);
>  					return true;
>  				}
>  			}
> -
> +out:
>  			/*
>  			 * Memory allocation/reclaim might be called from a WQ
>  			 * context and the current implementation of the WQ

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries
@ 2016-07-14 13:40     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-14 13:40 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On 07/08/2016 11:35 AM, Mel Gorman wrote:
> The number of LRU pages, dirty pages and writeback pages must be accounted
> for on both zones and nodes because of the reclaim retry logic, compaction
> retry logic and highmem calculations all depending on per-zone stats.
>
> Many lowmem allocations are immune from OOM kill due to a check in
> __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
> 03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception
> is costly high-order allocations or allocations that cannot fail. If the
> __alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations
> then it would fall through to __alloc_pages_direct_compact.
>
> This patch will blindly retry reclaim for zone-constrained allocations
> in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal but
> without per-zone stats there are not many alternatives. The impact it that
> zone-constrained allocations may delay before considering the OOM killer.
>
> As there is no guarantee enough memory can ever be freed to satisfy
> compaction, this patch avoids retrying compaction for zone-contrained
> allocations.
>
> In combination, that means that the per-node stats can be used when deciding
> whether to continue reclaim using a rough approximation.  While it is
> possible this will make the wrong decision on occasion, it will not infinite
> loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES.
>
> The final step is calculating the number of dirtyable highmem pages. As
> those calculations only care about the global count of file pages in
> highmem. This patch uses a global counter used instead of per-zone stats
> as it is sufficient.
>
> In combination, this allows the per-zone LRU and dirty state counters to
> be removed.
>
> Suggested by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

The resulting should_reclaim_retry() makes my head spin, I hope Michal 
can make more sense of it :) So just some comments below.

> @@ -4,6 +4,26 @@
>  #include <linux/huge_mm.h>
>  #include <linux/swap.h>
>
> +#ifdef CONFIG_HIGHMEM
> +extern atomic_t highmem_file_pages;
> +
> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> +							int nr_pages)
> +{
> +	if (is_highmem_idx(zid) && is_file_lru(lru)) {
> +		if (nr_pages > 0)

This seems like a unnecessary branch, atomic_add should handle negative 
nr_pages just fine?

> +			atomic_add(nr_pages, &highmem_file_pages);
> +		else
> +			atomic_sub(nr_pages, &highmem_file_pages);
> +	}
> +}

[...]

> @@ -1446,6 +1446,11 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	pg_data_t *last_pgdat = NULL;
> +
> +	/* Do not retry compaction for zone-constrained allocations */
> +	if (ac->high_zoneidx < ZONE_NORMAL)
> +		return false;
>
>  	/*
>  	 * Make sure at least one zone would pass __compaction_suitable if we continue
> @@ -1456,14 +1461,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>  		unsigned long available;
>  		enum compact_result compact_result;
>
> +		if (last_pgdat == zone->zone_pgdat)
> +			continue;
> +
> +		/*
> +		 * This over-estimates the number of pages available for
> +		 * reclaim/compaction but walking the LRU would take too
> +		 * long. The consequences are that compaction may retry
> +		 * longer than it should for a zone-constrained allocation
> +		 * request.

The comment above says that we don't retry zone-constrained at all. Is 
this an obsolete comment, or does it refer to the ZONE_NORMAL 
constraint? (as opposed to HIGHMEM, MOVABLE etc?).

[...]

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	pg_data_t *current_pgdat = NULL;
>
>  	/*
>  	 * Make sure we converge to OOM if we cannot make any progress
> @@ -3454,6 +3455,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		return false;
>
>  	/*
> +	 * Blindly retry lowmem allocation requests that are often ignored by
> +	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
> +	 * and fast means of calculating reclaimable, dirty and writeback pages
> +	 * in eligible zones.
> +	 */
> +	if (ac->high_zoneidx < ZONE_NORMAL)
> +		goto out;

A goto inside two nested for cycles? Is there no hope for sanity? :(

> +
> +	/*
>  	 * Keep reclaiming pages while there is a chance this will lead somewhere.
>  	 * If none of the target zones can satisfy our allocation request even
>  	 * if all reclaimable pages are considered then we are screwed and have
> @@ -3463,18 +3473,38 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  					ac->nodemask) {
>  		unsigned long available;
>  		unsigned long reclaimable;
> +		int zid;
>
> -		available = reclaimable = zone_reclaimable_pages(zone);
> +		if (current_pgdat == zone->zone_pgdat)
> +			continue;
> +
> +		current_pgdat = zone->zone_pgdat;
> +		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
>  		available -= DIV_ROUND_UP(no_progress_loops * available,
>  					  MAX_RECLAIM_RETRIES);
> -		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +		/* Account for all free pages on eligible zones */
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *acct_zone = &current_pgdat->node_zones[zid];
> +
> +			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
> +		}
>
>  		/*
>  		 * Would the allocation succeed if we reclaimed the whole
> -		 * available?
> +		 * available? This is approximate because there is no
> +		 * accurate count of reclaimable pages per zone.
>  		 */
> -		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> -				ac_classzone_idx(ac), alloc_flags, available)) {
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *check_zone = &current_pgdat->node_zones[zid];
> +			unsigned long estimate;
> +
> +			estimate = min(check_zone->managed_pages, available);
> +			if (!__zone_watermark_ok(check_zone, order,
> +					min_wmark_pages(check_zone), ac_classzone_idx(ac),
> +					alloc_flags, estimate))
> +				continue;
> +
>  			/*
>  			 * If we didn't make any progress and have a lot of
>  			 * dirty + writeback pages then we should wait for
> @@ -3484,15 +3514,16 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  			if (!did_some_progress) {
>  				unsigned long write_pending;
>
> -				write_pending = zone_page_state_snapshot(zone,
> -							NR_ZONE_WRITE_PENDING);
> +				write_pending =
> +					node_page_state(current_pgdat, NR_WRITEBACK) +
> +					node_page_state(current_pgdat, NR_FILE_DIRTY);
>
>  				if (2 * write_pending > reclaimable) {
>  					congestion_wait(BLK_RW_ASYNC, HZ/10);
>  					return true;
>  				}
>  			}
> -
> +out:
>  			/*
>  			 * Memory allocation/reclaim might be called from a WQ
>  			 * context and the current implementation of the WQ

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries
  2016-07-14 13:40     ` Vlastimil Babka
@ 2016-07-15  7:48       ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-15  7:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner,
	Minchan Kim, Joonsoo Kim, LKML

On Thu, Jul 14, 2016 at 03:40:11PM +0200, Vlastimil Babka wrote:
> >@@ -4,6 +4,26 @@
> > #include <linux/huge_mm.h>
> > #include <linux/swap.h>
> >
> >+#ifdef CONFIG_HIGHMEM
> >+extern atomic_t highmem_file_pages;
> >+
> >+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> >+							int nr_pages)
> >+{
> >+	if (is_highmem_idx(zid) && is_file_lru(lru)) {
> >+		if (nr_pages > 0)
> 
> This seems like a unnecessary branch, atomic_add should handle negative
> nr_pages just fine?
> 

On x86 it would but the interface makes no guarantees it'll handle
signed types properly on all architectures.

> >@@ -1456,14 +1461,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> > 		unsigned long available;
> > 		enum compact_result compact_result;
> >
> >+		if (last_pgdat == zone->zone_pgdat)
> >+			continue;
> >+
> >+		/*
> >+		 * This over-estimates the number of pages available for
> >+		 * reclaim/compaction but walking the LRU would take too
> >+		 * long. The consequences are that compaction may retry
> >+		 * longer than it should for a zone-constrained allocation
> >+		 * request.
> 
> The comment above says that we don't retry zone-constrained at all. Is this
> an obsolete comment, or does it refer to the ZONE_NORMAL constraint? (as
> opposed to HIGHMEM, MOVABLE etc?).
> 

It can still over-estimate the amount of memory available if
ZONE_MOVABLE exists even if the request is not zone-constrained.

> >@@ -3454,6 +3455,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> > 		return false;
> >
> > 	/*
> >+	 * Blindly retry lowmem allocation requests that are often ignored by
> >+	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
> >+	 * and fast means of calculating reclaimable, dirty and writeback pages
> >+	 * in eligible zones.
> >+	 */
> >+	if (ac->high_zoneidx < ZONE_NORMAL)
> >+		goto out;
> 
> A goto inside two nested for cycles? Is there no hope for sanity? :(
> 

None, hand it in at the door.

It can be pulled out and put past the "return false" at the end. It's
just not necessarily any better.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries
@ 2016-07-15  7:48       ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-15  7:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner,
	Minchan Kim, Joonsoo Kim, LKML

On Thu, Jul 14, 2016 at 03:40:11PM +0200, Vlastimil Babka wrote:
> >@@ -4,6 +4,26 @@
> > #include <linux/huge_mm.h>
> > #include <linux/swap.h>
> >
> >+#ifdef CONFIG_HIGHMEM
> >+extern atomic_t highmem_file_pages;
> >+
> >+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> >+							int nr_pages)
> >+{
> >+	if (is_highmem_idx(zid) && is_file_lru(lru)) {
> >+		if (nr_pages > 0)
> 
> This seems like a unnecessary branch, atomic_add should handle negative
> nr_pages just fine?
> 

On x86 it would but the interface makes no guarantees it'll handle
signed types properly on all architectures.

> >@@ -1456,14 +1461,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> > 		unsigned long available;
> > 		enum compact_result compact_result;
> >
> >+		if (last_pgdat == zone->zone_pgdat)
> >+			continue;
> >+
> >+		/*
> >+		 * This over-estimates the number of pages available for
> >+		 * reclaim/compaction but walking the LRU would take too
> >+		 * long. The consequences are that compaction may retry
> >+		 * longer than it should for a zone-constrained allocation
> >+		 * request.
> 
> The comment above says that we don't retry zone-constrained at all. Is this
> an obsolete comment, or does it refer to the ZONE_NORMAL constraint? (as
> opposed to HIGHMEM, MOVABLE etc?).
> 

It can still over-estimate the amount of memory available if
ZONE_MOVABLE exists even if the request is not zone-constrained.

> >@@ -3454,6 +3455,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> > 		return false;
> >
> > 	/*
> >+	 * Blindly retry lowmem allocation requests that are often ignored by
> >+	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
> >+	 * and fast means of calculating reclaimable, dirty and writeback pages
> >+	 * in eligible zones.
> >+	 */
> >+	if (ac->high_zoneidx < ZONE_NORMAL)
> >+		goto out;
> 
> A goto inside two nested for cycles? Is there no hope for sanity? :(
> 

None, hand it in at the door.

It can be pulled out and put past the "return false" at the end. It's
just not necessarily any better.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-13 21:13             ` Andrew Morton
@ 2016-07-15 10:46               ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-15 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 02:13:43PM -0700, Andrew Morton wrote:
> On Wed, 13 Jul 2016 14:37:01 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > > I don't care strongly enough to cause a respin of half the series, and
> > > it's not your problem that I waited until the last revision went into
> > > mmots to review and comment. But if you agreed to a revert, would you
> > > consider tacking on a revert patch at the end of the series?
> > > 
> > 
> > In this case, I'm going to ask the other people on the cc for a
> > tie-breaker. If someone else prefers the old names then I'm happy for
> > your patch to be applied on top with my ack instead of respinning the
> > whole series.
> > 
> > Anyone for a tie breaker?
> 
> I am aggressively undecided.  I guess as it's a bit of a 51/49
> situation, the "stay with what people are familiar with" benefit tips the
> balance toward the legacy names?
> 

I still can't decide. It's currently still a draw in terms of naming. If
you're worried, use the old naming. It wouldn't be the first time I
thought a name was odd.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-15 10:46               ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-07-15 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Wed, Jul 13, 2016 at 02:13:43PM -0700, Andrew Morton wrote:
> On Wed, 13 Jul 2016 14:37:01 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > > I don't care strongly enough to cause a respin of half the series, and
> > > it's not your problem that I waited until the last revision went into
> > > mmots to review and comment. But if you agreed to a revert, would you
> > > consider tacking on a revert patch at the end of the series?
> > > 
> > 
> > In this case, I'm going to ask the other people on the cc for a
> > tie-breaker. If someone else prefers the old names then I'm happy for
> > your patch to be applied on top with my ack instead of respinning the
> > whole series.
> > 
> > Anyone for a tie breaker?
> 
> I am aggressively undecided.  I guess as it's a bit of a 51/49
> situation, the "stay with what people are familiar with" benefit tips the
> balance toward the legacy names?
> 

I still can't decide. It's currently still a draw in terms of naming. If
you're worried, use the old naming. It wouldn't be the first time I
thought a name was odd.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries
  2016-07-15  7:48       ` Mel Gorman
@ 2016-07-15 12:20         ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-15 12:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner,
	Minchan Kim, Joonsoo Kim, LKML

On 07/15/2016 09:48 AM, Mel Gorman wrote:
> On Thu, Jul 14, 2016 at 03:40:11PM +0200, Vlastimil Babka wrote:
>>> @@ -4,6 +4,26 @@
>>> #include <linux/huge_mm.h>
>>> #include <linux/swap.h>
>>>
>>> +#ifdef CONFIG_HIGHMEM
>>> +extern atomic_t highmem_file_pages;
>>> +
>>> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
>>> +							int nr_pages)
>>> +{
>>> +	if (is_highmem_idx(zid) && is_file_lru(lru)) {
>>> +		if (nr_pages > 0)
>>
>> This seems like a unnecessary branch, atomic_add should handle negative
>> nr_pages just fine?
>>
> 
> On x86 it would but the interface makes no guarantees it'll handle
> signed types properly on all architectures.

Hmm really? At least some drivers do that in an easily grepable way:

drivers/tty/serial/dz.c:                        atomic_add(-1, &mux->map_guard);
drivers/tty/serial/sb1250-duart.c:                      atomic_add(-1, &duart->map_guard);
drivers/tty/serial/zs.c:                        atomic_add(-1, &scc->irq_guard);

And our own __mod_zone_page_state() can get both negative and positive
vales and boils down to atomic_long_add() (I assume the long variant wouldn't
be different in this aspect).

>>> @@ -1456,14 +1461,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>>> 		unsigned long available;
>>> 		enum compact_result compact_result;
>>>
>>> +		if (last_pgdat == zone->zone_pgdat)
>>> +			continue;
>>> +
>>> +		/*
>>> +		 * This over-estimates the number of pages available for
>>> +		 * reclaim/compaction but walking the LRU would take too
>>> +		 * long. The consequences are that compaction may retry
>>> +		 * longer than it should for a zone-constrained allocation
>>> +		 * request.
>>
>> The comment above says that we don't retry zone-constrained at all. Is this
>> an obsolete comment, or does it refer to the ZONE_NORMAL constraint? (as
>> opposed to HIGHMEM, MOVABLE etc?).
>>
> 
> It can still over-estimate the amount of memory available if
> ZONE_MOVABLE exists even if the request is not zone-constrained.

OK.

>>> @@ -3454,6 +3455,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>>> 		return false;
>>>
>>> 	/*
>>> +	 * Blindly retry lowmem allocation requests that are often ignored by
>>> +	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
>>> +	 * and fast means of calculating reclaimable, dirty and writeback pages
>>> +	 * in eligible zones.
>>> +	 */
>>> +	if (ac->high_zoneidx < ZONE_NORMAL)
>>> +		goto out;
>>
>> A goto inside two nested for cycles? Is there no hope for sanity? :(
>>
> 
> None, hand it in at the door.

Mine's long gone, was thinking for the future newbies :)
 
> It can be pulled out and put past the "return false" at the end. It's
> just not necessarily any better.

I see...

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries
@ 2016-07-15 12:20         ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-07-15 12:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner,
	Minchan Kim, Joonsoo Kim, LKML

On 07/15/2016 09:48 AM, Mel Gorman wrote:
> On Thu, Jul 14, 2016 at 03:40:11PM +0200, Vlastimil Babka wrote:
>>> @@ -4,6 +4,26 @@
>>> #include <linux/huge_mm.h>
>>> #include <linux/swap.h>
>>>
>>> +#ifdef CONFIG_HIGHMEM
>>> +extern atomic_t highmem_file_pages;
>>> +
>>> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
>>> +							int nr_pages)
>>> +{
>>> +	if (is_highmem_idx(zid) && is_file_lru(lru)) {
>>> +		if (nr_pages > 0)
>>
>> This seems like a unnecessary branch, atomic_add should handle negative
>> nr_pages just fine?
>>
> 
> On x86 it would but the interface makes no guarantees it'll handle
> signed types properly on all architectures.

Hmm really? At least some drivers do that in an easily grepable way:

drivers/tty/serial/dz.c:                        atomic_add(-1, &mux->map_guard);
drivers/tty/serial/sb1250-duart.c:                      atomic_add(-1, &duart->map_guard);
drivers/tty/serial/zs.c:                        atomic_add(-1, &scc->irq_guard);

And our own __mod_zone_page_state() can get both negative and positive
vales and boils down to atomic_long_add() (I assume the long variant wouldn't
be different in this aspect).

>>> @@ -1456,14 +1461,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>>> 		unsigned long available;
>>> 		enum compact_result compact_result;
>>>
>>> +		if (last_pgdat == zone->zone_pgdat)
>>> +			continue;
>>> +
>>> +		/*
>>> +		 * This over-estimates the number of pages available for
>>> +		 * reclaim/compaction but walking the LRU would take too
>>> +		 * long. The consequences are that compaction may retry
>>> +		 * longer than it should for a zone-constrained allocation
>>> +		 * request.
>>
>> The comment above says that we don't retry zone-constrained at all. Is this
>> an obsolete comment, or does it refer to the ZONE_NORMAL constraint? (as
>> opposed to HIGHMEM, MOVABLE etc?).
>>
> 
> It can still over-estimate the amount of memory available if
> ZONE_MOVABLE exists even if the request is not zone-constrained.

OK.

>>> @@ -3454,6 +3455,15 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>>> 		return false;
>>>
>>> 	/*
>>> +	 * Blindly retry lowmem allocation requests that are often ignored by
>>> +	 * the OOM killer up to MAX_RECLAIM_RETRIES as we not have a reliable
>>> +	 * and fast means of calculating reclaimable, dirty and writeback pages
>>> +	 * in eligible zones.
>>> +	 */
>>> +	if (ac->high_zoneidx < ZONE_NORMAL)
>>> +		goto out;
>>
>> A goto inside two nested for cycles? Is there no hope for sanity? :(
>>
> 
> None, hand it in at the door.

Mine's long gone, was thinking for the future newbies :)
 
> It can be pulled out and put past the "return false" at the end. It's
> just not necessarily any better.

I see...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-15 10:46               ` Mel Gorman
@ 2016-07-15 22:35                 ` Andrew Morton
  -1 siblings, 0 replies; 220+ messages in thread
From: Andrew Morton @ 2016-07-15 22:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, 15 Jul 2016 11:46:05 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Jul 13, 2016 at 02:13:43PM -0700, Andrew Morton wrote:
> > On Wed, 13 Jul 2016 14:37:01 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > > > I don't care strongly enough to cause a respin of half the series, and
> > > > it's not your problem that I waited until the last revision went into
> > > > mmots to review and comment. But if you agreed to a revert, would you
> > > > consider tacking on a revert patch at the end of the series?
> > > > 
> > > 
> > > In this case, I'm going to ask the other people on the cc for a
> > > tie-breaker. If someone else prefers the old names then I'm happy for
> > > your patch to be applied on top with my ack instead of respinning the
> > > whole series.
> > > 
> > > Anyone for a tie breaker?
> > 
> > I am aggressively undecided.  I guess as it's a bit of a 51/49
> > situation, the "stay with what people are familiar with" benefit tips the
> > balance toward the legacy names?
> > 
> 
> I still can't decide. It's currently still a draw in terms of naming. If
> you're worried, use the old naming. It wouldn't be the first time I
> thought a name was odd.

Well I dunno.  We can leave the series as-is for now and we can merge
the rename-it-back patch sometime during the next -rc cycle if we find
that people are running around in confusion and tumbling out of high
windows.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-15 22:35                 ` Andrew Morton
  0 siblings, 0 replies; 220+ messages in thread
From: Andrew Morton @ 2016-07-15 22:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Linux-MM, Rik van Riel, Vlastimil Babka,
	Minchan Kim, Joonsoo Kim, LKML

On Fri, 15 Jul 2016 11:46:05 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Jul 13, 2016 at 02:13:43PM -0700, Andrew Morton wrote:
> > On Wed, 13 Jul 2016 14:37:01 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > > > I don't care strongly enough to cause a respin of half the series, and
> > > > it's not your problem that I waited until the last revision went into
> > > > mmots to review and comment. But if you agreed to a revert, would you
> > > > consider tacking on a revert patch at the end of the series?
> > > > 
> > > 
> > > In this case, I'm going to ask the other people on the cc for a
> > > tie-breaker. If someone else prefers the old names then I'm happy for
> > > your patch to be applied on top with my ack instead of respinning the
> > > whole series.
> > > 
> > > Anyone for a tie breaker?
> > 
> > I am aggressively undecided.  I guess as it's a bit of a 51/49
> > situation, the "stay with what people are familiar with" benefit tips the
> > balance toward the legacy names?
> > 
> 
> I still can't decide. It's currently still a draw in terms of naming. If
> you're worried, use the old naming. It wouldn't be the first time I
> thought a name was odd.

Well I dunno.  We can leave the series as-is for now and we can merge
the rename-it-back patch sometime during the next -rc cycle if we find
that people are running around in confusion and tumbling out of high
windows.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  2016-07-15 22:35                 ` Andrew Morton
@ 2016-07-18 13:34                   ` Johannes Weiner
  -1 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-18 13:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux-MM, Rik van Riel, Vlastimil Babka, Minchan Kim,
	Joonsoo Kim, LKML

On Fri, Jul 15, 2016 at 03:35:54PM -0700, Andrew Morton wrote:
> Well I dunno.  We can leave the series as-is for now and we can merge
> the rename-it-back patch sometime during the next -rc cycle if we find
> that people are running around in confusion and tumbling out of high
> windows.

Sounds good to me.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
@ 2016-07-18 13:34                   ` Johannes Weiner
  0 siblings, 0 replies; 220+ messages in thread
From: Johannes Weiner @ 2016-07-18 13:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Linux-MM, Rik van Riel, Vlastimil Babka, Minchan Kim,
	Joonsoo Kim, LKML

On Fri, Jul 15, 2016 at 03:35:54PM -0700, Andrew Morton wrote:
> Well I dunno.  We can leave the series as-is for now and we can merge
> the rename-it-back patch sometime during the next -rc cycle if we find
> that people are running around in confusion and tumbling out of high
> windows.

Sounds good to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 01/34] mm, vmstat: add infrastructure for per-node vmstats
  2016-07-08  9:34   ` Mel Gorman
@ 2016-08-03 19:13     ` Reza Arbab
  -1 siblings, 0 replies; 220+ messages in thread
From: Reza Arbab @ 2016-08-03 19:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:37AM +0100, Mel Gorman wrote:
> void refresh_zone_stat_thresholds(void)
> {
[...]
>+	/* Zero current pgdat thresholds */
>+	for_each_online_pgdat(pgdat) {
>+		for_each_online_cpu(cpu) {
>+			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold = 0;
>+		}
>+	}

I am oopsing here, for a node whose pgdat->per_cpu_nodestats is NULL.

The node in question is memoryless, so in setup_per_cpu_pageset(), the 
loop over its populated zones doesn't run, and the per_cpu_nodestat 
struct isn't allocated.

This patch fixes things for me:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ea759b9..5221e17 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5271,9 +5271,17 @@ static void __meminit setup_zone_pageset(struct zone *zone)
  void __init setup_per_cpu_pageset(void)
  {
  	struct zone *zone;
+	struct pglist_data *pgdat;
  
  	for_each_populated_zone(zone)
  		setup_zone_pageset(zone);
+
+	for_each_online_pgdat(pgdat) {
+		if (!pgdat->per_cpu_nodestats) {
+			pgdat->per_cpu_nodestats =
+				alloc_percpu(struct per_cpu_nodestat);
+		}
+	}
  }
  
  static noinline __init_refok


-- 
Reza Arbab

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 01/34] mm, vmstat: add infrastructure for per-node vmstats
@ 2016-08-03 19:13     ` Reza Arbab
  0 siblings, 0 replies; 220+ messages in thread
From: Reza Arbab @ 2016-08-03 19:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Jul 08, 2016 at 10:34:37AM +0100, Mel Gorman wrote:
> void refresh_zone_stat_thresholds(void)
> {
[...]
>+	/* Zero current pgdat thresholds */
>+	for_each_online_pgdat(pgdat) {
>+		for_each_online_cpu(cpu) {
>+			per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->stat_threshold = 0;
>+		}
>+	}

I am oopsing here, for a node whose pgdat->per_cpu_nodestats is NULL.

The node in question is memoryless, so in setup_per_cpu_pageset(), the 
loop over its populated zones doesn't run, and the per_cpu_nodestat 
struct isn't allocated.

This patch fixes things for me:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ea759b9..5221e17 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5271,9 +5271,17 @@ static void __meminit setup_zone_pageset(struct zone *zone)
  void __init setup_per_cpu_pageset(void)
  {
  	struct zone *zone;
+	struct pglist_data *pgdat;
  
  	for_each_populated_zone(zone)
  		setup_zone_pageset(zone);
+
+	for_each_online_pgdat(pgdat) {
+		if (!pgdat->per_cpu_nodestats) {
+			pgdat->per_cpu_nodestats =
+				alloc_percpu(struct per_cpu_nodestat);
+		}
+	}
  }
  
  static noinline __init_refok


-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
  2016-07-08  9:34   ` Mel Gorman
  (?)
@ 2016-08-04 20:59     ` James Hogan
  -1 siblings, 0 replies; 220+ messages in thread
From: James Hogan @ 2016-08-04 20:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

On 8 July 2016 at 10:34, Mel Gorman <mgorman@techsingularity.net> wrote:
> This moves the LRU lists from the zone to the node and related data
> such as counters, tracing, congestion tracking and writeback tracking.
> Unfortunately, due to reclaim and compaction retry logic, it is necessary
> to account for the number of LRU pages on both zone and node logic.
> Most reclaim logic is based on the node counters but the retry logic uses
> the zone counters which do not distinguish inactive and active sizes.
> It would be possible to leave the LRU counters on a per-zone basis but
> it's a heavier calculation across multiple cache lines that is much more
> frequent than the retry checks.
>
> Other than the LRU counters, this is mostly a mechanical patch but note
> that it introduces a number of anomalies.  For example, the scans are
> per-zone but using per-node counters.  We also mark a node as congested
> when a zone is congested.  This causes weird problems that are fixed later
> but is easier to review.
>
> In the event that there is excessive overhead on 32-bit systems due to
> the nodes being on LRU then there are two potential solutions
>
> 1. Long-term isolation of highmem pages when reclaim is lowmem
>
>    When pages are skipped, they are immediately added back onto the LRU
>    list. If lowmem reclaim persisted for long periods of time, the same
>    highmem pages get continually scanned. The idea would be that lowmem
>    keeps those pages on a separate list until a reclaim for highmem pages
>    arrives that splices the highmem pages back onto the LRU. It potentially
>    could be implemented similar to the UNEVICTABLE list.
>
>    That would reduce the skip rate with the potential corner case is that
>    highmem pages have to be scanned and reclaimed to free lowmem slab pages.
>
> 2. Linear scan lowmem pages if the initial LRU shrink fails
>
>    This will break LRU ordering but may be preferable and faster during
>    memory pressure than skipping LRU pages.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

This breaks boot on metag architecture:
Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]

It appears to be in node_page_state_snapshot() (via
pgdat_reclaimable()), and have come via mm_init. Here's the relevant
bit of the backtrace:

    node_page_state_snapshot@0x4009c884(enum node_stat_item item =
???, struct pglist_data * pgdat = ???) + 0x48
    pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
    show_free_areas(unsigned int filter = 0) + 0x2cc
    show_mem(unsigned int filter = 0) + 0x18
    mm_init@0x4025c3d4()
    start_kernel() + 0x204

__per_cpu_offset[0] == 0x233000 (close to bad addr),
pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
definitely hasn't been called yet (mm_init is called before
setup_per_cpu_pageset()).

Any ideas what the correct solution is (and why presumably others
haven't seen the same issue on other architectures?).

Thanks
James

> ---
>  arch/tile/mm/pgtable.c                    |   8 +-
>  drivers/base/node.c                       |  19 +--
>  drivers/staging/android/lowmemorykiller.c |   8 +-
>  include/linux/backing-dev.h               |   2 +-
>  include/linux/memcontrol.h                |  18 +--
>  include/linux/mm_inline.h                 |  21 ++-
>  include/linux/mmzone.h                    |  68 +++++----
>  include/linux/swap.h                      |   1 +
>  include/linux/vm_event_item.h             |  10 +-
>  include/linux/vmstat.h                    |  17 +++
>  include/trace/events/vmscan.h             |  12 +-
>  kernel/power/snapshot.c                   |  10 +-
>  mm/backing-dev.c                          |  15 +-
>  mm/compaction.c                           |  18 +--
>  mm/huge_memory.c                          |   2 +-
>  mm/internal.h                             |   2 +-
>  mm/khugepaged.c                           |   4 +-
>  mm/memcontrol.c                           |  17 +--
>  mm/memory-failure.c                       |   4 +-
>  mm/memory_hotplug.c                       |   2 +-
>  mm/mempolicy.c                            |   2 +-
>  mm/migrate.c                              |  21 +--
>  mm/mlock.c                                |   2 +-
>  mm/page-writeback.c                       |   8 +-
>  mm/page_alloc.c                           |  68 +++++----
>  mm/swap.c                                 |  50 +++----
>  mm/vmscan.c                               | 226 +++++++++++++++++-------------
>  mm/vmstat.c                               |  47 ++++---
>  mm/workingset.c                           |   4 +-
>  29 files changed, 386 insertions(+), 300 deletions(-)
>
> diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
> index c4d5bf841a7f..9e389213580d 100644
> --- a/arch/tile/mm/pgtable.c
> +++ b/arch/tile/mm/pgtable.c
> @@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
>         struct zone *zone;
>
>         pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
> -              (global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE)),
> -              (global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE)),
> +              (global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE)),
> +              (global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE)),
>                global_page_state(NR_FILE_DIRTY),
>                global_page_state(NR_WRITEBACK),
>                global_page_state(NR_UNSTABLE_NFS),
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 92d8e090c5b3..b7f01a4a642d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  {
>         int n;
>         int nid = dev->id;
> +       struct pglist_data *pgdat = NODE_DATA(nid);
>         struct sysinfo i;
>
>         si_meminfo_node(&i, nid);
> @@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
>                        nid, K(i.totalram),
>                        nid, K(i.freeram),
>                        nid, K(i.totalram - i.freeram),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
>                        nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
>
>  #ifdef CONFIG_HIGHMEM
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> index 24d2745e9437..93dbcc38eb0f 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -72,10 +72,10 @@ static unsigned long lowmem_deathpending_timeout;
>  static unsigned long lowmem_count(struct shrinker *s,
>                                   struct shrink_control *sc)
>  {
> -       return global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE) +
> -               global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE);
> +       return global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE) +
> +               global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE);
>  }
>
>  static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c82794f20110..491a91717788 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
>  }
>
>  long congestion_wait(int sync, long timeout);
> -long wait_iff_congested(struct zone *zone, int sync, long timeout);
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
>  int pdflush_proc_obsolete(struct ctl_table *table, int write,
>                 void __user *buffer, size_t *lenp, loff_t *ppos);
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 104efa6874db..68f1121c8fe7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -340,7 +340,7 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = zone_lruvec(zone);
>                 goto out;
>         }
>
> @@ -349,15 +349,15 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>  out:
>         /*
>          * Since a node can be onlined after the mem_cgroup was created,
> -        * we have to be prepared to initialize lruvec->zone here;
> +        * we have to be prepared to initialize lruvec->pgdat here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != zone->zone_pgdat))
> +               lruvec->pgdat = zone->zone_pgdat;
>         return lruvec;
>  }
>
> -struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> +struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
>  bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -438,7 +438,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
>  int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
>
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -               int nr_pages);
> +               enum zone_type zid, int nr_pages);
>
>  unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>                                            int nid, unsigned int lru_mask);
> @@ -613,13 +613,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
>  static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>                                                     struct mem_cgroup *memcg)
>  {
> -       return &zone->lruvec;
> +       return zone_lruvec(zone);
>  }
>
>  static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> -                                                   struct zone *zone)
> +                                                   struct pglist_data *pgdat)
>  {
> -       return &zone->lruvec;
> +       return &pgdat->lruvec;
>  }
>
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 5bd29ba4f174..9aadcc781857 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -23,25 +23,32 @@ static inline int page_is_file_cache(struct page *page)
>  }
>
>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
> -       __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +       __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> +       __mod_zone_page_state(&pgdat->node_zones[zid],
> +               NR_ZONE_LRU_BASE + !!is_file_lru(lru),
> +               nr_pages);
>  }
>
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
>  #ifdef CONFIG_MEMCG
> -       mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
> +       mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
>  #else
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>  #endif
>  }
>
>  static __always_inline void add_page_to_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
> -       update_lru_size(lruvec, lru, hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
>         list_add(&page->lru, &lruvec->lists[lru]);
>  }
>
> @@ -49,7 +56,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
>         list_del(&page->lru);
> -       update_lru_size(lruvec, lru, -hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
>  }
>
>  /**
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cfa870107abe..d4f5cac0a8c3 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -111,12 +111,9 @@ enum zone_stat_item {
>         /* First 128 byte cacheline (assuming 64 bit words) */
>         NR_FREE_PAGES,
>         NR_ALLOC_BATCH,
> -       NR_LRU_BASE,
> -       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> -       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> -       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> -       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> -       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> +       NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
> +       NR_ZONE_LRU_FILE,
>         NR_MLOCK,               /* mlock()ed pages found and moved off LRU */
>         NR_ANON_PAGES,  /* Mapped anonymous pages */
>         NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> @@ -134,12 +131,9 @@ enum zone_stat_item {
>         NR_VMSCAN_WRITE,
>         NR_VMSCAN_IMMEDIATE,    /* Prioritise for reclaim when writeback ends */
>         NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
> -       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> -       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
>         NR_SHMEM,               /* shmem pages (included tmpfs/GEM pages) */
>         NR_DIRTIED,             /* page dirtyings since bootup */
>         NR_WRITTEN,             /* page writings since bootup */
> -       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         NR_ZSPAGES,             /* allocated in zsmalloc */
>  #endif
> @@ -161,6 +155,15 @@ enum zone_stat_item {
>         NR_VM_ZONE_STAT_ITEMS };
>
>  enum node_stat_item {
> +       NR_LRU_BASE,
> +       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> +       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> +       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> +       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> +       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> +       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> +       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>         NR_VM_NODE_STAT_ITEMS
>  };
>
> @@ -219,7 +222,7 @@ struct lruvec {
>         /* Evictions & activations on the inactive file list */
>         atomic_long_t                   inactive_age;
>  #ifdef CONFIG_MEMCG
> -       struct zone                     *zone;
> +       struct pglist_data *pgdat;
>  #endif
>  };
>
> @@ -357,13 +360,6 @@ struct zone {
>  #ifdef CONFIG_NUMA
>         int node;
>  #endif
> -
> -       /*
> -        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> -        * this zone's LRU.  Maintained by the pageout code.
> -        */
> -       unsigned int inactive_ratio;
> -
>         struct pglist_data      *zone_pgdat;
>         struct per_cpu_pageset __percpu *pageset;
>
> @@ -495,9 +491,6 @@ struct zone {
>
>         /* Write-intensive fields used by page reclaim */
>
> -       /* Fields commonly accessed by the page reclaim scanner */
> -       struct lruvec           lruvec;
> -
>         /*
>          * When free pages are below this point, additional steps are taken
>          * when reading the number of free pages to avoid per-cpu counter
> @@ -537,17 +530,20 @@ struct zone {
>
>  enum zone_flags {
>         ZONE_RECLAIM_LOCKED,            /* prevents concurrent reclaim */
> -       ZONE_CONGESTED,                 /* zone has many dirty pages backed by
> +       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
> +};
> +
> +enum pgdat_flags {
> +       PGDAT_CONGESTED,                /* pgdat has many dirty pages backed by
>                                          * a congested BDI
>                                          */
> -       ZONE_DIRTY,                     /* reclaim scanning has recently found
> +       PGDAT_DIRTY,                    /* reclaim scanning has recently found
>                                          * many dirty file pages at the tail
>                                          * of the LRU.
>                                          */
> -       ZONE_WRITEBACK,                 /* reclaim scanning has recently found
> +       PGDAT_WRITEBACK,                /* reclaim scanning has recently found
>                                          * many pages under writeback
>                                          */
> -       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
>  };
>
>  static inline unsigned long zone_end_pfn(const struct zone *zone)
> @@ -707,6 +703,19 @@ typedef struct pglist_data {
>         unsigned long split_queue_len;
>  #endif
>
> +       /* Fields commonly accessed by the page reclaim scanner */
> +       struct lruvec           lruvec;
> +
> +       /*
> +        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> +        * this node's LRU.  Maintained by the pageout code.
> +        */
> +       unsigned int inactive_ratio;
> +
> +       unsigned long           flags;
> +
> +       ZONE_PADDING(_pad2_)
> +
>         /* Per-node vmstats */
>         struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>         atomic_long_t           vm_stat[NR_VM_NODE_STAT_ITEMS];
> @@ -728,6 +737,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
>         return &zone->zone_pgdat->lru_lock;
>  }
>
> +static inline struct lruvec *zone_lruvec(struct zone *zone)
> +{
> +       return &zone->zone_pgdat->lruvec;
> +}
> +
>  static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
>  {
>         return pgdat->node_start_pfn + pgdat->node_spanned_pages;
> @@ -779,12 +793,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
>
>  extern void lruvec_init(struct lruvec *lruvec);
>
> -static inline struct zone *lruvec_zone(struct lruvec *lruvec)
> +static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
>  {
>  #ifdef CONFIG_MEMCG
> -       return lruvec->zone;
> +       return lruvec->pgdat;
>  #else
> -       return container_of(lruvec, struct zone, lruvec);
> +       return container_of(lruvec, struct pglist_data, lruvec);
>  #endif
>  }
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0af2bb2028fd..c82f916008b7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> +extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 42604173f122..1798ff542517 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 PGFREE, PGACTIVATE, PGDEACTIVATE,
>                 PGFAULT, PGMAJFAULT,
>                 PGLAZYFREED,
> -               FOR_ALL_ZONES(PGREFILL),
> -               FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> -               FOR_ALL_ZONES(PGSTEAL_DIRECT),
> -               FOR_ALL_ZONES(PGSCAN_KSWAPD),
> -               FOR_ALL_ZONES(PGSCAN_DIRECT),
> +               PGREFILL,
> +               PGSTEAL_KSWAPD,
> +               PGSTEAL_DIRECT,
> +               PGSCAN_KSWAPD,
> +               PGSCAN_DIRECT,
>                 PGSCAN_DIRECT_THROTTLE,
>  #ifdef CONFIG_NUMA
>                 PGSCAN_ZONE_RECLAIM_FAILED,
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index d1744aa3ab9c..fee321c98550 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>         return x;
>  }
>
> +static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
> +                                       enum node_stat_item item)
> +{
> +       long x = atomic_long_read(&pgdat->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +       int cpu;
> +       for_each_online_cpu(cpu)
> +               x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item];
> +
> +       if (x < 0)
> +               x = 0;
> +#endif
> +       return x;
> +}
> +
> +
>  #ifdef CONFIG_NUMA
>  extern unsigned long sum_zone_node_page_state(int node,
>                                                 enum zone_stat_item item);
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 0101ef37f1ee..897f1aa1ee5f 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -352,15 +352,14 @@ TRACE_EVENT(mm_vmscan_writepage,
>
>  TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>
> -       TP_PROTO(struct zone *zone,
> +       TP_PROTO(int nid,
>                 unsigned long nr_scanned, unsigned long nr_reclaimed,
>                 int priority, int file),
>
> -       TP_ARGS(zone, nr_scanned, nr_reclaimed, priority, file),
> +       TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
>
>         TP_STRUCT__entry(
>                 __field(int, nid)
> -               __field(int, zid)
>                 __field(unsigned long, nr_scanned)
>                 __field(unsigned long, nr_reclaimed)
>                 __field(int, priority)
> @@ -368,16 +367,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>         ),
>
>         TP_fast_assign(
> -               __entry->nid = zone_to_nid(zone);
> -               __entry->zid = zone_idx(zone);
> +               __entry->nid = nid;
>                 __entry->nr_scanned = nr_scanned;
>                 __entry->nr_reclaimed = nr_reclaimed;
>                 __entry->priority = priority;
>                 __entry->reclaim_flags = trace_shrink_flags(file);
>         ),
>
> -       TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> -               __entry->nid, __entry->zid,
> +       TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> +               __entry->nid,
>                 __entry->nr_scanned, __entry->nr_reclaimed,
>                 __entry->priority,
>                 show_reclaim_flags(__entry->reclaim_flags))
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 3a970604308f..24a06bc23f85 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
>         unsigned long size;
>
>         size = global_page_state(NR_SLAB_RECLAIMABLE)
> -               + global_page_state(NR_ACTIVE_ANON)
> -               + global_page_state(NR_INACTIVE_ANON)
> -               + global_page_state(NR_ACTIVE_FILE)
> -               + global_page_state(NR_INACTIVE_FILE)
> -               - global_page_state(NR_FILE_MAPPED);
> +               + global_node_page_state(NR_ACTIVE_ANON)
> +               + global_node_page_state(NR_INACTIVE_ANON)
> +               + global_node_page_state(NR_ACTIVE_FILE)
> +               + global_node_page_state(NR_INACTIVE_FILE)
> +               - global_node_page_state(NR_FILE_MAPPED);
>
>         return saveable <= size ? 0 : saveable - size;
>  }
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index f53b23ab7ed7..a8c3af46bd3d 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -982,24 +982,24 @@ long congestion_wait(int sync, long timeout)
>  EXPORT_SYMBOL(congestion_wait);
>
>  /**
> - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
> - * @zone: A zone to check if it is heavily congested
> + * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
> + * @pgdat: A pgdat to check if it is heavily congested
>   * @sync: SYNC or ASYNC IO
>   * @timeout: timeout in jiffies
>   *
>   * In the event of a congested backing_dev (any backing_dev) and the given
> - * @zone has experienced recent congestion, this waits for up to @timeout
> + * @pgdat has experienced recent congestion, this waits for up to @timeout
>   * jiffies for either a BDI to exit congestion of the given @sync queue
>   * or a write to complete.
>   *
> - * In the absence of zone congestion, cond_resched() is called to yield
> + * In the absence of pgdat congestion, cond_resched() is called to yield
>   * the processor if necessary but otherwise does not sleep.
>   *
>   * The return value is 0 if the sleep is for the full timeout. Otherwise,
>   * it is the number of jiffies that were still remaining when the function
>   * returned. return_value == timeout implies the function did not sleep.
>   */
> -long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
>  {
>         long ret;
>         unsigned long start = jiffies;
> @@ -1008,12 +1008,13 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
>
>         /*
>          * If there is no congestion, or heavy congestion is not being
> -        * encountered in the current zone, yield if necessary instead
> +        * encountered in the current pgdat, yield if necessary instead
>          * of sleeping on the congestion queue
>          */
>         if (atomic_read(&nr_wb_congested[sync]) == 0 ||
> -           !test_bit(ZONE_CONGESTED, &zone->flags)) {
> +           !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
>                 cond_resched();
> +
>                 /* In case we scheduled, work out time remaining */
>                 ret = timeout - (jiffies - start);
>                 if (ret < 0)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7607efb7bee2..a0bd85712516 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -646,8 +646,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
>         list_for_each_entry(page, &cc->migratepages, lru)
>                 count[!!page_is_file_cache(page)]++;
>
> -       mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
>  }
>
>  /* Similar to reclaim, but different enough that they don't share logic */
> @@ -655,12 +655,12 @@ static bool too_many_isolated(struct zone *zone)
>  {
>         unsigned long active, inactive, isolated;
>
> -       inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_INACTIVE_ANON);
> -       active = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_ACTIVE_ANON);
> -       isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
> -                                       zone_page_state(zone, NR_ISOLATED_ANON);
> +       inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
> +       active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
> +       isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
>
>         return isolated > (inactive + active) / 2;
>  }
> @@ -856,7 +856,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>                         }
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 /* Try isolate the page */
>                 if (__isolate_lru_page(page, isolate_mode) != 0)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f997328ae64..5d5b2207cfd2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1830,7 +1830,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>         pgoff_t end = -1;
>         int i;
>
> -       lruvec = mem_cgroup_page_lruvec(head, zone);
> +       lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
>
>         /* complete memcg works before add pages to LRU */
>         mem_cgroup_split_huge_fixup(head);
> diff --git a/mm/internal.h b/mm/internal.h
> index 9b6a6c43ac39..2f80d0343c56 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -78,7 +78,7 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool zone_reclaimable(struct zone *zone);
> +extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 93d5f87c00d5..d7a49f665f04 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -480,7 +480,7 @@ void __khugepaged_exit(struct mm_struct *mm)
>  static void release_pte_page(struct page *page)
>  {
>         /* 0 stands for page_is_file_cache(page) == false */
> -       dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +       dec_node_page_state(page, NR_ISOLATED_ANON + 0);
>         unlock_page(page);
>         putback_lru_page(page);
>  }
> @@ -576,7 +576,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         goto out;
>                 }
>                 /* 0 stands for page_is_file_cache(page) == false */
> -               inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +               inc_node_page_state(page, NR_ISOLATED_ANON + 0);
>                 VM_BUG_ON_PAGE(!PageLocked(page), page);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9b70f9ca8ddf..50c86ad121bc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -943,14 +943,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>   * and putback protocol: the LRU lock must be held, and the page must
>   * either be PageLRU() or the caller must have isolated/allocated it.
>   */
> -struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> +struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
>  {
>         struct mem_cgroup_per_zone *mz;
>         struct mem_cgroup *memcg;
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = &pgdat->lruvec;
>                 goto out;
>         }
>
> @@ -970,8 +970,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>          * we have to be prepared to initialize lruvec->zone here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != pgdat))
> +               lruvec->pgdat = pgdat;
>         return lruvec;
>  }
>
> @@ -979,6 +979,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
>   * @lru: index of lru list the page is sitting on
> + * @zid: Zone ID of the zone pages have been added to
>   * @nr_pages: positive when adding or negative when removing
>   *
>   * This function must be called under lru_lock, just before a page is added
> @@ -986,14 +987,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * so as to allow it to check that lru_size 0 is consistent with list_empty).
>   */
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -                               int nr_pages)
> +                               enum zone_type zid, int nr_pages)
>  {
>         struct mem_cgroup_per_zone *mz;
>         unsigned long *lru_size;
>         long size;
>         bool empty;
>
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2069,7 +2070,7 @@ static void lock_page_lru(struct page *page, int *isolated)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 *isolated = 1;
> @@ -2084,7 +2085,7 @@ static void unlock_page_lru(struct page *page, int isolated)
>         if (isolated) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>                 add_page_to_lru_list(page, lruvec, page_lru(page));
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2fcca6b0e005..11de752ccaf5 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1663,7 +1663,7 @@ static int __soft_offline_page(struct page *page, int flags)
>         put_hwpoison_page(page);
>         if (!ret) {
>                 LIST_HEAD(pagelist);
> -               inc_zone_page_state(page, NR_ISOLATED_ANON +
> +               inc_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                 list_add(&page->lru, &pagelist);
>                 ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> @@ -1671,7 +1671,7 @@ static int __soft_offline_page(struct page *page, int flags)
>                 if (ret) {
>                         if (!list_empty(&pagelist)) {
>                                 list_del(&page->lru);
> -                               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                                 page_is_file_cache(page));
>                                 putback_lru_page(page);
>                         }
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 82d0b98d27f8..c5278360ca66 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1586,7 +1586,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>                         put_page(page);
>                         list_add_tail(&page->lru, &source);
>                         move_pages--;
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>
>                 } else {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 53e40d3f3933..d8c4e38fb5f4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -962,7 +962,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
>         if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
>                 if (!isolate_lru_page(page)) {
>                         list_add_tail(&page->lru, pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>         }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2232f6923cc7..3033dae33a0a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -168,7 +168,7 @@ void putback_movable_pages(struct list_head *l)
>                         continue;
>                 }
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>                 /*
>                  * We isolated non-lru movable page so here we can use
> @@ -1119,7 +1119,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>                  * restored.
>                  */
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>         }
>
> @@ -1460,7 +1460,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>                 err = isolate_lru_page(page);
>                 if (!err) {
>                         list_add_tail(&page->lru, &pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>  put_and_set:
> @@ -1726,15 +1726,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>                                    unsigned long nr_migrate_pages)
>  {
>         int z;
> +
> +       if (!pgdat_reclaimable(pgdat))
> +               return false;
> +
>         for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>                 struct zone *zone = pgdat->node_zones + z;
>
>                 if (!populated_zone(zone))
>                         continue;
>
> -               if (!zone_reclaimable(zone))
> -                       continue;
> -
>                 /* Avoid waking kswapd by allocating pages_to_migrate pages. */
>                 if (!zone_watermark_ok(zone, 0,
>                                        high_wmark_pages(zone) +
> @@ -1828,7 +1829,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>         }
>
>         page_lru = page_is_file_cache(page);
> -       mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
> +       mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>                                 hpage_nr_pages(page));
>
>         /*
> @@ -1886,7 +1887,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>         if (nr_remaining) {
>                 if (!list_empty(&migratepages)) {
>                         list_del(&page->lru);
> -                       dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                       dec_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                         putback_lru_page(page);
>                 }
> @@ -1979,7 +1980,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>                 /* Retake the callers reference and putback on LRU */
>                 get_page(page);
>                 putback_lru_page(page);
> -               mod_zone_page_state(page_zone(page),
> +               mod_node_page_state(page_pgdat(page),
>                          NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
>
>                 goto out_unlock;
> @@ -2030,7 +2031,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>         count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
>         count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
>
> -       mod_zone_page_state(page_zone(page),
> +       mod_node_page_state(page_pgdat(page),
>                         NR_ISOLATED_ANON + page_lru,
>                         -HPAGE_PMD_NR);
>         return isolated;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 997f63082ff5..14645be06e30 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -103,7 +103,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 if (getpage)
>                         get_page(page);
>                 ClearPageLRU(page);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d578d2a56b19..0ada2b2954b0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -285,8 +285,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
>          */
>         nr_pages -= min(nr_pages, zone->totalreserve_pages);
>
> -       nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
> -       nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         return nr_pages;
>  }
> @@ -348,8 +348,8 @@ static unsigned long global_dirtyable_memory(void)
>          */
>         x -= min(x, totalreserve_pages);
>
> -       x += global_page_state(NR_INACTIVE_FILE);
> -       x += global_page_state(NR_ACTIVE_FILE);
> +       x += global_node_page_state(NR_INACTIVE_FILE);
> +       x += global_node_page_state(NR_ACTIVE_FILE);
>
>         if (!vm_highmem_is_dirtyable)
>                 x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48b5414009ac..b84b85ae54ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1090,9 +1090,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>
>         spin_lock(&zone->lock);
>         isolated_pageblocks = has_isolate_pageblock(zone);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         while (count) {
>                 struct page *page;
> @@ -1147,9 +1147,9 @@ static void free_one_page(struct zone *zone,
>  {
>         unsigned long nr_scanned;
>         spin_lock(&zone->lock);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
> @@ -4331,6 +4331,7 @@ void show_free_areas(unsigned int filter)
>         unsigned long free_pcp = 0;
>         int cpu;
>         struct zone *zone;
> +       pg_data_t *pgdat;
>
>         for_each_populated_zone(zone) {
>                 if (skip_free_areas_node(filter, zone_to_nid(zone)))
> @@ -4349,13 +4350,13 @@ void show_free_areas(unsigned int filter)
>                 " anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
>  #endif
>                 " free:%lu free_pcp:%lu free_cma:%lu\n",
> -               global_page_state(NR_ACTIVE_ANON),
> -               global_page_state(NR_INACTIVE_ANON),
> -               global_page_state(NR_ISOLATED_ANON),
> -               global_page_state(NR_ACTIVE_FILE),
> -               global_page_state(NR_INACTIVE_FILE),
> -               global_page_state(NR_ISOLATED_FILE),
> -               global_page_state(NR_UNEVICTABLE),
> +               global_node_page_state(NR_ACTIVE_ANON),
> +               global_node_page_state(NR_INACTIVE_ANON),
> +               global_node_page_state(NR_ISOLATED_ANON),
> +               global_node_page_state(NR_ACTIVE_FILE),
> +               global_node_page_state(NR_INACTIVE_FILE),
> +               global_node_page_state(NR_ISOLATED_FILE),
> +               global_node_page_state(NR_UNEVICTABLE),
>                 global_page_state(NR_FILE_DIRTY),
>                 global_page_state(NR_WRITEBACK),
>                 global_page_state(NR_UNSTABLE_NFS),
> @@ -4374,6 +4375,28 @@ void show_free_areas(unsigned int filter)
>                 free_pcp,
>                 global_page_state(NR_FREE_CMA_PAGES));
>
> +       for_each_online_pgdat(pgdat) {
> +               printk("Node %d"
> +                       " active_anon:%lukB"
> +                       " inactive_anon:%lukB"
> +                       " active_file:%lukB"
> +                       " inactive_file:%lukB"
> +                       " unevictable:%lukB"
> +                       " isolated(anon):%lukB"
> +                       " isolated(file):%lukB"
> +                       " all_unreclaimable? %s"
> +                       "\n",
> +                       pgdat->node_id,
> +                       K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_ANON)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_FILE)),
> +                       !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +       }
> +
>         for_each_populated_zone(zone) {
>                 int i;
>
> @@ -4390,13 +4413,6 @@ void show_free_areas(unsigned int filter)
>                         " min:%lukB"
>                         " low:%lukB"
>                         " high:%lukB"
> -                       " active_anon:%lukB"
> -                       " inactive_anon:%lukB"
> -                       " active_file:%lukB"
> -                       " inactive_file:%lukB"
> -                       " unevictable:%lukB"
> -                       " isolated(anon):%lukB"
> -                       " isolated(file):%lukB"
>                         " present:%lukB"
>                         " managed:%lukB"
>                         " mlocked:%lukB"
> @@ -4419,21 +4435,13 @@ void show_free_areas(unsigned int filter)
>                         " local_pcp:%ukB"
>                         " free_cma:%lukB"
>                         " writeback_tmp:%lukB"
> -                       " pages_scanned:%lu"
> -                       " all_unreclaimable? %s"
> +                       " node_pages_scanned:%lu"
>                         "\n",
>                         zone->name,
>                         K(zone_page_state(zone, NR_FREE_PAGES)),
>                         K(min_wmark_pages(zone)),
>                         K(low_wmark_pages(zone)),
>                         K(high_wmark_pages(zone)),
> -                       K(zone_page_state(zone, NR_ACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_INACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_ACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_INACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_UNEVICTABLE)),
> -                       K(zone_page_state(zone, NR_ISOLATED_ANON)),
> -                       K(zone_page_state(zone, NR_ISOLATED_FILE)),
>                         K(zone->present_pages),
>                         K(zone->managed_pages),
>                         K(zone_page_state(zone, NR_MLOCK)),
> @@ -4458,9 +4466,7 @@ void show_free_areas(unsigned int filter)
>                         K(this_cpu_read(zone->pageset->pcp.count)),
>                         K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
>                         K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
> -                       K(zone_page_state(zone, NR_PAGES_SCANNED)),
> -                       (!zone_reclaimable(zone) ? "yes" : "no")
> -                       );
> +                       K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
>                 printk("lowmem_reserve[]:");
>                 for (i = 0; i < MAX_NR_ZONES; i++)
>                         printk(" %ld", zone->lowmem_reserve[i]);
> @@ -6010,7 +6016,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>                 /* For bootup, initialized properly in watermark setup */
>                 mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
>
> -               lruvec_init(&zone->lruvec);
> +               lruvec_init(zone_lruvec(zone));
>                 if (!size)
>                         continue;
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bf37e5cfae81..77af473635fe 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
>                 unsigned long flags;
>
>                 spin_lock_irqsave(zone_lru_lock(zone), flags);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(!PageLRU(page), page);
>                 __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -194,7 +194,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>                         spin_lock_irqsave(zone_lru_lock(zone), flags);
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 (*move_fn)(page, lruvec, arg);
>         }
>         if (zone)
> @@ -319,7 +319,7 @@ void activate_page(struct page *page)
>
>         page = compound_head(page);
>         spin_lock_irq(zone_lru_lock(zone));
> -       __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
> +       __activate_page(page, mem_cgroup_page_lruvec(page, zone->zone_pgdat), NULL);
>         spin_unlock_irq(zone_lru_lock(zone));
>  }
>  #endif
> @@ -445,16 +445,16 @@ void lru_cache_add(struct page *page)
>   */
>  void add_page_to_unevictable_list(struct page *page)
>  {
> -       struct zone *zone = page_zone(page);
> +       struct pglist_data *pgdat = page_pgdat(page);
>         struct lruvec *lruvec;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> -       lruvec = mem_cgroup_page_lruvec(page, zone);
> +       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
>         ClearPageActive(page);
>         SetPageUnevictable(page);
>         SetPageLRU(page);
>         add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>  }
>
>  /**
> @@ -730,7 +730,7 @@ void release_pages(struct page **pages, int nr, bool cold)
>  {
>         int i;
>         LIST_HEAD(pages_to_free);
> -       struct zone *zone = NULL;
> +       struct pglist_data *locked_pgdat = NULL;
>         struct lruvec *lruvec;
>         unsigned long uninitialized_var(flags);
>         unsigned int uninitialized_var(lock_batch);
> @@ -741,11 +741,11 @@ void release_pages(struct page **pages, int nr, bool cold)
>                 /*
>                  * Make sure the IRQ-safe lock-holding time does not get
>                  * excessive with a continuous string of pages from the
> -                * same zone. The lock is held only if zone != NULL.
> +                * same pgdat. The lock is held only if pgdat != NULL.
>                  */
> -               if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
> -                       spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                       zone = NULL;
> +               if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> +                       spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                       locked_pgdat = NULL;
>                 }
>
>                 if (is_huge_zero_page(page)) {
> @@ -758,27 +758,27 @@ void release_pages(struct page **pages, int nr, bool cold)
>                         continue;
>
>                 if (PageCompound(page)) {
> -                       if (zone) {
> -                               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                               zone = NULL;
> +                       if (locked_pgdat) {
> +                               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                               locked_pgdat = NULL;
>                         }
>                         __put_compound_page(page);
>                         continue;
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct zone *pagezone = page_zone(page);
> +                       struct pglist_data *pgdat = page_pgdat(page);
>
> -                       if (pagezone != zone) {
> -                               if (zone)
> -                                       spin_unlock_irqrestore(zone_lru_lock(zone),
> +                       if (pgdat != locked_pgdat) {
> +                               if (locked_pgdat)
> +                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
>                                                                         flags);
>                                 lock_batch = 0;
> -                               zone = pagezone;
> -                               spin_lock_irqsave(zone_lru_lock(zone), flags);
> +                               locked_pgdat = pgdat;
> +                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>                         }
>
> -                       lruvec = mem_cgroup_page_lruvec(page, zone);
> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -789,8 +789,8 @@ void release_pages(struct page **pages, int nr, bool cold)
>
>                 list_add(&page->lru, &pages_to_free);
>         }
> -       if (zone)
> -               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> +       if (locked_pgdat)
> +               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
>
>         mem_cgroup_uncharge_list(&pages_to_free);
>         free_hot_cold_page_list(&pages_to_free, cold);
> @@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>         VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>         VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>         VM_BUG_ON(NR_CPUS != 1 &&
> -                 !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
> +                 !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
>
>         if (!list)
>                 SetPageLRU(page_tail);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e7ffcd259cc4..86a523a761c9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -191,26 +191,42 @@ static bool sane_reclaim(struct scan_control *sc)
>  }
>  #endif
>
> +/*
> + * This misses isolated pages which are not accounted for to save counters.
> + * As the data only determines if reclaim or compaction continues, it is
> + * not expected that isolated pages will be a dominating factor.
> + */
>  unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>         unsigned long nr;
>
> -       nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> +       nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
> +       if (get_nr_swap_pages() > 0)
> +               nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
> +
> +       return nr;
> +}
> +
> +unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
> +{
> +       unsigned long nr;
> +
> +       nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
>
>         if (get_nr_swap_pages() > 0)
> -               nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> +               nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
>
>         return nr;
>  }
>
> -bool zone_reclaimable(struct zone *zone)
> +bool pgdat_reclaimable(struct pglist_data *pgdat)
>  {
> -       return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
> -               zone_reclaimable_pages(zone) * 6;
> +       return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> +               pgdat_reclaimable_pages(pgdat) * 6;
>  }
>
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -218,7 +234,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
>         if (!mem_cgroup_disabled())
>                 return mem_cgroup_get_lru_size(lruvec, lru);
>
> -       return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> +       return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  }
>
>  /*
> @@ -877,7 +893,7 @@ static void page_check_dirty_writeback(struct page *page,
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -                                     struct zone *zone,
> +                                     struct pglist_data *pgdat,
>                                       struct scan_control *sc,
>                                       enum ttu_flags ttu_flags,
>                                       unsigned long *ret_nr_dirty,
> @@ -917,7 +933,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         goto keep;
>
>                 VM_BUG_ON_PAGE(PageActive(page), page);
> -               VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
>                 sc->nr_scanned++;
>
> @@ -996,7 +1011,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         /* Case 1 above */
>                         if (current_is_kswapd() &&
>                             PageReclaim(page) &&
> -                           test_bit(ZONE_WRITEBACK, &zone->flags)) {
> +                           test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>                                 nr_immediate++;
>                                 goto keep_locked;
>
> @@ -1092,7 +1107,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                          */
>                         if (page_is_file_cache(page) &&
>                                         (!current_is_kswapd() ||
> -                                        !test_bit(ZONE_DIRTY, &zone->flags))) {
> +                                        !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>                                 /*
>                                  * Immediately reclaim when written back.
>                                  * Similar in principal to deactivate_page()
> @@ -1266,11 +1281,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>                 }
>         }
>
> -       ret = shrink_page_list(&clean_pages, zone, &sc,
> +       ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
>                         TTU_UNMAP|TTU_IGNORE_ACCESS,
>                         &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
>         list_splice(&clean_pages, page_list);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
>         return ret;
>  }
>
> @@ -1375,7 +1390,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  {
>         struct list_head *src = &lruvec->lists[lru];
>         unsigned long nr_taken = 0;
> -       unsigned long scan;
> +       unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
> +       unsigned long scan, nr_pages;
>
>         for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
>                                         !list_empty(src); scan++) {
> @@ -1388,7 +1404,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
>                 switch (__isolate_lru_page(page, mode)) {
>                 case 0:
> -                       nr_taken += hpage_nr_pages(page);
> +                       nr_pages = hpage_nr_pages(page);
> +                       nr_taken += nr_pages;
> +                       nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
>
> @@ -1405,6 +1423,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>         *nr_scanned = scan;
>         trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
>                                     nr_taken, mode, is_file_lru(lru));
> +       for (scan = 0; scan < MAX_NR_ZONES; scan++) {
> +               nr_pages = nr_zone_taken[scan];
> +               if (!nr_pages)
> +                       continue;
> +
> +               update_lru_size(lruvec, lru, scan, -nr_pages);
> +       }
>         return nr_taken;
>  }
>
> @@ -1445,7 +1470,7 @@ int isolate_lru_page(struct page *page)
>                 struct lruvec *lruvec;
>
>                 spin_lock_irq(zone_lru_lock(zone));
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 if (PageLRU(page)) {
>                         int lru = page_lru(page);
>                         get_page(page);
> @@ -1465,7 +1490,7 @@ int isolate_lru_page(struct page *page)
>   * the LRU list will go small and be scanned faster than necessary, leading to
>   * unnecessary swapping, thrashing and OOM.
>   */
> -static int too_many_isolated(struct zone *zone, int file,
> +static int too_many_isolated(struct pglist_data *pgdat, int file,
>                 struct scan_control *sc)
>  {
>         unsigned long inactive, isolated;
> @@ -1477,11 +1502,11 @@ static int too_many_isolated(struct zone *zone, int file,
>                 return 0;
>
>         if (file) {
> -               inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -               isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
>         } else {
> -               inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> -               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
>         }
>
>         /*
> @@ -1499,7 +1524,7 @@ static noinline_for_stack void
>  putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  {
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         LIST_HEAD(pages_to_free);
>
>         /*
> @@ -1512,13 +1537,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(zone_lru_lock(zone));
> +                       spin_unlock_irq(&pgdat->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(zone_lru_lock(zone));
> +                       spin_lock_irq(&pgdat->lru_lock);
>                         continue;
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 SetPageLRU(page);
>                 lru = page_lru(page);
> @@ -1535,10 +1560,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>                 }
> @@ -1582,10 +1607,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         unsigned long nr_immediate = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>
> -       while (unlikely(too_many_isolated(zone, file, sc))) {
> +       while (unlikely(too_many_isolated(pgdat, file, sc))) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                 /* We are about to die and free our memory. Return now. */
> @@ -1600,48 +1625,45 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc)) {
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>                 else
> -                       __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_DIRECT, nr_scanned);
>         }
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         if (nr_taken == 0)
>                 return 0;
>
> -       nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> +       nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
>                                 &nr_dirty, &nr_unqueued_dirty, &nr_congested,
>                                 &nr_writeback, &nr_immediate,
>                                 false);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         if (global_reclaim(sc)) {
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSTEAL_KSWAPD, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
>                 else
> -                       __count_zone_vm_events(PGSTEAL_DIRECT, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
>         }
>
>         putback_inactive_pages(lruvec, &page_list);
>
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&page_list);
>         free_hot_cold_page_list(&page_list, true);
> @@ -1661,7 +1683,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          * are encountered in the nr_immediate check below.
>          */
>         if (nr_writeback && nr_writeback == nr_taken)
> -               set_bit(ZONE_WRITEBACK, &zone->flags);
> +               set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * Legacy memcg will stall in page writeback so avoid forcibly
> @@ -1673,16 +1695,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>                  * backed by a congested BDI and wait_iff_congested will stall.
>                  */
>                 if (nr_dirty && nr_dirty == nr_congested)
> -                       set_bit(ZONE_CONGESTED, &zone->flags);
> +                       set_bit(PGDAT_CONGESTED, &pgdat->flags);
>
>                 /*
>                  * If dirty pages are scanned that are not queued for IO, it
>                  * implies that flushers are not keeping up. In this case, flag
> -                * the zone ZONE_DIRTY and kswapd will start writing pages from
> +                * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
>                  * reclaim context.
>                  */
>                 if (nr_unqueued_dirty == nr_taken)
> -                       set_bit(ZONE_DIRTY, &zone->flags);
> +                       set_bit(PGDAT_DIRTY, &pgdat->flags);
>
>                 /*
>                  * If kswapd scans pages marked marked for immediate
> @@ -1701,9 +1723,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          */
>         if (!sc->hibernation_mode && !current_is_kswapd() &&
>             current_may_throttle())
> -               wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +               wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>
> -       trace_mm_vmscan_lru_shrink_inactive(zone, nr_scanned, nr_reclaimed,
> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> +                       nr_scanned, nr_reclaimed,
>                         sc->priority, file);
>         return nr_reclaimed;
>  }
> @@ -1731,20 +1754,20 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                                      struct list_head *pages_to_free,
>                                      enum lru_list lru)
>  {
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long pgmoved = 0;
>         struct page *page;
>         int nr_pages;
>
>         while (!list_empty(list)) {
>                 page = lru_to_page(list);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>
>                 nr_pages = hpage_nr_pages(page);
> -               update_lru_size(lruvec, lru, nr_pages);
> +               update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
>                 list_move(&page->lru, &lruvec->lists[lru]);
>                 pgmoved += nr_pages;
>
> @@ -1754,10 +1777,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, pages_to_free);
>                 }
> @@ -1783,7 +1806,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         unsigned long nr_rotated = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
>         lru_add_drain();
>
> @@ -1792,20 +1815,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc))
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> -       __count_zone_vm_events(PGREFILL, zone, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
> +       __count_vm_events(PGREFILL, nr_scanned);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         while (!list_empty(&l_hold)) {
>                 cond_resched();
> @@ -1850,7 +1872,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         /*
>          * Move pages back to the lru list.
>          */
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         /*
>          * Count referenced pages from currently used mappings as rotated,
>          * even though only some of them are actually re-activated.  This
> @@ -1861,8 +1883,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>         move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>         move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&l_hold);
>         free_hot_cold_page_list(&l_hold, true);
> @@ -1956,7 +1978,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>         u64 fraction[2];
>         u64 denominator = 0;    /* gcc */
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long anon_prio, file_prio;
>         enum scan_balance scan_balance;
>         unsigned long anon, file;
> @@ -1977,7 +1999,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * well.
>          */
>         if (current_is_kswapd()) {
> -               if (!zone_reclaimable(zone))
> +               if (!pgdat_reclaimable(pgdat))
>                         force_scan = true;
>                 if (!mem_cgroup_online(memcg))
>                         force_scan = true;
> @@ -2023,14 +2045,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * anon pages.  Try to detect this based on file LRU size.
>          */
>         if (global_reclaim(sc)) {
> -               unsigned long zonefile;
> -               unsigned long zonefree;
> +               unsigned long pgdatfile;
> +               unsigned long pgdatfree;
> +               int z;
> +               unsigned long total_high_wmark = 0;
>
> -               zonefree = zone_page_state(zone, NR_FREE_PAGES);
> -               zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                          zone_page_state(zone, NR_INACTIVE_FILE);
> +               pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> +               pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +                       if (!populated_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
>
> -               if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
> +               if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
>                         scan_balance = SCAN_ANON;
>                         goto out;
>                 }
> @@ -2077,7 +2109,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
>                 lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
>                 reclaim_stat->recent_scanned[0] /= 2;
>                 reclaim_stat->recent_rotated[0] /= 2;
> @@ -2098,7 +2130,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>
>         fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
>         fp /= reclaim_stat->recent_rotated[1] + 1;
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         fraction[0] = ap;
>         fraction[1] = fp;
> @@ -2352,9 +2384,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
>          * inactive lists are large enough, continue reclaiming
>          */
>         pages_for_compaction = (2UL << sc->order);
> -       inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> +       inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
>         if (get_nr_swap_pages() > 0)
> -               inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +               inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
>         if (sc->nr_reclaimed < pages_for_compaction &&
>                         inactive_lru_pages > pages_for_compaction)
>                 return true;
> @@ -2554,7 +2586,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>                                 continue;
>
>                         if (sc->priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;       /* Let kswapd poll it */
>
>                         /*
> @@ -2692,7 +2724,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
>                 if (!populated_zone(zone) ||
> -                   zone_reclaimable_pages(zone) == 0)
> +                   pgdat_reclaimable_pages(pgdat) == 0)
>                         continue;
>
>                 pfmemalloc_reserve += min_wmark_pages(zone);
> @@ -3000,7 +3032,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>                  * DEF_PRIORITY. Effectively, it considers them balanced so
>                  * they must be considered balanced here as well!
>                  */
> -               if (!zone_reclaimable(zone)) {
> +               if (!pgdat_reclaimable(zone->zone_pgdat)) {
>                         balanced_pages += zone->managed_pages;
>                         continue;
>                 }
> @@ -3063,6 +3095,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  {
>         unsigned long balance_gap;
>         bool lowmem_pressure;
> +       struct pglist_data *pgdat = zone->zone_pgdat;
>
>         /* Reclaim above the high watermark. */
>         sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> @@ -3087,7 +3120,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
>
>         shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
>
> -       clear_bit(ZONE_WRITEBACK, &zone->flags);
> +       /* TODO: ANOMALY */
> +       clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * If a zone reaches its high watermark, consider it to be no longer
> @@ -3095,10 +3129,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
>          * BDIs but as pressure is relieved, speculatively avoid congestion
>          * waits.
>          */
> -       if (zone_reclaimable(zone) &&
> +       if (pgdat_reclaimable(zone->zone_pgdat) &&
>             zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
> -               clear_bit(ZONE_CONGESTED, &zone->flags);
> -               clear_bit(ZONE_DIRTY, &zone->flags);
> +               clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> +               clear_bit(PGDAT_DIRTY, &pgdat->flags);
>         }
>
>         return sc->nr_scanned >= sc->nr_to_reclaim;
> @@ -3157,7 +3191,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         /*
> @@ -3184,9 +3218,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 /*
>                                  * If balanced, clear the dirty and congested
>                                  * flags
> +                                *
> +                                * TODO: ANOMALY
>                                  */
> -                               clear_bit(ZONE_CONGESTED, &zone->flags);
> -                               clear_bit(ZONE_DIRTY, &zone->flags);
> +                               clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +                               clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
>                         }
>                 }
>
> @@ -3216,7 +3252,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         sc.nr_scanned = 0;
> @@ -3612,8 +3648,8 @@ int sysctl_min_slab_ratio = 5;
>  static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
>  {
>         unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> -       unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
> -               zone_page_state(zone, NR_ACTIVE_FILE);
> +       unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +               node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         /*
>          * It's possible for there to be more file mapped pages than
> @@ -3716,7 +3752,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>             zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>                 return ZONE_RECLAIM_FULL;
>
> -       if (!zone_reclaimable(zone))
> +       if (!pgdat_reclaimable(zone->zone_pgdat))
>                 return ZONE_RECLAIM_FULL;
>
>         /*
> @@ -3795,7 +3831,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
>                         zone = pagezone;
>                         spin_lock_irq(zone_lru_lock(zone));
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3345d396a99b..de0c17076270 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -936,11 +936,8 @@ const char * const vmstat_text[] = {
>         /* enum zone_stat_item countes */
>         "nr_free_pages",
>         "nr_alloc_batch",
> -       "nr_inactive_anon",
> -       "nr_active_anon",
> -       "nr_inactive_file",
> -       "nr_active_file",
> -       "nr_unevictable",
> +       "nr_zone_anon_lru",
> +       "nr_zone_file_lru",
>         "nr_mlock",
>         "nr_anon_pages",
>         "nr_mapped",
> @@ -956,12 +953,9 @@ const char * const vmstat_text[] = {
>         "nr_vmscan_write",
>         "nr_vmscan_immediate_reclaim",
>         "nr_writeback_temp",
> -       "nr_isolated_anon",
> -       "nr_isolated_file",
>         "nr_shmem",
>         "nr_dirtied",
>         "nr_written",
> -       "nr_pages_scanned",
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         "nr_zspages",
>  #endif
> @@ -981,6 +975,16 @@ const char * const vmstat_text[] = {
>         "nr_shmem_pmdmapped",
>         "nr_free_cma",
>
> +       /* Node-based counters */
> +       "nr_inactive_anon",
> +       "nr_active_anon",
> +       "nr_inactive_file",
> +       "nr_active_file",
> +       "nr_unevictable",
> +       "nr_isolated_anon",
> +       "nr_isolated_file",
> +       "nr_pages_scanned",
> +
>         /* enum writeback_stat_item counters */
>         "nr_dirty_threshold",
>         "nr_dirty_background_threshold",
> @@ -1002,11 +1006,11 @@ const char * const vmstat_text[] = {
>         "pgmajfault",
>         "pglazyfreed",
>
> -       TEXTS_FOR_ZONES("pgrefill")
> -       TEXTS_FOR_ZONES("pgsteal_kswapd")
> -       TEXTS_FOR_ZONES("pgsteal_direct")
> -       TEXTS_FOR_ZONES("pgscan_kswapd")
> -       TEXTS_FOR_ZONES("pgscan_direct")
> +       "pgrefill",
> +       "pgsteal_kswapd",
> +       "pgsteal_direct",
> +       "pgscan_kswapd",
> +       "pgscan_direct",
>         "pgscan_direct_throttle",
>
>  #ifdef CONFIG_NUMA
> @@ -1434,7 +1438,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n        min      %lu"
>                    "\n        low      %lu"
>                    "\n        high     %lu"
> -                  "\n        scanned  %lu"
> +                  "\n   node_scanned  %lu"
>                    "\n        spanned  %lu"
>                    "\n        present  %lu"
>                    "\n        managed  %lu",
> @@ -1442,13 +1446,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    min_wmark_pages(zone),
>                    low_wmark_pages(zone),
>                    high_wmark_pages(zone),
> -                  zone_page_state(zone, NR_PAGES_SCANNED),
> +                  node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>                    zone->spanned_pages,
>                    zone->present_pages,
>                    zone->managed_pages);
>
>         for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> -               seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
> +               seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
>                                 zone_page_state(zone, i));
>
>         seq_printf(m,
> @@ -1478,12 +1482,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  #endif
>         }
>         seq_printf(m,
> -                  "\n  all_unreclaimable: %u"
> -                  "\n  start_pfn:         %lu"
> -                  "\n  inactive_ratio:    %u",
> -                  !zone_reclaimable(zone),
> +                  "\n  node_unreclaimable:  %u"
> +                  "\n  start_pfn:           %lu"
> +                  "\n  node_inactive_ratio: %u",
> +                  !pgdat_reclaimable(zone->zone_pgdat),
>                    zone->zone_start_pfn,
> -                  zone->inactive_ratio);
> +                  zone->zone_pgdat->inactive_ratio);
>         seq_putc(m, '\n');
>  }
>
> @@ -1574,7 +1578,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  {
>         unsigned long *l = arg;
>         unsigned long off = l - (unsigned long *)m->private;
> -
>         seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
>         return 0;
>  }
> diff --git a/mm/workingset.c b/mm/workingset.c
> index ba972ac2dfdd..ebe14445809a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -355,8 +355,8 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>                                                      LRU_ALL_FILE);
>         } else {
> -               pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
> -                       sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
> +               pages = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
> +                       node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
>         }
>
>         /*
> --
> 2.6.4
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-08-04 20:59     ` James Hogan
  0 siblings, 0 replies; 220+ messages in thread
From: James Hogan @ 2016-08-04 20:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

On 8 July 2016 at 10:34, Mel Gorman <mgorman@techsingularity.net> wrote:
> This moves the LRU lists from the zone to the node and related data
> such as counters, tracing, congestion tracking and writeback tracking.
> Unfortunately, due to reclaim and compaction retry logic, it is necessary
> to account for the number of LRU pages on both zone and node logic.
> Most reclaim logic is based on the node counters but the retry logic uses
> the zone counters which do not distinguish inactive and active sizes.
> It would be possible to leave the LRU counters on a per-zone basis but
> it's a heavier calculation across multiple cache lines that is much more
> frequent than the retry checks.
>
> Other than the LRU counters, this is mostly a mechanical patch but note
> that it introduces a number of anomalies.  For example, the scans are
> per-zone but using per-node counters.  We also mark a node as congested
> when a zone is congested.  This causes weird problems that are fixed later
> but is easier to review.
>
> In the event that there is excessive overhead on 32-bit systems due to
> the nodes being on LRU then there are two potential solutions
>
> 1. Long-term isolation of highmem pages when reclaim is lowmem
>
>    When pages are skipped, they are immediately added back onto the LRU
>    list. If lowmem reclaim persisted for long periods of time, the same
>    highmem pages get continually scanned. The idea would be that lowmem
>    keeps those pages on a separate list until a reclaim for highmem pages
>    arrives that splices the highmem pages back onto the LRU. It potentially
>    could be implemented similar to the UNEVICTABLE list.
>
>    That would reduce the skip rate with the potential corner case is that
>    highmem pages have to be scanned and reclaimed to free lowmem slab pages.
>
> 2. Linear scan lowmem pages if the initial LRU shrink fails
>
>    This will break LRU ordering but may be preferable and faster during
>    memory pressure than skipping LRU pages.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

This breaks boot on metag architecture:
Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]

It appears to be in node_page_state_snapshot() (via
pgdat_reclaimable()), and have come via mm_init. Here's the relevant
bit of the backtrace:

    node_page_state_snapshot@0x4009c884(enum node_stat_item item =
???, struct pglist_data * pgdat = ???) + 0x48
    pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
    show_free_areas(unsigned int filter = 0) + 0x2cc
    show_mem(unsigned int filter = 0) + 0x18
    mm_init@0x4025c3d4()
    start_kernel() + 0x204

__per_cpu_offset[0] == 0x233000 (close to bad addr),
pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
definitely hasn't been called yet (mm_init is called before
setup_per_cpu_pageset()).

Any ideas what the correct solution is (and why presumably others
haven't seen the same issue on other architectures?).

Thanks
James

> ---
>  arch/tile/mm/pgtable.c                    |   8 +-
>  drivers/base/node.c                       |  19 +--
>  drivers/staging/android/lowmemorykiller.c |   8 +-
>  include/linux/backing-dev.h               |   2 +-
>  include/linux/memcontrol.h                |  18 +--
>  include/linux/mm_inline.h                 |  21 ++-
>  include/linux/mmzone.h                    |  68 +++++----
>  include/linux/swap.h                      |   1 +
>  include/linux/vm_event_item.h             |  10 +-
>  include/linux/vmstat.h                    |  17 +++
>  include/trace/events/vmscan.h             |  12 +-
>  kernel/power/snapshot.c                   |  10 +-
>  mm/backing-dev.c                          |  15 +-
>  mm/compaction.c                           |  18 +--
>  mm/huge_memory.c                          |   2 +-
>  mm/internal.h                             |   2 +-
>  mm/khugepaged.c                           |   4 +-
>  mm/memcontrol.c                           |  17 +--
>  mm/memory-failure.c                       |   4 +-
>  mm/memory_hotplug.c                       |   2 +-
>  mm/mempolicy.c                            |   2 +-
>  mm/migrate.c                              |  21 +--
>  mm/mlock.c                                |   2 +-
>  mm/page-writeback.c                       |   8 +-
>  mm/page_alloc.c                           |  68 +++++----
>  mm/swap.c                                 |  50 +++----
>  mm/vmscan.c                               | 226 +++++++++++++++++-------------
>  mm/vmstat.c                               |  47 ++++---
>  mm/workingset.c                           |   4 +-
>  29 files changed, 386 insertions(+), 300 deletions(-)
>
> diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
> index c4d5bf841a7f..9e389213580d 100644
> --- a/arch/tile/mm/pgtable.c
> +++ b/arch/tile/mm/pgtable.c
> @@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
>         struct zone *zone;
>
>         pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
> -              (global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE)),
> -              (global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE)),
> +              (global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE)),
> +              (global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE)),
>                global_page_state(NR_FILE_DIRTY),
>                global_page_state(NR_WRITEBACK),
>                global_page_state(NR_UNSTABLE_NFS),
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 92d8e090c5b3..b7f01a4a642d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  {
>         int n;
>         int nid = dev->id;
> +       struct pglist_data *pgdat = NODE_DATA(nid);
>         struct sysinfo i;
>
>         si_meminfo_node(&i, nid);
> @@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
>                        nid, K(i.totalram),
>                        nid, K(i.freeram),
>                        nid, K(i.totalram - i.freeram),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
>                        nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
>
>  #ifdef CONFIG_HIGHMEM
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> index 24d2745e9437..93dbcc38eb0f 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -72,10 +72,10 @@ static unsigned long lowmem_deathpending_timeout;
>  static unsigned long lowmem_count(struct shrinker *s,
>                                   struct shrink_control *sc)
>  {
> -       return global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE) +
> -               global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE);
> +       return global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE) +
> +               global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE);
>  }
>
>  static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c82794f20110..491a91717788 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
>  }
>
>  long congestion_wait(int sync, long timeout);
> -long wait_iff_congested(struct zone *zone, int sync, long timeout);
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
>  int pdflush_proc_obsolete(struct ctl_table *table, int write,
>                 void __user *buffer, size_t *lenp, loff_t *ppos);
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 104efa6874db..68f1121c8fe7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -340,7 +340,7 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = zone_lruvec(zone);
>                 goto out;
>         }
>
> @@ -349,15 +349,15 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>  out:
>         /*
>          * Since a node can be onlined after the mem_cgroup was created,
> -        * we have to be prepared to initialize lruvec->zone here;
> +        * we have to be prepared to initialize lruvec->pgdat here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != zone->zone_pgdat))
> +               lruvec->pgdat = zone->zone_pgdat;
>         return lruvec;
>  }
>
> -struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> +struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
>  bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -438,7 +438,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
>  int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
>
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -               int nr_pages);
> +               enum zone_type zid, int nr_pages);
>
>  unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>                                            int nid, unsigned int lru_mask);
> @@ -613,13 +613,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
>  static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>                                                     struct mem_cgroup *memcg)
>  {
> -       return &zone->lruvec;
> +       return zone_lruvec(zone);
>  }
>
>  static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> -                                                   struct zone *zone)
> +                                                   struct pglist_data *pgdat)
>  {
> -       return &zone->lruvec;
> +       return &pgdat->lruvec;
>  }
>
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 5bd29ba4f174..9aadcc781857 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -23,25 +23,32 @@ static inline int page_is_file_cache(struct page *page)
>  }
>
>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
> -       __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +       __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> +       __mod_zone_page_state(&pgdat->node_zones[zid],
> +               NR_ZONE_LRU_BASE + !!is_file_lru(lru),
> +               nr_pages);
>  }
>
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
>  #ifdef CONFIG_MEMCG
> -       mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
> +       mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
>  #else
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>  #endif
>  }
>
>  static __always_inline void add_page_to_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
> -       update_lru_size(lruvec, lru, hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
>         list_add(&page->lru, &lruvec->lists[lru]);
>  }
>
> @@ -49,7 +56,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
>         list_del(&page->lru);
> -       update_lru_size(lruvec, lru, -hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
>  }
>
>  /**
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cfa870107abe..d4f5cac0a8c3 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -111,12 +111,9 @@ enum zone_stat_item {
>         /* First 128 byte cacheline (assuming 64 bit words) */
>         NR_FREE_PAGES,
>         NR_ALLOC_BATCH,
> -       NR_LRU_BASE,
> -       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> -       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> -       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> -       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> -       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> +       NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
> +       NR_ZONE_LRU_FILE,
>         NR_MLOCK,               /* mlock()ed pages found and moved off LRU */
>         NR_ANON_PAGES,  /* Mapped anonymous pages */
>         NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> @@ -134,12 +131,9 @@ enum zone_stat_item {
>         NR_VMSCAN_WRITE,
>         NR_VMSCAN_IMMEDIATE,    /* Prioritise for reclaim when writeback ends */
>         NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
> -       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> -       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
>         NR_SHMEM,               /* shmem pages (included tmpfs/GEM pages) */
>         NR_DIRTIED,             /* page dirtyings since bootup */
>         NR_WRITTEN,             /* page writings since bootup */
> -       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         NR_ZSPAGES,             /* allocated in zsmalloc */
>  #endif
> @@ -161,6 +155,15 @@ enum zone_stat_item {
>         NR_VM_ZONE_STAT_ITEMS };
>
>  enum node_stat_item {
> +       NR_LRU_BASE,
> +       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> +       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> +       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> +       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> +       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> +       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> +       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>         NR_VM_NODE_STAT_ITEMS
>  };
>
> @@ -219,7 +222,7 @@ struct lruvec {
>         /* Evictions & activations on the inactive file list */
>         atomic_long_t                   inactive_age;
>  #ifdef CONFIG_MEMCG
> -       struct zone                     *zone;
> +       struct pglist_data *pgdat;
>  #endif
>  };
>
> @@ -357,13 +360,6 @@ struct zone {
>  #ifdef CONFIG_NUMA
>         int node;
>  #endif
> -
> -       /*
> -        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> -        * this zone's LRU.  Maintained by the pageout code.
> -        */
> -       unsigned int inactive_ratio;
> -
>         struct pglist_data      *zone_pgdat;
>         struct per_cpu_pageset __percpu *pageset;
>
> @@ -495,9 +491,6 @@ struct zone {
>
>         /* Write-intensive fields used by page reclaim */
>
> -       /* Fields commonly accessed by the page reclaim scanner */
> -       struct lruvec           lruvec;
> -
>         /*
>          * When free pages are below this point, additional steps are taken
>          * when reading the number of free pages to avoid per-cpu counter
> @@ -537,17 +530,20 @@ struct zone {
>
>  enum zone_flags {
>         ZONE_RECLAIM_LOCKED,            /* prevents concurrent reclaim */
> -       ZONE_CONGESTED,                 /* zone has many dirty pages backed by
> +       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
> +};
> +
> +enum pgdat_flags {
> +       PGDAT_CONGESTED,                /* pgdat has many dirty pages backed by
>                                          * a congested BDI
>                                          */
> -       ZONE_DIRTY,                     /* reclaim scanning has recently found
> +       PGDAT_DIRTY,                    /* reclaim scanning has recently found
>                                          * many dirty file pages at the tail
>                                          * of the LRU.
>                                          */
> -       ZONE_WRITEBACK,                 /* reclaim scanning has recently found
> +       PGDAT_WRITEBACK,                /* reclaim scanning has recently found
>                                          * many pages under writeback
>                                          */
> -       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
>  };
>
>  static inline unsigned long zone_end_pfn(const struct zone *zone)
> @@ -707,6 +703,19 @@ typedef struct pglist_data {
>         unsigned long split_queue_len;
>  #endif
>
> +       /* Fields commonly accessed by the page reclaim scanner */
> +       struct lruvec           lruvec;
> +
> +       /*
> +        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> +        * this node's LRU.  Maintained by the pageout code.
> +        */
> +       unsigned int inactive_ratio;
> +
> +       unsigned long           flags;
> +
> +       ZONE_PADDING(_pad2_)
> +
>         /* Per-node vmstats */
>         struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>         atomic_long_t           vm_stat[NR_VM_NODE_STAT_ITEMS];
> @@ -728,6 +737,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
>         return &zone->zone_pgdat->lru_lock;
>  }
>
> +static inline struct lruvec *zone_lruvec(struct zone *zone)
> +{
> +       return &zone->zone_pgdat->lruvec;
> +}
> +
>  static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
>  {
>         return pgdat->node_start_pfn + pgdat->node_spanned_pages;
> @@ -779,12 +793,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
>
>  extern void lruvec_init(struct lruvec *lruvec);
>
> -static inline struct zone *lruvec_zone(struct lruvec *lruvec)
> +static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
>  {
>  #ifdef CONFIG_MEMCG
> -       return lruvec->zone;
> +       return lruvec->pgdat;
>  #else
> -       return container_of(lruvec, struct zone, lruvec);
> +       return container_of(lruvec, struct pglist_data, lruvec);
>  #endif
>  }
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0af2bb2028fd..c82f916008b7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> +extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 42604173f122..1798ff542517 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 PGFREE, PGACTIVATE, PGDEACTIVATE,
>                 PGFAULT, PGMAJFAULT,
>                 PGLAZYFREED,
> -               FOR_ALL_ZONES(PGREFILL),
> -               FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> -               FOR_ALL_ZONES(PGSTEAL_DIRECT),
> -               FOR_ALL_ZONES(PGSCAN_KSWAPD),
> -               FOR_ALL_ZONES(PGSCAN_DIRECT),
> +               PGREFILL,
> +               PGSTEAL_KSWAPD,
> +               PGSTEAL_DIRECT,
> +               PGSCAN_KSWAPD,
> +               PGSCAN_DIRECT,
>                 PGSCAN_DIRECT_THROTTLE,
>  #ifdef CONFIG_NUMA
>                 PGSCAN_ZONE_RECLAIM_FAILED,
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index d1744aa3ab9c..fee321c98550 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>         return x;
>  }
>
> +static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
> +                                       enum node_stat_item item)
> +{
> +       long x = atomic_long_read(&pgdat->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +       int cpu;
> +       for_each_online_cpu(cpu)
> +               x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item];
> +
> +       if (x < 0)
> +               x = 0;
> +#endif
> +       return x;
> +}
> +
> +
>  #ifdef CONFIG_NUMA
>  extern unsigned long sum_zone_node_page_state(int node,
>                                                 enum zone_stat_item item);
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 0101ef37f1ee..897f1aa1ee5f 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -352,15 +352,14 @@ TRACE_EVENT(mm_vmscan_writepage,
>
>  TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>
> -       TP_PROTO(struct zone *zone,
> +       TP_PROTO(int nid,
>                 unsigned long nr_scanned, unsigned long nr_reclaimed,
>                 int priority, int file),
>
> -       TP_ARGS(zone, nr_scanned, nr_reclaimed, priority, file),
> +       TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
>
>         TP_STRUCT__entry(
>                 __field(int, nid)
> -               __field(int, zid)
>                 __field(unsigned long, nr_scanned)
>                 __field(unsigned long, nr_reclaimed)
>                 __field(int, priority)
> @@ -368,16 +367,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>         ),
>
>         TP_fast_assign(
> -               __entry->nid = zone_to_nid(zone);
> -               __entry->zid = zone_idx(zone);
> +               __entry->nid = nid;
>                 __entry->nr_scanned = nr_scanned;
>                 __entry->nr_reclaimed = nr_reclaimed;
>                 __entry->priority = priority;
>                 __entry->reclaim_flags = trace_shrink_flags(file);
>         ),
>
> -       TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> -               __entry->nid, __entry->zid,
> +       TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> +               __entry->nid,
>                 __entry->nr_scanned, __entry->nr_reclaimed,
>                 __entry->priority,
>                 show_reclaim_flags(__entry->reclaim_flags))
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 3a970604308f..24a06bc23f85 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
>         unsigned long size;
>
>         size = global_page_state(NR_SLAB_RECLAIMABLE)
> -               + global_page_state(NR_ACTIVE_ANON)
> -               + global_page_state(NR_INACTIVE_ANON)
> -               + global_page_state(NR_ACTIVE_FILE)
> -               + global_page_state(NR_INACTIVE_FILE)
> -               - global_page_state(NR_FILE_MAPPED);
> +               + global_node_page_state(NR_ACTIVE_ANON)
> +               + global_node_page_state(NR_INACTIVE_ANON)
> +               + global_node_page_state(NR_ACTIVE_FILE)
> +               + global_node_page_state(NR_INACTIVE_FILE)
> +               - global_node_page_state(NR_FILE_MAPPED);
>
>         return saveable <= size ? 0 : saveable - size;
>  }
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index f53b23ab7ed7..a8c3af46bd3d 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -982,24 +982,24 @@ long congestion_wait(int sync, long timeout)
>  EXPORT_SYMBOL(congestion_wait);
>
>  /**
> - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
> - * @zone: A zone to check if it is heavily congested
> + * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
> + * @pgdat: A pgdat to check if it is heavily congested
>   * @sync: SYNC or ASYNC IO
>   * @timeout: timeout in jiffies
>   *
>   * In the event of a congested backing_dev (any backing_dev) and the given
> - * @zone has experienced recent congestion, this waits for up to @timeout
> + * @pgdat has experienced recent congestion, this waits for up to @timeout
>   * jiffies for either a BDI to exit congestion of the given @sync queue
>   * or a write to complete.
>   *
> - * In the absence of zone congestion, cond_resched() is called to yield
> + * In the absence of pgdat congestion, cond_resched() is called to yield
>   * the processor if necessary but otherwise does not sleep.
>   *
>   * The return value is 0 if the sleep is for the full timeout. Otherwise,
>   * it is the number of jiffies that were still remaining when the function
>   * returned. return_value == timeout implies the function did not sleep.
>   */
> -long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
>  {
>         long ret;
>         unsigned long start = jiffies;
> @@ -1008,12 +1008,13 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
>
>         /*
>          * If there is no congestion, or heavy congestion is not being
> -        * encountered in the current zone, yield if necessary instead
> +        * encountered in the current pgdat, yield if necessary instead
>          * of sleeping on the congestion queue
>          */
>         if (atomic_read(&nr_wb_congested[sync]) == 0 ||
> -           !test_bit(ZONE_CONGESTED, &zone->flags)) {
> +           !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
>                 cond_resched();
> +
>                 /* In case we scheduled, work out time remaining */
>                 ret = timeout - (jiffies - start);
>                 if (ret < 0)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7607efb7bee2..a0bd85712516 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -646,8 +646,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
>         list_for_each_entry(page, &cc->migratepages, lru)
>                 count[!!page_is_file_cache(page)]++;
>
> -       mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
>  }
>
>  /* Similar to reclaim, but different enough that they don't share logic */
> @@ -655,12 +655,12 @@ static bool too_many_isolated(struct zone *zone)
>  {
>         unsigned long active, inactive, isolated;
>
> -       inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_INACTIVE_ANON);
> -       active = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_ACTIVE_ANON);
> -       isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
> -                                       zone_page_state(zone, NR_ISOLATED_ANON);
> +       inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
> +       active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
> +       isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
>
>         return isolated > (inactive + active) / 2;
>  }
> @@ -856,7 +856,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>                         }
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 /* Try isolate the page */
>                 if (__isolate_lru_page(page, isolate_mode) != 0)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f997328ae64..5d5b2207cfd2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1830,7 +1830,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>         pgoff_t end = -1;
>         int i;
>
> -       lruvec = mem_cgroup_page_lruvec(head, zone);
> +       lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
>
>         /* complete memcg works before add pages to LRU */
>         mem_cgroup_split_huge_fixup(head);
> diff --git a/mm/internal.h b/mm/internal.h
> index 9b6a6c43ac39..2f80d0343c56 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -78,7 +78,7 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool zone_reclaimable(struct zone *zone);
> +extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 93d5f87c00d5..d7a49f665f04 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -480,7 +480,7 @@ void __khugepaged_exit(struct mm_struct *mm)
>  static void release_pte_page(struct page *page)
>  {
>         /* 0 stands for page_is_file_cache(page) == false */
> -       dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +       dec_node_page_state(page, NR_ISOLATED_ANON + 0);
>         unlock_page(page);
>         putback_lru_page(page);
>  }
> @@ -576,7 +576,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         goto out;
>                 }
>                 /* 0 stands for page_is_file_cache(page) == false */
> -               inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +               inc_node_page_state(page, NR_ISOLATED_ANON + 0);
>                 VM_BUG_ON_PAGE(!PageLocked(page), page);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9b70f9ca8ddf..50c86ad121bc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -943,14 +943,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>   * and putback protocol: the LRU lock must be held, and the page must
>   * either be PageLRU() or the caller must have isolated/allocated it.
>   */
> -struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> +struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
>  {
>         struct mem_cgroup_per_zone *mz;
>         struct mem_cgroup *memcg;
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = &pgdat->lruvec;
>                 goto out;
>         }
>
> @@ -970,8 +970,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>          * we have to be prepared to initialize lruvec->zone here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != pgdat))
> +               lruvec->pgdat = pgdat;
>         return lruvec;
>  }
>
> @@ -979,6 +979,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
>   * @lru: index of lru list the page is sitting on
> + * @zid: Zone ID of the zone pages have been added to
>   * @nr_pages: positive when adding or negative when removing
>   *
>   * This function must be called under lru_lock, just before a page is added
> @@ -986,14 +987,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * so as to allow it to check that lru_size 0 is consistent with list_empty).
>   */
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -                               int nr_pages)
> +                               enum zone_type zid, int nr_pages)
>  {
>         struct mem_cgroup_per_zone *mz;
>         unsigned long *lru_size;
>         long size;
>         bool empty;
>
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2069,7 +2070,7 @@ static void lock_page_lru(struct page *page, int *isolated)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 *isolated = 1;
> @@ -2084,7 +2085,7 @@ static void unlock_page_lru(struct page *page, int isolated)
>         if (isolated) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>                 add_page_to_lru_list(page, lruvec, page_lru(page));
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2fcca6b0e005..11de752ccaf5 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1663,7 +1663,7 @@ static int __soft_offline_page(struct page *page, int flags)
>         put_hwpoison_page(page);
>         if (!ret) {
>                 LIST_HEAD(pagelist);
> -               inc_zone_page_state(page, NR_ISOLATED_ANON +
> +               inc_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                 list_add(&page->lru, &pagelist);
>                 ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> @@ -1671,7 +1671,7 @@ static int __soft_offline_page(struct page *page, int flags)
>                 if (ret) {
>                         if (!list_empty(&pagelist)) {
>                                 list_del(&page->lru);
> -                               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                                 page_is_file_cache(page));
>                                 putback_lru_page(page);
>                         }
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 82d0b98d27f8..c5278360ca66 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1586,7 +1586,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>                         put_page(page);
>                         list_add_tail(&page->lru, &source);
>                         move_pages--;
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>
>                 } else {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 53e40d3f3933..d8c4e38fb5f4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -962,7 +962,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
>         if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
>                 if (!isolate_lru_page(page)) {
>                         list_add_tail(&page->lru, pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>         }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2232f6923cc7..3033dae33a0a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -168,7 +168,7 @@ void putback_movable_pages(struct list_head *l)
>                         continue;
>                 }
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>                 /*
>                  * We isolated non-lru movable page so here we can use
> @@ -1119,7 +1119,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>                  * restored.
>                  */
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>         }
>
> @@ -1460,7 +1460,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>                 err = isolate_lru_page(page);
>                 if (!err) {
>                         list_add_tail(&page->lru, &pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>  put_and_set:
> @@ -1726,15 +1726,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>                                    unsigned long nr_migrate_pages)
>  {
>         int z;
> +
> +       if (!pgdat_reclaimable(pgdat))
> +               return false;
> +
>         for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>                 struct zone *zone = pgdat->node_zones + z;
>
>                 if (!populated_zone(zone))
>                         continue;
>
> -               if (!zone_reclaimable(zone))
> -                       continue;
> -
>                 /* Avoid waking kswapd by allocating pages_to_migrate pages. */
>                 if (!zone_watermark_ok(zone, 0,
>                                        high_wmark_pages(zone) +
> @@ -1828,7 +1829,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>         }
>
>         page_lru = page_is_file_cache(page);
> -       mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
> +       mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>                                 hpage_nr_pages(page));
>
>         /*
> @@ -1886,7 +1887,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>         if (nr_remaining) {
>                 if (!list_empty(&migratepages)) {
>                         list_del(&page->lru);
> -                       dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                       dec_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                         putback_lru_page(page);
>                 }
> @@ -1979,7 +1980,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>                 /* Retake the callers reference and putback on LRU */
>                 get_page(page);
>                 putback_lru_page(page);
> -               mod_zone_page_state(page_zone(page),
> +               mod_node_page_state(page_pgdat(page),
>                          NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
>
>                 goto out_unlock;
> @@ -2030,7 +2031,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>         count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
>         count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
>
> -       mod_zone_page_state(page_zone(page),
> +       mod_node_page_state(page_pgdat(page),
>                         NR_ISOLATED_ANON + page_lru,
>                         -HPAGE_PMD_NR);
>         return isolated;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 997f63082ff5..14645be06e30 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -103,7 +103,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 if (getpage)
>                         get_page(page);
>                 ClearPageLRU(page);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d578d2a56b19..0ada2b2954b0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -285,8 +285,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
>          */
>         nr_pages -= min(nr_pages, zone->totalreserve_pages);
>
> -       nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
> -       nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         return nr_pages;
>  }
> @@ -348,8 +348,8 @@ static unsigned long global_dirtyable_memory(void)
>          */
>         x -= min(x, totalreserve_pages);
>
> -       x += global_page_state(NR_INACTIVE_FILE);
> -       x += global_page_state(NR_ACTIVE_FILE);
> +       x += global_node_page_state(NR_INACTIVE_FILE);
> +       x += global_node_page_state(NR_ACTIVE_FILE);
>
>         if (!vm_highmem_is_dirtyable)
>                 x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48b5414009ac..b84b85ae54ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1090,9 +1090,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>
>         spin_lock(&zone->lock);
>         isolated_pageblocks = has_isolate_pageblock(zone);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         while (count) {
>                 struct page *page;
> @@ -1147,9 +1147,9 @@ static void free_one_page(struct zone *zone,
>  {
>         unsigned long nr_scanned;
>         spin_lock(&zone->lock);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
> @@ -4331,6 +4331,7 @@ void show_free_areas(unsigned int filter)
>         unsigned long free_pcp = 0;
>         int cpu;
>         struct zone *zone;
> +       pg_data_t *pgdat;
>
>         for_each_populated_zone(zone) {
>                 if (skip_free_areas_node(filter, zone_to_nid(zone)))
> @@ -4349,13 +4350,13 @@ void show_free_areas(unsigned int filter)
>                 " anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
>  #endif
>                 " free:%lu free_pcp:%lu free_cma:%lu\n",
> -               global_page_state(NR_ACTIVE_ANON),
> -               global_page_state(NR_INACTIVE_ANON),
> -               global_page_state(NR_ISOLATED_ANON),
> -               global_page_state(NR_ACTIVE_FILE),
> -               global_page_state(NR_INACTIVE_FILE),
> -               global_page_state(NR_ISOLATED_FILE),
> -               global_page_state(NR_UNEVICTABLE),
> +               global_node_page_state(NR_ACTIVE_ANON),
> +               global_node_page_state(NR_INACTIVE_ANON),
> +               global_node_page_state(NR_ISOLATED_ANON),
> +               global_node_page_state(NR_ACTIVE_FILE),
> +               global_node_page_state(NR_INACTIVE_FILE),
> +               global_node_page_state(NR_ISOLATED_FILE),
> +               global_node_page_state(NR_UNEVICTABLE),
>                 global_page_state(NR_FILE_DIRTY),
>                 global_page_state(NR_WRITEBACK),
>                 global_page_state(NR_UNSTABLE_NFS),
> @@ -4374,6 +4375,28 @@ void show_free_areas(unsigned int filter)
>                 free_pcp,
>                 global_page_state(NR_FREE_CMA_PAGES));
>
> +       for_each_online_pgdat(pgdat) {
> +               printk("Node %d"
> +                       " active_anon:%lukB"
> +                       " inactive_anon:%lukB"
> +                       " active_file:%lukB"
> +                       " inactive_file:%lukB"
> +                       " unevictable:%lukB"
> +                       " isolated(anon):%lukB"
> +                       " isolated(file):%lukB"
> +                       " all_unreclaimable? %s"
> +                       "\n",
> +                       pgdat->node_id,
> +                       K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_ANON)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_FILE)),
> +                       !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +       }
> +
>         for_each_populated_zone(zone) {
>                 int i;
>
> @@ -4390,13 +4413,6 @@ void show_free_areas(unsigned int filter)
>                         " min:%lukB"
>                         " low:%lukB"
>                         " high:%lukB"
> -                       " active_anon:%lukB"
> -                       " inactive_anon:%lukB"
> -                       " active_file:%lukB"
> -                       " inactive_file:%lukB"
> -                       " unevictable:%lukB"
> -                       " isolated(anon):%lukB"
> -                       " isolated(file):%lukB"
>                         " present:%lukB"
>                         " managed:%lukB"
>                         " mlocked:%lukB"
> @@ -4419,21 +4435,13 @@ void show_free_areas(unsigned int filter)
>                         " local_pcp:%ukB"
>                         " free_cma:%lukB"
>                         " writeback_tmp:%lukB"
> -                       " pages_scanned:%lu"
> -                       " all_unreclaimable? %s"
> +                       " node_pages_scanned:%lu"
>                         "\n",
>                         zone->name,
>                         K(zone_page_state(zone, NR_FREE_PAGES)),
>                         K(min_wmark_pages(zone)),
>                         K(low_wmark_pages(zone)),
>                         K(high_wmark_pages(zone)),
> -                       K(zone_page_state(zone, NR_ACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_INACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_ACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_INACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_UNEVICTABLE)),
> -                       K(zone_page_state(zone, NR_ISOLATED_ANON)),
> -                       K(zone_page_state(zone, NR_ISOLATED_FILE)),
>                         K(zone->present_pages),
>                         K(zone->managed_pages),
>                         K(zone_page_state(zone, NR_MLOCK)),
> @@ -4458,9 +4466,7 @@ void show_free_areas(unsigned int filter)
>                         K(this_cpu_read(zone->pageset->pcp.count)),
>                         K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
>                         K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
> -                       K(zone_page_state(zone, NR_PAGES_SCANNED)),
> -                       (!zone_reclaimable(zone) ? "yes" : "no")
> -                       );
> +                       K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
>                 printk("lowmem_reserve[]:");
>                 for (i = 0; i < MAX_NR_ZONES; i++)
>                         printk(" %ld", zone->lowmem_reserve[i]);
> @@ -6010,7 +6016,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>                 /* For bootup, initialized properly in watermark setup */
>                 mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
>
> -               lruvec_init(&zone->lruvec);
> +               lruvec_init(zone_lruvec(zone));
>                 if (!size)
>                         continue;
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bf37e5cfae81..77af473635fe 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
>                 unsigned long flags;
>
>                 spin_lock_irqsave(zone_lru_lock(zone), flags);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(!PageLRU(page), page);
>                 __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -194,7 +194,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>                         spin_lock_irqsave(zone_lru_lock(zone), flags);
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 (*move_fn)(page, lruvec, arg);
>         }
>         if (zone)
> @@ -319,7 +319,7 @@ void activate_page(struct page *page)
>
>         page = compound_head(page);
>         spin_lock_irq(zone_lru_lock(zone));
> -       __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
> +       __activate_page(page, mem_cgroup_page_lruvec(page, zone->zone_pgdat), NULL);
>         spin_unlock_irq(zone_lru_lock(zone));
>  }
>  #endif
> @@ -445,16 +445,16 @@ void lru_cache_add(struct page *page)
>   */
>  void add_page_to_unevictable_list(struct page *page)
>  {
> -       struct zone *zone = page_zone(page);
> +       struct pglist_data *pgdat = page_pgdat(page);
>         struct lruvec *lruvec;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> -       lruvec = mem_cgroup_page_lruvec(page, zone);
> +       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
>         ClearPageActive(page);
>         SetPageUnevictable(page);
>         SetPageLRU(page);
>         add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>  }
>
>  /**
> @@ -730,7 +730,7 @@ void release_pages(struct page **pages, int nr, bool cold)
>  {
>         int i;
>         LIST_HEAD(pages_to_free);
> -       struct zone *zone = NULL;
> +       struct pglist_data *locked_pgdat = NULL;
>         struct lruvec *lruvec;
>         unsigned long uninitialized_var(flags);
>         unsigned int uninitialized_var(lock_batch);
> @@ -741,11 +741,11 @@ void release_pages(struct page **pages, int nr, bool cold)
>                 /*
>                  * Make sure the IRQ-safe lock-holding time does not get
>                  * excessive with a continuous string of pages from the
> -                * same zone. The lock is held only if zone != NULL.
> +                * same pgdat. The lock is held only if pgdat != NULL.
>                  */
> -               if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
> -                       spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                       zone = NULL;
> +               if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> +                       spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                       locked_pgdat = NULL;
>                 }
>
>                 if (is_huge_zero_page(page)) {
> @@ -758,27 +758,27 @@ void release_pages(struct page **pages, int nr, bool cold)
>                         continue;
>
>                 if (PageCompound(page)) {
> -                       if (zone) {
> -                               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                               zone = NULL;
> +                       if (locked_pgdat) {
> +                               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                               locked_pgdat = NULL;
>                         }
>                         __put_compound_page(page);
>                         continue;
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct zone *pagezone = page_zone(page);
> +                       struct pglist_data *pgdat = page_pgdat(page);
>
> -                       if (pagezone != zone) {
> -                               if (zone)
> -                                       spin_unlock_irqrestore(zone_lru_lock(zone),
> +                       if (pgdat != locked_pgdat) {
> +                               if (locked_pgdat)
> +                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
>                                                                         flags);
>                                 lock_batch = 0;
> -                               zone = pagezone;
> -                               spin_lock_irqsave(zone_lru_lock(zone), flags);
> +                               locked_pgdat = pgdat;
> +                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>                         }
>
> -                       lruvec = mem_cgroup_page_lruvec(page, zone);
> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -789,8 +789,8 @@ void release_pages(struct page **pages, int nr, bool cold)
>
>                 list_add(&page->lru, &pages_to_free);
>         }
> -       if (zone)
> -               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> +       if (locked_pgdat)
> +               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
>
>         mem_cgroup_uncharge_list(&pages_to_free);
>         free_hot_cold_page_list(&pages_to_free, cold);
> @@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>         VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>         VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>         VM_BUG_ON(NR_CPUS != 1 &&
> -                 !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
> +                 !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
>
>         if (!list)
>                 SetPageLRU(page_tail);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e7ffcd259cc4..86a523a761c9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -191,26 +191,42 @@ static bool sane_reclaim(struct scan_control *sc)
>  }
>  #endif
>
> +/*
> + * This misses isolated pages which are not accounted for to save counters.
> + * As the data only determines if reclaim or compaction continues, it is
> + * not expected that isolated pages will be a dominating factor.
> + */
>  unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>         unsigned long nr;
>
> -       nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> +       nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
> +       if (get_nr_swap_pages() > 0)
> +               nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
> +
> +       return nr;
> +}
> +
> +unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
> +{
> +       unsigned long nr;
> +
> +       nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
>
>         if (get_nr_swap_pages() > 0)
> -               nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> +               nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
>
>         return nr;
>  }
>
> -bool zone_reclaimable(struct zone *zone)
> +bool pgdat_reclaimable(struct pglist_data *pgdat)
>  {
> -       return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
> -               zone_reclaimable_pages(zone) * 6;
> +       return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> +               pgdat_reclaimable_pages(pgdat) * 6;
>  }
>
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -218,7 +234,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
>         if (!mem_cgroup_disabled())
>                 return mem_cgroup_get_lru_size(lruvec, lru);
>
> -       return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> +       return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  }
>
>  /*
> @@ -877,7 +893,7 @@ static void page_check_dirty_writeback(struct page *page,
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -                                     struct zone *zone,
> +                                     struct pglist_data *pgdat,
>                                       struct scan_control *sc,
>                                       enum ttu_flags ttu_flags,
>                                       unsigned long *ret_nr_dirty,
> @@ -917,7 +933,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         goto keep;
>
>                 VM_BUG_ON_PAGE(PageActive(page), page);
> -               VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
>                 sc->nr_scanned++;
>
> @@ -996,7 +1011,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         /* Case 1 above */
>                         if (current_is_kswapd() &&
>                             PageReclaim(page) &&
> -                           test_bit(ZONE_WRITEBACK, &zone->flags)) {
> +                           test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>                                 nr_immediate++;
>                                 goto keep_locked;
>
> @@ -1092,7 +1107,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                          */
>                         if (page_is_file_cache(page) &&
>                                         (!current_is_kswapd() ||
> -                                        !test_bit(ZONE_DIRTY, &zone->flags))) {
> +                                        !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>                                 /*
>                                  * Immediately reclaim when written back.
>                                  * Similar in principal to deactivate_page()
> @@ -1266,11 +1281,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>                 }
>         }
>
> -       ret = shrink_page_list(&clean_pages, zone, &sc,
> +       ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
>                         TTU_UNMAP|TTU_IGNORE_ACCESS,
>                         &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
>         list_splice(&clean_pages, page_list);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
>         return ret;
>  }
>
> @@ -1375,7 +1390,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  {
>         struct list_head *src = &lruvec->lists[lru];
>         unsigned long nr_taken = 0;
> -       unsigned long scan;
> +       unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
> +       unsigned long scan, nr_pages;
>
>         for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
>                                         !list_empty(src); scan++) {
> @@ -1388,7 +1404,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
>                 switch (__isolate_lru_page(page, mode)) {
>                 case 0:
> -                       nr_taken += hpage_nr_pages(page);
> +                       nr_pages = hpage_nr_pages(page);
> +                       nr_taken += nr_pages;
> +                       nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
>
> @@ -1405,6 +1423,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>         *nr_scanned = scan;
>         trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
>                                     nr_taken, mode, is_file_lru(lru));
> +       for (scan = 0; scan < MAX_NR_ZONES; scan++) {
> +               nr_pages = nr_zone_taken[scan];
> +               if (!nr_pages)
> +                       continue;
> +
> +               update_lru_size(lruvec, lru, scan, -nr_pages);
> +       }
>         return nr_taken;
>  }
>
> @@ -1445,7 +1470,7 @@ int isolate_lru_page(struct page *page)
>                 struct lruvec *lruvec;
>
>                 spin_lock_irq(zone_lru_lock(zone));
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 if (PageLRU(page)) {
>                         int lru = page_lru(page);
>                         get_page(page);
> @@ -1465,7 +1490,7 @@ int isolate_lru_page(struct page *page)
>   * the LRU list will go small and be scanned faster than necessary, leading to
>   * unnecessary swapping, thrashing and OOM.
>   */
> -static int too_many_isolated(struct zone *zone, int file,
> +static int too_many_isolated(struct pglist_data *pgdat, int file,
>                 struct scan_control *sc)
>  {
>         unsigned long inactive, isolated;
> @@ -1477,11 +1502,11 @@ static int too_many_isolated(struct zone *zone, int file,
>                 return 0;
>
>         if (file) {
> -               inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -               isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
>         } else {
> -               inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> -               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
>         }
>
>         /*
> @@ -1499,7 +1524,7 @@ static noinline_for_stack void
>  putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  {
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         LIST_HEAD(pages_to_free);
>
>         /*
> @@ -1512,13 +1537,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(zone_lru_lock(zone));
> +                       spin_unlock_irq(&pgdat->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(zone_lru_lock(zone));
> +                       spin_lock_irq(&pgdat->lru_lock);
>                         continue;
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 SetPageLRU(page);
>                 lru = page_lru(page);
> @@ -1535,10 +1560,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>                 }
> @@ -1582,10 +1607,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         unsigned long nr_immediate = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>
> -       while (unlikely(too_many_isolated(zone, file, sc))) {
> +       while (unlikely(too_many_isolated(pgdat, file, sc))) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                 /* We are about to die and free our memory. Return now. */
> @@ -1600,48 +1625,45 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc)) {
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>                 else
> -                       __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_DIRECT, nr_scanned);
>         }
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         if (nr_taken == 0)
>                 return 0;
>
> -       nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> +       nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
>                                 &nr_dirty, &nr_unqueued_dirty, &nr_congested,
>                                 &nr_writeback, &nr_immediate,
>                                 false);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         if (global_reclaim(sc)) {
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSTEAL_KSWAPD, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
>                 else
> -                       __count_zone_vm_events(PGSTEAL_DIRECT, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
>         }
>
>         putback_inactive_pages(lruvec, &page_list);
>
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&page_list);
>         free_hot_cold_page_list(&page_list, true);
> @@ -1661,7 +1683,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          * are encountered in the nr_immediate check below.
>          */
>         if (nr_writeback && nr_writeback == nr_taken)
> -               set_bit(ZONE_WRITEBACK, &zone->flags);
> +               set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * Legacy memcg will stall in page writeback so avoid forcibly
> @@ -1673,16 +1695,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>                  * backed by a congested BDI and wait_iff_congested will stall.
>                  */
>                 if (nr_dirty && nr_dirty == nr_congested)
> -                       set_bit(ZONE_CONGESTED, &zone->flags);
> +                       set_bit(PGDAT_CONGESTED, &pgdat->flags);
>
>                 /*
>                  * If dirty pages are scanned that are not queued for IO, it
>                  * implies that flushers are not keeping up. In this case, flag
> -                * the zone ZONE_DIRTY and kswapd will start writing pages from
> +                * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
>                  * reclaim context.
>                  */
>                 if (nr_unqueued_dirty == nr_taken)
> -                       set_bit(ZONE_DIRTY, &zone->flags);
> +                       set_bit(PGDAT_DIRTY, &pgdat->flags);
>
>                 /*
>                  * If kswapd scans pages marked marked for immediate
> @@ -1701,9 +1723,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          */
>         if (!sc->hibernation_mode && !current_is_kswapd() &&
>             current_may_throttle())
> -               wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +               wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>
> -       trace_mm_vmscan_lru_shrink_inactive(zone, nr_scanned, nr_reclaimed,
> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> +                       nr_scanned, nr_reclaimed,
>                         sc->priority, file);
>         return nr_reclaimed;
>  }
> @@ -1731,20 +1754,20 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                                      struct list_head *pages_to_free,
>                                      enum lru_list lru)
>  {
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long pgmoved = 0;
>         struct page *page;
>         int nr_pages;
>
>         while (!list_empty(list)) {
>                 page = lru_to_page(list);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>
>                 nr_pages = hpage_nr_pages(page);
> -               update_lru_size(lruvec, lru, nr_pages);
> +               update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
>                 list_move(&page->lru, &lruvec->lists[lru]);
>                 pgmoved += nr_pages;
>
> @@ -1754,10 +1777,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, pages_to_free);
>                 }
> @@ -1783,7 +1806,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         unsigned long nr_rotated = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
>         lru_add_drain();
>
> @@ -1792,20 +1815,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc))
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> -       __count_zone_vm_events(PGREFILL, zone, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
> +       __count_vm_events(PGREFILL, nr_scanned);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         while (!list_empty(&l_hold)) {
>                 cond_resched();
> @@ -1850,7 +1872,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         /*
>          * Move pages back to the lru list.
>          */
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         /*
>          * Count referenced pages from currently used mappings as rotated,
>          * even though only some of them are actually re-activated.  This
> @@ -1861,8 +1883,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>         move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>         move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&l_hold);
>         free_hot_cold_page_list(&l_hold, true);
> @@ -1956,7 +1978,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>         u64 fraction[2];
>         u64 denominator = 0;    /* gcc */
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long anon_prio, file_prio;
>         enum scan_balance scan_balance;
>         unsigned long anon, file;
> @@ -1977,7 +1999,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * well.
>          */
>         if (current_is_kswapd()) {
> -               if (!zone_reclaimable(zone))
> +               if (!pgdat_reclaimable(pgdat))
>                         force_scan = true;
>                 if (!mem_cgroup_online(memcg))
>                         force_scan = true;
> @@ -2023,14 +2045,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * anon pages.  Try to detect this based on file LRU size.
>          */
>         if (global_reclaim(sc)) {
> -               unsigned long zonefile;
> -               unsigned long zonefree;
> +               unsigned long pgdatfile;
> +               unsigned long pgdatfree;
> +               int z;
> +               unsigned long total_high_wmark = 0;
>
> -               zonefree = zone_page_state(zone, NR_FREE_PAGES);
> -               zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                          zone_page_state(zone, NR_INACTIVE_FILE);
> +               pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> +               pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +                       if (!populated_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
>
> -               if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
> +               if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
>                         scan_balance = SCAN_ANON;
>                         goto out;
>                 }
> @@ -2077,7 +2109,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
>                 lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
>                 reclaim_stat->recent_scanned[0] /= 2;
>                 reclaim_stat->recent_rotated[0] /= 2;
> @@ -2098,7 +2130,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>
>         fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
>         fp /= reclaim_stat->recent_rotated[1] + 1;
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         fraction[0] = ap;
>         fraction[1] = fp;
> @@ -2352,9 +2384,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
>          * inactive lists are large enough, continue reclaiming
>          */
>         pages_for_compaction = (2UL << sc->order);
> -       inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> +       inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
>         if (get_nr_swap_pages() > 0)
> -               inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +               inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
>         if (sc->nr_reclaimed < pages_for_compaction &&
>                         inactive_lru_pages > pages_for_compaction)
>                 return true;
> @@ -2554,7 +2586,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>                                 continue;
>
>                         if (sc->priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;       /* Let kswapd poll it */
>
>                         /*
> @@ -2692,7 +2724,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
>                 if (!populated_zone(zone) ||
> -                   zone_reclaimable_pages(zone) == 0)
> +                   pgdat_reclaimable_pages(pgdat) == 0)
>                         continue;
>
>                 pfmemalloc_reserve += min_wmark_pages(zone);
> @@ -3000,7 +3032,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>                  * DEF_PRIORITY. Effectively, it considers them balanced so
>                  * they must be considered balanced here as well!
>                  */
> -               if (!zone_reclaimable(zone)) {
> +               if (!pgdat_reclaimable(zone->zone_pgdat)) {
>                         balanced_pages += zone->managed_pages;
>                         continue;
>                 }
> @@ -3063,6 +3095,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  {
>         unsigned long balance_gap;
>         bool lowmem_pressure;
> +       struct pglist_data *pgdat = zone->zone_pgdat;
>
>         /* Reclaim above the high watermark. */
>         sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> @@ -3087,7 +3120,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
>
>         shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
>
> -       clear_bit(ZONE_WRITEBACK, &zone->flags);
> +       /* TODO: ANOMALY */
> +       clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * If a zone reaches its high watermark, consider it to be no longer
> @@ -3095,10 +3129,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
>          * BDIs but as pressure is relieved, speculatively avoid congestion
>          * waits.
>          */
> -       if (zone_reclaimable(zone) &&
> +       if (pgdat_reclaimable(zone->zone_pgdat) &&
>             zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
> -               clear_bit(ZONE_CONGESTED, &zone->flags);
> -               clear_bit(ZONE_DIRTY, &zone->flags);
> +               clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> +               clear_bit(PGDAT_DIRTY, &pgdat->flags);
>         }
>
>         return sc->nr_scanned >= sc->nr_to_reclaim;
> @@ -3157,7 +3191,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         /*
> @@ -3184,9 +3218,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 /*
>                                  * If balanced, clear the dirty and congested
>                                  * flags
> +                                *
> +                                * TODO: ANOMALY
>                                  */
> -                               clear_bit(ZONE_CONGESTED, &zone->flags);
> -                               clear_bit(ZONE_DIRTY, &zone->flags);
> +                               clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +                               clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
>                         }
>                 }
>
> @@ -3216,7 +3252,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         sc.nr_scanned = 0;
> @@ -3612,8 +3648,8 @@ int sysctl_min_slab_ratio = 5;
>  static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
>  {
>         unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> -       unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
> -               zone_page_state(zone, NR_ACTIVE_FILE);
> +       unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +               node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         /*
>          * It's possible for there to be more file mapped pages than
> @@ -3716,7 +3752,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>             zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>                 return ZONE_RECLAIM_FULL;
>
> -       if (!zone_reclaimable(zone))
> +       if (!pgdat_reclaimable(zone->zone_pgdat))
>                 return ZONE_RECLAIM_FULL;
>
>         /*
> @@ -3795,7 +3831,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
>                         zone = pagezone;
>                         spin_lock_irq(zone_lru_lock(zone));
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3345d396a99b..de0c17076270 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -936,11 +936,8 @@ const char * const vmstat_text[] = {
>         /* enum zone_stat_item countes */
>         "nr_free_pages",
>         "nr_alloc_batch",
> -       "nr_inactive_anon",
> -       "nr_active_anon",
> -       "nr_inactive_file",
> -       "nr_active_file",
> -       "nr_unevictable",
> +       "nr_zone_anon_lru",
> +       "nr_zone_file_lru",
>         "nr_mlock",
>         "nr_anon_pages",
>         "nr_mapped",
> @@ -956,12 +953,9 @@ const char * const vmstat_text[] = {
>         "nr_vmscan_write",
>         "nr_vmscan_immediate_reclaim",
>         "nr_writeback_temp",
> -       "nr_isolated_anon",
> -       "nr_isolated_file",
>         "nr_shmem",
>         "nr_dirtied",
>         "nr_written",
> -       "nr_pages_scanned",
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         "nr_zspages",
>  #endif
> @@ -981,6 +975,16 @@ const char * const vmstat_text[] = {
>         "nr_shmem_pmdmapped",
>         "nr_free_cma",
>
> +       /* Node-based counters */
> +       "nr_inactive_anon",
> +       "nr_active_anon",
> +       "nr_inactive_file",
> +       "nr_active_file",
> +       "nr_unevictable",
> +       "nr_isolated_anon",
> +       "nr_isolated_file",
> +       "nr_pages_scanned",
> +
>         /* enum writeback_stat_item counters */
>         "nr_dirty_threshold",
>         "nr_dirty_background_threshold",
> @@ -1002,11 +1006,11 @@ const char * const vmstat_text[] = {
>         "pgmajfault",
>         "pglazyfreed",
>
> -       TEXTS_FOR_ZONES("pgrefill")
> -       TEXTS_FOR_ZONES("pgsteal_kswapd")
> -       TEXTS_FOR_ZONES("pgsteal_direct")
> -       TEXTS_FOR_ZONES("pgscan_kswapd")
> -       TEXTS_FOR_ZONES("pgscan_direct")
> +       "pgrefill",
> +       "pgsteal_kswapd",
> +       "pgsteal_direct",
> +       "pgscan_kswapd",
> +       "pgscan_direct",
>         "pgscan_direct_throttle",
>
>  #ifdef CONFIG_NUMA
> @@ -1434,7 +1438,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n        min      %lu"
>                    "\n        low      %lu"
>                    "\n        high     %lu"
> -                  "\n        scanned  %lu"
> +                  "\n   node_scanned  %lu"
>                    "\n        spanned  %lu"
>                    "\n        present  %lu"
>                    "\n        managed  %lu",
> @@ -1442,13 +1446,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    min_wmark_pages(zone),
>                    low_wmark_pages(zone),
>                    high_wmark_pages(zone),
> -                  zone_page_state(zone, NR_PAGES_SCANNED),
> +                  node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>                    zone->spanned_pages,
>                    zone->present_pages,
>                    zone->managed_pages);
>
>         for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> -               seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
> +               seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
>                                 zone_page_state(zone, i));
>
>         seq_printf(m,
> @@ -1478,12 +1482,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  #endif
>         }
>         seq_printf(m,
> -                  "\n  all_unreclaimable: %u"
> -                  "\n  start_pfn:         %lu"
> -                  "\n  inactive_ratio:    %u",
> -                  !zone_reclaimable(zone),
> +                  "\n  node_unreclaimable:  %u"
> +                  "\n  start_pfn:           %lu"
> +                  "\n  node_inactive_ratio: %u",
> +                  !pgdat_reclaimable(zone->zone_pgdat),
>                    zone->zone_start_pfn,
> -                  zone->inactive_ratio);
> +                  zone->zone_pgdat->inactive_ratio);
>         seq_putc(m, '\n');
>  }
>
> @@ -1574,7 +1578,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  {
>         unsigned long *l = arg;
>         unsigned long off = l - (unsigned long *)m->private;
> -
>         seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
>         return 0;
>  }
> diff --git a/mm/workingset.c b/mm/workingset.c
> index ba972ac2dfdd..ebe14445809a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -355,8 +355,8 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>                                                      LRU_ALL_FILE);
>         } else {
> -               pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
> -                       sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
> +               pages = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
> +                       node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
>         }
>
>         /*
> --
> 2.6.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-08-04 20:59     ` James Hogan
  0 siblings, 0 replies; 220+ messages in thread
From: James Hogan @ 2016-08-04 20:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

On 8 July 2016 at 10:34, Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org> wrote:
> This moves the LRU lists from the zone to the node and related data
> such as counters, tracing, congestion tracking and writeback tracking.
> Unfortunately, due to reclaim and compaction retry logic, it is necessary
> to account for the number of LRU pages on both zone and node logic.
> Most reclaim logic is based on the node counters but the retry logic uses
> the zone counters which do not distinguish inactive and active sizes.
> It would be possible to leave the LRU counters on a per-zone basis but
> it's a heavier calculation across multiple cache lines that is much more
> frequent than the retry checks.
>
> Other than the LRU counters, this is mostly a mechanical patch but note
> that it introduces a number of anomalies.  For example, the scans are
> per-zone but using per-node counters.  We also mark a node as congested
> when a zone is congested.  This causes weird problems that are fixed later
> but is easier to review.
>
> In the event that there is excessive overhead on 32-bit systems due to
> the nodes being on LRU then there are two potential solutions
>
> 1. Long-term isolation of highmem pages when reclaim is lowmem
>
>    When pages are skipped, they are immediately added back onto the LRU
>    list. If lowmem reclaim persisted for long periods of time, the same
>    highmem pages get continually scanned. The idea would be that lowmem
>    keeps those pages on a separate list until a reclaim for highmem pages
>    arrives that splices the highmem pages back onto the LRU. It potentially
>    could be implemented similar to the UNEVICTABLE list.
>
>    That would reduce the skip rate with the potential corner case is that
>    highmem pages have to be scanned and reclaimed to free lowmem slab pages.
>
> 2. Linear scan lowmem pages if the initial LRU shrink fails
>
>    This will break LRU ordering but may be preferable and faster during
>    memory pressure than skipping LRU pages.
>
> Signed-off-by: Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

This breaks boot on metag architecture:
Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]

It appears to be in node_page_state_snapshot() (via
pgdat_reclaimable()), and have come via mm_init. Here's the relevant
bit of the backtrace:

    node_page_state_snapshot@0x4009c884(enum node_stat_item item =
???, struct pglist_data * pgdat = ???) + 0x48
    pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
    show_free_areas(unsigned int filter = 0) + 0x2cc
    show_mem(unsigned int filter = 0) + 0x18
    mm_init@0x4025c3d4()
    start_kernel() + 0x204

__per_cpu_offset[0] == 0x233000 (close to bad addr),
pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
definitely hasn't been called yet (mm_init is called before
setup_per_cpu_pageset()).

Any ideas what the correct solution is (and why presumably others
haven't seen the same issue on other architectures?).

Thanks
James

> ---
>  arch/tile/mm/pgtable.c                    |   8 +-
>  drivers/base/node.c                       |  19 +--
>  drivers/staging/android/lowmemorykiller.c |   8 +-
>  include/linux/backing-dev.h               |   2 +-
>  include/linux/memcontrol.h                |  18 +--
>  include/linux/mm_inline.h                 |  21 ++-
>  include/linux/mmzone.h                    |  68 +++++----
>  include/linux/swap.h                      |   1 +
>  include/linux/vm_event_item.h             |  10 +-
>  include/linux/vmstat.h                    |  17 +++
>  include/trace/events/vmscan.h             |  12 +-
>  kernel/power/snapshot.c                   |  10 +-
>  mm/backing-dev.c                          |  15 +-
>  mm/compaction.c                           |  18 +--
>  mm/huge_memory.c                          |   2 +-
>  mm/internal.h                             |   2 +-
>  mm/khugepaged.c                           |   4 +-
>  mm/memcontrol.c                           |  17 +--
>  mm/memory-failure.c                       |   4 +-
>  mm/memory_hotplug.c                       |   2 +-
>  mm/mempolicy.c                            |   2 +-
>  mm/migrate.c                              |  21 +--
>  mm/mlock.c                                |   2 +-
>  mm/page-writeback.c                       |   8 +-
>  mm/page_alloc.c                           |  68 +++++----
>  mm/swap.c                                 |  50 +++----
>  mm/vmscan.c                               | 226 +++++++++++++++++-------------
>  mm/vmstat.c                               |  47 ++++---
>  mm/workingset.c                           |   4 +-
>  29 files changed, 386 insertions(+), 300 deletions(-)
>
> diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
> index c4d5bf841a7f..9e389213580d 100644
> --- a/arch/tile/mm/pgtable.c
> +++ b/arch/tile/mm/pgtable.c
> @@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
>         struct zone *zone;
>
>         pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n",
> -              (global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE)),
> -              (global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE)),
> +              (global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE)),
> +              (global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE)),
>                global_page_state(NR_FILE_DIRTY),
>                global_page_state(NR_WRITEBACK),
>                global_page_state(NR_UNSTABLE_NFS),
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 92d8e090c5b3..b7f01a4a642d 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  {
>         int n;
>         int nid = dev->id;
> +       struct pglist_data *pgdat = NODE_DATA(nid);
>         struct sysinfo i;
>
>         si_meminfo_node(&i, nid);
> @@ -74,15 +75,15 @@ static ssize_t node_read_meminfo(struct device *dev,
>                        nid, K(i.totalram),
>                        nid, K(i.freeram),
>                        nid, K(i.totalram - i.freeram),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
> -                               sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
> -                      nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON) +
> +                               node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                      nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                      nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
>                        nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
>
>  #ifdef CONFIG_HIGHMEM
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> index 24d2745e9437..93dbcc38eb0f 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -72,10 +72,10 @@ static unsigned long lowmem_deathpending_timeout;
>  static unsigned long lowmem_count(struct shrinker *s,
>                                   struct shrink_control *sc)
>  {
> -       return global_page_state(NR_ACTIVE_ANON) +
> -               global_page_state(NR_ACTIVE_FILE) +
> -               global_page_state(NR_INACTIVE_ANON) +
> -               global_page_state(NR_INACTIVE_FILE);
> +       return global_node_page_state(NR_ACTIVE_ANON) +
> +               global_node_page_state(NR_ACTIVE_FILE) +
> +               global_node_page_state(NR_INACTIVE_ANON) +
> +               global_node_page_state(NR_INACTIVE_FILE);
>  }
>
>  static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c82794f20110..491a91717788 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -197,7 +197,7 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
>  }
>
>  long congestion_wait(int sync, long timeout);
> -long wait_iff_congested(struct zone *zone, int sync, long timeout);
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout);
>  int pdflush_proc_obsolete(struct ctl_table *table, int write,
>                 void __user *buffer, size_t *lenp, loff_t *ppos);
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 104efa6874db..68f1121c8fe7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -340,7 +340,7 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = zone_lruvec(zone);
>                 goto out;
>         }
>
> @@ -349,15 +349,15 @@ static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>  out:
>         /*
>          * Since a node can be onlined after the mem_cgroup was created,
> -        * we have to be prepared to initialize lruvec->zone here;
> +        * we have to be prepared to initialize lruvec->pgdat here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != zone->zone_pgdat))
> +               lruvec->pgdat = zone->zone_pgdat;
>         return lruvec;
>  }
>
> -struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> +struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
>  bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> @@ -438,7 +438,7 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
>  int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
>
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -               int nr_pages);
> +               enum zone_type zid, int nr_pages);
>
>  unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>                                            int nid, unsigned int lru_mask);
> @@ -613,13 +613,13 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
>  static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>                                                     struct mem_cgroup *memcg)
>  {
> -       return &zone->lruvec;
> +       return zone_lruvec(zone);
>  }
>
>  static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> -                                                   struct zone *zone)
> +                                                   struct pglist_data *pgdat)
>  {
> -       return &zone->lruvec;
> +       return &pgdat->lruvec;
>  }
>
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 5bd29ba4f174..9aadcc781857 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -23,25 +23,32 @@ static inline int page_is_file_cache(struct page *page)
>  }
>
>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
> -       __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +       __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> +       __mod_zone_page_state(&pgdat->node_zones[zid],
> +               NR_ZONE_LRU_BASE + !!is_file_lru(lru),
> +               nr_pages);
>  }
>
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> -                               enum lru_list lru, int nr_pages)
> +                               enum lru_list lru, enum zone_type zid,
> +                               int nr_pages)
>  {
>  #ifdef CONFIG_MEMCG
> -       mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
> +       mem_cgroup_update_lru_size(lruvec, lru, zid, nr_pages);
>  #else
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>  #endif
>  }
>
>  static __always_inline void add_page_to_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
> -       update_lru_size(lruvec, lru, hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
>         list_add(&page->lru, &lruvec->lists[lru]);
>  }
>
> @@ -49,7 +56,7 @@ static __always_inline void del_page_from_lru_list(struct page *page,
>                                 struct lruvec *lruvec, enum lru_list lru)
>  {
>         list_del(&page->lru);
> -       update_lru_size(lruvec, lru, -hpage_nr_pages(page));
> +       update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
>  }
>
>  /**
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cfa870107abe..d4f5cac0a8c3 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -111,12 +111,9 @@ enum zone_stat_item {
>         /* First 128 byte cacheline (assuming 64 bit words) */
>         NR_FREE_PAGES,
>         NR_ALLOC_BATCH,
> -       NR_LRU_BASE,
> -       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> -       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> -       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> -       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> -       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> +       NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
> +       NR_ZONE_LRU_FILE,
>         NR_MLOCK,               /* mlock()ed pages found and moved off LRU */
>         NR_ANON_PAGES,  /* Mapped anonymous pages */
>         NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> @@ -134,12 +131,9 @@ enum zone_stat_item {
>         NR_VMSCAN_WRITE,
>         NR_VMSCAN_IMMEDIATE,    /* Prioritise for reclaim when writeback ends */
>         NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
> -       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> -       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
>         NR_SHMEM,               /* shmem pages (included tmpfs/GEM pages) */
>         NR_DIRTIED,             /* page dirtyings since bootup */
>         NR_WRITTEN,             /* page writings since bootup */
> -       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         NR_ZSPAGES,             /* allocated in zsmalloc */
>  #endif
> @@ -161,6 +155,15 @@ enum zone_stat_item {
>         NR_VM_ZONE_STAT_ITEMS };
>
>  enum node_stat_item {
> +       NR_LRU_BASE,
> +       NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
> +       NR_ACTIVE_ANON,         /*  "     "     "   "       "         */
> +       NR_INACTIVE_FILE,       /*  "     "     "   "       "         */
> +       NR_ACTIVE_FILE,         /*  "     "     "   "       "         */
> +       NR_UNEVICTABLE,         /*  "     "     "   "       "         */
> +       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> +       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> +       NR_PAGES_SCANNED,       /* pages scanned since last reclaim */
>         NR_VM_NODE_STAT_ITEMS
>  };
>
> @@ -219,7 +222,7 @@ struct lruvec {
>         /* Evictions & activations on the inactive file list */
>         atomic_long_t                   inactive_age;
>  #ifdef CONFIG_MEMCG
> -       struct zone                     *zone;
> +       struct pglist_data *pgdat;
>  #endif
>  };
>
> @@ -357,13 +360,6 @@ struct zone {
>  #ifdef CONFIG_NUMA
>         int node;
>  #endif
> -
> -       /*
> -        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> -        * this zone's LRU.  Maintained by the pageout code.
> -        */
> -       unsigned int inactive_ratio;
> -
>         struct pglist_data      *zone_pgdat;
>         struct per_cpu_pageset __percpu *pageset;
>
> @@ -495,9 +491,6 @@ struct zone {
>
>         /* Write-intensive fields used by page reclaim */
>
> -       /* Fields commonly accessed by the page reclaim scanner */
> -       struct lruvec           lruvec;
> -
>         /*
>          * When free pages are below this point, additional steps are taken
>          * when reading the number of free pages to avoid per-cpu counter
> @@ -537,17 +530,20 @@ struct zone {
>
>  enum zone_flags {
>         ZONE_RECLAIM_LOCKED,            /* prevents concurrent reclaim */
> -       ZONE_CONGESTED,                 /* zone has many dirty pages backed by
> +       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
> +};
> +
> +enum pgdat_flags {
> +       PGDAT_CONGESTED,                /* pgdat has many dirty pages backed by
>                                          * a congested BDI
>                                          */
> -       ZONE_DIRTY,                     /* reclaim scanning has recently found
> +       PGDAT_DIRTY,                    /* reclaim scanning has recently found
>                                          * many dirty file pages at the tail
>                                          * of the LRU.
>                                          */
> -       ZONE_WRITEBACK,                 /* reclaim scanning has recently found
> +       PGDAT_WRITEBACK,                /* reclaim scanning has recently found
>                                          * many pages under writeback
>                                          */
> -       ZONE_FAIR_DEPLETED,             /* fair zone policy batch depleted */
>  };
>
>  static inline unsigned long zone_end_pfn(const struct zone *zone)
> @@ -707,6 +703,19 @@ typedef struct pglist_data {
>         unsigned long split_queue_len;
>  #endif
>
> +       /* Fields commonly accessed by the page reclaim scanner */
> +       struct lruvec           lruvec;
> +
> +       /*
> +        * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
> +        * this node's LRU.  Maintained by the pageout code.
> +        */
> +       unsigned int inactive_ratio;
> +
> +       unsigned long           flags;
> +
> +       ZONE_PADDING(_pad2_)
> +
>         /* Per-node vmstats */
>         struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>         atomic_long_t           vm_stat[NR_VM_NODE_STAT_ITEMS];
> @@ -728,6 +737,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
>         return &zone->zone_pgdat->lru_lock;
>  }
>
> +static inline struct lruvec *zone_lruvec(struct zone *zone)
> +{
> +       return &zone->zone_pgdat->lruvec;
> +}
> +
>  static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
>  {
>         return pgdat->node_start_pfn + pgdat->node_spanned_pages;
> @@ -779,12 +793,12 @@ extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
>
>  extern void lruvec_init(struct lruvec *lruvec);
>
> -static inline struct zone *lruvec_zone(struct lruvec *lruvec)
> +static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
>  {
>  #ifdef CONFIG_MEMCG
> -       return lruvec->zone;
> +       return lruvec->pgdat;
>  #else
> -       return container_of(lruvec, struct zone, lruvec);
> +       return container_of(lruvec, struct pglist_data, lruvec);
>  #endif
>  }
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0af2bb2028fd..c82f916008b7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>
>  /* linux/mm/vmscan.c */
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
> +extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 42604173f122..1798ff542517 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -26,11 +26,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 PGFREE, PGACTIVATE, PGDEACTIVATE,
>                 PGFAULT, PGMAJFAULT,
>                 PGLAZYFREED,
> -               FOR_ALL_ZONES(PGREFILL),
> -               FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> -               FOR_ALL_ZONES(PGSTEAL_DIRECT),
> -               FOR_ALL_ZONES(PGSCAN_KSWAPD),
> -               FOR_ALL_ZONES(PGSCAN_DIRECT),
> +               PGREFILL,
> +               PGSTEAL_KSWAPD,
> +               PGSTEAL_DIRECT,
> +               PGSCAN_KSWAPD,
> +               PGSCAN_DIRECT,
>                 PGSCAN_DIRECT_THROTTLE,
>  #ifdef CONFIG_NUMA
>                 PGSCAN_ZONE_RECLAIM_FAILED,
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index d1744aa3ab9c..fee321c98550 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -178,6 +178,23 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
>         return x;
>  }
>
> +static inline unsigned long node_page_state_snapshot(pg_data_t *pgdat,
> +                                       enum node_stat_item item)
> +{
> +       long x = atomic_long_read(&pgdat->vm_stat[item]);
> +
> +#ifdef CONFIG_SMP
> +       int cpu;
> +       for_each_online_cpu(cpu)
> +               x += per_cpu_ptr(pgdat->per_cpu_nodestats, cpu)->vm_node_stat_diff[item];
> +
> +       if (x < 0)
> +               x = 0;
> +#endif
> +       return x;
> +}
> +
> +
>  #ifdef CONFIG_NUMA
>  extern unsigned long sum_zone_node_page_state(int node,
>                                                 enum zone_stat_item item);
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 0101ef37f1ee..897f1aa1ee5f 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -352,15 +352,14 @@ TRACE_EVENT(mm_vmscan_writepage,
>
>  TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>
> -       TP_PROTO(struct zone *zone,
> +       TP_PROTO(int nid,
>                 unsigned long nr_scanned, unsigned long nr_reclaimed,
>                 int priority, int file),
>
> -       TP_ARGS(zone, nr_scanned, nr_reclaimed, priority, file),
> +       TP_ARGS(nid, nr_scanned, nr_reclaimed, priority, file),
>
>         TP_STRUCT__entry(
>                 __field(int, nid)
> -               __field(int, zid)
>                 __field(unsigned long, nr_scanned)
>                 __field(unsigned long, nr_reclaimed)
>                 __field(int, priority)
> @@ -368,16 +367,15 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>         ),
>
>         TP_fast_assign(
> -               __entry->nid = zone_to_nid(zone);
> -               __entry->zid = zone_idx(zone);
> +               __entry->nid = nid;
>                 __entry->nr_scanned = nr_scanned;
>                 __entry->nr_reclaimed = nr_reclaimed;
>                 __entry->priority = priority;
>                 __entry->reclaim_flags = trace_shrink_flags(file);
>         ),
>
> -       TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> -               __entry->nid, __entry->zid,
> +       TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
> +               __entry->nid,
>                 __entry->nr_scanned, __entry->nr_reclaimed,
>                 __entry->priority,
>                 show_reclaim_flags(__entry->reclaim_flags))
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 3a970604308f..24a06bc23f85 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1525,11 +1525,11 @@ static unsigned long minimum_image_size(unsigned long saveable)
>         unsigned long size;
>
>         size = global_page_state(NR_SLAB_RECLAIMABLE)
> -               + global_page_state(NR_ACTIVE_ANON)
> -               + global_page_state(NR_INACTIVE_ANON)
> -               + global_page_state(NR_ACTIVE_FILE)
> -               + global_page_state(NR_INACTIVE_FILE)
> -               - global_page_state(NR_FILE_MAPPED);
> +               + global_node_page_state(NR_ACTIVE_ANON)
> +               + global_node_page_state(NR_INACTIVE_ANON)
> +               + global_node_page_state(NR_ACTIVE_FILE)
> +               + global_node_page_state(NR_INACTIVE_FILE)
> +               - global_node_page_state(NR_FILE_MAPPED);
>
>         return saveable <= size ? 0 : saveable - size;
>  }
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index f53b23ab7ed7..a8c3af46bd3d 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -982,24 +982,24 @@ long congestion_wait(int sync, long timeout)
>  EXPORT_SYMBOL(congestion_wait);
>
>  /**
> - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
> - * @zone: A zone to check if it is heavily congested
> + * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
> + * @pgdat: A pgdat to check if it is heavily congested
>   * @sync: SYNC or ASYNC IO
>   * @timeout: timeout in jiffies
>   *
>   * In the event of a congested backing_dev (any backing_dev) and the given
> - * @zone has experienced recent congestion, this waits for up to @timeout
> + * @pgdat has experienced recent congestion, this waits for up to @timeout
>   * jiffies for either a BDI to exit congestion of the given @sync queue
>   * or a write to complete.
>   *
> - * In the absence of zone congestion, cond_resched() is called to yield
> + * In the absence of pgdat congestion, cond_resched() is called to yield
>   * the processor if necessary but otherwise does not sleep.
>   *
>   * The return value is 0 if the sleep is for the full timeout. Otherwise,
>   * it is the number of jiffies that were still remaining when the function
>   * returned. return_value == timeout implies the function did not sleep.
>   */
> -long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +long wait_iff_congested(struct pglist_data *pgdat, int sync, long timeout)
>  {
>         long ret;
>         unsigned long start = jiffies;
> @@ -1008,12 +1008,13 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
>
>         /*
>          * If there is no congestion, or heavy congestion is not being
> -        * encountered in the current zone, yield if necessary instead
> +        * encountered in the current pgdat, yield if necessary instead
>          * of sleeping on the congestion queue
>          */
>         if (atomic_read(&nr_wb_congested[sync]) == 0 ||
> -           !test_bit(ZONE_CONGESTED, &zone->flags)) {
> +           !test_bit(PGDAT_CONGESTED, &pgdat->flags)) {
>                 cond_resched();
> +
>                 /* In case we scheduled, work out time remaining */
>                 ret = timeout - (jiffies - start);
>                 if (ret < 0)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7607efb7bee2..a0bd85712516 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -646,8 +646,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
>         list_for_each_entry(page, &cc->migratepages, lru)
>                 count[!!page_is_file_cache(page)]++;
>
> -       mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON, count[0]);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, count[1]);
>  }
>
>  /* Similar to reclaim, but different enough that they don't share logic */
> @@ -655,12 +655,12 @@ static bool too_many_isolated(struct zone *zone)
>  {
>         unsigned long active, inactive, isolated;
>
> -       inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_INACTIVE_ANON);
> -       active = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                                       zone_page_state(zone, NR_ACTIVE_ANON);
> -       isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
> -                                       zone_page_state(zone, NR_ISOLATED_ANON);
> +       inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
> +       active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
> +       isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
> +                       node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
>
>         return isolated > (inactive + active) / 2;
>  }
> @@ -856,7 +856,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>                         }
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 /* Try isolate the page */
>                 if (__isolate_lru_page(page, isolate_mode) != 0)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f997328ae64..5d5b2207cfd2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1830,7 +1830,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>         pgoff_t end = -1;
>         int i;
>
> -       lruvec = mem_cgroup_page_lruvec(head, zone);
> +       lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
>
>         /* complete memcg works before add pages to LRU */
>         mem_cgroup_split_huge_fixup(head);
> diff --git a/mm/internal.h b/mm/internal.h
> index 9b6a6c43ac39..2f80d0343c56 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -78,7 +78,7 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool zone_reclaimable(struct zone *zone);
> +extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 93d5f87c00d5..d7a49f665f04 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -480,7 +480,7 @@ void __khugepaged_exit(struct mm_struct *mm)
>  static void release_pte_page(struct page *page)
>  {
>         /* 0 stands for page_is_file_cache(page) == false */
> -       dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +       dec_node_page_state(page, NR_ISOLATED_ANON + 0);
>         unlock_page(page);
>         putback_lru_page(page);
>  }
> @@ -576,7 +576,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         goto out;
>                 }
>                 /* 0 stands for page_is_file_cache(page) == false */
> -               inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
> +               inc_node_page_state(page, NR_ISOLATED_ANON + 0);
>                 VM_BUG_ON_PAGE(!PageLocked(page), page);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9b70f9ca8ddf..50c86ad121bc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -943,14 +943,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>   * and putback protocol: the LRU lock must be held, and the page must
>   * either be PageLRU() or the caller must have isolated/allocated it.
>   */
> -struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> +struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
>  {
>         struct mem_cgroup_per_zone *mz;
>         struct mem_cgroup *memcg;
>         struct lruvec *lruvec;
>
>         if (mem_cgroup_disabled()) {
> -               lruvec = &zone->lruvec;
> +               lruvec = &pgdat->lruvec;
>                 goto out;
>         }
>
> @@ -970,8 +970,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>          * we have to be prepared to initialize lruvec->zone here;
>          * and if offlined then reonlined, we need to reinitialize it.
>          */
> -       if (unlikely(lruvec->zone != zone))
> -               lruvec->zone = zone;
> +       if (unlikely(lruvec->pgdat != pgdat))
> +               lruvec->pgdat = pgdat;
>         return lruvec;
>  }
>
> @@ -979,6 +979,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
>   * @lru: index of lru list the page is sitting on
> + * @zid: Zone ID of the zone pages have been added to
>   * @nr_pages: positive when adding or negative when removing
>   *
>   * This function must be called under lru_lock, just before a page is added
> @@ -986,14 +987,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   * so as to allow it to check that lru_size 0 is consistent with list_empty).
>   */
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> -                               int nr_pages)
> +                               enum zone_type zid, int nr_pages)
>  {
>         struct mem_cgroup_per_zone *mz;
>         unsigned long *lru_size;
>         long size;
>         bool empty;
>
> -       __update_lru_size(lruvec, lru, nr_pages);
> +       __update_lru_size(lruvec, lru, zid, nr_pages);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2069,7 +2070,7 @@ static void lock_page_lru(struct page *page, int *isolated)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 *isolated = 1;
> @@ -2084,7 +2085,7 @@ static void unlock_page_lru(struct page *page, int isolated)
>         if (isolated) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>                 add_page_to_lru_list(page, lruvec, page_lru(page));
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2fcca6b0e005..11de752ccaf5 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1663,7 +1663,7 @@ static int __soft_offline_page(struct page *page, int flags)
>         put_hwpoison_page(page);
>         if (!ret) {
>                 LIST_HEAD(pagelist);
> -               inc_zone_page_state(page, NR_ISOLATED_ANON +
> +               inc_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                 list_add(&page->lru, &pagelist);
>                 ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> @@ -1671,7 +1671,7 @@ static int __soft_offline_page(struct page *page, int flags)
>                 if (ret) {
>                         if (!list_empty(&pagelist)) {
>                                 list_del(&page->lru);
> -                               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                                 page_is_file_cache(page));
>                                 putback_lru_page(page);
>                         }
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 82d0b98d27f8..c5278360ca66 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1586,7 +1586,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>                         put_page(page);
>                         list_add_tail(&page->lru, &source);
>                         move_pages--;
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>
>                 } else {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 53e40d3f3933..d8c4e38fb5f4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -962,7 +962,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
>         if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
>                 if (!isolate_lru_page(page)) {
>                         list_add_tail(&page->lru, pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>         }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2232f6923cc7..3033dae33a0a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -168,7 +168,7 @@ void putback_movable_pages(struct list_head *l)
>                         continue;
>                 }
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>                 /*
>                  * We isolated non-lru movable page so here we can use
> @@ -1119,7 +1119,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>                  * restored.
>                  */
>                 list_del(&page->lru);
> -               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +               dec_node_page_state(page, NR_ISOLATED_ANON +
>                                 page_is_file_cache(page));
>         }
>
> @@ -1460,7 +1460,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>                 err = isolate_lru_page(page);
>                 if (!err) {
>                         list_add_tail(&page->lru, &pagelist);
> -                       inc_zone_page_state(page, NR_ISOLATED_ANON +
> +                       inc_node_page_state(page, NR_ISOLATED_ANON +
>                                             page_is_file_cache(page));
>                 }
>  put_and_set:
> @@ -1726,15 +1726,16 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>                                    unsigned long nr_migrate_pages)
>  {
>         int z;
> +
> +       if (!pgdat_reclaimable(pgdat))
> +               return false;
> +
>         for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>                 struct zone *zone = pgdat->node_zones + z;
>
>                 if (!populated_zone(zone))
>                         continue;
>
> -               if (!zone_reclaimable(zone))
> -                       continue;
> -
>                 /* Avoid waking kswapd by allocating pages_to_migrate pages. */
>                 if (!zone_watermark_ok(zone, 0,
>                                        high_wmark_pages(zone) +
> @@ -1828,7 +1829,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>         }
>
>         page_lru = page_is_file_cache(page);
> -       mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
> +       mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>                                 hpage_nr_pages(page));
>
>         /*
> @@ -1886,7 +1887,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>         if (nr_remaining) {
>                 if (!list_empty(&migratepages)) {
>                         list_del(&page->lru);
> -                       dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                       dec_node_page_state(page, NR_ISOLATED_ANON +
>                                         page_is_file_cache(page));
>                         putback_lru_page(page);
>                 }
> @@ -1979,7 +1980,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>                 /* Retake the callers reference and putback on LRU */
>                 get_page(page);
>                 putback_lru_page(page);
> -               mod_zone_page_state(page_zone(page),
> +               mod_node_page_state(page_pgdat(page),
>                          NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
>
>                 goto out_unlock;
> @@ -2030,7 +2031,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>         count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
>         count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
>
> -       mod_zone_page_state(page_zone(page),
> +       mod_node_page_state(page_pgdat(page),
>                         NR_ISOLATED_ANON + page_lru,
>                         -HPAGE_PMD_NR);
>         return isolated;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 997f63082ff5..14645be06e30 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -103,7 +103,7 @@ static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>         if (PageLRU(page)) {
>                 struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 if (getpage)
>                         get_page(page);
>                 ClearPageLRU(page);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d578d2a56b19..0ada2b2954b0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -285,8 +285,8 @@ static unsigned long zone_dirtyable_memory(struct zone *zone)
>          */
>         nr_pages -= min(nr_pages, zone->totalreserve_pages);
>
> -       nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
> -       nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
> +       nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         return nr_pages;
>  }
> @@ -348,8 +348,8 @@ static unsigned long global_dirtyable_memory(void)
>          */
>         x -= min(x, totalreserve_pages);
>
> -       x += global_page_state(NR_INACTIVE_FILE);
> -       x += global_page_state(NR_ACTIVE_FILE);
> +       x += global_node_page_state(NR_INACTIVE_FILE);
> +       x += global_node_page_state(NR_ACTIVE_FILE);
>
>         if (!vm_highmem_is_dirtyable)
>                 x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48b5414009ac..b84b85ae54ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1090,9 +1090,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>
>         spin_lock(&zone->lock);
>         isolated_pageblocks = has_isolate_pageblock(zone);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         while (count) {
>                 struct page *page;
> @@ -1147,9 +1147,9 @@ static void free_one_page(struct zone *zone,
>  {
>         unsigned long nr_scanned;
>         spin_lock(&zone->lock);
> -       nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> +       nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
>         if (nr_scanned)
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> +               __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
>
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
> @@ -4331,6 +4331,7 @@ void show_free_areas(unsigned int filter)
>         unsigned long free_pcp = 0;
>         int cpu;
>         struct zone *zone;
> +       pg_data_t *pgdat;
>
>         for_each_populated_zone(zone) {
>                 if (skip_free_areas_node(filter, zone_to_nid(zone)))
> @@ -4349,13 +4350,13 @@ void show_free_areas(unsigned int filter)
>                 " anon_thp: %lu shmem_thp: %lu shmem_pmdmapped: %lu\n"
>  #endif
>                 " free:%lu free_pcp:%lu free_cma:%lu\n",
> -               global_page_state(NR_ACTIVE_ANON),
> -               global_page_state(NR_INACTIVE_ANON),
> -               global_page_state(NR_ISOLATED_ANON),
> -               global_page_state(NR_ACTIVE_FILE),
> -               global_page_state(NR_INACTIVE_FILE),
> -               global_page_state(NR_ISOLATED_FILE),
> -               global_page_state(NR_UNEVICTABLE),
> +               global_node_page_state(NR_ACTIVE_ANON),
> +               global_node_page_state(NR_INACTIVE_ANON),
> +               global_node_page_state(NR_ISOLATED_ANON),
> +               global_node_page_state(NR_ACTIVE_FILE),
> +               global_node_page_state(NR_INACTIVE_FILE),
> +               global_node_page_state(NR_ISOLATED_FILE),
> +               global_node_page_state(NR_UNEVICTABLE),
>                 global_page_state(NR_FILE_DIRTY),
>                 global_page_state(NR_WRITEBACK),
>                 global_page_state(NR_UNSTABLE_NFS),
> @@ -4374,6 +4375,28 @@ void show_free_areas(unsigned int filter)
>                 free_pcp,
>                 global_page_state(NR_FREE_CMA_PAGES));
>
> +       for_each_online_pgdat(pgdat) {
> +               printk("Node %d"
> +                       " active_anon:%lukB"
> +                       " inactive_anon:%lukB"
> +                       " active_file:%lukB"
> +                       " inactive_file:%lukB"
> +                       " unevictable:%lukB"
> +                       " isolated(anon):%lukB"
> +                       " isolated(file):%lukB"
> +                       " all_unreclaimable? %s"
> +                       "\n",
> +                       pgdat->node_id,
> +                       K(node_page_state(pgdat, NR_ACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_ANON)),
> +                       K(node_page_state(pgdat, NR_ACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_INACTIVE_FILE)),
> +                       K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_ANON)),
> +                       K(node_page_state(pgdat, NR_ISOLATED_FILE)),
> +                       !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +       }
> +
>         for_each_populated_zone(zone) {
>                 int i;
>
> @@ -4390,13 +4413,6 @@ void show_free_areas(unsigned int filter)
>                         " min:%lukB"
>                         " low:%lukB"
>                         " high:%lukB"
> -                       " active_anon:%lukB"
> -                       " inactive_anon:%lukB"
> -                       " active_file:%lukB"
> -                       " inactive_file:%lukB"
> -                       " unevictable:%lukB"
> -                       " isolated(anon):%lukB"
> -                       " isolated(file):%lukB"
>                         " present:%lukB"
>                         " managed:%lukB"
>                         " mlocked:%lukB"
> @@ -4419,21 +4435,13 @@ void show_free_areas(unsigned int filter)
>                         " local_pcp:%ukB"
>                         " free_cma:%lukB"
>                         " writeback_tmp:%lukB"
> -                       " pages_scanned:%lu"
> -                       " all_unreclaimable? %s"
> +                       " node_pages_scanned:%lu"
>                         "\n",
>                         zone->name,
>                         K(zone_page_state(zone, NR_FREE_PAGES)),
>                         K(min_wmark_pages(zone)),
>                         K(low_wmark_pages(zone)),
>                         K(high_wmark_pages(zone)),
> -                       K(zone_page_state(zone, NR_ACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_INACTIVE_ANON)),
> -                       K(zone_page_state(zone, NR_ACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_INACTIVE_FILE)),
> -                       K(zone_page_state(zone, NR_UNEVICTABLE)),
> -                       K(zone_page_state(zone, NR_ISOLATED_ANON)),
> -                       K(zone_page_state(zone, NR_ISOLATED_FILE)),
>                         K(zone->present_pages),
>                         K(zone->managed_pages),
>                         K(zone_page_state(zone, NR_MLOCK)),
> @@ -4458,9 +4466,7 @@ void show_free_areas(unsigned int filter)
>                         K(this_cpu_read(zone->pageset->pcp.count)),
>                         K(zone_page_state(zone, NR_FREE_CMA_PAGES)),
>                         K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
> -                       K(zone_page_state(zone, NR_PAGES_SCANNED)),
> -                       (!zone_reclaimable(zone) ? "yes" : "no")
> -                       );
> +                       K(node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED)));
>                 printk("lowmem_reserve[]:");
>                 for (i = 0; i < MAX_NR_ZONES; i++)
>                         printk(" %ld", zone->lowmem_reserve[i]);
> @@ -6010,7 +6016,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>                 /* For bootup, initialized properly in watermark setup */
>                 mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
>
> -               lruvec_init(&zone->lruvec);
> +               lruvec_init(zone_lruvec(zone));
>                 if (!size)
>                         continue;
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bf37e5cfae81..77af473635fe 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
>                 unsigned long flags;
>
>                 spin_lock_irqsave(zone_lru_lock(zone), flags);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 VM_BUG_ON_PAGE(!PageLRU(page), page);
>                 __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -194,7 +194,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>                         spin_lock_irqsave(zone_lru_lock(zone), flags);
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 (*move_fn)(page, lruvec, arg);
>         }
>         if (zone)
> @@ -319,7 +319,7 @@ void activate_page(struct page *page)
>
>         page = compound_head(page);
>         spin_lock_irq(zone_lru_lock(zone));
> -       __activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
> +       __activate_page(page, mem_cgroup_page_lruvec(page, zone->zone_pgdat), NULL);
>         spin_unlock_irq(zone_lru_lock(zone));
>  }
>  #endif
> @@ -445,16 +445,16 @@ void lru_cache_add(struct page *page)
>   */
>  void add_page_to_unevictable_list(struct page *page)
>  {
> -       struct zone *zone = page_zone(page);
> +       struct pglist_data *pgdat = page_pgdat(page);
>         struct lruvec *lruvec;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> -       lruvec = mem_cgroup_page_lruvec(page, zone);
> +       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
>         ClearPageActive(page);
>         SetPageUnevictable(page);
>         SetPageLRU(page);
>         add_page_to_lru_list(page, lruvec, LRU_UNEVICTABLE);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>  }
>
>  /**
> @@ -730,7 +730,7 @@ void release_pages(struct page **pages, int nr, bool cold)
>  {
>         int i;
>         LIST_HEAD(pages_to_free);
> -       struct zone *zone = NULL;
> +       struct pglist_data *locked_pgdat = NULL;
>         struct lruvec *lruvec;
>         unsigned long uninitialized_var(flags);
>         unsigned int uninitialized_var(lock_batch);
> @@ -741,11 +741,11 @@ void release_pages(struct page **pages, int nr, bool cold)
>                 /*
>                  * Make sure the IRQ-safe lock-holding time does not get
>                  * excessive with a continuous string of pages from the
> -                * same zone. The lock is held only if zone != NULL.
> +                * same pgdat. The lock is held only if pgdat != NULL.
>                  */
> -               if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
> -                       spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                       zone = NULL;
> +               if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> +                       spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                       locked_pgdat = NULL;
>                 }
>
>                 if (is_huge_zero_page(page)) {
> @@ -758,27 +758,27 @@ void release_pages(struct page **pages, int nr, bool cold)
>                         continue;
>
>                 if (PageCompound(page)) {
> -                       if (zone) {
> -                               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> -                               zone = NULL;
> +                       if (locked_pgdat) {
> +                               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +                               locked_pgdat = NULL;
>                         }
>                         __put_compound_page(page);
>                         continue;
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct zone *pagezone = page_zone(page);
> +                       struct pglist_data *pgdat = page_pgdat(page);
>
> -                       if (pagezone != zone) {
> -                               if (zone)
> -                                       spin_unlock_irqrestore(zone_lru_lock(zone),
> +                       if (pgdat != locked_pgdat) {
> +                               if (locked_pgdat)
> +                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
>                                                                         flags);
>                                 lock_batch = 0;
> -                               zone = pagezone;
> -                               spin_lock_irqsave(zone_lru_lock(zone), flags);
> +                               locked_pgdat = pgdat;
> +                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>                         }
>
> -                       lruvec = mem_cgroup_page_lruvec(page, zone);
> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -789,8 +789,8 @@ void release_pages(struct page **pages, int nr, bool cold)
>
>                 list_add(&page->lru, &pages_to_free);
>         }
> -       if (zone)
> -               spin_unlock_irqrestore(zone_lru_lock(zone), flags);
> +       if (locked_pgdat)
> +               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
>
>         mem_cgroup_uncharge_list(&pages_to_free);
>         free_hot_cold_page_list(&pages_to_free, cold);
> @@ -826,7 +826,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>         VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>         VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>         VM_BUG_ON(NR_CPUS != 1 &&
> -                 !spin_is_locked(zone_lru_lock(lruvec_zone(lruvec))));
> +                 !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
>
>         if (!list)
>                 SetPageLRU(page_tail);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e7ffcd259cc4..86a523a761c9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -191,26 +191,42 @@ static bool sane_reclaim(struct scan_control *sc)
>  }
>  #endif
>
> +/*
> + * This misses isolated pages which are not accounted for to save counters.
> + * As the data only determines if reclaim or compaction continues, it is
> + * not expected that isolated pages will be a dominating factor.
> + */
>  unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>         unsigned long nr;
>
> -       nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> -            zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> +       nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
> +       if (get_nr_swap_pages() > 0)
> +               nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
> +
> +       return nr;
> +}
> +
> +unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
> +{
> +       unsigned long nr;
> +
> +       nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> +            node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
>
>         if (get_nr_swap_pages() > 0)
> -               nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> -                     zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> +               nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
> +                     node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
>
>         return nr;
>  }
>
> -bool zone_reclaimable(struct zone *zone)
> +bool pgdat_reclaimable(struct pglist_data *pgdat)
>  {
> -       return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
> -               zone_reclaimable_pages(zone) * 6;
> +       return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> +               pgdat_reclaimable_pages(pgdat) * 6;
>  }
>
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -218,7 +234,7 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
>         if (!mem_cgroup_disabled())
>                 return mem_cgroup_get_lru_size(lruvec, lru);
>
> -       return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> +       return node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  }
>
>  /*
> @@ -877,7 +893,7 @@ static void page_check_dirty_writeback(struct page *page,
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -                                     struct zone *zone,
> +                                     struct pglist_data *pgdat,
>                                       struct scan_control *sc,
>                                       enum ttu_flags ttu_flags,
>                                       unsigned long *ret_nr_dirty,
> @@ -917,7 +933,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         goto keep;
>
>                 VM_BUG_ON_PAGE(PageActive(page), page);
> -               VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
>                 sc->nr_scanned++;
>
> @@ -996,7 +1011,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                         /* Case 1 above */
>                         if (current_is_kswapd() &&
>                             PageReclaim(page) &&
> -                           test_bit(ZONE_WRITEBACK, &zone->flags)) {
> +                           test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>                                 nr_immediate++;
>                                 goto keep_locked;
>
> @@ -1092,7 +1107,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                          */
>                         if (page_is_file_cache(page) &&
>                                         (!current_is_kswapd() ||
> -                                        !test_bit(ZONE_DIRTY, &zone->flags))) {
> +                                        !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>                                 /*
>                                  * Immediately reclaim when written back.
>                                  * Similar in principal to deactivate_page()
> @@ -1266,11 +1281,11 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>                 }
>         }
>
> -       ret = shrink_page_list(&clean_pages, zone, &sc,
> +       ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
>                         TTU_UNMAP|TTU_IGNORE_ACCESS,
>                         &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
>         list_splice(&clean_pages, page_list);
> -       mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> +       mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
>         return ret;
>  }
>
> @@ -1375,7 +1390,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  {
>         struct list_head *src = &lruvec->lists[lru];
>         unsigned long nr_taken = 0;
> -       unsigned long scan;
> +       unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
> +       unsigned long scan, nr_pages;
>
>         for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
>                                         !list_empty(src); scan++) {
> @@ -1388,7 +1404,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>
>                 switch (__isolate_lru_page(page, mode)) {
>                 case 0:
> -                       nr_taken += hpage_nr_pages(page);
> +                       nr_pages = hpage_nr_pages(page);
> +                       nr_taken += nr_pages;
> +                       nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
>
> @@ -1405,6 +1423,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>         *nr_scanned = scan;
>         trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
>                                     nr_taken, mode, is_file_lru(lru));
> +       for (scan = 0; scan < MAX_NR_ZONES; scan++) {
> +               nr_pages = nr_zone_taken[scan];
> +               if (!nr_pages)
> +                       continue;
> +
> +               update_lru_size(lruvec, lru, scan, -nr_pages);
> +       }
>         return nr_taken;
>  }
>
> @@ -1445,7 +1470,7 @@ int isolate_lru_page(struct page *page)
>                 struct lruvec *lruvec;
>
>                 spin_lock_irq(zone_lru_lock(zone));
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>                 if (PageLRU(page)) {
>                         int lru = page_lru(page);
>                         get_page(page);
> @@ -1465,7 +1490,7 @@ int isolate_lru_page(struct page *page)
>   * the LRU list will go small and be scanned faster than necessary, leading to
>   * unnecessary swapping, thrashing and OOM.
>   */
> -static int too_many_isolated(struct zone *zone, int file,
> +static int too_many_isolated(struct pglist_data *pgdat, int file,
>                 struct scan_control *sc)
>  {
>         unsigned long inactive, isolated;
> @@ -1477,11 +1502,11 @@ static int too_many_isolated(struct zone *zone, int file,
>                 return 0;
>
>         if (file) {
> -               inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -               isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
>         } else {
> -               inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> -               isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> +               inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
> +               isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
>         }
>
>         /*
> @@ -1499,7 +1524,7 @@ static noinline_for_stack void
>  putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  {
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         LIST_HEAD(pages_to_free);
>
>         /*
> @@ -1512,13 +1537,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(zone_lru_lock(zone));
> +                       spin_unlock_irq(&pgdat->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(zone_lru_lock(zone));
> +                       spin_lock_irq(&pgdat->lru_lock);
>                         continue;
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 SetPageLRU(page);
>                 lru = page_lru(page);
> @@ -1535,10 +1560,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>                 }
> @@ -1582,10 +1607,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         unsigned long nr_immediate = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>
> -       while (unlikely(too_many_isolated(zone, file, sc))) {
> +       while (unlikely(too_many_isolated(pgdat, file, sc))) {
>                 congestion_wait(BLK_RW_ASYNC, HZ/10);
>
>                 /* We are about to die and free our memory. Return now. */
> @@ -1600,48 +1625,45 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc)) {
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>                 else
> -                       __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
> +                       __count_vm_events(PGSCAN_DIRECT, nr_scanned);
>         }
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         if (nr_taken == 0)
>                 return 0;
>
> -       nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> +       nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
>                                 &nr_dirty, &nr_unqueued_dirty, &nr_congested,
>                                 &nr_writeback, &nr_immediate,
>                                 false);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         if (global_reclaim(sc)) {
>                 if (current_is_kswapd())
> -                       __count_zone_vm_events(PGSTEAL_KSWAPD, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
>                 else
> -                       __count_zone_vm_events(PGSTEAL_DIRECT, zone,
> -                                              nr_reclaimed);
> +                       __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
>         }
>
>         putback_inactive_pages(lruvec, &page_list);
>
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&page_list);
>         free_hot_cold_page_list(&page_list, true);
> @@ -1661,7 +1683,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          * are encountered in the nr_immediate check below.
>          */
>         if (nr_writeback && nr_writeback == nr_taken)
> -               set_bit(ZONE_WRITEBACK, &zone->flags);
> +               set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * Legacy memcg will stall in page writeback so avoid forcibly
> @@ -1673,16 +1695,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>                  * backed by a congested BDI and wait_iff_congested will stall.
>                  */
>                 if (nr_dirty && nr_dirty == nr_congested)
> -                       set_bit(ZONE_CONGESTED, &zone->flags);
> +                       set_bit(PGDAT_CONGESTED, &pgdat->flags);
>
>                 /*
>                  * If dirty pages are scanned that are not queued for IO, it
>                  * implies that flushers are not keeping up. In this case, flag
> -                * the zone ZONE_DIRTY and kswapd will start writing pages from
> +                * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
>                  * reclaim context.
>                  */
>                 if (nr_unqueued_dirty == nr_taken)
> -                       set_bit(ZONE_DIRTY, &zone->flags);
> +                       set_bit(PGDAT_DIRTY, &pgdat->flags);
>
>                 /*
>                  * If kswapd scans pages marked marked for immediate
> @@ -1701,9 +1723,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>          */
>         if (!sc->hibernation_mode && !current_is_kswapd() &&
>             current_may_throttle())
> -               wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +               wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>
> -       trace_mm_vmscan_lru_shrink_inactive(zone, nr_scanned, nr_reclaimed,
> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> +                       nr_scanned, nr_reclaimed,
>                         sc->priority, file);
>         return nr_reclaimed;
>  }
> @@ -1731,20 +1754,20 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                                      struct list_head *pages_to_free,
>                                      enum lru_list lru)
>  {
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long pgmoved = 0;
>         struct page *page;
>         int nr_pages;
>
>         while (!list_empty(list)) {
>                 page = lru_to_page(list);
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>
>                 nr_pages = hpage_nr_pages(page);
> -               update_lru_size(lruvec, lru, nr_pages);
> +               update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
>                 list_move(&page->lru, &lruvec->lists[lru]);
>                 pgmoved += nr_pages;
>
> @@ -1754,10 +1777,10 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(zone_lru_lock(zone));
> +                               spin_unlock_irq(&pgdat->lru_lock);
>                                 mem_cgroup_uncharge(page);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(zone_lru_lock(zone));
> +                               spin_lock_irq(&pgdat->lru_lock);
>                         } else
>                                 list_add(&page->lru, pages_to_free);
>                 }
> @@ -1783,7 +1806,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         unsigned long nr_rotated = 0;
>         isolate_mode_t isolate_mode = 0;
>         int file = is_file_lru(lru);
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
>         lru_add_drain();
>
> @@ -1792,20 +1815,19 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         if (!sc->may_writepage)
>                 isolate_mode |= ISOLATE_CLEAN;
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>                                      &nr_scanned, sc, isolate_mode, lru);
>
> -       update_lru_size(lruvec, lru, -nr_taken);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>         reclaim_stat->recent_scanned[file] += nr_taken;
>
>         if (global_reclaim(sc))
> -               __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> -       __count_zone_vm_events(PGREFILL, zone, nr_scanned);
> +               __mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
> +       __count_vm_events(PGREFILL, nr_scanned);
>
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         while (!list_empty(&l_hold)) {
>                 cond_resched();
> @@ -1850,7 +1872,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         /*
>          * Move pages back to the lru list.
>          */
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         /*
>          * Count referenced pages from currently used mappings as rotated,
>          * even though only some of them are actually re-activated.  This
> @@ -1861,8 +1883,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>         move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>         move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> -       __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         mem_cgroup_uncharge_list(&l_hold);
>         free_hot_cold_page_list(&l_hold, true);
> @@ -1956,7 +1978,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>         u64 fraction[2];
>         u64 denominator = 0;    /* gcc */
> -       struct zone *zone = lruvec_zone(lruvec);
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         unsigned long anon_prio, file_prio;
>         enum scan_balance scan_balance;
>         unsigned long anon, file;
> @@ -1977,7 +1999,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * well.
>          */
>         if (current_is_kswapd()) {
> -               if (!zone_reclaimable(zone))
> +               if (!pgdat_reclaimable(pgdat))
>                         force_scan = true;
>                 if (!mem_cgroup_online(memcg))
>                         force_scan = true;
> @@ -2023,14 +2045,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>          * anon pages.  Try to detect this based on file LRU size.
>          */
>         if (global_reclaim(sc)) {
> -               unsigned long zonefile;
> -               unsigned long zonefree;
> +               unsigned long pgdatfile;
> +               unsigned long pgdatfree;
> +               int z;
> +               unsigned long total_high_wmark = 0;
>
> -               zonefree = zone_page_state(zone, NR_FREE_PAGES);
> -               zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
> -                          zone_page_state(zone, NR_INACTIVE_FILE);
> +               pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
> +               pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                          node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> +               for (z = 0; z < MAX_NR_ZONES; z++) {
> +                       struct zone *zone = &pgdat->node_zones[z];
> +                       if (!populated_zone(zone))
> +                               continue;
> +
> +                       total_high_wmark += high_wmark_pages(zone);
> +               }
>
> -               if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
> +               if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
>                         scan_balance = SCAN_ANON;
>                         goto out;
>                 }
> @@ -2077,7 +2109,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>         file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
>                 lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
>
> -       spin_lock_irq(zone_lru_lock(zone));
> +       spin_lock_irq(&pgdat->lru_lock);
>         if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
>                 reclaim_stat->recent_scanned[0] /= 2;
>                 reclaim_stat->recent_rotated[0] /= 2;
> @@ -2098,7 +2130,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>
>         fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
>         fp /= reclaim_stat->recent_rotated[1] + 1;
> -       spin_unlock_irq(zone_lru_lock(zone));
> +       spin_unlock_irq(&pgdat->lru_lock);
>
>         fraction[0] = ap;
>         fraction[1] = fp;
> @@ -2352,9 +2384,9 @@ static inline bool should_continue_reclaim(struct zone *zone,
>          * inactive lists are large enough, continue reclaiming
>          */
>         pages_for_compaction = (2UL << sc->order);
> -       inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> +       inactive_lru_pages = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
>         if (get_nr_swap_pages() > 0)
> -               inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +               inactive_lru_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
>         if (sc->nr_reclaimed < pages_for_compaction &&
>                         inactive_lru_pages > pages_for_compaction)
>                 return true;
> @@ -2554,7 +2586,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>                                 continue;
>
>                         if (sc->priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;       /* Let kswapd poll it */
>
>                         /*
> @@ -2692,7 +2724,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
>                 if (!populated_zone(zone) ||
> -                   zone_reclaimable_pages(zone) == 0)
> +                   pgdat_reclaimable_pages(pgdat) == 0)
>                         continue;
>
>                 pfmemalloc_reserve += min_wmark_pages(zone);
> @@ -3000,7 +3032,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>                  * DEF_PRIORITY. Effectively, it considers them balanced so
>                  * they must be considered balanced here as well!
>                  */
> -               if (!zone_reclaimable(zone)) {
> +               if (!pgdat_reclaimable(zone->zone_pgdat)) {
>                         balanced_pages += zone->managed_pages;
>                         continue;
>                 }
> @@ -3063,6 +3095,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  {
>         unsigned long balance_gap;
>         bool lowmem_pressure;
> +       struct pglist_data *pgdat = zone->zone_pgdat;
>
>         /* Reclaim above the high watermark. */
>         sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
> @@ -3087,7 +3120,8 @@ static bool kswapd_shrink_zone(struct zone *zone,
>
>         shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
>
> -       clear_bit(ZONE_WRITEBACK, &zone->flags);
> +       /* TODO: ANOMALY */
> +       clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
>
>         /*
>          * If a zone reaches its high watermark, consider it to be no longer
> @@ -3095,10 +3129,10 @@ static bool kswapd_shrink_zone(struct zone *zone,
>          * BDIs but as pressure is relieved, speculatively avoid congestion
>          * waits.
>          */
> -       if (zone_reclaimable(zone) &&
> +       if (pgdat_reclaimable(zone->zone_pgdat) &&
>             zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
> -               clear_bit(ZONE_CONGESTED, &zone->flags);
> -               clear_bit(ZONE_DIRTY, &zone->flags);
> +               clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> +               clear_bit(PGDAT_DIRTY, &pgdat->flags);
>         }
>
>         return sc->nr_scanned >= sc->nr_to_reclaim;
> @@ -3157,7 +3191,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         /*
> @@ -3184,9 +3218,11 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 /*
>                                  * If balanced, clear the dirty and congested
>                                  * flags
> +                                *
> +                                * TODO: ANOMALY
>                                  */
> -                               clear_bit(ZONE_CONGESTED, &zone->flags);
> -                               clear_bit(ZONE_DIRTY, &zone->flags);
> +                               clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
> +                               clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
>                         }
>                 }
>
> @@ -3216,7 +3252,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                                 continue;
>
>                         if (sc.priority != DEF_PRIORITY &&
> -                           !zone_reclaimable(zone))
> +                           !pgdat_reclaimable(zone->zone_pgdat))
>                                 continue;
>
>                         sc.nr_scanned = 0;
> @@ -3612,8 +3648,8 @@ int sysctl_min_slab_ratio = 5;
>  static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
>  {
>         unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> -       unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
> -               zone_page_state(zone, NR_ACTIVE_FILE);
> +       unsigned long file_lru = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
> +               node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
>
>         /*
>          * It's possible for there to be more file mapped pages than
> @@ -3716,7 +3752,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>             zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
>                 return ZONE_RECLAIM_FULL;
>
> -       if (!zone_reclaimable(zone))
> +       if (!pgdat_reclaimable(zone->zone_pgdat))
>                 return ZONE_RECLAIM_FULL;
>
>         /*
> @@ -3795,7 +3831,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages)
>                         zone = pagezone;
>                         spin_lock_irq(zone_lru_lock(zone));
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, zone);
> +               lruvec = mem_cgroup_page_lruvec(page, zone->zone_pgdat);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3345d396a99b..de0c17076270 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -936,11 +936,8 @@ const char * const vmstat_text[] = {
>         /* enum zone_stat_item countes */
>         "nr_free_pages",
>         "nr_alloc_batch",
> -       "nr_inactive_anon",
> -       "nr_active_anon",
> -       "nr_inactive_file",
> -       "nr_active_file",
> -       "nr_unevictable",
> +       "nr_zone_anon_lru",
> +       "nr_zone_file_lru",
>         "nr_mlock",
>         "nr_anon_pages",
>         "nr_mapped",
> @@ -956,12 +953,9 @@ const char * const vmstat_text[] = {
>         "nr_vmscan_write",
>         "nr_vmscan_immediate_reclaim",
>         "nr_writeback_temp",
> -       "nr_isolated_anon",
> -       "nr_isolated_file",
>         "nr_shmem",
>         "nr_dirtied",
>         "nr_written",
> -       "nr_pages_scanned",
>  #if IS_ENABLED(CONFIG_ZSMALLOC)
>         "nr_zspages",
>  #endif
> @@ -981,6 +975,16 @@ const char * const vmstat_text[] = {
>         "nr_shmem_pmdmapped",
>         "nr_free_cma",
>
> +       /* Node-based counters */
> +       "nr_inactive_anon",
> +       "nr_active_anon",
> +       "nr_inactive_file",
> +       "nr_active_file",
> +       "nr_unevictable",
> +       "nr_isolated_anon",
> +       "nr_isolated_file",
> +       "nr_pages_scanned",
> +
>         /* enum writeback_stat_item counters */
>         "nr_dirty_threshold",
>         "nr_dirty_background_threshold",
> @@ -1002,11 +1006,11 @@ const char * const vmstat_text[] = {
>         "pgmajfault",
>         "pglazyfreed",
>
> -       TEXTS_FOR_ZONES("pgrefill")
> -       TEXTS_FOR_ZONES("pgsteal_kswapd")
> -       TEXTS_FOR_ZONES("pgsteal_direct")
> -       TEXTS_FOR_ZONES("pgscan_kswapd")
> -       TEXTS_FOR_ZONES("pgscan_direct")
> +       "pgrefill",
> +       "pgsteal_kswapd",
> +       "pgsteal_direct",
> +       "pgscan_kswapd",
> +       "pgscan_direct",
>         "pgscan_direct_throttle",
>
>  #ifdef CONFIG_NUMA
> @@ -1434,7 +1438,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n        min      %lu"
>                    "\n        low      %lu"
>                    "\n        high     %lu"
> -                  "\n        scanned  %lu"
> +                  "\n   node_scanned  %lu"
>                    "\n        spanned  %lu"
>                    "\n        present  %lu"
>                    "\n        managed  %lu",
> @@ -1442,13 +1446,13 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    min_wmark_pages(zone),
>                    low_wmark_pages(zone),
>                    high_wmark_pages(zone),
> -                  zone_page_state(zone, NR_PAGES_SCANNED),
> +                  node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>                    zone->spanned_pages,
>                    zone->present_pages,
>                    zone->managed_pages);
>
>         for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> -               seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
> +               seq_printf(m, "\n      %-12s %lu", vmstat_text[i],
>                                 zone_page_state(zone, i));
>
>         seq_printf(m,
> @@ -1478,12 +1482,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  #endif
>         }
>         seq_printf(m,
> -                  "\n  all_unreclaimable: %u"
> -                  "\n  start_pfn:         %lu"
> -                  "\n  inactive_ratio:    %u",
> -                  !zone_reclaimable(zone),
> +                  "\n  node_unreclaimable:  %u"
> +                  "\n  start_pfn:           %lu"
> +                  "\n  node_inactive_ratio: %u",
> +                  !pgdat_reclaimable(zone->zone_pgdat),
>                    zone->zone_start_pfn,
> -                  zone->inactive_ratio);
> +                  zone->zone_pgdat->inactive_ratio);
>         seq_putc(m, '\n');
>  }
>
> @@ -1574,7 +1578,6 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  {
>         unsigned long *l = arg;
>         unsigned long off = l - (unsigned long *)m->private;
> -
>         seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
>         return 0;
>  }
> diff --git a/mm/workingset.c b/mm/workingset.c
> index ba972ac2dfdd..ebe14445809a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -355,8 +355,8 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>                                                      LRU_ALL_FILE);
>         } else {
> -               pages = sum_zone_node_page_state(sc->nid, NR_ACTIVE_FILE) +
> -                       sum_zone_node_page_state(sc->nid, NR_INACTIVE_FILE);
> +               pages = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
> +                       node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
>         }
>
>         /*
> --
> 2.6.4
>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
  2016-08-04 20:59     ` James Hogan
@ 2016-08-05  8:41       ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-05  8:41 UTC (permalink / raw)
  To: James Hogan
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

On Thu, Aug 04, 2016 at 09:59:17PM +0100, James Hogan wrote:
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> This breaks boot on metag architecture:
> Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]
> 
> It appears to be in node_page_state_snapshot() (via
> pgdat_reclaimable()), and have come via mm_init. Here's the relevant
> bit of the backtrace:
> 
>     node_page_state_snapshot@0x4009c884(enum node_stat_item item =
> ???, struct pglist_data * pgdat = ???) + 0x48
>     pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
>     show_free_areas(unsigned int filter = 0) + 0x2cc
>     show_mem(unsigned int filter = 0) + 0x18
>     mm_init@0x4025c3d4()
>     start_kernel() + 0x204
> 
> __per_cpu_offset[0] == 0x233000 (close to bad addr),
> pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
> definitely hasn't been called yet (mm_init is called before
> setup_per_cpu_pageset()).
> 
> Any ideas what the correct solution is (and why presumably others
> haven't seen the same issue on other architectures?).
> 

metag calls show_mem in mem_init() before the pagesets are initialised.
What's surprising is that it worked for the zone stats as it appears
that calling zone_reclaimable() from that context should also have
broken. Did anything change recently that would have avoided the
zone->pageset dereference in zone_reclaimable() before?

The easiest option would be to not call show_mem from arch code until
after the pagesets are setup.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-08-05  8:41       ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-05  8:41 UTC (permalink / raw)
  To: James Hogan
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

On Thu, Aug 04, 2016 at 09:59:17PM +0100, James Hogan wrote:
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> This breaks boot on metag architecture:
> Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]
> 
> It appears to be in node_page_state_snapshot() (via
> pgdat_reclaimable()), and have come via mm_init. Here's the relevant
> bit of the backtrace:
> 
>     node_page_state_snapshot@0x4009c884(enum node_stat_item item =
> ???, struct pglist_data * pgdat = ???) + 0x48
>     pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
>     show_free_areas(unsigned int filter = 0) + 0x2cc
>     show_mem(unsigned int filter = 0) + 0x18
>     mm_init@0x4025c3d4()
>     start_kernel() + 0x204
> 
> __per_cpu_offset[0] == 0x233000 (close to bad addr),
> pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
> definitely hasn't been called yet (mm_init is called before
> setup_per_cpu_pageset()).
> 
> Any ideas what the correct solution is (and why presumably others
> haven't seen the same issue on other architectures?).
> 

metag calls show_mem in mem_init() before the pagesets are initialised.
What's surprising is that it worked for the zone stats as it appears
that calling zone_reclaimable() from that context should also have
broken. Did anything change recently that would have avoided the
zone->pageset dereference in zone_reclaimable() before?

The easiest option would be to not call show_mem from arch code until
after the pagesets are setup.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-08-05 10:52         ` James Hogan
  0 siblings, 0 replies; 220+ messages in thread
From: James Hogan @ 2016-08-05 10:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

[-- Attachment #1: Type: text/plain, Size: 2789 bytes --]

On Fri, Aug 05, 2016 at 09:41:15AM +0100, Mel Gorman wrote:
> On Thu, Aug 04, 2016 at 09:59:17PM +0100, James Hogan wrote:
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > This breaks boot on metag architecture:
> > Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]
> > 
> > It appears to be in node_page_state_snapshot() (via
> > pgdat_reclaimable()), and have come via mm_init. Here's the relevant
> > bit of the backtrace:
> > 
> >     node_page_state_snapshot@0x4009c884(enum node_stat_item item =
> > ???, struct pglist_data * pgdat = ???) + 0x48
> >     pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
> >     show_free_areas(unsigned int filter = 0) + 0x2cc
> >     show_mem(unsigned int filter = 0) + 0x18
> >     mm_init@0x4025c3d4()
> >     start_kernel() + 0x204
> > 
> > __per_cpu_offset[0] == 0x233000 (close to bad addr),
> > pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
> > definitely hasn't been called yet (mm_init is called before
> > setup_per_cpu_pageset()).
> > 
> > Any ideas what the correct solution is (and why presumably others
> > haven't seen the same issue on other architectures?).
> > 
> 
> metag calls show_mem in mem_init() before the pagesets are initialised.

Indeed, I didn't spot yesterday evening that this appears to be
different to other arches.

> What's surprising is that it worked for the zone stats as it appears
> that calling zone_reclaimable() from that context should also have
> broken. Did anything change recently that would have avoided the
> zone->pageset dereference in zone_reclaimable() before?

It appears that zone_pcp_init() was already setting zone->pageset to
&boot_pageset, via paging_init():

zone_pcp_init@0x40265d54(struct zone * zone = ???)
free_area_init_core@0x40265c18(struct pglist_data * pgdat = ???) + 0x138
free_area_init_node(int nid = 0, unsigned long * zones_size = ???, unsigned long node_start_pfn = ???, unsigned long * zholes_size = ???) + 0x1a0
free_area_init_nodes(unsigned long * max_zone_pfn = ???) + 0x440
paging_init(unsigned long mem_end = 0x4fe00000) + 0x378
setup_arch(char ** cmdline_p = 0x4024e038) + 0x2b8
start_kernel() + 0x54

setup_arch() is called prior to mm_init(), which explains why it wasn't
crashing before.

> The easiest option would be to not call show_mem from arch code until
> after the pagesets are setup.

Since no other arches seem to do show_mem earily during boot like metag,
and doing so doesn't really add much value, I'm happy to remove it
anyway.

However could your change break other things and need fixing anyway?

Thanks!
James

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-08-05 10:52         ` James Hogan
  0 siblings, 0 replies; 220+ messages in thread
From: James Hogan @ 2016-08-05 10:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

[-- Attachment #1: Type: text/plain, Size: 2868 bytes --]

On Fri, Aug 05, 2016 at 09:41:15AM +0100, Mel Gorman wrote:
> On Thu, Aug 04, 2016 at 09:59:17PM +0100, James Hogan wrote:
> > > Signed-off-by: Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>
> > > Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > > Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
> > 
> > This breaks boot on metag architecture:
> > Oops: err 0007 (Data access general read/write fault) addr 00233008 [#1]
> > 
> > It appears to be in node_page_state_snapshot() (via
> > pgdat_reclaimable()), and have come via mm_init. Here's the relevant
> > bit of the backtrace:
> > 
> >     node_page_state_snapshot@0x4009c884(enum node_stat_item item =
> > ???, struct pglist_data * pgdat = ???) + 0x48
> >     pgdat_reclaimable(struct pglist_data * pgdat = 0x402517a0)
> >     show_free_areas(unsigned int filter = 0) + 0x2cc
> >     show_mem(unsigned int filter = 0) + 0x18
> >     mm_init@0x4025c3d4()
> >     start_kernel() + 0x204
> > 
> > __per_cpu_offset[0] == 0x233000 (close to bad addr),
> > pgdat->per_cpu_nodestats = NULL. and setup_per_cpu_pageset()
> > definitely hasn't been called yet (mm_init is called before
> > setup_per_cpu_pageset()).
> > 
> > Any ideas what the correct solution is (and why presumably others
> > haven't seen the same issue on other architectures?).
> > 
> 
> metag calls show_mem in mem_init() before the pagesets are initialised.

Indeed, I didn't spot yesterday evening that this appears to be
different to other arches.

> What's surprising is that it worked for the zone stats as it appears
> that calling zone_reclaimable() from that context should also have
> broken. Did anything change recently that would have avoided the
> zone->pageset dereference in zone_reclaimable() before?

It appears that zone_pcp_init() was already setting zone->pageset to
&boot_pageset, via paging_init():

zone_pcp_init@0x40265d54(struct zone * zone = ???)
free_area_init_core@0x40265c18(struct pglist_data * pgdat = ???) + 0x138
free_area_init_node(int nid = 0, unsigned long * zones_size = ???, unsigned long node_start_pfn = ???, unsigned long * zholes_size = ???) + 0x1a0
free_area_init_nodes(unsigned long * max_zone_pfn = ???) + 0x440
paging_init(unsigned long mem_end = 0x4fe00000) + 0x378
setup_arch(char ** cmdline_p = 0x4024e038) + 0x2b8
start_kernel() + 0x54

setup_arch() is called prior to mm_init(), which explains why it wasn't
crashing before.

> The easiest option would be to not call show_mem from arch code until
> after the pagesets are setup.

Since no other arches seem to do show_mem earily during boot like metag,
and doing so doesn't really add much value, I'm happy to remove it
anyway.

However could your change break other things and need fixing anyway?

Thanks!
James

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
  2016-08-05 10:52         ` James Hogan
  (?)
@ 2016-08-05 11:55           ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-05 11:55 UTC (permalink / raw)
  To: James Hogan
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

On Fri, Aug 05, 2016 at 11:52:57AM +0100, James Hogan wrote:
> > What's surprising is that it worked for the zone stats as it appears
> > that calling zone_reclaimable() from that context should also have
> > broken. Did anything change recently that would have avoided the
> > zone->pageset dereference in zone_reclaimable() before?
> 
> It appears that zone_pcp_init() was already setting zone->pageset to
> &boot_pageset, via paging_init():
> 

/me slaps self

Of course.

> > The easiest option would be to not call show_mem from arch code until
> > after the pagesets are setup.
> 
> Since no other arches seem to do show_mem earily during boot like metag,
> and doing so doesn't really add much value, I'm happy to remove it
> anyway.
> 

Thanks. Can I assume you'll merge such a patch or should I roll one?

> However could your change break other things and need fixing anyway?
> 

Not that I'm aware of. There would have to be a node-based stat that has
meaning that early in boot to have an effect. If one happened to added
then it would need fixing but until then the complexity is unnecessary.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-08-05 11:55           ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-05 11:55 UTC (permalink / raw)
  To: James Hogan
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

On Fri, Aug 05, 2016 at 11:52:57AM +0100, James Hogan wrote:
> > What's surprising is that it worked for the zone stats as it appears
> > that calling zone_reclaimable() from that context should also have
> > broken. Did anything change recently that would have avoided the
> > zone->pageset dereference in zone_reclaimable() before?
> 
> It appears that zone_pcp_init() was already setting zone->pageset to
> &boot_pageset, via paging_init():
> 

/me slaps self

Of course.

> > The easiest option would be to not call show_mem from arch code until
> > after the pagesets are setup.
> 
> Since no other arches seem to do show_mem earily during boot like metag,
> and doing so doesn't really add much value, I'm happy to remove it
> anyway.
> 

Thanks. Can I assume you'll merge such a patch or should I roll one?

> However could your change break other things and need fixing anyway?
> 

Not that I'm aware of. There would have to be a node-based stat that has
meaning that early in boot to have an effect. If one happened to added
then it would need fixing but until then the complexity is unnecessary.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-08-05 11:55           ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-05 11:55 UTC (permalink / raw)
  To: James Hogan
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

On Fri, Aug 05, 2016 at 11:52:57AM +0100, James Hogan wrote:
> > What's surprising is that it worked for the zone stats as it appears
> > that calling zone_reclaimable() from that context should also have
> > broken. Did anything change recently that would have avoided the
> > zone->pageset dereference in zone_reclaimable() before?
> 
> It appears that zone_pcp_init() was already setting zone->pageset to
> &boot_pageset, via paging_init():
> 

/me slaps self

Of course.

> > The easiest option would be to not call show_mem from arch code until
> > after the pagesets are setup.
> 
> Since no other arches seem to do show_mem earily during boot like metag,
> and doing so doesn't really add much value, I'm happy to remove it
> anyway.
> 

Thanks. Can I assume you'll merge such a patch or should I roll one?

> However could your change break other things and need fixing anyway?
> 

Not that I'm aware of. There would have to be a node-based stat that has
meaning that early in boot to have an effect. If one happened to added
then it would need fixing but until then the complexity is unnecessary.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-08-05 12:02             ` James Hogan
  0 siblings, 0 replies; 220+ messages in thread
From: James Hogan @ 2016-08-05 12:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

[-- Attachment #1: Type: text/plain, Size: 1336 bytes --]

On Fri, Aug 05, 2016 at 12:55:26PM +0100, Mel Gorman wrote:
> On Fri, Aug 05, 2016 at 11:52:57AM +0100, James Hogan wrote:
> > > What's surprising is that it worked for the zone stats as it appears
> > > that calling zone_reclaimable() from that context should also have
> > > broken. Did anything change recently that would have avoided the
> > > zone->pageset dereference in zone_reclaimable() before?
> > 
> > It appears that zone_pcp_init() was already setting zone->pageset to
> > &boot_pageset, via paging_init():
> > 
> 
> /me slaps self
> 
> Of course.
> 
> > > The easiest option would be to not call show_mem from arch code until
> > > after the pagesets are setup.
> > 
> > Since no other arches seem to do show_mem earily during boot like metag,
> > and doing so doesn't really add much value, I'm happy to remove it
> > anyway.
> > 
> 
> Thanks. Can I assume you'll merge such a patch or should I roll one?

Yep, I'll take care of it.

> 
> > However could your change break other things and need fixing anyway?
> > 
> 
> Not that I'm aware of. There would have to be a node-based stat that has
> meaning that early in boot to have an effect. If one happened to added
> then it would need fixing but until then the complexity is unnecessary.

Okay, thanks for the help,

Cheers
James

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/34] mm, vmscan: move LRU lists to node
@ 2016-08-05 12:02             ` James Hogan
  0 siblings, 0 replies; 220+ messages in thread
From: James Hogan @ 2016-08-05 12:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, metag

[-- Attachment #1: Type: text/plain, Size: 1336 bytes --]

On Fri, Aug 05, 2016 at 12:55:26PM +0100, Mel Gorman wrote:
> On Fri, Aug 05, 2016 at 11:52:57AM +0100, James Hogan wrote:
> > > What's surprising is that it worked for the zone stats as it appears
> > > that calling zone_reclaimable() from that context should also have
> > > broken. Did anything change recently that would have avoided the
> > > zone->pageset dereference in zone_reclaimable() before?
> > 
> > It appears that zone_pcp_init() was already setting zone->pageset to
> > &boot_pageset, via paging_init():
> > 
> 
> /me slaps self
> 
> Of course.
> 
> > > The easiest option would be to not call show_mem from arch code until
> > > after the pagesets are setup.
> > 
> > Since no other arches seem to do show_mem earily during boot like metag,
> > and doing so doesn't really add much value, I'm happy to remove it
> > anyway.
> > 
> 
> Thanks. Can I assume you'll merge such a patch or should I roll one?

Yep, I'll take care of it.

> 
> > However could your change break other things and need fixing anyway?
> > 
> 
> Not that I'm aware of. There would have to be a node-based stat that has
> meaning that early in boot to have an effect. If one happened to added
> then it would need fixing but until then the complexity is unnecessary.

Okay, thanks for the help,

Cheers
James

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
  2016-07-08  9:34 ` Mel Gorman
@ 2016-08-19 13:12   ` Andrea Arcangeli
  -1 siblings, 0 replies; 220+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 13:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

Hello Mel,

On Fri, Jul 08, 2016 at 10:34:36AM +0100, Mel Gorman wrote:
> Minor changes this time
> 
> Changelog since v8
> This is the latest version of a series that moves LRUs from the zones to

I'm afraid this is a bit incomplete...

I had troubles in rebasing the compaction-enabled zone_reclaim feature
(now node_reclaim) to the node model. That is because compaction is
still zone based, and so I would need to do a loop of compaction calls
(for each zone in the node), but what's the point? Movable memory can
always go anywhere, can't it? So it would be better to compact across
the whole node without care of the zone boundaries. Then if the
classzone_idx passed to compaction is not for the highest classzone,
it'll do zone_reclaim and focus itself on the lower zones (but it can
still cross the zone boundaries among those lower zones).

No matter how I tweak my code it doesn't make much sense to do a
manual loop and leave compaction unable to cross zone boundaries. Is
anybody working to complete this work to make compaction work on node
basis instead of zone basis? Or am I missing something for why
compaction scan "lowpfn, highpfn" starting positions cannot possibly
cross zone boundaries?

I'm also uncertain what's the meaning now of zonelist_order=z (default
setting) considering it'll always behave like zone_order=n
anyway... On the same lines, I'm also uncertain of the meaning of the
zonelist in the first place and why it's not a "nodelist +
classzone_idx". Why is there still a zonelist_order=z default setting
and a zonelist_order option in the first place, and a zonelist instead
of a nodelist?

I use zonelist_order=n on my NUMA systems and I always liked the LRU
to be per-node (despite it uses more CPU when you allocate from a
lower classzone as you need to skip the pages of the higher zones not
contained in the classzone_idx). So to be clear I'm not against this
work (I tend to believe there are more pros than cons), but to port
some code to the node model in the right way, I'd need to do too much
work myself on the compaction side.

Also note, the main security left that allows this change to work
stable is in the lowmem reserve ratio feature in the page allocator
that prevents lower classzones to be completely filled by non movable
allocations from higher classzones (i.e. pagetables). As there's no
priority anymore to start shrinking from the higher zone of the
classzone_idx of the allocation (especially effective logic if using
zonelist_order=z which happens to be the default, even though I almost
always use zonelist_order=n which in fact already behaved much closer
to the new behavior). The removal of the bias against the highest zone
to me is the biggest cons in terms of stability in the corner cases,
overall but I believe the security of the lowmem reserve ratio should
suffice.

I also expect this work to make negligible difference for those
systems where DMA32 and DMA zones don't exist or are tiny, as the
node:zone relation is practically already 1:1 there. I believe this
actually will help more in systems where the DMA32 zone is relevant if
compared to the total memory size (as long as there are not too many
DMA32 allocations from pci32 devices, and the zone exists just in
case, for an lowmem allocation once in a while). So this isn't a
change for the long run, it'll be more noticeable on low end systems
or highmem 32bit systems, and it's going to be a noop if you've got a
terabytes of RAM (perhaps some pointer dereference is avoided, but
that difference should get not measurable).

On a side note the compaction enabled node_reclaim that makes
node_reclaim fully effective with THP on, works better with
zonelist_order=z too, so it should work even better with the node
model that practically makes zonelist_order=z impossible to achieve
any longer (which also shows it was a bad default and it was good idea
to manually set it to =n :). It's just the compaction zone model that
forces me to write a for-each-zone loop that isn't ideal and it would
defeat the purpose of the node model as far as compaction is concerned.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
@ 2016-08-19 13:12   ` Andrea Arcangeli
  0 siblings, 0 replies; 220+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 13:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

Hello Mel,

On Fri, Jul 08, 2016 at 10:34:36AM +0100, Mel Gorman wrote:
> Minor changes this time
> 
> Changelog since v8
> This is the latest version of a series that moves LRUs from the zones to

I'm afraid this is a bit incomplete...

I had troubles in rebasing the compaction-enabled zone_reclaim feature
(now node_reclaim) to the node model. That is because compaction is
still zone based, and so I would need to do a loop of compaction calls
(for each zone in the node), but what's the point? Movable memory can
always go anywhere, can't it? So it would be better to compact across
the whole node without care of the zone boundaries. Then if the
classzone_idx passed to compaction is not for the highest classzone,
it'll do zone_reclaim and focus itself on the lower zones (but it can
still cross the zone boundaries among those lower zones).

No matter how I tweak my code it doesn't make much sense to do a
manual loop and leave compaction unable to cross zone boundaries. Is
anybody working to complete this work to make compaction work on node
basis instead of zone basis? Or am I missing something for why
compaction scan "lowpfn, highpfn" starting positions cannot possibly
cross zone boundaries?

I'm also uncertain what's the meaning now of zonelist_order=z (default
setting) considering it'll always behave like zone_order=n
anyway... On the same lines, I'm also uncertain of the meaning of the
zonelist in the first place and why it's not a "nodelist +
classzone_idx". Why is there still a zonelist_order=z default setting
and a zonelist_order option in the first place, and a zonelist instead
of a nodelist?

I use zonelist_order=n on my NUMA systems and I always liked the LRU
to be per-node (despite it uses more CPU when you allocate from a
lower classzone as you need to skip the pages of the higher zones not
contained in the classzone_idx). So to be clear I'm not against this
work (I tend to believe there are more pros than cons), but to port
some code to the node model in the right way, I'd need to do too much
work myself on the compaction side.

Also note, the main security left that allows this change to work
stable is in the lowmem reserve ratio feature in the page allocator
that prevents lower classzones to be completely filled by non movable
allocations from higher classzones (i.e. pagetables). As there's no
priority anymore to start shrinking from the higher zone of the
classzone_idx of the allocation (especially effective logic if using
zonelist_order=z which happens to be the default, even though I almost
always use zonelist_order=n which in fact already behaved much closer
to the new behavior). The removal of the bias against the highest zone
to me is the biggest cons in terms of stability in the corner cases,
overall but I believe the security of the lowmem reserve ratio should
suffice.

I also expect this work to make negligible difference for those
systems where DMA32 and DMA zones don't exist or are tiny, as the
node:zone relation is practically already 1:1 there. I believe this
actually will help more in systems where the DMA32 zone is relevant if
compared to the total memory size (as long as there are not too many
DMA32 allocations from pci32 devices, and the zone exists just in
case, for an lowmem allocation once in a while). So this isn't a
change for the long run, it'll be more noticeable on low end systems
or highmem 32bit systems, and it's going to be a noop if you've got a
terabytes of RAM (perhaps some pointer dereference is avoided, but
that difference should get not measurable).

On a side note the compaction enabled node_reclaim that makes
node_reclaim fully effective with THP on, works better with
zonelist_order=z too, so it should work even better with the node
model that practically makes zonelist_order=z impossible to achieve
any longer (which also shows it was a bad default and it was good idea
to manually set it to =n :). It's just the compaction zone model that
forces me to write a for-each-zone loop that isn't ideal and it would
defeat the purpose of the node model as far as compaction is concerned.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
  2016-08-19 13:12   ` Andrea Arcangeli
@ 2016-08-19 13:23     ` Vlastimil Babka
  -1 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-08-19 13:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner,
	Minchan Kim, Joonsoo Kim, LKML

On 08/19/2016 03:12 PM, Andrea Arcangeli wrote:
> Hello Mel,

Hi Andrea,

> On Fri, Jul 08, 2016 at 10:34:36AM +0100, Mel Gorman wrote:
>> Minor changes this time
>>
>> Changelog since v8
>> This is the latest version of a series that moves LRUs from the zones to
>
> I'm afraid this is a bit incomplete...
>
> I had troubles in rebasing the compaction-enabled zone_reclaim feature
> (now node_reclaim) to the node model.

What's that? Never head of this before, but sounds scary :) I thought 
that zone_reclaim itself was rather discouraged nowadays, not a big 
candidate for further improvement.,,

> That is because compaction is
> still zone based, and so I would need to do a loop of compaction calls
> (for each zone in the node), but what's the point? Movable memory can
> always go anywhere, can't it?

Hm I'm not so sure. Are all movable allocations highmem? For example 
Joonsoo mentions in his ZONE_CMA patchset "blockdev file cache page 
[...] usually has __GFP_MOVABLE but not __GFP_HIGHMEM and __GFP_USER".
Now we also have Minchan's infrastructure for arbitrary driver 
compaction, so those will be movable, but potentially still restricted 
to e.g. DMA32...

Vlastimil

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
@ 2016-08-19 13:23     ` Vlastimil Babka
  0 siblings, 0 replies; 220+ messages in thread
From: Vlastimil Babka @ 2016-08-19 13:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Johannes Weiner,
	Minchan Kim, Joonsoo Kim, LKML

On 08/19/2016 03:12 PM, Andrea Arcangeli wrote:
> Hello Mel,

Hi Andrea,

> On Fri, Jul 08, 2016 at 10:34:36AM +0100, Mel Gorman wrote:
>> Minor changes this time
>>
>> Changelog since v8
>> This is the latest version of a series that moves LRUs from the zones to
>
> I'm afraid this is a bit incomplete...
>
> I had troubles in rebasing the compaction-enabled zone_reclaim feature
> (now node_reclaim) to the node model.

What's that? Never head of this before, but sounds scary :) I thought 
that zone_reclaim itself was rather discouraged nowadays, not a big 
candidate for further improvement.,,

> That is because compaction is
> still zone based, and so I would need to do a loop of compaction calls
> (for each zone in the node), but what's the point? Movable memory can
> always go anywhere, can't it?

Hm I'm not so sure. Are all movable allocations highmem? For example 
Joonsoo mentions in his ZONE_CMA patchset "blockdev file cache page 
[...] usually has __GFP_MOVABLE but not __GFP_HIGHMEM and __GFP_USER".
Now we also have Minchan's infrastructure for arbitrary driver 
compaction, so those will be movable, but potentially still restricted 
to e.g. DMA32...

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
  2016-08-19 13:23     ` Vlastimil Babka
@ 2016-08-19 13:55       ` Andrea Arcangeli
  -1 siblings, 0 replies; 220+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 13:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Linux-MM, Rik van Riel,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Aug 19, 2016 at 03:23:20PM +0200, Vlastimil Babka wrote:
> What's that? Never head of this before, but sounds scary :) I thought 
> that zone_reclaim itself was rather discouraged nowadays, not a big 
> candidate for further improvement.,,

It's some fix that I tried to push upstream but wasn't merged. I kept
maintaining it because I got customers bugreport about THP causing
regressions to node_reclaim.

Hard NUMA bindings would solve that but apparently there are apps that
prefers no memory binding to allow flexible spillover, and they only
use CPU bindings only but with a strong NUMA bias provided by
node_reclaim, by shrinking the cache (and only the cache).

In any case it was a regression caused by THP because compaction
wasn't invoked. Note zone_reclaim has a synchronous more aggressive
option that blocks for write back if needed, so invoking direct
compaction there is sure ok, if it's asked on demand.

As usual it's always a tradeoff between long live and short lived
allocation so if you reserve a system for computations and you know
your allocation are very long lived it make perfect sense to be
aggressive if you tune for it.

zone_reclaim or synchronous direct compaction are obviously bad
defaults for general purpose default settings, it doesn't mean it
should be impossible to tune a system for a certain workload to run
optimal.

> Hm I'm not so sure. Are all movable allocations highmem? For example 
> Joonsoo mentions in his ZONE_CMA patchset "blockdev file cache page 
> [...] usually has __GFP_MOVABLE but not __GFP_HIGHMEM and __GFP_USER".
> Now we also have Minchan's infrastructure for arbitrary driver 
> compaction, so those will be movable, but potentially still restricted 
> to e.g. DMA32...

One option is to forbid such corner cases... and VM_WARN_ON (not a
typo :) available in my tree) if __GFP_MOVABLE is passed on lower
classzones.

The other option would be to have a per-classzone lowpfn, highpnf scan
pointers. That has some cons but hey this whole thing is a tradeoff
isn't it?

It's about the fact we're optimizing for less frequent lowmem
allocations so we can as well provide a worse compaction for lowmem
(by reducing the MOVABLE memory restricted to lower classzones like
mentioned above), but leverage the node model to have a more powerful
that crosses all zone boundaries, when the GFP_HIGHUSER is used.

I don't see why the tradeoff is valid when it comes to the LRU but not
valid when it comes to compaction and then I've to do a blind loop of
(for-each-zone-in-the-node-in-reverse { compact_zone_order(zone) })
which works worse than before and works worse than a
zone-boundary-less compaction based on the node model.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
@ 2016-08-19 13:55       ` Andrea Arcangeli
  0 siblings, 0 replies; 220+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 13:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Linux-MM, Rik van Riel,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Aug 19, 2016 at 03:23:20PM +0200, Vlastimil Babka wrote:
> What's that? Never head of this before, but sounds scary :) I thought 
> that zone_reclaim itself was rather discouraged nowadays, not a big 
> candidate for further improvement.,,

It's some fix that I tried to push upstream but wasn't merged. I kept
maintaining it because I got customers bugreport about THP causing
regressions to node_reclaim.

Hard NUMA bindings would solve that but apparently there are apps that
prefers no memory binding to allow flexible spillover, and they only
use CPU bindings only but with a strong NUMA bias provided by
node_reclaim, by shrinking the cache (and only the cache).

In any case it was a regression caused by THP because compaction
wasn't invoked. Note zone_reclaim has a synchronous more aggressive
option that blocks for write back if needed, so invoking direct
compaction there is sure ok, if it's asked on demand.

As usual it's always a tradeoff between long live and short lived
allocation so if you reserve a system for computations and you know
your allocation are very long lived it make perfect sense to be
aggressive if you tune for it.

zone_reclaim or synchronous direct compaction are obviously bad
defaults for general purpose default settings, it doesn't mean it
should be impossible to tune a system for a certain workload to run
optimal.

> Hm I'm not so sure. Are all movable allocations highmem? For example 
> Joonsoo mentions in his ZONE_CMA patchset "blockdev file cache page 
> [...] usually has __GFP_MOVABLE but not __GFP_HIGHMEM and __GFP_USER".
> Now we also have Minchan's infrastructure for arbitrary driver 
> compaction, so those will be movable, but potentially still restricted 
> to e.g. DMA32...

One option is to forbid such corner cases... and VM_WARN_ON (not a
typo :) available in my tree) if __GFP_MOVABLE is passed on lower
classzones.

The other option would be to have a per-classzone lowpfn, highpnf scan
pointers. That has some cons but hey this whole thing is a tradeoff
isn't it?

It's about the fact we're optimizing for less frequent lowmem
allocations so we can as well provide a worse compaction for lowmem
(by reducing the MOVABLE memory restricted to lower classzones like
mentioned above), but leverage the node model to have a more powerful
that crosses all zone boundaries, when the GFP_HIGHUSER is used.

I don't see why the tradeoff is valid when it comes to the LRU but not
valid when it comes to compaction and then I've to do a blind loop of
(for-each-zone-in-the-node-in-reverse { compact_zone_order(zone) })
which works worse than before and works worse than a
zone-boundary-less compaction based on the node model.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
  2016-08-19 13:12   ` Andrea Arcangeli
@ 2016-08-19 14:53     ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-19 14:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Aug 19, 2016 at 03:12:00PM +0200, Andrea Arcangeli wrote:
> Hello Mel,
> 
> On Fri, Jul 08, 2016 at 10:34:36AM +0100, Mel Gorman wrote:
> > Minor changes this time
> > 
> > Changelog since v8
> > This is the latest version of a series that moves LRUs from the zones to
> 
> I'm afraid this is a bit incomplete...
> 

Compaction is not the same as LRU management.

> I had troubles in rebasing the compaction-enabled zone_reclaim feature
> (now node_reclaim) to the node model.

I'm not familiar with this although from the name, I can guess what it's
doing -- migrating pages from lowmem instead of reclaiming.

> That is because compaction is
> still zone based, and so I would need to do a loop of compaction calls
> (for each zone in the node), but what's the point? Movable memory can
> always go anywhere, can't it?

That is not guaranteed. At the time of migration, it is unknown if the
original allocation had addressing limitations or not. I did not audit
the address-limited allocations to see if any of them allow migration.

The filesystems would be the ones that need careful auditing. There are
some places that add lowmem pages to the LRU but far less obvious if any
of them would successfully migrate.

I'm not familiar with the specifics of the series you're working on but
as compaction was zone-based, you'd have to loop across the zones whether
the LRU is node or zone based. Even if cross-zone compaction was allowed,
it does not make a difference how the LRUs are managed.

Historically, the possibility that pages being compacted were address-limited
was the first reason didn't compact across zones. The other was that it
could introduce page aging problems. For example, migrating DMA32 to a
small NORMAL potentially allowed the page to be reclaimed prematurely by
reclaim. That is less of a concern with node-lru.

> So it would be better to compact across
> the whole node without care of the zone boundaries.

That is likely true as long as migration is always towards higher address.

> Then if the
> classzone_idx passed to compaction is not for the highest classzone,
> it'll do zone_reclaim and focus itself on the lower zones (but it can
> still cross the zone boundaries among those lower zones).
> 
> No matter how I tweak my code it doesn't make much sense to do a
> manual loop and leave compaction unable to cross zone boundaries. Is
> anybody working to complete this work to make compaction work on node
> basis instead of zone basis?

Not that I'm aware of but compaction across zones is not directly related
to LRU management.

> Or am I missing something for why
> compaction scan "lowpfn, highpfn" starting positions cannot possibly
> cross zone boundaries?
> 

An audit of all additions to the LRU that are address-limited allocations
is required to determine if any of those pages can migrate.

> I'm also uncertain what's the meaning now of zonelist_order=z (default
> setting) considering it'll always behave like zone_order=n
> anyway...

I'm not sure I understand. The zone allocation preference has the same
meaning as it always had.

>  On the same lines, I'm also uncertain of the meaning of the
> zonelist in the first place and why it's not a "nodelist +
> classzone_idx". Why is there still a zonelist_order=z default setting

On 64-bit, the default order is NODE. Are you using 32-bit NUMA systems?

> and a zonelist_order option in the first place, and a zonelist instead
> of a nodelist?
> 

The zonelist ordering is still required to satisfy address-limited allocation
requests. If it wasn't, free pages could be managed on a per-node basis.

> I use zonelist_order=n on my NUMA systems and I always liked the LRU
> to be per-node (despite it uses more CPU when you allocate from a
> lower classzone as you need to skip the pages of the higher zones not
> contained in the classzone_idx). So to be clear I'm not against this
> work (I tend to believe there are more pros than cons), but to port
> some code to the node model in the right way, I'd need to do too much
> work myself on the compaction side.
> 

Compaction working across zones would be nice to have unconditionally.
It's ortogonal to whether LRUs are managed per-node or not.

> Also note, the main security left that allows this change to work
> stable is in the lowmem reserve ratio feature in the page allocator
> that prevents lower classzones to be completely filled by non movable
> allocations from higher classzones (i.e. pagetables). As there's no
> priority anymore to start shrinking from the higher zone of the
> classzone_idx of the allocation (especially effective logic if using
> zonelist_order=z which happens to be the default, even though I almost
> always use zonelist_order=n which in fact already behaved much closer
> to the new behavior). The removal of the bias against the highest zone
> to me is the biggest cons in terms of stability in the corner cases,
> overall but I believe the security of the lowmem reserve ratio should
> suffice.
> 

It's expected that the lowmem reserve ratio will suffice with the corner
case of lowmem-restricted allocations potentially having to sacn more.

> I also expect this work to make negligible difference for those
> systems where DMA32 and DMA zones don't exist or are tiny, as the
> node:zone relation is practically already 1:1 there. I believe this
> actually will help more in systems where the DMA32 zone is relevant if
> compared to the total memory size (as long as there are not too many
> DMA32 allocations from pci32 devices, and the zone exists just in
> case, for an lowmem allocation once in a while). So this isn't a
> change for the long run, it'll be more noticeable on low end systems
> or highmem 32bit systems, and it's going to be a noop if you've got a
> terabytes of RAM (perhaps some pointer dereference is avoided, but
> that difference should get not measurable).
> 

32-bit systems with large highmem zones are expected to be a rarity. It
was a different story 10 years ago. A system with terabytes of RAM is
not going to be 32-bit.

> On a side note the compaction enabled node_reclaim that makes
> node_reclaim fully effective with THP on, works better with
> zonelist_order=z too, so it should work even better with the node
> model that practically makes zonelist_order=z impossible to achieve
> any longer (which also shows it was a bad default and it was good idea
> to manually set it to =n :). It's just the compaction zone model that
> forces me to write a for-each-zone loop that isn't ideal and it would
> defeat the purpose of the node model as far as compaction is concerned.
> 

Compacting across zones was/is a problem regardless of how the LRU is
managed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
@ 2016-08-19 14:53     ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-19 14:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Aug 19, 2016 at 03:12:00PM +0200, Andrea Arcangeli wrote:
> Hello Mel,
> 
> On Fri, Jul 08, 2016 at 10:34:36AM +0100, Mel Gorman wrote:
> > Minor changes this time
> > 
> > Changelog since v8
> > This is the latest version of a series that moves LRUs from the zones to
> 
> I'm afraid this is a bit incomplete...
> 

Compaction is not the same as LRU management.

> I had troubles in rebasing the compaction-enabled zone_reclaim feature
> (now node_reclaim) to the node model.

I'm not familiar with this although from the name, I can guess what it's
doing -- migrating pages from lowmem instead of reclaiming.

> That is because compaction is
> still zone based, and so I would need to do a loop of compaction calls
> (for each zone in the node), but what's the point? Movable memory can
> always go anywhere, can't it?

That is not guaranteed. At the time of migration, it is unknown if the
original allocation had addressing limitations or not. I did not audit
the address-limited allocations to see if any of them allow migration.

The filesystems would be the ones that need careful auditing. There are
some places that add lowmem pages to the LRU but far less obvious if any
of them would successfully migrate.

I'm not familiar with the specifics of the series you're working on but
as compaction was zone-based, you'd have to loop across the zones whether
the LRU is node or zone based. Even if cross-zone compaction was allowed,
it does not make a difference how the LRUs are managed.

Historically, the possibility that pages being compacted were address-limited
was the first reason didn't compact across zones. The other was that it
could introduce page aging problems. For example, migrating DMA32 to a
small NORMAL potentially allowed the page to be reclaimed prematurely by
reclaim. That is less of a concern with node-lru.

> So it would be better to compact across
> the whole node without care of the zone boundaries.

That is likely true as long as migration is always towards higher address.

> Then if the
> classzone_idx passed to compaction is not for the highest classzone,
> it'll do zone_reclaim and focus itself on the lower zones (but it can
> still cross the zone boundaries among those lower zones).
> 
> No matter how I tweak my code it doesn't make much sense to do a
> manual loop and leave compaction unable to cross zone boundaries. Is
> anybody working to complete this work to make compaction work on node
> basis instead of zone basis?

Not that I'm aware of but compaction across zones is not directly related
to LRU management.

> Or am I missing something for why
> compaction scan "lowpfn, highpfn" starting positions cannot possibly
> cross zone boundaries?
> 

An audit of all additions to the LRU that are address-limited allocations
is required to determine if any of those pages can migrate.

> I'm also uncertain what's the meaning now of zonelist_order=z (default
> setting) considering it'll always behave like zone_order=n
> anyway...

I'm not sure I understand. The zone allocation preference has the same
meaning as it always had.

>  On the same lines, I'm also uncertain of the meaning of the
> zonelist in the first place and why it's not a "nodelist +
> classzone_idx". Why is there still a zonelist_order=z default setting

On 64-bit, the default order is NODE. Are you using 32-bit NUMA systems?

> and a zonelist_order option in the first place, and a zonelist instead
> of a nodelist?
> 

The zonelist ordering is still required to satisfy address-limited allocation
requests. If it wasn't, free pages could be managed on a per-node basis.

> I use zonelist_order=n on my NUMA systems and I always liked the LRU
> to be per-node (despite it uses more CPU when you allocate from a
> lower classzone as you need to skip the pages of the higher zones not
> contained in the classzone_idx). So to be clear I'm not against this
> work (I tend to believe there are more pros than cons), but to port
> some code to the node model in the right way, I'd need to do too much
> work myself on the compaction side.
> 

Compaction working across zones would be nice to have unconditionally.
It's ortogonal to whether LRUs are managed per-node or not.

> Also note, the main security left that allows this change to work
> stable is in the lowmem reserve ratio feature in the page allocator
> that prevents lower classzones to be completely filled by non movable
> allocations from higher classzones (i.e. pagetables). As there's no
> priority anymore to start shrinking from the higher zone of the
> classzone_idx of the allocation (especially effective logic if using
> zonelist_order=z which happens to be the default, even though I almost
> always use zonelist_order=n which in fact already behaved much closer
> to the new behavior). The removal of the bias against the highest zone
> to me is the biggest cons in terms of stability in the corner cases,
> overall but I believe the security of the lowmem reserve ratio should
> suffice.
> 

It's expected that the lowmem reserve ratio will suffice with the corner
case of lowmem-restricted allocations potentially having to sacn more.

> I also expect this work to make negligible difference for those
> systems where DMA32 and DMA zones don't exist or are tiny, as the
> node:zone relation is practically already 1:1 there. I believe this
> actually will help more in systems where the DMA32 zone is relevant if
> compared to the total memory size (as long as there are not too many
> DMA32 allocations from pci32 devices, and the zone exists just in
> case, for an lowmem allocation once in a while). So this isn't a
> change for the long run, it'll be more noticeable on low end systems
> or highmem 32bit systems, and it's going to be a noop if you've got a
> terabytes of RAM (perhaps some pointer dereference is avoided, but
> that difference should get not measurable).
> 

32-bit systems with large highmem zones are expected to be a rarity. It
was a different story 10 years ago. A system with terabytes of RAM is
not going to be 32-bit.

> On a side note the compaction enabled node_reclaim that makes
> node_reclaim fully effective with THP on, works better with
> zonelist_order=z too, so it should work even better with the node
> model that practically makes zonelist_order=z impossible to achieve
> any longer (which also shows it was a bad default and it was good idea
> to manually set it to =n :). It's just the compaction zone model that
> forces me to write a for-each-zone loop that isn't ideal and it would
> defeat the purpose of the node model as far as compaction is concerned.
> 

Compacting across zones was/is a problem regardless of how the LRU is
managed.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
  2016-08-19 14:53     ` Mel Gorman
@ 2016-08-19 15:32       ` Andrea Arcangeli
  -1 siblings, 0 replies; 220+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 15:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Aug 19, 2016 at 03:53:59PM +0100, Mel Gorman wrote:
> Compaction is not the same as LRU management.

Sure but compaction is invoked by reclaim and if reclaim is node-wide,
it makes more sense if compaction would be node-wide as well.

Otherwise what you compact? Just the higher zone, or all of them?

> That is not guaranteed. At the time of migration, it is unknown if the
> original allocation had addressing limitations or not. I did not audit
> the address-limited allocations to see if any of them allow migration.
> 
> The filesystems would be the ones that need careful auditing. There are
> some places that add lowmem pages to the LRU but far less obvious if any
> of them would successfully migrate.

True but that's a tradeoff. This whole patchset is about optimizing
the common case of allocations from the highest possible
classzone_idx, as if a system has no older hardware, the lowmem
classzone_idx allocations practically never happens.

If that tradeoff is valid, retaining migrability of memory allocated
not with the highest classzone_idx is an optimization that goes in the
opposite direction.

Retaining such "optimization" means increasing the likelihood of
succeeding high order allocations from lower zones yes, but it screws
with the main concept of:

     reclaim node wide at high order -> failure to allocate -> compaction node wide

So if the tradeoff works for reclaim_node I don't see why we should
"optimize" for the opposite case in compaction.

> That is likely true as long as migration is always towards higher address.

Well with a node-wide LRU we got rid of any "towards higher address"
bias. So why to worry about these concepts for high order allocations
provided by compaction, if order 0 allocations provided purely by
shrink_node won't care at all about such a concept any longer?

It sounds backwards to even worry about "towards higher address" in
compaction which is only relevant to provide high order allocations,
when the zero order 4kb allocations will not care at all to go
"towards higher address" anymore.

> An audit of all additions to the LRU that are address-limited allocations
> is required to determine if any of those pages can migrate.

Agreed.

Either that or the pageblock needs to be marked with the classzone_idx
that it can tolerate. And a per-allocation-classzone highpfn,lowpfn
markers needs to be added, instead of being global.

> I'm not sure I understand. The zone allocation preference has the same
> meaning as it always had.

What I mean is zonelist is like:

=n

	[ node0->zone0, node0->zone1, node1->zone0, node1->zone1 ]

or:

=z

	[ node0->zone0, node1->zone0, node0->zone1, node1->zone1 ]

So why to call in order (first case above):

1  shrink_node(node0->zone0->node, allocation_classzone_idx /* to limit */)
2  shrink_node(node0->zone1->node, allocation_classzone_idx /* to limit */)
3  shrink_node(node1->zone0->node, allocation_classzone_idx /* to limit */)
4  shrink_node(node1->zone1->node, allocation_classzone_idx /* to limit */)

When in fact 1 and 2 are doing the exact same thing? And 3, 4 are
doing the same thing as well? All it matters is the classzone_idx, the
zonelist looks irrelevant and an unnecessary repetition.

It's possible I missed something in the code, perhaps I misunderstand
how shrink_node is invoked through the zonelist.

This zonelist vs nodelist issue however is orthogonal to the
per-zone vs per-node compaction issue discussed earlier.

> On 64-bit, the default order is NODE. Are you using 32-bit NUMA systems?

Ah that changed in 3.16 but my grub cmdlines didn't change since
then... Good that the default is =n for 64bit indeed.

> The zonelist ordering is still required to satisfy address-limited allocation
> requests. If it wasn't, free pages could be managed on a per-node basis.

But you pass the node, not the zone, to the shrink_node function, the
whole point I'm making is that the "address-limiting" is provided by
the classzone_idx alone with your change, not by the zonelist anymore.

static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)

There's no zone pointer in "scan control". There's the reclaim_idx
(aka classzone_idx, would have been more clear to call it
classzone_idx).

Then when isolating you do:

		if (page_zonenum(page) > sc->reclaim_idx) {

Confirming the limiting factor comes from reclaim_idx
(i.e. allocation_classzone_idx).

Again I may be missing something in the code...

> Compacting across zones was/is a problem regardless of how the LRU is
> managed.

It is a separate problem to make it work, but wanting to making
compaction work node-wide is very much a side effect of shrink_node
not going serially into zones but going node-wide at all times. Hence
it doesn't make sense anymore to only compact a single zone before
invoking shrink_node again.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
@ 2016-08-19 15:32       ` Andrea Arcangeli
  0 siblings, 0 replies; 220+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 15:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Aug 19, 2016 at 03:53:59PM +0100, Mel Gorman wrote:
> Compaction is not the same as LRU management.

Sure but compaction is invoked by reclaim and if reclaim is node-wide,
it makes more sense if compaction would be node-wide as well.

Otherwise what you compact? Just the higher zone, or all of them?

> That is not guaranteed. At the time of migration, it is unknown if the
> original allocation had addressing limitations or not. I did not audit
> the address-limited allocations to see if any of them allow migration.
> 
> The filesystems would be the ones that need careful auditing. There are
> some places that add lowmem pages to the LRU but far less obvious if any
> of them would successfully migrate.

True but that's a tradeoff. This whole patchset is about optimizing
the common case of allocations from the highest possible
classzone_idx, as if a system has no older hardware, the lowmem
classzone_idx allocations practically never happens.

If that tradeoff is valid, retaining migrability of memory allocated
not with the highest classzone_idx is an optimization that goes in the
opposite direction.

Retaining such "optimization" means increasing the likelihood of
succeeding high order allocations from lower zones yes, but it screws
with the main concept of:

     reclaim node wide at high order -> failure to allocate -> compaction node wide

So if the tradeoff works for reclaim_node I don't see why we should
"optimize" for the opposite case in compaction.

> That is likely true as long as migration is always towards higher address.

Well with a node-wide LRU we got rid of any "towards higher address"
bias. So why to worry about these concepts for high order allocations
provided by compaction, if order 0 allocations provided purely by
shrink_node won't care at all about such a concept any longer?

It sounds backwards to even worry about "towards higher address" in
compaction which is only relevant to provide high order allocations,
when the zero order 4kb allocations will not care at all to go
"towards higher address" anymore.

> An audit of all additions to the LRU that are address-limited allocations
> is required to determine if any of those pages can migrate.

Agreed.

Either that or the pageblock needs to be marked with the classzone_idx
that it can tolerate. And a per-allocation-classzone highpfn,lowpfn
markers needs to be added, instead of being global.

> I'm not sure I understand. The zone allocation preference has the same
> meaning as it always had.

What I mean is zonelist is like:

=n

	[ node0->zone0, node0->zone1, node1->zone0, node1->zone1 ]

or:

=z

	[ node0->zone0, node1->zone0, node0->zone1, node1->zone1 ]

So why to call in order (first case above):

1  shrink_node(node0->zone0->node, allocation_classzone_idx /* to limit */)
2  shrink_node(node0->zone1->node, allocation_classzone_idx /* to limit */)
3  shrink_node(node1->zone0->node, allocation_classzone_idx /* to limit */)
4  shrink_node(node1->zone1->node, allocation_classzone_idx /* to limit */)

When in fact 1 and 2 are doing the exact same thing? And 3, 4 are
doing the same thing as well? All it matters is the classzone_idx, the
zonelist looks irrelevant and an unnecessary repetition.

It's possible I missed something in the code, perhaps I misunderstand
how shrink_node is invoked through the zonelist.

This zonelist vs nodelist issue however is orthogonal to the
per-zone vs per-node compaction issue discussed earlier.

> On 64-bit, the default order is NODE. Are you using 32-bit NUMA systems?

Ah that changed in 3.16 but my grub cmdlines didn't change since
then... Good that the default is =n for 64bit indeed.

> The zonelist ordering is still required to satisfy address-limited allocation
> requests. If it wasn't, free pages could be managed on a per-node basis.

But you pass the node, not the zone, to the shrink_node function, the
whole point I'm making is that the "address-limiting" is provided by
the classzone_idx alone with your change, not by the zonelist anymore.

static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)

There's no zone pointer in "scan control". There's the reclaim_idx
(aka classzone_idx, would have been more clear to call it
classzone_idx).

Then when isolating you do:

		if (page_zonenum(page) > sc->reclaim_idx) {

Confirming the limiting factor comes from reclaim_idx
(i.e. allocation_classzone_idx).

Again I may be missing something in the code...

> Compacting across zones was/is a problem regardless of how the LRU is
> managed.

It is a separate problem to make it work, but wanting to making
compaction work node-wide is very much a side effect of shrink_node
not going serially into zones but going node-wide at all times. Hence
it doesn't make sense anymore to only compact a single zone before
invoking shrink_node again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
  2016-08-19 15:32       ` Andrea Arcangeli
@ 2016-08-19 15:55         ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-19 15:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Aug 19, 2016 at 05:32:59PM +0200, Andrea Arcangeli wrote:
> On Fri, Aug 19, 2016 at 03:53:59PM +0100, Mel Gorman wrote:
> > Compaction is not the same as LRU management.
> 
> Sure but compaction is invoked by reclaim and if reclaim is node-wide,
> it makes more sense if compaction would be node-wide as well.
> 

It might be desirable but it's not necessarily effective.  Reclaim/compaction
was always, at best, a heuristic that replaced lumpy-reclaim.

> Otherwise what you compact? Just the higher zone, or all of them?
> 

Right now, all of them taking into account whether the compaction is likely
to succeed.

> > That is not guaranteed. At the time of migration, it is unknown if the
> > original allocation had addressing limitations or not. I did not audit
> > the address-limited allocations to see if any of them allow migration.
> > 
> > The filesystems would be the ones that need careful auditing. There are
> > some places that add lowmem pages to the LRU but far less obvious if any
> > of them would successfully migrate.
> 
> True but that's a tradeoff. This whole patchset is about optimizing
> the common case of allocations from the highest possible
> classzone_idx, as if a system has no older hardware, the lowmem
> classzone_idx allocations practically never happens.
> 

I'll have to take your word for it. It still is the case that the highest
zone is preferred for allocations but the timing will be different due to
when reclaim is triggered and what is reclaimed.

As long as it is reclaiming, the page age will still be priority. Unless the
LRU is almost perfectly interleaves between zones, the effect of node-reclaim
will be that there is a likelihood that reclaim/compaction will succeed
for at least one zone with similar success rates to zone reclaim.

> If that tradeoff is valid, retaining migrability of memory allocated
> not with the highest classzone_idx is an optimization that goes in the
> opposite direction.
> 
> Retaining such "optimization" means increasing the likelihood of
> succeeding high order allocations from lower zones yes, but it screws
> with the main concept of:
> 
>      reclaim node wide at high order -> failure to allocate -> compaction node wide
> 
> So if the tradeoff works for reclaim_node I don't see why we should
> "optimize" for the opposite case in compaction.
> 
> > That is likely true as long as migration is always towards higher address.
> 
> Well with a node-wide LRU we got rid of any "towards higher address"

The allocation preferences of the zonelist continue to favour higher
zones. If anything, there is a stronger preference because zone-lru used
the fair-zone allocation policy to interleave pages between zones to avoid
page age inversion problems with smaller sized high zones.

> bias. So why to worry about these concepts for high order allocations
> provided by compaction, if order 0 allocations provided purely by
> shrink_node won't care at all about such a concept any longer?
> 

At worst, reclaim/compaction is slighly weakened. As the reclaim is in
LRU order, reclaim/compaction will still make progress for at least one
zone at a time.

> It sounds backwards to even worry about "towards higher address" in
> compaction which is only relevant to provide high order allocations,
> when the zero order 4kb allocations will not care at all to go
> "towards higher address" anymore.
> 

Towards higher addresses in compaction is an implementation detail when
it's not going across zones. Specifically, it was the easiest way to avoid
using the same pageblocks as both migration sources and targets.

> > An audit of all additions to the LRU that are address-limited allocations
> > is required to determine if any of those pages can migrate.
> 
> Agreed.
> 
> Either that or the pageblock needs to be marked with the classzone_idx
> that it can tolerate. And a per-allocation-classzone highpfn,lowpfn
> markers needs to be added, instead of being global.
> 

That would be somewhat severe. A single address-restricted allocation
would prevent any pages in that pageblock migrating out of the zone.
The classzone_idx could only be cleared when the entire pageblock was freed.

> > I'm not sure I understand. The zone allocation preference has the same
> > meaning as it always had.
> 
> What I mean is zonelist is like:
> 
> =n
> 
> 	[ node0->zone0, node0->zone1, node1->zone0, node1->zone1 ]
> 
> or:
> 
> =z
> 
> 	[ node0->zone0, node1->zone0, node0->zone1, node1->zone1 ]
> 
> So why to call in order (first case above):
> 
> 1  shrink_node(node0->zone0->node, allocation_classzone_idx /* to limit */)
> 2  shrink_node(node0->zone1->node, allocation_classzone_idx /* to limit */)
> 3  shrink_node(node1->zone0->node, allocation_classzone_idx /* to limit */)
> 4  shrink_node(node1->zone1->node, allocation_classzone_idx /* to limit */)
> 

For =n in the direct reclaim case, there is a check to see if the pgdat
has changed when reclaming in zonelist order. 1 will call shrink_node, 2
will skip, 3 will shrink_node, 4 will skip.

For =z, shrink_node will be called multiple times but it's also not the
default case. If it is a problem then direct reclaim would need to use a
bitmask of nodes shrunk during a zonelist traversal.

> It's possible I missed something in the code, perhaps I misunderstand
> how shrink_node is invoked through the zonelist.
> 

Look for this bit

                        /*
                         * Shrink each node in the zonelist once. If the
                         * zonelist is ordered by zone (not the default)
                         * then a node may be shrunk multiple times but in that
                         * case the user prefers lower zones being preserved.
                         */
                        if (zone->zone_pgdat == last_pgdat)
                                continue;

> > The zonelist ordering is still required to satisfy address-limited allocation
> > requests. If it wasn't, free pages could be managed on a per-node basis.
> 
> But you pass the node, not the zone, to the shrink_node function, the
> whole point I'm making is that the "address-limiting" is provided by
> the classzone_idx alone with your change, not by the zonelist anymore.
> 
> static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> 
> There's no zone pointer in "scan control". There's the reclaim_idx
> (aka classzone_idx, would have been more clear to call it
> classzone_idx).
> 

There is no need for a zone pointer as the reclaim_idx is sufficient. In
the node-ordered case, the node will be shrunk once based on the
restrictions of reclaim_idx.

> > Compacting across zones was/is a problem regardless of how the LRU is
> > managed.
> 
> It is a separate problem to make it work, but wanting to making
> compaction work node-wide is very much a side effect of shrink_node
> not going serially into zones but going node-wide at all times. Hence
> it doesn't make sense anymore to only compact a single zone before
> invoking shrink_node again.

Calling shrink_node will increase the chance of a single zone successfully
compacting. It's not guaranteed to succeed but then again, it never was.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 00/34] Move LRU page reclaim from zones to nodes v9
@ 2016-08-19 15:55         ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-19 15:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML

On Fri, Aug 19, 2016 at 05:32:59PM +0200, Andrea Arcangeli wrote:
> On Fri, Aug 19, 2016 at 03:53:59PM +0100, Mel Gorman wrote:
> > Compaction is not the same as LRU management.
> 
> Sure but compaction is invoked by reclaim and if reclaim is node-wide,
> it makes more sense if compaction would be node-wide as well.
> 

It might be desirable but it's not necessarily effective.  Reclaim/compaction
was always, at best, a heuristic that replaced lumpy-reclaim.

> Otherwise what you compact? Just the higher zone, or all of them?
> 

Right now, all of them taking into account whether the compaction is likely
to succeed.

> > That is not guaranteed. At the time of migration, it is unknown if the
> > original allocation had addressing limitations or not. I did not audit
> > the address-limited allocations to see if any of them allow migration.
> > 
> > The filesystems would be the ones that need careful auditing. There are
> > some places that add lowmem pages to the LRU but far less obvious if any
> > of them would successfully migrate.
> 
> True but that's a tradeoff. This whole patchset is about optimizing
> the common case of allocations from the highest possible
> classzone_idx, as if a system has no older hardware, the lowmem
> classzone_idx allocations practically never happens.
> 

I'll have to take your word for it. It still is the case that the highest
zone is preferred for allocations but the timing will be different due to
when reclaim is triggered and what is reclaimed.

As long as it is reclaiming, the page age will still be priority. Unless the
LRU is almost perfectly interleaves between zones, the effect of node-reclaim
will be that there is a likelihood that reclaim/compaction will succeed
for at least one zone with similar success rates to zone reclaim.

> If that tradeoff is valid, retaining migrability of memory allocated
> not with the highest classzone_idx is an optimization that goes in the
> opposite direction.
> 
> Retaining such "optimization" means increasing the likelihood of
> succeeding high order allocations from lower zones yes, but it screws
> with the main concept of:
> 
>      reclaim node wide at high order -> failure to allocate -> compaction node wide
> 
> So if the tradeoff works for reclaim_node I don't see why we should
> "optimize" for the opposite case in compaction.
> 
> > That is likely true as long as migration is always towards higher address.
> 
> Well with a node-wide LRU we got rid of any "towards higher address"

The allocation preferences of the zonelist continue to favour higher
zones. If anything, there is a stronger preference because zone-lru used
the fair-zone allocation policy to interleave pages between zones to avoid
page age inversion problems with smaller sized high zones.

> bias. So why to worry about these concepts for high order allocations
> provided by compaction, if order 0 allocations provided purely by
> shrink_node won't care at all about such a concept any longer?
> 

At worst, reclaim/compaction is slighly weakened. As the reclaim is in
LRU order, reclaim/compaction will still make progress for at least one
zone at a time.

> It sounds backwards to even worry about "towards higher address" in
> compaction which is only relevant to provide high order allocations,
> when the zero order 4kb allocations will not care at all to go
> "towards higher address" anymore.
> 

Towards higher addresses in compaction is an implementation detail when
it's not going across zones. Specifically, it was the easiest way to avoid
using the same pageblocks as both migration sources and targets.

> > An audit of all additions to the LRU that are address-limited allocations
> > is required to determine if any of those pages can migrate.
> 
> Agreed.
> 
> Either that or the pageblock needs to be marked with the classzone_idx
> that it can tolerate. And a per-allocation-classzone highpfn,lowpfn
> markers needs to be added, instead of being global.
> 

That would be somewhat severe. A single address-restricted allocation
would prevent any pages in that pageblock migrating out of the zone.
The classzone_idx could only be cleared when the entire pageblock was freed.

> > I'm not sure I understand. The zone allocation preference has the same
> > meaning as it always had.
> 
> What I mean is zonelist is like:
> 
> =n
> 
> 	[ node0->zone0, node0->zone1, node1->zone0, node1->zone1 ]
> 
> or:
> 
> =z
> 
> 	[ node0->zone0, node1->zone0, node0->zone1, node1->zone1 ]
> 
> So why to call in order (first case above):
> 
> 1  shrink_node(node0->zone0->node, allocation_classzone_idx /* to limit */)
> 2  shrink_node(node0->zone1->node, allocation_classzone_idx /* to limit */)
> 3  shrink_node(node1->zone0->node, allocation_classzone_idx /* to limit */)
> 4  shrink_node(node1->zone1->node, allocation_classzone_idx /* to limit */)
> 

For =n in the direct reclaim case, there is a check to see if the pgdat
has changed when reclaming in zonelist order. 1 will call shrink_node, 2
will skip, 3 will shrink_node, 4 will skip.

For =z, shrink_node will be called multiple times but it's also not the
default case. If it is a problem then direct reclaim would need to use a
bitmask of nodes shrunk during a zonelist traversal.

> It's possible I missed something in the code, perhaps I misunderstand
> how shrink_node is invoked through the zonelist.
> 

Look for this bit

                        /*
                         * Shrink each node in the zonelist once. If the
                         * zonelist is ordered by zone (not the default)
                         * then a node may be shrunk multiple times but in that
                         * case the user prefers lower zones being preserved.
                         */
                        if (zone->zone_pgdat == last_pgdat)
                                continue;

> > The zonelist ordering is still required to satisfy address-limited allocation
> > requests. If it wasn't, free pages could be managed on a per-node basis.
> 
> But you pass the node, not the zone, to the shrink_node function, the
> whole point I'm making is that the "address-limiting" is provided by
> the classzone_idx alone with your change, not by the zonelist anymore.
> 
> static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> 
> There's no zone pointer in "scan control". There's the reclaim_idx
> (aka classzone_idx, would have been more clear to call it
> classzone_idx).
> 

There is no need for a zone pointer as the reclaim_idx is sufficient. In
the node-ordered case, the node will be shrunk once based on the
restrictions of reclaim_idx.

> > Compacting across zones was/is a problem regardless of how the LRU is
> > managed.
> 
> It is a separate problem to make it work, but wanting to making
> compaction work node-wide is very much a side effect of shrink_node
> not going serially into zones but going node-wide at all times. Hence
> it doesn't make sense anymore to only compact a single zone before
> invoking shrink_node again.

Calling shrink_node will increase the chance of a single zone successfully
compacting. It's not guaranteed to succeed but then again, it never was.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-07-08  9:34   ` Mel Gorman
@ 2016-08-29  9:38     ` Srikar Dronamraju
  -1 siblings, 0 replies; 220+ messages in thread
From: Srikar Dronamraju @ 2016-08-29  9:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

> Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
> thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
> patch gets rid of many of the node-based versus zone-based decisions.
> 
> o A node is considered balanced when any eligible lower zone is balanced.
>   This eliminates one class of age-inversion problem because we avoid
>   reclaiming a newer page just because it's in the wrong zone
> o pgdat_balanced disappears because we now only care about one zone being
>   balanced.
> o Some anomalies related to writeback and congestion tracking being based on
>   zones disappear.
> o kswapd no longer has to take care to reclaim zones in the reverse order
>   that the page allocator uses.
> o Most importantly of all, reclaim from node 0 with multiple zones will
>   have similar aging and reclaiming characteristics as every
>   other node.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

This patch seems to hurt FA_DUMP functionality. This behaviour is not
seen on v4.7 but only after this patch.

So when a kernel on a multinode machine with memblock_reserve() such
that most of the nodes have zero available memory, kswapd seems to be
consuming 100% of the time.

This is independent of CONFIG_DEFERRED_STRUCT_PAGE, i.e this problem is
seen even with parallel page struct initialization disabled.


top - 13:48:52 up  1:07,  3 users,  load average: 15.25, 15.32, 21.18
Tasks: 11080 total,  16 running, 11064 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  2.7 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  15929941+total,  8637824 used, 15843563+free,     2304 buffers
KiB Swap: 91898816 total,        0 used, 91898816 free.  1381312 cached Mem

    PID USER      PR  NI    VIRT    RES    SHR S     %CPU  %MEM     TIME+ COMMAND
  10824 root      20   0       0      0      0 R  100.000 0.000  65:30.76 kswapd2
  10837 root      20   0       0      0      0 R  100.000 0.000  65:31.17 kswapd15
  10823 root      20   0       0      0      0 R   97.059 0.000  65:30.85 kswapd1
  10825 root      20   0       0      0      0 R   97.059 0.000  65:31.10 kswapd3
  10826 root      20   0       0      0      0 R   97.059 0.000  65:31.18 kswapd4
  10827 root      20   0       0      0      0 R   97.059 0.000  65:31.08 kswapd5
  10828 root      20   0       0      0      0 R   97.059 0.000  65:30.91 kswapd6
  10829 root      20   0       0      0      0 R   97.059 0.000  65:31.17 kswapd7
  10830 root      20   0       0      0      0 R   97.059 0.000  65:31.17 kswapd8
  10831 root      20   0       0      0      0 R   97.059 0.000  65:31.18 kswapd9
  10832 root      20   0       0      0      0 R   97.059 0.000  65:31.12 kswapd10
  10833 root      20   0       0      0      0 R   97.059 0.000  65:31.19 kswapd11
  10834 root      20   0       0      0      0 R   97.059 0.000  65:31.13 kswapd12
  10835 root      20   0       0      0      0 R   97.059 0.000  65:31.09 kswapd13
  10836 root      20   0       0      0      0 R   97.059 0.000  65:31.18 kswapd14
 277155 srikar    20   0   16960  13760   3264 R   52.941 0.001   0:00.37 top

top - 13:48:55 up  1:07,  3 users,  load average: 15.23, 15.32, 21.15
Tasks: 11080 total,  16 running, 11064 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  15929941+total,  8637824 used, 15843563+free,     2304 buffers
KiB Swap: 91898816 total,        0 used, 91898816 free.  1381312 cached Mem

    PID USER      PR  NI    VIRT    RES    SHR S     %CPU  %MEM     TIME+ COMMAND
  10836 root      20   0       0      0      0 R  100.000 0.000  65:33.39 kswapd14
  10823 root      20   0       0      0      0 R  100.000 0.000  65:33.05 kswapd1
  10824 root      20   0       0      0      0 R  100.000 0.000  65:32.96 kswapd2
  10825 root      20   0       0      0      0 R  100.000 0.000  65:33.30 kswapd3
  10826 root      20   0       0      0      0 R  100.000 0.000  65:33.38 kswapd4
  10827 root      20   0       0      0      0 R  100.000 0.000  65:33.28 kswapd5
  10828 root      20   0       0      0      0 R  100.000 0.000  65:33.11 kswapd6
  10829 root      20   0       0      0      0 R  100.000 0.000  65:33.37 kswapd7
  10830 root      20   0       0      0      0 R  100.000 0.000  65:33.37 kswapd8
  10831 root      20   0       0      0      0 R  100.000 0.000  65:33.38 kswapd9
  10832 root      20   0       0      0      0 R  100.000 0.000  65:33.32 kswapd10
  10833 root      20   0       0      0      0 R  100.000 0.000  65:33.39 kswapd11
  10834 root      20   0       0      0      0 R  100.000 0.000  65:33.33 kswapd12
  10835 root      20   0       0      0      0 R  100.000 0.000  65:33.29 kswapd13
  10837 root      20   0       0      0      0 R  100.000 0.000  65:33.37 kswapd15
 277155 srikar    20   0   17536  14912   3264 R    9.091 0.001   0:00.57 top
   1092 root      rt   0       0      0      0 S    0.455 0.000   0:00.08 watchdog/178

Please see that there is no used swap space. However 15 kswapd threads
corresponding to 15 out of the 16 nodes are running full throttle.  The
only node 0 has memory, other nodes memory is fully reserved.

git bisect output
I tried git bisect between v4.7 and v4.8-rc3 filtered to mm/vmscan.c

# bad: [d7f05528eedb047efe2288cff777676b028747b6] mm, vmscan: account for skipped pages as a partial scan
# good: [b1123ea6d3b3da25af5c8a9d843bd07ab63213f4] mm: balloon: use general non-lru movable page feature
git bisect start 'HEAD' 'b1123ea6' '--' 'mm/vmscan.c'
# bad: [c4a25635b60d08853a3e4eaae3ab34419a36cfa2] mm: move vmscan writes and file write accounting to the node
git bisect bad c4a25635b60d08853a3e4eaae3ab34419a36cfa2
# bad: [38087d9b0360987a6db46c2c2c4ece37cd048abe] mm, vmscan: simplify the logic deciding whether kswapd sleeps
git bisect bad 38087d9b0360987a6db46c2c2c4ece37cd048abe
# good: [b2e18757f2c9d1cdd746a882e9878852fdec9501] mm, vmscan: begin reclaiming pages on a per-node basis
git bisect good b2e18757f2c9d1cdd746a882e9878852fdec9501
# bad: [1d82de618ddde0f1164e640f79af152f01994c18] mm, vmscan: make kswapd reclaim in terms of nodes
git bisect bad 1d82de618ddde0f1164e640f79af152f01994c18
# good: [f7b60926ebc05944f73d93ffaf6690503b796a88] mm, vmscan: have kswapd only scan based on the highest requested zone
git bisect good f7b60926ebc05944f73d93ffaf6690503b796a88
# first bad commit: [1d82de618ddde0f1164e640f79af152f01994c18] mm, vmscan: make kswapd reclaim in terms of nodes

Here is perf top output on the kernel where kswapd is hogging cpu.

-   93.50%     0.01%  [kernel]                 [k] kswapd
   - kswapd
      - 114.31% shrink_node
         - 111.51% shrink_node_memcg
            - pgdat_reclaimable
               - 95.51% pgdat_reclaimable_pages
                  - 86.34% pgdat_reclaimable_pages
                  - 6.69% _find_next_bit.part.0
                  - 2.47% find_next_bit
               - 14.46% pgdat_reclaimable
                 1.13% _find_next_bit.part.0
               + 0.30% find_next_bit
         - 2.38% shrink_slab
            - super_cache_count
               - 0
                  - __list_lru_count_one.isra.1
                       _raw_spin_lock
      - 28.04% pgdat_reclaimable
         - 23.97% pgdat_reclaimable_pages
            - 21.66% pgdat_reclaimable_pages
            - 1.69% _find_next_bit.part.0
              0.63% find_next_bit
         - 3.70% pgdat_reclaimable
           0.29% _find_next_bit.part.0
      - 16.33% zone_balanced
         - zone_watermark_ok_safe
            - 14.86% zone_watermark_ok_safe
              1.15% _find_next_bit.part.0
              0.31% find_next_bit
      - 2.72% prepare_kswapd_sleep
         - zone_balanced
            - zone_watermark_ok_safe
                 zone_watermark_ok_safe
-   80.72%    10.51%  [kernel]                 [k] pgdat_reclaimable
   - 140.49% pgdat_reclaimable
      - 138.40% pgdat_reclaimable_pages
         - 125.10% pgdat_reclaimable_pages
         - 9.71% _find_next_bit.part.0
         - 3.59% find_next_bit
        1.64% _find_next_bit.part.0
      + 0.44% find_next_bit
   - 21.03% ret_from_kernel_thread
        kthread
      - kswapd
         - 16.75% shrink_node
              shrink_node_memcg
         - 4.28% pgdat_reclaimable
              pgdat_reclaimable
-   69.17%    62.48%  [kernel]                 [k] pgdat_reclaimable_pages
   - 145.91% ret_from_kernel_thread
        kthread
   - 15.61% pgdat_reclaimable_pages
      - 11.33% _find_next_bit.part.0
      - 4.19% find_next_bit
-   66.18%     0.01%  [kernel]                 [k] shrink_node
   - shrink_node
      - 157.54% shrink_node_memcg
         - pgdat_reclaimable
            - 134.94% pgdat_reclaimable_pages
               - 121.99% pgdat_reclaimable_pages
               - 9.46% _find_next_bit.part.0
               - 3.49% find_next_bit
            - 20.44% pgdat_reclaimable
              1.59% _find_next_bit.part.0
            + 0.42% find_next_bit
      - 3.37% shrink_slab
         - super_cache_count
            - 0
               - __list_lru_count_one.isra.1
                    _raw_spin_lock
-   64.56%     0.03%  [kernel]                 [k] shrink_node_memcg
   - shrink_node_memcg
      - pgdat_reclaimable
         - 138.31% pgdat_reclaimable_pages
            - 125.04% pgdat_reclaimable_pages
            - 9.69% _find_next_bit.part.0
            - 3.58% find_next_bit
         - 20.95% pgdat_reclaimable
           1.63% _find_next_bit.part.0
         + 0.43% find_next_bit
    53.73%     0.00%  [kernel]                 [k] kthread
    53.73%     0.00%  [kernel]                 [k] ret_from_kernel_thread
-   11.04%    10.04%  [kernel]                 [k] zone_watermark_ok_safe
   - 146.80% ret_from_kernel_thread
        kthread
      - kswapd
         - 125.81% zone_balanced
              zone_watermark_ok_safe
         - 20.97% prepare_kswapd_sleep
              zone_balanced
              zone_watermark_ok_safe
   - 14.55% zone_watermark_ok_safe
        11.38% _find_next_bit.part.0
        3.06% find_next_bit
-   11.03%     0.00%  [kernel]                 [k] zone_balanced
   - zone_balanced
      - zone_watermark_ok_safe
           145.84% zone_watermark_ok_safe
           11.31% _find_next_bit.part.0
           3.04% find_next_bit


-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-08-29  9:38     ` Srikar Dronamraju
  0 siblings, 0 replies; 220+ messages in thread
From: Srikar Dronamraju @ 2016-08-29  9:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

> Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
> thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
> patch gets rid of many of the node-based versus zone-based decisions.
> 
> o A node is considered balanced when any eligible lower zone is balanced.
>   This eliminates one class of age-inversion problem because we avoid
>   reclaiming a newer page just because it's in the wrong zone
> o pgdat_balanced disappears because we now only care about one zone being
>   balanced.
> o Some anomalies related to writeback and congestion tracking being based on
>   zones disappear.
> o kswapd no longer has to take care to reclaim zones in the reverse order
>   that the page allocator uses.
> o Most importantly of all, reclaim from node 0 with multiple zones will
>   have similar aging and reclaiming characteristics as every
>   other node.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

This patch seems to hurt FA_DUMP functionality. This behaviour is not
seen on v4.7 but only after this patch.

So when a kernel on a multinode machine with memblock_reserve() such
that most of the nodes have zero available memory, kswapd seems to be
consuming 100% of the time.

This is independent of CONFIG_DEFERRED_STRUCT_PAGE, i.e this problem is
seen even with parallel page struct initialization disabled.


top - 13:48:52 up  1:07,  3 users,  load average: 15.25, 15.32, 21.18
Tasks: 11080 total,  16 running, 11064 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  2.7 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  15929941+total,  8637824 used, 15843563+free,     2304 buffers
KiB Swap: 91898816 total,        0 used, 91898816 free.  1381312 cached Mem

    PID USER      PR  NI    VIRT    RES    SHR S     %CPU  %MEM     TIME+ COMMAND
  10824 root      20   0       0      0      0 R  100.000 0.000  65:30.76 kswapd2
  10837 root      20   0       0      0      0 R  100.000 0.000  65:31.17 kswapd15
  10823 root      20   0       0      0      0 R   97.059 0.000  65:30.85 kswapd1
  10825 root      20   0       0      0      0 R   97.059 0.000  65:31.10 kswapd3
  10826 root      20   0       0      0      0 R   97.059 0.000  65:31.18 kswapd4
  10827 root      20   0       0      0      0 R   97.059 0.000  65:31.08 kswapd5
  10828 root      20   0       0      0      0 R   97.059 0.000  65:30.91 kswapd6
  10829 root      20   0       0      0      0 R   97.059 0.000  65:31.17 kswapd7
  10830 root      20   0       0      0      0 R   97.059 0.000  65:31.17 kswapd8
  10831 root      20   0       0      0      0 R   97.059 0.000  65:31.18 kswapd9
  10832 root      20   0       0      0      0 R   97.059 0.000  65:31.12 kswapd10
  10833 root      20   0       0      0      0 R   97.059 0.000  65:31.19 kswapd11
  10834 root      20   0       0      0      0 R   97.059 0.000  65:31.13 kswapd12
  10835 root      20   0       0      0      0 R   97.059 0.000  65:31.09 kswapd13
  10836 root      20   0       0      0      0 R   97.059 0.000  65:31.18 kswapd14
 277155 srikar    20   0   16960  13760   3264 R   52.941 0.001   0:00.37 top

top - 13:48:55 up  1:07,  3 users,  load average: 15.23, 15.32, 21.15
Tasks: 11080 total,  16 running, 11064 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  15929941+total,  8637824 used, 15843563+free,     2304 buffers
KiB Swap: 91898816 total,        0 used, 91898816 free.  1381312 cached Mem

    PID USER      PR  NI    VIRT    RES    SHR S     %CPU  %MEM     TIME+ COMMAND
  10836 root      20   0       0      0      0 R  100.000 0.000  65:33.39 kswapd14
  10823 root      20   0       0      0      0 R  100.000 0.000  65:33.05 kswapd1
  10824 root      20   0       0      0      0 R  100.000 0.000  65:32.96 kswapd2
  10825 root      20   0       0      0      0 R  100.000 0.000  65:33.30 kswapd3
  10826 root      20   0       0      0      0 R  100.000 0.000  65:33.38 kswapd4
  10827 root      20   0       0      0      0 R  100.000 0.000  65:33.28 kswapd5
  10828 root      20   0       0      0      0 R  100.000 0.000  65:33.11 kswapd6
  10829 root      20   0       0      0      0 R  100.000 0.000  65:33.37 kswapd7
  10830 root      20   0       0      0      0 R  100.000 0.000  65:33.37 kswapd8
  10831 root      20   0       0      0      0 R  100.000 0.000  65:33.38 kswapd9
  10832 root      20   0       0      0      0 R  100.000 0.000  65:33.32 kswapd10
  10833 root      20   0       0      0      0 R  100.000 0.000  65:33.39 kswapd11
  10834 root      20   0       0      0      0 R  100.000 0.000  65:33.33 kswapd12
  10835 root      20   0       0      0      0 R  100.000 0.000  65:33.29 kswapd13
  10837 root      20   0       0      0      0 R  100.000 0.000  65:33.37 kswapd15
 277155 srikar    20   0   17536  14912   3264 R    9.091 0.001   0:00.57 top
   1092 root      rt   0       0      0      0 S    0.455 0.000   0:00.08 watchdog/178

Please see that there is no used swap space. However 15 kswapd threads
corresponding to 15 out of the 16 nodes are running full throttle.  The
only node 0 has memory, other nodes memory is fully reserved.

git bisect output
I tried git bisect between v4.7 and v4.8-rc3 filtered to mm/vmscan.c

# bad: [d7f05528eedb047efe2288cff777676b028747b6] mm, vmscan: account for skipped pages as a partial scan
# good: [b1123ea6d3b3da25af5c8a9d843bd07ab63213f4] mm: balloon: use general non-lru movable page feature
git bisect start 'HEAD' 'b1123ea6' '--' 'mm/vmscan.c'
# bad: [c4a25635b60d08853a3e4eaae3ab34419a36cfa2] mm: move vmscan writes and file write accounting to the node
git bisect bad c4a25635b60d08853a3e4eaae3ab34419a36cfa2
# bad: [38087d9b0360987a6db46c2c2c4ece37cd048abe] mm, vmscan: simplify the logic deciding whether kswapd sleeps
git bisect bad 38087d9b0360987a6db46c2c2c4ece37cd048abe
# good: [b2e18757f2c9d1cdd746a882e9878852fdec9501] mm, vmscan: begin reclaiming pages on a per-node basis
git bisect good b2e18757f2c9d1cdd746a882e9878852fdec9501
# bad: [1d82de618ddde0f1164e640f79af152f01994c18] mm, vmscan: make kswapd reclaim in terms of nodes
git bisect bad 1d82de618ddde0f1164e640f79af152f01994c18
# good: [f7b60926ebc05944f73d93ffaf6690503b796a88] mm, vmscan: have kswapd only scan based on the highest requested zone
git bisect good f7b60926ebc05944f73d93ffaf6690503b796a88
# first bad commit: [1d82de618ddde0f1164e640f79af152f01994c18] mm, vmscan: make kswapd reclaim in terms of nodes

Here is perf top output on the kernel where kswapd is hogging cpu.

-   93.50%     0.01%  [kernel]                 [k] kswapd
   - kswapd
      - 114.31% shrink_node
         - 111.51% shrink_node_memcg
            - pgdat_reclaimable
               - 95.51% pgdat_reclaimable_pages
                  - 86.34% pgdat_reclaimable_pages
                  - 6.69% _find_next_bit.part.0
                  - 2.47% find_next_bit
               - 14.46% pgdat_reclaimable
                 1.13% _find_next_bit.part.0
               + 0.30% find_next_bit
         - 2.38% shrink_slab
            - super_cache_count
               - 0
                  - __list_lru_count_one.isra.1
                       _raw_spin_lock
      - 28.04% pgdat_reclaimable
         - 23.97% pgdat_reclaimable_pages
            - 21.66% pgdat_reclaimable_pages
            - 1.69% _find_next_bit.part.0
              0.63% find_next_bit
         - 3.70% pgdat_reclaimable
           0.29% _find_next_bit.part.0
      - 16.33% zone_balanced
         - zone_watermark_ok_safe
            - 14.86% zone_watermark_ok_safe
              1.15% _find_next_bit.part.0
              0.31% find_next_bit
      - 2.72% prepare_kswapd_sleep
         - zone_balanced
            - zone_watermark_ok_safe
                 zone_watermark_ok_safe
-   80.72%    10.51%  [kernel]                 [k] pgdat_reclaimable
   - 140.49% pgdat_reclaimable
      - 138.40% pgdat_reclaimable_pages
         - 125.10% pgdat_reclaimable_pages
         - 9.71% _find_next_bit.part.0
         - 3.59% find_next_bit
        1.64% _find_next_bit.part.0
      + 0.44% find_next_bit
   - 21.03% ret_from_kernel_thread
        kthread
      - kswapd
         - 16.75% shrink_node
              shrink_node_memcg
         - 4.28% pgdat_reclaimable
              pgdat_reclaimable
-   69.17%    62.48%  [kernel]                 [k] pgdat_reclaimable_pages
   - 145.91% ret_from_kernel_thread
        kthread
   - 15.61% pgdat_reclaimable_pages
      - 11.33% _find_next_bit.part.0
      - 4.19% find_next_bit
-   66.18%     0.01%  [kernel]                 [k] shrink_node
   - shrink_node
      - 157.54% shrink_node_memcg
         - pgdat_reclaimable
            - 134.94% pgdat_reclaimable_pages
               - 121.99% pgdat_reclaimable_pages
               - 9.46% _find_next_bit.part.0
               - 3.49% find_next_bit
            - 20.44% pgdat_reclaimable
              1.59% _find_next_bit.part.0
            + 0.42% find_next_bit
      - 3.37% shrink_slab
         - super_cache_count
            - 0
               - __list_lru_count_one.isra.1
                    _raw_spin_lock
-   64.56%     0.03%  [kernel]                 [k] shrink_node_memcg
   - shrink_node_memcg
      - pgdat_reclaimable
         - 138.31% pgdat_reclaimable_pages
            - 125.04% pgdat_reclaimable_pages
            - 9.69% _find_next_bit.part.0
            - 3.58% find_next_bit
         - 20.95% pgdat_reclaimable
           1.63% _find_next_bit.part.0
         + 0.43% find_next_bit
    53.73%     0.00%  [kernel]                 [k] kthread
    53.73%     0.00%  [kernel]                 [k] ret_from_kernel_thread
-   11.04%    10.04%  [kernel]                 [k] zone_watermark_ok_safe
   - 146.80% ret_from_kernel_thread
        kthread
      - kswapd
         - 125.81% zone_balanced
              zone_watermark_ok_safe
         - 20.97% prepare_kswapd_sleep
              zone_balanced
              zone_watermark_ok_safe
   - 14.55% zone_watermark_ok_safe
        11.38% _find_next_bit.part.0
        3.06% find_next_bit
-   11.03%     0.00%  [kernel]                 [k] zone_balanced
   - zone_balanced
      - zone_watermark_ok_safe
           145.84% zone_watermark_ok_safe
           11.31% _find_next_bit.part.0
           3.04% find_next_bit


-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-08-29  9:38     ` Srikar Dronamraju
@ 2016-08-30 12:07       ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-30 12:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Mon, Aug 29, 2016 at 03:08:44PM +0530, Srikar Dronamraju wrote:
> > Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
> > thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
> > patch gets rid of many of the node-based versus zone-based decisions.
> > 
> > o A node is considered balanced when any eligible lower zone is balanced.
> >   This eliminates one class of age-inversion problem because we avoid
> >   reclaiming a newer page just because it's in the wrong zone
> > o pgdat_balanced disappears because we now only care about one zone being
> >   balanced.
> > o Some anomalies related to writeback and congestion tracking being based on
> >   zones disappear.
> > o kswapd no longer has to take care to reclaim zones in the reverse order
> >   that the page allocator uses.
> > o Most importantly of all, reclaim from node 0 with multiple zones will
> >   have similar aging and reclaiming characteristics as every
> >   other node.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> This patch seems to hurt FA_DUMP functionality. This behaviour is not
> seen on v4.7 but only after this patch.
> 
> So when a kernel on a multinode machine with memblock_reserve() such
> that most of the nodes have zero available memory, kswapd seems to be
> consuming 100% of the time.
> 

Why is FA_DUMP specifically the trigger? If the nodes have zero available
memory then is the zone_populated() check failing when FA_DUMP is enabled? If
so, that would both allow kswapd to wake and stay awake.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-08-30 12:07       ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-30 12:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Mon, Aug 29, 2016 at 03:08:44PM +0530, Srikar Dronamraju wrote:
> > Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
> > thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
> > patch gets rid of many of the node-based versus zone-based decisions.
> > 
> > o A node is considered balanced when any eligible lower zone is balanced.
> >   This eliminates one class of age-inversion problem because we avoid
> >   reclaiming a newer page just because it's in the wrong zone
> > o pgdat_balanced disappears because we now only care about one zone being
> >   balanced.
> > o Some anomalies related to writeback and congestion tracking being based on
> >   zones disappear.
> > o kswapd no longer has to take care to reclaim zones in the reverse order
> >   that the page allocator uses.
> > o Most importantly of all, reclaim from node 0 with multiple zones will
> >   have similar aging and reclaiming characteristics as every
> >   other node.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> This patch seems to hurt FA_DUMP functionality. This behaviour is not
> seen on v4.7 but only after this patch.
> 
> So when a kernel on a multinode machine with memblock_reserve() such
> that most of the nodes have zero available memory, kswapd seems to be
> consuming 100% of the time.
> 

Why is FA_DUMP specifically the trigger? If the nodes have zero available
memory then is the zone_populated() check failing when FA_DUMP is enabled? If
so, that would both allow kswapd to wake and stay awake.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-08-30 12:07       ` Mel Gorman
@ 2016-08-30 14:25         ` Srikar Dronamraju
  -1 siblings, 0 replies; 220+ messages in thread
From: Srikar Dronamraju @ 2016-08-30 14:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

> > 
> > This patch seems to hurt FA_DUMP functionality. This behaviour is not
> > seen on v4.7 but only after this patch.
> > 
> > So when a kernel on a multinode machine with memblock_reserve() such
> > that most of the nodes have zero available memory, kswapd seems to be
> > consuming 100% of the time.
> > 
> 
> Why is FA_DUMP specifically the trigger? If the nodes have zero available
> memory then is the zone_populated() check failing when FA_DUMP is enabled? If
> so, that would both allow kswapd to wake and stay awake.
> 

The trigger is memblock_reserve() for the complete node memory.  And
this is exactly what FA_DUMP does.  Here again the node has memory but
its all reserved so there is no free memory in the node.

Did you mean populated_zone() when you said zone_populated or have I
mistaken? populated_zone() does return 1 since it checks for
zone->present_pages.

Here is revelant log from the dmesg log at boot 

ppc64_pft_size    = 0x26
phys_mem_size     = 0x1e4600000000
dcache_bsize      = 0x80
icache_bsize      = 0x80
cpu_features      = 0x27fc7aec18500249
  possible        = 0x3fffffff18500649
  always          = 0x0000000018100040
cpu_user_features = 0xdc0065c2 0xef000000
mmu_features      = 0x7c000001
firmware_features = 0x00000003c45bfc57
htab_hash_mask    = 0x7fffffff
-----------------------------------------------------
Node 0 Memory: 0x0-0x1fb50000000
Node 1 Memory: 0x1fb50000000-0x3fa90000000
Node 2 Memory: 0x3fa90000000-0x5f9b0000000
Node 3 Memory: 0x5f9b0000000-0x76850000000
Node 4 Memory: 0x76850000000-0x95020000000
Node 5 Memory: 0x95020000000-0xb37f0000000
Node 6 Memory: 0xb37f0000000-0xd1fc0000000
Node 7 Memory: 0xd1fc0000000-0xf0790000000
Node 8 Memory: 0xf0790000000-0x10ef60000000
Node 9 Memory: 0x10ef60000000-0x12d730000000
Node 10 Memory: 0x12d730000000-0x14bf00000000
Node 11 Memory: 0x14bf00000000-0x16a6d0000000
Node 12 Memory: 0x16a6d0000000-0x188ea0000000
Node 13 Memory: 0x188ea0000000-0x1a7660000000
Node 14 Memory: 0x1a7660000000-0x1c5e30000000
Node 15 Memory: 0x1c5e30000000-0x1e4600000000
numa: Initmem setup node 0 [mem 0x00000000-0x1fb4fffffff]
numa:   NODE_DATA [mem 0x1837fe23680-0x1837fe2d37f]
numa: Initmem setup node 1 [mem 0x1fb50000000-0x3fa8fffffff]
numa:   NODE_DATA [mem 0x1837fa19980-0x1837fa2367f]
numa:     NODE_DATA(1) on node 0
numa: Initmem setup node 2 [mem 0x3fa90000000-0x5f9afffffff]
numa:   NODE_DATA [mem 0x1837f60fc80-0x1837f61997f]
numa:     NODE_DATA(2) on node 0
numa: Initmem setup node 3 [mem 0x5f9b0000000-0x7684fffffff]
numa:   NODE_DATA [mem 0x1837f205f80-0x1837f20fc7f]
numa:     NODE_DATA(3) on node 0
numa: Initmem setup node 4 [mem 0x76850000000-0x9501fffffff]
numa:   NODE_DATA [mem 0x1837ef1c280-0x1837ef25f7f]
numa:     NODE_DATA(4) on node 0
numa: Initmem setup node 5 [mem 0x95020000000-0xb37efffffff]
numa:   NODE_DATA [mem 0x1837eb42580-0x1837eb4c27f]
numa:     NODE_DATA(5) on node 0
numa: Initmem setup node 6 [mem 0xb37f0000000-0xd1fbfffffff]
numa:   NODE_DATA [mem 0x1837e778880-0x1837e78257f]
numa:     NODE_DATA(6) on node 0
numa: Initmem setup node 7 [mem 0xd1fc0000000-0xf078fffffff]
numa:   NODE_DATA [mem 0x1837e39eb80-0x1837e3a887f]
numa:     NODE_DATA(7) on node 0
numa: Initmem setup node 8 [mem 0xf0790000000-0x10ef5fffffff]
numa:   NODE_DATA [mem 0x1837dfc4e80-0x1837dfceb7f]
numa:     NODE_DATA(8) on node 0
numa: Initmem setup node 9 [mem 0x10ef60000000-0x12d72fffffff]
numa:   NODE_DATA [mem 0x1837dbeb180-0x1837dbf4e7f]
numa:     NODE_DATA(9) on node 0
numa: Initmem setup node 10 [mem 0x12d730000000-0x14beffffffff]
numa:   NODE_DATA [mem 0x1837d811480-0x1837d81b17f]
numa:     NODE_DATA(10) on node 0
numa: Initmem setup node 11 [mem 0x14bf00000000-0x16a6cfffffff]
numa:   NODE_DATA [mem 0x1837d437780-0x1837d44147f]
numa:     NODE_DATA(11) on node 0
numa: Initmem setup node 12 [mem 0x16a6d0000000-0x188e9fffffff]
numa:   NODE_DATA [mem 0x1837d05da80-0x1837d06777f]
numa:     NODE_DATA(12) on node 0
numa: Initmem setup node 13 [mem 0x188ea0000000-0x1a765fffffff]
numa:   NODE_DATA [mem 0x1837cc83d80-0x1837cc8da7f]
numa:     NODE_DATA(13) on node 0
numa: Initmem setup node 14 [mem 0x1a7660000000-0x1c5e2fffffff]
numa:   NODE_DATA [mem 0x1837c8aa080-0x1837c8b3d7f]
numa:     NODE_DATA(14) on node 0
numa: Initmem setup node 15 [mem 0x1c5e30000000-0x1e45ffffffff]
numa:   NODE_DATA [mem 0x1837c4d0380-0x1837c4da07f]
numa:     NODE_DATA(15) on node 0
Section 99194 and 99199 (node 0) have a circular dependency on usemap and pgdat allocations
node 1 must be removed before remove section 99193
node 1 must be removed before remove section 99194
node 2 must be removed before remove section 99193
node 4 must be removed before remove section 99193
node 8 must be removed before remove section 99193
node 13 must be removed before remove section 99193
PCI host bridge /pci@800000020000032  ranges:
 MEM 0x00003fd480000000..0x00003fd4feffffff -> 0x0000000080000000 
 MEM 0x0000329000000000..0x0000329fffffffff -> 0x0003d29000000000 
PCI host bridge /pci@800000020000164  ranges:
 MEM 0x00003fc2e0000000..0x00003fc2efffffff -> 0x00000000e0000000 
 MEM 0x0000305800000000..0x0000305bffffffff -> 0x0003d05800000000 
PPC64 nvram contains 15360 bytes
Top of RAM: 0x1e4600000000, Total RAM: 0x1e4600000000
Memory hole size: 0MB
Zone ranges:
  DMA      [mem 0x0000000000000000-0x00001e45ffffffff]
  DMA32    empty
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000000000-0x000001fb4fffffff]
  node   1: [mem 0x000001fb50000000-0x000003fa8fffffff]
  node   2: [mem 0x000003fa90000000-0x000005f9afffffff]
  node   3: [mem 0x000005f9b0000000-0x000007684fffffff]
  node   4: [mem 0x0000076850000000-0x000009501fffffff]
  node   5: [mem 0x0000095020000000-0x00000b37efffffff]
  node   6: [mem 0x00000b37f0000000-0x00000d1fbfffffff]
  node   7: [mem 0x00000d1fc0000000-0x00000f078fffffff]
  node   8: [mem 0x00000f0790000000-0x000010ef5fffffff]
  node   9: [mem 0x000010ef60000000-0x000012d72fffffff]
  node  10: [mem 0x000012d730000000-0x000014beffffffff]
  node  11: [mem 0x000014bf00000000-0x000016a6cfffffff]
  node  12: [mem 0x000016a6d0000000-0x0000188e9fffffff]
  node  13: [mem 0x0000188ea0000000-0x00001a765fffffff]
  node  14: [mem 0x00001a7660000000-0x00001c5e2fffffff]
  node  15: [mem 0x00001c5e30000000-0x00001e45ffffffff]
Initmem setup node 0 [mem 0x0000000000000000-0x000001fb4fffffff]
On node 0 totalpages: 33247232
  DMA zone: 32468 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 33247232 pages, LIFO batch:1
Initmem setup node 1 [mem 0x000001fb50000000-0x000003fa8fffffff]
On node 1 totalpages: 33505280
  DMA zone: 32720 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 33505280 pages, LIFO batch:1
Initmem setup node 2 [mem 0x000003fa90000000-0x000005f9afffffff]
On node 2 totalpages: 33497088
  DMA zone: 32712 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 33497088 pages, LIFO batch:1
Initmem setup node 3 [mem 0x000005f9b0000000-0x000007684fffffff]
On node 3 totalpages: 24027136
  DMA zone: 23464 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 24027136 pages, LIFO batch:1
Initmem setup node 4 [mem 0x0000076850000000-0x000009501fffffff]
On node 4 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 5 [mem 0x0000095020000000-0x00000b37efffffff]
On node 5 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 6 [mem 0x00000b37f0000000-0x00000d1fbfffffff]
On node 6 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 7 [mem 0x00000d1fc0000000-0x00000f078fffffff]
On node 7 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 8 [mem 0x00000f0790000000-0x000010ef5fffffff]
On node 8 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 9 [mem 0x000010ef60000000-0x000012d72fffffff]
On node 9 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 10 [mem 0x000012d730000000-0x000014beffffffff]
On node 10 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 11 [mem 0x000014bf00000000-0x000016a6cfffffff]
On node 11 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 12 [mem 0x000016a6d0000000-0x0000188e9fffffff]
On node 12 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 13 [mem 0x0000188ea0000000-0x00001a765fffffff]
On node 13 totalpages: 31965184
  DMA zone: 31216 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31965184 pages, LIFO batch:1
Initmem setup node 14 [mem 0x00001a7660000000-0x00001c5e2fffffff]
On node 14 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 15 [mem 0x00001c5e30000000-0x00001e45ffffffff]
On node 15 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-08-30 14:25         ` Srikar Dronamraju
  0 siblings, 0 replies; 220+ messages in thread
From: Srikar Dronamraju @ 2016-08-30 14:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

> > 
> > This patch seems to hurt FA_DUMP functionality. This behaviour is not
> > seen on v4.7 but only after this patch.
> > 
> > So when a kernel on a multinode machine with memblock_reserve() such
> > that most of the nodes have zero available memory, kswapd seems to be
> > consuming 100% of the time.
> > 
> 
> Why is FA_DUMP specifically the trigger? If the nodes have zero available
> memory then is the zone_populated() check failing when FA_DUMP is enabled? If
> so, that would both allow kswapd to wake and stay awake.
> 

The trigger is memblock_reserve() for the complete node memory.  And
this is exactly what FA_DUMP does.  Here again the node has memory but
its all reserved so there is no free memory in the node.

Did you mean populated_zone() when you said zone_populated or have I
mistaken? populated_zone() does return 1 since it checks for
zone->present_pages.

Here is revelant log from the dmesg log at boot 

ppc64_pft_size    = 0x26
phys_mem_size     = 0x1e4600000000
dcache_bsize      = 0x80
icache_bsize      = 0x80
cpu_features      = 0x27fc7aec18500249
  possible        = 0x3fffffff18500649
  always          = 0x0000000018100040
cpu_user_features = 0xdc0065c2 0xef000000
mmu_features      = 0x7c000001
firmware_features = 0x00000003c45bfc57
htab_hash_mask    = 0x7fffffff
-----------------------------------------------------
Node 0 Memory: 0x0-0x1fb50000000
Node 1 Memory: 0x1fb50000000-0x3fa90000000
Node 2 Memory: 0x3fa90000000-0x5f9b0000000
Node 3 Memory: 0x5f9b0000000-0x76850000000
Node 4 Memory: 0x76850000000-0x95020000000
Node 5 Memory: 0x95020000000-0xb37f0000000
Node 6 Memory: 0xb37f0000000-0xd1fc0000000
Node 7 Memory: 0xd1fc0000000-0xf0790000000
Node 8 Memory: 0xf0790000000-0x10ef60000000
Node 9 Memory: 0x10ef60000000-0x12d730000000
Node 10 Memory: 0x12d730000000-0x14bf00000000
Node 11 Memory: 0x14bf00000000-0x16a6d0000000
Node 12 Memory: 0x16a6d0000000-0x188ea0000000
Node 13 Memory: 0x188ea0000000-0x1a7660000000
Node 14 Memory: 0x1a7660000000-0x1c5e30000000
Node 15 Memory: 0x1c5e30000000-0x1e4600000000
numa: Initmem setup node 0 [mem 0x00000000-0x1fb4fffffff]
numa:   NODE_DATA [mem 0x1837fe23680-0x1837fe2d37f]
numa: Initmem setup node 1 [mem 0x1fb50000000-0x3fa8fffffff]
numa:   NODE_DATA [mem 0x1837fa19980-0x1837fa2367f]
numa:     NODE_DATA(1) on node 0
numa: Initmem setup node 2 [mem 0x3fa90000000-0x5f9afffffff]
numa:   NODE_DATA [mem 0x1837f60fc80-0x1837f61997f]
numa:     NODE_DATA(2) on node 0
numa: Initmem setup node 3 [mem 0x5f9b0000000-0x7684fffffff]
numa:   NODE_DATA [mem 0x1837f205f80-0x1837f20fc7f]
numa:     NODE_DATA(3) on node 0
numa: Initmem setup node 4 [mem 0x76850000000-0x9501fffffff]
numa:   NODE_DATA [mem 0x1837ef1c280-0x1837ef25f7f]
numa:     NODE_DATA(4) on node 0
numa: Initmem setup node 5 [mem 0x95020000000-0xb37efffffff]
numa:   NODE_DATA [mem 0x1837eb42580-0x1837eb4c27f]
numa:     NODE_DATA(5) on node 0
numa: Initmem setup node 6 [mem 0xb37f0000000-0xd1fbfffffff]
numa:   NODE_DATA [mem 0x1837e778880-0x1837e78257f]
numa:     NODE_DATA(6) on node 0
numa: Initmem setup node 7 [mem 0xd1fc0000000-0xf078fffffff]
numa:   NODE_DATA [mem 0x1837e39eb80-0x1837e3a887f]
numa:     NODE_DATA(7) on node 0
numa: Initmem setup node 8 [mem 0xf0790000000-0x10ef5fffffff]
numa:   NODE_DATA [mem 0x1837dfc4e80-0x1837dfceb7f]
numa:     NODE_DATA(8) on node 0
numa: Initmem setup node 9 [mem 0x10ef60000000-0x12d72fffffff]
numa:   NODE_DATA [mem 0x1837dbeb180-0x1837dbf4e7f]
numa:     NODE_DATA(9) on node 0
numa: Initmem setup node 10 [mem 0x12d730000000-0x14beffffffff]
numa:   NODE_DATA [mem 0x1837d811480-0x1837d81b17f]
numa:     NODE_DATA(10) on node 0
numa: Initmem setup node 11 [mem 0x14bf00000000-0x16a6cfffffff]
numa:   NODE_DATA [mem 0x1837d437780-0x1837d44147f]
numa:     NODE_DATA(11) on node 0
numa: Initmem setup node 12 [mem 0x16a6d0000000-0x188e9fffffff]
numa:   NODE_DATA [mem 0x1837d05da80-0x1837d06777f]
numa:     NODE_DATA(12) on node 0
numa: Initmem setup node 13 [mem 0x188ea0000000-0x1a765fffffff]
numa:   NODE_DATA [mem 0x1837cc83d80-0x1837cc8da7f]
numa:     NODE_DATA(13) on node 0
numa: Initmem setup node 14 [mem 0x1a7660000000-0x1c5e2fffffff]
numa:   NODE_DATA [mem 0x1837c8aa080-0x1837c8b3d7f]
numa:     NODE_DATA(14) on node 0
numa: Initmem setup node 15 [mem 0x1c5e30000000-0x1e45ffffffff]
numa:   NODE_DATA [mem 0x1837c4d0380-0x1837c4da07f]
numa:     NODE_DATA(15) on node 0
Section 99194 and 99199 (node 0) have a circular dependency on usemap and pgdat allocations
node 1 must be removed before remove section 99193
node 1 must be removed before remove section 99194
node 2 must be removed before remove section 99193
node 4 must be removed before remove section 99193
node 8 must be removed before remove section 99193
node 13 must be removed before remove section 99193
PCI host bridge /pci@800000020000032  ranges:
 MEM 0x00003fd480000000..0x00003fd4feffffff -> 0x0000000080000000 
 MEM 0x0000329000000000..0x0000329fffffffff -> 0x0003d29000000000 
PCI host bridge /pci@800000020000164  ranges:
 MEM 0x00003fc2e0000000..0x00003fc2efffffff -> 0x00000000e0000000 
 MEM 0x0000305800000000..0x0000305bffffffff -> 0x0003d05800000000 
PPC64 nvram contains 15360 bytes
Top of RAM: 0x1e4600000000, Total RAM: 0x1e4600000000
Memory hole size: 0MB
Zone ranges:
  DMA      [mem 0x0000000000000000-0x00001e45ffffffff]
  DMA32    empty
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000000000000-0x000001fb4fffffff]
  node   1: [mem 0x000001fb50000000-0x000003fa8fffffff]
  node   2: [mem 0x000003fa90000000-0x000005f9afffffff]
  node   3: [mem 0x000005f9b0000000-0x000007684fffffff]
  node   4: [mem 0x0000076850000000-0x000009501fffffff]
  node   5: [mem 0x0000095020000000-0x00000b37efffffff]
  node   6: [mem 0x00000b37f0000000-0x00000d1fbfffffff]
  node   7: [mem 0x00000d1fc0000000-0x00000f078fffffff]
  node   8: [mem 0x00000f0790000000-0x000010ef5fffffff]
  node   9: [mem 0x000010ef60000000-0x000012d72fffffff]
  node  10: [mem 0x000012d730000000-0x000014beffffffff]
  node  11: [mem 0x000014bf00000000-0x000016a6cfffffff]
  node  12: [mem 0x000016a6d0000000-0x0000188e9fffffff]
  node  13: [mem 0x0000188ea0000000-0x00001a765fffffff]
  node  14: [mem 0x00001a7660000000-0x00001c5e2fffffff]
  node  15: [mem 0x00001c5e30000000-0x00001e45ffffffff]
Initmem setup node 0 [mem 0x0000000000000000-0x000001fb4fffffff]
On node 0 totalpages: 33247232
  DMA zone: 32468 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 33247232 pages, LIFO batch:1
Initmem setup node 1 [mem 0x000001fb50000000-0x000003fa8fffffff]
On node 1 totalpages: 33505280
  DMA zone: 32720 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 33505280 pages, LIFO batch:1
Initmem setup node 2 [mem 0x000003fa90000000-0x000005f9afffffff]
On node 2 totalpages: 33497088
  DMA zone: 32712 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 33497088 pages, LIFO batch:1
Initmem setup node 3 [mem 0x000005f9b0000000-0x000007684fffffff]
On node 3 totalpages: 24027136
  DMA zone: 23464 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 24027136 pages, LIFO batch:1
Initmem setup node 4 [mem 0x0000076850000000-0x000009501fffffff]
On node 4 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 5 [mem 0x0000095020000000-0x00000b37efffffff]
On node 5 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 6 [mem 0x00000b37f0000000-0x00000d1fbfffffff]
On node 6 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 7 [mem 0x00000d1fc0000000-0x00000f078fffffff]
On node 7 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 8 [mem 0x00000f0790000000-0x000010ef5fffffff]
On node 8 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 9 [mem 0x000010ef60000000-0x000012d72fffffff]
On node 9 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 10 [mem 0x000012d730000000-0x000014beffffffff]
On node 10 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 11 [mem 0x000014bf00000000-0x000016a6cfffffff]
On node 11 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 12 [mem 0x000016a6d0000000-0x0000188e9fffffff]
On node 12 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 13 [mem 0x0000188ea0000000-0x00001a765fffffff]
On node 13 totalpages: 31965184
  DMA zone: 31216 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31965184 pages, LIFO batch:1
Initmem setup node 14 [mem 0x00001a7660000000-0x00001c5e2fffffff]
On node 14 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 15 [mem 0x00001c5e30000000-0x00001e45ffffffff]
On node 15 totalpages: 31969280
  DMA zone: 31220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 31969280 pages, LIFO batch:1

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-08-30 14:25         ` Srikar Dronamraju
@ 2016-08-30 15:00           ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-30 15:00 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Tue, Aug 30, 2016 at 07:55:08PM +0530, Srikar Dronamraju wrote:
> > > 
> > > This patch seems to hurt FA_DUMP functionality. This behaviour is not
> > > seen on v4.7 but only after this patch.
> > > 
> > > So when a kernel on a multinode machine with memblock_reserve() such
> > > that most of the nodes have zero available memory, kswapd seems to be
> > > consuming 100% of the time.
> > > 
> > 
> > Why is FA_DUMP specifically the trigger? If the nodes have zero available
> > memory then is the zone_populated() check failing when FA_DUMP is enabled? If
> > so, that would both allow kswapd to wake and stay awake.
> > 
> 
> The trigger is memblock_reserve() for the complete node memory.  And
> this is exactly what FA_DUMP does.  Here again the node has memory but
> its all reserved so there is no free memory in the node.
> 
> Did you mean populated_zone() when you said zone_populated or have I
> mistaken? populated_zone() does return 1 since it checks for
> zone->present_pages.
> 

Yes, I meant populated_zone(). Using present pages may have hidden
a long-lived corner case as it was unexpected that an entire node
would be reserved. The old code happened to survive *probably* because
pgdat_reclaimable would look false and kswapd checks for pgdat being
balanced would happen to do the right thing in this case.

Can you check if something like this works?

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..cf64a5456cf6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -830,7 +830,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
 
 static inline int populated_zone(struct zone *zone)
 {
-	return (!!zone->present_pages);
+	return (!!zone->managed_pages);
 }
 
 extern int movable_zone;

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-08-30 15:00           ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-30 15:00 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Tue, Aug 30, 2016 at 07:55:08PM +0530, Srikar Dronamraju wrote:
> > > 
> > > This patch seems to hurt FA_DUMP functionality. This behaviour is not
> > > seen on v4.7 but only after this patch.
> > > 
> > > So when a kernel on a multinode machine with memblock_reserve() such
> > > that most of the nodes have zero available memory, kswapd seems to be
> > > consuming 100% of the time.
> > > 
> > 
> > Why is FA_DUMP specifically the trigger? If the nodes have zero available
> > memory then is the zone_populated() check failing when FA_DUMP is enabled? If
> > so, that would both allow kswapd to wake and stay awake.
> > 
> 
> The trigger is memblock_reserve() for the complete node memory.  And
> this is exactly what FA_DUMP does.  Here again the node has memory but
> its all reserved so there is no free memory in the node.
> 
> Did you mean populated_zone() when you said zone_populated or have I
> mistaken? populated_zone() does return 1 since it checks for
> zone->present_pages.
> 

Yes, I meant populated_zone(). Using present pages may have hidden
a long-lived corner case as it was unexpected that an entire node
would be reserved. The old code happened to survive *probably* because
pgdat_reclaimable would look false and kswapd checks for pgdat being
balanced would happen to do the right thing in this case.

Can you check if something like this works?

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..cf64a5456cf6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -830,7 +830,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
 
 static inline int populated_zone(struct zone *zone)
 {
-	return (!!zone->present_pages);
+	return (!!zone->managed_pages);
 }
 
 extern int movable_zone;

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-08-30 15:00           ` Mel Gorman
@ 2016-08-31  6:09             ` Srikar Dronamraju
  -1 siblings, 0 replies; 220+ messages in thread
From: Srikar Dronamraju @ 2016-08-31  6:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

> > The trigger is memblock_reserve() for the complete node memory.  And
> > this is exactly what FA_DUMP does.  Here again the node has memory but
> > its all reserved so there is no free memory in the node.
> > 
> > Did you mean populated_zone() when you said zone_populated or have I
> > mistaken? populated_zone() does return 1 since it checks for
> > zone->present_pages.
> > 
> 
> Yes, I meant populated_zone(). Using present pages may have hidden
> a long-lived corner case as it was unexpected that an entire node
> would be reserved. The old code happened to survive *probably* because
> pgdat_reclaimable would look false and kswapd checks for pgdat being
> balanced would happen to do the right thing in this case.
> 
> Can you check if something like this works?
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d572b78b65e1..cf64a5456cf6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -830,7 +830,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
> 
>  static inline int populated_zone(struct zone *zone)
>  {
> -	return (!!zone->present_pages);
> +	return (!!zone->managed_pages);
>  }
> 
>  extern int movable_zone;
> 

This indeed fixes the problem.
Please add my 
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-08-31  6:09             ` Srikar Dronamraju
  0 siblings, 0 replies; 220+ messages in thread
From: Srikar Dronamraju @ 2016-08-31  6:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

> > The trigger is memblock_reserve() for the complete node memory.  And
> > this is exactly what FA_DUMP does.  Here again the node has memory but
> > its all reserved so there is no free memory in the node.
> > 
> > Did you mean populated_zone() when you said zone_populated or have I
> > mistaken? populated_zone() does return 1 since it checks for
> > zone->present_pages.
> > 
> 
> Yes, I meant populated_zone(). Using present pages may have hidden
> a long-lived corner case as it was unexpected that an entire node
> would be reserved. The old code happened to survive *probably* because
> pgdat_reclaimable would look false and kswapd checks for pgdat being
> balanced would happen to do the right thing in this case.
> 
> Can you check if something like this works?
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d572b78b65e1..cf64a5456cf6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -830,7 +830,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
> 
>  static inline int populated_zone(struct zone *zone)
>  {
> -	return (!!zone->present_pages);
> +	return (!!zone->managed_pages);
>  }
> 
>  extern int movable_zone;
> 

This indeed fixes the problem.
Please add my 
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-08-31  6:09             ` Srikar Dronamraju
@ 2016-08-31  8:49               ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-31  8:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Wed, Aug 31, 2016 at 11:39:59AM +0530, Srikar Dronamraju wrote:
> This indeed fixes the problem.
> Please add my 
> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> 

Ok, thanks. Unfortunately we cannot do a wide conversion like this
because some users of populated_zone() really meant to check for
present_pages. In all cases, the expectation was that reserved pages
would be tiny but fadump messes that up. Can you verify this also works
please?

---8<---
mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator

Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
of memory when booting a secondary kernel. Srikar Dronamraju reported that
multiple nodes may have no memory managed by the buddy allocator but still
return true for populated_zone().

Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
was reported to cause kswapd to spin at 100% CPU usage when fadump was
enabled. The old code happened to deal with the situation of a populated
node with zero free pages by co-incidence but the current code tries to
reclaim populated zones without realising that is impossible.

We cannot just convert populated_zone() as many existing users really
need to check for present_pages. This patch introduces a managed_zone()
helper and uses it in the few cases where it is critical that the check
is made for managed pages -- zonelist constuction and page reclaim.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h | 11 +++++++++--
 mm/page_alloc.c        |  4 ++--
 mm/vmscan.c            | 22 +++++++++++-----------
 3 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..69f886b79656 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -828,9 +828,16 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
  */
 #define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
 
-static inline int populated_zone(struct zone *zone)
+/* Returns true if a zone has pages managed by the buddy allocator */
+static inline bool managed_zone(struct zone *zone)
 {
-	return (!!zone->present_pages);
+	return zone->managed_pages;
+}
+
+/* Returns true if a zone has memory */
+static inline bool populated_zone(struct zone *zone)
+{
+	return zone->present_pages;
 }
 
 extern int movable_zone;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c09d9f7f692..ea7558149ee5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4405,7 +4405,7 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
-		if (populated_zone(zone)) {
+		if (managed_zone(zone)) {
 			zoneref_set_zone(zone,
 				&zonelist->_zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
@@ -4643,7 +4643,7 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 		for (j = 0; j < nr_nodes; j++) {
 			node = node_order[j];
 			z = &NODE_DATA(node)->node_zones[zone_type];
-			if (populated_zone(z)) {
+			if (managed_zone(z)) {
 				zoneref_set_zone(z,
 					&zonelist->_zonerefs[pos++]);
 				check_highest_zone(zone_type);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 98774f45b04a..55943a284082 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1665,7 +1665,7 @@ static bool inactive_reclaimable_pages(struct lruvec *lruvec,
 
 	for (zid = sc->reclaim_idx; zid >= 0; zid--) {
 		zone = &pgdat->node_zones[zid];
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		if (zone_page_state_snapshot(zone, NR_ZONE_LRU_BASE +
@@ -2036,7 +2036,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 		struct zone *zone = &pgdat->node_zones[zid];
 		unsigned long inactive_zone, active_zone;
 
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		inactive_zone = zone_page_state(zone,
@@ -2171,7 +2171,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 		for (z = 0; z < MAX_NR_ZONES; z++) {
 			struct zone *zone = &pgdat->node_zones[z];
-			if (!populated_zone(zone))
+			if (!managed_zone(zone))
 				continue;
 
 			total_high_wmark += high_wmark_pages(zone);
@@ -2508,7 +2508,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	/* If compaction would go ahead or the allocation would succeed, stop */
 	for (z = 0; z <= sc->reclaim_idx; z++) {
 		struct zone *zone = &pgdat->node_zones[z];
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) {
@@ -2835,7 +2835,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
-		if (!populated_zone(zone) ||
+		if (!managed_zone(zone) ||
 		    pgdat_reclaimable_pages(pgdat) == 0)
 			continue;
 
@@ -3136,7 +3136,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		if (!zone_balanced(zone, order, classzone_idx))
@@ -3164,7 +3164,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 	sc->nr_to_reclaim = 0;
 	for (z = 0; z <= sc->reclaim_idx; z++) {
 		zone = pgdat->node_zones + z;
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX);
@@ -3237,7 +3237,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		if (buffer_heads_over_limit) {
 			for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
 				zone = pgdat->node_zones + i;
-				if (!populated_zone(zone))
+				if (!managed_zone(zone))
 					continue;
 
 				sc.reclaim_idx = i;
@@ -3257,7 +3257,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		for (i = classzone_idx; i >= 0; i--) {
 			zone = pgdat->node_zones + i;
-			if (!populated_zone(zone))
+			if (!managed_zone(zone))
 				continue;
 
 			if (zone_balanced(zone, sc.order, classzone_idx))
@@ -3503,7 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	pg_data_t *pgdat;
 	int z;
 
-	if (!populated_zone(zone))
+	if (!managed_zone(zone))
 		return;
 
 	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
@@ -3517,7 +3517,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	/* Only wake kswapd if all zones are unbalanced */
 	for (z = 0; z <= classzone_idx; z++) {
 		zone = pgdat->node_zones + z;
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		if (zone_balanced(zone, order, classzone_idx))

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-08-31  8:49               ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-31  8:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Wed, Aug 31, 2016 at 11:39:59AM +0530, Srikar Dronamraju wrote:
> This indeed fixes the problem.
> Please add my 
> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> 

Ok, thanks. Unfortunately we cannot do a wide conversion like this
because some users of populated_zone() really meant to check for
present_pages. In all cases, the expectation was that reserved pages
would be tiny but fadump messes that up. Can you verify this also works
please?

---8<---
mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator

Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
of memory when booting a secondary kernel. Srikar Dronamraju reported that
multiple nodes may have no memory managed by the buddy allocator but still
return true for populated_zone().

Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
was reported to cause kswapd to spin at 100% CPU usage when fadump was
enabled. The old code happened to deal with the situation of a populated
node with zero free pages by co-incidence but the current code tries to
reclaim populated zones without realising that is impossible.

We cannot just convert populated_zone() as many existing users really
need to check for present_pages. This patch introduces a managed_zone()
helper and uses it in the few cases where it is critical that the check
is made for managed pages -- zonelist constuction and page reclaim.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h | 11 +++++++++--
 mm/page_alloc.c        |  4 ++--
 mm/vmscan.c            | 22 +++++++++++-----------
 3 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..69f886b79656 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -828,9 +828,16 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
  */
 #define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
 
-static inline int populated_zone(struct zone *zone)
+/* Returns true if a zone has pages managed by the buddy allocator */
+static inline bool managed_zone(struct zone *zone)
 {
-	return (!!zone->present_pages);
+	return zone->managed_pages;
+}
+
+/* Returns true if a zone has memory */
+static inline bool populated_zone(struct zone *zone)
+{
+	return zone->present_pages;
 }
 
 extern int movable_zone;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c09d9f7f692..ea7558149ee5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4405,7 +4405,7 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
-		if (populated_zone(zone)) {
+		if (managed_zone(zone)) {
 			zoneref_set_zone(zone,
 				&zonelist->_zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
@@ -4643,7 +4643,7 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 		for (j = 0; j < nr_nodes; j++) {
 			node = node_order[j];
 			z = &NODE_DATA(node)->node_zones[zone_type];
-			if (populated_zone(z)) {
+			if (managed_zone(z)) {
 				zoneref_set_zone(z,
 					&zonelist->_zonerefs[pos++]);
 				check_highest_zone(zone_type);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 98774f45b04a..55943a284082 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1665,7 +1665,7 @@ static bool inactive_reclaimable_pages(struct lruvec *lruvec,
 
 	for (zid = sc->reclaim_idx; zid >= 0; zid--) {
 		zone = &pgdat->node_zones[zid];
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		if (zone_page_state_snapshot(zone, NR_ZONE_LRU_BASE +
@@ -2036,7 +2036,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 		struct zone *zone = &pgdat->node_zones[zid];
 		unsigned long inactive_zone, active_zone;
 
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		inactive_zone = zone_page_state(zone,
@@ -2171,7 +2171,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 		for (z = 0; z < MAX_NR_ZONES; z++) {
 			struct zone *zone = &pgdat->node_zones[z];
-			if (!populated_zone(zone))
+			if (!managed_zone(zone))
 				continue;
 
 			total_high_wmark += high_wmark_pages(zone);
@@ -2508,7 +2508,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	/* If compaction would go ahead or the allocation would succeed, stop */
 	for (z = 0; z <= sc->reclaim_idx; z++) {
 		struct zone *zone = &pgdat->node_zones[z];
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) {
@@ -2835,7 +2835,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
-		if (!populated_zone(zone) ||
+		if (!managed_zone(zone) ||
 		    pgdat_reclaimable_pages(pgdat) == 0)
 			continue;
 
@@ -3136,7 +3136,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		if (!zone_balanced(zone, order, classzone_idx))
@@ -3164,7 +3164,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 	sc->nr_to_reclaim = 0;
 	for (z = 0; z <= sc->reclaim_idx; z++) {
 		zone = pgdat->node_zones + z;
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX);
@@ -3237,7 +3237,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		if (buffer_heads_over_limit) {
 			for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
 				zone = pgdat->node_zones + i;
-				if (!populated_zone(zone))
+				if (!managed_zone(zone))
 					continue;
 
 				sc.reclaim_idx = i;
@@ -3257,7 +3257,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		for (i = classzone_idx; i >= 0; i--) {
 			zone = pgdat->node_zones + i;
-			if (!populated_zone(zone))
+			if (!managed_zone(zone))
 				continue;
 
 			if (zone_balanced(zone, sc.order, classzone_idx))
@@ -3503,7 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	pg_data_t *pgdat;
 	int z;
 
-	if (!populated_zone(zone))
+	if (!managed_zone(zone))
 		return;
 
 	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
@@ -3517,7 +3517,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	/* Only wake kswapd if all zones are unbalanced */
 	for (z = 0; z <= classzone_idx; z++) {
 		zone = pgdat->node_zones + z;
-		if (!populated_zone(zone))
+		if (!managed_zone(zone))
 			continue;
 
 		if (zone_balanced(zone, order, classzone_idx))

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-08-31  8:49               ` Mel Gorman
@ 2016-08-31 11:09                 ` Michal Hocko
  -1 siblings, 0 replies; 220+ messages in thread
From: Michal Hocko @ 2016-08-31 11:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Andrew Morton, Linux-MM, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Wed 31-08-16 09:49:42, Mel Gorman wrote:
> On Wed, Aug 31, 2016 at 11:39:59AM +0530, Srikar Dronamraju wrote:
> > This indeed fixes the problem.
> > Please add my 
> > Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > 
> 
> Ok, thanks. Unfortunately we cannot do a wide conversion like this
> because some users of populated_zone() really meant to check for
> present_pages. In all cases, the expectation was that reserved pages
> would be tiny but fadump messes that up. Can you verify this also works
> please?
> 
> ---8<---
> mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator
> 
> Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
> of memory when booting a secondary kernel. Srikar Dronamraju reported that
> multiple nodes may have no memory managed by the buddy allocator but still
> return true for populated_zone().
> 
> Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> was reported to cause kswapd to spin at 100% CPU usage when fadump was
> enabled. The old code happened to deal with the situation of a populated
> node with zero free pages by co-incidence but the current code tries to
> reclaim populated zones without realising that is impossible.
> 
> We cannot just convert populated_zone() as many existing users really
> need to check for present_pages. This patch introduces a managed_zone()
> helper and uses it in the few cases where it is critical that the check
> is made for managed pages -- zonelist constuction and page reclaim.

OK, the patch makes sense to me. I am not happy about two very similar
functions, to be honest though. managed vs. present checks will be quite
subtle and it is not entirely clear when to use which one. I agree that
the reclaim path is the most critical one so the patch seems OK to me.
At least from a quick glance it should help with the reported issue so
feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

I expect we might want to turn other places as well but they are far
from critical. I would appreciate some lead there and stick a clarifying
comment

[...]
> -static inline int populated_zone(struct zone *zone)
> +/* Returns true if a zone has pages managed by the buddy allocator */

/*
 * Returns true if a zone has pages managed by the buddy allocator.
 * All the reclaim decisions have to use this function rather than
 * populated_zone(). If the whole zone is reserved then we can easily
 * end up with populated_zone() && !managed_zone().
 */

What do you think?

> +static inline bool managed_zone(struct zone *zone)
>  {
> -	return (!!zone->present_pages);
> +	return zone->managed_pages;
> +}
> +
> +/* Returns true if a zone has memory */
> +static inline bool populated_zone(struct zone *zone)
> +{
> +	return zone->present_pages;
>  }
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-08-31 11:09                 ` Michal Hocko
  0 siblings, 0 replies; 220+ messages in thread
From: Michal Hocko @ 2016-08-31 11:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Andrew Morton, Linux-MM, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Wed 31-08-16 09:49:42, Mel Gorman wrote:
> On Wed, Aug 31, 2016 at 11:39:59AM +0530, Srikar Dronamraju wrote:
> > This indeed fixes the problem.
> > Please add my 
> > Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > 
> 
> Ok, thanks. Unfortunately we cannot do a wide conversion like this
> because some users of populated_zone() really meant to check for
> present_pages. In all cases, the expectation was that reserved pages
> would be tiny but fadump messes that up. Can you verify this also works
> please?
> 
> ---8<---
> mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator
> 
> Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
> of memory when booting a secondary kernel. Srikar Dronamraju reported that
> multiple nodes may have no memory managed by the buddy allocator but still
> return true for populated_zone().
> 
> Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> was reported to cause kswapd to spin at 100% CPU usage when fadump was
> enabled. The old code happened to deal with the situation of a populated
> node with zero free pages by co-incidence but the current code tries to
> reclaim populated zones without realising that is impossible.
> 
> We cannot just convert populated_zone() as many existing users really
> need to check for present_pages. This patch introduces a managed_zone()
> helper and uses it in the few cases where it is critical that the check
> is made for managed pages -- zonelist constuction and page reclaim.

OK, the patch makes sense to me. I am not happy about two very similar
functions, to be honest though. managed vs. present checks will be quite
subtle and it is not entirely clear when to use which one. I agree that
the reclaim path is the most critical one so the patch seems OK to me.
At least from a quick glance it should help with the reported issue so
feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

I expect we might want to turn other places as well but they are far
from critical. I would appreciate some lead there and stick a clarifying
comment

[...]
> -static inline int populated_zone(struct zone *zone)
> +/* Returns true if a zone has pages managed by the buddy allocator */

/*
 * Returns true if a zone has pages managed by the buddy allocator.
 * All the reclaim decisions have to use this function rather than
 * populated_zone(). If the whole zone is reserved then we can easily
 * end up with populated_zone() && !managed_zone().
 */

What do you think?

> +static inline bool managed_zone(struct zone *zone)
>  {
> -	return (!!zone->present_pages);
> +	return zone->managed_pages;
> +}
> +
> +/* Returns true if a zone has memory */
> +static inline bool populated_zone(struct zone *zone)
> +{
> +	return zone->present_pages;
>  }
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-08-31 11:09                 ` Michal Hocko
@ 2016-08-31 12:46                   ` Mel Gorman
  -1 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-31 12:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Srikar Dronamraju, Andrew Morton, Linux-MM, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Wed, Aug 31, 2016 at 01:09:33PM +0200, Michal Hocko wrote:
> > We cannot just convert populated_zone() as many existing users really
> > need to check for present_pages. This patch introduces a managed_zone()
> > helper and uses it in the few cases where it is critical that the check
> > is made for managed pages -- zonelist constuction and page reclaim.
> 
> OK, the patch makes sense to me. I am not happy about two very similar
> functions, to be honest though. managed vs. present checks will be quite
> subtle and it is not entirely clear when to use which one.

In the vast majority of cases, the distinction is irrelevant. The patch
only updates the places where it really matters to minimise any
confusion.

> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks.

> /*
>  * Returns true if a zone has pages managed by the buddy allocator.
>  * All the reclaim decisions have to use this function rather than
>  * populated_zone(). If the whole zone is reserved then we can easily
>  * end up with populated_zone() && !managed_zone().
>  */
> 
> What do you think?
> 

This makes a lot of sense. I've updated the patch and will await a test
from Srikar before reposting.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-08-31 12:46                   ` Mel Gorman
  0 siblings, 0 replies; 220+ messages in thread
From: Mel Gorman @ 2016-08-31 12:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Srikar Dronamraju, Andrew Morton, Linux-MM, Rik van Riel,
	Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

On Wed, Aug 31, 2016 at 01:09:33PM +0200, Michal Hocko wrote:
> > We cannot just convert populated_zone() as many existing users really
> > need to check for present_pages. This patch introduces a managed_zone()
> > helper and uses it in the few cases where it is critical that the check
> > is made for managed pages -- zonelist constuction and page reclaim.
> 
> OK, the patch makes sense to me. I am not happy about two very similar
> functions, to be honest though. managed vs. present checks will be quite
> subtle and it is not entirely clear when to use which one.

In the vast majority of cases, the distinction is irrelevant. The patch
only updates the places where it really matters to minimise any
confusion.

> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks.

> /*
>  * Returns true if a zone has pages managed by the buddy allocator.
>  * All the reclaim decisions have to use this function rather than
>  * populated_zone(). If the whole zone is reserved then we can easily
>  * end up with populated_zone() && !managed_zone().
>  */
> 
> What do you think?
> 

This makes a lot of sense. I've updated the patch and will await a test
from Srikar before reposting.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
  2016-08-31  8:49               ` Mel Gorman
@ 2016-08-31 17:33                 ` Srikar Dronamraju
  -1 siblings, 0 replies; 220+ messages in thread
From: Srikar Dronamraju @ 2016-08-31 17:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

> mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator
> 
> Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
> of memory when booting a secondary kernel. Srikar Dronamraju reported that
> multiple nodes may have no memory managed by the buddy allocator but still
> return true for populated_zone().
> 
> Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> was reported to cause kswapd to spin at 100% CPU usage when fadump was
> enabled. The old code happened to deal with the situation of a populated
> node with zero free pages by co-incidence but the current code tries to
> reclaim populated zones without realising that is impossible.
> 
> We cannot just convert populated_zone() as many existing users really
> need to check for present_pages. This patch introduces a managed_zone()
> helper and uses it in the few cases where it is critical that the check
> is made for managed pages -- zonelist constuction and page reclaim.

one nit
s/constuction/construction/

> 

Verified that it works fine.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
@ 2016-08-31 17:33                 ` Srikar Dronamraju
  0 siblings, 0 replies; 220+ messages in thread
From: Srikar Dronamraju @ 2016-08-31 17:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
	Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini

> mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator
> 
> Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
> of memory when booting a secondary kernel. Srikar Dronamraju reported that
> multiple nodes may have no memory managed by the buddy allocator but still
> return true for populated_zone().
> 
> Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> was reported to cause kswapd to spin at 100% CPU usage when fadump was
> enabled. The old code happened to deal with the situation of a populated
> node with zero free pages by co-incidence but the current code tries to
> reclaim populated zones without realising that is impossible.
> 
> We cannot just convert populated_zone() as many existing users really
> need to check for present_pages. This patch introduces a managed_zone()
> helper and uses it in the few cases where it is critical that the check
> is made for managed pages -- zonelist constuction and page reclaim.

one nit
s/constuction/construction/

> 

Verified that it works fine.

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 220+ messages in thread

end of thread, other threads:[~2016-08-31 17:33 UTC | newest]

Thread overview: 220+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-08  9:34 [PATCH 00/34] Move LRU page reclaim from zones to nodes v9 Mel Gorman
2016-07-08  9:34 ` Mel Gorman
2016-07-08  9:34 ` [PATCH 01/34] mm, vmstat: add infrastructure for per-node vmstats Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-08-03 19:13   ` Reza Arbab
2016-08-03 19:13     ` Reza Arbab
2016-07-08  9:34 ` [PATCH 02/34] mm, vmscan: move lru_lock to the node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 11:06   ` Balbir Singh
2016-07-12 11:06     ` Balbir Singh
2016-07-12 11:18     ` Mel Gorman
2016-07-12 11:18       ` Mel Gorman
2016-07-13  5:50       ` Balbir Singh
2016-07-13  5:50         ` Balbir Singh
2016-07-13  8:39         ` Vlastimil Babka
2016-07-13  8:39           ` Vlastimil Babka
2016-07-08  9:34 ` [PATCH 03/34] mm, vmscan: move LRU lists to node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-08-04 20:59   ` James Hogan
2016-08-04 20:59     ` James Hogan
2016-08-04 20:59     ` James Hogan
2016-08-05  8:41     ` Mel Gorman
2016-08-05  8:41       ` Mel Gorman
2016-08-05 10:52       ` James Hogan
2016-08-05 10:52         ` James Hogan
2016-08-05 11:55         ` Mel Gorman
2016-08-05 11:55           ` Mel Gorman
2016-08-05 11:55           ` Mel Gorman
2016-08-05 12:02           ` James Hogan
2016-08-05 12:02             ` James Hogan
2016-07-08  9:34 ` [PATCH 04/34] mm, mmzone: clarify the usage of zone padding Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 13:49   ` Johannes Weiner
2016-07-12 13:49     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 05/34] mm, vmscan: begin reclaiming pages on a per-node basis Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 13:54   ` Johannes Weiner
2016-07-12 13:54     ` Johannes Weiner
2016-07-14  9:19   ` Vlastimil Babka
2016-07-14  9:19     ` Vlastimil Babka
2016-07-08  9:34 ` [PATCH 06/34] mm, vmscan: have kswapd only scan based on the highest requested zone Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:05   ` Johannes Weiner
2016-07-12 14:05     ` Johannes Weiner
2016-07-13  8:37     ` Mel Gorman
2016-07-13  8:37       ` Mel Gorman
2016-07-08  9:34 ` [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-08-29  9:38   ` Srikar Dronamraju
2016-08-29  9:38     ` Srikar Dronamraju
2016-08-30 12:07     ` Mel Gorman
2016-08-30 12:07       ` Mel Gorman
2016-08-30 14:25       ` Srikar Dronamraju
2016-08-30 14:25         ` Srikar Dronamraju
2016-08-30 15:00         ` Mel Gorman
2016-08-30 15:00           ` Mel Gorman
2016-08-31  6:09           ` Srikar Dronamraju
2016-08-31  6:09             ` Srikar Dronamraju
2016-08-31  8:49             ` Mel Gorman
2016-08-31  8:49               ` Mel Gorman
2016-08-31 11:09               ` Michal Hocko
2016-08-31 11:09                 ` Michal Hocko
2016-08-31 12:46                 ` Mel Gorman
2016-08-31 12:46                   ` Mel Gorman
2016-08-31 17:33               ` Srikar Dronamraju
2016-08-31 17:33                 ` Srikar Dronamraju
2016-07-08  9:34 ` [PATCH 08/34] mm, vmscan: remove balance gap Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:06   ` Johannes Weiner
2016-07-12 14:06     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 09/34] mm, vmscan: simplify the logic deciding whether kswapd sleeps Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-08  9:34 ` [PATCH 10/34] mm, vmscan: by default have direct reclaim only shrink once per node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-08  9:34 ` [PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:22   ` Johannes Weiner
2016-07-12 14:22     ` Johannes Weiner
2016-07-13  8:40     ` Mel Gorman
2016-07-13  8:40       ` Mel Gorman
2016-07-14  9:45   ` Vlastimil Babka
2016-07-14  9:45     ` Vlastimil Babka
2016-07-08  9:34 ` [PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:29   ` Johannes Weiner
2016-07-12 14:29     ` Johannes Weiner
2016-07-13  8:47     ` Mel Gorman
2016-07-13  8:47       ` Mel Gorman
2016-07-13 12:28       ` Johannes Weiner
2016-07-13 12:28         ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 13/34] mm, vmscan: make shrink_node decisions more node-centric Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:32   ` Johannes Weiner
2016-07-12 14:32     ` Johannes Weiner
2016-07-13  8:48     ` Mel Gorman
2016-07-13  8:48       ` Mel Gorman
2016-07-08  9:34 ` [PATCH 14/34] mm, memcg: move memcg limit enforcement from zones to nodes Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:38   ` Johannes Weiner
2016-07-12 14:38     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 15/34] mm, workingset: make working set detection node-aware Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-08  9:34 ` [PATCH 16/34] mm, page_alloc: consider dirtyable memory in terms of nodes Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-08  9:34 ` [PATCH 17/34] mm: move page mapped accounting to the node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:42   ` Johannes Weiner
2016-07-12 14:42     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 14:58   ` Johannes Weiner
2016-07-12 14:58     ` Johannes Weiner
2016-07-13  8:55     ` Mel Gorman
2016-07-13  8:55       ` Mel Gorman
2016-07-13 13:04       ` Johannes Weiner
2016-07-13 13:04         ` Johannes Weiner
2016-07-13 13:37         ` Mel Gorman
2016-07-13 13:37           ` Mel Gorman
2016-07-13 21:13           ` Andrew Morton
2016-07-13 21:13             ` Andrew Morton
2016-07-15 10:46             ` Mel Gorman
2016-07-15 10:46               ` Mel Gorman
2016-07-15 22:35               ` Andrew Morton
2016-07-15 22:35                 ` Andrew Morton
2016-07-18 13:34                 ` Johannes Weiner
2016-07-18 13:34                   ` Johannes Weiner
2016-07-14  1:27           ` Minchan Kim
2016-07-14  1:27             ` Minchan Kim
2016-07-08  9:34 ` [PATCH 19/34] mm: move most file-based accounting to the node Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 15:11   ` Johannes Weiner
2016-07-12 15:11     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 20/34] mm: move vmscan writes and file write " Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 15:15   ` Johannes Weiner
2016-07-12 15:15     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 21/34] mm, vmscan: only wakeup kswapd once per node for the requested classzone Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 17:18   ` Johannes Weiner
2016-07-12 17:18     ` Johannes Weiner
2016-07-08  9:34 ` [PATCH 22/34] mm, page_alloc: wake kswapd based on the highest eligible zone Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 17:24   ` Johannes Weiner
2016-07-12 17:24     ` Johannes Weiner
2016-07-14 10:05   ` Vlastimil Babka
2016-07-14 10:05     ` Vlastimil Babka
2016-07-08  9:34 ` [PATCH 23/34] mm: convert zone_reclaim to node_reclaim Mel Gorman
2016-07-08  9:34   ` Mel Gorman
2016-07-12 17:28   ` Johannes Weiner
2016-07-12 17:28     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 24/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 17:31   ` Johannes Weiner
2016-07-12 17:31     ` Johannes Weiner
2016-07-14 10:09   ` Vlastimil Babka
2016-07-14 10:09     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 25/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:01   ` Johannes Weiner
2016-07-12 18:01     ` Johannes Weiner
2016-07-14 12:12   ` Vlastimil Babka
2016-07-14 12:12     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 26/34] mm, vmscan: avoid passing in remaining unnecessarily to prepare_kswapd_sleep Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:06   ` Johannes Weiner
2016-07-12 18:06     ` Johannes Weiner
2016-07-14 12:48   ` Vlastimil Babka
2016-07-14 12:48     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 27/34] mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:10   ` Johannes Weiner
2016-07-12 18:10     ` Johannes Weiner
2016-07-14 12:54   ` Vlastimil Babka
2016-07-14 12:54     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 28/34] mm, vmscan: add classzone information to tracepoints Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:13   ` Johannes Weiner
2016-07-12 18:13     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 29/34] mm, page_alloc: remove fair zone allocation policy Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:18   ` Johannes Weiner
2016-07-12 18:18     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 30/34] mm: page_alloc: cache the last node whose dirty limit is reached Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 18:43   ` Johannes Weiner
2016-07-12 18:43     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 31/34] mm: vmstat: replace __count_zone_vm_events with a zone id equivalent Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 19:10   ` Johannes Weiner
2016-07-12 19:10     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 32/34] mm: vmstat: account per-zone stalls and pages skipped during reclaim Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 19:06   ` Johannes Weiner
2016-07-12 19:06     ` Johannes Weiner
2016-07-08  9:35 ` [PATCH 33/34] mm, vmstat: print node-based stats in zoneinfo file Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-12 19:18   ` Johannes Weiner
2016-07-12 19:18     ` Johannes Weiner
2016-07-14 12:56   ` Vlastimil Babka
2016-07-14 12:56     ` Vlastimil Babka
2016-07-08  9:35 ` [PATCH 34/34] mm, vmstat: remove zone and node double accounting by approximating retries Mel Gorman
2016-07-08  9:35   ` Mel Gorman
2016-07-14 13:40   ` Vlastimil Babka
2016-07-14 13:40     ` Vlastimil Babka
2016-07-15  7:48     ` Mel Gorman
2016-07-15  7:48       ` Mel Gorman
2016-07-15 12:20       ` Vlastimil Babka
2016-07-15 12:20         ` Vlastimil Babka
2016-08-19 13:12 ` [PATCH 00/34] Move LRU page reclaim from zones to nodes v9 Andrea Arcangeli
2016-08-19 13:12   ` Andrea Arcangeli
2016-08-19 13:23   ` Vlastimil Babka
2016-08-19 13:23     ` Vlastimil Babka
2016-08-19 13:55     ` Andrea Arcangeli
2016-08-19 13:55       ` Andrea Arcangeli
2016-08-19 14:53   ` Mel Gorman
2016-08-19 14:53     ` Mel Gorman
2016-08-19 15:32     ` Andrea Arcangeli
2016-08-19 15:32       ` Andrea Arcangeli
2016-08-19 15:55       ` Mel Gorman
2016-08-19 15:55         ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.