All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/10] mm: thrash detection-based file cache sizing
@ 2013-05-30 18:03 ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.
    
Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This happens on file servers and media streaming servers, where the
popular set of files changes over time.  Even though the individual
files might be smaller than half of memory, concurrent access to many
of them may still result in their inter-reference distance being
greater than half of memory.  It's also been reported on database
workloads that switch back and forth between tables that are bigger
than half of memory.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to tell actual use
once streaming from inactive list thrashing and subsequently adjust
the balance between the lists.

Version 2 of this series includes many updates to the comments,
documentation, code structure, and eviction history tracking in
response to Peter Zijlstra, Rik van Riel, Minchan Kim, Andrea
Arcangeli, Andrew Morton, and Mel Gorman.  Thanks a lot!!

wschange - test adaptiveness to new workingsets
-----------------------------------------------

On a 16G machine, a sequence of 12G files is read into the cache.
Every file is read repeatedly until fully cached in memory, then the
test moves on to the next file to show how quickly the VM adapts to a
new workingset.

--- vanilla:
Dropping caches...
Reading files until fully cached (+2 reads for activation):
data-1 (1):  9.32 4.48 4.14 4.10
data-2 (1):  9.73 9.95 10.00 10.00 9.99 9.75 9.99 9.56 10.04 9.56 10.02
    9.62 10.02 9.56 10.06 9.57 10.04 9.58 9.74 10.06 10.04 10.04 10.06
    9.60 10.07 10.07 9.70 10.03 10.07 9.65
ERROR: data-2 not fully cached after reading it 30x

The vanilla kernel never adapts to new workingsets with
inter-reference distances bigger than half of memory. The active list
is simply not challenged as long as it is bigger than the inactive
list, i.e. effectively half of memory in size, which does not give the
new pages enough time for activation.  As a result, they are thrashing
on the inactive list, which the VM mistakes for "plenty of used-once
cache" and protects the stale cache indefinitely.

--- patched:
Dropping caches...
Reading files until fully cached (+2 reads for activation):
data-1 (1):  9.41 4.58 4.21 4.16
data-2 (1):  9.58 10.00 9.72 10.22 5.77 4.29 4.22 4.20
data-3 (1):  9.71 9.71 10.13 10.25 6.08 4.42 4.19 4.17
data-1 (2):  10.00 9.79 10.32 7.53 4.49 4.21 4.18
data-2 (2):  10.02 10.27 9.10 4.64 4.25 4.19
data-3 (2):  10.02 10.33 9.14 4.66 4.25 4.21
data-1 (3):  10.04 10.35 9.18 4.67 4.27 4.22
data-2 (3):  10.08 10.36 9.33 4.72 4.26 4.23
data-3 (3):  10.09 10.41 9.31 4.72 4.29 4.24
...

The patched kernel detects the thrashing on the inactive list and
challenges the stale cache on the active list, which is eventually
evicted to make room for the new workingset.

wsprotect - test protection of workingset in presence of streaming
------------------------------------------------------------------

Streaming data does not benefit from caching, and repeatedly access
data that is bigger than memory can not be reasonably cached at this
point.  That's why the VM needs to protect an existing working set in
the presence of such streaming / uncachable competetitor sets.

On a 16G machine, a 4G file is read into cache.  When a 17G file is
read repeatedly, the 4G workingset data should remain cached as much
as possible.

--- vanilla:
Dropping caches...
Caching workingset file 'ws':
3.13
1.49
1.37
1.36
1.37
mincore: ws: 1048576/1048576 (100%)
Repeatedly streaming bigger than memory file 'stream':
13.46
14.09
14.09
14.14
14.09
14.00
13.45
13.43
13.47
14.04
mincore: ws: 1048576/1048576 (100%)

--- patched:
Dropping caches...
Caching workingset file 'ws':
3.18
1.56
1.43
1.41
1.41
mincore: ws: 1048576/1048576 (100%)
Repeatedly streaming bigger than memory file 'stream':
13.45
13.66
13.69
13.75
13.85
13.83
13.95
14.36
14.38
14.40
mincore: ws: 1048576/1048576 (100%)

The patched kernel observes refaulting streaming pages, but recognizes
that the set is bigger than memory and could never be fully cached.
As a result, it continues to protect the existing cache.

pft - page fault overhead
-------------------------

The zone round robin allocator (RRALLOC) adds some overhead that shows
in this microbenchmark which serves tmpfs faults purely out of memory.
There is no significant impact from the remaining workingset patches:

pft
                              BASE               RRALLOC            WORKINGSET
User       1       0.0235 (  0.00%)       0.0275 (-17.02%)       0.0270 (-14.89%)
User       2       0.0275 (  0.00%)       0.0275 ( -0.00%)       0.0285 ( -3.64%)
User       3       0.0330 (  0.00%)       0.0365 (-10.61%)       0.0335 ( -1.52%)
User       4       0.0390 (  0.00%)       0.0390 (  0.00%)       0.0380 (  2.56%)
System     1       0.2645 (  0.00%)       0.2620 (  0.95%)       0.2625 (  0.76%)
System     2       0.3215 (  0.00%)       0.3310 ( -2.95%)       0.3285 ( -2.18%)
System     3       0.3935 (  0.00%)       0.4080 ( -3.68%)       0.4130 ( -4.96%)
System     4       0.4920 (  0.00%)       0.5030 ( -2.24%)       0.5045 ( -2.54%)
Elapsed    1       0.2905 (  0.00%)       0.2905 (  0.00%)       0.2905 (  0.00%)
Elapsed    2       0.1800 (  0.00%)       0.1800 (  0.00%)       0.1800 (  0.00%)
Elapsed    3       0.1500 (  0.00%)       0.1600 ( -6.67%)       0.1600 ( -6.67%)
Elapsed    4       0.1305 (  0.00%)       0.1420 ( -8.81%)       0.1415 ( -8.43%)
Faults/cpu 1  667251.7997 (  0.00%)  666296.4749 ( -0.14%)  667880.8099 (  0.09%)
Faults/cpu 2  551464.0345 (  0.00%)  536113.4630 ( -2.78%)  538286.2087 ( -2.39%)
Faults/cpu 3  452403.4425 (  0.00%)  433856.5320 ( -4.10%)  432193.9888 ( -4.47%)
Faults/cpu 4  362691.4491 (  0.00%)  356514.8821 ( -1.70%)  356436.5711 ( -1.72%)
Faults/sec 1  663612.5980 (  0.00%)  662501.4959 ( -0.17%)  664037.3123 (  0.06%)
Faults/sec 2 1096166.5317 (  0.00%) 1064679.7154 ( -2.87%) 1068906.1040 ( -2.49%)
Faults/sec 3 1272925.4995 (  0.00%) 1209241.9167 ( -5.00%) 1202868.9190 ( -5.50%)
Faults/sec 4 1437691.1054 (  0.00%) 1362549.9877 ( -5.23%) 1381633.9889 ( -3.90%)

                BASE     RRALLOC  WORKINGSET
User            2.53        2.63        2.59
System         34.01       34.94       35.08
Elapsed        18.93       19.49       19.52

kernbench - impact on kernel hacker workloads
---------------------------------------------

In a workload that is not purely allocator bound and also does some
computation and IO, the added allocator overhead is in the noise:

                                BASE               RRALLOC            WORKINGSET
User    min        1163.95 (  0.00%)     1131.79 (  2.76%)     1123.41 (  3.48%)
User    mean       1170.76 (  0.00%)     1139.68 (  2.65%)     1125.63 (  3.85%)
User    stddev        6.38 (  0.00%)        7.91 (-24.00%)        1.37 ( 78.60%)
User    max        1182.17 (  0.00%)     1149.63 (  2.75%)     1127.55 (  4.62%)
User    range        18.22 (  0.00%)       17.84 (  2.09%)        4.14 ( 77.28%)
System  min          79.97 (  0.00%)       80.13 ( -0.20%)       78.21 (  2.20%)
System  mean         80.55 (  0.00%)       80.68 ( -0.16%)       78.93 (  2.01%)
System  stddev        0.80 (  0.00%)        0.55 ( 31.73%)        0.44 ( 44.91%)
System  max          82.11 (  0.00%)       81.38 (  0.89%)       79.33 (  3.39%)
System  range         2.14 (  0.00%)        1.25 ( 41.59%)        1.12 ( 47.66%)
Elapsed min         319.04 (  0.00%)      310.75 (  2.60%)      307.69 (  3.56%)
Elapsed mean        320.98 (  0.00%)      313.65 (  2.28%)      309.33 (  3.63%)
Elapsed stddev        2.37 (  0.00%)        2.27 (  4.37%)        1.40 ( 40.92%)
Elapsed max         325.52 (  0.00%)      316.83 (  2.67%)      311.69 (  4.25%)
Elapsed range         6.48 (  0.00%)        6.08 (  6.17%)        4.00 ( 38.27%)
CPU     min         388.00 (  0.00%)      386.00 (  0.52%)      386.00 (  0.52%)
CPU     mean        389.40 (  0.00%)      388.60 (  0.21%)      389.00 (  0.10%)
CPU     stddev        0.80 (  0.00%)        1.50 (-87.08%)        1.55 (-93.65%)
CPU     max         390.00 (  0.00%)      390.00 (  0.00%)      390.00 (  0.00%)
CPU     range         2.00 (  0.00%)        4.00 (-100.00%)        4.00 (-100.00%)

                BASE     RRALLOC  WORKINGSET
User         7009.94     6821.10     6755.85
System        489.88      490.82      481.82
Elapsed      1974.68     1930.58     1909.76

micro - reclaim micro benchmark
-------------------------------

This multi-threaded micro benchmark creates memory pressure with a mix
of anonymous and mapped file memory.  By spreading memory among the
available nodes more evently, reclaim behavior is greatly improved by
the round-robin allocator in terms of overall IO, swapping,
efficiency, direct reclaim invocations, reclaim writeback:

                BASE     RRALLOC  WORKINGSET
User          558.11      566.39      564.37
System         28.36       25.60       24.29
Elapsed       394.70      387.38      386.07

                                  BASE     RRALLOC  WORKINGSET
Page Ins                       6853744     5764336     5672052
Page Outs                     12136640    10673568    10617640
Swap Ins                             0           0           0
Swap Outs                         6702           0           0
Direct pages scanned           1751264      176965      238264
Kswapd pages scanned           4832689     3751475     3595031
Kswapd pages reclaimed         2347185     2325232     2239671
Direct pages reclaimed          419104      176226      236990
Kswapd efficiency                  48%         61%         62%
Kswapd velocity              12243.955    9684.225    9311.863
Direct efficiency                  23%         99%         99%
Direct velocity               4436.950     456.825     617.152
Percentage direct scans            26%          4%          6%
Page writes by reclaim          661863       10182       11310
Page writes file                655161       10182       11310
Page writes anon                  6702           0           0
Page reclaim immediate         1083840       15373       24797
Page rescued immediate               0           0           0
Slabs scanned                    10240       13312       11776
Direct inode steals                  0           0           0
Kswapd inode steals                  0           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                   2057        2249        3163
THP collapse alloc                   0           0           0
THP splits                           6           0           0
THP fault fallback                5824        5630        4719
THP collapse fail                    0           0           0
Compaction stalls                  551         484         610
Compaction success                 105          47          91
Compaction failures                446         437         484
Page migrate success            176065      103764      135377
Page migrate failure                 0           0           0
Compaction pages isolated       443314      263699      346198
Compaction migrate scanned      687684      598263      640277
Compaction free scanned       14437356     5061851     4744974
Compaction cost                    195         116         151
NUMA PTE updates                     0           0           0
NUMA hint faults                     0           0           0
NUMA hint local faults               0           0           0
NUMA pages migrated                  0           0           0
AutoNUMA cost                        0           0           0

memcachetest - streaming IO impact on anonyomus workingset
----------------------------------------------------------

This test runs a latency-sensitive in-core workload that is
accompanied by use once page cache streams of increasing size in the
background.

It too shows great improvements in allocation/reclaim behavior.  The
in-core workload is much less affected by the background IO, even
though IO throughput itself increased.  Same reclaim improvements as
before: reduced swapping, page faults, increased reclaim efficiency,
less writeback from reclaim:

                                              BASE                     RRALLOC                  WORKINGSET
Ops memcachetest-0M             15294.00 (  0.00%)          15492.00 (  1.29%)          16420.00 (  7.36%)
Ops memcachetest-375M           15574.00 (  0.00%)          15510.00 ( -0.41%)          16602.00 (  6.60%)
Ops memcachetest-1252M           8908.00 (  0.00%)          15733.00 ( 76.62%)          16640.00 ( 86.80%)
Ops memcachetest-2130M           2652.00 (  0.00%)          16089.00 (506.67%)          16764.00 (532.13%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-375M                6.00 (  0.00%)              5.00 ( 16.67%)              6.00 (  0.00%)
Ops io-duration-1252M              52.00 (  0.00%)             17.00 ( 67.31%)             17.00 ( 67.31%)
Ops io-duration-2130M             124.00 (  0.00%)             30.00 ( 75.81%)             30.00 ( 75.81%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-375M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-1252M            169167.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-2130M            278835.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-375M                     0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-1252M                78117.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2130M               135073.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M             776489.00 (  0.00%)         779312.00 ( -0.36%)         783329.00 ( -0.88%)
Ops minorfaults-375M           778665.00 (  0.00%)         780201.00 ( -0.20%)         784954.00 ( -0.81%)
Ops minorfaults-1252M          898776.00 (  0.00%)         781391.00 ( 13.06%)         785025.00 ( 12.66%)
Ops minorfaults-2130M          838654.00 (  0.00%)         782741.00 (  6.67%)         785580.00 (  6.33%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-375M                0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-1252M           10916.00 (  0.00%)             38.00 ( 99.65%)             38.00 ( 99.65%)
Ops majorfaults-2130M           19278.00 (  0.00%)             38.00 ( 99.80%)             38.00 ( 99.80%)

                BASE     RRALLOC  WORKINGSET
User          521.34      654.91      671.03
System       1694.60     2181.44     2157.61
Elapsed      4781.91     4701.73     4700.31

                                  BASE     RRALLOC  WORKINGSET
Page Ins                       3609444       18304       18296
Page Outs                     23111464    19283920    19285644
Swap Ins                        831734           0           0
Swap Outs                       950459           0           0
Direct pages scanned            354478           0        1061
Kswapd pages scanned           6490315     2808074     2875760
Kswapd pages reclaimed         3116126     2808050     2875738
Direct pages reclaimed          324821           0        1061
Kswapd efficiency                  48%         99%         99%
Kswapd velocity               1357.264     597.243     611.823
Direct efficiency                  91%        100%        100%
Direct velocity                 74.129       0.000       0.226
Percentage direct scans             5%          0%          0%
Page writes by reclaim         2088376           0           0
Page writes file               1137917           0           0
Page writes anon                950459           0           0
Page reclaim immediate          195121           0           0
Page rescued immediate               0           0           0
Slabs scanned                    35328           0           0
Direct inode steals                  0           0           0
Kswapd inode steals              19613           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8           3           0
THP collapse alloc                2495         871        1025
THP splits                          18          10           7
THP fault fallback                   0           0           0
THP collapse fail                   24          65          59
Compaction stalls                   66           2           2
Compaction success                  45           0           0
Compaction failures                 21           2           2
Page migrate success             39331           0           0
Page migrate failure                 0           0           0
Compaction pages isolated        84996           0           0
Compaction migrate scanned       59149           0           0
Compaction free scanned         916327           0           0
Compaction cost                     42           0           0
NUMA PTE updates                     0           0           0
NUMA hint faults                     0           0           0
NUMA hint local faults               0           0           0
NUMA pages migrated                  0           0           0
AutoNUMA cost                        0           0           0

---

Patch #1 solves a fairness problem we have with the per-zone LRU
lists, where the time a file cache page gets in memory is dependent on
the zone it gets allocated from.  The proposed solution is a very
simple (and maybe too crude) round-robin allocator.  It's a problem
that exists without this patch series, but the thrash detection
fundamentally relies on fair aging, so this is included here.

Patches #2-#6 prepare the page cache radix tree for non-page entries
that represent evicted pages.

Patch #7 prepares the page cache allocation path for passing down
refault information from the fault handler down to the page allocator,
which will later use it to prime the reclaim scanner for list
rebalancing.

Patch #9 is the thrash detection code.

Patch #10 is to keep the eviction history in check by both throttling
the number of non-page entries remembered in the radix trees when the
per-file refault ratio is very small and by having a shrinker that
trims those entries when they still grow excessively.

 fs/btrfs/compression.c           |   9 +-
 fs/cachefiles/rdwr.c             |  25 ++-
 fs/ceph/xattr.c                  |   2 +-
 fs/inode.c                       |   8 +-
 fs/logfs/readwrite.c             |   9 +-
 fs/nfs/blocklayout/blocklayout.c |   2 +-
 fs/nilfs2/inode.c                |   4 +-
 fs/ntfs/file.c                   |  10 +-
 fs/splice.c                      |   9 +-
 include/linux/fs.h               |   3 +
 include/linux/gfp.h              |  18 +-
 include/linux/mm.h               |   8 +
 include/linux/mmzone.h           |   9 +
 include/linux/pagemap.h          |  59 ++++--
 include/linux/pagevec.h          |   3 +
 include/linux/radix-tree.h       |   5 +-
 include/linux/shmem_fs.h         |   1 +
 include/linux/swap.h             |   9 +
 include/linux/vm_event_item.h    |   1 +
 include/linux/writeback.h        |   1 +
 lib/radix-tree.c                 | 105 +++------
 mm/Makefile                      |   2 +-
 mm/filemap.c                     | 289 ++++++++++++++++++++-----
 mm/memcontrol.c                  |   3 +
 mm/mempolicy.c                   |  17 +-
 mm/mincore.c                     |  20 +-
 mm/mmzone.c                      |   1 +
 mm/page-writeback.c              |   2 +-
 mm/page_alloc.c                  |  90 +++++---
 mm/readahead.c                   |  12 +-
 mm/shmem.c                       | 122 +++--------
 mm/swap.c                        |  22 ++
 mm/truncate.c                    |  78 +++++--
 mm/vmscan.c                      |  45 +++-
 mm/vmstat.c                      |   4 +
 mm/workingset.c                  | 423 +++++++++++++++++++++++++++++++++++++
 net/ceph/pagelist.c              |   4 +-
 net/ceph/pagevec.c               |   2 +-
 38 files changed, 1083 insertions(+), 353 deletions(-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch 00/10] mm: thrash detection-based file cache sizing
@ 2013-05-30 18:03 ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.
    
Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This happens on file servers and media streaming servers, where the
popular set of files changes over time.  Even though the individual
files might be smaller than half of memory, concurrent access to many
of them may still result in their inter-reference distance being
greater than half of memory.  It's also been reported on database
workloads that switch back and forth between tables that are bigger
than half of memory.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to tell actual use
once streaming from inactive list thrashing and subsequently adjust
the balance between the lists.

Version 2 of this series includes many updates to the comments,
documentation, code structure, and eviction history tracking in
response to Peter Zijlstra, Rik van Riel, Minchan Kim, Andrea
Arcangeli, Andrew Morton, and Mel Gorman.  Thanks a lot!!

wschange - test adaptiveness to new workingsets
-----------------------------------------------

On a 16G machine, a sequence of 12G files is read into the cache.
Every file is read repeatedly until fully cached in memory, then the
test moves on to the next file to show how quickly the VM adapts to a
new workingset.

--- vanilla:
Dropping caches...
Reading files until fully cached (+2 reads for activation):
data-1 (1):  9.32 4.48 4.14 4.10
data-2 (1):  9.73 9.95 10.00 10.00 9.99 9.75 9.99 9.56 10.04 9.56 10.02
    9.62 10.02 9.56 10.06 9.57 10.04 9.58 9.74 10.06 10.04 10.04 10.06
    9.60 10.07 10.07 9.70 10.03 10.07 9.65
ERROR: data-2 not fully cached after reading it 30x

The vanilla kernel never adapts to new workingsets with
inter-reference distances bigger than half of memory. The active list
is simply not challenged as long as it is bigger than the inactive
list, i.e. effectively half of memory in size, which does not give the
new pages enough time for activation.  As a result, they are thrashing
on the inactive list, which the VM mistakes for "plenty of used-once
cache" and protects the stale cache indefinitely.

--- patched:
Dropping caches...
Reading files until fully cached (+2 reads for activation):
data-1 (1):  9.41 4.58 4.21 4.16
data-2 (1):  9.58 10.00 9.72 10.22 5.77 4.29 4.22 4.20
data-3 (1):  9.71 9.71 10.13 10.25 6.08 4.42 4.19 4.17
data-1 (2):  10.00 9.79 10.32 7.53 4.49 4.21 4.18
data-2 (2):  10.02 10.27 9.10 4.64 4.25 4.19
data-3 (2):  10.02 10.33 9.14 4.66 4.25 4.21
data-1 (3):  10.04 10.35 9.18 4.67 4.27 4.22
data-2 (3):  10.08 10.36 9.33 4.72 4.26 4.23
data-3 (3):  10.09 10.41 9.31 4.72 4.29 4.24
...

The patched kernel detects the thrashing on the inactive list and
challenges the stale cache on the active list, which is eventually
evicted to make room for the new workingset.

wsprotect - test protection of workingset in presence of streaming
------------------------------------------------------------------

Streaming data does not benefit from caching, and repeatedly access
data that is bigger than memory can not be reasonably cached at this
point.  That's why the VM needs to protect an existing working set in
the presence of such streaming / uncachable competetitor sets.

On a 16G machine, a 4G file is read into cache.  When a 17G file is
read repeatedly, the 4G workingset data should remain cached as much
as possible.

--- vanilla:
Dropping caches...
Caching workingset file 'ws':
3.13
1.49
1.37
1.36
1.37
mincore: ws: 1048576/1048576 (100%)
Repeatedly streaming bigger than memory file 'stream':
13.46
14.09
14.09
14.14
14.09
14.00
13.45
13.43
13.47
14.04
mincore: ws: 1048576/1048576 (100%)

--- patched:
Dropping caches...
Caching workingset file 'ws':
3.18
1.56
1.43
1.41
1.41
mincore: ws: 1048576/1048576 (100%)
Repeatedly streaming bigger than memory file 'stream':
13.45
13.66
13.69
13.75
13.85
13.83
13.95
14.36
14.38
14.40
mincore: ws: 1048576/1048576 (100%)

The patched kernel observes refaulting streaming pages, but recognizes
that the set is bigger than memory and could never be fully cached.
As a result, it continues to protect the existing cache.

pft - page fault overhead
-------------------------

The zone round robin allocator (RRALLOC) adds some overhead that shows
in this microbenchmark which serves tmpfs faults purely out of memory.
There is no significant impact from the remaining workingset patches:

pft
                              BASE               RRALLOC            WORKINGSET
User       1       0.0235 (  0.00%)       0.0275 (-17.02%)       0.0270 (-14.89%)
User       2       0.0275 (  0.00%)       0.0275 ( -0.00%)       0.0285 ( -3.64%)
User       3       0.0330 (  0.00%)       0.0365 (-10.61%)       0.0335 ( -1.52%)
User       4       0.0390 (  0.00%)       0.0390 (  0.00%)       0.0380 (  2.56%)
System     1       0.2645 (  0.00%)       0.2620 (  0.95%)       0.2625 (  0.76%)
System     2       0.3215 (  0.00%)       0.3310 ( -2.95%)       0.3285 ( -2.18%)
System     3       0.3935 (  0.00%)       0.4080 ( -3.68%)       0.4130 ( -4.96%)
System     4       0.4920 (  0.00%)       0.5030 ( -2.24%)       0.5045 ( -2.54%)
Elapsed    1       0.2905 (  0.00%)       0.2905 (  0.00%)       0.2905 (  0.00%)
Elapsed    2       0.1800 (  0.00%)       0.1800 (  0.00%)       0.1800 (  0.00%)
Elapsed    3       0.1500 (  0.00%)       0.1600 ( -6.67%)       0.1600 ( -6.67%)
Elapsed    4       0.1305 (  0.00%)       0.1420 ( -8.81%)       0.1415 ( -8.43%)
Faults/cpu 1  667251.7997 (  0.00%)  666296.4749 ( -0.14%)  667880.8099 (  0.09%)
Faults/cpu 2  551464.0345 (  0.00%)  536113.4630 ( -2.78%)  538286.2087 ( -2.39%)
Faults/cpu 3  452403.4425 (  0.00%)  433856.5320 ( -4.10%)  432193.9888 ( -4.47%)
Faults/cpu 4  362691.4491 (  0.00%)  356514.8821 ( -1.70%)  356436.5711 ( -1.72%)
Faults/sec 1  663612.5980 (  0.00%)  662501.4959 ( -0.17%)  664037.3123 (  0.06%)
Faults/sec 2 1096166.5317 (  0.00%) 1064679.7154 ( -2.87%) 1068906.1040 ( -2.49%)
Faults/sec 3 1272925.4995 (  0.00%) 1209241.9167 ( -5.00%) 1202868.9190 ( -5.50%)
Faults/sec 4 1437691.1054 (  0.00%) 1362549.9877 ( -5.23%) 1381633.9889 ( -3.90%)

                BASE     RRALLOC  WORKINGSET
User            2.53        2.63        2.59
System         34.01       34.94       35.08
Elapsed        18.93       19.49       19.52

kernbench - impact on kernel hacker workloads
---------------------------------------------

In a workload that is not purely allocator bound and also does some
computation and IO, the added allocator overhead is in the noise:

                                BASE               RRALLOC            WORKINGSET
User    min        1163.95 (  0.00%)     1131.79 (  2.76%)     1123.41 (  3.48%)
User    mean       1170.76 (  0.00%)     1139.68 (  2.65%)     1125.63 (  3.85%)
User    stddev        6.38 (  0.00%)        7.91 (-24.00%)        1.37 ( 78.60%)
User    max        1182.17 (  0.00%)     1149.63 (  2.75%)     1127.55 (  4.62%)
User    range        18.22 (  0.00%)       17.84 (  2.09%)        4.14 ( 77.28%)
System  min          79.97 (  0.00%)       80.13 ( -0.20%)       78.21 (  2.20%)
System  mean         80.55 (  0.00%)       80.68 ( -0.16%)       78.93 (  2.01%)
System  stddev        0.80 (  0.00%)        0.55 ( 31.73%)        0.44 ( 44.91%)
System  max          82.11 (  0.00%)       81.38 (  0.89%)       79.33 (  3.39%)
System  range         2.14 (  0.00%)        1.25 ( 41.59%)        1.12 ( 47.66%)
Elapsed min         319.04 (  0.00%)      310.75 (  2.60%)      307.69 (  3.56%)
Elapsed mean        320.98 (  0.00%)      313.65 (  2.28%)      309.33 (  3.63%)
Elapsed stddev        2.37 (  0.00%)        2.27 (  4.37%)        1.40 ( 40.92%)
Elapsed max         325.52 (  0.00%)      316.83 (  2.67%)      311.69 (  4.25%)
Elapsed range         6.48 (  0.00%)        6.08 (  6.17%)        4.00 ( 38.27%)
CPU     min         388.00 (  0.00%)      386.00 (  0.52%)      386.00 (  0.52%)
CPU     mean        389.40 (  0.00%)      388.60 (  0.21%)      389.00 (  0.10%)
CPU     stddev        0.80 (  0.00%)        1.50 (-87.08%)        1.55 (-93.65%)
CPU     max         390.00 (  0.00%)      390.00 (  0.00%)      390.00 (  0.00%)
CPU     range         2.00 (  0.00%)        4.00 (-100.00%)        4.00 (-100.00%)

                BASE     RRALLOC  WORKINGSET
User         7009.94     6821.10     6755.85
System        489.88      490.82      481.82
Elapsed      1974.68     1930.58     1909.76

micro - reclaim micro benchmark
-------------------------------

This multi-threaded micro benchmark creates memory pressure with a mix
of anonymous and mapped file memory.  By spreading memory among the
available nodes more evently, reclaim behavior is greatly improved by
the round-robin allocator in terms of overall IO, swapping,
efficiency, direct reclaim invocations, reclaim writeback:

                BASE     RRALLOC  WORKINGSET
User          558.11      566.39      564.37
System         28.36       25.60       24.29
Elapsed       394.70      387.38      386.07

                                  BASE     RRALLOC  WORKINGSET
Page Ins                       6853744     5764336     5672052
Page Outs                     12136640    10673568    10617640
Swap Ins                             0           0           0
Swap Outs                         6702           0           0
Direct pages scanned           1751264      176965      238264
Kswapd pages scanned           4832689     3751475     3595031
Kswapd pages reclaimed         2347185     2325232     2239671
Direct pages reclaimed          419104      176226      236990
Kswapd efficiency                  48%         61%         62%
Kswapd velocity              12243.955    9684.225    9311.863
Direct efficiency                  23%         99%         99%
Direct velocity               4436.950     456.825     617.152
Percentage direct scans            26%          4%          6%
Page writes by reclaim          661863       10182       11310
Page writes file                655161       10182       11310
Page writes anon                  6702           0           0
Page reclaim immediate         1083840       15373       24797
Page rescued immediate               0           0           0
Slabs scanned                    10240       13312       11776
Direct inode steals                  0           0           0
Kswapd inode steals                  0           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                   2057        2249        3163
THP collapse alloc                   0           0           0
THP splits                           6           0           0
THP fault fallback                5824        5630        4719
THP collapse fail                    0           0           0
Compaction stalls                  551         484         610
Compaction success                 105          47          91
Compaction failures                446         437         484
Page migrate success            176065      103764      135377
Page migrate failure                 0           0           0
Compaction pages isolated       443314      263699      346198
Compaction migrate scanned      687684      598263      640277
Compaction free scanned       14437356     5061851     4744974
Compaction cost                    195         116         151
NUMA PTE updates                     0           0           0
NUMA hint faults                     0           0           0
NUMA hint local faults               0           0           0
NUMA pages migrated                  0           0           0
AutoNUMA cost                        0           0           0

memcachetest - streaming IO impact on anonyomus workingset
----------------------------------------------------------

This test runs a latency-sensitive in-core workload that is
accompanied by use once page cache streams of increasing size in the
background.

It too shows great improvements in allocation/reclaim behavior.  The
in-core workload is much less affected by the background IO, even
though IO throughput itself increased.  Same reclaim improvements as
before: reduced swapping, page faults, increased reclaim efficiency,
less writeback from reclaim:

                                              BASE                     RRALLOC                  WORKINGSET
Ops memcachetest-0M             15294.00 (  0.00%)          15492.00 (  1.29%)          16420.00 (  7.36%)
Ops memcachetest-375M           15574.00 (  0.00%)          15510.00 ( -0.41%)          16602.00 (  6.60%)
Ops memcachetest-1252M           8908.00 (  0.00%)          15733.00 ( 76.62%)          16640.00 ( 86.80%)
Ops memcachetest-2130M           2652.00 (  0.00%)          16089.00 (506.67%)          16764.00 (532.13%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-375M                6.00 (  0.00%)              5.00 ( 16.67%)              6.00 (  0.00%)
Ops io-duration-1252M              52.00 (  0.00%)             17.00 ( 67.31%)             17.00 ( 67.31%)
Ops io-duration-2130M             124.00 (  0.00%)             30.00 ( 75.81%)             30.00 ( 75.81%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-375M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-1252M            169167.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-2130M            278835.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-375M                     0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-1252M                78117.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2130M               135073.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M             776489.00 (  0.00%)         779312.00 ( -0.36%)         783329.00 ( -0.88%)
Ops minorfaults-375M           778665.00 (  0.00%)         780201.00 ( -0.20%)         784954.00 ( -0.81%)
Ops minorfaults-1252M          898776.00 (  0.00%)         781391.00 ( 13.06%)         785025.00 ( 12.66%)
Ops minorfaults-2130M          838654.00 (  0.00%)         782741.00 (  6.67%)         785580.00 (  6.33%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-375M                0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-1252M           10916.00 (  0.00%)             38.00 ( 99.65%)             38.00 ( 99.65%)
Ops majorfaults-2130M           19278.00 (  0.00%)             38.00 ( 99.80%)             38.00 ( 99.80%)

                BASE     RRALLOC  WORKINGSET
User          521.34      654.91      671.03
System       1694.60     2181.44     2157.61
Elapsed      4781.91     4701.73     4700.31

                                  BASE     RRALLOC  WORKINGSET
Page Ins                       3609444       18304       18296
Page Outs                     23111464    19283920    19285644
Swap Ins                        831734           0           0
Swap Outs                       950459           0           0
Direct pages scanned            354478           0        1061
Kswapd pages scanned           6490315     2808074     2875760
Kswapd pages reclaimed         3116126     2808050     2875738
Direct pages reclaimed          324821           0        1061
Kswapd efficiency                  48%         99%         99%
Kswapd velocity               1357.264     597.243     611.823
Direct efficiency                  91%        100%        100%
Direct velocity                 74.129       0.000       0.226
Percentage direct scans             5%          0%          0%
Page writes by reclaim         2088376           0           0
Page writes file               1137917           0           0
Page writes anon                950459           0           0
Page reclaim immediate          195121           0           0
Page rescued immediate               0           0           0
Slabs scanned                    35328           0           0
Direct inode steals                  0           0           0
Kswapd inode steals              19613           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8           3           0
THP collapse alloc                2495         871        1025
THP splits                          18          10           7
THP fault fallback                   0           0           0
THP collapse fail                   24          65          59
Compaction stalls                   66           2           2
Compaction success                  45           0           0
Compaction failures                 21           2           2
Page migrate success             39331           0           0
Page migrate failure                 0           0           0
Compaction pages isolated        84996           0           0
Compaction migrate scanned       59149           0           0
Compaction free scanned         916327           0           0
Compaction cost                     42           0           0
NUMA PTE updates                     0           0           0
NUMA hint faults                     0           0           0
NUMA hint local faults               0           0           0
NUMA pages migrated                  0           0           0
AutoNUMA cost                        0           0           0

---

Patch #1 solves a fairness problem we have with the per-zone LRU
lists, where the time a file cache page gets in memory is dependent on
the zone it gets allocated from.  The proposed solution is a very
simple (and maybe too crude) round-robin allocator.  It's a problem
that exists without this patch series, but the thrash detection
fundamentally relies on fair aging, so this is included here.

Patches #2-#6 prepare the page cache radix tree for non-page entries
that represent evicted pages.

Patch #7 prepares the page cache allocation path for passing down
refault information from the fault handler down to the page allocator,
which will later use it to prime the reclaim scanner for list
rebalancing.

Patch #9 is the thrash detection code.

Patch #10 is to keep the eviction history in check by both throttling
the number of non-page entries remembered in the radix trees when the
per-file refault ratio is very small and by having a shrinker that
trims those entries when they still grow excessively.

 fs/btrfs/compression.c           |   9 +-
 fs/cachefiles/rdwr.c             |  25 ++-
 fs/ceph/xattr.c                  |   2 +-
 fs/inode.c                       |   8 +-
 fs/logfs/readwrite.c             |   9 +-
 fs/nfs/blocklayout/blocklayout.c |   2 +-
 fs/nilfs2/inode.c                |   4 +-
 fs/ntfs/file.c                   |  10 +-
 fs/splice.c                      |   9 +-
 include/linux/fs.h               |   3 +
 include/linux/gfp.h              |  18 +-
 include/linux/mm.h               |   8 +
 include/linux/mmzone.h           |   9 +
 include/linux/pagemap.h          |  59 ++++--
 include/linux/pagevec.h          |   3 +
 include/linux/radix-tree.h       |   5 +-
 include/linux/shmem_fs.h         |   1 +
 include/linux/swap.h             |   9 +
 include/linux/vm_event_item.h    |   1 +
 include/linux/writeback.h        |   1 +
 lib/radix-tree.c                 | 105 +++------
 mm/Makefile                      |   2 +-
 mm/filemap.c                     | 289 ++++++++++++++++++++-----
 mm/memcontrol.c                  |   3 +
 mm/mempolicy.c                   |  17 +-
 mm/mincore.c                     |  20 +-
 mm/mmzone.c                      |   1 +
 mm/page-writeback.c              |   2 +-
 mm/page_alloc.c                  |  90 +++++---
 mm/readahead.c                   |  12 +-
 mm/shmem.c                       | 122 +++--------
 mm/swap.c                        |  22 ++
 mm/truncate.c                    |  78 +++++--
 mm/vmscan.c                      |  45 +++-
 mm/vmstat.c                      |   4 +
 mm/workingset.c                  | 423 +++++++++++++++++++++++++++++++++++++
 net/ceph/pagelist.c              |   4 +-
 net/ceph/pagevec.c               |   2 +-
 38 files changed, 1083 insertions(+), 353 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch 01/10] mm: page_alloc: zone round-robin allocator
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:03   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Each zone that holds pages of one workload must be aged at a speed
proportional to the zone size.  Otherwise, the time an individual page
gets to stay in memory depends on the zone it happened to be allocated
in.  Asymmetry in the zone aging creates rather unpredictable aging
behavior and results in the wrong pages being reclaimed, activated
etc.

But exactly this happens right now because of the way the page
allocator and kswapd interact.  The page allocator uses per-node lists
of all zones in the system, ordered by preference, when allocating a
new page.  When the first iteration does not yield any results, kswapd
is woken up and the allocator retries.  Due to the way kswapd reclaims
zones below the high watermark but a zone can be allocated from when
it is above the low watermark, the allocator may keep kswapd running
while kswapd reclaim ensures that the page allocator can keep
allocating from the first zone in the zonelist for extended periods of
time.  Meanwhile the other zones rarely see new allocations and thus
get aged much slower in comparison.

The result is that the occasional page placed in lower zones gets
relatively more time in memory, even get promoted to the active list
after its peers have long been evicted.  Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there
may be significant amounts of memory available in the lower zones.

Even the most basic test -- repeatedly reading a file slightly bigger
than memory -- shows how broken the zone aging is.  In this scenario,
no single page should be able stay in memory long enough to get
referenced twice and activated, but activation happens in spades:

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 0
      nr_active_file 8
      nr_inactive_file 1582
      nr_active_file 11994
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 70
      nr_inactive_file 258753
      nr_active_file 443214
      nr_inactive_file 149793
      nr_active_file 12021

This problem will be more pronounced when subsequent patches base list
rebalancing decisions on the time between the eviction of pages and
their refault into memory, as the measured time values might be
heavily skewed by the aging speed imbalances.

Fix this with a very simple round-robin allocator.  Each zone is
allowed a batch of allocations that is proportional to the zone's
size, after which it is treated as full.  The batch counters are reset
when all zones have been tried and kswapd is woken up.

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 174
      nr_active_file 4865
      nr_inactive_file 53
      nr_active_file 860
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 666622
      nr_active_file 4988
      nr_inactive_file 190969
      nr_active_file 937

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  3 +++
 mm/page_alloc.c        | 32 +++++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c74092e..370a35f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -367,6 +367,9 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+	/* zone round-robin allocator batch */
+	atomic_t		alloc_batch;
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..a64d786 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1896,6 +1896,14 @@ zonelist_scan:
 		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
 			goto this_zone_full;
 
+		/*
+		 * XXX: Ensure similar zone aging speeds by
+		 * round-robin allocating through the zonelist.
+		 */
+		if (atomic_read(&zone->alloc_batch) >
+		    high_wmark_pages(zone) - low_wmark_pages(zone))
+			goto this_zone_full;
+
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
@@ -1906,6 +1914,21 @@ zonelist_scan:
 				    classzone_idx, alloc_flags))
 				goto try_this_zone;
 
+			/*
+			 * XXX: With multiple nodes, kswapd balancing
+			 * is not synchronized with the round robin
+			 * allocation quotas.  If kswapd of a node
+			 * goes to sleep at the wrong time, a zone
+			 * might reach the low watermark while there
+			 * is still allocation quota left.  Kick
+			 * kswapd in this situation to ensure the
+			 * aging speed of the zone.  It's got to be
+			 * rebalanced anyway...
+			 */
+			if (!(gfp_mask & __GFP_NO_KSWAPD))
+				wakeup_kswapd(zone, order,
+					      zone_idx(preferred_zone));
+
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
@@ -1962,7 +1985,8 @@ this_zone_full:
 		goto zonelist_scan;
 	}
 
-	if (page)
+	if (page) {
+		atomic_add(1 << order, &zone->alloc_batch);
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -1971,7 +1995,7 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
-
+	}
 	return page;
 }
 
@@ -2303,8 +2327,10 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 	struct zoneref *z;
 	struct zone *zone;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		atomic_set(&zone->alloc_batch, 0);
 		wakeup_kswapd(zone, order, classzone_idx);
+	}
 }
 
 static inline int
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 01/10] mm: page_alloc: zone round-robin allocator
@ 2013-05-30 18:03   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Each zone that holds pages of one workload must be aged at a speed
proportional to the zone size.  Otherwise, the time an individual page
gets to stay in memory depends on the zone it happened to be allocated
in.  Asymmetry in the zone aging creates rather unpredictable aging
behavior and results in the wrong pages being reclaimed, activated
etc.

But exactly this happens right now because of the way the page
allocator and kswapd interact.  The page allocator uses per-node lists
of all zones in the system, ordered by preference, when allocating a
new page.  When the first iteration does not yield any results, kswapd
is woken up and the allocator retries.  Due to the way kswapd reclaims
zones below the high watermark but a zone can be allocated from when
it is above the low watermark, the allocator may keep kswapd running
while kswapd reclaim ensures that the page allocator can keep
allocating from the first zone in the zonelist for extended periods of
time.  Meanwhile the other zones rarely see new allocations and thus
get aged much slower in comparison.

The result is that the occasional page placed in lower zones gets
relatively more time in memory, even get promoted to the active list
after its peers have long been evicted.  Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there
may be significant amounts of memory available in the lower zones.

Even the most basic test -- repeatedly reading a file slightly bigger
than memory -- shows how broken the zone aging is.  In this scenario,
no single page should be able stay in memory long enough to get
referenced twice and activated, but activation happens in spades:

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 0
      nr_active_file 8
      nr_inactive_file 1582
      nr_active_file 11994
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 70
      nr_inactive_file 258753
      nr_active_file 443214
      nr_inactive_file 149793
      nr_active_file 12021

This problem will be more pronounced when subsequent patches base list
rebalancing decisions on the time between the eviction of pages and
their refault into memory, as the measured time values might be
heavily skewed by the aging speed imbalances.

Fix this with a very simple round-robin allocator.  Each zone is
allowed a batch of allocations that is proportional to the zone's
size, after which it is treated as full.  The batch counters are reset
when all zones have been tried and kswapd is woken up.

  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 174
      nr_active_file 4865
      nr_inactive_file 53
      nr_active_file 860
  $ cat data data data data >/dev/null
  $ grep active_file /proc/zoneinfo
      nr_inactive_file 0
      nr_active_file 0
      nr_inactive_file 666622
      nr_active_file 4988
      nr_inactive_file 190969
      nr_active_file 937

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  3 +++
 mm/page_alloc.c        | 32 +++++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c74092e..370a35f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -367,6 +367,9 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+	/* zone round-robin allocator batch */
+	atomic_t		alloc_batch;
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..a64d786 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1896,6 +1896,14 @@ zonelist_scan:
 		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
 			goto this_zone_full;
 
+		/*
+		 * XXX: Ensure similar zone aging speeds by
+		 * round-robin allocating through the zonelist.
+		 */
+		if (atomic_read(&zone->alloc_batch) >
+		    high_wmark_pages(zone) - low_wmark_pages(zone))
+			goto this_zone_full;
+
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
 			unsigned long mark;
@@ -1906,6 +1914,21 @@ zonelist_scan:
 				    classzone_idx, alloc_flags))
 				goto try_this_zone;
 
+			/*
+			 * XXX: With multiple nodes, kswapd balancing
+			 * is not synchronized with the round robin
+			 * allocation quotas.  If kswapd of a node
+			 * goes to sleep at the wrong time, a zone
+			 * might reach the low watermark while there
+			 * is still allocation quota left.  Kick
+			 * kswapd in this situation to ensure the
+			 * aging speed of the zone.  It's got to be
+			 * rebalanced anyway...
+			 */
+			if (!(gfp_mask & __GFP_NO_KSWAPD))
+				wakeup_kswapd(zone, order,
+					      zone_idx(preferred_zone));
+
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
@@ -1962,7 +1985,8 @@ this_zone_full:
 		goto zonelist_scan;
 	}
 
-	if (page)
+	if (page) {
+		atomic_add(1 << order, &zone->alloc_batch);
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
 		 * necessary to allocate the page. The expectation is
@@ -1971,7 +1995,7 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
-
+	}
 	return page;
 }
 
@@ -2303,8 +2327,10 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 	struct zoneref *z;
 	struct zone *zone;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		atomic_set(&zone->alloc_batch, 0);
 		wakeup_kswapd(zone, order, classzone_idx);
+	}
 }
 
 static inline int
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 02/10] lib: radix-tree: radix_tree_delete_item()
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:03   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Provide a function that does not just delete an entry at a given
index, but also allows passing in an expected item.  Delete only if
that item is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  1 +
 lib/radix-tree.c           | 30 ++++++++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..622b8d4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -219,6 +219,7 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..2c1c994 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1298,15 +1298,18 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 }
 
 /**
- *	radix_tree_delete    -    delete an item from a radix tree
+ *	radix_tree_delete_item    -    delete an item from a radix tree
  *	@root:		radix tree root
  *	@index:		index key
+ *	@item:		expected item
  *
- *	Remove the item at @index from the radix tree rooted at @root.
+ *	Remove @item at @index from the radix tree rooted at @root.
  *
- *	Returns the address of the deleted item, or NULL if it was not present.
+ *	Returns the address of the deleted item, or NULL if it was not present
+ *	or the entry at the given @index was not @item.
  */
-void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+void *radix_tree_delete_item(struct radix_tree_root *root,
+			     unsigned long index, void *item)
 {
 	struct radix_tree_node *node = NULL;
 	struct radix_tree_node *slot = NULL;
@@ -1341,6 +1344,11 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	if (slot == NULL)
 		goto out;
 
+	if (item && slot != item) {
+		slot = NULL;
+		goto out;
+	}
+
 	/*
 	 * Clear all tags associated with the item to be deleted.
 	 * This way of doing it would be inefficient, but seldom is any set.
@@ -1385,6 +1393,20 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 out:
 	return slot;
 }
+
+/**
+ *	radix_tree_delete    -    delete an item from a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Remove the item at @index from the radix tree rooted at @root.
+ *
+ *	Returns the address of the deleted item, or NULL if it was not present.
+ */
+void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_delete_item(root, index, NULL);
+}
 EXPORT_SYMBOL(radix_tree_delete);
 
 /**
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 02/10] lib: radix-tree: radix_tree_delete_item()
@ 2013-05-30 18:03   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Provide a function that does not just delete an entry at a given
index, but also allows passing in an expected item.  Delete only if
that item is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  1 +
 lib/radix-tree.c           | 30 ++++++++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..622b8d4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -219,6 +219,7 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..2c1c994 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1298,15 +1298,18 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 }
 
 /**
- *	radix_tree_delete    -    delete an item from a radix tree
+ *	radix_tree_delete_item    -    delete an item from a radix tree
  *	@root:		radix tree root
  *	@index:		index key
+ *	@item:		expected item
  *
- *	Remove the item at @index from the radix tree rooted at @root.
+ *	Remove @item at @index from the radix tree rooted at @root.
  *
- *	Returns the address of the deleted item, or NULL if it was not present.
+ *	Returns the address of the deleted item, or NULL if it was not present
+ *	or the entry at the given @index was not @item.
  */
-void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+void *radix_tree_delete_item(struct radix_tree_root *root,
+			     unsigned long index, void *item)
 {
 	struct radix_tree_node *node = NULL;
 	struct radix_tree_node *slot = NULL;
@@ -1341,6 +1344,11 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	if (slot == NULL)
 		goto out;
 
+	if (item && slot != item) {
+		slot = NULL;
+		goto out;
+	}
+
 	/*
 	 * Clear all tags associated with the item to be deleted.
 	 * This way of doing it would be inefficient, but seldom is any set.
@@ -1385,6 +1393,20 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 out:
 	return slot;
 }
+
+/**
+ *	radix_tree_delete    -    delete an item from a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Remove the item at @index from the radix tree rooted at @root.
+ *
+ *	Returns the address of the deleted item, or NULL if it was not present.
+ */
+void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_delete_item(root, index, NULL);
+}
 EXPORT_SYMBOL(radix_tree_delete);
 
 /**
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 03/10] mm: shmem: save one radix tree lookup when truncating swapped pages
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:03   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Page cache radix tree slots are usually stabilized by the page lock,
but shmem's swap cookies have no such thing.  Because the overall
truncation loop is lockless, the swap entry is currently confirmed by
a tree lookup and then deleted by another tree lookup under the same
tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup.  This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/shmem.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 1c44af7..f6f5e4c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -240,19 +240,17 @@ static int shmem_radix_tree_replace(struct address_space *mapping,
 			pgoff_t index, void *expected, void *replacement)
 {
 	void **pslot;
-	void *item = NULL;
+	void *item;
 
 	VM_BUG_ON(!expected);
+	VM_BUG_ON(!replacement);
 	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
-	if (pslot)
-		item = radix_tree_deref_slot_protected(pslot,
-							&mapping->tree_lock);
+	if (!pslot)
+		return -ENOENT;
+	item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
 	if (item != expected)
 		return -ENOENT;
-	if (replacement)
-		radix_tree_replace_slot(pslot, replacement);
-	else
-		radix_tree_delete(&mapping->page_tree, index);
+	radix_tree_replace_slot(pslot, replacement);
 	return 0;
 }
 
@@ -384,14 +382,15 @@ export:
 static int shmem_free_swap(struct address_space *mapping,
 			   pgoff_t index, void *radswap)
 {
-	int error;
+	void *old;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+	old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!error)
-		free_swap_and_cache(radix_to_swp_entry(radswap));
-	return error;
+	if (old != radswap)
+		return -ENOENT;
+	free_swap_and_cache(radix_to_swp_entry(radswap));
+	return 0;
 }
 
 /*
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 03/10] mm: shmem: save one radix tree lookup when truncating swapped pages
@ 2013-05-30 18:03   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:03 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Page cache radix tree slots are usually stabilized by the page lock,
but shmem's swap cookies have no such thing.  Because the overall
truncation loop is lockless, the swap entry is currently confirmed by
a tree lookup and then deleted by another tree lookup under the same
tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup.  This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/shmem.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 1c44af7..f6f5e4c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -240,19 +240,17 @@ static int shmem_radix_tree_replace(struct address_space *mapping,
 			pgoff_t index, void *expected, void *replacement)
 {
 	void **pslot;
-	void *item = NULL;
+	void *item;
 
 	VM_BUG_ON(!expected);
+	VM_BUG_ON(!replacement);
 	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
-	if (pslot)
-		item = radix_tree_deref_slot_protected(pslot,
-							&mapping->tree_lock);
+	if (!pslot)
+		return -ENOENT;
+	item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
 	if (item != expected)
 		return -ENOENT;
-	if (replacement)
-		radix_tree_replace_slot(pslot, replacement);
-	else
-		radix_tree_delete(&mapping->page_tree, index);
+	radix_tree_replace_slot(pslot, replacement);
 	return 0;
 }
 
@@ -384,14 +382,15 @@ export:
 static int shmem_free_swap(struct address_space *mapping,
 			   pgoff_t index, void *radswap)
 {
-	int error;
+	void *old;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+	old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!error)
-		free_swap_and_cache(radix_to_swp_entry(radswap));
-	return error;
+	if (old != radswap)
+		return -ENOENT;
+	free_swap_and_cache(radix_to_swp_entry(radswap));
+	return 0;
 }
 
 /*
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 04/10] mm: filemap: move radix tree hole searching here
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:04   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot".  But this is about to change, though, as shadow
page descriptors will be stored in the page cache after the actual
pages get evicted from memory.

Move the functions over to mm/filemap.c and make them native page
cache operations, where they can later be adapted to handle the new
definition of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/nfs/blocklayout/blocklayout.c |  2 +-
 include/linux/pagemap.h          |  5 +++
 include/linux/radix-tree.h       |  4 ---
 lib/radix-tree.c                 | 75 ---------------------------------------
 mm/filemap.c                     | 76 ++++++++++++++++++++++++++++++++++++++++
 mm/readahead.c                   |  4 +--
 6 files changed, 84 insertions(+), 82 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 434b93e..821d5cb 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1219,7 +1219,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t idx)
 	end = DIV_ROUND_UP(i_size_read(inode), PAGE_CACHE_SIZE);
 	if (end != NFS_I(inode)->npages) {
 		rcu_read_lock();
-		end = radix_tree_next_hole(&mapping->page_tree, idx + 1, ULONG_MAX);
+		end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
 		rcu_read_unlock();
 	}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..206ae96 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -243,6 +243,11 @@ static inline struct page *page_cache_alloc_readahead(struct address_space *x)
 
 typedef int filler_t(void *, struct page *);
 
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+
 extern struct page * find_get_page(struct address_space *mapping,
 				pgoff_t index);
 extern struct page * find_lock_page(struct address_space *mapping,
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 622b8d4..91de5c2 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -227,10 +227,6 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 2c1c994..6428f04 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -909,81 +909,6 @@ next:
 }
 EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
 
-
-/**
- *	radix_tree_next_hole    -    find the next hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
- *	indexed hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'return - index >= max_scan'
- *	will be true). In rare cases of index wrap-around, 0 will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 5, then subsequently a hole is created at index 10,
- *	radix_tree_next_hole covering both indexes may return 10 if called
- *	under rcu_read_lock.
- */
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index++;
-		if (index == 0)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_next_hole);
-
-/**
- *	radix_tree_prev_hole    -    find the prev hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search backwards in the range [max(index-max_scan+1, 0), index]
- *	for the first hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'index - return >= max_scan'
- *	will be true). In rare cases of wrap-around, ULONG_MAX will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 10, then subsequently a hole is created at index 5,
- *	radix_tree_prev_hole covering both indexes may return 5 if called under
- *	rcu_read_lock.
- */
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				   unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index--;
-		if (index == ULONG_MAX)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_prev_hole);
-
 /**
  *	radix_tree_gang_lookup - perform multiple lookup on a radix tree
  *	@root:		radix tree root
diff --git a/mm/filemap.c b/mm/filemap.c
index e1979fd..ccd5af8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -666,6 +666,82 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 }
 
 /**
+ * page_cache_next_hole - find the next hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
+ * lowest indexed hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'return - index >=
+ * max_scan' will be true). In rare cases of index wrap-around, 0 will
+ * be returned.
+ *
+ * page_cache_next_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 5, then subsequently a hole is created at
+ * index 10, page_cache_next_hole covering both indexes may return 10
+ * if called under rcu_read_lock.
+ */
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index++;
+		if (index == 0)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_next_hole);
+
+/**
+ * page_cache_prev_hole - find the prev hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search backwards in the range [max(index-max_scan+1, 0), index] for
+ * the first hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'index - return >=
+ * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
+ * will be returned.
+ *
+ * page_cache_prev_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 10, then subsequently a hole is created at
+ * index 5, page_cache_prev_hole covering both indexes may return 5 if
+ * called under rcu_read_lock.
+ */
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index--;
+		if (index == ULONG_MAX)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_prev_hole);
+
+/**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..187cbea 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -351,7 +351,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
 	pgoff_t head;
 
 	rcu_read_lock();
-	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+	head = page_cache_prev_hole(mapping, offset - 1, max);
 	rcu_read_unlock();
 
 	return offset - 1 - head;
@@ -430,7 +430,7 @@ ondemand_readahead(struct address_space *mapping,
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = page_cache_next_hole(mapping, offset + 1, max);
 		rcu_read_unlock();
 
 		if (!start || start - offset > max)
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 04/10] mm: filemap: move radix tree hole searching here
@ 2013-05-30 18:04   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot".  But this is about to change, though, as shadow
page descriptors will be stored in the page cache after the actual
pages get evicted from memory.

Move the functions over to mm/filemap.c and make them native page
cache operations, where they can later be adapted to handle the new
definition of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/nfs/blocklayout/blocklayout.c |  2 +-
 include/linux/pagemap.h          |  5 +++
 include/linux/radix-tree.h       |  4 ---
 lib/radix-tree.c                 | 75 ---------------------------------------
 mm/filemap.c                     | 76 ++++++++++++++++++++++++++++++++++++++++
 mm/readahead.c                   |  4 +--
 6 files changed, 84 insertions(+), 82 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 434b93e..821d5cb 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1219,7 +1219,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t idx)
 	end = DIV_ROUND_UP(i_size_read(inode), PAGE_CACHE_SIZE);
 	if (end != NFS_I(inode)->npages) {
 		rcu_read_lock();
-		end = radix_tree_next_hole(&mapping->page_tree, idx + 1, ULONG_MAX);
+		end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
 		rcu_read_unlock();
 	}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..206ae96 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -243,6 +243,11 @@ static inline struct page *page_cache_alloc_readahead(struct address_space *x)
 
 typedef int filler_t(void *, struct page *);
 
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+
 extern struct page * find_get_page(struct address_space *mapping,
 				pgoff_t index);
 extern struct page * find_lock_page(struct address_space *mapping,
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 622b8d4..91de5c2 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -227,10 +227,6 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 2c1c994..6428f04 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -909,81 +909,6 @@ next:
 }
 EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
 
-
-/**
- *	radix_tree_next_hole    -    find the next hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
- *	indexed hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'return - index >= max_scan'
- *	will be true). In rare cases of index wrap-around, 0 will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 5, then subsequently a hole is created at index 10,
- *	radix_tree_next_hole covering both indexes may return 10 if called
- *	under rcu_read_lock.
- */
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index++;
-		if (index == 0)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_next_hole);
-
-/**
- *	radix_tree_prev_hole    -    find the prev hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search backwards in the range [max(index-max_scan+1, 0), index]
- *	for the first hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'index - return >= max_scan'
- *	will be true). In rare cases of wrap-around, ULONG_MAX will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 10, then subsequently a hole is created at index 5,
- *	radix_tree_prev_hole covering both indexes may return 5 if called under
- *	rcu_read_lock.
- */
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				   unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index--;
-		if (index == ULONG_MAX)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_prev_hole);
-
 /**
  *	radix_tree_gang_lookup - perform multiple lookup on a radix tree
  *	@root:		radix tree root
diff --git a/mm/filemap.c b/mm/filemap.c
index e1979fd..ccd5af8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -666,6 +666,82 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 }
 
 /**
+ * page_cache_next_hole - find the next hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
+ * lowest indexed hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'return - index >=
+ * max_scan' will be true). In rare cases of index wrap-around, 0 will
+ * be returned.
+ *
+ * page_cache_next_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 5, then subsequently a hole is created at
+ * index 10, page_cache_next_hole covering both indexes may return 10
+ * if called under rcu_read_lock.
+ */
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index++;
+		if (index == 0)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_next_hole);
+
+/**
+ * page_cache_prev_hole - find the prev hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search backwards in the range [max(index-max_scan+1, 0), index] for
+ * the first hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'index - return >=
+ * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
+ * will be returned.
+ *
+ * page_cache_prev_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 10, then subsequently a hole is created at
+ * index 5, page_cache_prev_hole covering both indexes may return 5 if
+ * called under rcu_read_lock.
+ */
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index--;
+		if (index == ULONG_MAX)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_prev_hole);
+
+/**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..187cbea 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -351,7 +351,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
 	pgoff_t head;
 
 	rcu_read_lock();
-	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+	head = page_cache_prev_hole(mapping, offset - 1, max);
 	rcu_read_unlock();
 
 	return offset - 1 - head;
@@ -430,7 +430,7 @@ ondemand_readahead(struct address_space *mapping,
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = page_cache_next_hole(mapping, offset + 1, max);
 		rcu_read_unlock();
 
 		if (!start || start - offset > max)
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 05/10] mm + fs: prepare for non-page entries in page cache radix trees
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:04   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
entries other than pages.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before.  But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c   |   2 +-
 include/linux/mm.h       |   8 +++
 include/linux/pagemap.h  |  15 +++---
 include/linux/pagevec.h  |   3 ++
 include/linux/shmem_fs.h |   1 +
 mm/filemap.c             | 130 +++++++++++++++++++++++++++++++++++++++--------
 mm/mincore.c             |  20 +++++---
 mm/readahead.c           |   2 +-
 mm/shmem.c               |  97 +++++++----------------------------
 mm/swap.c                |  20 ++++++++
 mm/truncate.c            |  75 +++++++++++++++++++++------
 11 files changed, 245 insertions(+), 128 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 15b9408..4a80f6b 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -472,7 +472,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pg_index);
 		rcu_read_unlock();
-		if (page) {
+		if (page && !radix_tree_exceptional_entry(page)) {
 			misses++;
 			if (misses > 4)
 				break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2091b8..c1dcd2f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -905,6 +905,14 @@ extern void show_free_areas(unsigned int flags);
 extern bool skip_free_areas_node(unsigned int flags, int nid);
 
 int shmem_zero_setup(struct vm_area_struct *);
+#ifdef CONFIG_SHMEM
+bool shmem_mapping(struct address_space *mapping);
+#else
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+	return false;
+}
+#endif
 
 extern int can_do_mlock(void);
 extern int user_shm_lock(size_t, struct user_struct *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 206ae96..a972341 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-extern struct page * find_get_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_lock_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_or_create_page(struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
+				 gfp_t gfp_mask);
+unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
+			  unsigned int nr_pages, struct page **pages,
+			  pgoff_t *indices);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 2aa12b8..a2eeb43 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,9 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
+void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 30aa0dc..deb4960 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -49,6 +49,7 @@ extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index ccd5af8..df9a1db 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -429,6 +429,24 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
+static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
+			     struct page *page)
+{
+	void **slot;
+
+	slot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+		radix_tree_replace_slot(slot, page);
+		return 0;
+	}
+	return radix_tree_insert(&mapping->page_tree, offset, page);
+}
+
 /**
  * add_to_page_cache_locked - add a locked page to the pagecache
  * @page:	page to add
@@ -459,7 +477,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		page->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
+		error = page_cache_insert(mapping, offset, page);
 		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
@@ -692,7 +710,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index++;
 		if (index == 0)
@@ -730,7 +751,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index--;
 		if (index == ULONG_MAX)
@@ -741,15 +765,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 }
 EXPORT_SYMBOL(page_cache_prev_hole);
 
-/**
- * find_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Is there a pagecache struct page at the given (mapping, offset) tuple?
- * If yes, increment its refcount and return it; if no, return NULL.
- */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
 	struct page *page;
@@ -790,24 +806,30 @@ out:
 
 	return page;
 }
-EXPORT_SYMBOL(find_get_page);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Locates the desired pagecache page, locks it, increments its reference
- * count and returns its address.
- *
- * Returns zero if the page was not present. find_lock_page() may sleep.
+ * Is there a pagecache struct page at the given (mapping, offset) tuple?
+ * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page;
+	struct page *page = __find_get_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
+EXPORT_SYMBOL(find_get_page);
 
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page;
 repeat:
-	page = find_get_page(mapping, offset);
+	page = __find_get_page(mapping, offset);
 	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
@@ -820,6 +842,25 @@ repeat:
 	}
 	return page;
 }
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Locates the desired pagecache page, locks it, increments its reference
+ * count and returns its address.
+ *
+ * Returns zero if the page was not present. find_lock_page() may sleep.
+ */
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_lock_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_lock_page);
 
 /**
@@ -869,6 +910,53 @@ repeat:
 }
 EXPORT_SYMBOL(find_or_create_page);
 
+unsigned __find_get_pages(struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_pages,
+			  struct page **pages, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_pages)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * Otherwise, we must be storing a swap entry
+			 * here as an exceptional entry: so return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		pages[ret] = page;
+		if (++ret == nr_pages)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 /**
  * find_get_pages - gang pagecache lookup
  * @mapping:	The address_space to search
diff --git a/mm/mincore.c b/mm/mincore.c
index da2be56..ad411ec 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-	page = find_get_page(mapping, pgoff);
 #ifdef CONFIG_SWAP
-	/* shmem/tmpfs may return swap: account for swapcache page too. */
-	if (radix_tree_exceptional_entry(page)) {
-		swp_entry_t swap = radix_to_swp_entry(page);
-		page = find_get_page(swap_address_space(swap), swap.val);
-	}
+	if (shmem_mapping(mapping)) {
+		page = __find_get_page(mapping, pgoff);
+		/*
+		 * shmem/tmpfs may return swap: account for swapcache
+		 * page too.
+		 */
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swp = radix_to_swp_entry(page);
+			page = find_get_page(swap_address_space(swp), swp.val);
+		}
+	} else
+		page = find_get_page(mapping, pgoff);
+#else
+	page = find_get_page(mapping, pgoff);
 #endif
 	if (page) {
 		present = PageUptodate(page);
diff --git a/mm/readahead.c b/mm/readahead.c
index 187cbea..29efd45 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page)
+		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
 		page = page_cache_alloc_readahead(mapping);
diff --git a/mm/shmem.c b/mm/shmem.c
index f6f5e4c..9bb4a7f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -327,56 +327,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 }
 
 /*
- * Like find_get_pages, but collecting swap entries as well as pages.
- */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
-					pgoff_t start, unsigned int nr_pages,
-					struct page **pages, pgoff_t *indices)
-{
-	void **slot;
-	unsigned int ret = 0;
-	struct radix_tree_iter iter;
-
-	if (!nr_pages)
-		return 0;
-
-	rcu_read_lock();
-restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
-		struct page *page;
-repeat:
-		page = radix_tree_deref_slot(slot);
-		if (unlikely(!page))
-			continue;
-		if (radix_tree_exception(page)) {
-			if (radix_tree_deref_retry(page))
-				goto restart;
-			/*
-			 * Otherwise, we must be storing a swap entry
-			 * here as an exceptional entry: so return it
-			 * without attempting to raise page count.
-			 */
-			goto export;
-		}
-		if (!page_cache_get_speculative(page))
-			goto repeat;
-
-		/* Has the page moved? */
-		if (unlikely(page != *slot)) {
-			page_cache_release(page);
-			goto repeat;
-		}
-export:
-		indices[ret] = iter.index;
-		pages[ret] = page;
-		if (++ret == nr_pages)
-			break;
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -394,21 +344,6 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
- * Pagevec may contain swap entries, so shuffle up pages before releasing.
- */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
-{
-	int i, j;
-
-	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		if (!radix_tree_exceptional_entry(page))
-			pvec->pages[j++] = page;
-	}
-	pvec->nr = j;
-}
-
-/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
@@ -426,12 +361,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		check_move_unevictable_pages(pvec.pages, pvec.nr);
 		pagevec_release(&pvec);
 		cond_resched();
@@ -463,9 +398,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
-				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+		pvec.nr = __find_get_pages(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		mem_cgroup_uncharge_start();
@@ -494,7 +429,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -532,9 +467,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+
+		pvec.nr = __find_get_pages(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+				pvec.pages, indices);
 		if (!pvec.nr) {
 			if (index == start || unfalloc)
 				break;
@@ -542,7 +478,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			continue;
 		}
 		if ((index == start || unfalloc) && indices[0] >= end) {
-			shmem_deswap_pagevec(&pvec);
+			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
 		}
@@ -571,7 +507,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -1079,7 +1015,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return -EFBIG;
 repeat:
 	swap.val = 0;
-	page = find_lock_page(mapping, index);
+	page = __find_lock_page(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
@@ -1416,6 +1352,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	return inode;
 }
 
+bool shmem_mapping(struct address_space *mapping)
+{
+	return mapping->backing_dev_info == &shmem_backing_dev_info;
+}
+
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;
@@ -1728,7 +1669,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 	pagevec_init(&pvec, 0);
 	pvec.nr = 1;		/* start small: we may be there already */
 	while (!done) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					pvec.nr, pvec.pages, indices);
 		if (!pvec.nr) {
 			if (whence == SEEK_DATA)
@@ -1755,7 +1696,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 				break;
 			}
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		pvec.nr = PAGEVEC_SIZE;
 		cond_resched();
diff --git a/mm/swap.c b/mm/swap.c
index 8a529a0..37bfe2d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -816,6 +816,26 @@ void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 }
 EXPORT_SYMBOL(__pagevec_lru_add);
 
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
+{
+	pvec->nr = __find_get_pages(mapping, start, nr_pages,
+				    pvec->pages, indices);
+	return pagevec_count(pvec);
+}
+
+void pagevec_remove_exceptionals(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..d6ec30c 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -22,6 +22,22 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+static void clear_exceptional_entry(struct address_space *mapping,
+				    pgoff_t index, struct page *page)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	radix_tree_delete_item(&mapping->page_tree, index, page);
+	spin_unlock_irq(&mapping->tree_lock);
+}
 
 /**
  * do_invalidatepage - invalidate part or all of a page
@@ -206,31 +222,36 @@ void truncate_inode_pages_range(struct address_space *mapping,
 {
 	const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
 	const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	pgoff_t end;
 	int i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0)
-		return;
 
 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
 	end = (lend >> PAGE_CACHE_SHIFT);
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -241,6 +262,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -260,14 +282,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		if (!__pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index > end) {
+		if (index == start && indices[0] > end) {
 			pagevec_release(&pvec);
 			break;
 		}
@@ -276,16 +299,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -328,6 +357,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index = start;
 	unsigned long ret;
@@ -343,17 +373,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 */
 
 	pagevec_init(&pvec, 0);
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -367,6 +403,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				deactivate_page(page);
 			count += ret;
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -434,6 +471,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	int i;
@@ -444,17 +482,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	cleancache_invalidate_inode(mapping);
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			if (page->mapping != mapping) {
@@ -492,6 +536,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				ret = ret2;
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 05/10] mm + fs: prepare for non-page entries in page cache radix trees
@ 2013-05-30 18:04   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
entries other than pages.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before.  But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c   |   2 +-
 include/linux/mm.h       |   8 +++
 include/linux/pagemap.h  |  15 +++---
 include/linux/pagevec.h  |   3 ++
 include/linux/shmem_fs.h |   1 +
 mm/filemap.c             | 130 +++++++++++++++++++++++++++++++++++++++--------
 mm/mincore.c             |  20 +++++---
 mm/readahead.c           |   2 +-
 mm/shmem.c               |  97 +++++++----------------------------
 mm/swap.c                |  20 ++++++++
 mm/truncate.c            |  75 +++++++++++++++++++++------
 11 files changed, 245 insertions(+), 128 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 15b9408..4a80f6b 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -472,7 +472,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pg_index);
 		rcu_read_unlock();
-		if (page) {
+		if (page && !radix_tree_exceptional_entry(page)) {
 			misses++;
 			if (misses > 4)
 				break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2091b8..c1dcd2f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -905,6 +905,14 @@ extern void show_free_areas(unsigned int flags);
 extern bool skip_free_areas_node(unsigned int flags, int nid);
 
 int shmem_zero_setup(struct vm_area_struct *);
+#ifdef CONFIG_SHMEM
+bool shmem_mapping(struct address_space *mapping);
+#else
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+	return false;
+}
+#endif
 
 extern int can_do_mlock(void);
 extern int user_shm_lock(size_t, struct user_struct *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 206ae96..a972341 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-extern struct page * find_get_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_lock_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_or_create_page(struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
+				 gfp_t gfp_mask);
+unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
+			  unsigned int nr_pages, struct page **pages,
+			  pgoff_t *indices);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 2aa12b8..a2eeb43 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,9 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
+void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 30aa0dc..deb4960 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -49,6 +49,7 @@ extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index ccd5af8..df9a1db 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -429,6 +429,24 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
+static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
+			     struct page *page)
+{
+	void **slot;
+
+	slot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+		radix_tree_replace_slot(slot, page);
+		return 0;
+	}
+	return radix_tree_insert(&mapping->page_tree, offset, page);
+}
+
 /**
  * add_to_page_cache_locked - add a locked page to the pagecache
  * @page:	page to add
@@ -459,7 +477,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		page->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
+		error = page_cache_insert(mapping, offset, page);
 		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
@@ -692,7 +710,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index++;
 		if (index == 0)
@@ -730,7 +751,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index--;
 		if (index == ULONG_MAX)
@@ -741,15 +765,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 }
 EXPORT_SYMBOL(page_cache_prev_hole);
 
-/**
- * find_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Is there a pagecache struct page at the given (mapping, offset) tuple?
- * If yes, increment its refcount and return it; if no, return NULL.
- */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
 	struct page *page;
@@ -790,24 +806,30 @@ out:
 
 	return page;
 }
-EXPORT_SYMBOL(find_get_page);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Locates the desired pagecache page, locks it, increments its reference
- * count and returns its address.
- *
- * Returns zero if the page was not present. find_lock_page() may sleep.
+ * Is there a pagecache struct page at the given (mapping, offset) tuple?
+ * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page;
+	struct page *page = __find_get_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
+EXPORT_SYMBOL(find_get_page);
 
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page;
 repeat:
-	page = find_get_page(mapping, offset);
+	page = __find_get_page(mapping, offset);
 	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
@@ -820,6 +842,25 @@ repeat:
 	}
 	return page;
 }
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Locates the desired pagecache page, locks it, increments its reference
+ * count and returns its address.
+ *
+ * Returns zero if the page was not present. find_lock_page() may sleep.
+ */
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_lock_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_lock_page);
 
 /**
@@ -869,6 +910,53 @@ repeat:
 }
 EXPORT_SYMBOL(find_or_create_page);
 
+unsigned __find_get_pages(struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_pages,
+			  struct page **pages, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_pages)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * Otherwise, we must be storing a swap entry
+			 * here as an exceptional entry: so return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		pages[ret] = page;
+		if (++ret == nr_pages)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 /**
  * find_get_pages - gang pagecache lookup
  * @mapping:	The address_space to search
diff --git a/mm/mincore.c b/mm/mincore.c
index da2be56..ad411ec 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-	page = find_get_page(mapping, pgoff);
 #ifdef CONFIG_SWAP
-	/* shmem/tmpfs may return swap: account for swapcache page too. */
-	if (radix_tree_exceptional_entry(page)) {
-		swp_entry_t swap = radix_to_swp_entry(page);
-		page = find_get_page(swap_address_space(swap), swap.val);
-	}
+	if (shmem_mapping(mapping)) {
+		page = __find_get_page(mapping, pgoff);
+		/*
+		 * shmem/tmpfs may return swap: account for swapcache
+		 * page too.
+		 */
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swp = radix_to_swp_entry(page);
+			page = find_get_page(swap_address_space(swp), swp.val);
+		}
+	} else
+		page = find_get_page(mapping, pgoff);
+#else
+	page = find_get_page(mapping, pgoff);
 #endif
 	if (page) {
 		present = PageUptodate(page);
diff --git a/mm/readahead.c b/mm/readahead.c
index 187cbea..29efd45 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page)
+		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
 		page = page_cache_alloc_readahead(mapping);
diff --git a/mm/shmem.c b/mm/shmem.c
index f6f5e4c..9bb4a7f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -327,56 +327,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 }
 
 /*
- * Like find_get_pages, but collecting swap entries as well as pages.
- */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
-					pgoff_t start, unsigned int nr_pages,
-					struct page **pages, pgoff_t *indices)
-{
-	void **slot;
-	unsigned int ret = 0;
-	struct radix_tree_iter iter;
-
-	if (!nr_pages)
-		return 0;
-
-	rcu_read_lock();
-restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
-		struct page *page;
-repeat:
-		page = radix_tree_deref_slot(slot);
-		if (unlikely(!page))
-			continue;
-		if (radix_tree_exception(page)) {
-			if (radix_tree_deref_retry(page))
-				goto restart;
-			/*
-			 * Otherwise, we must be storing a swap entry
-			 * here as an exceptional entry: so return it
-			 * without attempting to raise page count.
-			 */
-			goto export;
-		}
-		if (!page_cache_get_speculative(page))
-			goto repeat;
-
-		/* Has the page moved? */
-		if (unlikely(page != *slot)) {
-			page_cache_release(page);
-			goto repeat;
-		}
-export:
-		indices[ret] = iter.index;
-		pages[ret] = page;
-		if (++ret == nr_pages)
-			break;
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -394,21 +344,6 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
- * Pagevec may contain swap entries, so shuffle up pages before releasing.
- */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
-{
-	int i, j;
-
-	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		if (!radix_tree_exceptional_entry(page))
-			pvec->pages[j++] = page;
-	}
-	pvec->nr = j;
-}
-
-/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
@@ -426,12 +361,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		check_move_unevictable_pages(pvec.pages, pvec.nr);
 		pagevec_release(&pvec);
 		cond_resched();
@@ -463,9 +398,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
-				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+		pvec.nr = __find_get_pages(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		mem_cgroup_uncharge_start();
@@ -494,7 +429,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -532,9 +467,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+
+		pvec.nr = __find_get_pages(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+				pvec.pages, indices);
 		if (!pvec.nr) {
 			if (index == start || unfalloc)
 				break;
@@ -542,7 +478,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			continue;
 		}
 		if ((index == start || unfalloc) && indices[0] >= end) {
-			shmem_deswap_pagevec(&pvec);
+			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
 		}
@@ -571,7 +507,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -1079,7 +1015,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return -EFBIG;
 repeat:
 	swap.val = 0;
-	page = find_lock_page(mapping, index);
+	page = __find_lock_page(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
@@ -1416,6 +1352,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	return inode;
 }
 
+bool shmem_mapping(struct address_space *mapping)
+{
+	return mapping->backing_dev_info == &shmem_backing_dev_info;
+}
+
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;
@@ -1728,7 +1669,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 	pagevec_init(&pvec, 0);
 	pvec.nr = 1;		/* start small: we may be there already */
 	while (!done) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					pvec.nr, pvec.pages, indices);
 		if (!pvec.nr) {
 			if (whence == SEEK_DATA)
@@ -1755,7 +1696,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 				break;
 			}
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		pvec.nr = PAGEVEC_SIZE;
 		cond_resched();
diff --git a/mm/swap.c b/mm/swap.c
index 8a529a0..37bfe2d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -816,6 +816,26 @@ void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 }
 EXPORT_SYMBOL(__pagevec_lru_add);
 
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
+{
+	pvec->nr = __find_get_pages(mapping, start, nr_pages,
+				    pvec->pages, indices);
+	return pagevec_count(pvec);
+}
+
+void pagevec_remove_exceptionals(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..d6ec30c 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -22,6 +22,22 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+static void clear_exceptional_entry(struct address_space *mapping,
+				    pgoff_t index, struct page *page)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	radix_tree_delete_item(&mapping->page_tree, index, page);
+	spin_unlock_irq(&mapping->tree_lock);
+}
 
 /**
  * do_invalidatepage - invalidate part or all of a page
@@ -206,31 +222,36 @@ void truncate_inode_pages_range(struct address_space *mapping,
 {
 	const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
 	const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	pgoff_t end;
 	int i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0)
-		return;
 
 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
 	end = (lend >> PAGE_CACHE_SHIFT);
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -241,6 +262,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -260,14 +282,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		if (!__pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index > end) {
+		if (index == start && indices[0] > end) {
 			pagevec_release(&pvec);
 			break;
 		}
@@ -276,16 +299,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -328,6 +357,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index = start;
 	unsigned long ret;
@@ -343,17 +373,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 */
 
 	pagevec_init(&pvec, 0);
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -367,6 +403,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				deactivate_page(page);
 			count += ret;
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -434,6 +471,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	int i;
@@ -444,17 +482,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	cleancache_invalidate_inode(mapping);
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			if (page->mapping != mapping) {
@@ -492,6 +536,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				ret = ret2;
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 06/10] mm + fs: store shadow entries in page cache
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:04   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page.  As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently.  At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate.  Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c              |  7 ++++++-
 fs/nilfs2/inode.c       |  4 ++--
 include/linux/fs.h      |  1 +
 include/linux/pagemap.h | 13 ++++++++++++-
 mm/filemap.c            | 16 ++++++++++++----
 mm/truncate.c           |  5 +++--
 mm/vmscan.c             |  2 +-
 7 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index a898b3d..3bd7916 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -509,6 +509,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(inode->i_data.nrshadows);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
@@ -551,10 +552,14 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
+	spin_lock_irq(&inode->i_data.tree_lock);
+	mapping_set_exiting(&inode->i_data);
+	spin_unlock_irq(&inode->i_data.tree_lock);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 	}
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 6b49f14..fbc3f00 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -747,7 +747,7 @@ void nilfs_evict_inode(struct inode *inode)
 	int ret;
 
 	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 		nilfs_clear_inode(inode);
@@ -755,7 +755,7 @@ void nilfs_evict_inode(struct inode *inode)
 	}
 	nilfs_transaction_begin(sb, &ti, 0); /* never fails */
 
-	if (inode->i_data.nrpages)
+	if (inode->i_data.nrpages || inode->i_data.nrshadows)
 		truncate_inode_pages(&inode->i_data, 0);
 
 	/* TODO: some of the following operations may fail.  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..5bf1d99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -413,6 +413,7 @@ struct address_space {
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
+	unsigned long		nrshadows;	/* number of shadow entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a972341..258eb38 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* inode is being evicted */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -69,6 +70,16 @@ static inline int mapping_balloon(struct address_space *mapping)
 	return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 
+static inline void mapping_set_exiting(struct address_space *mapping)
+{
+	set_bit(AS_EXITING, &mapping->flags);
+}
+
+static inline int mapping_exiting(struct address_space *mapping)
+{
+	return test_bit(AS_EXITING, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
@@ -547,7 +558,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page);
+extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index df9a1db..dd0835e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -109,7 +109,7 @@
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock.
  */
-void __delete_from_page_cache(struct page *page)
+void __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
@@ -123,7 +123,14 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (shadow) {
+		void **slot;
+
+		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+		radix_tree_replace_slot(slot, shadow);
+		mapping->nrshadows++;
+	} else
+		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
 	mapping->nrpages--;
@@ -162,7 +169,7 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 	spin_lock_irq(&mapping->tree_lock);
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
@@ -409,7 +416,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		__delete_from_page_cache(old);
+		__delete_from_page_cache(old, NULL);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
 		BUG_ON(error);
 		mapping->nrpages++;
@@ -442,6 +449,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
+		mapping->nrshadows--;
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index d6ec30c..c1a5147 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -35,7 +35,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	radix_tree_delete_item(&mapping->page_tree, index, page);
+	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
+		mapping->nrshadows--;
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
@@ -434,7 +435,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 669fba3..ff0d92f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -498,7 +498,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page);
+		__delete_from_page_cache(page, NULL);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 06/10] mm + fs: store shadow entries in page cache
@ 2013-05-30 18:04   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page.  As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently.  At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate.  Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c              |  7 ++++++-
 fs/nilfs2/inode.c       |  4 ++--
 include/linux/fs.h      |  1 +
 include/linux/pagemap.h | 13 ++++++++++++-
 mm/filemap.c            | 16 ++++++++++++----
 mm/truncate.c           |  5 +++--
 mm/vmscan.c             |  2 +-
 7 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index a898b3d..3bd7916 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -509,6 +509,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(inode->i_data.nrshadows);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
@@ -551,10 +552,14 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
+	spin_lock_irq(&inode->i_data.tree_lock);
+	mapping_set_exiting(&inode->i_data);
+	spin_unlock_irq(&inode->i_data.tree_lock);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 	}
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 6b49f14..fbc3f00 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -747,7 +747,7 @@ void nilfs_evict_inode(struct inode *inode)
 	int ret;
 
 	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 		nilfs_clear_inode(inode);
@@ -755,7 +755,7 @@ void nilfs_evict_inode(struct inode *inode)
 	}
 	nilfs_transaction_begin(sb, &ti, 0); /* never fails */
 
-	if (inode->i_data.nrpages)
+	if (inode->i_data.nrpages || inode->i_data.nrshadows)
 		truncate_inode_pages(&inode->i_data, 0);
 
 	/* TODO: some of the following operations may fail.  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..5bf1d99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -413,6 +413,7 @@ struct address_space {
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
+	unsigned long		nrshadows;	/* number of shadow entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a972341..258eb38 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* inode is being evicted */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -69,6 +70,16 @@ static inline int mapping_balloon(struct address_space *mapping)
 	return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 
+static inline void mapping_set_exiting(struct address_space *mapping)
+{
+	set_bit(AS_EXITING, &mapping->flags);
+}
+
+static inline int mapping_exiting(struct address_space *mapping)
+{
+	return test_bit(AS_EXITING, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
@@ -547,7 +558,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page);
+extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index df9a1db..dd0835e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -109,7 +109,7 @@
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock.
  */
-void __delete_from_page_cache(struct page *page)
+void __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
@@ -123,7 +123,14 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (shadow) {
+		void **slot;
+
+		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+		radix_tree_replace_slot(slot, shadow);
+		mapping->nrshadows++;
+	} else
+		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
 	mapping->nrpages--;
@@ -162,7 +169,7 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 	spin_lock_irq(&mapping->tree_lock);
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
@@ -409,7 +416,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		__delete_from_page_cache(old);
+		__delete_from_page_cache(old, NULL);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
 		BUG_ON(error);
 		mapping->nrpages++;
@@ -442,6 +449,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
+		mapping->nrshadows--;
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index d6ec30c..c1a5147 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -35,7 +35,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	radix_tree_delete_item(&mapping->page_tree, index, page);
+	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
+		mapping->nrshadows--;
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
@@ -434,7 +435,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 669fba3..ff0d92f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -498,7 +498,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page);
+		__delete_from_page_cache(page, NULL);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 07/10] mm + fs: provide refault distance to page cache allocations
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:04   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

In order to make informed placement and reclaim decisions, the page
allocator requires the eviction information of refaulting pages.

Every site that does a find_or_create()-style allocation is converted
to pass this value to the page_cache_alloc() family of functions,
which in turn pass it down to the page allocator.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c  |  7 +++--
 fs/cachefiles/rdwr.c    | 25 ++++++++++-------
 fs/ceph/xattr.c         |  2 +-
 fs/logfs/readwrite.c    |  9 ++++--
 fs/ntfs/file.c          | 10 +++++--
 fs/splice.c             |  9 +++---
 include/linux/gfp.h     | 18 +++++++-----
 include/linux/pagemap.h | 26 +++++++++++------
 include/linux/swap.h    |  6 ++++
 mm/filemap.c            | 74 ++++++++++++++++++++++++++++++-------------------
 mm/mempolicy.c          | 17 +++++++-----
 mm/page_alloc.c         | 51 +++++++++++++++++++---------------
 mm/readahead.c          |  6 ++--
 net/ceph/pagelist.c     |  4 +--
 net/ceph/pagevec.c      |  2 +-
 15 files changed, 163 insertions(+), 103 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 4a80f6b..9c83b84 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -464,6 +464,8 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 	end_index = (i_size_read(inode) - 1) >> PAGE_CACHE_SHIFT;
 
 	while (last_offset < compressed_end) {
+		unsigned long distance;
+
 		pg_index = last_offset >> PAGE_CACHE_SHIFT;
 
 		if (pg_index > end_index)
@@ -478,12 +480,11 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 				break;
 			goto next;
 		}
-
+		distance = workingset_refault_distance(page);
 		page = __page_cache_alloc(mapping_gfp_mask(mapping) &
-								~__GFP_FS);
+					  ~__GFP_FS, distance);
 		if (!page)
 			break;
-
 		if (add_to_page_cache_lru(page, mapping, pg_index,
 								GFP_NOFS)) {
 			page_cache_release(page);
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 4809922..3d4a75a 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -12,6 +12,7 @@
 #include <linux/mount.h>
 #include <linux/slab.h>
 #include <linux/file.h>
+#include <linux/swap.h>
 #include "internal.h"
 
 /*
@@ -256,17 +257,19 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 	newpage = NULL;
 
 	for (;;) {
-		backpage = find_get_page(bmapping, netpage->index);
-		if (backpage)
-			goto backing_page_already_present;
+		unsigned long distance;
 
+		backpage = __find_get_page(bmapping, netpage->index);
+		if (backpage && !radix_tree_exceptional_entry(backpage))
+			goto backing_page_already_present;
+		distance = workingset_refault_distance(backpage);
 		if (!newpage) {
 			newpage = __page_cache_alloc(cachefiles_gfp |
-						     __GFP_COLD);
+						     __GFP_COLD,
+						     distance);
 			if (!newpage)
 				goto nomem_monitor;
 		}
-
 		ret = add_to_page_cache(newpage, bmapping,
 					netpage->index, cachefiles_gfp);
 		if (ret == 0)
@@ -507,17 +510,19 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 		}
 
 		for (;;) {
-			backpage = find_get_page(bmapping, netpage->index);
-			if (backpage)
-				goto backing_page_already_present;
+			unsigned long distance;
 
+			backpage = __find_get_page(bmapping, netpage->index);
+			if (backpage && !radix_tree_exceptional_entry(backpage))
+				goto backing_page_already_present;
+			distance = workingset_refault_distance(backpage);
 			if (!newpage) {
 				newpage = __page_cache_alloc(cachefiles_gfp |
-							     __GFP_COLD);
+							     __GFP_COLD,
+							     distance);
 				if (!newpage)
 					goto nomem;
 			}
-
 			ret = add_to_page_cache(newpage, bmapping,
 						netpage->index, cachefiles_gfp);
 			if (ret == 0)
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 9b6b2b6..d52c9f0 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -815,7 +815,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name,
 			return -ENOMEM;
 		err = -ENOMEM;
 		for (i = 0; i < nr_pages; i++) {
-			pages[i] = __page_cache_alloc(GFP_NOFS);
+			pages[i] = __page_cache_alloc(GFP_NOFS, 0);
 			if (!pages[i]) {
 				nr_pages = i;
 				goto out;
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 9a59cba..0c4535d 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -19,6 +19,7 @@
 #include "logfs.h"
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/swap.h>
 
 static u64 adjust_bix(u64 bix, level_t level)
 {
@@ -316,9 +317,11 @@ static struct page *logfs_get_write_page(struct inode *inode, u64 bix,
 	int err;
 
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(GFP_NOFS);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		unsigned long distance = workingset_refault_distance(page);
+
+		page = __page_cache_alloc(GFP_NOFS, distance);
 		if (!page)
 			return NULL;
 		err = add_to_page_cache_lru(page, mapping, index, GFP_NOFS);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 5b2d4f0..a8a4e07 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -412,10 +412,14 @@ static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
 	BUG_ON(!nr_pages);
 	err = nr = 0;
 	do {
-		pages[nr] = find_lock_page(mapping, index);
-		if (!pages[nr]) {
+		pages[nr] = __find_lock_page(mapping, index);
+		if (!pages[nr] || radix_tree_exceptional_entry(pages[nr])) {
+			unsigned long distance;
+
+			distance = workingset_refault_distance(pages[nr]);
 			if (!*cached_page) {
-				*cached_page = page_cache_alloc(mapping);
+				*cached_page = page_cache_alloc(mapping,
+								distance);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;
diff --git a/fs/splice.c b/fs/splice.c
index 29e394e..e60ddfc 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -352,15 +352,16 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		 * Page could be there, find_get_pages_contig() breaks on
 		 * the first hole.
 		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
+		page = __find_get_page(mapping, index);
+		if (!page || radix_tree_exceptional_entry(page)) {
+			unsigned long distance;
 			/*
 			 * page didn't exist, allocate one.
 			 */
-			page = page_cache_alloc_cold(mapping);
+			distance = workingset_refault_distance(page);
+			page = page_cache_alloc_cold(mapping, distance);
 			if (!page)
 				break;
-
 			error = add_to_page_cache_lru(page, mapping, index,
 						GFP_KERNEL);
 			if (unlikely(error)) {
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0f615eb..caf8d34 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,13 +298,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		       struct zonelist *zonelist, nodemask_t *nodemask);
+		       struct zonelist *zonelist, nodemask_t *nodemask,
+		       unsigned long refault_distance);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+	      struct zonelist *zonelist, unsigned long refault_distance)
 {
-	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order,
+				      zonelist, NULL,
+				      refault_distance);
 }
 
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
@@ -314,7 +317,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	if (nid < 0)
 		nid = numa_node_id();
 
-	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask), 0);
 }
 
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
@@ -322,16 +325,17 @@ static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
 {
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
 
-	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask), 0);
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
+extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order,
+					unsigned long refault_distance);
 
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	return alloc_pages_current(gfp_mask, order);
+	return alloc_pages_current(gfp_mask, order, 0);
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 258eb38..d758243 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -228,28 +228,36 @@ static inline void page_unfreeze_refs(struct page *page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp,
+				       unsigned long refault_distance);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp,
+					      unsigned long refault_distance)
 {
-	return alloc_pages(gfp, 0);
+	return __alloc_pages(gfp, 0, node_zonelist(numa_node_id(), gfp),
+			     refault_distance);
 }
 #endif
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+static inline struct page *page_cache_alloc(struct address_space *x,
+					    unsigned long refault_distance)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), refault_distance);
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+						 unsigned long refault_distance)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
+				  refault_distance);
 }
 
-static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+static inline struct page *page_cache_alloc_readahead(struct address_space *x,
+						      unsigned long refault_distance)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x) |
-				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN,
+				  refault_distance);
 }
 
 typedef int filler_t(void *, struct page *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..ffa323a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -221,6 +221,12 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+static inline unsigned long workingset_refault_distance(struct page *page)
+{
+	return ~0UL;
+}
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index dd0835e..10f8a62 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -518,7 +518,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, unsigned long refault_distance)
 {
 	int n;
 	struct page *page;
@@ -528,12 +528,12 @@ struct page *__page_cache_alloc(gfp_t gfp)
 		do {
 			cpuset_mems_cookie = get_mems_allowed();
 			n = cpuset_mem_spread_node();
-			page = alloc_pages_exact_node(n, gfp, 0);
+			page = __alloc_pages(gfp, 0, node_zonelist(n, gfp),
+					     refault_distance);
 		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
-
-		return page;
-	}
-	return alloc_pages(gfp, 0);
+	} else
+		page = alloc_pages_current(gfp, 0, refault_distance);
+	return page;
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -894,9 +894,11 @@ struct page *find_or_create_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+	page = __find_lock_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		unsigned long distance = workingset_refault_distance(page);
+
+		page = __page_cache_alloc(gfp_mask, distance);
 		if (!page)
 			return NULL;
 		/*
@@ -1199,16 +1201,21 @@ EXPORT_SYMBOL(find_get_pages_tag);
 struct page *
 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
-	struct page *page = find_get_page(mapping, index);
+	struct page *page = __find_get_page(mapping, index);
+	unsigned long distance;
 
-	if (page) {
+	if (page && !radix_tree_exceptional_entry(page)) {
 		if (trylock_page(page))
 			return page;
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
+	distance = workingset_refault_distance(page);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+				  distance);
+	if (!page)
+		return NULL;
+	if (add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
 	}
@@ -1270,6 +1277,7 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 	offset = *ppos & ~PAGE_CACHE_MASK;
 
 	for (;;) {
+		unsigned long distance;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -1282,8 +1290,9 @@ find_page:
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
-			page = find_get_page(mapping, index);
-			if (unlikely(page == NULL))
+			page = __find_get_page(mapping, index);
+			if (unlikely(!page ||
+				     radix_tree_exceptional_entry(page)))
 				goto no_cached_page;
 		}
 		if (PageReadahead(page)) {
@@ -1441,7 +1450,8 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		page = page_cache_alloc_cold(mapping);
+		distance = workingset_refault_distance(page);
+		page = page_cache_alloc_cold(mapping, distance);
 		if (!page) {
 			desc->error = -ENOMEM;
 			goto out;
@@ -1650,21 +1660,22 @@ EXPORT_SYMBOL(generic_file_aio_read);
  * page_cache_read - adds requested page to the page cache if not already there
  * @file:	file to read
  * @offset:	page index
+ * @distance:	refault distance
  *
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset,
+			   unsigned long distance)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *page; 
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, distance);
 		if (!page)
 			return -ENOMEM;
-
 		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
 		if (ret == 0)
 			ret = mapping->a_ops->readpage(file, page);
@@ -1767,6 +1778,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
 	pgoff_t offset = vmf->pgoff;
+	unsigned long distance;
 	struct page *page;
 	pgoff_t size;
 	int ret = 0;
@@ -1792,8 +1804,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
-		page = find_get_page(mapping, offset);
-		if (!page)
+		page = __find_get_page(mapping, offset);
+		if (!page || radix_tree_exceptional_entry(page))
 			goto no_cached_page;
 	}
 
@@ -1836,7 +1848,8 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	distance = workingset_refault_distance(page);
+	error = page_cache_read(file, offset, distance);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1958,9 +1971,11 @@ static struct page *__read_cache_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp | __GFP_COLD);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		unsigned long distance = workingset_refault_distance(page);
+
+		page = __page_cache_alloc(gfp | __GFP_COLD, distance);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, gfp);
@@ -2424,6 +2439,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	unsigned long distance;
 
 	gfp_mask = mapping_gfp_mask(mapping);
 	if (mapping_cap_account_dirty(mapping))
@@ -2431,11 +2447,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (page)
+	page = __find_lock_page(mapping, index);
+	if (page && !radix_tree_exceptional_entry(page))
 		goto found;
-
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	distance = workingset_refault_distance(page);
+	page = __page_cache_alloc(gfp_mask & ~gfp_notmask, distance);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..69f57b8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1944,13 +1944,14 @@ out:
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
-					unsigned nid)
+					  unsigned nid,
+					  unsigned long refault_distance)
 {
 	struct zonelist *zl;
 	struct page *page;
 
 	zl = node_zonelist(nid, gfp);
-	page = __alloc_pages(gfp, order, zl);
+	page = __alloc_pages(gfp, order, zl, refault_distance);
 	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
 	return page;
@@ -1996,7 +1997,7 @@ retry_cpuset:
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, order, nid);
+		page = alloc_page_interleave(gfp, order, nid, 0);
 		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 			goto retry_cpuset;
 
@@ -2004,7 +2005,7 @@ retry_cpuset:
 	}
 	page = __alloc_pages_nodemask(gfp, order,
 				      policy_zonelist(gfp, pol, node),
-				      policy_nodemask(gfp, pol));
+				      policy_nodemask(gfp, pol), 0);
 	if (unlikely(mpol_needs_cond_ref(pol)))
 		__mpol_put(pol);
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
@@ -2031,7 +2032,8 @@ retry_cpuset:
  *	1) it's ok to take cpuset_sem (can WAIT), and
  *	2) allocating for current task (not interrupt).
  */
-struct page *alloc_pages_current(gfp_t gfp, unsigned order)
+struct page *alloc_pages_current(gfp_t gfp, unsigned order,
+				 unsigned long refault_distance)
 {
 	struct mempolicy *pol = get_task_policy(current);
 	struct page *page;
@@ -2048,11 +2050,12 @@ retry_cpuset:
 	 * nor system default_policy
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
-		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
+		page = alloc_page_interleave(gfp, order, interleave_nodes(pol),
+					     refault_distance);
 	else
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),
-				policy_nodemask(gfp, pol));
+				policy_nodemask(gfp, pol), refault_distance);
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a64d786..92b4c01 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1842,7 +1842,8 @@ static inline void init_zone_allows_reclaim(int nid)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int migratetype,
+		unsigned long refault_distance)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -2105,7 +2106,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, unsigned long refault_distance)
 {
 	struct page *page;
 
@@ -2123,7 +2124,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, migratetype, refault_distance);
 	if (page)
 		goto out;
 
@@ -2158,7 +2159,7 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int migratetype, unsigned long refault_distance, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2186,7 +2187,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype, refault_distance);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			preferred_zone->compact_considered = 0;
@@ -2221,7 +2222,7 @@ static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int migratetype, unsigned long refault_distance, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2262,7 +2263,8 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int migratetype, unsigned long refault_distance,
+	unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -2278,9 +2280,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx,
-					alloc_flags & ~ALLOC_NO_WATERMARKS,
-					preferred_zone, migratetype);
+				zonelist, high_zoneidx,
+				alloc_flags & ~ALLOC_NO_WATERMARKS,
+				preferred_zone, migratetype, refault_distance);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2303,14 +2305,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, unsigned long refault_distance)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, refault_distance);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2391,7 +2393,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, unsigned long refault_distance)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -2449,7 +2451,7 @@ rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, refault_distance);
 	if (page)
 		goto got_pg;
 
@@ -2464,7 +2466,8 @@ rebalance:
 
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype,
+				refault_distance);
 		if (page) {
 			goto got_pg;
 		}
@@ -2490,7 +2493,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, sync_migration,
+					migratetype, refault_distance,
+					sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
 					&did_some_progress);
@@ -2513,7 +2517,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					migratetype, refault_distance,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -2532,7 +2537,7 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					migratetype, refault_distance);
 			if (page)
 				goto got_pg;
 
@@ -2575,7 +2580,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, sync_migration,
+					migratetype, refault_distance,
+					sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
 					&did_some_progress);
@@ -2598,7 +2604,8 @@ got_pg:
  */
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+		       struct zonelist *zonelist, nodemask_t *nodemask,
+		       unsigned long refault_distance)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
@@ -2649,7 +2656,7 @@ retry_cpuset:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, refault_distance);
 	if (unlikely(!page)) {
 		/*
 		 * Runtime PM, block IO and its error handling path
@@ -2659,7 +2666,7 @@ retry_cpuset:
 		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype, refault_distance);
 	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
diff --git a/mm/readahead.c b/mm/readahead.c
index 29efd45..1ff6104 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
 #include <linux/export.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -172,6 +173,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	 */
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
+		unsigned long distance;
 
 		if (page_offset > end_index)
 			break;
@@ -181,8 +183,8 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_unlock();
 		if (page && !radix_tree_exceptional_entry(page))
 			continue;
-
-		page = page_cache_alloc_readahead(mapping);
+		distance = workingset_refault_distance(page);
+		page = page_cache_alloc_readahead(mapping, distance);
 		if (!page)
 			break;
 		page->index = page_offset;
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 92866be..fabdc16 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -32,7 +32,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
 	struct page *page;
 
 	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, 0);
 	} else {
 		page = list_first_entry(&pl->free_list, struct page, lru);
 		list_del(&page->lru);
@@ -83,7 +83,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
 
 	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
+		struct page *page = __page_cache_alloc(GFP_NOFS, 0);
 		if (!page)
 			return -ENOMEM;
 		list_add_tail(&page->lru, &pl->free_list);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 815a224..b1151f4 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -79,7 +79,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < num_pages; i++) {
-		pages[i] = __page_cache_alloc(flags);
+		pages[i] = __page_cache_alloc(flags, 0);
 		if (pages[i] == NULL) {
 			ceph_release_page_vector(pages, i);
 			return ERR_PTR(-ENOMEM);
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 07/10] mm + fs: provide refault distance to page cache allocations
@ 2013-05-30 18:04   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

In order to make informed placement and reclaim decisions, the page
allocator requires the eviction information of refaulting pages.

Every site that does a find_or_create()-style allocation is converted
to pass this value to the page_cache_alloc() family of functions,
which in turn pass it down to the page allocator.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c  |  7 +++--
 fs/cachefiles/rdwr.c    | 25 ++++++++++-------
 fs/ceph/xattr.c         |  2 +-
 fs/logfs/readwrite.c    |  9 ++++--
 fs/ntfs/file.c          | 10 +++++--
 fs/splice.c             |  9 +++---
 include/linux/gfp.h     | 18 +++++++-----
 include/linux/pagemap.h | 26 +++++++++++------
 include/linux/swap.h    |  6 ++++
 mm/filemap.c            | 74 ++++++++++++++++++++++++++++++-------------------
 mm/mempolicy.c          | 17 +++++++-----
 mm/page_alloc.c         | 51 +++++++++++++++++++---------------
 mm/readahead.c          |  6 ++--
 net/ceph/pagelist.c     |  4 +--
 net/ceph/pagevec.c      |  2 +-
 15 files changed, 163 insertions(+), 103 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 4a80f6b..9c83b84 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -464,6 +464,8 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 	end_index = (i_size_read(inode) - 1) >> PAGE_CACHE_SHIFT;
 
 	while (last_offset < compressed_end) {
+		unsigned long distance;
+
 		pg_index = last_offset >> PAGE_CACHE_SHIFT;
 
 		if (pg_index > end_index)
@@ -478,12 +480,11 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 				break;
 			goto next;
 		}
-
+		distance = workingset_refault_distance(page);
 		page = __page_cache_alloc(mapping_gfp_mask(mapping) &
-								~__GFP_FS);
+					  ~__GFP_FS, distance);
 		if (!page)
 			break;
-
 		if (add_to_page_cache_lru(page, mapping, pg_index,
 								GFP_NOFS)) {
 			page_cache_release(page);
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 4809922..3d4a75a 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -12,6 +12,7 @@
 #include <linux/mount.h>
 #include <linux/slab.h>
 #include <linux/file.h>
+#include <linux/swap.h>
 #include "internal.h"
 
 /*
@@ -256,17 +257,19 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 	newpage = NULL;
 
 	for (;;) {
-		backpage = find_get_page(bmapping, netpage->index);
-		if (backpage)
-			goto backing_page_already_present;
+		unsigned long distance;
 
+		backpage = __find_get_page(bmapping, netpage->index);
+		if (backpage && !radix_tree_exceptional_entry(backpage))
+			goto backing_page_already_present;
+		distance = workingset_refault_distance(backpage);
 		if (!newpage) {
 			newpage = __page_cache_alloc(cachefiles_gfp |
-						     __GFP_COLD);
+						     __GFP_COLD,
+						     distance);
 			if (!newpage)
 				goto nomem_monitor;
 		}
-
 		ret = add_to_page_cache(newpage, bmapping,
 					netpage->index, cachefiles_gfp);
 		if (ret == 0)
@@ -507,17 +510,19 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 		}
 
 		for (;;) {
-			backpage = find_get_page(bmapping, netpage->index);
-			if (backpage)
-				goto backing_page_already_present;
+			unsigned long distance;
 
+			backpage = __find_get_page(bmapping, netpage->index);
+			if (backpage && !radix_tree_exceptional_entry(backpage))
+				goto backing_page_already_present;
+			distance = workingset_refault_distance(backpage);
 			if (!newpage) {
 				newpage = __page_cache_alloc(cachefiles_gfp |
-							     __GFP_COLD);
+							     __GFP_COLD,
+							     distance);
 				if (!newpage)
 					goto nomem;
 			}
-
 			ret = add_to_page_cache(newpage, bmapping,
 						netpage->index, cachefiles_gfp);
 			if (ret == 0)
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 9b6b2b6..d52c9f0 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -815,7 +815,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name,
 			return -ENOMEM;
 		err = -ENOMEM;
 		for (i = 0; i < nr_pages; i++) {
-			pages[i] = __page_cache_alloc(GFP_NOFS);
+			pages[i] = __page_cache_alloc(GFP_NOFS, 0);
 			if (!pages[i]) {
 				nr_pages = i;
 				goto out;
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 9a59cba..0c4535d 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -19,6 +19,7 @@
 #include "logfs.h"
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/swap.h>
 
 static u64 adjust_bix(u64 bix, level_t level)
 {
@@ -316,9 +317,11 @@ static struct page *logfs_get_write_page(struct inode *inode, u64 bix,
 	int err;
 
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(GFP_NOFS);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		unsigned long distance = workingset_refault_distance(page);
+
+		page = __page_cache_alloc(GFP_NOFS, distance);
 		if (!page)
 			return NULL;
 		err = add_to_page_cache_lru(page, mapping, index, GFP_NOFS);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 5b2d4f0..a8a4e07 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -412,10 +412,14 @@ static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
 	BUG_ON(!nr_pages);
 	err = nr = 0;
 	do {
-		pages[nr] = find_lock_page(mapping, index);
-		if (!pages[nr]) {
+		pages[nr] = __find_lock_page(mapping, index);
+		if (!pages[nr] || radix_tree_exceptional_entry(pages[nr])) {
+			unsigned long distance;
+
+			distance = workingset_refault_distance(pages[nr]);
 			if (!*cached_page) {
-				*cached_page = page_cache_alloc(mapping);
+				*cached_page = page_cache_alloc(mapping,
+								distance);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;
diff --git a/fs/splice.c b/fs/splice.c
index 29e394e..e60ddfc 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -352,15 +352,16 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		 * Page could be there, find_get_pages_contig() breaks on
 		 * the first hole.
 		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
+		page = __find_get_page(mapping, index);
+		if (!page || radix_tree_exceptional_entry(page)) {
+			unsigned long distance;
 			/*
 			 * page didn't exist, allocate one.
 			 */
-			page = page_cache_alloc_cold(mapping);
+			distance = workingset_refault_distance(page);
+			page = page_cache_alloc_cold(mapping, distance);
 			if (!page)
 				break;
-
 			error = add_to_page_cache_lru(page, mapping, index,
 						GFP_KERNEL);
 			if (unlikely(error)) {
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0f615eb..caf8d34 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,13 +298,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		       struct zonelist *zonelist, nodemask_t *nodemask);
+		       struct zonelist *zonelist, nodemask_t *nodemask,
+		       unsigned long refault_distance);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+	      struct zonelist *zonelist, unsigned long refault_distance)
 {
-	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order,
+				      zonelist, NULL,
+				      refault_distance);
 }
 
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
@@ -314,7 +317,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	if (nid < 0)
 		nid = numa_node_id();
 
-	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask), 0);
 }
 
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
@@ -322,16 +325,17 @@ static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
 {
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
 
-	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask), 0);
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
+extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order,
+					unsigned long refault_distance);
 
 static inline struct page *
 alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
-	return alloc_pages_current(gfp_mask, order);
+	return alloc_pages_current(gfp_mask, order, 0);
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 258eb38..d758243 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -228,28 +228,36 @@ static inline void page_unfreeze_refs(struct page *page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp,
+				       unsigned long refault_distance);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp,
+					      unsigned long refault_distance)
 {
-	return alloc_pages(gfp, 0);
+	return __alloc_pages(gfp, 0, node_zonelist(numa_node_id(), gfp),
+			     refault_distance);
 }
 #endif
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+static inline struct page *page_cache_alloc(struct address_space *x,
+					    unsigned long refault_distance)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), refault_distance);
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+						 unsigned long refault_distance)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
+				  refault_distance);
 }
 
-static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+static inline struct page *page_cache_alloc_readahead(struct address_space *x,
+						      unsigned long refault_distance)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x) |
-				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN,
+				  refault_distance);
 }
 
 typedef int filler_t(void *, struct page *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..ffa323a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -221,6 +221,12 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+static inline unsigned long workingset_refault_distance(struct page *page)
+{
+	return ~0UL;
+}
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index dd0835e..10f8a62 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -518,7 +518,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, unsigned long refault_distance)
 {
 	int n;
 	struct page *page;
@@ -528,12 +528,12 @@ struct page *__page_cache_alloc(gfp_t gfp)
 		do {
 			cpuset_mems_cookie = get_mems_allowed();
 			n = cpuset_mem_spread_node();
-			page = alloc_pages_exact_node(n, gfp, 0);
+			page = __alloc_pages(gfp, 0, node_zonelist(n, gfp),
+					     refault_distance);
 		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
-
-		return page;
-	}
-	return alloc_pages(gfp, 0);
+	} else
+		page = alloc_pages_current(gfp, 0, refault_distance);
+	return page;
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -894,9 +894,11 @@ struct page *find_or_create_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+	page = __find_lock_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		unsigned long distance = workingset_refault_distance(page);
+
+		page = __page_cache_alloc(gfp_mask, distance);
 		if (!page)
 			return NULL;
 		/*
@@ -1199,16 +1201,21 @@ EXPORT_SYMBOL(find_get_pages_tag);
 struct page *
 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
-	struct page *page = find_get_page(mapping, index);
+	struct page *page = __find_get_page(mapping, index);
+	unsigned long distance;
 
-	if (page) {
+	if (page && !radix_tree_exceptional_entry(page)) {
 		if (trylock_page(page))
 			return page;
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
+	distance = workingset_refault_distance(page);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+				  distance);
+	if (!page)
+		return NULL;
+	if (add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
 	}
@@ -1270,6 +1277,7 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 	offset = *ppos & ~PAGE_CACHE_MASK;
 
 	for (;;) {
+		unsigned long distance;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -1282,8 +1290,9 @@ find_page:
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
-			page = find_get_page(mapping, index);
-			if (unlikely(page == NULL))
+			page = __find_get_page(mapping, index);
+			if (unlikely(!page ||
+				     radix_tree_exceptional_entry(page)))
 				goto no_cached_page;
 		}
 		if (PageReadahead(page)) {
@@ -1441,7 +1450,8 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		page = page_cache_alloc_cold(mapping);
+		distance = workingset_refault_distance(page);
+		page = page_cache_alloc_cold(mapping, distance);
 		if (!page) {
 			desc->error = -ENOMEM;
 			goto out;
@@ -1650,21 +1660,22 @@ EXPORT_SYMBOL(generic_file_aio_read);
  * page_cache_read - adds requested page to the page cache if not already there
  * @file:	file to read
  * @offset:	page index
+ * @distance:	refault distance
  *
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset,
+			   unsigned long distance)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *page; 
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, distance);
 		if (!page)
 			return -ENOMEM;
-
 		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
 		if (ret == 0)
 			ret = mapping->a_ops->readpage(file, page);
@@ -1767,6 +1778,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
 	pgoff_t offset = vmf->pgoff;
+	unsigned long distance;
 	struct page *page;
 	pgoff_t size;
 	int ret = 0;
@@ -1792,8 +1804,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
-		page = find_get_page(mapping, offset);
-		if (!page)
+		page = __find_get_page(mapping, offset);
+		if (!page || radix_tree_exceptional_entry(page))
 			goto no_cached_page;
 	}
 
@@ -1836,7 +1848,8 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	distance = workingset_refault_distance(page);
+	error = page_cache_read(file, offset, distance);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1958,9 +1971,11 @@ static struct page *__read_cache_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp | __GFP_COLD);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		unsigned long distance = workingset_refault_distance(page);
+
+		page = __page_cache_alloc(gfp | __GFP_COLD, distance);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, gfp);
@@ -2424,6 +2439,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	unsigned long distance;
 
 	gfp_mask = mapping_gfp_mask(mapping);
 	if (mapping_cap_account_dirty(mapping))
@@ -2431,11 +2447,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (page)
+	page = __find_lock_page(mapping, index);
+	if (page && !radix_tree_exceptional_entry(page))
 		goto found;
-
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	distance = workingset_refault_distance(page);
+	page = __page_cache_alloc(gfp_mask & ~gfp_notmask, distance);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..69f57b8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1944,13 +1944,14 @@ out:
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
-					unsigned nid)
+					  unsigned nid,
+					  unsigned long refault_distance)
 {
 	struct zonelist *zl;
 	struct page *page;
 
 	zl = node_zonelist(nid, gfp);
-	page = __alloc_pages(gfp, order, zl);
+	page = __alloc_pages(gfp, order, zl, refault_distance);
 	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
 	return page;
@@ -1996,7 +1997,7 @@ retry_cpuset:
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, order, nid);
+		page = alloc_page_interleave(gfp, order, nid, 0);
 		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 			goto retry_cpuset;
 
@@ -2004,7 +2005,7 @@ retry_cpuset:
 	}
 	page = __alloc_pages_nodemask(gfp, order,
 				      policy_zonelist(gfp, pol, node),
-				      policy_nodemask(gfp, pol));
+				      policy_nodemask(gfp, pol), 0);
 	if (unlikely(mpol_needs_cond_ref(pol)))
 		__mpol_put(pol);
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
@@ -2031,7 +2032,8 @@ retry_cpuset:
  *	1) it's ok to take cpuset_sem (can WAIT), and
  *	2) allocating for current task (not interrupt).
  */
-struct page *alloc_pages_current(gfp_t gfp, unsigned order)
+struct page *alloc_pages_current(gfp_t gfp, unsigned order,
+				 unsigned long refault_distance)
 {
 	struct mempolicy *pol = get_task_policy(current);
 	struct page *page;
@@ -2048,11 +2050,12 @@ retry_cpuset:
 	 * nor system default_policy
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
-		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
+		page = alloc_page_interleave(gfp, order, interleave_nodes(pol),
+					     refault_distance);
 	else
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),
-				policy_nodemask(gfp, pol));
+				policy_nodemask(gfp, pol), refault_distance);
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a64d786..92b4c01 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1842,7 +1842,8 @@ static inline void init_zone_allows_reclaim(int nid)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int migratetype,
+		unsigned long refault_distance)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -2105,7 +2106,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, unsigned long refault_distance)
 {
 	struct page *page;
 
@@ -2123,7 +2124,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, migratetype, refault_distance);
 	if (page)
 		goto out;
 
@@ -2158,7 +2159,7 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int migratetype, unsigned long refault_distance, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2186,7 +2187,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype, refault_distance);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			preferred_zone->compact_considered = 0;
@@ -2221,7 +2222,7 @@ static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int migratetype, unsigned long refault_distance, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2262,7 +2263,8 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int migratetype, unsigned long refault_distance,
+	unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -2278,9 +2280,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
-					zonelist, high_zoneidx,
-					alloc_flags & ~ALLOC_NO_WATERMARKS,
-					preferred_zone, migratetype);
+				zonelist, high_zoneidx,
+				alloc_flags & ~ALLOC_NO_WATERMARKS,
+				preferred_zone, migratetype, refault_distance);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2303,14 +2305,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, unsigned long refault_distance)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, refault_distance);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2391,7 +2393,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, unsigned long refault_distance)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -2449,7 +2451,7 @@ rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, refault_distance);
 	if (page)
 		goto got_pg;
 
@@ -2464,7 +2466,8 @@ rebalance:
 
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype,
+				refault_distance);
 		if (page) {
 			goto got_pg;
 		}
@@ -2490,7 +2493,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, sync_migration,
+					migratetype, refault_distance,
+					sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
 					&did_some_progress);
@@ -2513,7 +2517,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					migratetype, refault_distance,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -2532,7 +2537,7 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					migratetype, refault_distance);
 			if (page)
 				goto got_pg;
 
@@ -2575,7 +2580,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, sync_migration,
+					migratetype, refault_distance,
+					sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
 					&did_some_progress);
@@ -2598,7 +2604,8 @@ got_pg:
  */
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+		       struct zonelist *zonelist, nodemask_t *nodemask,
+		       unsigned long refault_distance)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
@@ -2649,7 +2656,7 @@ retry_cpuset:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, refault_distance);
 	if (unlikely(!page)) {
 		/*
 		 * Runtime PM, block IO and its error handling path
@@ -2659,7 +2666,7 @@ retry_cpuset:
 		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype, refault_distance);
 	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
diff --git a/mm/readahead.c b/mm/readahead.c
index 29efd45..1ff6104 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
 #include <linux/export.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -172,6 +173,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	 */
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
+		unsigned long distance;
 
 		if (page_offset > end_index)
 			break;
@@ -181,8 +183,8 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_unlock();
 		if (page && !radix_tree_exceptional_entry(page))
 			continue;
-
-		page = page_cache_alloc_readahead(mapping);
+		distance = workingset_refault_distance(page);
+		page = page_cache_alloc_readahead(mapping, distance);
 		if (!page)
 			break;
 		page->index = page_offset;
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 92866be..fabdc16 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -32,7 +32,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
 	struct page *page;
 
 	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, 0);
 	} else {
 		page = list_first_entry(&pl->free_list, struct page, lru);
 		list_del(&page->lru);
@@ -83,7 +83,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
 
 	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
+		struct page *page = __page_cache_alloc(GFP_NOFS, 0);
 		if (!page)
 			return -ENOMEM;
 		list_add_tail(&page->lru, &pl->free_list);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 815a224..b1151f4 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -79,7 +79,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < num_pages; i++) {
-		pages[i] = __page_cache_alloc(flags);
+		pages[i] = __page_cache_alloc(flags, 0);
 		if (pages[i] == NULL) {
 			ceph_release_page_vector(pages, i);
 			return ERR_PTR(-ENOMEM);
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 08/10] mm: make global_dirtyable_memory() available to other mm code
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:04   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Subsequent patches need a rough estimate of memory available for page
cache.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/writeback.h | 1 +
 mm/page-writeback.c       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9a9367c..832f86b 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -148,6 +148,7 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
+unsigned long global_dirtyable_memory(void);
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..5e302e6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -231,7 +231,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the global number of pages potentially available for dirty
  * page cache.  This is the base value for the global dirty limits.
  */
-static unsigned long global_dirtyable_memory(void)
+unsigned long global_dirtyable_memory(void)
 {
 	unsigned long x;
 
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 08/10] mm: make global_dirtyable_memory() available to other mm code
@ 2013-05-30 18:04   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Subsequent patches need a rough estimate of memory available for page
cache.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/writeback.h | 1 +
 mm/page-writeback.c       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9a9367c..832f86b 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -148,6 +148,7 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
+unsigned long global_dirtyable_memory(void);
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..5e302e6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -231,7 +231,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the global number of pages potentially available for dirty
  * page cache.  This is the base value for the global dirty limits.
  */
-static unsigned long global_dirtyable_memory(void)
+unsigned long global_dirtyable_memory(void)
 {
 	unsigned long x;
 
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 09/10] mm: thrash detection-based file cache sizing
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:04   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a global counter is increased every time a page is
evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted since this page's eviction, which is how many
page slots are missing from the inactive list for this page to get
accessed twice while in memory.  If the number of missing slots is
less than or equal to the number of active pages, increasing the
inactive list at the cost of the active list would give this thrashing
set a chance to establish itself:

eviction counter = 4
                        evicted      inactive           active
 Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries:         0 1 2 3
Refault distance:         4 3 2 1

When c is faulted back into memory, it is noted that two more page
slots on the inactive list could have prevented the refault.  Thus,
the active list needs to be challenged for those two page slots as it
is possible that c is used more frequently than l, m, n.  However, c
might also be used much less frequent than the active pages and so 1)
pages can not be directly reclaimed from the tail of the active list
and b) refaulting pages can not be directly activated.  Instead,
active pages are moved from the tail of the active list to the head of
the inactive list and placed directly next to the refaulting pages.
This way, they both have the same time on the inactive list to prove
which page is actually used more frequently without incurring
unnecessary major faults or diluting the active page set in case the
previously active page is in fact the more frequently used one.

On multi-node systems, workloads with different page access
frequencies may execute concurrently on separate nodes.  On refault,
the page allocator walks the list of allowed zones to allocate a page
frame for the refaulting page.  For each zone, a local refault
distance is calculated that is proportional to the zone's recent share
of global evictions.  This local distance is then compared to the
local number of active pages, so the decision to rebalance the lists
is made on an individual per-zone basis.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   6 ++
 include/linux/swap.h   |   8 +-
 mm/Makefile            |   2 +-
 mm/memcontrol.c        |   3 +
 mm/mmzone.c            |   1 +
 mm/page_alloc.c        |   9 +-
 mm/swap.c              |   2 +
 mm/vmscan.c            |  45 +++++++---
 mm/vmstat.c            |   3 +
 mm/workingset.c        | 233 +++++++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 293 insertions(+), 19 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 370a35f..505bd80 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -16,6 +16,7 @@
 #include <linux/nodemask.h>
 #include <linux/pageblock-flags.h>
 #include <linux/page-flags-layout.h>
+#include <linux/proportions.h>
 #include <linux/atomic.h>
 #include <asm/page.h>
 
@@ -141,6 +142,9 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_STALE,
+	WORKINGSET_BALANCE,
+	WORKINGSET_BALANCE_FORCE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -202,6 +206,8 @@ struct zone_reclaim_stat {
 struct lruvec {
 	struct list_head lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat reclaim_stat;
+	struct prop_local_percpu evictions;
+	long shrink_active;
 #ifdef CONFIG_MEMCG
 	struct zone *zone;
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ffa323a..c3d5237 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -222,10 +222,10 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
-static inline unsigned long workingset_refault_distance(struct page *page)
-{
-	return ~0UL;
-}
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+unsigned long workingset_refault_distance(struct page *page);
+void workingset_zone_balance(struct zone *zone, unsigned long refault_distance);
+void workingset_activation(struct page *page);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..b5a7416 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o $(mmu-y)
+			   interval_tree.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2b55222..cc3026a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1221,6 +1221,9 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 		goto out;
 	}
 
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
 	lruvec = &mz->lruvec;
 out:
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..3d71a01 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -95,6 +95,7 @@ void lruvec_init(struct lruvec *lruvec)
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
+	prop_local_init_percpu(&lruvec->evictions);
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 92b4c01..9fd11c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1893,9 +1893,12 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
-			goto this_zone_full;
+		if (alloc_flags & ALLOC_WMARK_LOW) {
+			if (refault_distance)
+				workingset_zone_balance(zone, refault_distance);
+			if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+				goto this_zone_full;
+		}
 
 		/*
 		 * XXX: Ensure similar zone aging speeds by
diff --git a/mm/swap.c b/mm/swap.c
index 37bfe2d..af394c1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -440,6 +440,8 @@ void mark_page_accessed(struct page *page)
 			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ff0d92f..ef5b73a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -449,7 +449,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -495,10 +496,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page, NULL);
+		if (reclaimed && page_is_file_cache(page))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -521,7 +525,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -903,7 +907,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
@@ -1593,21 +1597,40 @@ static inline int inactive_anon_is_low(struct lruvec *lruvec)
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct lruvec *lruvec)
+static int inactive_file_is_low(struct lruvec *lruvec, unsigned long nr_to_scan)
 {
 	unsigned long inactive;
 	unsigned long active;
 
+	if (lruvec->shrink_active > 0) {
+		inc_zone_state(lruvec_zone(lruvec), WORKINGSET_BALANCE);
+		lruvec->shrink_active -= nr_to_scan;
+		return true;
+	}
+	/*
+	 * Make sure there is always a reasonable amount of inactive
+	 * file pages around to keep the zone reclaimable.
+	 *
+	 * We could do better than requiring half of memory, but we
+	 * need a big safety buffer until we are smarter about
+	 * dirty/writeback pages and file readahead windows.
+	 * Otherwise, we can end up with all pages on the inactive
+	 * list being dirty, or trash readahead pages before use.
+	 */
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
-
-	return active > inactive;
+	if (active > inactive) {
+		inc_zone_state(lruvec_zone(lruvec), WORKINGSET_BALANCE_FORCE);
+		return true;
+	}
+	return false;
 }
 
-static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
+static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru,
+				unsigned long nr_to_scan)
 {
 	if (is_file_lru(lru))
-		return inactive_file_is_low(lruvec);
+		return inactive_file_is_low(lruvec, nr_to_scan);
 	else
 		return inactive_anon_is_low(lruvec);
 }
@@ -1616,7 +1639,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, lru))
+		if (inactive_list_is_low(lruvec, lru, nr_to_scan))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -1727,7 +1750,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
 	 */
-	if (!inactive_file_is_low(lruvec)) {
+	if (!inactive_file_is_low(lruvec, 0)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e1d8ed1..17c19b0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -735,6 +735,9 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_stale",
+	"workingset_balance",
+	"workingset_balance_force",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..7986aa4
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,233 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2012 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * inactive list requires more space to detect repeatedly accessed
+ * pages in the current workload and prevent them from thrashing:
+ *
+ *   fault -----------------------+
+ *                                |
+ *              +-------------+   |            +-------------+
+ *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
+ *              +-------------+                +-------------+    |
+ *                       |                                        |
+ *                       +----------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is thrashing when the distances between the first and
+ * second access of pages that are frequently used is bigger than the
+ * current inactive clock list size, as the pages get reclaimed before
+ * the second access would have promoted them instead:
+ *
+ *    Access #: 1 2 3 4 5 6 7 8 9
+ *     Page ID: x y b c d e f x y
+ *                  | inactive  |
+ *
+ * To prevent this workload from thrashing, a bigger inactive list is
+ * required.  And the only way the inactive list can grow on a full
+ * zone is by taking away space from the corresponding active list.
+ *
+ *      +-inactive--+-active------+
+ *  x y | b c d e f | G H I J K L |
+ *      +-----------+-------------+
+ *
+ * Not every refault should lead to growing the inactive list at the
+ * cost of the active list, however: if the access distances are
+ * bigger than available memory overall, there is little point in
+ * challenging the protected pages on the active list, as those
+ * refaulting pages will not fit completely into memory.
+ *
+ * It is prohibitively expensive to track the access frequency of
+ * in-core pages, but it is possible to track their refault distance,
+ * which is the number of page slots shrunk from the inactive list
+ * between a page's eviction and subsequent refault.  This indicates
+ * how many page slots are missing on the inactive list in order to
+ * prevent future thrashing of that page.  Thus, instead of comparing
+ * access frequency to total available memory, one can compare the
+ * refault distance to the inactive list's potential for growth: the
+ * size of the active list.
+ *
+ *
+ *		Rebalancing the lists
+ *
+ * Shrinking the active list has to be done carefully because the
+ * pages on it may have vastly different access frequencies compared
+ * to the pages on the inactive list.  Thus, pages are not reclaimed
+ * directly from the tail of the active list, but instead moved to the
+ * head of the inactive list.  This way, they are competing directly
+ * with the pages that challenged their protected status.  If they are
+ * unused, they will eventually be reclaimed, but if they are indeed
+ * used more frequently than the challenging inactive pages, they will
+ * be reactivated.  This allows the existing protected set to be
+ * challenged without incurring major faults in case of a mistake.
+ */
+
+/*
+ * Monotonic workingset clock for non-resident pages.
+ *
+ * The refault distance of a page is the number of ticks that occurred
+ * between that page's eviction and subsequent refault.
+ *
+ * Every page slot that is taken away from the inactive list is one
+ * more slot the inactive list would have to grow again in order to
+ * hold the current non-resident pages in memory as well.
+ *
+ * As the refault distance needs to reflect the space missing on the
+ * inactive list, the workingset time is advanced every time the
+ * inactive list is shrunk.  This means eviction, but also activation.
+ */
+static atomic_long_t workingset_time;
+
+/*
+ * Workingset clock snapshots are stored in the page cache radix tree
+ * as exceptional entries (shadows).
+ */
+#define EV_SHIFT	RADIX_TREE_EXCEPTIONAL_SHIFT
+#define EV_MASK		(~0UL >> EV_SHIFT)
+
+/*
+ * Per-zone proportional eviction counter to keep track of recent zone
+ * eviction speed and be able to calculate per-zone refault distances.
+ */
+static struct prop_descriptor global_evictions;
+
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct lruvec *lruvec;
+	unsigned long time;
+
+	time = atomic_long_inc_return(&workingset_time);
+
+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+	prop_inc_percpu(&global_evictions, &lruvec->evictions);
+
+	/*
+	 * Don't store shadows in an inode that is being reclaimed.
+	 * This is not just an optizimation, inode reclaim needs to
+	 * empty out the radix tree or the nodes are lost, so don't
+	 * plant shadows behind its back.
+	 */
+	if (mapping_exiting(mapping))
+		return NULL;
+
+	return (void *)((time << EV_SHIFT) | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+unsigned long workingset_refault_distance(struct page *page)
+{
+	unsigned long time_of_eviction;
+	unsigned long now;
+
+	if (!page)
+		return ~0UL;
+
+	BUG_ON(!radix_tree_exceptional_entry(page));
+	time_of_eviction = (unsigned long)page >> EV_SHIFT;
+	now = atomic_long_read(&workingset_time);
+	return (now - time_of_eviction) & EV_MASK;
+}
+EXPORT_SYMBOL(workingset_refault_distance);
+
+void workingset_zone_balance(struct zone *zone, unsigned long refault_distance)
+{
+	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, NULL);
+	unsigned long zone_active;
+	unsigned long zone_free;
+	unsigned long missing;
+	long denominator;
+	long numerator;
+
+	/*
+	 * The dirty balance reserve is a generous estimation of the
+	 * zone's memory reserve that is not available to page cache.
+	 * If the zone has more free pages than that, it means that
+	 * there are pages ready to allocate without reclaiming from
+	 * the zone at all, let alone putting pressure on its active
+	 * pages.
+	 */
+	zone_free = zone_page_state(zone, NR_FREE_PAGES);
+	if (zone_free > zone->dirty_balance_reserve)
+		return;
+
+	/*
+	 * Pages with a refault distance bigger than available memory
+	 * won't fit in memory no matter what, leave the active pages
+	 * alone.  This also makes sure the multiplication below does
+	 * not overflow.
+	 */
+	if (refault_distance > global_dirtyable_memory())
+		return;
+
+	/*
+	 * Translate the global refault distance using the zone's
+	 * share of global evictions, to make it comparable to the
+	 * zone's number of active and free pages.
+	 */
+	prop_fraction_percpu(&global_evictions, &lruvec->evictions,
+			     &numerator, &denominator);
+	missing = refault_distance * numerator;
+	do_div(missing, denominator);
+
+	/*
+	 * Protected pages should be challenged when the refault
+	 * distance indicates that thrashing could be stopped by
+	 * increasing the inactive list at the cost of the active
+	 * list.
+	 */
+	zone_active = zone_page_state(zone, NR_ACTIVE_FILE);
+	if (missing > zone_active)
+		return;
+
+	inc_zone_state(zone, WORKINGSET_STALE);
+	lruvec->shrink_active++;
+}
+
+void workingset_activation(struct page *page)
+{
+	struct lruvec *lruvec;
+
+	atomic_long_inc(&workingset_time);
+
+	/*
+	 * The lists are rebalanced when the inactive list is observed
+	 * to be too small for activations.  An activation means that
+	 * the inactive list is now big enough again for at least one
+	 * page, so back off further deactivation.
+	 */
+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+	if (lruvec->shrink_active > 0)
+		lruvec->shrink_active--;
+}
+
+static int __init workingset_init(void)
+{
+	int shift;
+
+	/* XXX: adapt shift during memory hotplug */
+	shift = ilog2(global_dirtyable_memory() - 1);
+	prop_descriptor_init(&global_evictions, shift);
+	return 0;
+}
+module_init(workingset_init);
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 09/10] mm: thrash detection-based file cache sizing
@ 2013-05-30 18:04   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a global counter is increased every time a page is
evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted since this page's eviction, which is how many
page slots are missing from the inactive list for this page to get
accessed twice while in memory.  If the number of missing slots is
less than or equal to the number of active pages, increasing the
inactive list at the cost of the active list would give this thrashing
set a chance to establish itself:

eviction counter = 4
                        evicted      inactive           active
 Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries:         0 1 2 3
Refault distance:         4 3 2 1

When c is faulted back into memory, it is noted that two more page
slots on the inactive list could have prevented the refault.  Thus,
the active list needs to be challenged for those two page slots as it
is possible that c is used more frequently than l, m, n.  However, c
might also be used much less frequent than the active pages and so 1)
pages can not be directly reclaimed from the tail of the active list
and b) refaulting pages can not be directly activated.  Instead,
active pages are moved from the tail of the active list to the head of
the inactive list and placed directly next to the refaulting pages.
This way, they both have the same time on the inactive list to prove
which page is actually used more frequently without incurring
unnecessary major faults or diluting the active page set in case the
previously active page is in fact the more frequently used one.

On multi-node systems, workloads with different page access
frequencies may execute concurrently on separate nodes.  On refault,
the page allocator walks the list of allowed zones to allocate a page
frame for the refaulting page.  For each zone, a local refault
distance is calculated that is proportional to the zone's recent share
of global evictions.  This local distance is then compared to the
local number of active pages, so the decision to rebalance the lists
is made on an individual per-zone basis.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   6 ++
 include/linux/swap.h   |   8 +-
 mm/Makefile            |   2 +-
 mm/memcontrol.c        |   3 +
 mm/mmzone.c            |   1 +
 mm/page_alloc.c        |   9 +-
 mm/swap.c              |   2 +
 mm/vmscan.c            |  45 +++++++---
 mm/vmstat.c            |   3 +
 mm/workingset.c        | 233 +++++++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 293 insertions(+), 19 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 370a35f..505bd80 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -16,6 +16,7 @@
 #include <linux/nodemask.h>
 #include <linux/pageblock-flags.h>
 #include <linux/page-flags-layout.h>
+#include <linux/proportions.h>
 #include <linux/atomic.h>
 #include <asm/page.h>
 
@@ -141,6 +142,9 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_STALE,
+	WORKINGSET_BALANCE,
+	WORKINGSET_BALANCE_FORCE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -202,6 +206,8 @@ struct zone_reclaim_stat {
 struct lruvec {
 	struct list_head lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat reclaim_stat;
+	struct prop_local_percpu evictions;
+	long shrink_active;
 #ifdef CONFIG_MEMCG
 	struct zone *zone;
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ffa323a..c3d5237 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -222,10 +222,10 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
-static inline unsigned long workingset_refault_distance(struct page *page)
-{
-	return ~0UL;
-}
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+unsigned long workingset_refault_distance(struct page *page);
+void workingset_zone_balance(struct zone *zone, unsigned long refault_distance);
+void workingset_activation(struct page *page);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..b5a7416 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o $(mmu-y)
+			   interval_tree.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2b55222..cc3026a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1221,6 +1221,9 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 		goto out;
 	}
 
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
 	lruvec = &mz->lruvec;
 out:
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..3d71a01 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -95,6 +95,7 @@ void lruvec_init(struct lruvec *lruvec)
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
+	prop_local_init_percpu(&lruvec->evictions);
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 92b4c01..9fd11c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1893,9 +1893,12 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
-			goto this_zone_full;
+		if (alloc_flags & ALLOC_WMARK_LOW) {
+			if (refault_distance)
+				workingset_zone_balance(zone, refault_distance);
+			if ((gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+				goto this_zone_full;
+		}
 
 		/*
 		 * XXX: Ensure similar zone aging speeds by
diff --git a/mm/swap.c b/mm/swap.c
index 37bfe2d..af394c1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -440,6 +440,8 @@ void mark_page_accessed(struct page *page)
 			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ff0d92f..ef5b73a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -449,7 +449,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -495,10 +496,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page, NULL);
+		if (reclaimed && page_is_file_cache(page))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -521,7 +525,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -903,7 +907,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
@@ -1593,21 +1597,40 @@ static inline int inactive_anon_is_low(struct lruvec *lruvec)
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct lruvec *lruvec)
+static int inactive_file_is_low(struct lruvec *lruvec, unsigned long nr_to_scan)
 {
 	unsigned long inactive;
 	unsigned long active;
 
+	if (lruvec->shrink_active > 0) {
+		inc_zone_state(lruvec_zone(lruvec), WORKINGSET_BALANCE);
+		lruvec->shrink_active -= nr_to_scan;
+		return true;
+	}
+	/*
+	 * Make sure there is always a reasonable amount of inactive
+	 * file pages around to keep the zone reclaimable.
+	 *
+	 * We could do better than requiring half of memory, but we
+	 * need a big safety buffer until we are smarter about
+	 * dirty/writeback pages and file readahead windows.
+	 * Otherwise, we can end up with all pages on the inactive
+	 * list being dirty, or trash readahead pages before use.
+	 */
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
-
-	return active > inactive;
+	if (active > inactive) {
+		inc_zone_state(lruvec_zone(lruvec), WORKINGSET_BALANCE_FORCE);
+		return true;
+	}
+	return false;
 }
 
-static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
+static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru,
+				unsigned long nr_to_scan)
 {
 	if (is_file_lru(lru))
-		return inactive_file_is_low(lruvec);
+		return inactive_file_is_low(lruvec, nr_to_scan);
 	else
 		return inactive_anon_is_low(lruvec);
 }
@@ -1616,7 +1639,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, lru))
+		if (inactive_list_is_low(lruvec, lru, nr_to_scan))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -1727,7 +1750,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
 	 */
-	if (!inactive_file_is_low(lruvec)) {
+	if (!inactive_file_is_low(lruvec, 0)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e1d8ed1..17c19b0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -735,6 +735,9 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_stale",
+	"workingset_balance",
+	"workingset_balance_force",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..7986aa4
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,233 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2012 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * inactive list requires more space to detect repeatedly accessed
+ * pages in the current workload and prevent them from thrashing:
+ *
+ *   fault -----------------------+
+ *                                |
+ *              +-------------+   |            +-------------+
+ *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
+ *              +-------------+                +-------------+    |
+ *                       |                                        |
+ *                       +----------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is thrashing when the distances between the first and
+ * second access of pages that are frequently used is bigger than the
+ * current inactive clock list size, as the pages get reclaimed before
+ * the second access would have promoted them instead:
+ *
+ *    Access #: 1 2 3 4 5 6 7 8 9
+ *     Page ID: x y b c d e f x y
+ *                  | inactive  |
+ *
+ * To prevent this workload from thrashing, a bigger inactive list is
+ * required.  And the only way the inactive list can grow on a full
+ * zone is by taking away space from the corresponding active list.
+ *
+ *      +-inactive--+-active------+
+ *  x y | b c d e f | G H I J K L |
+ *      +-----------+-------------+
+ *
+ * Not every refault should lead to growing the inactive list at the
+ * cost of the active list, however: if the access distances are
+ * bigger than available memory overall, there is little point in
+ * challenging the protected pages on the active list, as those
+ * refaulting pages will not fit completely into memory.
+ *
+ * It is prohibitively expensive to track the access frequency of
+ * in-core pages, but it is possible to track their refault distance,
+ * which is the number of page slots shrunk from the inactive list
+ * between a page's eviction and subsequent refault.  This indicates
+ * how many page slots are missing on the inactive list in order to
+ * prevent future thrashing of that page.  Thus, instead of comparing
+ * access frequency to total available memory, one can compare the
+ * refault distance to the inactive list's potential for growth: the
+ * size of the active list.
+ *
+ *
+ *		Rebalancing the lists
+ *
+ * Shrinking the active list has to be done carefully because the
+ * pages on it may have vastly different access frequencies compared
+ * to the pages on the inactive list.  Thus, pages are not reclaimed
+ * directly from the tail of the active list, but instead moved to the
+ * head of the inactive list.  This way, they are competing directly
+ * with the pages that challenged their protected status.  If they are
+ * unused, they will eventually be reclaimed, but if they are indeed
+ * used more frequently than the challenging inactive pages, they will
+ * be reactivated.  This allows the existing protected set to be
+ * challenged without incurring major faults in case of a mistake.
+ */
+
+/*
+ * Monotonic workingset clock for non-resident pages.
+ *
+ * The refault distance of a page is the number of ticks that occurred
+ * between that page's eviction and subsequent refault.
+ *
+ * Every page slot that is taken away from the inactive list is one
+ * more slot the inactive list would have to grow again in order to
+ * hold the current non-resident pages in memory as well.
+ *
+ * As the refault distance needs to reflect the space missing on the
+ * inactive list, the workingset time is advanced every time the
+ * inactive list is shrunk.  This means eviction, but also activation.
+ */
+static atomic_long_t workingset_time;
+
+/*
+ * Workingset clock snapshots are stored in the page cache radix tree
+ * as exceptional entries (shadows).
+ */
+#define EV_SHIFT	RADIX_TREE_EXCEPTIONAL_SHIFT
+#define EV_MASK		(~0UL >> EV_SHIFT)
+
+/*
+ * Per-zone proportional eviction counter to keep track of recent zone
+ * eviction speed and be able to calculate per-zone refault distances.
+ */
+static struct prop_descriptor global_evictions;
+
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct lruvec *lruvec;
+	unsigned long time;
+
+	time = atomic_long_inc_return(&workingset_time);
+
+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+	prop_inc_percpu(&global_evictions, &lruvec->evictions);
+
+	/*
+	 * Don't store shadows in an inode that is being reclaimed.
+	 * This is not just an optizimation, inode reclaim needs to
+	 * empty out the radix tree or the nodes are lost, so don't
+	 * plant shadows behind its back.
+	 */
+	if (mapping_exiting(mapping))
+		return NULL;
+
+	return (void *)((time << EV_SHIFT) | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+unsigned long workingset_refault_distance(struct page *page)
+{
+	unsigned long time_of_eviction;
+	unsigned long now;
+
+	if (!page)
+		return ~0UL;
+
+	BUG_ON(!radix_tree_exceptional_entry(page));
+	time_of_eviction = (unsigned long)page >> EV_SHIFT;
+	now = atomic_long_read(&workingset_time);
+	return (now - time_of_eviction) & EV_MASK;
+}
+EXPORT_SYMBOL(workingset_refault_distance);
+
+void workingset_zone_balance(struct zone *zone, unsigned long refault_distance)
+{
+	struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, NULL);
+	unsigned long zone_active;
+	unsigned long zone_free;
+	unsigned long missing;
+	long denominator;
+	long numerator;
+
+	/*
+	 * The dirty balance reserve is a generous estimation of the
+	 * zone's memory reserve that is not available to page cache.
+	 * If the zone has more free pages than that, it means that
+	 * there are pages ready to allocate without reclaiming from
+	 * the zone at all, let alone putting pressure on its active
+	 * pages.
+	 */
+	zone_free = zone_page_state(zone, NR_FREE_PAGES);
+	if (zone_free > zone->dirty_balance_reserve)
+		return;
+
+	/*
+	 * Pages with a refault distance bigger than available memory
+	 * won't fit in memory no matter what, leave the active pages
+	 * alone.  This also makes sure the multiplication below does
+	 * not overflow.
+	 */
+	if (refault_distance > global_dirtyable_memory())
+		return;
+
+	/*
+	 * Translate the global refault distance using the zone's
+	 * share of global evictions, to make it comparable to the
+	 * zone's number of active and free pages.
+	 */
+	prop_fraction_percpu(&global_evictions, &lruvec->evictions,
+			     &numerator, &denominator);
+	missing = refault_distance * numerator;
+	do_div(missing, denominator);
+
+	/*
+	 * Protected pages should be challenged when the refault
+	 * distance indicates that thrashing could be stopped by
+	 * increasing the inactive list at the cost of the active
+	 * list.
+	 */
+	zone_active = zone_page_state(zone, NR_ACTIVE_FILE);
+	if (missing > zone_active)
+		return;
+
+	inc_zone_state(zone, WORKINGSET_STALE);
+	lruvec->shrink_active++;
+}
+
+void workingset_activation(struct page *page)
+{
+	struct lruvec *lruvec;
+
+	atomic_long_inc(&workingset_time);
+
+	/*
+	 * The lists are rebalanced when the inactive list is observed
+	 * to be too small for activations.  An activation means that
+	 * the inactive list is now big enough again for at least one
+	 * page, so back off further deactivation.
+	 */
+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+	if (lruvec->shrink_active > 0)
+		lruvec->shrink_active--;
+}
+
+static int __init workingset_init(void)
+{
+	int shift;
+
+	/* XXX: adapt shift during memory hotplug */
+	shift = ilog2(global_dirtyable_memory() - 1);
+	prop_descriptor_init(&global_evictions, shift);
+	return 0;
+}
+module_init(workingset_init);
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 10/10] mm: workingset: keep shadow entries in check
  2013-05-30 18:03 ` Johannes Weiner
@ 2013-05-30 18:04   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, two mechanisms are used:

1. A refault balance counter is maintained per file that grows with
   each shadow entry planted and shrinks with each refault.  Once the
   counter grows beyond a certain threshold, planting new shadows in
   that file is throttled.  It's per file so that a single file can
   not disable thrashing detection globally.  However, this still
   allows shadow entries to grow excessively when many files show this
   usage pattern, and so:

2. a list of files that contain shadow entries is maintained.  If the
   global number of shadows exceeds a certain threshold, a shrinker is
   activated that reclaims old entries from the mappings.  This is
   heavy-handed but it should not be a common case and is only there
   to protect from accidentally/maliciously induced OOM kills.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c                    |   1 +
 include/linux/fs.h            |   2 +
 include/linux/swap.h          |   3 +
 include/linux/vm_event_item.h |   1 +
 mm/filemap.c                  |   5 +-
 mm/truncate.c                 |   2 +-
 mm/vmstat.c                   |   1 +
 mm/workingset.c               | 198 +++++++++++++++++++++++++++++++++++++++++-
 8 files changed, 206 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 3bd7916..f48ce73 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -168,6 +168,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	mapping->shadow_debt = global_dirtyable_memory();
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bf1d99..7fc3f3a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -414,6 +414,8 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	struct list_head	shadow_list;
+	unsigned long		shadow_debt;
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c3d5237..ad153b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -225,7 +225,10 @@ struct swap_list_t {
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 unsigned long workingset_refault_distance(struct page *page);
 void workingset_zone_balance(struct zone *zone, unsigned long refault_distance);
+void workingset_refault(struct address_space *mapping);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index bd6cf61..cbbc323 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -70,6 +70,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 #endif
+		WORKINGSET_SHADOWS_RECLAIMED,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 10f8a62..3900bea 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -128,7 +128,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
 		radix_tree_replace_slot(slot, shadow);
-		mapping->nrshadows++;
+		workingset_shadows_inc(mapping);
 	} else
 		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
@@ -449,7 +449,8 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
+		workingset_refault(mapping);
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index c1a5147..621c581 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -36,7 +36,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * need verification under the tree lock.
 	 */
 	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 17c19b0..9fef546 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -818,6 +818,7 @@ const char * const vmstat_text[] = {
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 #endif
+	"workingset_shadows_reclaimed",
 
 #endif /* CONFIG_VM_EVENTS_COUNTERS */
 };
diff --git a/mm/workingset.c b/mm/workingset.c
index 7986aa4..e6294cb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -84,6 +84,8 @@
  * challenged without incurring major faults in case of a mistake.
  */
 
+static int memory_shift;
+
 /*
  * Monotonic workingset clock for non-resident pages.
  *
@@ -115,6 +117,7 @@ static struct prop_descriptor global_evictions;
 
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
+	unsigned int excess_order;
 	struct lruvec *lruvec;
 	unsigned long time;
 
@@ -132,6 +135,26 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	if (mapping_exiting(mapping))
 		return NULL;
 
+	/*
+ 	 * If the planted shadows exceed the refaults, throttle the
+ 	 * planting to relieve the shadow shrinker.
+ 	 */
+	excess_order = mapping->shadow_debt >> memory_shift;
+	if (excess_order &&
+	    (time & ((SWAP_CLUSTER_MAX << (excess_order - 1)) - 1)))
+		return NULL;
+
+	/*
+ 	 * The counter needs a safety buffer above the excess
+ 	 * threshold to not oscillate, but don't plant shadows too
+ 	 * sparsely, either.  This is a trade-off between shrinker
+ 	 * activity during streaming IO and adaptiveness when the
+ 	 * workload actually does start using this file's pages
+ 	 * frequently.
+ 	 */
+	if (excess_order < 4)
+		mapping->shadow_debt++;
+
 	return (void *)((time << EV_SHIFT) | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
@@ -204,6 +227,20 @@ void workingset_zone_balance(struct zone *zone, unsigned long refault_distance)
 	lruvec->shrink_active++;
 }
 
+void workingset_refault(struct address_space *mapping)
+{
+	unsigned int excess_order;
+	unsigned long delta = 1;
+
+	excess_order = mapping->shadow_debt >> memory_shift;
+	if (excess_order)
+		delta = SWAP_CLUSTER_MAX << (excess_order - 1);
+	if (mapping->shadow_debt > delta)
+		mapping->shadow_debt -= delta;
+	else
+		mapping->shadow_debt = 0;
+}
+
 void workingset_activation(struct page *page)
 {
 	struct lruvec *lruvec;
@@ -221,13 +258,166 @@ void workingset_activation(struct page *page)
 		lruvec->shrink_active--;
 }
 
-static int __init workingset_init(void)
+static DEFINE_PER_CPU(unsigned long, nr_shadows);
+static DEFINE_SPINLOCK(shadow_lock);
+static LIST_HEAD(shadow_list);
+
+void workingset_shadows_inc(struct address_space *mapping)
+{
+	might_lock(&shadow_lock);
+	if (mapping->nrshadows == 0) {
+		spin_lock(&shadow_lock);
+		list_add(&mapping->shadow_list, &shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	mapping->nrshadows++;
+	this_cpu_inc(nr_shadows);
+}
+
+void workingset_shadows_dec(struct address_space *mapping)
+{
+	might_lock(&shadow_lock);
+	if (mapping->nrshadows == 1) {
+		spin_lock(&shadow_lock);
+		list_del(&mapping->shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	mapping->nrshadows--;
+	this_cpu_dec(nr_shadows);
+}
+
+static unsigned long get_nr_shadows(void)
+{
+	long sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(nr_shadows, cpu);
+	return max(sum, 0L);
+}
+
+static unsigned long nr_old_shadows(unsigned long cutoff,
+				    unsigned long nr_shadows)
+{
+	if (nr_shadows <= cutoff)
+		return 0;
+	return nr_shadows - cutoff;
+}
+
+static unsigned long prune_mapping(struct address_space *mapping,
+				   unsigned long nr_to_scan,
+				   unsigned long cutoff)
 {
-	int shift;
+	struct radix_tree_iter iter;
+	unsigned long nr_pruned = 0;
+	void **slot;
 
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		unsigned long time_of_eviction;
+		unsigned long nrshadows;
+		unsigned long diff;
+		struct page *page;
+
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (!radix_tree_exception(page))
+			continue;
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		time_of_eviction = (unsigned long)page >> EV_SHIFT;
+		/*
+		 * Throw out entries older than the cutoff.  But watch
+		 * out for time wrap and pages that were installed
+		 * after the collection cycle started.
+		 */
+		diff = (cutoff - time_of_eviction) & EV_MASK;
+		if (diff & ~(EV_MASK >> 1))
+			continue;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (radix_tree_delete_item(&mapping->page_tree,
+					   iter.index, page)) {
+			workingset_shadows_dec(mapping);
+			nr_pruned++;
+		}
+		nrshadows = mapping->nrshadows;
+		spin_unlock_irq(&mapping->tree_lock);
+
+		if (nrshadows == 0)
+			break;
+
+		if (--nr_to_scan == 0)
+			break;
+	}
+	rcu_read_unlock();
+	return nr_pruned;
+}
+
+static int prune_shadows(struct shrinker *shrink, struct shrink_control *sc)
+{
+	unsigned long nr_shadows;
+	unsigned long nr_to_scan;
+	unsigned long nr_max;
+	unsigned long nr_old;
+	unsigned long cutoff;
+	unsigned long now;
+
+	nr_shadows = get_nr_shadows();
+	if (!nr_shadows)
+		return 0;
+
+	nr_max = 2UL << memory_shift;
+	nr_old = nr_old_shadows(nr_max, nr_shadows);
+
+	if (!sc->nr_to_scan)
+		return nr_old;
+
+	nr_to_scan = sc->nr_to_scan;
+	now = atomic_long_read(&workingset_time);
+	cutoff = (now - nr_max) & EV_MASK;
+
+	while (nr_to_scan && nr_old) {
+		struct address_space *mapping;
+		unsigned long nr_pruned;
+
+		spin_lock(&shadow_lock);
+		if (list_empty(&shadow_list)) {
+			spin_unlock(&shadow_lock);
+			return 0;
+		}
+		mapping = list_entry(shadow_list.prev,
+				     struct address_space,
+				     shadow_list);
+		__iget(mapping->host);
+		list_move(&mapping->shadow_list, &shadow_list);
+		spin_unlock(&shadow_lock);
+
+		nr_pruned = prune_mapping(mapping, nr_to_scan, cutoff);
+		nr_to_scan -= nr_pruned;
+		iput(mapping->host);
+
+		count_vm_events(WORKINGSET_SHADOWS_RECLAIMED, nr_pruned);
+
+		nr_old = nr_old_shadows(nr_max, get_nr_shadows());
+	}
+	return nr_old;
+}
+
+static struct shrinker shadow_shrinker = {
+	.shrink = prune_shadows,
+	.seeks = 1,
+};
+
+static int __init workingset_init(void)
+{
 	/* XXX: adapt shift during memory hotplug */
-	shift = ilog2(global_dirtyable_memory() - 1);
-	prop_descriptor_init(&global_evictions, shift);
+	memory_shift = ilog2(global_dirtyable_memory() - 1);
+	prop_descriptor_init(&global_evictions, memory_shift);
+	register_shrinker(&shadow_shrinker);
 	return 0;
 }
 module_init(workingset_init);
-- 
1.8.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-05-30 18:04   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-05-30 18:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, two mechanisms are used:

1. A refault balance counter is maintained per file that grows with
   each shadow entry planted and shrinks with each refault.  Once the
   counter grows beyond a certain threshold, planting new shadows in
   that file is throttled.  It's per file so that a single file can
   not disable thrashing detection globally.  However, this still
   allows shadow entries to grow excessively when many files show this
   usage pattern, and so:

2. a list of files that contain shadow entries is maintained.  If the
   global number of shadows exceeds a certain threshold, a shrinker is
   activated that reclaims old entries from the mappings.  This is
   heavy-handed but it should not be a common case and is only there
   to protect from accidentally/maliciously induced OOM kills.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c                    |   1 +
 include/linux/fs.h            |   2 +
 include/linux/swap.h          |   3 +
 include/linux/vm_event_item.h |   1 +
 mm/filemap.c                  |   5 +-
 mm/truncate.c                 |   2 +-
 mm/vmstat.c                   |   1 +
 mm/workingset.c               | 198 +++++++++++++++++++++++++++++++++++++++++-
 8 files changed, 206 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 3bd7916..f48ce73 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -168,6 +168,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	mapping->shadow_debt = global_dirtyable_memory();
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bf1d99..7fc3f3a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -414,6 +414,8 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	struct list_head	shadow_list;
+	unsigned long		shadow_debt;
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c3d5237..ad153b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -225,7 +225,10 @@ struct swap_list_t {
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 unsigned long workingset_refault_distance(struct page *page);
 void workingset_zone_balance(struct zone *zone, unsigned long refault_distance);
+void workingset_refault(struct address_space *mapping);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index bd6cf61..cbbc323 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -70,6 +70,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 #endif
+		WORKINGSET_SHADOWS_RECLAIMED,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 10f8a62..3900bea 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -128,7 +128,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
 		radix_tree_replace_slot(slot, shadow);
-		mapping->nrshadows++;
+		workingset_shadows_inc(mapping);
 	} else
 		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
@@ -449,7 +449,8 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
+		workingset_refault(mapping);
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index c1a5147..621c581 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -36,7 +36,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * need verification under the tree lock.
 	 */
 	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 17c19b0..9fef546 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -818,6 +818,7 @@ const char * const vmstat_text[] = {
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 #endif
+	"workingset_shadows_reclaimed",
 
 #endif /* CONFIG_VM_EVENTS_COUNTERS */
 };
diff --git a/mm/workingset.c b/mm/workingset.c
index 7986aa4..e6294cb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -84,6 +84,8 @@
  * challenged without incurring major faults in case of a mistake.
  */
 
+static int memory_shift;
+
 /*
  * Monotonic workingset clock for non-resident pages.
  *
@@ -115,6 +117,7 @@ static struct prop_descriptor global_evictions;
 
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
+	unsigned int excess_order;
 	struct lruvec *lruvec;
 	unsigned long time;
 
@@ -132,6 +135,26 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	if (mapping_exiting(mapping))
 		return NULL;
 
+	/*
+ 	 * If the planted shadows exceed the refaults, throttle the
+ 	 * planting to relieve the shadow shrinker.
+ 	 */
+	excess_order = mapping->shadow_debt >> memory_shift;
+	if (excess_order &&
+	    (time & ((SWAP_CLUSTER_MAX << (excess_order - 1)) - 1)))
+		return NULL;
+
+	/*
+ 	 * The counter needs a safety buffer above the excess
+ 	 * threshold to not oscillate, but don't plant shadows too
+ 	 * sparsely, either.  This is a trade-off between shrinker
+ 	 * activity during streaming IO and adaptiveness when the
+ 	 * workload actually does start using this file's pages
+ 	 * frequently.
+ 	 */
+	if (excess_order < 4)
+		mapping->shadow_debt++;
+
 	return (void *)((time << EV_SHIFT) | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
@@ -204,6 +227,20 @@ void workingset_zone_balance(struct zone *zone, unsigned long refault_distance)
 	lruvec->shrink_active++;
 }
 
+void workingset_refault(struct address_space *mapping)
+{
+	unsigned int excess_order;
+	unsigned long delta = 1;
+
+	excess_order = mapping->shadow_debt >> memory_shift;
+	if (excess_order)
+		delta = SWAP_CLUSTER_MAX << (excess_order - 1);
+	if (mapping->shadow_debt > delta)
+		mapping->shadow_debt -= delta;
+	else
+		mapping->shadow_debt = 0;
+}
+
 void workingset_activation(struct page *page)
 {
 	struct lruvec *lruvec;
@@ -221,13 +258,166 @@ void workingset_activation(struct page *page)
 		lruvec->shrink_active--;
 }
 
-static int __init workingset_init(void)
+static DEFINE_PER_CPU(unsigned long, nr_shadows);
+static DEFINE_SPINLOCK(shadow_lock);
+static LIST_HEAD(shadow_list);
+
+void workingset_shadows_inc(struct address_space *mapping)
+{
+	might_lock(&shadow_lock);
+	if (mapping->nrshadows == 0) {
+		spin_lock(&shadow_lock);
+		list_add(&mapping->shadow_list, &shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	mapping->nrshadows++;
+	this_cpu_inc(nr_shadows);
+}
+
+void workingset_shadows_dec(struct address_space *mapping)
+{
+	might_lock(&shadow_lock);
+	if (mapping->nrshadows == 1) {
+		spin_lock(&shadow_lock);
+		list_del(&mapping->shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	mapping->nrshadows--;
+	this_cpu_dec(nr_shadows);
+}
+
+static unsigned long get_nr_shadows(void)
+{
+	long sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(nr_shadows, cpu);
+	return max(sum, 0L);
+}
+
+static unsigned long nr_old_shadows(unsigned long cutoff,
+				    unsigned long nr_shadows)
+{
+	if (nr_shadows <= cutoff)
+		return 0;
+	return nr_shadows - cutoff;
+}
+
+static unsigned long prune_mapping(struct address_space *mapping,
+				   unsigned long nr_to_scan,
+				   unsigned long cutoff)
 {
-	int shift;
+	struct radix_tree_iter iter;
+	unsigned long nr_pruned = 0;
+	void **slot;
 
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		unsigned long time_of_eviction;
+		unsigned long nrshadows;
+		unsigned long diff;
+		struct page *page;
+
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (!radix_tree_exception(page))
+			continue;
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		time_of_eviction = (unsigned long)page >> EV_SHIFT;
+		/*
+		 * Throw out entries older than the cutoff.  But watch
+		 * out for time wrap and pages that were installed
+		 * after the collection cycle started.
+		 */
+		diff = (cutoff - time_of_eviction) & EV_MASK;
+		if (diff & ~(EV_MASK >> 1))
+			continue;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (radix_tree_delete_item(&mapping->page_tree,
+					   iter.index, page)) {
+			workingset_shadows_dec(mapping);
+			nr_pruned++;
+		}
+		nrshadows = mapping->nrshadows;
+		spin_unlock_irq(&mapping->tree_lock);
+
+		if (nrshadows == 0)
+			break;
+
+		if (--nr_to_scan == 0)
+			break;
+	}
+	rcu_read_unlock();
+	return nr_pruned;
+}
+
+static int prune_shadows(struct shrinker *shrink, struct shrink_control *sc)
+{
+	unsigned long nr_shadows;
+	unsigned long nr_to_scan;
+	unsigned long nr_max;
+	unsigned long nr_old;
+	unsigned long cutoff;
+	unsigned long now;
+
+	nr_shadows = get_nr_shadows();
+	if (!nr_shadows)
+		return 0;
+
+	nr_max = 2UL << memory_shift;
+	nr_old = nr_old_shadows(nr_max, nr_shadows);
+
+	if (!sc->nr_to_scan)
+		return nr_old;
+
+	nr_to_scan = sc->nr_to_scan;
+	now = atomic_long_read(&workingset_time);
+	cutoff = (now - nr_max) & EV_MASK;
+
+	while (nr_to_scan && nr_old) {
+		struct address_space *mapping;
+		unsigned long nr_pruned;
+
+		spin_lock(&shadow_lock);
+		if (list_empty(&shadow_list)) {
+			spin_unlock(&shadow_lock);
+			return 0;
+		}
+		mapping = list_entry(shadow_list.prev,
+				     struct address_space,
+				     shadow_list);
+		__iget(mapping->host);
+		list_move(&mapping->shadow_list, &shadow_list);
+		spin_unlock(&shadow_lock);
+
+		nr_pruned = prune_mapping(mapping, nr_to_scan, cutoff);
+		nr_to_scan -= nr_pruned;
+		iput(mapping->host);
+
+		count_vm_events(WORKINGSET_SHADOWS_RECLAIMED, nr_pruned);
+
+		nr_old = nr_old_shadows(nr_max, get_nr_shadows());
+	}
+	return nr_old;
+}
+
+static struct shrinker shadow_shrinker = {
+	.shrink = prune_shadows,
+	.seeks = 1,
+};
+
+static int __init workingset_init(void)
+{
 	/* XXX: adapt shift during memory hotplug */
-	shift = ilog2(global_dirtyable_memory() - 1);
-	prop_descriptor_init(&global_evictions, shift);
+	memory_shift = ilog2(global_dirtyable_memory() - 1);
+	prop_descriptor_init(&global_evictions, memory_shift);
+	register_shrinker(&shadow_shrinker);
 	return 0;
 }
 module_init(workingset_init);
-- 
1.8.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
  2013-05-30 18:04   ` Johannes Weiner
@ 2013-06-03  8:22     ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03  8:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> 2. a list of files that contain shadow entries is maintained.  If the
>    global number of shadows exceeds a certain threshold, a shrinker is
>    activated that reclaims old entries from the mappings.  This is
>    heavy-handed but it should not be a common case and is only there
>    to protect from accidentally/maliciously induced OOM kills.

Grrr.. another global files list. We've been trying rather hard to get
rid of the first one :/

I see why you want it but ugh.

I have similar worries for your global time counter, large machines
might thrash on that one cacheline.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-06-03  8:22     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03  8:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> 2. a list of files that contain shadow entries is maintained.  If the
>    global number of shadows exceeds a certain threshold, a shrinker is
>    activated that reclaims old entries from the mappings.  This is
>    heavy-handed but it should not be a common case and is only there
>    to protect from accidentally/maliciously induced OOM kills.

Grrr.. another global files list. We've been trying rather hard to get
rid of the first one :/

I see why you want it but ugh.

I have similar worries for your global time counter, large machines
might thrash on that one cacheline.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
  2013-05-30 18:04   ` Johannes Weiner
@ 2013-06-03  8:25     ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03  8:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 

Can't we simply prune all refault entries that have a distance larger
than the memory size? Then we must assume that no refault entry means
its too old, which I think is a fair assumption.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-06-03  8:25     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03  8:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 

Can't we simply prune all refault entries that have a distance larger
than the memory size? Then we must assume that no refault entry means
its too old, which I think is a fair assumption.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
  2013-06-03  8:22     ` Peter Zijlstra
@ 2013-06-03 15:01       ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-03 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 10:22:09AM +0200, Peter Zijlstra wrote:
> On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > 2. a list of files that contain shadow entries is maintained.  If the
> >    global number of shadows exceeds a certain threshold, a shrinker is
> >    activated that reclaims old entries from the mappings.  This is
> >    heavy-handed but it should not be a common case and is only there
> >    to protect from accidentally/maliciously induced OOM kills.
> 
> Grrr.. another global files list. We've been trying rather hard to get
> rid of the first one :/
> 
> I see why you want it but ugh.

I'll try to make it per-SB like the inode list.  It probably won't be
per-SB shrinkers because of the global nature of the shadow limit, but
at least per-SB inode lists should be doable.

> I have similar worries for your global time counter, large machines
> might thrash on that one cacheline.

Fair enough.

So I'm trying the following idea: instead of the global time counter,
have per-zone time counters and store the zone along with those local
timestamps in the shadow entries (nid | zid | time).  On refault, we
can calculate the zone-local distance first and then use the inverse
of the zone's eviction proportion to scale it to a global distance.

Delta for 9/10:

---
 include/linux/mmzone.h |  1 +
 mm/workingset.c        | 74 ++++++++++++++++++++++++++++++--------------------
 2 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 505bd80..24e9805 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -206,6 +206,7 @@ struct zone_reclaim_stat {
 struct lruvec {
 	struct list_head lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat reclaim_stat;
+	atomic_long_t workingset_time;
 	struct prop_local_percpu evictions;
 	long shrink_active;
 #ifdef CONFIG_MEMCG
diff --git a/mm/workingset.c b/mm/workingset.c
index 7986aa4..5fd7277 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -85,27 +85,10 @@
  */
 
 /*
- * Monotonic workingset clock for non-resident pages.
- *
- * The refault distance of a page is the number of ticks that occurred
- * between that page's eviction and subsequent refault.
- *
- * Every page slot that is taken away from the inactive list is one
- * more slot the inactive list would have to grow again in order to
- * hold the current non-resident pages in memory as well.
- *
- * As the refault distance needs to reflect the space missing on the
- * inactive list, the workingset time is advanced every time the
- * inactive list is shrunk.  This means eviction, but also activation.
- */
-static atomic_long_t workingset_time;
-
-/*
  * Workingset clock snapshots are stored in the page cache radix tree
  * as exceptional entries (shadows).
  */
 #define EV_SHIFT	RADIX_TREE_EXCEPTIONAL_SHIFT
-#define EV_MASK		(~0UL >> EV_SHIFT)
 
 /*
  * Per-zone proportional eviction counter to keep track of recent zone
@@ -115,12 +98,12 @@ static struct prop_descriptor global_evictions;
 
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
+	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	unsigned long time;
 
-	time = atomic_long_inc_return(&workingset_time);
-
-	lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+	lruvec = mem_cgroup_zone_lruvec(zone, NULL);
+	time = atomic_long_inc_return(&lruvec->workingset_time);
 	prop_inc_percpu(&global_evictions, &lruvec->evictions);
 
 	/*
@@ -132,21 +115,57 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	if (mapping_exiting(mapping))
 		return NULL;
 
+	time = (time << NODES_SHIFT) | zone->node;
+	time = (time << ZONES_SHIFT) | zone_idx(zone);
+
 	return (void *)((time << EV_SHIFT) | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-unsigned long workingset_refault_distance(struct page *page)
+static void lruvec_refault_distance(unsigned long shadow,
+				    struct lruvec **lruvec,
+				    unsigned long *distance)
 {
 	unsigned long time_of_eviction;
+	struct zone *zone;
 	unsigned long now;
+	int zid, nid;
+
+	shadow >>= EV_SHIFT;
+	zid = shadow & ((1UL << ZONES_SHIFT) - 1);
+	shadow >>= ZONES_SHIFT;
+	nid = shadow & ((1UL << NODES_SHIFT) - 1);
+	shadow >>= NODES_SHIFT;
+	time_of_eviction = shadow;
+	zone = NODE_DATA(nid)->node_zones + zid;
+
+	*lruvec = mem_cgroup_zone_lruvec(zone, NULL);
+
+	now = atomic_long_read(&(*lruvec)->workingset_time);
+
+	*distance = (now - time_of_eviction) &
+		(~0UL >> (EV_SHIFT + ZONES_SHIFT + NODES_SHIFT));
+}
+
+unsigned long workingset_refault_distance(struct page *page)
+{
+	unsigned long refault_distance;
+	unsigned long lruvec_distance;
+	struct lruvec *lruvec;
+	long denominator;
+	long numerator;
 
 	if (!page)
 		return ~0UL;
 
 	BUG_ON(!radix_tree_exceptional_entry(page));
-	time_of_eviction = (unsigned long)page >> EV_SHIFT;
-	now = atomic_long_read(&workingset_time);
-	return (now - time_of_eviction) & EV_MASK;
+	lruvec_refault_distance((unsigned long)page,
+				&lruvec, &lruvec_distance);
+	prop_fraction_percpu(&global_evictions, &lruvec->evictions,
+			     &numerator, &denominator);
+	if (!numerator)
+		numerator = 1;
+	refault_distance = mult_frac(lruvec_distance, denominator, numerator);
+	return refault_distance;
 }
 EXPORT_SYMBOL(workingset_refault_distance);
 
@@ -187,8 +206,7 @@ void workingset_zone_balance(struct zone *zone, unsigned long refault_distance)
 	 */
 	prop_fraction_percpu(&global_evictions, &lruvec->evictions,
 			     &numerator, &denominator);
-	missing = refault_distance * numerator;
-	do_div(missing, denominator);
+	missing = mult_frac(refault_distance, numerator, denominator);
 
 	/*
 	 * Protected pages should be challenged when the refault
@@ -207,9 +225,6 @@ void workingset_zone_balance(struct zone *zone, unsigned long refault_distance)
 void workingset_activation(struct page *page)
 {
 	struct lruvec *lruvec;
-
-	atomic_long_inc(&workingset_time);
-
 	/*
 	 * The lists are rebalanced when the inactive list is observed
 	 * to be too small for activations.  An activation means that
@@ -217,6 +232,7 @@ void workingset_activation(struct page *page)
 	 * page, so back off further deactivation.
 	 */
 	lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+	atomic_long_inc(&lruvec->workingset_time);
 	if (lruvec->shrink_active > 0)
 		lruvec->shrink_active--;
 }
-- 
1.8.2.3


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-06-03 15:01       ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-03 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 10:22:09AM +0200, Peter Zijlstra wrote:
> On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > 2. a list of files that contain shadow entries is maintained.  If the
> >    global number of shadows exceeds a certain threshold, a shrinker is
> >    activated that reclaims old entries from the mappings.  This is
> >    heavy-handed but it should not be a common case and is only there
> >    to protect from accidentally/maliciously induced OOM kills.
> 
> Grrr.. another global files list. We've been trying rather hard to get
> rid of the first one :/
> 
> I see why you want it but ugh.

I'll try to make it per-SB like the inode list.  It probably won't be
per-SB shrinkers because of the global nature of the shadow limit, but
at least per-SB inode lists should be doable.

> I have similar worries for your global time counter, large machines
> might thrash on that one cacheline.

Fair enough.

So I'm trying the following idea: instead of the global time counter,
have per-zone time counters and store the zone along with those local
timestamps in the shadow entries (nid | zid | time).  On refault, we
can calculate the zone-local distance first and then use the inverse
of the zone's eviction proportion to scale it to a global distance.

Delta for 9/10:

---
 include/linux/mmzone.h |  1 +
 mm/workingset.c        | 74 ++++++++++++++++++++++++++++++--------------------
 2 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 505bd80..24e9805 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -206,6 +206,7 @@ struct zone_reclaim_stat {
 struct lruvec {
 	struct list_head lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat reclaim_stat;
+	atomic_long_t workingset_time;
 	struct prop_local_percpu evictions;
 	long shrink_active;
 #ifdef CONFIG_MEMCG
diff --git a/mm/workingset.c b/mm/workingset.c
index 7986aa4..5fd7277 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -85,27 +85,10 @@
  */
 
 /*
- * Monotonic workingset clock for non-resident pages.
- *
- * The refault distance of a page is the number of ticks that occurred
- * between that page's eviction and subsequent refault.
- *
- * Every page slot that is taken away from the inactive list is one
- * more slot the inactive list would have to grow again in order to
- * hold the current non-resident pages in memory as well.
- *
- * As the refault distance needs to reflect the space missing on the
- * inactive list, the workingset time is advanced every time the
- * inactive list is shrunk.  This means eviction, but also activation.
- */
-static atomic_long_t workingset_time;
-
-/*
  * Workingset clock snapshots are stored in the page cache radix tree
  * as exceptional entries (shadows).
  */
 #define EV_SHIFT	RADIX_TREE_EXCEPTIONAL_SHIFT
-#define EV_MASK		(~0UL >> EV_SHIFT)
 
 /*
  * Per-zone proportional eviction counter to keep track of recent zone
@@ -115,12 +98,12 @@ static struct prop_descriptor global_evictions;
 
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
+	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	unsigned long time;
 
-	time = atomic_long_inc_return(&workingset_time);
-
-	lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+	lruvec = mem_cgroup_zone_lruvec(zone, NULL);
+	time = atomic_long_inc_return(&lruvec->workingset_time);
 	prop_inc_percpu(&global_evictions, &lruvec->evictions);
 
 	/*
@@ -132,21 +115,57 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	if (mapping_exiting(mapping))
 		return NULL;
 
+	time = (time << NODES_SHIFT) | zone->node;
+	time = (time << ZONES_SHIFT) | zone_idx(zone);
+
 	return (void *)((time << EV_SHIFT) | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
-unsigned long workingset_refault_distance(struct page *page)
+static void lruvec_refault_distance(unsigned long shadow,
+				    struct lruvec **lruvec,
+				    unsigned long *distance)
 {
 	unsigned long time_of_eviction;
+	struct zone *zone;
 	unsigned long now;
+	int zid, nid;
+
+	shadow >>= EV_SHIFT;
+	zid = shadow & ((1UL << ZONES_SHIFT) - 1);
+	shadow >>= ZONES_SHIFT;
+	nid = shadow & ((1UL << NODES_SHIFT) - 1);
+	shadow >>= NODES_SHIFT;
+	time_of_eviction = shadow;
+	zone = NODE_DATA(nid)->node_zones + zid;
+
+	*lruvec = mem_cgroup_zone_lruvec(zone, NULL);
+
+	now = atomic_long_read(&(*lruvec)->workingset_time);
+
+	*distance = (now - time_of_eviction) &
+		(~0UL >> (EV_SHIFT + ZONES_SHIFT + NODES_SHIFT));
+}
+
+unsigned long workingset_refault_distance(struct page *page)
+{
+	unsigned long refault_distance;
+	unsigned long lruvec_distance;
+	struct lruvec *lruvec;
+	long denominator;
+	long numerator;
 
 	if (!page)
 		return ~0UL;
 
 	BUG_ON(!radix_tree_exceptional_entry(page));
-	time_of_eviction = (unsigned long)page >> EV_SHIFT;
-	now = atomic_long_read(&workingset_time);
-	return (now - time_of_eviction) & EV_MASK;
+	lruvec_refault_distance((unsigned long)page,
+				&lruvec, &lruvec_distance);
+	prop_fraction_percpu(&global_evictions, &lruvec->evictions,
+			     &numerator, &denominator);
+	if (!numerator)
+		numerator = 1;
+	refault_distance = mult_frac(lruvec_distance, denominator, numerator);
+	return refault_distance;
 }
 EXPORT_SYMBOL(workingset_refault_distance);
 
@@ -187,8 +206,7 @@ void workingset_zone_balance(struct zone *zone, unsigned long refault_distance)
 	 */
 	prop_fraction_percpu(&global_evictions, &lruvec->evictions,
 			     &numerator, &denominator);
-	missing = refault_distance * numerator;
-	do_div(missing, denominator);
+	missing = mult_frac(refault_distance, numerator, denominator);
 
 	/*
 	 * Protected pages should be challenged when the refault
@@ -207,9 +225,6 @@ void workingset_zone_balance(struct zone *zone, unsigned long refault_distance)
 void workingset_activation(struct page *page)
 {
 	struct lruvec *lruvec;
-
-	atomic_long_inc(&workingset_time);
-
 	/*
 	 * The lists are rebalanced when the inactive list is observed
 	 * to be too small for activations.  An activation means that
@@ -217,6 +232,7 @@ void workingset_activation(struct page *page)
 	 * page, so back off further deactivation.
 	 */
 	lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+	atomic_long_inc(&lruvec->workingset_time);
 	if (lruvec->shrink_active > 0)
 		lruvec->shrink_active--;
 }
-- 
1.8.2.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
  2013-06-03  8:25     ` Peter Zijlstra
@ 2013-06-03 15:20       ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-03 15:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 10:25:33AM +0200, Peter Zijlstra wrote:
> On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> 
> Can't we simply prune all refault entries that have a distance larger
> than the memory size? Then we must assume that no refault entry means
> its too old, which I think is a fair assumption.

Two workloads bound to two nodes might not push pages through the LRUs
at the same pace, so a distance might be bigger than memory due to the
faster moving node, yet still be a hit in the slower moving one.  We
can't really know until we evaluate it on a per-zone basis.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-06-03 15:20       ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-03 15:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 10:25:33AM +0200, Peter Zijlstra wrote:
> On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> 
> Can't we simply prune all refault entries that have a distance larger
> than the memory size? Then we must assume that no refault entry means
> its too old, which I think is a fair assumption.

Two workloads bound to two nodes might not push pages through the LRUs
at the same pace, so a distance might be bigger than memory due to the
faster moving node, yet still be a hit in the slower moving one.  We
can't really know until we evaluate it on a per-zone basis.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
  2013-06-03 15:01       ` Johannes Weiner
@ 2013-06-03 17:10         ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03 17:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 11:01:54AM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 10:22:09AM +0200, Peter Zijlstra wrote:
> > On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > > 2. a list of files that contain shadow entries is maintained.  If the
> > >    global number of shadows exceeds a certain threshold, a shrinker is
> > >    activated that reclaims old entries from the mappings.  This is
> > >    heavy-handed but it should not be a common case and is only there
> > >    to protect from accidentally/maliciously induced OOM kills.
> > 
> > Grrr.. another global files list. We've been trying rather hard to get
> > rid of the first one :/
> > 
> > I see why you want it but ugh.
> 
> I'll try to make it per-SB like the inode list.  It probably won't be
> per-SB shrinkers because of the global nature of the shadow limit, but
> at least per-SB inode lists should be doable.

per have per-cpu-per-sb lists, see file_sb_list_{add,del} and
do_file_list_for_each_entry()

> > I have similar worries for your global time counter, large machines
> > might thrash on that one cacheline.
> 
> Fair enough.
> 
> So I'm trying the following idea: instead of the global time counter,
> have per-zone time counters and store the zone along with those local
> timestamps in the shadow entries (nid | zid | time).  On refault, we
> can calculate the zone-local distance first and then use the inverse
> of the zone's eviction proportion to scale it to a global distance.

The thinking is since that's the same granularity as the zone lock,
you're likely to at least trash the zone lock in equal measure?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-06-03 17:10         ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03 17:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 11:01:54AM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 10:22:09AM +0200, Peter Zijlstra wrote:
> > On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > > 2. a list of files that contain shadow entries is maintained.  If the
> > >    global number of shadows exceeds a certain threshold, a shrinker is
> > >    activated that reclaims old entries from the mappings.  This is
> > >    heavy-handed but it should not be a common case and is only there
> > >    to protect from accidentally/maliciously induced OOM kills.
> > 
> > Grrr.. another global files list. We've been trying rather hard to get
> > rid of the first one :/
> > 
> > I see why you want it but ugh.
> 
> I'll try to make it per-SB like the inode list.  It probably won't be
> per-SB shrinkers because of the global nature of the shadow limit, but
> at least per-SB inode lists should be doable.

per have per-cpu-per-sb lists, see file_sb_list_{add,del} and
do_file_list_for_each_entry()

> > I have similar worries for your global time counter, large machines
> > might thrash on that one cacheline.
> 
> Fair enough.
> 
> So I'm trying the following idea: instead of the global time counter,
> have per-zone time counters and store the zone along with those local
> timestamps in the shadow entries (nid | zid | time).  On refault, we
> can calculate the zone-local distance first and then use the inverse
> of the zone's eviction proportion to scale it to a global distance.

The thinking is since that's the same granularity as the zone lock,
you're likely to at least trash the zone lock in equal measure?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
  2013-06-03 15:20       ` Johannes Weiner
@ 2013-06-03 17:15         ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03 17:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 11:20:32AM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 10:25:33AM +0200, Peter Zijlstra wrote:
> > On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers.  But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed.  This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > entries will just sit there and waste memory.  In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.
> > > 
> > 
> > Can't we simply prune all refault entries that have a distance larger
> > than the memory size? Then we must assume that no refault entry means
> > its too old, which I think is a fair assumption.
> 
> Two workloads bound to two nodes might not push pages through the LRUs
> at the same pace, so a distance might be bigger than memory due to the
> faster moving node, yet still be a hit in the slower moving one.  We
> can't really know until we evaluate it on a per-zone basis.

But wasn't patch 1 of this series about making sure each zone is scanned
proportionally to its size?

But given that, sure maybe 1 memory size is a bit strict, but surely we
can put a limit on things at about 2 memory sizes?




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-06-03 17:15         ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03 17:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 11:20:32AM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 10:25:33AM +0200, Peter Zijlstra wrote:
> > On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers.  But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed.  This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > entries will just sit there and waste memory.  In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.
> > > 
> > 
> > Can't we simply prune all refault entries that have a distance larger
> > than the memory size? Then we must assume that no refault entry means
> > its too old, which I think is a fair assumption.
> 
> Two workloads bound to two nodes might not push pages through the LRUs
> at the same pace, so a distance might be bigger than memory due to the
> faster moving node, yet still be a hit in the slower moving one.  We
> can't really know until we evaluate it on a per-zone basis.

But wasn't patch 1 of this series about making sure each zone is scanned
proportionally to its size?

But given that, sure maybe 1 memory size is a bit strict, but surely we
can put a limit on things at about 2 memory sizes?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
  2013-06-03 17:15         ` Peter Zijlstra
@ 2013-06-03 18:12           ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-03 18:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 07:15:58PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 03, 2013 at 11:20:32AM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2013 at 10:25:33AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > > > Previously, page cache radix tree nodes were freed after reclaim
> > > > emptied out their page pointers.  But now reclaim stores shadow
> > > > entries in their place, which are only reclaimed when the inodes
> > > > themselves are reclaimed.  This is problematic for bigger files that
> > > > are still in use after they have a significant amount of their cache
> > > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > > entries will just sit there and waste memory.  In the worst case, the
> > > > shadow entries will accumulate until the machine runs out of memory.
> > > > 
> > > 
> > > Can't we simply prune all refault entries that have a distance larger
> > > than the memory size? Then we must assume that no refault entry means
> > > its too old, which I think is a fair assumption.
> > 
> > Two workloads bound to two nodes might not push pages through the LRUs
> > at the same pace, so a distance might be bigger than memory due to the
> > faster moving node, yet still be a hit in the slower moving one.  We
> > can't really know until we evaluate it on a per-zone basis.
> 
> But wasn't patch 1 of this series about making sure each zone is scanned
> proportionally to its size?

Only within any given zonelist.  It's just so that pages used together
are aged fairly.  But if the tasks are isolated from each other their
pages may age at different paces without it being unfair since the
tasks do not contend for the same memory.

> But given that, sure maybe 1 memory size is a bit strict, but surely we
> can put a limit on things at about 2 memory sizes?

That's what this 10/10 patch does (prune everything older than 2 *
global_dirtyable_memory()), so I think we're talking past each other.

Maybe the wording of the changelog was confusing?  The paragraph you
quoted above explains the problem resulting from 9/10 but which this
patch 10/10 fixes.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-06-03 18:12           ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-03 18:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 07:15:58PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 03, 2013 at 11:20:32AM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2013 at 10:25:33AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > > > Previously, page cache radix tree nodes were freed after reclaim
> > > > emptied out their page pointers.  But now reclaim stores shadow
> > > > entries in their place, which are only reclaimed when the inodes
> > > > themselves are reclaimed.  This is problematic for bigger files that
> > > > are still in use after they have a significant amount of their cache
> > > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > > entries will just sit there and waste memory.  In the worst case, the
> > > > shadow entries will accumulate until the machine runs out of memory.
> > > > 
> > > 
> > > Can't we simply prune all refault entries that have a distance larger
> > > than the memory size? Then we must assume that no refault entry means
> > > its too old, which I think is a fair assumption.
> > 
> > Two workloads bound to two nodes might not push pages through the LRUs
> > at the same pace, so a distance might be bigger than memory due to the
> > faster moving node, yet still be a hit in the slower moving one.  We
> > can't really know until we evaluate it on a per-zone basis.
> 
> But wasn't patch 1 of this series about making sure each zone is scanned
> proportionally to its size?

Only within any given zonelist.  It's just so that pages used together
are aged fairly.  But if the tasks are isolated from each other their
pages may age at different paces without it being unfair since the
tasks do not contend for the same memory.

> But given that, sure maybe 1 memory size is a bit strict, but surely we
> can put a limit on things at about 2 memory sizes?

That's what this 10/10 patch does (prune everything older than 2 *
global_dirtyable_memory()), so I think we're talking past each other.

Maybe the wording of the changelog was confusing?  The paragraph you
quoted above explains the problem resulting from 9/10 but which this
patch 10/10 fixes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
  2013-06-03 18:12           ` Johannes Weiner
@ 2013-06-03 18:52             ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03 18:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 02:12:02PM -0400, Johannes Weiner wrote:
> > But given that, sure maybe 1 memory size is a bit strict, but surely we
> > can put a limit on things at about 2 memory sizes?
> 
> That's what this 10/10 patch does (prune everything older than 2 *
> global_dirtyable_memory()), so I think we're talking past each other.
> 
> Maybe the wording of the changelog was confusing?  The paragraph you
> quoted above explains the problem resulting from 9/10 but which this
> patch 10/10 fixes.

Could be I just didn't read very well -- I pretty much raced through the
patches trying to get a general overview and see if I could spot
something weird.

I'll try again and let you know :-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-06-03 18:52             ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2013-06-03 18:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 02:12:02PM -0400, Johannes Weiner wrote:
> > But given that, sure maybe 1 memory size is a bit strict, but surely we
> > can put a limit on things at about 2 memory sizes?
> 
> That's what this 10/10 patch does (prune everything older than 2 *
> global_dirtyable_memory()), so I think we're talking past each other.
> 
> Maybe the wording of the changelog was confusing?  The paragraph you
> quoted above explains the problem resulting from 9/10 but which this
> patch 10/10 fixes.

Could be I just didn't read very well -- I pretty much raced through the
patches trying to get a general overview and see if I could spot
something weird.

I'll try again and let you know :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
  2013-06-03 17:10         ` Peter Zijlstra
@ 2013-06-06 18:31           ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-06 18:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 07:10:31PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 03, 2013 at 11:01:54AM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2013 at 10:22:09AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > > > 2. a list of files that contain shadow entries is maintained.  If the
> > > >    global number of shadows exceeds a certain threshold, a shrinker is
> > > >    activated that reclaims old entries from the mappings.  This is
> > > >    heavy-handed but it should not be a common case and is only there
> > > >    to protect from accidentally/maliciously induced OOM kills.
> > > 
> > > Grrr.. another global files list. We've been trying rather hard to get
> > > rid of the first one :/
> > > 
> > > I see why you want it but ugh.
> > 
> > I'll try to make it per-SB like the inode list.  It probably won't be
> > per-SB shrinkers because of the global nature of the shadow limit, but
> > at least per-SB inode lists should be doable.
> 
> per have per-cpu-per-sb lists, see file_sb_list_{add,del} and
> do_file_list_for_each_entry()

Ok, I'll give it a look.  Thanks.

> > > I have similar worries for your global time counter, large machines
> > > might thrash on that one cacheline.
> > 
> > Fair enough.
> > 
> > So I'm trying the following idea: instead of the global time counter,
> > have per-zone time counters and store the zone along with those local
> > timestamps in the shadow entries (nid | zid | time).  On refault, we
> > can calculate the zone-local distance first and then use the inverse
> > of the zone's eviction proportion to scale it to a global distance.
> 
> The thinking is since that's the same granularity as the zone lock,
> you're likely to at least trash the zone lock in equal measure?

Yeah, and prevent the cross-node bouncing.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 10/10] mm: workingset: keep shadow entries in check
@ 2013-06-06 18:31           ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-06 18:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, metin d,
	linux-kernel, linux-fsdevel

On Mon, Jun 03, 2013 at 07:10:31PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 03, 2013 at 11:01:54AM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2013 at 10:22:09AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 30, 2013 at 02:04:06PM -0400, Johannes Weiner wrote:
> > > > 2. a list of files that contain shadow entries is maintained.  If the
> > > >    global number of shadows exceeds a certain threshold, a shrinker is
> > > >    activated that reclaims old entries from the mappings.  This is
> > > >    heavy-handed but it should not be a common case and is only there
> > > >    to protect from accidentally/maliciously induced OOM kills.
> > > 
> > > Grrr.. another global files list. We've been trying rather hard to get
> > > rid of the first one :/
> > > 
> > > I see why you want it but ugh.
> > 
> > I'll try to make it per-SB like the inode list.  It probably won't be
> > per-SB shrinkers because of the global nature of the shadow limit, but
> > at least per-SB inode lists should be doable.
> 
> per have per-cpu-per-sb lists, see file_sb_list_{add,del} and
> do_file_list_for_each_entry()

Ok, I'll give it a look.  Thanks.

> > > I have similar worries for your global time counter, large machines
> > > might thrash on that one cacheline.
> > 
> > Fair enough.
> > 
> > So I'm trying the following idea: instead of the global time counter,
> > have per-zone time counters and store the zone along with those local
> > timestamps in the shadow entries (nid | zid | time).  On refault, we
> > can calculate the zone-local distance first and then use the inverse
> > of the zone's eviction proportion to scale it to a global distance.
> 
> The thinking is since that's the same granularity as the zone lock,
> you're likely to at least trash the zone lock in equal measure?

Yeah, and prevent the cross-node bouncing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 09/10] mm: thrash detection-based file cache sizing
  2013-05-30 18:04   ` Johannes Weiner
@ 2013-06-07 14:16     ` Roman Gushchin
  -1 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2013-06-07 14:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, metin d,
	linux-kernel, linux-fsdevel

On 30.05.2013 22:04, Johannes Weiner wrote:
> +/*
> + * Monotonic workingset clock for non-resident pages.
> + *
> + * The refault distance of a page is the number of ticks that occurred
> + * between that page's eviction and subsequent refault.
> + *
> + * Every page slot that is taken away from the inactive list is one
> + * more slot the inactive list would have to grow again in order to
> + * hold the current non-resident pages in memory as well.
> + *
> + * As the refault distance needs to reflect the space missing on the
> + * inactive list, the workingset time is advanced every time the
> + * inactive list is shrunk.  This means eviction, but also activation.
> + */
> +static atomic_long_t workingset_time;

It seems strange to me, that workingset_time is global.
Don't you want to make it per-cgroup?

Two more questions:
1) do you plan to take fadvise's into account somehow?
2) do you plan to use workingset information to enhance
	the readahead mechanism?

Thanks!

Regards,
Roman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 09/10] mm: thrash detection-based file cache sizing
@ 2013-06-07 14:16     ` Roman Gushchin
  0 siblings, 0 replies; 44+ messages in thread
From: Roman Gushchin @ 2013-06-07 14:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, metin d,
	linux-kernel, linux-fsdevel

On 30.05.2013 22:04, Johannes Weiner wrote:
> +/*
> + * Monotonic workingset clock for non-resident pages.
> + *
> + * The refault distance of a page is the number of ticks that occurred
> + * between that page's eviction and subsequent refault.
> + *
> + * Every page slot that is taken away from the inactive list is one
> + * more slot the inactive list would have to grow again in order to
> + * hold the current non-resident pages in memory as well.
> + *
> + * As the refault distance needs to reflect the space missing on the
> + * inactive list, the workingset time is advanced every time the
> + * inactive list is shrunk.  This means eviction, but also activation.
> + */
> +static atomic_long_t workingset_time;

It seems strange to me, that workingset_time is global.
Don't you want to make it per-cgroup?

Two more questions:
1) do you plan to take fadvise's into account somehow?
2) do you plan to use workingset information to enhance
	the readahead mechanism?

Thanks!

Regards,
Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 09/10] mm: thrash detection-based file cache sizing
  2013-06-07 14:16     ` Roman Gushchin
@ 2013-06-07 17:36       ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-07 17:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, metin d,
	linux-kernel, linux-fsdevel

On Fri, Jun 07, 2013 at 06:16:05PM +0400, Roman Gushchin wrote:
> On 30.05.2013 22:04, Johannes Weiner wrote:
> >+/*
> >+ * Monotonic workingset clock for non-resident pages.
> >+ *
> >+ * The refault distance of a page is the number of ticks that occurred
> >+ * between that page's eviction and subsequent refault.
> >+ *
> >+ * Every page slot that is taken away from the inactive list is one
> >+ * more slot the inactive list would have to grow again in order to
> >+ * hold the current non-resident pages in memory as well.
> >+ *
> >+ * As the refault distance needs to reflect the space missing on the
> >+ * inactive list, the workingset time is advanced every time the
> >+ * inactive list is shrunk.  This means eviction, but also activation.
> >+ */
> >+static atomic_long_t workingset_time;
> 
> It seems strange to me, that workingset_time is global.
> Don't you want to make it per-cgroup?

Yes, we need to go there and the code is structured so that it will be
possible to adapt memcg in the future.

But we will still need to maintain a global view of the workingset
time as memory and data are not exclusive resources, or at least can't
be guaranteed to be, so refault distances always need to be applicable
to all containers in the system.  But in response to Peter's feedback,
I changed the workingset_time global variable to a per-zone one and
then use the per-zone floating proportions that I used to break down
global speed in reverse to scale up the zone time to global time.

> Two more questions:
> 1) do you plan to take fadvise's into account somehow?

DONTNEED is honored, shadow entries will be dropped in the fadvised
region.  Is that what you meant?

> 2) do you plan to use workingset information to enhance
> 	the readahead mechanism?

I don't have any specific plans for this and I'm not sure if detecting
thrashing alone would be a good predicate.  It would make more sense
to adjust readahead windows if readahead pages are reclaimed before
they are used, and that may happen even in the absence of refaults.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 09/10] mm: thrash detection-based file cache sizing
@ 2013-06-07 17:36       ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2013-06-07 17:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, metin d,
	linux-kernel, linux-fsdevel

On Fri, Jun 07, 2013 at 06:16:05PM +0400, Roman Gushchin wrote:
> On 30.05.2013 22:04, Johannes Weiner wrote:
> >+/*
> >+ * Monotonic workingset clock for non-resident pages.
> >+ *
> >+ * The refault distance of a page is the number of ticks that occurred
> >+ * between that page's eviction and subsequent refault.
> >+ *
> >+ * Every page slot that is taken away from the inactive list is one
> >+ * more slot the inactive list would have to grow again in order to
> >+ * hold the current non-resident pages in memory as well.
> >+ *
> >+ * As the refault distance needs to reflect the space missing on the
> >+ * inactive list, the workingset time is advanced every time the
> >+ * inactive list is shrunk.  This means eviction, but also activation.
> >+ */
> >+static atomic_long_t workingset_time;
> 
> It seems strange to me, that workingset_time is global.
> Don't you want to make it per-cgroup?

Yes, we need to go there and the code is structured so that it will be
possible to adapt memcg in the future.

But we will still need to maintain a global view of the workingset
time as memory and data are not exclusive resources, or at least can't
be guaranteed to be, so refault distances always need to be applicable
to all containers in the system.  But in response to Peter's feedback,
I changed the workingset_time global variable to a per-zone one and
then use the per-zone floating proportions that I used to break down
global speed in reverse to scale up the zone time to global time.

> Two more questions:
> 1) do you plan to take fadvise's into account somehow?

DONTNEED is honored, shadow entries will be dropped in the fadvised
region.  Is that what you meant?

> 2) do you plan to use workingset information to enhance
> 	the readahead mechanism?

I don't have any specific plans for this and I'm not sure if detecting
thrashing alone would be a good predicate.  It would make more sense
to adjust readahead windows if readahead pages are reclaimed before
they are used, and that may happen even in the absence of refaults.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2013-06-07 17:37 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-30 18:03 [patch 00/10] mm: thrash detection-based file cache sizing Johannes Weiner
2013-05-30 18:03 ` Johannes Weiner
2013-05-30 18:03 ` [patch 01/10] mm: page_alloc: zone round-robin allocator Johannes Weiner
2013-05-30 18:03   ` Johannes Weiner
2013-05-30 18:03 ` [patch 02/10] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-05-30 18:03   ` Johannes Weiner
2013-05-30 18:03 ` [patch 03/10] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-05-30 18:03   ` Johannes Weiner
2013-05-30 18:04 ` [patch 04/10] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-05-30 18:04   ` Johannes Weiner
2013-05-30 18:04 ` [patch 05/10] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-05-30 18:04   ` Johannes Weiner
2013-05-30 18:04 ` [patch 06/10] mm + fs: store shadow entries in page cache Johannes Weiner
2013-05-30 18:04   ` Johannes Weiner
2013-05-30 18:04 ` [patch 07/10] mm + fs: provide refault distance to page cache allocations Johannes Weiner
2013-05-30 18:04   ` Johannes Weiner
2013-05-30 18:04 ` [patch 08/10] mm: make global_dirtyable_memory() available to other mm code Johannes Weiner
2013-05-30 18:04   ` Johannes Weiner
2013-05-30 18:04 ` [patch 09/10] mm: thrash detection-based file cache sizing Johannes Weiner
2013-05-30 18:04   ` Johannes Weiner
2013-06-07 14:16   ` Roman Gushchin
2013-06-07 14:16     ` Roman Gushchin
2013-06-07 17:36     ` Johannes Weiner
2013-06-07 17:36       ` Johannes Weiner
2013-05-30 18:04 ` [patch 10/10] mm: workingset: keep shadow entries in check Johannes Weiner
2013-05-30 18:04   ` Johannes Weiner
2013-06-03  8:22   ` Peter Zijlstra
2013-06-03  8:22     ` Peter Zijlstra
2013-06-03 15:01     ` Johannes Weiner
2013-06-03 15:01       ` Johannes Weiner
2013-06-03 17:10       ` Peter Zijlstra
2013-06-03 17:10         ` Peter Zijlstra
2013-06-06 18:31         ` Johannes Weiner
2013-06-06 18:31           ` Johannes Weiner
2013-06-03  8:25   ` Peter Zijlstra
2013-06-03  8:25     ` Peter Zijlstra
2013-06-03 15:20     ` Johannes Weiner
2013-06-03 15:20       ` Johannes Weiner
2013-06-03 17:15       ` Peter Zijlstra
2013-06-03 17:15         ` Peter Zijlstra
2013-06-03 18:12         ` Johannes Weiner
2013-06-03 18:12           ` Johannes Weiner
2013-06-03 18:52           ` Peter Zijlstra
2013-06-03 18:52             ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.