linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-06 19:48 ` [PATCH 07/10] mm: base LRU balancing on an explicit cost model Johannes Weiner
@ 2016-06-06 19:13   ` kbuild test robot
  2016-06-07  2:34   ` Rik van Riel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 67+ messages in thread
From: kbuild test robot @ 2016-06-06 19:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, linux-mm, linux-kernel, Andrew Morton, Rik van Riel,
	Mel Gorman, Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 2599 bytes --]

Hi,

[auto build test ERROR on cifs/for-next]
[also build test ERROR on v4.7-rc2 next-20160606]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-balance-LRU-lists-based-on-relative-thrashing/20160607-035348
base:   git://git.samba.org/sfrench/cifs-2.6.git for-next
config: s390-default_defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 5.3.1-8) 5.3.1 20160205
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=s390 

All errors (new ones prefixed by >>):

   mm/memcontrol.c: In function 'memcg_stat_show':
>> mm/memcontrol.c:3221:24: error: 'struct lruvec' has no member named 'reclaim_stat'
        rstat = &mz->lruvec.reclaim_stat;
                           ^
>> mm/memcontrol.c:3223:31: error: dereferencing pointer to incomplete type 'struct zone_reclaim_stat'
        recent_rotated[0] += rstat->recent_rotated[0];
                                  ^

vim +3221 mm/memcontrol.c

7f016ee8 KOSAKI Motohiro 2009-01-07  3215  		unsigned long recent_rotated[2] = {0, 0};
7f016ee8 KOSAKI Motohiro 2009-01-07  3216  		unsigned long recent_scanned[2] = {0, 0};
7f016ee8 KOSAKI Motohiro 2009-01-07  3217  
7f016ee8 KOSAKI Motohiro 2009-01-07  3218  		for_each_online_node(nid)
7f016ee8 KOSAKI Motohiro 2009-01-07  3219  			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
e231875b Jianyu Zhan     2014-06-06  3220  				mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
89abfab1 Hugh Dickins    2012-05-29 @3221  				rstat = &mz->lruvec.reclaim_stat;
7f016ee8 KOSAKI Motohiro 2009-01-07  3222  
89abfab1 Hugh Dickins    2012-05-29 @3223  				recent_rotated[0] += rstat->recent_rotated[0];
89abfab1 Hugh Dickins    2012-05-29  3224  				recent_rotated[1] += rstat->recent_rotated[1];
89abfab1 Hugh Dickins    2012-05-29  3225  				recent_scanned[0] += rstat->recent_scanned[0];
89abfab1 Hugh Dickins    2012-05-29  3226  				recent_scanned[1] += rstat->recent_scanned[1];

:::::: The code at line 3221 was first introduced by commit
:::::: 89abfab133ef1f5902abafb744df72793213ac19 mm/memcg: move reclaim_stat into lruvec

:::::: TO: Hugh Dickins <hughd@google.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 16213 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-06 19:48 ` [PATCH 10/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
@ 2016-06-06 19:22   ` kbuild test robot
  2016-06-06 23:50   ` Tim Chen
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 67+ messages in thread
From: kbuild test robot @ 2016-06-06 19:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kbuild-all, linux-mm, linux-kernel, Andrew Morton, Rik van Riel,
	Mel Gorman, Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 1716 bytes --]

Hi,

[auto build test ERROR on cifs/for-next]
[also build test ERROR on v4.7-rc2 next-20160606]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-balance-LRU-lists-based-on-relative-thrashing/20160607-035348
base:   git://git.samba.org/sfrench/cifs-2.6.git for-next
config: s390-default_defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 5.3.1-8) 5.3.1 20160205
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=s390 

All errors (new ones prefixed by >>):

   mm/migrate.c: In function 'migrate_misplaced_transhuge_page':
>> mm/migrate.c:1814:7: error: implicit declaration of function 'TestClearPageWorkingset' [-Werror=implicit-function-declaration]
      if (TestClearPageWorkingset(new_page))
          ^
   cc1: some warnings being treated as errors

vim +/TestClearPageWorkingset +1814 mm/migrate.c

  1808		if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
  1809	fail_putback:
  1810			spin_unlock(ptl);
  1811			mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
  1812	
  1813			/* Reverse changes made by migrate_page_copy() */
> 1814			if (TestClearPageWorkingset(new_page))
  1815				ClearPageWorkingset(page);
  1816			if (TestClearPageActive(new_page))
  1817				SetPageActive(page);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 16213 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 00/10] mm: balance LRU lists based on relative thrashing
@ 2016-06-06 19:48 Johannes Weiner
  2016-06-06 19:48 ` [PATCH 01/10] mm: allow swappiness that prefers anon over file Johannes Weiner
                   ` (10 more replies)
  0 siblings, 11 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

Hi everybody,

this series re-implements the LRU balancing between page cache and
anonymous pages to work better with fast random IO swap devices.

The LRU balancing code evolved under slow rotational disks with high
seek overhead, and it had to extrapolate the cost of reclaiming a list
based on in-memory reference patterns alone, which is error prone and,
in combination with the high IO cost of mistakes, risky. As a result,
the balancing code is now at a point where it mostly goes for page
cache and avoids the random IO of swapping altogether until the VM is
under significant memory pressure.

With the proliferation of fast random IO devices such as SSDs and
persistent memory, though, swap becomes interesting again, not just as
a last-resort overflow, but as an extension of memory that can be used
to optimize the in-memory balance between the page cache and the
anonymous workingset even during moderate load. Our current reclaim
choices don't exploit the potential of this hardware. This series sets
out to address this.

Having exact tracking of refault IO - the ultimate cost of reclaiming
the wrong pages - allows us to use an IO cost based balancing model
that is more aggressive about swapping on fast backing devices while
holding back on existing setups that still use rotational storage.

These patches base the LRU balancing on the rate of refaults on each
list, times the relative IO cost between swap device and filesystem
(swappiness), in order to optimize reclaim for least IO cost incurred.

---

The following postgres benchmark demonstrates the benefits of this new
model. The machine has 7G, the database is 5.6G with 1G for shared
buffers, and the system has a little over 1G worth of anonymous pages
from mostly idle processes and tmpfs files. The filesystem is on
spinning rust, the swap partition is on an SSD; swappiness is set to
115 to ballpark the relative IO cost between them. The test run is
preceded by 30 minutes of warmup using the same workload:

transaction type: TPC-B (sort of)
scaling factor: 420
query mode: simple
number of clients: 8
number of threads: 4
duration: 3600 s

vanilla:
number of transactions actually processed: 290360
latency average: 99.187 ms
latency stddev: 261.171 ms
tps = 80.654848 (including connections establishing)
tps = 80.654878 (excluding connections establishing)

patched:
number of transactions actually processed: 377960
latency average: 76.198 ms
latency stddev: 229.411 ms
tps = 104.987704 (including connections establishing)
tps = 104.987743 (excluding connections establishing)

The patched kernel shows a 30% increase in throughput, and a 23%
decrease in average latency. Latency variance is reduced as well.

The reclaim statistics explain the difference in behavior:

                         PGBENCH5.6G-vanilla      PGBENCH5.6G-lrucost
Real time                 3600.49 (  +0.00%)      3600.26 (   -0.01%)
User time                   17.85 (  +0.00%)        18.80 (   +5.05%)
System time                 17.52 (  +0.00%)        17.02 (   -2.72%)
Allocation stalls            3.00 (  +0.00%)         0.00 (  -75.00%)
Anon scanned              6579.00 (  +0.00%)    201845.00 (+2967.57%)
Anon reclaimed            3426.00 (  +0.00%)     86924.00 (+2436.48%)
Anon reclaim efficiency     52.07 (  +0.00%)        43.06 (  -16.98%)
File scanned            364444.00 (  +0.00%)     27706.00 (  -92.40%)
File reclaimed          363136.00 (  +0.00%)     27366.00 (  -92.46%)
File reclaim efficiency     99.64 (  +0.00%)        98.77 (   -0.86%)
Swap out                  3149.00 (  +0.00%)     86932.00 (+2659.78%)
Swap in                    313.00 (  +0.00%)       503.00 (  +60.51%)
File refault            222486.00 (  +0.00%)    101041.00 (  -54.59%)
Total refaults          222799.00 (  +0.00%)    101544.00 (  -54.42%)

The patched kernel works much harder to find idle anonymous pages in
order to alleviate the thrashing of the page cache. And it pays off:
overall, refault IO is cut in half, more time is spent in userspace,
less time is spent in the kernel.

---

The parallelio test from the mmtests package shows the backward
compatibility of the new model. It runs a memcache workload while
copying large files in parallel. The page cache isn't thrashing, so
the VM shouldn't swap except to relieve immediate memory pressure.
Swappiness is reset to the default setting of 60 as well.

parallelio Transactions
                                                vanilla                     lrucost
                                                     60                          60
Min      memcachetest-0M             83736.00 (  0.00%)          84376.00 (  0.76%)
Min      memcachetest-769M           83708.00 (  0.00%)          85038.00 (  1.59%)
Min      memcachetest-2565M          85419.00 (  0.00%)          85740.00 (  0.38%)
Min      memcachetest-4361M          85979.00 (  0.00%)          86746.00 (  0.89%)
Hmean    memcachetest-0M             84805.85 (  0.00%)          84852.31 (  0.05%)
Hmean    memcachetest-769M           84273.56 (  0.00%)          85160.52 (  1.05%)
Hmean    memcachetest-2565M          85792.43 (  0.00%)          85967.59 (  0.20%)
Hmean    memcachetest-4361M          86212.90 (  0.00%)          86891.87 (  0.79%)
Stddev   memcachetest-0M               959.16 (  0.00%)            339.07 ( 64.65%)
Stddev   memcachetest-769M             421.00 (  0.00%)            110.07 ( 73.85%)
Stddev   memcachetest-2565M            277.86 (  0.00%)            252.33 (  9.19%)
Stddev   memcachetest-4361M            193.55 (  0.00%)            106.30 ( 45.08%)
CoeffVar memcachetest-0M                 1.13 (  0.00%)              0.40 ( 64.66%)
CoeffVar memcachetest-769M               0.50 (  0.00%)              0.13 ( 74.13%)
CoeffVar memcachetest-2565M              0.32 (  0.00%)              0.29 (  9.37%)
CoeffVar memcachetest-4361M              0.22 (  0.00%)              0.12 ( 45.51%)
Max      memcachetest-0M             86067.00 (  0.00%)          85129.00 ( -1.09%)
Max      memcachetest-769M           84715.00 (  0.00%)          85305.00 (  0.70%)
Max      memcachetest-2565M          86084.00 (  0.00%)          86320.00 (  0.27%)
Max      memcachetest-4361M          86453.00 (  0.00%)          86996.00 (  0.63%)

parallelio Background IO
                                               vanilla                     lrucost
                                                    60                          60
Min      io-duration-0M                 0.00 (  0.00%)              0.00 (  0.00%)
Min      io-duration-769M               6.00 (  0.00%)              6.00 (  0.00%)
Min      io-duration-2565M             21.00 (  0.00%)             21.00 (  0.00%)
Min      io-duration-4361M             36.00 (  0.00%)             37.00 ( -2.78%)
Amean    io-duration-0M                 0.00 (  0.00%)              0.00 (  0.00%)
Amean    io-duration-769M               6.67 (  0.00%)              6.67 (  0.00%)
Amean    io-duration-2565M             21.67 (  0.00%)             21.67 (  0.00%)
Amean    io-duration-4361M             36.33 (  0.00%)             37.00 ( -1.83%)
Stddev   io-duration-0M                 0.00 (  0.00%)              0.00 (  0.00%)
Stddev   io-duration-769M               0.47 (  0.00%)              0.47 (  0.00%)
Stddev   io-duration-2565M              0.47 (  0.00%)              0.47 (  0.00%)
Stddev   io-duration-4361M              0.47 (  0.00%)              0.00 (100.00%)
CoeffVar io-duration-0M                 0.00 (  0.00%)              0.00 (  0.00%)
CoeffVar io-duration-769M               7.07 (  0.00%)              7.07 (  0.00%)
CoeffVar io-duration-2565M              2.18 (  0.00%)              2.18 (  0.00%)
CoeffVar io-duration-4361M              1.30 (  0.00%)              0.00 (100.00%)
Max      io-duration-0M                 0.00 (  0.00%)              0.00 (  0.00%)
Max      io-duration-769M               7.00 (  0.00%)              7.00 (  0.00%)
Max      io-duration-2565M             22.00 (  0.00%)             22.00 (  0.00%)
Max      io-duration-4361M             37.00 (  0.00%)             37.00 (  0.00%)

parallelio Swap totals
                                               vanilla                     lrucost
                                                    60                          60
Min      swapin-0M                 244169.00 (  0.00%)         281418.00 (-15.26%)
Min      swapin-769M               269973.00 (  0.00%)         231669.00 ( 14.19%)
Min      swapin-2565M              204356.00 (  0.00%)         188934.00 (  7.55%)
Min      swapin-4361M              178044.00 (  0.00%)         147799.00 ( 16.99%)
Min      swaptotal-0M              810441.00 (  0.00%)         832580.00 ( -2.73%)
Min      swaptotal-769M            827282.00 (  0.00%)         705879.00 ( 14.67%)
Min      swaptotal-2565M           690422.00 (  0.00%)         656948.00 (  4.85%)
Min      swaptotal-4361M           660507.00 (  0.00%)         582026.00 ( 11.88%)
Min      minorfaults-0M           2677904.00 (  0.00%)        2706086.00 ( -1.05%)
Min      minorfaults-769M         2731412.00 (  0.00%)        2606587.00 (  4.57%)
Min      minorfaults-2565M        2599647.00 (  0.00%)        2572429.00 (  1.05%)
Min      minorfaults-4361M        2573117.00 (  0.00%)        2514047.00 (  2.30%)
Min      majorfaults-0M             82864.00 (  0.00%)          98005.00 (-18.27%)
Min      majorfaults-769M           95047.00 (  0.00%)          78789.00 ( 17.11%)
Min      majorfaults-2565M          69486.00 (  0.00%)          65934.00 (  5.11%)
Min      majorfaults-4361M          60009.00 (  0.00%)          50955.00 ( 15.09%)
Amean    swapin-0M                 291429.67 (  0.00%)         290184.67 (  0.43%)
Amean    swapin-769M               294641.33 (  0.00%)         247553.33 ( 15.98%)
Amean    swapin-2565M              224398.67 (  0.00%)         199541.33 ( 11.08%)
Amean    swapin-4361M              188710.67 (  0.00%)         155103.67 ( 17.81%)
Amean    swaptotal-0M              877847.33 (  0.00%)         842476.33 (  4.03%)
Amean    swaptotal-769M            860593.67 (  0.00%)         765749.00 ( 11.02%)
Amean    swaptotal-2565M           724284.33 (  0.00%)         674759.67 (  6.84%)
Amean    swaptotal-4361M           669080.67 (  0.00%)         594949.33 ( 11.08%)
Amean    minorfaults-0M           2743339.00 (  0.00%)        2707815.33 (  1.29%)
Amean    minorfaults-769M         2740174.33 (  0.00%)        2656168.33 (  3.07%)
Amean    minorfaults-2565M        2624234.00 (  0.00%)        2579847.00 (  1.69%)
Amean    minorfaults-4361M        2582434.67 (  0.00%)        2525946.33 (  2.19%)
Amean    majorfaults-0M             99845.67 (  0.00%)         101007.33 ( -1.16%)
Amean    majorfaults-769M          101037.67 (  0.00%)          87706.00 ( 13.19%)
Amean    majorfaults-2565M          74771.67 (  0.00%)          68243.67 (  8.73%)
Amean    majorfaults-4361M          62557.33 (  0.00%)          52668.33 ( 15.81%)
Stddev   swapin-0M                  33554.61 (  0.00%)           6370.43 ( 81.01%)
Stddev   swapin-769M                18283.19 (  0.00%)          11586.05 ( 36.63%)
Stddev   swapin-2565M               14314.16 (  0.00%)           9023.96 ( 36.96%)
Stddev   swapin-4361M               11000.92 (  0.00%)           6770.47 ( 38.46%)
Stddev   swaptotal-0M               47680.16 (  0.00%)           8319.84 ( 82.55%)
Stddev   swaptotal-769M             23632.76 (  0.00%)          42426.42 (-79.52%)
Stddev   swaptotal-2565M            24761.63 (  0.00%)          14504.40 ( 41.42%)
Stddev   swaptotal-4361M             8173.20 (  0.00%)           9177.32 (-12.29%)
Stddev   minorfaults-0M             49578.82 (  0.00%)           1928.88 ( 96.11%)
Stddev   minorfaults-769M            7305.53 (  0.00%)          35084.61 (-380.25%)
Stddev   minorfaults-2565M          17393.80 (  0.00%)           5259.94 ( 69.76%)
Stddev   minorfaults-4361M           7780.48 (  0.00%)          10048.60 (-29.15%)
Stddev   majorfaults-0M             12102.64 (  0.00%)           2178.49 ( 82.00%)
Stddev   majorfaults-769M            4839.82 (  0.00%)           6313.49 (-30.45%)
Stddev   majorfaults-2565M           3748.79 (  0.00%)           2707.31 ( 27.78%)
Stddev   majorfaults-4361M           3292.87 (  0.00%)           1466.92 ( 55.45%)
CoeffVar swapin-0M                     11.51 (  0.00%)              2.20 ( 80.93%)
CoeffVar swapin-769M                    6.21 (  0.00%)              4.68 ( 24.58%)
CoeffVar swapin-2565M                   6.38 (  0.00%)              4.52 ( 29.10%)
CoeffVar swapin-4361M                   5.83 (  0.00%)              4.37 ( 25.12%)
CoeffVar swaptotal-0M                   5.43 (  0.00%)              0.99 ( 81.82%)
CoeffVar swaptotal-769M                 2.75 (  0.00%)              5.54 (-101.76%)
CoeffVar swaptotal-2565M                3.42 (  0.00%)              2.15 ( 37.12%)
CoeffVar swaptotal-4361M                1.22 (  0.00%)              1.54 (-26.28%)
CoeffVar minorfaults-0M                 1.81 (  0.00%)              0.07 ( 96.06%)
CoeffVar minorfaults-769M               0.27 (  0.00%)              1.32 (-395.44%)
CoeffVar minorfaults-2565M              0.66 (  0.00%)              0.20 ( 69.24%)
CoeffVar minorfaults-4361M              0.30 (  0.00%)              0.40 (-32.04%)
CoeffVar majorfaults-0M                12.12 (  0.00%)              2.16 ( 82.21%)
CoeffVar majorfaults-769M               4.79 (  0.00%)              7.20 (-50.28%)
CoeffVar majorfaults-2565M              5.01 (  0.00%)              3.97 ( 20.87%)
CoeffVar majorfaults-4361M              5.26 (  0.00%)              2.79 ( 47.09%)
Max      swapin-0M                 318760.00 (  0.00%)         296366.00 (  7.03%)
Max      swapin-769M               313685.00 (  0.00%)         258977.00 ( 17.44%)
Max      swapin-2565M              236882.00 (  0.00%)         210990.00 ( 10.93%)
Max      swapin-4361M              203852.00 (  0.00%)         164117.00 ( 19.49%)
Max      swaptotal-0M              913095.00 (  0.00%)         852936.00 (  6.59%)
Max      swaptotal-769M            879597.00 (  0.00%)         799103.00 (  9.15%)
Max      swaptotal-2565M           748943.00 (  0.00%)         692476.00 (  7.54%)
Max      swaptotal-4361M           680081.00 (  0.00%)         602448.00 ( 11.42%)
Max      minorfaults-0M           2797869.00 (  0.00%)        2710507.00 (  3.12%)
Max      minorfaults-769M         2749296.00 (  0.00%)        2682591.00 (  2.43%)
Max      minorfaults-2565M        2637180.00 (  0.00%)        2584036.00 (  2.02%)
Max      minorfaults-4361M        2592162.00 (  0.00%)        2538624.00 (  2.07%)
Max      majorfaults-0M            110188.00 (  0.00%)         103107.00 (  6.43%)
Max      majorfaults-769M          106900.00 (  0.00%)          92559.00 ( 13.42%)
Max      majorfaults-2565M          77770.00 (  0.00%)          72043.00 (  7.36%)
Max      majorfaults-4361M          67207.00 (  0.00%)          54538.00 ( 18.85%)

             vanilla     lrucost
                  60          60
User         1108.24     1122.37
System       4636.57     4650.63
Elapsed      6046.97     6047.82

                               vanilla     lrucost
                                    60          60
Minor Faults                  34022711    33360104
Major Faults                   1014895      929273
Swap Ins                       2997968     2677588
Swap Outs                      6397877     5956707
Allocation stalls                   27          31
DMA allocs                           0           0
DMA32 allocs                  15080196    14356136
Normal allocs                 26177871    26662120
Movable allocs                       0           0
Direct pages scanned             31625       27194
Kswapd pages scanned          33103442    27727713
Kswapd pages reclaimed        11817394    11598677
Direct pages reclaimed           21146       24043
Kswapd efficiency                  35%         41%
Kswapd velocity               5474.385    4584.745
Direct efficiency                  66%         88%
Direct velocity                  5.230       4.496
Percentage direct scans             0%          0%
Zone normal velocity          3786.073    3908.266
Zone dma32 velocity           1693.542     680.975
Zone dma velocity                0.000       0.000
Page writes by reclaim     6398557.000 5962129.000
Page writes file                   680        5422
Page writes anon               6397877     5956707
Page reclaim immediate            3750       12647
Sector Reads                  12608512    11624860
Sector Writes                 49304260    47539216
Page rescued immediate               0           0
Slabs scanned                   148322      164263
Direct inode steals                  0           0
Kswapd inode steals                  0          22
Kswapd skipped wait                  0           0
THP fault alloc                      6           3
THP collapse alloc                3490        3567
THP splits                           0           0
THP fault fallback                   0           0
THP collapse fail                   13          17
Compaction stalls                  431         446
Compaction success                 405         416
Compaction failures                 26          30
Page migrate success            199708      211181
Page migrate failure                71         121
Compaction pages isolated       425244      452352
Compaction migrate scanned      209471      226018
Compaction free scanned       20950979    23257076
Compaction cost                    216         229
NUMA alloc hit                38459351    38177612
NUMA alloc miss                      0           0
NUMA interleave hit                  0           0
NUMA alloc local              38455861    38174045
NUMA base PTE updates                0           0
NUMA huge PMD updates                0           0
NUMA page range updates              0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA hint local percent            100         100
NUMA pages migrated                  0           0
AutoNUMA cost                       0%          0%

Both the memcache transactions and the background IO throughput are
unchanged.

Overall reclaim activity actually went down in the patched kernel,
since the VM is now deterred by the swapins, whereas previously a
successful swapout followed by a swapin would actually make the anon
LRU more attractive (swapout is a scanned but not rotated page; swapin
puts pages on the inactive list, which used to be a scan event too).

The changes are fairly straight-forward, but they do require a page
flag to tell inactive cache refaults (cache transition) from active
ones (existing cache needs more space). On x86-32 PAE, that bumps us
to 22 core flags + 7 section bits on x86 PAE + 2 zone bits = 31 bits.
With the configurable hwpoison flag 32, and thus the last page flag.
However, this is core VM functionality, and we can make new features
64-bit-only, like we did with the page idle tracking.

Thanks

 Documentation/sysctl/vm.txt    |  16 +++--
 fs/cifs/file.c                 |  10 +--
 fs/fuse/dev.c                  |   2 +-
 include/linux/mmzone.h         |  29 ++++----
 include/linux/page-flags.h     |   2 +
 include/linux/pagevec.h        |   2 +-
 include/linux/swap.h           |  11 ++-
 include/trace/events/mmflags.h |   1 +
 kernel/sysctl.c                |   3 +-
 mm/filemap.c                   |   9 +--
 mm/migrate.c                   |   4 ++
 mm/mlock.c                     |   2 +-
 mm/shmem.c                     |   4 +-
 mm/swap.c                      | 124 +++++++++++++++++++---------------
 mm/swap_state.c                |   3 +-
 mm/vmscan.c                    |  48 ++++++-------
 mm/vmstat.c                    |   6 +-
 mm/workingset.c                | 142 +++++++++++++++++++++++++++++----------
 18 files changed, 258 insertions(+), 160 deletions(-)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 01/10] mm: allow swappiness that prefers anon over file
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-07  0:25   ` Minchan Kim
  2016-06-06 19:48 ` [PATCH 02/10] mm: swap: unexport __pagevec_lru_add() Johannes Weiner
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

With the advent of fast random IO devices (SSDs, PMEM) and in-memory
swap devices such as zswap, it's possible for swap to be much faster
than filesystems, and for swapping to be preferable over thrashing
filesystem caches.

Allow setting swappiness - which defines the relative IO cost of cache
misses between page cache and swap-backed pages - to reflect such
situations by making the swap-preferred range configurable.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/sysctl/vm.txt | 16 +++++++++++-----
 kernel/sysctl.c             |  3 ++-
 mm/vmscan.c                 |  2 +-
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 720355cbdf45..54030750cd31 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -771,14 +771,20 @@ with no ill effects: errors and warnings on these stats are suppressed.)
 
 swappiness
 
-This control is used to define how aggressive the kernel will swap
-memory pages.  Higher values will increase agressiveness, lower values
-decrease the amount of swap.  A value of 0 instructs the kernel not to
-initiate swap until the amount of free and file-backed pages is less
-than the high water mark in a zone.
+This control is used to define the relative IO cost of cache misses
+between the swap device and the filesystem as a value between 0 and
+200. At 100, the VM assumes equal IO cost and will thus apply memory
+pressure to the page cache and swap-backed pages equally. At 0, the
+kernel will not initiate swap until the amount of free and file-backed
+pages is less than the high watermark in a zone.
 
 The default value is 60.
 
+On non-rotational swap devices, a value of 100 (or higher, depending
+on what's backing the filesystem) is recommended.
+
+For in-memory swap, like zswap, values closer to 200 are recommended.
+
 ==============================================================
 
 - user_reserve_kbytes
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2effd84d83e3..56a9243eb171 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -126,6 +126,7 @@ static int __maybe_unused two = 2;
 static int __maybe_unused four = 4;
 static unsigned long one_ul = 1;
 static int one_hundred = 100;
+static int two_hundred = 200;
 static int one_thousand = 1000;
 #ifdef CONFIG_PRINTK
 static int ten_thousand = 10000;
@@ -1323,7 +1324,7 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &one_hundred,
+		.extra2		= &two_hundred,
 	},
 #ifdef CONFIG_HUGETLB_PAGE
 	{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..f79010bbcdd4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -136,7 +136,7 @@ struct scan_control {
 #endif
 
 /*
- * From 0 .. 100.  Higher means more swappy.
+ * From 0 .. 200.  Higher means more swappy.
  */
 int vm_swappiness = 60;
 /*
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 02/10] mm: swap: unexport __pagevec_lru_add()
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
  2016-06-06 19:48 ` [PATCH 01/10] mm: allow swappiness that prefers anon over file Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-06 21:32   ` Rik van Riel
                     ` (2 more replies)
  2016-06-06 19:48 ` [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() Johannes Weiner
                   ` (8 subsequent siblings)
  10 siblings, 3 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

There is currently no modular user of this function. We used to have
filesystems that open-coded the page cache instantiation, but luckily
they're all streamlined, and we don't want this to come back.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/swap.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 95916142fc46..d810c3d95c97 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -860,7 +860,6 @@ void __pagevec_lru_add(struct pagevec *pvec)
 {
 	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
 }
-EXPORT_SYMBOL(__pagevec_lru_add);
 
 /**
  * pagevec_lookup_entries - gang pagecache lookup
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
  2016-06-06 19:48 ` [PATCH 01/10] mm: allow swappiness that prefers anon over file Johannes Weiner
  2016-06-06 19:48 ` [PATCH 02/10] mm: swap: unexport __pagevec_lru_add() Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-06 21:33   ` Rik van Riel
                     ` (2 more replies)
  2016-06-06 19:48 ` [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

They're the same function, and for the purpose of all callers they are
equivalent to lru_cache_add().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/cifs/file.c       | 10 +++++-----
 fs/fuse/dev.c        |  2 +-
 include/linux/swap.h |  2 --
 mm/shmem.c           |  4 ++--
 mm/swap.c            | 40 +++++++++-------------------------------
 mm/swap_state.c      |  2 +-
 6 files changed, 18 insertions(+), 42 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 9793ae0bcaa2..232390879640 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3261,7 +3261,7 @@ cifs_readv_complete(struct work_struct *work)
 	for (i = 0; i < rdata->nr_pages; i++) {
 		struct page *page = rdata->pages[i];
 
-		lru_cache_add_file(page);
+		lru_cache_add(page);
 
 		if (rdata->result == 0 ||
 		    (rdata->result == -EAGAIN && got_bytes)) {
@@ -3321,7 +3321,7 @@ cifs_readpages_read_into_pages(struct TCP_Server_Info *server,
 			 * fill them until the writes are flushed.
 			 */
 			zero_user(page, 0, PAGE_SIZE);
-			lru_cache_add_file(page);
+			lru_cache_add(page);
 			flush_dcache_page(page);
 			SetPageUptodate(page);
 			unlock_page(page);
@@ -3331,7 +3331,7 @@ cifs_readpages_read_into_pages(struct TCP_Server_Info *server,
 			continue;
 		} else {
 			/* no need to hold page hostage */
-			lru_cache_add_file(page);
+			lru_cache_add(page);
 			unlock_page(page);
 			put_page(page);
 			rdata->pages[i] = NULL;
@@ -3488,7 +3488,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 			/* best to give up if we're out of mem */
 			list_for_each_entry_safe(page, tpage, &tmplist, lru) {
 				list_del(&page->lru);
-				lru_cache_add_file(page);
+				lru_cache_add(page);
 				unlock_page(page);
 				put_page(page);
 			}
@@ -3518,7 +3518,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
 			add_credits_and_wake_if(server, rdata->credits, 0);
 			for (i = 0; i < rdata->nr_pages; i++) {
 				page = rdata->pages[i];
-				lru_cache_add_file(page);
+				lru_cache_add(page);
 				unlock_page(page);
 				put_page(page);
 			}
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cbece1221417..c7264d4a7f3f 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -900,7 +900,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
 	get_page(newpage);
 
 	if (!(buf->flags & PIPE_BUF_FLAG_LRU))
-		lru_cache_add_file(newpage);
+		lru_cache_add(newpage);
 
 	err = 0;
 	spin_lock(&cs->req->waitq.lock);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0af2bb2028fd..38fe1e91ba55 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -296,8 +296,6 @@ extern unsigned long nr_free_pagecache_pages(void);
 
 /* linux/mm/swap.c */
 extern void lru_cache_add(struct page *);
-extern void lru_cache_add_anon(struct page *page);
-extern void lru_cache_add_file(struct page *page);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
diff --git a/mm/shmem.c b/mm/shmem.c
index e418a995427d..ff210317022d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1098,7 +1098,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 		oldpage = newpage;
 	} else {
 		mem_cgroup_migrate(oldpage, newpage);
-		lru_cache_add_anon(newpage);
+		lru_cache_add(newpage);
 		*pagep = newpage;
 	}
 
@@ -1289,7 +1289,7 @@ repeat:
 			goto decused;
 		}
 		mem_cgroup_commit_charge(page, memcg, false, false);
-		lru_cache_add_anon(page);
+		lru_cache_add(page);
 
 		spin_lock(&info->lock);
 		info->alloced++;
diff --git a/mm/swap.c b/mm/swap.c
index d810c3d95c97..d2786a6308dd 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -386,36 +386,6 @@ void mark_page_accessed(struct page *page)
 }
 EXPORT_SYMBOL(mark_page_accessed);
 
-static void __lru_cache_add(struct page *page)
-{
-	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
-
-	get_page(page);
-	if (!pagevec_space(pvec))
-		__pagevec_lru_add(pvec);
-	pagevec_add(pvec, page);
-	put_cpu_var(lru_add_pvec);
-}
-
-/**
- * lru_cache_add: add a page to the page lists
- * @page: the page to add
- */
-void lru_cache_add_anon(struct page *page)
-{
-	if (PageActive(page))
-		ClearPageActive(page);
-	__lru_cache_add(page);
-}
-
-void lru_cache_add_file(struct page *page)
-{
-	if (PageActive(page))
-		ClearPageActive(page);
-	__lru_cache_add(page);
-}
-EXPORT_SYMBOL(lru_cache_add_file);
-
 /**
  * lru_cache_add - add a page to a page list
  * @page: the page to be added to the LRU.
@@ -427,10 +397,18 @@ EXPORT_SYMBOL(lru_cache_add_file);
  */
 void lru_cache_add(struct page *page)
 {
+	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
+
 	VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
-	__lru_cache_add(page);
+
+	get_page(page);
+	if (!pagevec_space(pvec))
+		__pagevec_lru_add(pvec);
+	pagevec_add(pvec, page);
+	put_cpu_var(lru_add_pvec);
 }
+EXPORT_SYMBOL(lru_cache_add);
 
 /**
  * add_page_to_unevictable_list - add a page to the unevictable list
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0d457e7db8d6..5400f814ae12 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -365,7 +365,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_anon(new_page);
+			lru_cache_add(new_page);
 			*new_page_allocated = true;
 			return new_page;
 		}
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
                   ` (2 preceding siblings ...)
  2016-06-06 19:48 ` [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-06 21:36   ` Rik van Riel
                     ` (2 more replies)
  2016-06-06 19:48 ` [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation Johannes Weiner
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

Currently, THP are counted as single pages until they are split right
before being swapped out. However, at that point the VM is already in
the middle of reclaim, and adjusting the LRU balance then is useless.

Always account THP by the number of basepages, and remove the fixup
from the splitting path.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/swap.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index d2786a6308dd..c6936507abb5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -249,13 +249,14 @@ void rotate_reclaimable_page(struct page *page)
 }
 
 static void update_page_reclaim_stat(struct lruvec *lruvec,
-				     int file, int rotated)
+				     int file, int rotated,
+				     unsigned int nr_pages)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	reclaim_stat->recent_scanned[file]++;
+	reclaim_stat->recent_scanned[file] += nr_pages;
 	if (rotated)
-		reclaim_stat->recent_rotated[file]++;
+		reclaim_stat->recent_rotated[file] += nr_pages;
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec,
@@ -272,7 +273,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 1);
+		update_page_reclaim_stat(lruvec, file, 1, hpage_nr_pages(page));
 	}
 }
 
@@ -532,7 +533,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
-	update_page_reclaim_stat(lruvec, file, 0);
+	update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
 }
 
 
@@ -549,7 +550,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 		add_page_to_lru_list(page, lruvec, lru);
 
 		__count_vm_event(PGDEACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 0);
+		update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
 	}
 }
 
@@ -809,9 +810,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 		list_head = page_tail->lru.prev;
 		list_move_tail(&page_tail->lru, list_head);
 	}
-
-	if (!PageUnevictable(page))
-		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -826,7 +824,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
-	update_page_reclaim_stat(lruvec, file, active);
+	update_page_reclaim_stat(lruvec, file, active, hpage_nr_pages(page));
 	trace_mm_lru_insertion(page, lru);
 }
 
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
                   ` (3 preceding siblings ...)
  2016-06-06 19:48 ` [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-06 21:56   ` Rik van Riel
                     ` (2 more replies)
  2016-06-06 19:48 ` [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing Johannes Weiner
                   ` (5 subsequent siblings)
  10 siblings, 3 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

Isolating an existing LRU page and subsequently putting it back on the
list currently influences the balance between the anon and file LRUs.
For example, heavy page migration or compaction could influence the
balance between the LRUs and make one type more attractive when that
type of page is affected more than the other. That doesn't make sense.

Add a dedicated LRU cache for putback, so that we can tell new LRU
pages from existing ones at the time of linking them to the lists.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/pagevec.h |  2 +-
 include/linux/swap.h    |  1 +
 mm/mlock.c              |  2 +-
 mm/swap.c               | 34 ++++++++++++++++++++++++++++------
 mm/vmscan.c             |  2 +-
 5 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index b45d391b4540..3f8a2a01131c 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -21,7 +21,7 @@ struct pagevec {
 };
 
 void __pagevec_release(struct pagevec *pvec);
-void __pagevec_lru_add(struct pagevec *pvec);
+void __pagevec_lru_add(struct pagevec *pvec, bool new);
 unsigned pagevec_lookup_entries(struct pagevec *pvec,
 				struct address_space *mapping,
 				pgoff_t start, unsigned nr_entries,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 38fe1e91ba55..178f084365c2 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -296,6 +296,7 @@ extern unsigned long nr_free_pagecache_pages(void);
 
 /* linux/mm/swap.c */
 extern void lru_cache_add(struct page *);
+extern void lru_cache_putback(struct page *page);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
diff --git a/mm/mlock.c b/mm/mlock.c
index 96f001041928..449c291a286d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -264,7 +264,7 @@ static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
 	 *__pagevec_lru_add() calls release_pages() so we don't call
 	 * put_page() explicitly
 	 */
-	__pagevec_lru_add(pvec);
+	__pagevec_lru_add(pvec, false);
 	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index c6936507abb5..576c721f210b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -44,6 +44,7 @@
 int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
+static DEFINE_PER_CPU(struct pagevec, lru_putback_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
@@ -405,12 +406,23 @@ void lru_cache_add(struct page *page)
 
 	get_page(page);
 	if (!pagevec_space(pvec))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add(pvec, true);
 	pagevec_add(pvec, page);
 	put_cpu_var(lru_add_pvec);
 }
 EXPORT_SYMBOL(lru_cache_add);
 
+void lru_cache_putback(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_putback_pvec);
+
+	get_page(page);
+	if (!pagevec_space(pvec))
+		__pagevec_lru_add(pvec, false);
+	pagevec_add(pvec, page);
+	put_cpu_var(lru_putback_pvec);
+}
+
 /**
  * add_page_to_unevictable_list - add a page to the unevictable list
  * @page:  the page to be added to the unevictable list
@@ -561,10 +573,15 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
  */
 void lru_add_drain_cpu(int cpu)
 {
-	struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu);
+	struct pagevec *pvec;
+
+	pvec = &per_cpu(lru_add_pvec, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_add(pvec, true);
 
+	pvec = &per_cpu(lru_putback_pvec, cpu);
 	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add(pvec, false);
 
 	pvec = &per_cpu(lru_rotate_pvecs, cpu);
 	if (pagevec_count(pvec)) {
@@ -819,12 +836,17 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 	int file = page_is_file_cache(page);
 	int active = PageActive(page);
 	enum lru_list lru = page_lru(page);
+	bool new = (bool)arg;
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
-	update_page_reclaim_stat(lruvec, file, active, hpage_nr_pages(page));
+
+	if (new)
+		update_page_reclaim_stat(lruvec, file, active,
+					 hpage_nr_pages(page));
+
 	trace_mm_lru_insertion(page, lru);
 }
 
@@ -832,9 +854,9 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
-void __pagevec_lru_add(struct pagevec *pvec)
+void __pagevec_lru_add(struct pagevec *pvec, bool new)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, (void *)new);
 }
 
 /**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f79010bbcdd4..8503713bb60e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -737,7 +737,7 @@ redo:
 		 * We know how to handle that.
 		 */
 		is_unevictable = false;
-		lru_cache_add(page);
+		lru_cache_putback(page);
 	} else {
 		/*
 		 * Put unevictable pages directly on zone's unevictable
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
                   ` (4 preceding siblings ...)
  2016-06-06 19:48 ` [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-07  2:20   ` Rik van Riel
                     ` (2 more replies)
  2016-06-06 19:48 ` [PATCH 07/10] mm: base LRU balancing on an explicit cost model Johannes Weiner
                   ` (4 subsequent siblings)
  10 siblings, 3 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

When the splitlru patches divided page cache and swap-backed pages
into separate LRU lists, the pressure balance between the lists was
biased to account for the fact that streaming IO can cause memory
pressure with a flood of pages that are used only once. New page cache
additions would tip the balance toward the file LRU, and repeat access
would neutralize that bias again. This ensured that page reclaim would
always go for used-once cache first.

Since e9868505987a ("mm,vmscan: only evict file pages when we have
plenty"), page reclaim generally skips over swap-backed memory
entirely as long as there is used-once cache present, and will apply
the LRU balancing when only repeatedly accessed cache pages are left -
at which point the previous use-once bias will have been neutralized.

This makes the use-once cache balancing bias unnecessary. Remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/swap.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 576c721f210b..814e3a2e54b4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -264,7 +264,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
 
 		del_page_from_lru_list(page, lruvec, lru);
@@ -274,7 +273,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		trace_mm_lru_activate(page);
 
 		__count_vm_event(PGACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 1, hpage_nr_pages(page));
 	}
 }
 
@@ -797,8 +795,6 @@ EXPORT_SYMBOL(__pagevec_release);
 void lru_add_page_tail(struct page *page, struct page *page_tail,
 		       struct lruvec *lruvec, struct list_head *list)
 {
-	const int file = 0;
-
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
@@ -833,20 +829,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
-	int file = page_is_file_cache(page);
-	int active = PageActive(page);
 	enum lru_list lru = page_lru(page);
-	bool new = (bool)arg;
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 
-	if (new)
-		update_page_reclaim_stat(lruvec, file, active,
-					 hpage_nr_pages(page));
-
 	trace_mm_lru_insertion(page, lru);
 }
 
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
                   ` (5 preceding siblings ...)
  2016-06-06 19:48 ` [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-06 19:13   ` kbuild test robot
                     ` (3 more replies)
  2016-06-06 19:48 ` [PATCH 08/10] mm: deactivations shouldn't bias the LRU balance Johannes Weiner
                   ` (3 subsequent siblings)
  10 siblings, 4 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

Currently, scan pressure between the anon and file LRU lists is
balanced based on a mixture of reclaim efficiency and a somewhat vague
notion of "value" of having certain pages in memory over others. That
concept of value is problematic, because it has caused us to count any
event that remotely makes one LRU list more or less preferrable for
reclaim, even when these events are not directly comparable to each
other and impose very different costs on the system - such as a
referenced file page that we still deactivate and a referenced
anonymous page that we actually rotate back to the head of the list.

There is also conceptual overlap with the LRU algorithm itself. By
rotating recently used pages instead of reclaiming them, the algorithm
already biases the applied scan pressure based on page value. Thus,
when rebalancing scan pressure due to rotations, we should think of
reclaim cost, and leave assessing the page value to the LRU algorithm.

Lastly, considering both value-increasing as well as value-decreasing
events can sometimes cause the same type of event to be counted twice,
i.e. how rotating a page increases the LRU value, while reclaiming it
succesfully decreases the value. In itself this will balance out fine,
but it quietly skews the impact of events that are only recorded once.

The abstract metric of "value", the murky relationship with the LRU
algorithm, and accounting both negative and positive events make the
current pressure balancing model hard to reason about and modify.

In preparation for thrashing-based LRU balancing, this patch switches
to a balancing model of accounting the concrete, actually observed
cost of reclaiming one LRU over another. For now, that cost includes
pages that are scanned but rotated back to the list head. Subsequent
patches will add consideration for IO caused by refaulting recently
evicted pages. The idea is to primarily scan the LRU that thrashes the
least, and secondarily scan the LRU that needs the least amount of
work to free memory.

Rename struct zone_reclaim_stat to struct lru_cost, and move from two
separate value ratios for the LRU lists to a relative LRU cost metric
with a shared denominator. Then make everything that affects the cost
go through a new lru_note_cost() function.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h | 23 +++++++++++------------
 include/linux/swap.h   |  2 ++
 mm/swap.c              | 15 +++++----------
 mm/vmscan.c            | 35 +++++++++++++++--------------------
 4 files changed, 33 insertions(+), 42 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c23486d..4d257d00fbf5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -191,22 +191,21 @@ static inline int is_active_lru(enum lru_list lru)
 	return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
 }
 
-struct zone_reclaim_stat {
-	/*
-	 * The pageout code in vmscan.c keeps track of how many of the
-	 * mem/swap backed and file backed pages are referenced.
-	 * The higher the rotated/scanned ratio, the more valuable
-	 * that cache is.
-	 *
-	 * The anon LRU stats live in [0], file LRU stats in [1]
-	 */
-	unsigned long		recent_rotated[2];
-	unsigned long		recent_scanned[2];
+/*
+ * This tracks cost of reclaiming one LRU type - file or anon - over
+ * the other. As the observed cost of pressure on one type increases,
+ * the scan balance in vmscan.c tips toward the other type.
+ *
+ * The recorded cost for anon is in numer[0], file in numer[1].
+ */
+struct lru_cost {
+	unsigned long		numer[2];
+	unsigned long		denom;
 };
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
-	struct zone_reclaim_stat	reclaim_stat;
+	struct lru_cost			balance;
 	/* Evictions & activations on the inactive file list */
 	atomic_long_t			inactive_age;
 #ifdef CONFIG_MEMCG
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 178f084365c2..c461ce0533da 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -295,6 +295,8 @@ extern unsigned long nr_free_pagecache_pages(void);
 
 
 /* linux/mm/swap.c */
+extern void lru_note_cost(struct lruvec *lruvec, bool file,
+			  unsigned int nr_pages);
 extern void lru_cache_add(struct page *);
 extern void lru_cache_putback(struct page *page);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
diff --git a/mm/swap.c b/mm/swap.c
index 814e3a2e54b4..645d21242324 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -249,15 +249,10 @@ void rotate_reclaimable_page(struct page *page)
 	}
 }
 
-static void update_page_reclaim_stat(struct lruvec *lruvec,
-				     int file, int rotated,
-				     unsigned int nr_pages)
+void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-
-	reclaim_stat->recent_scanned[file] += nr_pages;
-	if (rotated)
-		reclaim_stat->recent_rotated[file] += nr_pages;
+	lruvec->balance.numer[file] += nr_pages;
+	lruvec->balance.denom += nr_pages;
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec,
@@ -543,7 +538,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
-	update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
+	lru_note_cost(lruvec, !file, hpage_nr_pages(page));
 }
 
 
@@ -560,7 +555,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 		add_page_to_lru_list(page, lruvec, lru);
 
 		__count_vm_event(PGDEACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
+		lru_note_cost(lruvec, !file, hpage_nr_pages(page));
 	}
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8503713bb60e..06e381e1004c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1492,7 +1492,6 @@ static int too_many_isolated(struct zone *zone, int file,
 static noinline_for_stack void
 putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 {
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	struct zone *zone = lruvec_zone(lruvec);
 	LIST_HEAD(pages_to_free);
 
@@ -1521,8 +1520,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
 			int numpages = hpage_nr_pages(page);
-			reclaim_stat->recent_rotated[file] += numpages;
+			/*
+			 * Rotating pages costs CPU without actually
+			 * progressing toward the reclaim goal.
+			 */
+			lru_note_cost(lruvec, file, numpages);
 		}
+
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
@@ -1577,7 +1581,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
 	struct zone *zone = lruvec_zone(lruvec);
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1601,7 +1604,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	update_lru_size(lruvec, lru, -nr_taken);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
-	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
 		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
@@ -1773,7 +1775,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
 	struct page *page;
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	unsigned long nr_rotated = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
@@ -1793,7 +1794,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	update_lru_size(lruvec, lru, -nr_taken);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
-	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc))
 		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
@@ -1851,7 +1851,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 * helps balance scan pressure between file and anonymous pages in
 	 * get_scan_count.
 	 */
-	reclaim_stat->recent_rotated[file] += nr_rotated;
+	lru_note_cost(lruvec, file, nr_rotated);
 
 	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
@@ -1947,7 +1947,6 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 			   unsigned long *lru_pages)
 {
 	int swappiness = mem_cgroup_swappiness(memcg);
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2];
 	u64 denominator = 0;	/* gcc */
 	struct zone *zone = lruvec_zone(lruvec);
@@ -2072,14 +2071,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
 
 	spin_lock_irq(&zone->lru_lock);
-	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
-		reclaim_stat->recent_scanned[0] /= 2;
-		reclaim_stat->recent_rotated[0] /= 2;
-	}
-
-	if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
-		reclaim_stat->recent_scanned[1] /= 2;
-		reclaim_stat->recent_rotated[1] /= 2;
+	if (unlikely(lruvec->balance.denom > (anon + file) / 8)) {
+		lruvec->balance.numer[0] /= 2;
+		lruvec->balance.numer[1] /= 2;
+		lruvec->balance.denom /= 2;
 	}
 
 	/*
@@ -2087,11 +2082,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * proportional to the fraction of recently scanned pages on
 	 * each list that were recently referenced and in active use.
 	 */
-	ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
-	ap /= reclaim_stat->recent_rotated[0] + 1;
+	ap = anon_prio * (lruvec->balance.denom + 1);
+	ap /= lruvec->balance.numer[0] + 1;
 
-	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
-	fp /= reclaim_stat->recent_rotated[1] + 1;
+	fp = file_prio * (lruvec->balance.denom + 1);
+	fp /= lruvec->balance.numer[1] + 1;
 	spin_unlock_irq(&zone->lru_lock);
 
 	fraction[0] = ap;
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 08/10] mm: deactivations shouldn't bias the LRU balance
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
                   ` (6 preceding siblings ...)
  2016-06-06 19:48 ` [PATCH 07/10] mm: base LRU balancing on an explicit cost model Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-08  8:15   ` Minchan Kim
  2016-06-08 12:57   ` Michal Hocko
  2016-06-06 19:48 ` [PATCH 09/10] mm: only count actual rotations as LRU reclaim cost Johannes Weiner
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

Operations like MADV_FREE, FADV_DONTNEED etc. currently move any
affected active pages to the inactive list to accelerate their reclaim
(good) but also steer page reclaim toward that LRU type, or away from
the other (bad).

The reason why this is undesirable is that such operations are not
part of the regular page aging cycle, and rather a fluke that doesn't
say much about the remaining pages on that list. They might all be in
heavy use. But once the chunk of easy victims has been purged, the VM
continues to apply elevated pressure on the remaining hot pages. The
other LRU, meanwhile, might have easily reclaimable pages, and there
was never a need to steer away from it in the first place.

As the previous patch outlined, we should focus on recording actually
observed cost to steer the balance rather than speculating about the
potential value of one LRU list over the other. In that spirit, leave
explicitely deactivated pages to the LRU algorithm to pick up, and let
rotations decide which list is the easiest to reclaim.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/swap.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 645d21242324..ae07b469ddca 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -538,7 +538,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 
 	if (active)
 		__count_vm_event(PGDEACTIVATE);
-	lru_note_cost(lruvec, !file, hpage_nr_pages(page));
 }
 
 
@@ -546,7 +545,6 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
 
 		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
@@ -555,7 +553,6 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 		add_page_to_lru_list(page, lruvec, lru);
 
 		__count_vm_event(PGDEACTIVATE);
-		lru_note_cost(lruvec, !file, hpage_nr_pages(page));
 	}
 }
 
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 09/10] mm: only count actual rotations as LRU reclaim cost
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
                   ` (7 preceding siblings ...)
  2016-06-06 19:48 ` [PATCH 08/10] mm: deactivations shouldn't bias the LRU balance Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-08  8:19   ` Minchan Kim
  2016-06-08 13:18   ` Michal Hocko
  2016-06-06 19:48 ` [PATCH 10/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
  2016-06-07  9:51 ` [PATCH 00/10] " Michal Hocko
  10 siblings, 2 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

Noting a reference on an active file page but still deactivating it
represents a smaller cost of reclaim than noting a referenced
anonymous page and actually physically rotating it back to the head.
The file page *might* refault later on, but it's definite progress
toward freeing pages, whereas rotating the anonymous page costs us
real time without making progress toward the reclaim goal.

Don't treat both events as equal. The following patch will hook up LRU
balancing to cache and swap refaults, which are a much more concrete
cost signal for reclaiming one list over the other. Remove the
maybe-IO cost bias from page references, and only note the CPU cost
for actual rotations that prevent the pages from getting reclaimed.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06e381e1004c..acbd212eab6e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1821,7 +1821,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 		if (page_referenced(page, 0, sc->target_mem_cgroup,
 				    &vm_flags)) {
-			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
 			 * give them one more trip around the active list. So
@@ -1832,6 +1831,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			 * so we ignore them here.
 			 */
 			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
+				nr_rotated += hpage_nr_pages(page);
 				list_add(&page->lru, &l_active);
 				continue;
 			}
@@ -1846,10 +1846,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 */
 	spin_lock_irq(&zone->lru_lock);
 	/*
-	 * Count referenced pages from currently used mappings as rotated,
-	 * even though only some of them are actually re-activated.  This
-	 * helps balance scan pressure between file and anonymous pages in
-	 * get_scan_count.
+	 * Rotating pages costs CPU without actually
+	 * progressing toward the reclaim goal.
 	 */
 	lru_note_cost(lruvec, file, nr_rotated);
 
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
                   ` (8 preceding siblings ...)
  2016-06-06 19:48 ` [PATCH 09/10] mm: only count actual rotations as LRU reclaim cost Johannes Weiner
@ 2016-06-06 19:48 ` Johannes Weiner
  2016-06-06 19:22   ` kbuild test robot
                     ` (3 more replies)
  2016-06-07  9:51 ` [PATCH 00/10] " Michal Hocko
  10 siblings, 4 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 19:48 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, Tim Chen, kernel-team

Since the LRUs were split into anon and file lists, the VM has been
balancing between page cache and anonymous pages based on per-list
ratios of scanned vs. rotated pages. In most cases that tips page
reclaim towards the list that is easier to reclaim and has the fewest
actively used pages, but there are a few problems with it:

1. Refaults and in-memory rotations are weighted the same way, even
   though one costs IO and the other costs CPU. When the balance is
   off, the page cache can be thrashing while anonymous pages are aged
   comparably slower and thus have more time to get even their coldest
   pages referenced. The VM would consider this a fair equilibrium.

2. The page cache has usually a share of use-once pages that will
   further dilute its scanned/rotated ratio in the above-mentioned
   scenario. This can cease scanning of the anonymous list almost
   entirely - again while the page cache is thrashing and IO-bound.

Historically, swap has been an emergency overflow for high memory
pressure, and we avoided using it as long as new page allocations
could be served from recycling page cache. However, when recycling
page cache incurs a higher cost in IO than swapping out a few unused
anonymous pages would, it makes sense to increase swap pressure.

In order to accomplish this, we can extend the thrash detection code
that currently detects workingset changes within the page cache: when
inactive cache pages are thrashing, the VM raises LRU pressure on the
otherwise protected active file list to increase competition. However,
when active pages begin refaulting as well, it means that the page
cache is thrashing as a whole and the LRU balance should tip toward
anonymous. This is what this patch implements.

To tell inactive from active refaults, a page flag is introduced that
marks pages that have been on the active list in their lifetime. This
flag is remembered in the shadow page entry on reclaim, and restored
when the page refaults. It is also set on anonymous pages during
swapin. When a page with that flag set is added to the LRU, the LRU
balance is adjusted for the IO cost of reclaiming the thrashing list.

Rotations continue to influence the LRU balance as well, but with a
different weight factor. That factor is statically chosen such that
refaults are considered more costly than rotations at this point. We
might want to revisit this for ultra-fast swap or secondary memory
devices, where rotating referenced pages might be more costly than
swapping or relocating them directly and have some of them refault.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h         |   6 +-
 include/linux/page-flags.h     |   2 +
 include/linux/swap.h           |  10 ++-
 include/trace/events/mmflags.h |   1 +
 mm/filemap.c                   |   9 +--
 mm/migrate.c                   |   4 ++
 mm/swap.c                      |  38 ++++++++++-
 mm/swap_state.c                |   1 +
 mm/vmscan.c                    |   5 +-
 mm/vmstat.c                    |   6 +-
 mm/workingset.c                | 142 +++++++++++++++++++++++++++++++----------
 11 files changed, 172 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4d257d00fbf5..d7aaee25b536 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -148,9 +148,9 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
-	WORKINGSET_REFAULT,
-	WORKINGSET_ACTIVATE,
-	WORKINGSET_NODERECLAIM,
+	REFAULT_INACTIVE_FILE,
+	REFAULT_ACTIVE_FILE,
+	REFAULT_NODERECLAIM,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e5a32445f930..a1b9d7dddd68 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -79,6 +79,7 @@ enum pageflags {
 	PG_dirty,
 	PG_lru,
 	PG_active,
+	PG_workingset,
 	PG_slab,
 	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
 	PG_arch_1,
@@ -259,6 +260,7 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
+PAGEFLAG(Workingset, workingset, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c461ce0533da..9923b51ee8e9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,7 +250,7 @@ struct swap_info_struct {
 
 /* linux/mm/workingset.c */
 void *workingset_eviction(struct address_space *mapping, struct page *page);
-bool workingset_refault(void *shadow);
+void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 extern struct list_lru workingset_shadow_nodes;
 
@@ -295,8 +295,12 @@ extern unsigned long nr_free_pagecache_pages(void);
 
 
 /* linux/mm/swap.c */
-extern void lru_note_cost(struct lruvec *lruvec, bool file,
-			  unsigned int nr_pages);
+enum lru_cost_type {
+	COST_CPU,
+	COST_IO,
+};
+extern void lru_note_cost(struct lruvec *lruvec, enum lru_cost_type cost,
+			  bool file, unsigned int nr_pages);
 extern void lru_cache_add(struct page *);
 extern void lru_cache_putback(struct page *page);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 43cedbf0c759..bc05e0ac1b8c 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -86,6 +86,7 @@
 	{1UL << PG_dirty,		"dirty"		},		\
 	{1UL << PG_lru,			"lru"		},		\
 	{1UL << PG_active,		"active"	},		\
+	{1UL << PG_workingset,		"workingset"	},		\
 	{1UL << PG_slab,		"slab"		},		\
 	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
 	{1UL << PG_arch_1,		"arch_1"	},		\
diff --git a/mm/filemap.c b/mm/filemap.c
index 9665b1d4f318..1b356b47381b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -700,12 +700,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 		 * data from the working set, only to cache data that will
 		 * get overwritten with something else, is a waste of memory.
 		 */
-		if (!(gfp_mask & __GFP_WRITE) &&
-		    shadow && workingset_refault(shadow)) {
-			SetPageActive(page);
-			workingset_activation(page);
-		} else
-			ClearPageActive(page);
+		WARN_ON_ONCE(PageActive(page));
+		if (!(gfp_mask & __GFP_WRITE) && shadow)
+			workingset_refault(page, shadow);
 		lru_cache_add(page);
 	}
 	return ret;
diff --git a/mm/migrate.c b/mm/migrate.c
index 9baf41c877ff..115d49441c6c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -544,6 +544,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 		SetPageActive(newpage);
 	} else if (TestClearPageUnevictable(page))
 		SetPageUnevictable(newpage);
+	if (PageWorkingset(page))
+		SetPageWorkingset(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
@@ -1809,6 +1811,8 @@ fail_putback:
 		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 		/* Reverse changes made by migrate_page_copy() */
+		if (TestClearPageWorkingset(new_page))
+			ClearPageWorkingset(page);
 		if (TestClearPageActive(new_page))
 			SetPageActive(page);
 		if (TestClearPageUnevictable(new_page))
diff --git a/mm/swap.c b/mm/swap.c
index ae07b469ddca..cb6773e1424e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -249,8 +249,28 @@ void rotate_reclaimable_page(struct page *page)
 	}
 }
 
-void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
+void lru_note_cost(struct lruvec *lruvec, enum lru_cost_type cost,
+		   bool file, unsigned int nr_pages)
 {
+	if (cost == COST_IO) {
+		/*
+		 * Reflect the relative reclaim cost between incurring
+		 * IO from refaults on one hand, and incurring CPU
+		 * cost from rotating scanned pages on the other.
+		 *
+		 * XXX: For now, the relative cost factor for IO is
+		 * set statically to outweigh the cost of rotating
+		 * referenced pages. This might change with ultra-fast
+		 * IO devices, or with secondary memory devices that
+		 * allow users continued access of swapped out pages.
+		 *
+		 * Until then, the value is chosen simply such that we
+		 * balance for IO cost first and optimize for CPU only
+		 * once the thrashing subsides.
+		 */
+		nr_pages *= SWAP_CLUSTER_MAX;
+	}
+
 	lruvec->balance.numer[file] += nr_pages;
 	lruvec->balance.denom += nr_pages;
 }
@@ -262,6 +282,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		int lru = page_lru_base_type(page);
 
 		del_page_from_lru_list(page, lruvec, lru);
+		SetPageWorkingset(page);
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
@@ -821,13 +842,28 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
+	unsigned int nr_pages = hpage_nr_pages(page);
 	enum lru_list lru = page_lru(page);
+	bool active = is_active_lru(lru);
+	bool file = is_file_lru(lru);
+	bool new = (bool)arg;
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
 	SetPageLRU(page);
 	add_page_to_lru_list(page, lruvec, lru);
 
+	if (new) {
+		/*
+		 * If the workingset is thrashing, note the IO cost of
+		 * reclaiming that list and steer reclaim away from it.
+		 */
+		if (PageWorkingset(page))
+			lru_note_cost(lruvec, COST_IO, file, nr_pages);
+		else if (active)
+			SetPageWorkingset(page);
+	}
+
 	trace_mm_lru_insertion(page, lru);
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 5400f814ae12..43561a56ba5d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -365,6 +365,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * Initiate read into locked page and return.
 			 */
+			SetPageWorkingset(new_page);
 			lru_cache_add(new_page);
 			*new_page_allocated = true;
 			return new_page;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index acbd212eab6e..b2cb4f4f9d31 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1216,6 +1216,7 @@ activate_locked:
 		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
 			try_to_free_swap(page);
 		VM_BUG_ON_PAGE(PageActive(page), page);
+		SetPageWorkingset(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -1524,7 +1525,7 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			 * Rotating pages costs CPU without actually
 			 * progressing toward the reclaim goal.
 			 */
-			lru_note_cost(lruvec, file, numpages);
+			lru_note_cost(lruvec, COST_CPU, file, numpages);
 		}
 
 		if (put_page_testzero(page)) {
@@ -1849,7 +1850,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	 * Rotating pages costs CPU without actually
 	 * progressing toward the reclaim goal.
 	 */
-	lru_note_cost(lruvec, file, nr_rotated);
+	lru_note_cost(lruvec, COST_CPU, file, nr_rotated);
 
 	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 77e42ef388c2..6c8d658f5b7f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -727,9 +727,9 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
-	"workingset_refault",
-	"workingset_activate",
-	"workingset_nodereclaim",
+	"refault_inactive_file",
+	"refault_active_file",
+	"refault_nodereclaim",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 8a75f8d2916a..261cf583fb62 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -118,7 +118,7 @@
  * the only thing eating into inactive list space is active pages.
  *
  *
- *		Activating refaulting pages
+ *		Refaulting inactive pages
  *
  * All that is known about the active list is that the pages have been
  * accessed more than once in the past.  This means that at any given
@@ -131,6 +131,10 @@
  * used less frequently than the refaulting page - or even not used at
  * all anymore.
  *
+ * That means, if inactive cache is refaulting with a suitable refault
+ * distance, we assume the cache workingset is transitioning and put
+ * pressure on the existing cache pages on the active list.
+ *
  * If this is wrong and demotion kicks in, the pages which are truly
  * used more frequently will be reactivated while the less frequently
  * used once will be evicted from memory.
@@ -139,6 +143,30 @@
  * and the used pages get to stay in cache.
  *
  *
+ *		Refaulting active pages
+ *
+ * If, on the other hand, the refaulting pages have been recently
+ * deactivated, it means that the active list is no longer protecting
+ * actively used cache from reclaim: the cache is not transitioning to
+ * a different workingset, the existing workingset is thrashing in the
+ * space allocated to the page cache.
+ *
+ * When that is the case, mere activation of the refaulting pages is
+ * not enough. The page reclaim code needs to be informed of the high
+ * IO cost associated with the continued reclaim of page cache, so
+ * that it can steer pressure to the anonymous list.
+ *
+ * Just as when refaulting inactive pages, it's possible that there
+ * are cold(er) anonymous pages that can be swapped and forgotten in
+ * order to increase the space available to the page cache as a whole.
+ *
+ * If anonymous pages start thrashing as well, the reclaim scanner
+ * will aim for the list that imposes the lowest cost on the system,
+ * where cost is defined as:
+ *
+ *	refault rate * relative IO cost (as determined by swappiness)
+ *
+ *
  *		Implementation
  *
  * For each zone's file LRU lists, a counter for inactive evictions
@@ -150,10 +178,25 @@
  *
  * On cache misses for which there are shadow entries, an eligible
  * refault distance will immediately activate the refaulting page.
+ *
+ * On activation, cache pages are marked PageWorkingset, which is not
+ * cleared until the page is freed. Shadow entries will remember that
+ * flag to be able to tell inactive from active refaults. Refaults of
+ * previous workingset pages will restore that page flag and inform
+ * page reclaim of the IO cost.
+ *
+ * XXX: Since we don't track anonymous references, every swap-in event
+ * is considered a workingset refault - regardless of distance. Swapin
+ * floods will thus always raise the assumed IO cost of reclaiming the
+ * anonymous LRU lists, even if the pages haven't been used recently.
+ * Temporary events don't matter that much other than they might delay
+ * the stabilization a bit. But during continuous thrashing, anonymous
+ * pages can have a leg-up against page cache. This might need fixing
+ * for ultra-fast IO devices or secondary memory types.
  */
 
-#define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
-			 ZONES_SHIFT + NODES_SHIFT +	\
+#define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY +			\
+			 1 + ZONES_SHIFT + NODES_SHIFT +		\
 			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
@@ -167,24 +210,29 @@
  */
 static unsigned int bucket_order __read_mostly;
 
-static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
+static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction,
+			 bool workingset)
 {
 	eviction >>= bucket_order;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
 	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
+	eviction = (eviction << 1) | workingset;
 	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
 
 	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
 static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
-			  unsigned long *evictionp)
+			  unsigned long *evictionp, bool *workingsetp)
 {
 	unsigned long entry = (unsigned long)shadow;
 	int memcgid, nid, zid;
+	bool workingset;
 
 	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	workingset = entry & 1;
+	entry >>= 1;
 	zid = entry & ((1UL << ZONES_SHIFT) - 1);
 	entry >>= ZONES_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
@@ -195,6 +243,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
 	*memcgidp = memcgid;
 	*zonep = NODE_DATA(nid)->node_zones + zid;
 	*evictionp = entry << bucket_order;
+	*workingsetp = workingset;
 }
 
 /**
@@ -220,19 +269,18 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 
 	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
-	return pack_shadow(memcgid, zone, eviction);
+	return pack_shadow(memcgid, zone, eviction, PageWorkingset(page));
 }
 
 /**
  * workingset_refault - evaluate the refault of a previously evicted page
+ * @page: the freshly allocated replacement page
  * @shadow: shadow entry of the evicted page
  *
  * Calculates and evaluates the refault distance of the previously
  * evicted page in the context of the zone it was allocated in.
- *
- * Returns %true if the page should be activated, %false otherwise.
  */
-bool workingset_refault(void *shadow)
+void workingset_refault(struct page *page, void *shadow)
 {
 	unsigned long refault_distance;
 	unsigned long active_file;
@@ -240,10 +288,12 @@ bool workingset_refault(void *shadow)
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	unsigned long refault;
+	unsigned long anon;
 	struct zone *zone;
+	bool workingset;
 	int memcgid;
 
-	unpack_shadow(shadow, &memcgid, &zone, &eviction);
+	unpack_shadow(shadow, &memcgid, &zone, &eviction, &workingset);
 
 	rcu_read_lock();
 	/*
@@ -263,40 +313,64 @@ bool workingset_refault(void *shadow)
 	 * configurations instead.
 	 */
 	memcg = mem_cgroup_from_id(memcgid);
-	if (!mem_cgroup_disabled() && !memcg) {
-		rcu_read_unlock();
-		return false;
-	}
+	if (!mem_cgroup_disabled() && !memcg)
+		goto out;
 	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
-	rcu_read_unlock();
+	if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+		anon = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
+		       lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
+	else
+		anon = 0;
 
 	/*
-	 * The unsigned subtraction here gives an accurate distance
-	 * across inactive_age overflows in most cases.
+	 * Calculate the refault distance.
 	 *
-	 * There is a special case: usually, shadow entries have a
-	 * short lifetime and are either refaulted or reclaimed along
-	 * with the inode before they get too old.  But it is not
-	 * impossible for the inactive_age to lap a shadow entry in
-	 * the field, which can then can result in a false small
-	 * refault distance, leading to a false activation should this
-	 * old entry actually refault again.  However, earlier kernels
-	 * used to deactivate unconditionally with *every* reclaim
-	 * invocation for the longest time, so the occasional
-	 * inappropriate activation leading to pressure on the active
-	 * list is not a problem.
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * inactive_age to lap a shadow entry in the field, which can
+	 * then can result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
 	 */
 	refault_distance = (refault - eviction) & EVICTION_MASK;
 
-	inc_zone_state(zone, WORKINGSET_REFAULT);
+	/*
+	 * Compare the distance with the existing workingset. We don't
+	 * act on pages that couldn't stay resident even with all the
+	 * memory available to the page cache.
+	 */
+	if (refault_distance > active_file + anon)
+		goto out;
 
-	if (refault_distance <= active_file) {
-		inc_zone_state(zone, WORKINGSET_ACTIVATE);
-		return true;
+	/*
+	 * If inactive cache is refaulting, activate the page to
+	 * challenge the current cache workingset. The existing cache
+	 * might be stale, or at least colder than the contender.
+	 *
+	 * If active cache is refaulting (PageWorkingset set at time
+	 * of eviction), it means that the page cache as a whole is
+	 * thrashing. Restore PageWorkingset to inform the LRU code
+	 * about the additional cost of reclaiming more page cache.
+	 */
+	SetPageActive(page);
+	atomic_long_inc(&lruvec->inactive_age);
+
+	if (workingset) {
+		SetPageWorkingset(page);
+		inc_zone_state(zone, REFAULT_ACTIVE_FILE);
+	} else {
+		inc_zone_state(zone, REFAULT_INACTIVE_FILE);
 	}
-	return false;
+out:
+	rcu_read_unlock();
 }
 
 /**
@@ -433,7 +507,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 		}
 	}
 	BUG_ON(node->count);
-	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
+	inc_zone_state(page_zone(virt_to_page(node)), REFAULT_NODERECLAIM);
 	if (!__radix_tree_delete_node(&mapping->page_tree, node))
 		BUG();
 
-- 
2.8.3

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/10] mm: swap: unexport __pagevec_lru_add()
  2016-06-06 19:48 ` [PATCH 02/10] mm: swap: unexport __pagevec_lru_add() Johannes Weiner
@ 2016-06-06 21:32   ` Rik van Riel
  2016-06-07  9:07   ` Michal Hocko
  2016-06-08  7:14   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Rik van Riel @ 2016-06-06 21:32 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm, linux-kernel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, Andi Kleen,
	Michal Hocko, Tim Chen, kernel-team

[-- Attachment #1: Type: text/plain, Size: 408 bytes --]

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> There is currently no modular user of this function. We used to have
> filesystems that open-coded the page cache instantiation, but luckily
> they're all streamlined, and we don't want this to come back.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()
  2016-06-06 19:48 ` [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() Johannes Weiner
@ 2016-06-06 21:33   ` Rik van Riel
  2016-06-07  9:12   ` Michal Hocko
  2016-06-08  7:24   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Rik van Riel @ 2016-06-06 21:33 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm, linux-kernel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, Andi Kleen,
	Michal Hocko, Tim Chen, kernel-team

[-- Attachment #1: Type: text/plain, Size: 309 bytes --]

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> They're the same function, and for the purpose of all callers they
> are
> equivalent to lru_cache_add().
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages
  2016-06-06 19:48 ` [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
@ 2016-06-06 21:36   ` Rik van Riel
  2016-06-07  9:19   ` Michal Hocko
  2016-06-08  7:28   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Rik van Riel @ 2016-06-06 21:36 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm, linux-kernel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, Andi Kleen,
	Michal Hocko, Tim Chen, kernel-team

[-- Attachment #1: Type: text/plain, Size: 522 bytes --]

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> Currently, THP are counted as single pages until they are split right
> before being swapped out. However, at that point the VM is already in
> the middle of reclaim, and adjusting the LRU balance then is useless.
> 
> Always account THP by the number of basepages, and remove the fixup
> from the splitting path.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-06 19:48 ` [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation Johannes Weiner
@ 2016-06-06 21:56   ` Rik van Riel
  2016-06-06 22:15     ` Johannes Weiner
  2016-06-07  9:49   ` Michal Hocko
  2016-06-08  7:39   ` Minchan Kim
  2 siblings, 1 reply; 67+ messages in thread
From: Rik van Riel @ 2016-06-06 21:56 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm, linux-kernel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, Andi Kleen,
	Michal Hocko, Tim Chen, kernel-team

[-- Attachment #1: Type: text/plain, Size: 591 bytes --]

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> 
> +void lru_cache_putback(struct page *page)
> +{
> +	struct pagevec *pvec = &get_cpu_var(lru_putback_pvec);
> +
> +	get_page(page);
> +	if (!pagevec_space(pvec))
> +		__pagevec_lru_add(pvec, false);
> +	pagevec_add(pvec, page);
> +	put_cpu_var(lru_putback_pvec);
> +}
> 

Wait a moment.

So now we have a putback_lru_page, which does adjust
the statistics, and an lru_cache_putback which does
not?

This function could use a name that is not as similar
to its counterpart :)

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-06 21:56   ` Rik van Riel
@ 2016-06-06 22:15     ` Johannes Weiner
  2016-06-07  1:11       ` Rik van Riel
  2016-06-07  9:26       ` Michal Hocko
  0 siblings, 2 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-06 22:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 05:56:09PM -0400, Rik van Riel wrote:
> On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> > 
> > +void lru_cache_putback(struct page *page)
> > +{
> > +	struct pagevec *pvec = &get_cpu_var(lru_putback_pvec);
> > +
> > +	get_page(page);
> > +	if (!pagevec_space(pvec))
> > +		__pagevec_lru_add(pvec, false);
> > +	pagevec_add(pvec, page);
> > +	put_cpu_var(lru_putback_pvec);
> > +}
> > 
> 
> Wait a moment.
> 
> So now we have a putback_lru_page, which does adjust
> the statistics, and an lru_cache_putback which does
> not?
> 
> This function could use a name that is not as similar
> to its counterpart :)

lru_cache_add() and lru_cache_putback() are the two sibling functions,
where the first influences the LRU balance and the second one doesn't.

The last hunk in the patch (obscured by showing the label instead of
the function name as context) updates putback_lru_page() from using
lru_cache_add() to using lru_cache_putback().

Does that make sense?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-06 19:48 ` [PATCH 10/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
  2016-06-06 19:22   ` kbuild test robot
@ 2016-06-06 23:50   ` Tim Chen
  2016-06-07 16:23     ` Johannes Weiner
  2016-06-08 13:58   ` Michal Hocko
  2016-06-10  2:19   ` Minchan Kim
  3 siblings, 1 reply; 67+ messages in thread
From: Tim Chen @ 2016-06-06 23:50 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm, linux-kernel
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Andrea Arcangeli,
	Andi Kleen, Michal Hocko, kernel-team

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> Since the LRUs were split into anon and file lists, the VM has been
> balancing between page cache and anonymous pages based on per-list
> ratios of scanned vs. rotated pages. In most cases that tips page
> reclaim towards the list that is easier to reclaim and has the fewest
> actively used pages, but there are a few problems with it:
> 
> 1. Refaults and in-memory rotations are weighted the same way, even
>    though one costs IO and the other costs CPU. When the balance is
>    off, the page cache can be thrashing while anonymous pages are aged
>    comparably slower and thus have more time to get even their coldest
>    pages referenced. The VM would consider this a fair equilibrium.
> 
> 2. The page cache has usually a share of use-once pages that will
>    further dilute its scanned/rotated ratio in the above-mentioned
>    scenario. This can cease scanning of the anonymous list almost
>    entirely - again while the page cache is thrashing and IO-bound.
> 
> Historically, swap has been an emergency overflow for high memory
> pressure, and we avoided using it as long as new page allocations
> could be served from recycling page cache. However, when recycling
> page cache incurs a higher cost in IO than swapping out a few unused
> anonymous pages would, it makes sense to increase swap pressure.
> 
> In order to accomplish this, we can extend the thrash detection code
> that currently detects workingset changes within the page cache: when
> inactive cache pages are thrashing, the VM raises LRU pressure on the
> otherwise protected active file list to increase competition. However,
> when active pages begin refaulting as well, it means that the page
> cache is thrashing as a whole and the LRU balance should tip toward
> anonymous. This is what this patch implements.
> 
> To tell inactive from active refaults, a page flag is introduced that
> marks pages that have been on the active list in their lifetime. This
> flag is remembered in the shadow page entry on reclaim, and restored
> when the page refaults. It is also set on anonymous pages during
> swapin. When a page with that flag set is added to the LRU, the LRU
> balance is adjusted for the IO cost of reclaiming the thrashing list.

Johannes,

It seems like you are saying that the shadow entry is also present
for anonymous pages that are swapped out.  But once a page is swapped
out, its entry is removed from the radix tree and we won't be able
to store the shadow page entry as for file mapped page 
in __remove_mapping.  Or are you thinking of modifying
the current code to keep the radix tree entry? I may be missing something
so will appreciate if you can clarify.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 01/10] mm: allow swappiness that prefers anon over file
  2016-06-06 19:48 ` [PATCH 01/10] mm: allow swappiness that prefers anon over file Johannes Weiner
@ 2016-06-07  0:25   ` Minchan Kim
  2016-06-07 14:18     ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2016-06-07  0:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

Hi Johannes,

Thanks for the nice work. I didn't read all patchset yet but the design
makes sense to me so it would be better for zram-based on workload
compared to as is.

On Mon, Jun 06, 2016 at 03:48:27PM -0400, Johannes Weiner wrote:
> With the advent of fast random IO devices (SSDs, PMEM) and in-memory
> swap devices such as zswap, it's possible for swap to be much faster
> than filesystems, and for swapping to be preferable over thrashing
> filesystem caches.
> 
> Allow setting swappiness - which defines the relative IO cost of cache
> misses between page cache and swap-backed pages - to reflect such
> situations by making the swap-preferred range configurable.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/sysctl/vm.txt | 16 +++++++++++-----
>  kernel/sysctl.c             |  3 ++-
>  mm/vmscan.c                 |  2 +-
>  3 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 720355cbdf45..54030750cd31 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -771,14 +771,20 @@ with no ill effects: errors and warnings on these stats are suppressed.)
>  
>  swappiness
>  
> -This control is used to define how aggressive the kernel will swap
> -memory pages.  Higher values will increase agressiveness, lower values
> -decrease the amount of swap.  A value of 0 instructs the kernel not to
> -initiate swap until the amount of free and file-backed pages is less
> -than the high water mark in a zone.
> +This control is used to define the relative IO cost of cache misses
> +between the swap device and the filesystem as a value between 0 and
> +200. At 100, the VM assumes equal IO cost and will thus apply memory
> +pressure to the page cache and swap-backed pages equally. At 0, the
> +kernel will not initiate swap until the amount of free and file-backed
> +pages is less than the high watermark in a zone.

Generally, I agree extending swappiness value good but not sure 200 is
enough to represent speed gap between file and swap sotrage in every
cases. - Just nitpick.

Some years ago, I extended it to 200 like your patch and experimented it
based on zram in our platform workload. At that time, it was terribly
slow in app switching workload if swappiness is higher than 150.
Although it was highly dependent on the workload, it's dangerous to
recommend it before fixing balacing between file and anon, I think.
IOW, I think this patch should be last one in this patchset.

>  
>  The default value is 60.
>  
> +On non-rotational swap devices, a value of 100 (or higher, depending
> +on what's backing the filesystem) is recommended.
> +
> +For in-memory swap, like zswap, values closer to 200 are recommended.

                maybe, like zram

I'm not sure it would be good suggestion for zswap because it ends up
writing cached pages to swap device once it reaches threshold.
Then, the cost is compression + decompression + write I/O which is
heavier than normal swap device(i.e., write I/O). OTOH, zram have no
(writeback I/O+ decompression) cost.

> +
>  ==============================================================
>  
>  - user_reserve_kbytes
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2effd84d83e3..56a9243eb171 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -126,6 +126,7 @@ static int __maybe_unused two = 2;
>  static int __maybe_unused four = 4;
>  static unsigned long one_ul = 1;
>  static int one_hundred = 100;
> +static int two_hundred = 200;
>  static int one_thousand = 1000;
>  #ifdef CONFIG_PRINTK
>  static int ten_thousand = 10000;
> @@ -1323,7 +1324,7 @@ static struct ctl_table vm_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec_minmax,
>  		.extra1		= &zero,
> -		.extra2		= &one_hundred,
> +		.extra2		= &two_hundred,
>  	},
>  #ifdef CONFIG_HUGETLB_PAGE
>  	{
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c4a2f4512fca..f79010bbcdd4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -136,7 +136,7 @@ struct scan_control {
>  #endif
>  
>  /*
> - * From 0 .. 100.  Higher means more swappy.
> + * From 0 .. 200.  Higher means more swappy.
>   */
>  int vm_swappiness = 60;
>  /*
> -- 
> 2.8.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-06 22:15     ` Johannes Weiner
@ 2016-06-07  1:11       ` Rik van Riel
  2016-06-07 13:57         ` Johannes Weiner
  2016-06-07  9:26       ` Michal Hocko
  1 sibling, 1 reply; 67+ messages in thread
From: Rik van Riel @ 2016-06-07  1:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]

On Mon, 2016-06-06 at 18:15 -0400, Johannes Weiner wrote:
> On Mon, Jun 06, 2016 at 05:56:09PM -0400, Rik van Riel wrote:
> > 
> > On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> > > 
> > >  
> > > +void lru_cache_putback(struct page *page)
> > > +{
> > > +	struct pagevec *pvec = &get_cpu_var(lru_putback_pvec);
> > > +
> > > +	get_page(page);
> > > +	if (!pagevec_space(pvec))
> > > +		__pagevec_lru_add(pvec, false);
> > > +	pagevec_add(pvec, page);
> > > +	put_cpu_var(lru_putback_pvec);
> > > +}
> > > 
> > Wait a moment.
> > 
> > So now we have a putback_lru_page, which does adjust
> > the statistics, and an lru_cache_putback which does
> > not?
> > 
> > This function could use a name that is not as similar
> > to its counterpart :)
> lru_cache_add() and lru_cache_putback() are the two sibling
> functions,
> where the first influences the LRU balance and the second one
> doesn't.
> 
> The last hunk in the patch (obscured by showing the label instead of
> the function name as context) updates putback_lru_page() from using
> lru_cache_add() to using lru_cache_putback().
> 
> Does that make sense?

That means the page reclaim does not update the
"rotated" statistics.  That seems undesirable,
no?  Am I overlooking something?


-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing
  2016-06-06 19:48 ` [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing Johannes Weiner
@ 2016-06-07  2:20   ` Rik van Riel
  2016-06-07 14:11     ` Johannes Weiner
  2016-06-08  8:03   ` Minchan Kim
  2016-06-08 12:31   ` Michal Hocko
  2 siblings, 1 reply; 67+ messages in thread
From: Rik van Riel @ 2016-06-07  2:20 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm, linux-kernel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, Andi Kleen,
	Michal Hocko, Tim Chen, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1146 bytes --]

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> When the splitlru patches divided page cache and swap-backed pages
> into separate LRU lists, the pressure balance between the lists was
> biased to account for the fact that streaming IO can cause memory
> pressure with a flood of pages that are used only once. New page
> cache
> additions would tip the balance toward the file LRU, and repeat
> access
> would neutralize that bias again. This ensured that page reclaim
> would
> always go for used-once cache first.
> 
> Since e9868505987a ("mm,vmscan: only evict file pages when we have
> plenty"), page reclaim generally skips over swap-backed memory
> entirely as long as there is used-once cache present, and will apply
> the LRU balancing when only repeatedly accessed cache pages are left
> -
> at which point the previous use-once bias will have been neutralized.
> 
> This makes the use-once cache balancing bias unnecessary. Remove it.
> 

The code in get_scan_count() still seems to use the statistics
of which you just removed the updating.

What am I overlooking?

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-06 19:48 ` [PATCH 07/10] mm: base LRU balancing on an explicit cost model Johannes Weiner
  2016-06-06 19:13   ` kbuild test robot
@ 2016-06-07  2:34   ` Rik van Riel
  2016-06-07 14:12     ` Johannes Weiner
  2016-06-08  8:14   ` Minchan Kim
  2016-06-08 12:51   ` Michal Hocko
  3 siblings, 1 reply; 67+ messages in thread
From: Rik van Riel @ 2016-06-07  2:34 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm, linux-kernel
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, Andi Kleen,
	Michal Hocko, Tim Chen, kernel-team

[-- Attachment #1: Type: text/plain, Size: 815 bytes --]

On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> Currently, scan pressure between the anon and file LRU lists is
> balanced based on a mixture of reclaim efficiency and a somewhat
> vague
> notion of "value" of having certain pages in memory over others. That
> concept of value is problematic, because it has caused us to count
> any
> event that remotely makes one LRU list more or less preferrable for
> reclaim, even when these events are not directly comparable to each
> other and impose very different costs on the system - such as a
> referenced file page that we still deactivate and a referenced
> anonymous page that we actually rotate back to the head of the list.
> 

Well, patches 7-10 answered my question on patch 6 :)

I like this design.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/10] mm: swap: unexport __pagevec_lru_add()
  2016-06-06 19:48 ` [PATCH 02/10] mm: swap: unexport __pagevec_lru_add() Johannes Weiner
  2016-06-06 21:32   ` Rik van Riel
@ 2016-06-07  9:07   ` Michal Hocko
  2016-06-08  7:14   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2016-06-07  9:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:28, Johannes Weiner wrote:
> There is currently no modular user of this function. We used to have
> filesystems that open-coded the page cache instantiation, but luckily
> they're all streamlined, and we don't want this to come back.

allmodconfig agrees with that

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/swap.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 95916142fc46..d810c3d95c97 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -860,7 +860,6 @@ void __pagevec_lru_add(struct pagevec *pvec)
>  {
>  	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
>  }
> -EXPORT_SYMBOL(__pagevec_lru_add);
>  
>  /**
>   * pagevec_lookup_entries - gang pagecache lookup
> -- 
> 2.8.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()
  2016-06-06 19:48 ` [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() Johannes Weiner
  2016-06-06 21:33   ` Rik van Riel
@ 2016-06-07  9:12   ` Michal Hocko
  2016-06-08  7:24   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2016-06-07  9:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:29, Johannes Weiner wrote:
> They're the same function, and for the purpose of all callers they are
> equivalent to lru_cache_add().
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  fs/cifs/file.c       | 10 +++++-----
>  fs/fuse/dev.c        |  2 +-
>  include/linux/swap.h |  2 --
>  mm/shmem.c           |  4 ++--
>  mm/swap.c            | 40 +++++++++-------------------------------
>  mm/swap_state.c      |  2 +-
>  6 files changed, 18 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 9793ae0bcaa2..232390879640 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -3261,7 +3261,7 @@ cifs_readv_complete(struct work_struct *work)
>  	for (i = 0; i < rdata->nr_pages; i++) {
>  		struct page *page = rdata->pages[i];
>  
> -		lru_cache_add_file(page);
> +		lru_cache_add(page);
>  
>  		if (rdata->result == 0 ||
>  		    (rdata->result == -EAGAIN && got_bytes)) {
> @@ -3321,7 +3321,7 @@ cifs_readpages_read_into_pages(struct TCP_Server_Info *server,
>  			 * fill them until the writes are flushed.
>  			 */
>  			zero_user(page, 0, PAGE_SIZE);
> -			lru_cache_add_file(page);
> +			lru_cache_add(page);
>  			flush_dcache_page(page);
>  			SetPageUptodate(page);
>  			unlock_page(page);
> @@ -3331,7 +3331,7 @@ cifs_readpages_read_into_pages(struct TCP_Server_Info *server,
>  			continue;
>  		} else {
>  			/* no need to hold page hostage */
> -			lru_cache_add_file(page);
> +			lru_cache_add(page);
>  			unlock_page(page);
>  			put_page(page);
>  			rdata->pages[i] = NULL;
> @@ -3488,7 +3488,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
>  			/* best to give up if we're out of mem */
>  			list_for_each_entry_safe(page, tpage, &tmplist, lru) {
>  				list_del(&page->lru);
> -				lru_cache_add_file(page);
> +				lru_cache_add(page);
>  				unlock_page(page);
>  				put_page(page);
>  			}
> @@ -3518,7 +3518,7 @@ static int cifs_readpages(struct file *file, struct address_space *mapping,
>  			add_credits_and_wake_if(server, rdata->credits, 0);
>  			for (i = 0; i < rdata->nr_pages; i++) {
>  				page = rdata->pages[i];
> -				lru_cache_add_file(page);
> +				lru_cache_add(page);
>  				unlock_page(page);
>  				put_page(page);
>  			}
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index cbece1221417..c7264d4a7f3f 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -900,7 +900,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
>  	get_page(newpage);
>  
>  	if (!(buf->flags & PIPE_BUF_FLAG_LRU))
> -		lru_cache_add_file(newpage);
> +		lru_cache_add(newpage);
>  
>  	err = 0;
>  	spin_lock(&cs->req->waitq.lock);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0af2bb2028fd..38fe1e91ba55 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -296,8 +296,6 @@ extern unsigned long nr_free_pagecache_pages(void);
>  
>  /* linux/mm/swap.c */
>  extern void lru_cache_add(struct page *);
> -extern void lru_cache_add_anon(struct page *page);
> -extern void lru_cache_add_file(struct page *page);
>  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
>  			 struct lruvec *lruvec, struct list_head *head);
>  extern void activate_page(struct page *);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e418a995427d..ff210317022d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1098,7 +1098,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
>  		oldpage = newpage;
>  	} else {
>  		mem_cgroup_migrate(oldpage, newpage);
> -		lru_cache_add_anon(newpage);
> +		lru_cache_add(newpage);
>  		*pagep = newpage;
>  	}
>  
> @@ -1289,7 +1289,7 @@ repeat:
>  			goto decused;
>  		}
>  		mem_cgroup_commit_charge(page, memcg, false, false);
> -		lru_cache_add_anon(page);
> +		lru_cache_add(page);
>  
>  		spin_lock(&info->lock);
>  		info->alloced++;
> diff --git a/mm/swap.c b/mm/swap.c
> index d810c3d95c97..d2786a6308dd 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -386,36 +386,6 @@ void mark_page_accessed(struct page *page)
>  }
>  EXPORT_SYMBOL(mark_page_accessed);
>  
> -static void __lru_cache_add(struct page *page)
> -{
> -	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
> -
> -	get_page(page);
> -	if (!pagevec_space(pvec))
> -		__pagevec_lru_add(pvec);
> -	pagevec_add(pvec, page);
> -	put_cpu_var(lru_add_pvec);
> -}
> -
> -/**
> - * lru_cache_add: add a page to the page lists
> - * @page: the page to add
> - */
> -void lru_cache_add_anon(struct page *page)
> -{
> -	if (PageActive(page))
> -		ClearPageActive(page);
> -	__lru_cache_add(page);
> -}
> -
> -void lru_cache_add_file(struct page *page)
> -{
> -	if (PageActive(page))
> -		ClearPageActive(page);
> -	__lru_cache_add(page);
> -}
> -EXPORT_SYMBOL(lru_cache_add_file);
> -
>  /**
>   * lru_cache_add - add a page to a page list
>   * @page: the page to be added to the LRU.
> @@ -427,10 +397,18 @@ EXPORT_SYMBOL(lru_cache_add_file);
>   */
>  void lru_cache_add(struct page *page)
>  {
> +	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
> +
>  	VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
> -	__lru_cache_add(page);
> +
> +	get_page(page);
> +	if (!pagevec_space(pvec))
> +		__pagevec_lru_add(pvec);
> +	pagevec_add(pvec, page);
> +	put_cpu_var(lru_add_pvec);
>  }
> +EXPORT_SYMBOL(lru_cache_add);
>  
>  /**
>   * add_page_to_unevictable_list - add a page to the unevictable list
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 0d457e7db8d6..5400f814ae12 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -365,7 +365,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  			/*
>  			 * Initiate read into locked page and return.
>  			 */
> -			lru_cache_add_anon(new_page);
> +			lru_cache_add(new_page);
>  			*new_page_allocated = true;
>  			return new_page;
>  		}
> -- 
> 2.8.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages
  2016-06-06 19:48 ` [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
  2016-06-06 21:36   ` Rik van Riel
@ 2016-06-07  9:19   ` Michal Hocko
  2016-06-08  7:28   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2016-06-07  9:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:30, Johannes Weiner wrote:
> Currently, THP are counted as single pages until they are split right
> before being swapped out. However, at that point the VM is already in
> the middle of reclaim, and adjusting the LRU balance then is useless.
> 
> Always account THP by the number of basepages, and remove the fixup
> from the splitting path.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/swap.c | 18 ++++++++----------
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index d2786a6308dd..c6936507abb5 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -249,13 +249,14 @@ void rotate_reclaimable_page(struct page *page)
>  }
>  
>  static void update_page_reclaim_stat(struct lruvec *lruvec,
> -				     int file, int rotated)
> +				     int file, int rotated,
> +				     unsigned int nr_pages)
>  {
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  
> -	reclaim_stat->recent_scanned[file]++;
> +	reclaim_stat->recent_scanned[file] += nr_pages;
>  	if (rotated)
> -		reclaim_stat->recent_rotated[file]++;
> +		reclaim_stat->recent_rotated[file] += nr_pages;
>  }
>  
>  static void __activate_page(struct page *page, struct lruvec *lruvec,
> @@ -272,7 +273,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
>  		trace_mm_lru_activate(page);
>  
>  		__count_vm_event(PGACTIVATE);
> -		update_page_reclaim_stat(lruvec, file, 1);
> +		update_page_reclaim_stat(lruvec, file, 1, hpage_nr_pages(page));
>  	}
>  }
>  
> @@ -532,7 +533,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  
>  	if (active)
>  		__count_vm_event(PGDEACTIVATE);
> -	update_page_reclaim_stat(lruvec, file, 0);
> +	update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
>  }
>  
>  
> @@ -549,7 +550,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>  		add_page_to_lru_list(page, lruvec, lru);
>  
>  		__count_vm_event(PGDEACTIVATE);
> -		update_page_reclaim_stat(lruvec, file, 0);
> +		update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
>  	}
>  }
>  
> @@ -809,9 +810,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  		list_head = page_tail->lru.prev;
>  		list_move_tail(&page_tail->lru, list_head);
>  	}
> -
> -	if (!PageUnevictable(page))
> -		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> @@ -826,7 +824,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>  
>  	SetPageLRU(page);
>  	add_page_to_lru_list(page, lruvec, lru);
> -	update_page_reclaim_stat(lruvec, file, active);
> +	update_page_reclaim_stat(lruvec, file, active, hpage_nr_pages(page));
>  	trace_mm_lru_insertion(page, lru);
>  }
>  
> -- 
> 2.8.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-06 22:15     ` Johannes Weiner
  2016-06-07  1:11       ` Rik van Riel
@ 2016-06-07  9:26       ` Michal Hocko
  2016-06-07 14:06         ` Johannes Weiner
  1 sibling, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2016-06-07  9:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 18:15:50, Johannes Weiner wrote:
[...]
> The last hunk in the patch (obscured by showing the label instead of
> the function name as context)

JFYI my ~/.gitconfig has the following to workaround this:
[diff "default"]
        xfuncname = "^[[:alpha:]$_].*[^:]$"

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-06 19:48 ` [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation Johannes Weiner
  2016-06-06 21:56   ` Rik van Riel
@ 2016-06-07  9:49   ` Michal Hocko
  2016-06-08  7:39   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2016-06-07  9:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:31, Johannes Weiner wrote:
> Isolating an existing LRU page and subsequently putting it back on the
> list currently influences the balance between the anon and file LRUs.
> For example, heavy page migration or compaction could influence the
> balance between the LRUs and make one type more attractive when that
> type of page is affected more than the other. That doesn't make sense.
> 
> Add a dedicated LRU cache for putback, so that we can tell new LRU
> pages from existing ones at the time of linking them to the lists.

It is far from trivial to review this one (there are quite some callers)
but it makes sense to me from the semantic point of view.
 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/pagevec.h |  2 +-
>  include/linux/swap.h    |  1 +
>  mm/mlock.c              |  2 +-
>  mm/swap.c               | 34 ++++++++++++++++++++++++++++------
>  mm/vmscan.c             |  2 +-
>  5 files changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
> index b45d391b4540..3f8a2a01131c 100644
> --- a/include/linux/pagevec.h
> +++ b/include/linux/pagevec.h
> @@ -21,7 +21,7 @@ struct pagevec {
>  };
>  
>  void __pagevec_release(struct pagevec *pvec);
> -void __pagevec_lru_add(struct pagevec *pvec);
> +void __pagevec_lru_add(struct pagevec *pvec, bool new);
>  unsigned pagevec_lookup_entries(struct pagevec *pvec,
>  				struct address_space *mapping,
>  				pgoff_t start, unsigned nr_entries,
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 38fe1e91ba55..178f084365c2 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -296,6 +296,7 @@ extern unsigned long nr_free_pagecache_pages(void);
>  
>  /* linux/mm/swap.c */
>  extern void lru_cache_add(struct page *);
> +extern void lru_cache_putback(struct page *page);
>  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
>  			 struct lruvec *lruvec, struct list_head *head);
>  extern void activate_page(struct page *);
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 96f001041928..449c291a286d 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -264,7 +264,7 @@ static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
>  	 *__pagevec_lru_add() calls release_pages() so we don't call
>  	 * put_page() explicitly
>  	 */
> -	__pagevec_lru_add(pvec);
> +	__pagevec_lru_add(pvec, false);
>  	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
>  }
>  
> diff --git a/mm/swap.c b/mm/swap.c
> index c6936507abb5..576c721f210b 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -44,6 +44,7 @@
>  int page_cluster;
>  
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
> +static DEFINE_PER_CPU(struct pagevec, lru_putback_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> @@ -405,12 +406,23 @@ void lru_cache_add(struct page *page)
>  
>  	get_page(page);
>  	if (!pagevec_space(pvec))
> -		__pagevec_lru_add(pvec);
> +		__pagevec_lru_add(pvec, true);
>  	pagevec_add(pvec, page);
>  	put_cpu_var(lru_add_pvec);
>  }
>  EXPORT_SYMBOL(lru_cache_add);
>  
> +void lru_cache_putback(struct page *page)
> +{
> +	struct pagevec *pvec = &get_cpu_var(lru_putback_pvec);
> +
> +	get_page(page);
> +	if (!pagevec_space(pvec))
> +		__pagevec_lru_add(pvec, false);
> +	pagevec_add(pvec, page);
> +	put_cpu_var(lru_putback_pvec);
> +}
> +
>  /**
>   * add_page_to_unevictable_list - add a page to the unevictable list
>   * @page:  the page to be added to the unevictable list
> @@ -561,10 +573,15 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>   */
>  void lru_add_drain_cpu(int cpu)
>  {
> -	struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu);
> +	struct pagevec *pvec;
> +
> +	pvec = &per_cpu(lru_add_pvec, cpu);
> +	if (pagevec_count(pvec))
> +		__pagevec_lru_add(pvec, true);
>  
> +	pvec = &per_cpu(lru_putback_pvec, cpu);
>  	if (pagevec_count(pvec))
> -		__pagevec_lru_add(pvec);
> +		__pagevec_lru_add(pvec, false);
>  
>  	pvec = &per_cpu(lru_rotate_pvecs, cpu);
>  	if (pagevec_count(pvec)) {
> @@ -819,12 +836,17 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>  	int file = page_is_file_cache(page);
>  	int active = PageActive(page);
>  	enum lru_list lru = page_lru(page);
> +	bool new = (bool)arg;
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
>  	SetPageLRU(page);
>  	add_page_to_lru_list(page, lruvec, lru);
> -	update_page_reclaim_stat(lruvec, file, active, hpage_nr_pages(page));
> +
> +	if (new)
> +		update_page_reclaim_stat(lruvec, file, active,
> +					 hpage_nr_pages(page));
> +
>  	trace_mm_lru_insertion(page, lru);
>  }
>  
> @@ -832,9 +854,9 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>   * Add the passed pages to the LRU, then drop the caller's refcount
>   * on them.  Reinitialises the caller's pagevec.
>   */
> -void __pagevec_lru_add(struct pagevec *pvec)
> +void __pagevec_lru_add(struct pagevec *pvec, bool new)
>  {
> -	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
> +	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, (void *)new);
>  }
>  
>  /**
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f79010bbcdd4..8503713bb60e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -737,7 +737,7 @@ redo:
>  		 * We know how to handle that.
>  		 */
>  		is_unevictable = false;
> -		lru_cache_add(page);
> +		lru_cache_putback(page);
>  	} else {
>  		/*
>  		 * Put unevictable pages directly on zone's unevictable
> -- 
> 2.8.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 00/10] mm: balance LRU lists based on relative thrashing
  2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
                   ` (9 preceding siblings ...)
  2016-06-06 19:48 ` [PATCH 10/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
@ 2016-06-07  9:51 ` Michal Hocko
  10 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2016-06-07  9:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:26, Johannes Weiner wrote:
> Hi everybody,
> 
> this series re-implements the LRU balancing between page cache and
> anonymous pages to work better with fast random IO swap devices.

I didn't get to review the full series properly but initial patches
(2-5) seem good to go even without the rest. I will try to get to the
rest ASAP.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-07  1:11       ` Rik van Riel
@ 2016-06-07 13:57         ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-07 13:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 09:11:18PM -0400, Rik van Riel wrote:
> On Mon, 2016-06-06 at 18:15 -0400, Johannes Weiner wrote:
> > On Mon, Jun 06, 2016 at 05:56:09PM -0400, Rik van Riel wrote:
> > > 
> > > On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> > > > 
> > > >  
> > > > +void lru_cache_putback(struct page *page)
> > > > +{
> > > > +	struct pagevec *pvec = &get_cpu_var(lru_putback_pvec);
> > > > +
> > > > +	get_page(page);
> > > > +	if (!pagevec_space(pvec))
> > > > +		__pagevec_lru_add(pvec, false);
> > > > +	pagevec_add(pvec, page);
> > > > +	put_cpu_var(lru_putback_pvec);
> > > > +}
> > > > 
> > > Wait a moment.
> > > 
> > > So now we have a putback_lru_page, which does adjust
> > > the statistics, and an lru_cache_putback which does
> > > not?
> > > 
> > > This function could use a name that is not as similar
> > > to its counterpart :)
> > lru_cache_add() and lru_cache_putback() are the two sibling
> > functions,
> > where the first influences the LRU balance and the second one
> > doesn't.
> > 
> > The last hunk in the patch (obscured by showing the label instead of
> > the function name as context) updates putback_lru_page() from using
> > lru_cache_add() to using lru_cache_putback().
> > 
> > Does that make sense?
> 
> That means the page reclaim does not update the
> "rotated" statistics.  That seems undesirable,
> no?  Am I overlooking something?

Oh, reclaim doesn't use putback_lru_page(), except for the stray
unevictable corner case. It does open-coded putback in batch, and
those functions continue to update the reclaim statistics. See the
recent_scanned/recent_rotated manipulations in putback_inactive_pages(),
shrink_inactive_list(), and shrink_active_list().

putback_lru_page() is mainly used by page migration, cgroup migration,
mlock etc. - all operations which muck with the LRU for purposes other
than reclaim or aging, and so shouldn't affect the anon/file balance.

This patch only changes those LRU users, not page reclaim.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-07  9:26       ` Michal Hocko
@ 2016-06-07 14:06         ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-07 14:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Tue, Jun 07, 2016 at 11:26:29AM +0200, Michal Hocko wrote:
> On Mon 06-06-16 18:15:50, Johannes Weiner wrote:
> [...]
> > The last hunk in the patch (obscured by showing the label instead of
> > the function name as context)
> 
> JFYI my ~/.gitconfig has the following to workaround this:
> [diff "default"]
>         xfuncname = "^[[:alpha:]$_].*[^:]$"

Thanks, that's useful. I added it to my ~/.gitconfig, so this should
be a little less confusing in v2.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing
  2016-06-07  2:20   ` Rik van Riel
@ 2016-06-07 14:11     ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-07 14:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 10:20:31PM -0400, Rik van Riel wrote:
> On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> > When the splitlru patches divided page cache and swap-backed pages
> > into separate LRU lists, the pressure balance between the lists was
> > biased to account for the fact that streaming IO can cause memory
> > pressure with a flood of pages that are used only once. New page
> > cache
> > additions would tip the balance toward the file LRU, and repeat
> > access
> > would neutralize that bias again. This ensured that page reclaim
> > would
> > always go for used-once cache first.
> > 
> > Since e9868505987a ("mm,vmscan: only evict file pages when we have
> > plenty"), page reclaim generally skips over swap-backed memory
> > entirely as long as there is used-once cache present, and will apply
> > the LRU balancing when only repeatedly accessed cache pages are left
> > -
> > at which point the previous use-once bias will have been neutralized.
> > 
> > This makes the use-once cache balancing bias unnecessary. Remove it.
> > 
> 
> The code in get_scan_count() still seems to use the statistics
> of which you just removed the updating.
> 
> What am I overlooking?

As I mentioned in 5/10, page reclaim still does updates for each
scanned page and rotated page at this point in the series.

This merely removes the pre-reclaim bias for cache.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-07  2:34   ` Rik van Riel
@ 2016-06-07 14:12     ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-07 14:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 10:34:43PM -0400, Rik van Riel wrote:
> On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> > Currently, scan pressure between the anon and file LRU lists is
> > balanced based on a mixture of reclaim efficiency and a somewhat
> > vague
> > notion of "value" of having certain pages in memory over others. That
> > concept of value is problematic, because it has caused us to count
> > any
> > event that remotely makes one LRU list more or less preferrable for
> > reclaim, even when these events are not directly comparable to each
> > other and impose very different costs on the system - such as a
> > referenced file page that we still deactivate and a referenced
> > anonymous page that we actually rotate back to the head of the list.
> > 
> 
> Well, patches 7-10 answered my question on patch 6 :)
> 
> I like this design.

Great! Thanks for reviewing.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 01/10] mm: allow swappiness that prefers anon over file
  2016-06-07  0:25   ` Minchan Kim
@ 2016-06-07 14:18     ` Johannes Weiner
  2016-06-08  0:06       ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2016-06-07 14:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Tue, Jun 07, 2016 at 09:25:50AM +0900, Minchan Kim wrote:
> Hi Johannes,
> 
> Thanks for the nice work. I didn't read all patchset yet but the design
> makes sense to me so it would be better for zram-based on workload
> compared to as is.

Thanks!

> On Mon, Jun 06, 2016 at 03:48:27PM -0400, Johannes Weiner wrote:
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -771,14 +771,20 @@ with no ill effects: errors and warnings on these stats are suppressed.)
> >  
> >  swappiness
> >  
> > -This control is used to define how aggressive the kernel will swap
> > -memory pages.  Higher values will increase agressiveness, lower values
> > -decrease the amount of swap.  A value of 0 instructs the kernel not to
> > -initiate swap until the amount of free and file-backed pages is less
> > -than the high water mark in a zone.
> > +This control is used to define the relative IO cost of cache misses
> > +between the swap device and the filesystem as a value between 0 and
> > +200. At 100, the VM assumes equal IO cost and will thus apply memory
> > +pressure to the page cache and swap-backed pages equally. At 0, the
> > +kernel will not initiate swap until the amount of free and file-backed
> > +pages is less than the high watermark in a zone.
> 
> Generally, I agree extending swappiness value good but not sure 200 is
> enough to represent speed gap between file and swap sotrage in every
> cases. - Just nitpick.

How so? You can't give swap more weight than 100%. 200 is the maximum
possible value.

> Some years ago, I extended it to 200 like your patch and experimented it
> based on zram in our platform workload. At that time, it was terribly
> slow in app switching workload if swappiness is higher than 150.
> Although it was highly dependent on the workload, it's dangerous to
> recommend it before fixing balacing between file and anon, I think.
> IOW, I think this patch should be last one in this patchset.

Good point. I'll tone down the recommendations. But OTOH it's a fairly
trivial patch, so I wouldn't want it to close after the current 10/10.

> >  The default value is 60.
> >  
> > +On non-rotational swap devices, a value of 100 (or higher, depending
> > +on what's backing the filesystem) is recommended.
> > +
> > +For in-memory swap, like zswap, values closer to 200 are recommended.
> 
>                 maybe, like zram
> 
> I'm not sure it would be good suggestion for zswap because it ends up
> writing cached pages to swap device once it reaches threshold.
> Then, the cost is compression + decompression + write I/O which is
> heavier than normal swap device(i.e., write I/O). OTOH, zram have no
> (writeback I/O+ decompression) cost.

Oh, good catch. Yeah, I'll change that for v2.

Thanks for your input, Minchan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-06 23:50   ` Tim Chen
@ 2016-06-07 16:23     ` Johannes Weiner
  2016-06-07 19:56       ` Tim Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2016-06-07 16:23 UTC (permalink / raw)
  To: Tim Chen
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, kernel-team

Hi Tim,

On Mon, Jun 06, 2016 at 04:50:23PM -0700, Tim Chen wrote:
> On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> > To tell inactive from active refaults, a page flag is introduced that
> > marks pages that have been on the active list in their lifetime. This
> > flag is remembered in the shadow page entry on reclaim, and restored
> > when the page refaults. It is also set on anonymous pages during
> > swapin. When a page with that flag set is added to the LRU, the LRU
> > balance is adjusted for the IO cost of reclaiming the thrashing list.
> 
> Johannes,
> 
> It seems like you are saying that the shadow entry is also present
> for anonymous pages that are swapped out.  But once a page is swapped
> out, its entry is removed from the radix tree and we won't be able
> to store the shadow page entry as for file mapped page 
> in __remove_mapping.  Or are you thinking of modifying
> the current code to keep the radix tree entry? I may be missing something
> so will appreciate if you can clarify.

Sorry if this was ambiguously phrased.

You are correct, there are no shadow entries for anonymous evictions,
only page cache evictions. All swap-ins are treated as "eligible"
refaults and push back against cache, whereas cache only pushes
against anon if the cache workingset is determined to fit into memory.

That implies a fixed hierarchy where the VM always tries to fit the
anonymous workingset into memory first and the page cache second. If
the anonymous set is bigger than memory, the algorithm won't stop
counting IO cost from anonymous refaults and pressuring page cache.

[ Although you can set the effective cost of these refaults to 0
  (swappiness = 200) and reduce effective cache to a minimum -
  possibly to a level where LRU rotations consume most of it.
  But yeah. ]

So the current code works well when we assume that cache workingsets
might exceed memory, but anonymous workingsets don't.

For SSDs and non-DIMM pmem devices this assumption is fine, because
nobody wants half their frequent anonymous memory accesses to be major
faults. Anonymous workingsets will continue to target RAM size there.

Secondary memory types, which userspace can continue to map directly
after "swap out", are a different story. That might need workingset
estimation for anonymous pages. But it would have to build on top of
this series here. These patches are about eliminating or mitigating IO
by swapping idle or colder anon pages when the cache is thrashing.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-07 16:23     ` Johannes Weiner
@ 2016-06-07 19:56       ` Tim Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Tim Chen @ 2016-06-07 19:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, kernel-team

On Tue, 2016-06-07 at 12:23 -0400, Johannes Weiner wrote:
> Hi Tim,
> 
> On Mon, Jun 06, 2016 at 04:50:23PM -0700, Tim Chen wrote:
> > 
> > On Mon, 2016-06-06 at 15:48 -0400, Johannes Weiner wrote:
> > > 
> > > To tell inactive from active refaults, a page flag is introduced that
> > > marks pages that have been on the active list in their lifetime. This
> > > flag is remembered in the shadow page entry on reclaim, and restored
> > > when the page refaults. It is also set on anonymous pages during
> > > swapin. When a page with that flag set is added to the LRU, the LRU
> > > balance is adjusted for the IO cost of reclaiming the thrashing list.
> > Johannes,
> > 
> > It seems like you are saying that the shadow entry is also present
> > for anonymous pages that are swapped out.  But once a page is swapped
> > out, its entry is removed from the radix tree and we won't be able
> > to store the shadow page entry as for file mapped page 
> > in __remove_mapping.  Or are you thinking of modifying
> > the current code to keep the radix tree entry? I may be missing something
> > so will appreciate if you can clarify.
> Sorry if this was ambiguously phrased.
> 
> You are correct, there are no shadow entries for anonymous evictions,
> only page cache evictions. All swap-ins are treated as "eligible"
> refaults and push back against cache, whereas cache only pushes
> against anon if the cache workingset is determined to fit into memory.

Thanks. That makes sense.  I wasn't sure before whether you intend
to have a re-fault distance to determine if a
faulted in anonymous page is in working set.  I see now that
you always consider it to be in working set.

> 
> That implies a fixed hierarchy where the VM always tries to fit the
> anonymous workingset into memory first and the page cache second. If
> the anonymous set is bigger than memory, the algorithm won't stop
> counting IO cost from anonymous refaults and pressuring page cache.
> 
> [ Although you can set the effective cost of these refaults to 0
>   (swappiness = 200) and reduce effective cache to a minimum -
>   possibly to a level where LRU rotations consume most of it.
>   But yeah. ]
> 
> So the current code works well when we assume that cache workingsets
> might exceed memory, but anonymous workingsets don't.
> 
> For SSDs and non-DIMM pmem devices this assumption is fine, because
> nobody wants half their frequent anonymous memory accesses to be major
> faults. Anonymous workingsets will continue to target RAM size there.
> 
> Secondary memory types, which userspace can continue to map directly
> after "swap out", are a different story. That might need workingset
> estimation for anonymous pages. 

The direct mapped swap case is trickier as we need a method to gauge how often
a page was accessed in place in swap, to decide if we need to
bring it back to RAM.  The accessed bit in pte only tells
us if it has been accessed, but not the frequency.

If we simply try to mitigate IO cost, we may just have pages migrated and
accessed within the swap space, but not bring the hot ones back to RAM.

That said, this series is a very nice optimization of the balance between
anonymous and file backed page reclaim.

Thanks.

Tim

> But it would have to build on top of
> this series here. These patches are about eliminating or mitigating IO
> by swapping idle or colder anon pages when the cache is thrashing.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 01/10] mm: allow swappiness that prefers anon over file
  2016-06-07 14:18     ` Johannes Weiner
@ 2016-06-08  0:06       ` Minchan Kim
  2016-06-08 15:58         ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2016-06-08  0:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Tue, Jun 07, 2016 at 10:18:18AM -0400, Johannes Weiner wrote:
> On Tue, Jun 07, 2016 at 09:25:50AM +0900, Minchan Kim wrote:
> > Hi Johannes,
> > 
> > Thanks for the nice work. I didn't read all patchset yet but the design
> > makes sense to me so it would be better for zram-based on workload
> > compared to as is.
> 
> Thanks!
> 
> > On Mon, Jun 06, 2016 at 03:48:27PM -0400, Johannes Weiner wrote:
> > > --- a/Documentation/sysctl/vm.txt
> > > +++ b/Documentation/sysctl/vm.txt
> > > @@ -771,14 +771,20 @@ with no ill effects: errors and warnings on these stats are suppressed.)
> > >  
> > >  swappiness
> > >  
> > > -This control is used to define how aggressive the kernel will swap
> > > -memory pages.  Higher values will increase agressiveness, lower values
> > > -decrease the amount of swap.  A value of 0 instructs the kernel not to
> > > -initiate swap until the amount of free and file-backed pages is less
> > > -than the high water mark in a zone.
> > > +This control is used to define the relative IO cost of cache misses
> > > +between the swap device and the filesystem as a value between 0 and
> > > +200. At 100, the VM assumes equal IO cost and will thus apply memory
> > > +pressure to the page cache and swap-backed pages equally. At 0, the
> > > +kernel will not initiate swap until the amount of free and file-backed
> > > +pages is less than the high watermark in a zone.
> > 
> > Generally, I agree extending swappiness value good but not sure 200 is
> > enough to represent speed gap between file and swap sotrage in every
> > cases. - Just nitpick.
> 
> How so? You can't give swap more weight than 100%. 200 is the maximum
> possible value.

In old, swappiness is how agressively reclaim anonymous pages in favour
of page cache. But when I read your description and changes about
swappiness in vm.txt, esp, *relative IO cost*, I feel you change swappiness
define to represent relative IO cost between swap storage and file storage.
Then, with that, we could balance anonymous and file LRU with the weight.

For example, let's assume that in-memory swap storage is 10x times faster
than slow thumb drive. In that case, IO cost of 5 anonymous pages
swapping-in/out is equal to 1 file-backed page-discard/read.

I thought it does make sense because that measuring the speed gab between
those storages is easier than selecting vague swappiness tendency.

In terms of such approach, I thought 200 is not enough to show the gab
because the gap is started from 100.
Isn't it your intention? If so, to me, the description was rather
misleading. :(

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 02/10] mm: swap: unexport __pagevec_lru_add()
  2016-06-06 19:48 ` [PATCH 02/10] mm: swap: unexport __pagevec_lru_add() Johannes Weiner
  2016-06-06 21:32   ` Rik van Riel
  2016-06-07  9:07   ` Michal Hocko
@ 2016-06-08  7:14   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2016-06-08  7:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 03:48:28PM -0400, Johannes Weiner wrote:
> There is currently no modular user of this function. We used to have
> filesystems that open-coded the page cache instantiation, but luckily
> they're all streamlined, and we don't want this to come back.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()
  2016-06-06 19:48 ` [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() Johannes Weiner
  2016-06-06 21:33   ` Rik van Riel
  2016-06-07  9:12   ` Michal Hocko
@ 2016-06-08  7:24   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2016-06-08  7:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 03:48:29PM -0400, Johannes Weiner wrote:
> They're the same function, and for the purpose of all callers they are
> equivalent to lru_cache_add().
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages
  2016-06-06 19:48 ` [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
  2016-06-06 21:36   ` Rik van Riel
  2016-06-07  9:19   ` Michal Hocko
@ 2016-06-08  7:28   ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2016-06-08  7:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 03:48:30PM -0400, Johannes Weiner wrote:
> Currently, THP are counted as single pages until they are split right
> before being swapped out. However, at that point the VM is already in
> the middle of reclaim, and adjusting the LRU balance then is useless.
> 
> Always account THP by the number of basepages, and remove the fixup
> from the splitting path.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-06 19:48 ` [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation Johannes Weiner
  2016-06-06 21:56   ` Rik van Riel
  2016-06-07  9:49   ` Michal Hocko
@ 2016-06-08  7:39   ` Minchan Kim
  2016-06-08 16:02     ` Johannes Weiner
  2 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2016-06-08  7:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 03:48:31PM -0400, Johannes Weiner wrote:
> Isolating an existing LRU page and subsequently putting it back on the
> list currently influences the balance between the anon and file LRUs.
> For example, heavy page migration or compaction could influence the
> balance between the LRUs and make one type more attractive when that
> type of page is affected more than the other. That doesn't make sense.
> 
> Add a dedicated LRU cache for putback, so that we can tell new LRU
> pages from existing ones at the time of linking them to the lists.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/pagevec.h |  2 +-
>  include/linux/swap.h    |  1 +
>  mm/mlock.c              |  2 +-
>  mm/swap.c               | 34 ++++++++++++++++++++++++++++------
>  mm/vmscan.c             |  2 +-
>  5 files changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
> index b45d391b4540..3f8a2a01131c 100644
> --- a/include/linux/pagevec.h
> +++ b/include/linux/pagevec.h
> @@ -21,7 +21,7 @@ struct pagevec {
>  };
>  
>  void __pagevec_release(struct pagevec *pvec);
> -void __pagevec_lru_add(struct pagevec *pvec);
> +void __pagevec_lru_add(struct pagevec *pvec, bool new);
>  unsigned pagevec_lookup_entries(struct pagevec *pvec,
>  				struct address_space *mapping,
>  				pgoff_t start, unsigned nr_entries,
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 38fe1e91ba55..178f084365c2 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -296,6 +296,7 @@ extern unsigned long nr_free_pagecache_pages(void);
>  
>  /* linux/mm/swap.c */
>  extern void lru_cache_add(struct page *);
> +extern void lru_cache_putback(struct page *page);
>  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
>  			 struct lruvec *lruvec, struct list_head *head);
>  extern void activate_page(struct page *);
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 96f001041928..449c291a286d 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -264,7 +264,7 @@ static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
>  	 *__pagevec_lru_add() calls release_pages() so we don't call
>  	 * put_page() explicitly
>  	 */
> -	__pagevec_lru_add(pvec);
> +	__pagevec_lru_add(pvec, false);
>  	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
>  }
>  
> diff --git a/mm/swap.c b/mm/swap.c
> index c6936507abb5..576c721f210b 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -44,6 +44,7 @@
>  int page_cluster;
>  
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
> +static DEFINE_PER_CPU(struct pagevec, lru_putback_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> @@ -405,12 +406,23 @@ void lru_cache_add(struct page *page)
>  
>  	get_page(page);
>  	if (!pagevec_space(pvec))
> -		__pagevec_lru_add(pvec);
> +		__pagevec_lru_add(pvec, true);
>  	pagevec_add(pvec, page);
>  	put_cpu_var(lru_add_pvec);
>  }
>  EXPORT_SYMBOL(lru_cache_add);
>  
> +void lru_cache_putback(struct page *page)
> +{
> +	struct pagevec *pvec = &get_cpu_var(lru_putback_pvec);
> +
> +	get_page(page);
> +	if (!pagevec_space(pvec))
> +		__pagevec_lru_add(pvec, false);
> +	pagevec_add(pvec, page);
> +	put_cpu_var(lru_putback_pvec);
> +}
> +
>  /**
>   * add_page_to_unevictable_list - add a page to the unevictable list
>   * @page:  the page to be added to the unevictable list
> @@ -561,10 +573,15 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>   */
>  void lru_add_drain_cpu(int cpu)
>  {
> -	struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu);
> +	struct pagevec *pvec;
> +
> +	pvec = &per_cpu(lru_add_pvec, cpu);
> +	if (pagevec_count(pvec))
> +		__pagevec_lru_add(pvec, true);
>  
> +	pvec = &per_cpu(lru_putback_pvec, cpu);
>  	if (pagevec_count(pvec))
> -		__pagevec_lru_add(pvec);
> +		__pagevec_lru_add(pvec, false);
>  
>  	pvec = &per_cpu(lru_rotate_pvecs, cpu);
>  	if (pagevec_count(pvec)) {
> @@ -819,12 +836,17 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>  	int file = page_is_file_cache(page);
>  	int active = PageActive(page);
>  	enum lru_list lru = page_lru(page);
> +	bool new = (bool)arg;
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
>  	SetPageLRU(page);
>  	add_page_to_lru_list(page, lruvec, lru);
> -	update_page_reclaim_stat(lruvec, file, active, hpage_nr_pages(page));
> +
> +	if (new)
> +		update_page_reclaim_stat(lruvec, file, active,
> +					 hpage_nr_pages(page));
> +
>  	trace_mm_lru_insertion(page, lru);
>  }
>  
> @@ -832,9 +854,9 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>   * Add the passed pages to the LRU, then drop the caller's refcount
>   * on them.  Reinitialises the caller's pagevec.
>   */
> -void __pagevec_lru_add(struct pagevec *pvec)
> +void __pagevec_lru_add(struct pagevec *pvec, bool new)
>  {
> -	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
> +	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, (void *)new);
>  }

Just trivial:

'new' argument would be not clear in this context what does it mean
so worth to comment it, IMO but no strong opinion.

Other than that,

Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing
  2016-06-06 19:48 ` [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing Johannes Weiner
  2016-06-07  2:20   ` Rik van Riel
@ 2016-06-08  8:03   ` Minchan Kim
  2016-06-08 12:31   ` Michal Hocko
  2 siblings, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2016-06-08  8:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 03:48:32PM -0400, Johannes Weiner wrote:
> When the splitlru patches divided page cache and swap-backed pages
> into separate LRU lists, the pressure balance between the lists was
> biased to account for the fact that streaming IO can cause memory
> pressure with a flood of pages that are used only once. New page cache
> additions would tip the balance toward the file LRU, and repeat access
> would neutralize that bias again. This ensured that page reclaim would
> always go for used-once cache first.
> 
> Since e9868505987a ("mm,vmscan: only evict file pages when we have
> plenty"), page reclaim generally skips over swap-backed memory
> entirely as long as there is used-once cache present, and will apply
> the LRU balancing when only repeatedly accessed cache pages are left -
> at which point the previous use-once bias will have been neutralized.
> 
> This makes the use-once cache balancing bias unnecessary. Remove it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-06 19:48 ` [PATCH 07/10] mm: base LRU balancing on an explicit cost model Johannes Weiner
  2016-06-06 19:13   ` kbuild test robot
  2016-06-07  2:34   ` Rik van Riel
@ 2016-06-08  8:14   ` Minchan Kim
  2016-06-08 16:06     ` Johannes Weiner
  2016-06-08 12:51   ` Michal Hocko
  3 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2016-06-08  8:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 03:48:33PM -0400, Johannes Weiner wrote:
> Currently, scan pressure between the anon and file LRU lists is
> balanced based on a mixture of reclaim efficiency and a somewhat vague
> notion of "value" of having certain pages in memory over others. That
> concept of value is problematic, because it has caused us to count any
> event that remotely makes one LRU list more or less preferrable for
> reclaim, even when these events are not directly comparable to each
> other and impose very different costs on the system - such as a
> referenced file page that we still deactivate and a referenced
> anonymous page that we actually rotate back to the head of the list.
> 
> There is also conceptual overlap with the LRU algorithm itself. By
> rotating recently used pages instead of reclaiming them, the algorithm
> already biases the applied scan pressure based on page value. Thus,
> when rebalancing scan pressure due to rotations, we should think of
> reclaim cost, and leave assessing the page value to the LRU algorithm.
> 
> Lastly, considering both value-increasing as well as value-decreasing
> events can sometimes cause the same type of event to be counted twice,
> i.e. how rotating a page increases the LRU value, while reclaiming it
> succesfully decreases the value. In itself this will balance out fine,
> but it quietly skews the impact of events that are only recorded once.
> 
> The abstract metric of "value", the murky relationship with the LRU
> algorithm, and accounting both negative and positive events make the
> current pressure balancing model hard to reason about and modify.
> 
> In preparation for thrashing-based LRU balancing, this patch switches
> to a balancing model of accounting the concrete, actually observed
> cost of reclaiming one LRU over another. For now, that cost includes
> pages that are scanned but rotated back to the list head. Subsequent
> patches will add consideration for IO caused by refaulting recently
> evicted pages. The idea is to primarily scan the LRU that thrashes the
> least, and secondarily scan the LRU that needs the least amount of
> work to free memory.
> 
> Rename struct zone_reclaim_stat to struct lru_cost, and move from two
> separate value ratios for the LRU lists to a relative LRU cost metric
> with a shared denominator. Then make everything that affects the cost
> go through a new lru_note_cost() function.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h | 23 +++++++++++------------
>  include/linux/swap.h   |  2 ++
>  mm/swap.c              | 15 +++++----------
>  mm/vmscan.c            | 35 +++++++++++++++--------------------
>  4 files changed, 33 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 02069c23486d..4d257d00fbf5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -191,22 +191,21 @@ static inline int is_active_lru(enum lru_list lru)
>  	return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
>  }
>  
> -struct zone_reclaim_stat {
> -	/*
> -	 * The pageout code in vmscan.c keeps track of how many of the
> -	 * mem/swap backed and file backed pages are referenced.
> -	 * The higher the rotated/scanned ratio, the more valuable
> -	 * that cache is.
> -	 *
> -	 * The anon LRU stats live in [0], file LRU stats in [1]
> -	 */
> -	unsigned long		recent_rotated[2];
> -	unsigned long		recent_scanned[2];
> +/*
> + * This tracks cost of reclaiming one LRU type - file or anon - over
> + * the other. As the observed cost of pressure on one type increases,
> + * the scan balance in vmscan.c tips toward the other type.
> + *
> + * The recorded cost for anon is in numer[0], file in numer[1].
> + */
> +struct lru_cost {
> +	unsigned long		numer[2];
> +	unsigned long		denom;
>  };
>  
>  struct lruvec {
>  	struct list_head		lists[NR_LRU_LISTS];
> -	struct zone_reclaim_stat	reclaim_stat;
> +	struct lru_cost			balance;
>  	/* Evictions & activations on the inactive file list */
>  	atomic_long_t			inactive_age;
>  #ifdef CONFIG_MEMCG
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 178f084365c2..c461ce0533da 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -295,6 +295,8 @@ extern unsigned long nr_free_pagecache_pages(void);
>  
>  
>  /* linux/mm/swap.c */
> +extern void lru_note_cost(struct lruvec *lruvec, bool file,
> +			  unsigned int nr_pages);
>  extern void lru_cache_add(struct page *);
>  extern void lru_cache_putback(struct page *page);
>  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
> diff --git a/mm/swap.c b/mm/swap.c
> index 814e3a2e54b4..645d21242324 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -249,15 +249,10 @@ void rotate_reclaimable_page(struct page *page)
>  	}
>  }
>  
> -static void update_page_reclaim_stat(struct lruvec *lruvec,
> -				     int file, int rotated,
> -				     unsigned int nr_pages)
> +void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  {
> -	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -
> -	reclaim_stat->recent_scanned[file] += nr_pages;
> -	if (rotated)
> -		reclaim_stat->recent_rotated[file] += nr_pages;
> +	lruvec->balance.numer[file] += nr_pages;
> +	lruvec->balance.denom += nr_pages;

balance.numer[0] + balance.number[1] = balance.denom
so we can remove denom at the moment?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 08/10] mm: deactivations shouldn't bias the LRU balance
  2016-06-06 19:48 ` [PATCH 08/10] mm: deactivations shouldn't bias the LRU balance Johannes Weiner
@ 2016-06-08  8:15   ` Minchan Kim
  2016-06-08 12:57   ` Michal Hocko
  1 sibling, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2016-06-08  8:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 03:48:34PM -0400, Johannes Weiner wrote:
> Operations like MADV_FREE, FADV_DONTNEED etc. currently move any
> affected active pages to the inactive list to accelerate their reclaim
> (good) but also steer page reclaim toward that LRU type, or away from
> the other (bad).
> 
> The reason why this is undesirable is that such operations are not
> part of the regular page aging cycle, and rather a fluke that doesn't
> say much about the remaining pages on that list. They might all be in
> heavy use. But once the chunk of easy victims has been purged, the VM
> continues to apply elevated pressure on the remaining hot pages. The
> other LRU, meanwhile, might have easily reclaimable pages, and there
> was never a need to steer away from it in the first place.
> 
> As the previous patch outlined, we should focus on recording actually
> observed cost to steer the balance rather than speculating about the
> potential value of one LRU list over the other. In that spirit, leave
> explicitely deactivated pages to the LRU algorithm to pick up, and let
> rotations decide which list is the easiest to reclaim.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Nice description. Agreed.

Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 09/10] mm: only count actual rotations as LRU reclaim cost
  2016-06-06 19:48 ` [PATCH 09/10] mm: only count actual rotations as LRU reclaim cost Johannes Weiner
@ 2016-06-08  8:19   ` Minchan Kim
  2016-06-08 13:18   ` Michal Hocko
  1 sibling, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2016-06-08  8:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 06, 2016 at 03:48:35PM -0400, Johannes Weiner wrote:
> Noting a reference on an active file page but still deactivating it
> represents a smaller cost of reclaim than noting a referenced
> anonymous page and actually physically rotating it back to the head.
> The file page *might* refault later on, but it's definite progress
> toward freeing pages, whereas rotating the anonymous page costs us
> real time without making progress toward the reclaim goal.
> 
> Don't treat both events as equal. The following patch will hook up LRU
> balancing to cache and swap refaults, which are a much more concrete
> cost signal for reclaiming one list over the other. Remove the
> maybe-IO cost bias from page references, and only note the CPU cost
> for actual rotations that prevent the pages from getting reclaimed.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing
  2016-06-06 19:48 ` [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing Johannes Weiner
  2016-06-07  2:20   ` Rik van Riel
  2016-06-08  8:03   ` Minchan Kim
@ 2016-06-08 12:31   ` Michal Hocko
  2 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2016-06-08 12:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:32, Johannes Weiner wrote:
> When the splitlru patches divided page cache and swap-backed pages
> into separate LRU lists, the pressure balance between the lists was
> biased to account for the fact that streaming IO can cause memory
> pressure with a flood of pages that are used only once. New page cache
> additions would tip the balance toward the file LRU, and repeat access
> would neutralize that bias again. This ensured that page reclaim would
> always go for used-once cache first.
> 
> Since e9868505987a ("mm,vmscan: only evict file pages when we have
> plenty"), page reclaim generally skips over swap-backed memory
> entirely as long as there is used-once cache present, and will apply
> the LRU balancing when only repeatedly accessed cache pages are left -
> at which point the previous use-once bias will have been neutralized.
> 
> This makes the use-once cache balancing bias unnecessary. Remove it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/swap.c | 11 -----------
>  1 file changed, 11 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 576c721f210b..814e3a2e54b4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -264,7 +264,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
>  			    void *arg)
>  {
>  	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> -		int file = page_is_file_cache(page);
>  		int lru = page_lru_base_type(page);
>  
>  		del_page_from_lru_list(page, lruvec, lru);
> @@ -274,7 +273,6 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
>  		trace_mm_lru_activate(page);
>  
>  		__count_vm_event(PGACTIVATE);
> -		update_page_reclaim_stat(lruvec, file, 1, hpage_nr_pages(page));
>  	}
>  }
>  
> @@ -797,8 +795,6 @@ EXPORT_SYMBOL(__pagevec_release);
>  void lru_add_page_tail(struct page *page, struct page *page_tail,
>  		       struct lruvec *lruvec, struct list_head *list)
>  {
> -	const int file = 0;
> -
>  	VM_BUG_ON_PAGE(!PageHead(page), page);
>  	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>  	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
> @@ -833,20 +829,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>  				 void *arg)
>  {
> -	int file = page_is_file_cache(page);
> -	int active = PageActive(page);
>  	enum lru_list lru = page_lru(page);
> -	bool new = (bool)arg;
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
>  	SetPageLRU(page);
>  	add_page_to_lru_list(page, lruvec, lru);
>  
> -	if (new)
> -		update_page_reclaim_stat(lruvec, file, active,
> -					 hpage_nr_pages(page));
> -
>  	trace_mm_lru_insertion(page, lru);
>  }
>  
> -- 
> 2.8.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-06 19:48 ` [PATCH 07/10] mm: base LRU balancing on an explicit cost model Johannes Weiner
                     ` (2 preceding siblings ...)
  2016-06-08  8:14   ` Minchan Kim
@ 2016-06-08 12:51   ` Michal Hocko
  2016-06-08 16:16     ` Johannes Weiner
  3 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2016-06-08 12:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:33, Johannes Weiner wrote:
> Currently, scan pressure between the anon and file LRU lists is
> balanced based on a mixture of reclaim efficiency and a somewhat vague
> notion of "value" of having certain pages in memory over others. That
> concept of value is problematic, because it has caused us to count any
> event that remotely makes one LRU list more or less preferrable for
> reclaim, even when these events are not directly comparable to each
> other and impose very different costs on the system - such as a
> referenced file page that we still deactivate and a referenced
> anonymous page that we actually rotate back to the head of the list.
> 
> There is also conceptual overlap with the LRU algorithm itself. By
> rotating recently used pages instead of reclaiming them, the algorithm
> already biases the applied scan pressure based on page value. Thus,
> when rebalancing scan pressure due to rotations, we should think of
> reclaim cost, and leave assessing the page value to the LRU algorithm.
> 
> Lastly, considering both value-increasing as well as value-decreasing
> events can sometimes cause the same type of event to be counted twice,
> i.e. how rotating a page increases the LRU value, while reclaiming it
> succesfully decreases the value. In itself this will balance out fine,
> but it quietly skews the impact of events that are only recorded once.
> 
> The abstract metric of "value", the murky relationship with the LRU
> algorithm, and accounting both negative and positive events make the
> current pressure balancing model hard to reason about and modify.
> 
> In preparation for thrashing-based LRU balancing, this patch switches
> to a balancing model of accounting the concrete, actually observed
> cost of reclaiming one LRU over another. For now, that cost includes
> pages that are scanned but rotated back to the list head.

This makes a lot of sense to me

> Subsequent
> patches will add consideration for IO caused by refaulting recently
> evicted pages. The idea is to primarily scan the LRU that thrashes the
> least, and secondarily scan the LRU that needs the least amount of
> work to free memory.
> 
> Rename struct zone_reclaim_stat to struct lru_cost, and move from two
> separate value ratios for the LRU lists to a relative LRU cost metric
> with a shared denominator.

I just do not like the too generic `number'. I guess cost or price would
fit better and look better in the code as well. Up you though...

> Then make everything that affects the cost go through a new
> lru_note_cost() function.

Just curious, have you tried to measure just the effect of this change
without the rest of the series? I do not expect it would show large
differences because we are not doing SCAN_FRACT most of the time...

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/mmzone.h | 23 +++++++++++------------
>  include/linux/swap.h   |  2 ++
>  mm/swap.c              | 15 +++++----------
>  mm/vmscan.c            | 35 +++++++++++++++--------------------
>  4 files changed, 33 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 02069c23486d..4d257d00fbf5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -191,22 +191,21 @@ static inline int is_active_lru(enum lru_list lru)
>  	return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
>  }
>  
> -struct zone_reclaim_stat {
> -	/*
> -	 * The pageout code in vmscan.c keeps track of how many of the
> -	 * mem/swap backed and file backed pages are referenced.
> -	 * The higher the rotated/scanned ratio, the more valuable
> -	 * that cache is.
> -	 *
> -	 * The anon LRU stats live in [0], file LRU stats in [1]
> -	 */
> -	unsigned long		recent_rotated[2];
> -	unsigned long		recent_scanned[2];
> +/*
> + * This tracks cost of reclaiming one LRU type - file or anon - over
> + * the other. As the observed cost of pressure on one type increases,
> + * the scan balance in vmscan.c tips toward the other type.
> + *
> + * The recorded cost for anon is in numer[0], file in numer[1].
> + */
> +struct lru_cost {
> +	unsigned long		numer[2];
> +	unsigned long		denom;
>  };
>  
>  struct lruvec {
>  	struct list_head		lists[NR_LRU_LISTS];
> -	struct zone_reclaim_stat	reclaim_stat;
> +	struct lru_cost			balance;
>  	/* Evictions & activations on the inactive file list */
>  	atomic_long_t			inactive_age;
>  #ifdef CONFIG_MEMCG
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 178f084365c2..c461ce0533da 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -295,6 +295,8 @@ extern unsigned long nr_free_pagecache_pages(void);
>  
>  
>  /* linux/mm/swap.c */
> +extern void lru_note_cost(struct lruvec *lruvec, bool file,
> +			  unsigned int nr_pages);
>  extern void lru_cache_add(struct page *);
>  extern void lru_cache_putback(struct page *page);
>  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
> diff --git a/mm/swap.c b/mm/swap.c
> index 814e3a2e54b4..645d21242324 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -249,15 +249,10 @@ void rotate_reclaimable_page(struct page *page)
>  	}
>  }
>  
> -static void update_page_reclaim_stat(struct lruvec *lruvec,
> -				     int file, int rotated,
> -				     unsigned int nr_pages)
> +void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  {
> -	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -
> -	reclaim_stat->recent_scanned[file] += nr_pages;
> -	if (rotated)
> -		reclaim_stat->recent_rotated[file] += nr_pages;
> +	lruvec->balance.numer[file] += nr_pages;
> +	lruvec->balance.denom += nr_pages;
>  }
>  
>  static void __activate_page(struct page *page, struct lruvec *lruvec,
> @@ -543,7 +538,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  
>  	if (active)
>  		__count_vm_event(PGDEACTIVATE);
> -	update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
> +	lru_note_cost(lruvec, !file, hpage_nr_pages(page));
>  }
>  
>  
> @@ -560,7 +555,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>  		add_page_to_lru_list(page, lruvec, lru);
>  
>  		__count_vm_event(PGDEACTIVATE);
> -		update_page_reclaim_stat(lruvec, file, 0, hpage_nr_pages(page));
> +		lru_note_cost(lruvec, !file, hpage_nr_pages(page));
>  	}
>  }
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8503713bb60e..06e381e1004c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1492,7 +1492,6 @@ static int too_many_isolated(struct zone *zone, int file,
>  static noinline_for_stack void
>  putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  {
> -	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  	struct zone *zone = lruvec_zone(lruvec);
>  	LIST_HEAD(pages_to_free);
>  
> @@ -1521,8 +1520,13 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  		if (is_active_lru(lru)) {
>  			int file = is_file_lru(lru);
>  			int numpages = hpage_nr_pages(page);
> -			reclaim_stat->recent_rotated[file] += numpages;
> +			/*
> +			 * Rotating pages costs CPU without actually
> +			 * progressing toward the reclaim goal.
> +			 */
> +			lru_note_cost(lruvec, file, numpages);
>  		}
> +
>  		if (put_page_testzero(page)) {
>  			__ClearPageLRU(page);
>  			__ClearPageActive(page);
> @@ -1577,7 +1581,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	isolate_mode_t isolate_mode = 0;
>  	int file = is_file_lru(lru);
>  	struct zone *zone = lruvec_zone(lruvec);
> -	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1601,7 +1604,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  	update_lru_size(lruvec, lru, -nr_taken);
>  	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> -	reclaim_stat->recent_scanned[file] += nr_taken;
>  
>  	if (global_reclaim(sc)) {
>  		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> @@ -1773,7 +1775,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	LIST_HEAD(l_active);
>  	LIST_HEAD(l_inactive);
>  	struct page *page;
> -	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  	unsigned long nr_rotated = 0;
>  	isolate_mode_t isolate_mode = 0;
>  	int file = is_file_lru(lru);
> @@ -1793,7 +1794,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  
>  	update_lru_size(lruvec, lru, -nr_taken);
>  	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
> -	reclaim_stat->recent_scanned[file] += nr_taken;
>  
>  	if (global_reclaim(sc))
>  		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
> @@ -1851,7 +1851,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	 * helps balance scan pressure between file and anonymous pages in
>  	 * get_scan_count.
>  	 */
> -	reclaim_stat->recent_rotated[file] += nr_rotated;
> +	lru_note_cost(lruvec, file, nr_rotated);
>  
>  	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>  	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> @@ -1947,7 +1947,6 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  			   unsigned long *lru_pages)
>  {
>  	int swappiness = mem_cgroup_swappiness(memcg);
> -	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  	u64 fraction[2];
>  	u64 denominator = 0;	/* gcc */
>  	struct zone *zone = lruvec_zone(lruvec);
> @@ -2072,14 +2071,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
>  
>  	spin_lock_irq(&zone->lru_lock);
> -	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
> -		reclaim_stat->recent_scanned[0] /= 2;
> -		reclaim_stat->recent_rotated[0] /= 2;
> -	}
> -
> -	if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
> -		reclaim_stat->recent_scanned[1] /= 2;
> -		reclaim_stat->recent_rotated[1] /= 2;
> +	if (unlikely(lruvec->balance.denom > (anon + file) / 8)) {
> +		lruvec->balance.numer[0] /= 2;
> +		lruvec->balance.numer[1] /= 2;
> +		lruvec->balance.denom /= 2;
>  	}
>  
>  	/*
> @@ -2087,11 +2082,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * proportional to the fraction of recently scanned pages on
>  	 * each list that were recently referenced and in active use.
>  	 */
> -	ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
> -	ap /= reclaim_stat->recent_rotated[0] + 1;
> +	ap = anon_prio * (lruvec->balance.denom + 1);
> +	ap /= lruvec->balance.numer[0] + 1;
>  
> -	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
> -	fp /= reclaim_stat->recent_rotated[1] + 1;
> +	fp = file_prio * (lruvec->balance.denom + 1);
> +	fp /= lruvec->balance.numer[1] + 1;
>  	spin_unlock_irq(&zone->lru_lock);
>  
>  	fraction[0] = ap;
> -- 
> 2.8.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 08/10] mm: deactivations shouldn't bias the LRU balance
  2016-06-06 19:48 ` [PATCH 08/10] mm: deactivations shouldn't bias the LRU balance Johannes Weiner
  2016-06-08  8:15   ` Minchan Kim
@ 2016-06-08 12:57   ` Michal Hocko
  1 sibling, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2016-06-08 12:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:34, Johannes Weiner wrote:
> Operations like MADV_FREE, FADV_DONTNEED etc. currently move any
> affected active pages to the inactive list to accelerate their reclaim
> (good) but also steer page reclaim toward that LRU type, or away from
> the other (bad).
> 
> The reason why this is undesirable is that such operations are not
> part of the regular page aging cycle, and rather a fluke that doesn't
> say much about the remaining pages on that list. They might all be in
> heavy use. But once the chunk of easy victims has been purged, the VM
> continues to apply elevated pressure on the remaining hot pages. The
> other LRU, meanwhile, might have easily reclaimable pages, and there
> was never a need to steer away from it in the first place.
> 
> As the previous patch outlined, we should focus on recording actually
> observed cost to steer the balance rather than speculating about the
> potential value of one LRU list over the other. In that spirit, leave
> explicitely deactivated pages to the LRU algorithm to pick up, and let
> rotations decide which list is the easiest to reclaim.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/swap.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 645d21242324..ae07b469ddca 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -538,7 +538,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  
>  	if (active)
>  		__count_vm_event(PGDEACTIVATE);
> -	lru_note_cost(lruvec, !file, hpage_nr_pages(page));
>  }
>  
>  
> @@ -546,7 +545,6 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>  			    void *arg)
>  {
>  	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		int file = page_is_file_cache(page);
>  		int lru = page_lru_base_type(page);
>  
>  		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> @@ -555,7 +553,6 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>  		add_page_to_lru_list(page, lruvec, lru);
>  
>  		__count_vm_event(PGDEACTIVATE);
> -		lru_note_cost(lruvec, !file, hpage_nr_pages(page));
>  	}
>  }
>  
> -- 
> 2.8.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 09/10] mm: only count actual rotations as LRU reclaim cost
  2016-06-06 19:48 ` [PATCH 09/10] mm: only count actual rotations as LRU reclaim cost Johannes Weiner
  2016-06-08  8:19   ` Minchan Kim
@ 2016-06-08 13:18   ` Michal Hocko
  1 sibling, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2016-06-08 13:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:35, Johannes Weiner wrote:
> Noting a reference on an active file page but still deactivating it
> represents a smaller cost of reclaim than noting a referenced
> anonymous page and actually physically rotating it back to the head.
> The file page *might* refault later on, but it's definite progress
> toward freeing pages, whereas rotating the anonymous page costs us
> real time without making progress toward the reclaim goal.
> 
> Don't treat both events as equal. The following patch will hook up LRU
> balancing to cache and swap refaults, which are a much more concrete
> cost signal for reclaiming one list over the other. Remove the
> maybe-IO cost bias from page references, and only note the CPU cost
> for actual rotations that prevent the pages from getting reclaimed.

The changelog was quite hard to digest for me but I guess I got your
point. The change itself makes sense to me because noting the LRU
cost for pages which we intentionally keep on the active list because
they are really precious is reasonable. Which is not the case for
referenced pages in general because we only find out whether they are
really needed when we encounter them on the inactive list.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 06e381e1004c..acbd212eab6e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1821,7 +1821,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  
>  		if (page_referenced(page, 0, sc->target_mem_cgroup,
>  				    &vm_flags)) {
> -			nr_rotated += hpage_nr_pages(page);
>  			/*
>  			 * Identify referenced, file-backed active pages and
>  			 * give them one more trip around the active list. So
> @@ -1832,6 +1831,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  			 * so we ignore them here.
>  			 */
>  			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
> +				nr_rotated += hpage_nr_pages(page);
>  				list_add(&page->lru, &l_active);
>  				continue;
>  			}
> @@ -1846,10 +1846,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	 */
>  	spin_lock_irq(&zone->lru_lock);
>  	/*
> -	 * Count referenced pages from currently used mappings as rotated,
> -	 * even though only some of them are actually re-activated.  This
> -	 * helps balance scan pressure between file and anonymous pages in
> -	 * get_scan_count.
> +	 * Rotating pages costs CPU without actually
> +	 * progressing toward the reclaim goal.
>  	 */
>  	lru_note_cost(lruvec, file, nr_rotated);
>  
> -- 
> 2.8.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-06 19:48 ` [PATCH 10/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
  2016-06-06 19:22   ` kbuild test robot
  2016-06-06 23:50   ` Tim Chen
@ 2016-06-08 13:58   ` Michal Hocko
  2016-06-10  2:19   ` Minchan Kim
  3 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2016-06-08 13:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Mon 06-06-16 15:48:36, Johannes Weiner wrote:
> Since the LRUs were split into anon and file lists, the VM has been
> balancing between page cache and anonymous pages based on per-list
> ratios of scanned vs. rotated pages. In most cases that tips page
> reclaim towards the list that is easier to reclaim and has the fewest
> actively used pages, but there are a few problems with it:
> 
> 1. Refaults and in-memory rotations are weighted the same way, even
>    though one costs IO and the other costs CPU. When the balance is
>    off, the page cache can be thrashing while anonymous pages are aged
>    comparably slower and thus have more time to get even their coldest
>    pages referenced. The VM would consider this a fair equilibrium.
> 
> 2. The page cache has usually a share of use-once pages that will
>    further dilute its scanned/rotated ratio in the above-mentioned
>    scenario. This can cease scanning of the anonymous list almost
>    entirely - again while the page cache is thrashing and IO-bound.
> 
> Historically, swap has been an emergency overflow for high memory
> pressure, and we avoided using it as long as new page allocations
> could be served from recycling page cache. However, when recycling
> page cache incurs a higher cost in IO than swapping out a few unused
> anonymous pages would, it makes sense to increase swap pressure.
> 
> In order to accomplish this, we can extend the thrash detection code
> that currently detects workingset changes within the page cache: when
> inactive cache pages are thrashing, the VM raises LRU pressure on the
> otherwise protected active file list to increase competition. However,
> when active pages begin refaulting as well, it means that the page
> cache is thrashing as a whole and the LRU balance should tip toward
> anonymous. This is what this patch implements.
> 
> To tell inactive from active refaults, a page flag is introduced that
> marks pages that have been on the active list in their lifetime. This
> flag is remembered in the shadow page entry on reclaim, and restored
> when the page refaults. It is also set on anonymous pages during
> swapin. When a page with that flag set is added to the LRU, the LRU
> balance is adjusted for the IO cost of reclaiming the thrashing list.
> 
> Rotations continue to influence the LRU balance as well, but with a
> different weight factor. That factor is statically chosen such that
> refaults are considered more costly than rotations at this point. We
> might want to revisit this for ultra-fast swap or secondary memory
> devices, where rotating referenced pages might be more costly than
> swapping or relocating them directly and have some of them refault.

The approach seems sensible to me. The additional page flag is far from
nice to say the least. Maybe we can override some existing one which
doesn't make any other sense for LRU pages. E.g. PG_slab although we
might have some explicit VM_BUG_ONs etc.. so this could get tricky.

I have to think about this more.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h         |   6 +-
>  include/linux/page-flags.h     |   2 +
>  include/linux/swap.h           |  10 ++-
>  include/trace/events/mmflags.h |   1 +
>  mm/filemap.c                   |   9 +--
>  mm/migrate.c                   |   4 ++
>  mm/swap.c                      |  38 ++++++++++-
>  mm/swap_state.c                |   1 +
>  mm/vmscan.c                    |   5 +-
>  mm/vmstat.c                    |   6 +-
>  mm/workingset.c                | 142 +++++++++++++++++++++++++++++++----------
>  11 files changed, 172 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4d257d00fbf5..d7aaee25b536 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -148,9 +148,9 @@ enum zone_stat_item {
>  	NUMA_LOCAL,		/* allocation from local node */
>  	NUMA_OTHER,		/* allocation from other node */
>  #endif
> -	WORKINGSET_REFAULT,
> -	WORKINGSET_ACTIVATE,
> -	WORKINGSET_NODERECLAIM,
> +	REFAULT_INACTIVE_FILE,
> +	REFAULT_ACTIVE_FILE,
> +	REFAULT_NODERECLAIM,
>  	NR_ANON_TRANSPARENT_HUGEPAGES,
>  	NR_FREE_CMA_PAGES,
>  	NR_VM_ZONE_STAT_ITEMS };
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e5a32445f930..a1b9d7dddd68 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -79,6 +79,7 @@ enum pageflags {
>  	PG_dirty,
>  	PG_lru,
>  	PG_active,
> +	PG_workingset,
>  	PG_slab,
>  	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
>  	PG_arch_1,
> @@ -259,6 +260,7 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
>  PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
>  PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
>  	TESTCLEARFLAG(Active, active, PF_HEAD)
> +PAGEFLAG(Workingset, workingset, PF_HEAD)
>  __PAGEFLAG(Slab, slab, PF_NO_TAIL)
>  __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
>  PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c461ce0533da..9923b51ee8e9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -250,7 +250,7 @@ struct swap_info_struct {
>  
>  /* linux/mm/workingset.c */
>  void *workingset_eviction(struct address_space *mapping, struct page *page);
> -bool workingset_refault(void *shadow);
> +void workingset_refault(struct page *page, void *shadow);
>  void workingset_activation(struct page *page);
>  extern struct list_lru workingset_shadow_nodes;
>  
> @@ -295,8 +295,12 @@ extern unsigned long nr_free_pagecache_pages(void);
>  
>  
>  /* linux/mm/swap.c */
> -extern void lru_note_cost(struct lruvec *lruvec, bool file,
> -			  unsigned int nr_pages);
> +enum lru_cost_type {
> +	COST_CPU,
> +	COST_IO,
> +};
> +extern void lru_note_cost(struct lruvec *lruvec, enum lru_cost_type cost,
> +			  bool file, unsigned int nr_pages);
>  extern void lru_cache_add(struct page *);
>  extern void lru_cache_putback(struct page *page);
>  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 43cedbf0c759..bc05e0ac1b8c 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -86,6 +86,7 @@
>  	{1UL << PG_dirty,		"dirty"		},		\
>  	{1UL << PG_lru,			"lru"		},		\
>  	{1UL << PG_active,		"active"	},		\
> +	{1UL << PG_workingset,		"workingset"	},		\
>  	{1UL << PG_slab,		"slab"		},		\
>  	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
>  	{1UL << PG_arch_1,		"arch_1"	},		\
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9665b1d4f318..1b356b47381b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -700,12 +700,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  		 * data from the working set, only to cache data that will
>  		 * get overwritten with something else, is a waste of memory.
>  		 */
> -		if (!(gfp_mask & __GFP_WRITE) &&
> -		    shadow && workingset_refault(shadow)) {
> -			SetPageActive(page);
> -			workingset_activation(page);
> -		} else
> -			ClearPageActive(page);
> +		WARN_ON_ONCE(PageActive(page));
> +		if (!(gfp_mask & __GFP_WRITE) && shadow)
> +			workingset_refault(page, shadow);
>  		lru_cache_add(page);
>  	}
>  	return ret;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9baf41c877ff..115d49441c6c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -544,6 +544,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>  		SetPageActive(newpage);
>  	} else if (TestClearPageUnevictable(page))
>  		SetPageUnevictable(newpage);
> +	if (PageWorkingset(page))
> +		SetPageWorkingset(newpage);
>  	if (PageChecked(page))
>  		SetPageChecked(newpage);
>  	if (PageMappedToDisk(page))
> @@ -1809,6 +1811,8 @@ fail_putback:
>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  
>  		/* Reverse changes made by migrate_page_copy() */
> +		if (TestClearPageWorkingset(new_page))
> +			ClearPageWorkingset(page);
>  		if (TestClearPageActive(new_page))
>  			SetPageActive(page);
>  		if (TestClearPageUnevictable(new_page))
> diff --git a/mm/swap.c b/mm/swap.c
> index ae07b469ddca..cb6773e1424e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -249,8 +249,28 @@ void rotate_reclaimable_page(struct page *page)
>  	}
>  }
>  
> -void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
> +void lru_note_cost(struct lruvec *lruvec, enum lru_cost_type cost,
> +		   bool file, unsigned int nr_pages)
>  {
> +	if (cost == COST_IO) {
> +		/*
> +		 * Reflect the relative reclaim cost between incurring
> +		 * IO from refaults on one hand, and incurring CPU
> +		 * cost from rotating scanned pages on the other.
> +		 *
> +		 * XXX: For now, the relative cost factor for IO is
> +		 * set statically to outweigh the cost of rotating
> +		 * referenced pages. This might change with ultra-fast
> +		 * IO devices, or with secondary memory devices that
> +		 * allow users continued access of swapped out pages.
> +		 *
> +		 * Until then, the value is chosen simply such that we
> +		 * balance for IO cost first and optimize for CPU only
> +		 * once the thrashing subsides.
> +		 */
> +		nr_pages *= SWAP_CLUSTER_MAX;
> +	}
> +
>  	lruvec->balance.numer[file] += nr_pages;
>  	lruvec->balance.denom += nr_pages;
>  }
> @@ -262,6 +282,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
>  		int lru = page_lru_base_type(page);
>  
>  		del_page_from_lru_list(page, lruvec, lru);
> +		SetPageWorkingset(page);
>  		SetPageActive(page);
>  		lru += LRU_ACTIVE;
>  		add_page_to_lru_list(page, lruvec, lru);
> @@ -821,13 +842,28 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>  				 void *arg)
>  {
> +	unsigned int nr_pages = hpage_nr_pages(page);
>  	enum lru_list lru = page_lru(page);
> +	bool active = is_active_lru(lru);
> +	bool file = is_file_lru(lru);
> +	bool new = (bool)arg;
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
>  	SetPageLRU(page);
>  	add_page_to_lru_list(page, lruvec, lru);
>  
> +	if (new) {
> +		/*
> +		 * If the workingset is thrashing, note the IO cost of
> +		 * reclaiming that list and steer reclaim away from it.
> +		 */
> +		if (PageWorkingset(page))
> +			lru_note_cost(lruvec, COST_IO, file, nr_pages);
> +		else if (active)
> +			SetPageWorkingset(page);
> +	}
> +
>  	trace_mm_lru_insertion(page, lru);
>  }
>  
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 5400f814ae12..43561a56ba5d 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -365,6 +365,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  			/*
>  			 * Initiate read into locked page and return.
>  			 */
> +			SetPageWorkingset(new_page);
>  			lru_cache_add(new_page);
>  			*new_page_allocated = true;
>  			return new_page;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index acbd212eab6e..b2cb4f4f9d31 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1216,6 +1216,7 @@ activate_locked:
>  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
>  			try_to_free_swap(page);
>  		VM_BUG_ON_PAGE(PageActive(page), page);
> +		SetPageWorkingset(page);
>  		SetPageActive(page);
>  		pgactivate++;
>  keep_locked:
> @@ -1524,7 +1525,7 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  			 * Rotating pages costs CPU without actually
>  			 * progressing toward the reclaim goal.
>  			 */
> -			lru_note_cost(lruvec, file, numpages);
> +			lru_note_cost(lruvec, COST_CPU, file, numpages);
>  		}
>  
>  		if (put_page_testzero(page)) {
> @@ -1849,7 +1850,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	 * Rotating pages costs CPU without actually
>  	 * progressing toward the reclaim goal.
>  	 */
> -	lru_note_cost(lruvec, file, nr_rotated);
> +	lru_note_cost(lruvec, COST_CPU, file, nr_rotated);
>  
>  	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>  	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 77e42ef388c2..6c8d658f5b7f 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -727,9 +727,9 @@ const char * const vmstat_text[] = {
>  	"numa_local",
>  	"numa_other",
>  #endif
> -	"workingset_refault",
> -	"workingset_activate",
> -	"workingset_nodereclaim",
> +	"refault_inactive_file",
> +	"refault_active_file",
> +	"refault_nodereclaim",
>  	"nr_anon_transparent_hugepages",
>  	"nr_free_cma",
>  
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 8a75f8d2916a..261cf583fb62 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -118,7 +118,7 @@
>   * the only thing eating into inactive list space is active pages.
>   *
>   *
> - *		Activating refaulting pages
> + *		Refaulting inactive pages
>   *
>   * All that is known about the active list is that the pages have been
>   * accessed more than once in the past.  This means that at any given
> @@ -131,6 +131,10 @@
>   * used less frequently than the refaulting page - or even not used at
>   * all anymore.
>   *
> + * That means, if inactive cache is refaulting with a suitable refault
> + * distance, we assume the cache workingset is transitioning and put
> + * pressure on the existing cache pages on the active list.
> + *
>   * If this is wrong and demotion kicks in, the pages which are truly
>   * used more frequently will be reactivated while the less frequently
>   * used once will be evicted from memory.
> @@ -139,6 +143,30 @@
>   * and the used pages get to stay in cache.
>   *
>   *
> + *		Refaulting active pages
> + *
> + * If, on the other hand, the refaulting pages have been recently
> + * deactivated, it means that the active list is no longer protecting
> + * actively used cache from reclaim: the cache is not transitioning to
> + * a different workingset, the existing workingset is thrashing in the
> + * space allocated to the page cache.
> + *
> + * When that is the case, mere activation of the refaulting pages is
> + * not enough. The page reclaim code needs to be informed of the high
> + * IO cost associated with the continued reclaim of page cache, so
> + * that it can steer pressure to the anonymous list.
> + *
> + * Just as when refaulting inactive pages, it's possible that there
> + * are cold(er) anonymous pages that can be swapped and forgotten in
> + * order to increase the space available to the page cache as a whole.
> + *
> + * If anonymous pages start thrashing as well, the reclaim scanner
> + * will aim for the list that imposes the lowest cost on the system,
> + * where cost is defined as:
> + *
> + *	refault rate * relative IO cost (as determined by swappiness)
> + *
> + *
>   *		Implementation
>   *
>   * For each zone's file LRU lists, a counter for inactive evictions
> @@ -150,10 +178,25 @@
>   *
>   * On cache misses for which there are shadow entries, an eligible
>   * refault distance will immediately activate the refaulting page.
> + *
> + * On activation, cache pages are marked PageWorkingset, which is not
> + * cleared until the page is freed. Shadow entries will remember that
> + * flag to be able to tell inactive from active refaults. Refaults of
> + * previous workingset pages will restore that page flag and inform
> + * page reclaim of the IO cost.
> + *
> + * XXX: Since we don't track anonymous references, every swap-in event
> + * is considered a workingset refault - regardless of distance. Swapin
> + * floods will thus always raise the assumed IO cost of reclaiming the
> + * anonymous LRU lists, even if the pages haven't been used recently.
> + * Temporary events don't matter that much other than they might delay
> + * the stabilization a bit. But during continuous thrashing, anonymous
> + * pages can have a leg-up against page cache. This might need fixing
> + * for ultra-fast IO devices or secondary memory types.
>   */
>  
> -#define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
> -			 ZONES_SHIFT + NODES_SHIFT +	\
> +#define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY +			\
> +			 1 + ZONES_SHIFT + NODES_SHIFT +		\
>  			 MEM_CGROUP_ID_SHIFT)
>  #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
>  
> @@ -167,24 +210,29 @@
>   */
>  static unsigned int bucket_order __read_mostly;
>  
> -static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
> +static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction,
> +			 bool workingset)
>  {
>  	eviction >>= bucket_order;
>  	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
>  	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
>  	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
> +	eviction = (eviction << 1) | workingset;
>  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
>  
>  	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
>  }
>  
>  static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
> -			  unsigned long *evictionp)
> +			  unsigned long *evictionp, bool *workingsetp)
>  {
>  	unsigned long entry = (unsigned long)shadow;
>  	int memcgid, nid, zid;
> +	bool workingset;
>  
>  	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> +	workingset = entry & 1;
> +	entry >>= 1;
>  	zid = entry & ((1UL << ZONES_SHIFT) - 1);
>  	entry >>= ZONES_SHIFT;
>  	nid = entry & ((1UL << NODES_SHIFT) - 1);
> @@ -195,6 +243,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
>  	*memcgidp = memcgid;
>  	*zonep = NODE_DATA(nid)->node_zones + zid;
>  	*evictionp = entry << bucket_order;
> +	*workingsetp = workingset;
>  }
>  
>  /**
> @@ -220,19 +269,18 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
>  
>  	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  	eviction = atomic_long_inc_return(&lruvec->inactive_age);
> -	return pack_shadow(memcgid, zone, eviction);
> +	return pack_shadow(memcgid, zone, eviction, PageWorkingset(page));
>  }
>  
>  /**
>   * workingset_refault - evaluate the refault of a previously evicted page
> + * @page: the freshly allocated replacement page
>   * @shadow: shadow entry of the evicted page
>   *
>   * Calculates and evaluates the refault distance of the previously
>   * evicted page in the context of the zone it was allocated in.
> - *
> - * Returns %true if the page should be activated, %false otherwise.
>   */
> -bool workingset_refault(void *shadow)
> +void workingset_refault(struct page *page, void *shadow)
>  {
>  	unsigned long refault_distance;
>  	unsigned long active_file;
> @@ -240,10 +288,12 @@ bool workingset_refault(void *shadow)
>  	unsigned long eviction;
>  	struct lruvec *lruvec;
>  	unsigned long refault;
> +	unsigned long anon;
>  	struct zone *zone;
> +	bool workingset;
>  	int memcgid;
>  
> -	unpack_shadow(shadow, &memcgid, &zone, &eviction);
> +	unpack_shadow(shadow, &memcgid, &zone, &eviction, &workingset);
>  
>  	rcu_read_lock();
>  	/*
> @@ -263,40 +313,64 @@ bool workingset_refault(void *shadow)
>  	 * configurations instead.
>  	 */
>  	memcg = mem_cgroup_from_id(memcgid);
> -	if (!mem_cgroup_disabled() && !memcg) {
> -		rcu_read_unlock();
> -		return false;
> -	}
> +	if (!mem_cgroup_disabled() && !memcg)
> +		goto out;
>  	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  	refault = atomic_long_read(&lruvec->inactive_age);
>  	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
> -	rcu_read_unlock();
> +	if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
> +		anon = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
> +		       lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
> +	else
> +		anon = 0;
>  
>  	/*
> -	 * The unsigned subtraction here gives an accurate distance
> -	 * across inactive_age overflows in most cases.
> +	 * Calculate the refault distance.
>  	 *
> -	 * There is a special case: usually, shadow entries have a
> -	 * short lifetime and are either refaulted or reclaimed along
> -	 * with the inode before they get too old.  But it is not
> -	 * impossible for the inactive_age to lap a shadow entry in
> -	 * the field, which can then can result in a false small
> -	 * refault distance, leading to a false activation should this
> -	 * old entry actually refault again.  However, earlier kernels
> -	 * used to deactivate unconditionally with *every* reclaim
> -	 * invocation for the longest time, so the occasional
> -	 * inappropriate activation leading to pressure on the active
> -	 * list is not a problem.
> +	 * The unsigned subtraction here gives an accurate distance
> +	 * across inactive_age overflows in most cases. There is a
> +	 * special case: usually, shadow entries have a short lifetime
> +	 * and are either refaulted or reclaimed along with the inode
> +	 * before they get too old.  But it is not impossible for the
> +	 * inactive_age to lap a shadow entry in the field, which can
> +	 * then can result in a false small refault distance, leading
> +	 * to a false activation should this old entry actually
> +	 * refault again.  However, earlier kernels used to deactivate
> +	 * unconditionally with *every* reclaim invocation for the
> +	 * longest time, so the occasional inappropriate activation
> +	 * leading to pressure on the active list is not a problem.
>  	 */
>  	refault_distance = (refault - eviction) & EVICTION_MASK;
>  
> -	inc_zone_state(zone, WORKINGSET_REFAULT);
> +	/*
> +	 * Compare the distance with the existing workingset. We don't
> +	 * act on pages that couldn't stay resident even with all the
> +	 * memory available to the page cache.
> +	 */
> +	if (refault_distance > active_file + anon)
> +		goto out;
>  
> -	if (refault_distance <= active_file) {
> -		inc_zone_state(zone, WORKINGSET_ACTIVATE);
> -		return true;
> +	/*
> +	 * If inactive cache is refaulting, activate the page to
> +	 * challenge the current cache workingset. The existing cache
> +	 * might be stale, or at least colder than the contender.
> +	 *
> +	 * If active cache is refaulting (PageWorkingset set at time
> +	 * of eviction), it means that the page cache as a whole is
> +	 * thrashing. Restore PageWorkingset to inform the LRU code
> +	 * about the additional cost of reclaiming more page cache.
> +	 */
> +	SetPageActive(page);
> +	atomic_long_inc(&lruvec->inactive_age);
> +
> +	if (workingset) {
> +		SetPageWorkingset(page);
> +		inc_zone_state(zone, REFAULT_ACTIVE_FILE);
> +	} else {
> +		inc_zone_state(zone, REFAULT_INACTIVE_FILE);
>  	}
> -	return false;
> +out:
> +	rcu_read_unlock();
>  }
>  
>  /**
> @@ -433,7 +507,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>  		}
>  	}
>  	BUG_ON(node->count);
> -	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
> +	inc_zone_state(page_zone(virt_to_page(node)), REFAULT_NODERECLAIM);
>  	if (!__radix_tree_delete_node(&mapping->page_tree, node))
>  		BUG();
>  
> -- 
> 2.8.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 01/10] mm: allow swappiness that prefers anon over file
  2016-06-08  0:06       ` Minchan Kim
@ 2016-06-08 15:58         ` Johannes Weiner
  2016-06-09  1:01           ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2016-06-08 15:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Wed, Jun 08, 2016 at 09:06:32AM +0900, Minchan Kim wrote:
> On Tue, Jun 07, 2016 at 10:18:18AM -0400, Johannes Weiner wrote:
> > On Tue, Jun 07, 2016 at 09:25:50AM +0900, Minchan Kim wrote:
> > > On Mon, Jun 06, 2016 at 03:48:27PM -0400, Johannes Weiner wrote:
> > > > --- a/Documentation/sysctl/vm.txt
> > > > +++ b/Documentation/sysctl/vm.txt
> > > > @@ -771,14 +771,20 @@ with no ill effects: errors and warnings on these stats are suppressed.)
> > > >  
> > > >  swappiness
> > > >  
> > > > -This control is used to define how aggressive the kernel will swap
> > > > -memory pages.  Higher values will increase agressiveness, lower values
> > > > -decrease the amount of swap.  A value of 0 instructs the kernel not to
> > > > -initiate swap until the amount of free and file-backed pages is less
> > > > -than the high water mark in a zone.
> > > > +This control is used to define the relative IO cost of cache misses
> > > > +between the swap device and the filesystem as a value between 0 and
> > > > +200. At 100, the VM assumes equal IO cost and will thus apply memory
> > > > +pressure to the page cache and swap-backed pages equally. At 0, the
> > > > +kernel will not initiate swap until the amount of free and file-backed
> > > > +pages is less than the high watermark in a zone.
> > > 
> > > Generally, I agree extending swappiness value good but not sure 200 is
> > > enough to represent speed gap between file and swap sotrage in every
> > > cases. - Just nitpick.
> > 
> > How so? You can't give swap more weight than 100%. 200 is the maximum
> > possible value.
> 
> In old, swappiness is how agressively reclaim anonymous pages in favour
> of page cache. But when I read your description and changes about
> swappiness in vm.txt, esp, *relative IO cost*, I feel you change swappiness
> define to represent relative IO cost between swap storage and file storage.
> Then, with that, we could balance anonymous and file LRU with the weight.
> 
> For example, let's assume that in-memory swap storage is 10x times faster
> than slow thumb drive. In that case, IO cost of 5 anonymous pages
> swapping-in/out is equal to 1 file-backed page-discard/read.
> 
> I thought it does make sense because that measuring the speed gab between
> those storages is easier than selecting vague swappiness tendency.
> 
> In terms of such approach, I thought 200 is not enough to show the gab
> because the gap is started from 100.
> Isn't it your intention? If so, to me, the description was rather
> misleading. :(

The way swappiness works never actually changed.

The only thing that changed is that we used to look at referenced
pages (recent_rotated) and *assumed* they would likely cause IO when
reclaimed, whereas with my patches we actually know whether they are.
But swappiness has always been about relative IO cost of the LRUs.

Swappiness defines relative IO cost between file and swap on a scale
from 0 to 200, where 100 is the point of equality. The scale factors
are calculated in get_scan_count() like this:

  anon_prio = swappiness
  file_prio = 200 - swappiness

and those are applied to the recorded cost/value ratios like this:

  ap = anon_prio * scanned / rotated
  fp = file_prio * scanned / rotated

That means if your swap device is 10 times faster than your filesystem
device, and you thus want anon to receive 10x the refaults when the
anon and file pages are used equally, you do this:

  x + 10x = 200
        x = 18 (ish)

So your file priority is ~18 and your swap priority is the remainder
of the range, 200 - 18. You set swappiness to 182.

Now fill in the numbers while assuming all pages on both lists have
been referenced before and will likely refault (or in the new model,
all pages are refaulting):

  fraction[anon] = ap      = 182 * 1 / 1 = 182
  fraction[file] = fp      =  18 * 1 / 1 =  18
     denominator = ap + fp =    182 + 18 = 200

and then calculate the scan target like this:

  scan[type] = (lru_size() >> priority) * fraction[type] / denominator

This will scan and reclaim 9% of the file pages and 90% of the anon
pages. On refault, 9% of the IO will be from the filesystem and 90%
from the swap device.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation
  2016-06-08  7:39   ` Minchan Kim
@ 2016-06-08 16:02     ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-08 16:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Wed, Jun 08, 2016 at 04:39:44PM +0900, Minchan Kim wrote:
> On Mon, Jun 06, 2016 at 03:48:31PM -0400, Johannes Weiner wrote:
> > @@ -832,9 +854,9 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
> >   * Add the passed pages to the LRU, then drop the caller's refcount
> >   * on them.  Reinitialises the caller's pagevec.
> >   */
> > -void __pagevec_lru_add(struct pagevec *pvec)
> > +void __pagevec_lru_add(struct pagevec *pvec, bool new)
> >  {
> > -	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
> > +	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, (void *)new);
> >  }
> 
> Just trivial:
> 
> 'new' argument would be not clear in this context what does it mean
> so worth to comment it, IMO but no strong opinion.

True, it's a little mysterious. I'll document it.

> Other than that,
> 
> Acked-by: Minchan Kim <minchan@kernel.org>

Thanks!

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-08  8:14   ` Minchan Kim
@ 2016-06-08 16:06     ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-08 16:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Wed, Jun 08, 2016 at 05:14:21PM +0900, Minchan Kim wrote:
> On Mon, Jun 06, 2016 at 03:48:33PM -0400, Johannes Weiner wrote:
> > @@ -249,15 +249,10 @@ void rotate_reclaimable_page(struct page *page)
> >  	}
> >  }
> >  
> > -static void update_page_reclaim_stat(struct lruvec *lruvec,
> > -				     int file, int rotated,
> > -				     unsigned int nr_pages)
> > +void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
> >  {
> > -	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > -
> > -	reclaim_stat->recent_scanned[file] += nr_pages;
> > -	if (rotated)
> > -		reclaim_stat->recent_rotated[file] += nr_pages;
> > +	lruvec->balance.numer[file] += nr_pages;
> > +	lruvec->balance.denom += nr_pages;
> 
> balance.numer[0] + balance.number[1] = balance.denom
> so we can remove denom at the moment?

You're right, it doesn't make sense to keep that around anymore. I'll
remove it.

Thanks!

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-08 12:51   ` Michal Hocko
@ 2016-06-08 16:16     ` Johannes Weiner
  2016-06-09 12:18       ` Michal Hocko
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2016-06-08 16:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Wed, Jun 08, 2016 at 02:51:37PM +0200, Michal Hocko wrote:
> On Mon 06-06-16 15:48:33, Johannes Weiner wrote:
> > Rename struct zone_reclaim_stat to struct lru_cost, and move from two
> > separate value ratios for the LRU lists to a relative LRU cost metric
> > with a shared denominator.
> 
> I just do not like the too generic `number'. I guess cost or price would
> fit better and look better in the code as well. Up you though...

Yeah, I picked it as a pair, numerator and denominator. But as Minchan
points out, denom is superfluous in the final version of the patch, so
I'm going to remove it and give the numerators better names.

anon_cost and file_cost?

> > Then make everything that affects the cost go through a new
> > lru_note_cost() function.
> 
> Just curious, have you tried to measure just the effect of this change
> without the rest of the series? I do not expect it would show large
> differences because we are not doing SCAN_FRACT most of the time...

Yes, we default to use-once cache and do fractional scanning when that
runs out and we have to go after workingset, which might potentially
cause refault IO. So you need a workload that has little streaming IO.

I haven't tested this patch in isolation, but it shouldn't make much
of a difference, since we continue to balance based on the same input.

> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 01/10] mm: allow swappiness that prefers anon over file
  2016-06-08 15:58         ` Johannes Weiner
@ 2016-06-09  1:01           ` Minchan Kim
  2016-06-09 13:32             ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2016-06-09  1:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Wed, Jun 08, 2016 at 11:58:12AM -0400, Johannes Weiner wrote:
> On Wed, Jun 08, 2016 at 09:06:32AM +0900, Minchan Kim wrote:
> > On Tue, Jun 07, 2016 at 10:18:18AM -0400, Johannes Weiner wrote:
> > > On Tue, Jun 07, 2016 at 09:25:50AM +0900, Minchan Kim wrote:
> > > > On Mon, Jun 06, 2016 at 03:48:27PM -0400, Johannes Weiner wrote:
> > > > > --- a/Documentation/sysctl/vm.txt
> > > > > +++ b/Documentation/sysctl/vm.txt
> > > > > @@ -771,14 +771,20 @@ with no ill effects: errors and warnings on these stats are suppressed.)
> > > > >  
> > > > >  swappiness
> > > > >  
> > > > > -This control is used to define how aggressive the kernel will swap
> > > > > -memory pages.  Higher values will increase agressiveness, lower values
> > > > > -decrease the amount of swap.  A value of 0 instructs the kernel not to
> > > > > -initiate swap until the amount of free and file-backed pages is less
> > > > > -than the high water mark in a zone.
> > > > > +This control is used to define the relative IO cost of cache misses
> > > > > +between the swap device and the filesystem as a value between 0 and
> > > > > +200. At 100, the VM assumes equal IO cost and will thus apply memory
> > > > > +pressure to the page cache and swap-backed pages equally. At 0, the
> > > > > +kernel will not initiate swap until the amount of free and file-backed
> > > > > +pages is less than the high watermark in a zone.
> > > > 
> > > > Generally, I agree extending swappiness value good but not sure 200 is
> > > > enough to represent speed gap between file and swap sotrage in every
> > > > cases. - Just nitpick.
> > > 
> > > How so? You can't give swap more weight than 100%. 200 is the maximum
> > > possible value.
> > 
> > In old, swappiness is how agressively reclaim anonymous pages in favour
> > of page cache. But when I read your description and changes about
> > swappiness in vm.txt, esp, *relative IO cost*, I feel you change swappiness
> > define to represent relative IO cost between swap storage and file storage.
> > Then, with that, we could balance anonymous and file LRU with the weight.
> > 
> > For example, let's assume that in-memory swap storage is 10x times faster
> > than slow thumb drive. In that case, IO cost of 5 anonymous pages
> > swapping-in/out is equal to 1 file-backed page-discard/read.
> > 
> > I thought it does make sense because that measuring the speed gab between
> > those storages is easier than selecting vague swappiness tendency.
> > 
> > In terms of such approach, I thought 200 is not enough to show the gab
> > because the gap is started from 100.
> > Isn't it your intention? If so, to me, the description was rather
> > misleading. :(
> 
> The way swappiness works never actually changed.
> 
> The only thing that changed is that we used to look at referenced
> pages (recent_rotated) and *assumed* they would likely cause IO when
> reclaimed, whereas with my patches we actually know whether they are.
> But swappiness has always been about relative IO cost of the LRUs.
> 
> Swappiness defines relative IO cost between file and swap on a scale
> from 0 to 200, where 100 is the point of equality. The scale factors
> are calculated in get_scan_count() like this:
> 
>   anon_prio = swappiness
>   file_prio = 200 - swappiness
> 
> and those are applied to the recorded cost/value ratios like this:
> 
>   ap = anon_prio * scanned / rotated
>   fp = file_prio * scanned / rotated
> 
> That means if your swap device is 10 times faster than your filesystem
> device, and you thus want anon to receive 10x the refaults when the
> anon and file pages are used equally, you do this:
> 
>   x + 10x = 200
>         x = 18 (ish)
> 
> So your file priority is ~18 and your swap priority is the remainder
> of the range, 200 - 18. You set swappiness to 182.
> 
> Now fill in the numbers while assuming all pages on both lists have
> been referenced before and will likely refault (or in the new model,
> all pages are refaulting):
> 
>   fraction[anon] = ap      = 182 * 1 / 1 = 182
>   fraction[file] = fp      =  18 * 1 / 1 =  18
>      denominator = ap + fp =    182 + 18 = 200
> 
> and then calculate the scan target like this:
> 
>   scan[type] = (lru_size() >> priority) * fraction[type] / denominator
> 
> This will scan and reclaim 9% of the file pages and 90% of the anon
> pages. On refault, 9% of the IO will be from the filesystem and 90%
> from the swap device.

Thanks for the detail example. Then, let's change the example a little bit.

A system has big HDD storage and SSD swap.

HDD:    200 IOPS
SSD: 100000 IOPS
>From https://en.wikipedia.org/wiki/IOPS

So, speed gap is 500x.
x + 500x = 200
If we use PCIe-SSD, the gap will be larger.
That's why I said 200 is enough to represent speed gap.
Such system configuration is already non-sense so it is okay to ignore such
usecases?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-08 16:16     ` Johannes Weiner
@ 2016-06-09 12:18       ` Michal Hocko
  2016-06-09 13:33         ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2016-06-09 12:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Wed 08-06-16 12:16:05, Johannes Weiner wrote:
> On Wed, Jun 08, 2016 at 02:51:37PM +0200, Michal Hocko wrote:
> > On Mon 06-06-16 15:48:33, Johannes Weiner wrote:
> > > Rename struct zone_reclaim_stat to struct lru_cost, and move from two
> > > separate value ratios for the LRU lists to a relative LRU cost metric
> > > with a shared denominator.
> > 
> > I just do not like the too generic `number'. I guess cost or price would
> > fit better and look better in the code as well. Up you though...
> 
> Yeah, I picked it as a pair, numerator and denominator. But as Minchan
> points out, denom is superfluous in the final version of the patch, so
> I'm going to remove it and give the numerators better names.
> 
> anon_cost and file_cost?

Yes that is much more descriptive and easier to grep for. I didn't
propose that because I thought you would want to preserve the array
definition for an easier code to update them.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 01/10] mm: allow swappiness that prefers anon over file
  2016-06-09  1:01           ` Minchan Kim
@ 2016-06-09 13:32             ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-09 13:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Thu, Jun 09, 2016 at 10:01:07AM +0900, Minchan Kim wrote:
> A system has big HDD storage and SSD swap.
> 
> HDD:    200 IOPS
> SSD: 100000 IOPS
> From https://en.wikipedia.org/wiki/IOPS
> 
> So, speed gap is 500x.
> x + 500x = 200
> If we use PCIe-SSD, the gap will be larger.
> That's why I said 200 is enough to represent speed gap.

Ah, I see what you're saying.

Yeah, that's unfortunately a limitation in the current ABI. Extending
the range to previously unavailable settings is doable; changing the
meaning of existing values is not. We'd have to add another interface.

> Such system configuration is already non-sense so it is okay to ignore such
> usecases?

I'm not sure we have to be proactive about it, but we can always add a
more fine-grained knob to override swappiness when somebody wants to
use such a setup in practice.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 07/10] mm: base LRU balancing on an explicit cost model
  2016-06-09 12:18       ` Michal Hocko
@ 2016-06-09 13:33         ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2016-06-09 13:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Tim Chen, kernel-team

On Thu, Jun 09, 2016 at 02:18:02PM +0200, Michal Hocko wrote:
> On Wed 08-06-16 12:16:05, Johannes Weiner wrote:
> > On Wed, Jun 08, 2016 at 02:51:37PM +0200, Michal Hocko wrote:
> > > On Mon 06-06-16 15:48:33, Johannes Weiner wrote:
> > > > Rename struct zone_reclaim_stat to struct lru_cost, and move from two
> > > > separate value ratios for the LRU lists to a relative LRU cost metric
> > > > with a shared denominator.
> > > 
> > > I just do not like the too generic `number'. I guess cost or price would
> > > fit better and look better in the code as well. Up you though...
> > 
> > Yeah, I picked it as a pair, numerator and denominator. But as Minchan
> > points out, denom is superfluous in the final version of the patch, so
> > I'm going to remove it and give the numerators better names.
> > 
> > anon_cost and file_cost?
> 
> Yes that is much more descriptive and easier to grep for. I didn't
> propose that because I thought you would want to preserve the array
> definition for an easier code to update them.

It'll be slightly more verbose, but that's probably a good thing.
Especially for readability in get_scan_count().

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-06 19:48 ` [PATCH 10/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
                     ` (2 preceding siblings ...)
  2016-06-08 13:58   ` Michal Hocko
@ 2016-06-10  2:19   ` Minchan Kim
  2016-06-13 15:52     ` Johannes Weiner
  3 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2016-06-10  2:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

Hi Hannes,

On Mon, Jun 06, 2016 at 03:48:36PM -0400, Johannes Weiner wrote:
> Since the LRUs were split into anon and file lists, the VM has been
> balancing between page cache and anonymous pages based on per-list
> ratios of scanned vs. rotated pages. In most cases that tips page
> reclaim towards the list that is easier to reclaim and has the fewest
> actively used pages, but there are a few problems with it:
> 
> 1. Refaults and in-memory rotations are weighted the same way, even
>    though one costs IO and the other costs CPU. When the balance is
>    off, the page cache can be thrashing while anonymous pages are aged
>    comparably slower and thus have more time to get even their coldest
>    pages referenced. The VM would consider this a fair equilibrium.
> 
> 2. The page cache has usually a share of use-once pages that will
>    further dilute its scanned/rotated ratio in the above-mentioned
>    scenario. This can cease scanning of the anonymous list almost
>    entirely - again while the page cache is thrashing and IO-bound.
> 
> Historically, swap has been an emergency overflow for high memory
> pressure, and we avoided using it as long as new page allocations
> could be served from recycling page cache. However, when recycling
> page cache incurs a higher cost in IO than swapping out a few unused
> anonymous pages would, it makes sense to increase swap pressure.
> 
> In order to accomplish this, we can extend the thrash detection code
> that currently detects workingset changes within the page cache: when
> inactive cache pages are thrashing, the VM raises LRU pressure on the
> otherwise protected active file list to increase competition. However,
> when active pages begin refaulting as well, it means that the page
> cache is thrashing as a whole and the LRU balance should tip toward
> anonymous. This is what this patch implements.
> 
> To tell inactive from active refaults, a page flag is introduced that
> marks pages that have been on the active list in their lifetime. This
> flag is remembered in the shadow page entry on reclaim, and restored
> when the page refaults. It is also set on anonymous pages during
> swapin. When a page with that flag set is added to the LRU, the LRU
> balance is adjusted for the IO cost of reclaiming the thrashing list.
> 
> Rotations continue to influence the LRU balance as well, but with a
> different weight factor. That factor is statically chosen such that
> refaults are considered more costly than rotations at this point. We
> might want to revisit this for ultra-fast swap or secondary memory
> devices, where rotating referenced pages might be more costly than
> swapping or relocating them directly and have some of them refault.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h         |   6 +-
>  include/linux/page-flags.h     |   2 +
>  include/linux/swap.h           |  10 ++-
>  include/trace/events/mmflags.h |   1 +
>  mm/filemap.c                   |   9 +--
>  mm/migrate.c                   |   4 ++
>  mm/swap.c                      |  38 ++++++++++-
>  mm/swap_state.c                |   1 +
>  mm/vmscan.c                    |   5 +-
>  mm/vmstat.c                    |   6 +-
>  mm/workingset.c                | 142 +++++++++++++++++++++++++++++++----------
>  11 files changed, 172 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4d257d00fbf5..d7aaee25b536 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -148,9 +148,9 @@ enum zone_stat_item {
>  	NUMA_LOCAL,		/* allocation from local node */
>  	NUMA_OTHER,		/* allocation from other node */
>  #endif
> -	WORKINGSET_REFAULT,
> -	WORKINGSET_ACTIVATE,
> -	WORKINGSET_NODERECLAIM,
> +	REFAULT_INACTIVE_FILE,
> +	REFAULT_ACTIVE_FILE,
> +	REFAULT_NODERECLAIM,
>  	NR_ANON_TRANSPARENT_HUGEPAGES,
>  	NR_FREE_CMA_PAGES,
>  	NR_VM_ZONE_STAT_ITEMS };
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e5a32445f930..a1b9d7dddd68 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -79,6 +79,7 @@ enum pageflags {
>  	PG_dirty,
>  	PG_lru,
>  	PG_active,
> +	PG_workingset,

I think PG_workingset might be a good flag in the future, core MM might
utilize it to optimize something so I hope it supports for 32bit, too.

A usecase with PG_workingset in old was cleancache. A few year ago,
Dan tried it to only cache activated page from page cache to cleancache,
IIRC. As well, many system using zram(i.e., fast swap) are still 32 bit
architecture.

Just an idea. we might be able to move less important flag(i.e., enabled
in specific configuration, for example, PG_hwpoison or PG_uncached) in 32bit
to page_extra to avoid allocate extra memory space and charge the bit as
PG_workingset. :)

Other concern about PG_workingset is naming. For file-backed pages, it's
good because file-backed pages started from inactive's head and promoted
active LRU once two touch so it's likely to be workingset. However,
for anonymous page, it starts from active list so every anonymous page
has PG_workingset while mlocked pages cannot have a chance to have it.
It wouldn't matter in eclaim POV but if we would use PG_workingset as
indicator to identify real workingset page, it might be confused.
Maybe, We could mark mlocked pages as workingset unconditionally.

>  	PG_slab,
>  	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
>  	PG_arch_1,
> @@ -259,6 +260,7 @@ PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
>  PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
>  PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
>  	TESTCLEARFLAG(Active, active, PF_HEAD)
> +PAGEFLAG(Workingset, workingset, PF_HEAD)
>  __PAGEFLAG(Slab, slab, PF_NO_TAIL)
>  __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
>  PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c461ce0533da..9923b51ee8e9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -250,7 +250,7 @@ struct swap_info_struct {
>  
>  /* linux/mm/workingset.c */
>  void *workingset_eviction(struct address_space *mapping, struct page *page);
> -bool workingset_refault(void *shadow);
> +void workingset_refault(struct page *page, void *shadow);
>  void workingset_activation(struct page *page);
>  extern struct list_lru workingset_shadow_nodes;
>  
> @@ -295,8 +295,12 @@ extern unsigned long nr_free_pagecache_pages(void);
>  
>  
>  /* linux/mm/swap.c */
> -extern void lru_note_cost(struct lruvec *lruvec, bool file,
> -			  unsigned int nr_pages);
> +enum lru_cost_type {
> +	COST_CPU,
> +	COST_IO,
> +};
> +extern void lru_note_cost(struct lruvec *lruvec, enum lru_cost_type cost,
> +			  bool file, unsigned int nr_pages);
>  extern void lru_cache_add(struct page *);
>  extern void lru_cache_putback(struct page *page);
>  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 43cedbf0c759..bc05e0ac1b8c 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -86,6 +86,7 @@
>  	{1UL << PG_dirty,		"dirty"		},		\
>  	{1UL << PG_lru,			"lru"		},		\
>  	{1UL << PG_active,		"active"	},		\
> +	{1UL << PG_workingset,		"workingset"	},		\
>  	{1UL << PG_slab,		"slab"		},		\
>  	{1UL << PG_owner_priv_1,	"owner_priv_1"	},		\
>  	{1UL << PG_arch_1,		"arch_1"	},		\
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9665b1d4f318..1b356b47381b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -700,12 +700,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  		 * data from the working set, only to cache data that will
>  		 * get overwritten with something else, is a waste of memory.
>  		 */
> -		if (!(gfp_mask & __GFP_WRITE) &&
> -		    shadow && workingset_refault(shadow)) {
> -			SetPageActive(page);
> -			workingset_activation(page);
> -		} else
> -			ClearPageActive(page);
> +		WARN_ON_ONCE(PageActive(page));
> +		if (!(gfp_mask & __GFP_WRITE) && shadow)
> +			workingset_refault(page, shadow);
>  		lru_cache_add(page);
>  	}
>  	return ret;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9baf41c877ff..115d49441c6c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -544,6 +544,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>  		SetPageActive(newpage);
>  	} else if (TestClearPageUnevictable(page))
>  		SetPageUnevictable(newpage);
> +	if (PageWorkingset(page))
> +		SetPageWorkingset(newpage);

When I see this, popped thought is how we handle PG_workingset
when split/collapsing THP and then, I can't find any logic. :(
Every anonymous page is PG_workingset by birth so you ignore it
intentionally?


>  	if (PageChecked(page))
>  		SetPageChecked(newpage);
>  	if (PageMappedToDisk(page))
> @@ -1809,6 +1811,8 @@ fail_putback:
>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  
>  		/* Reverse changes made by migrate_page_copy() */
> +		if (TestClearPageWorkingset(new_page))
> +			ClearPageWorkingset(page);
>  		if (TestClearPageActive(new_page))
>  			SetPageActive(page);
>  		if (TestClearPageUnevictable(new_page))
> diff --git a/mm/swap.c b/mm/swap.c
> index ae07b469ddca..cb6773e1424e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -249,8 +249,28 @@ void rotate_reclaimable_page(struct page *page)
>  	}
>  }
>  
> -void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
> +void lru_note_cost(struct lruvec *lruvec, enum lru_cost_type cost,
> +		   bool file, unsigned int nr_pages)
>  {
> +	if (cost == COST_IO) {
> +		/*
> +		 * Reflect the relative reclaim cost between incurring
> +		 * IO from refaults on one hand, and incurring CPU
> +		 * cost from rotating scanned pages on the other.
> +		 *
> +		 * XXX: For now, the relative cost factor for IO is
> +		 * set statically to outweigh the cost of rotating
> +		 * referenced pages. This might change with ultra-fast
> +		 * IO devices, or with secondary memory devices that
> +		 * allow users continued access of swapped out pages.
> +		 *
> +		 * Until then, the value is chosen simply such that we
> +		 * balance for IO cost first and optimize for CPU only
> +		 * once the thrashing subsides.
> +		 */
> +		nr_pages *= SWAP_CLUSTER_MAX;
> +	}
> +
>  	lruvec->balance.numer[file] += nr_pages;
>  	lruvec->balance.denom += nr_pages;

So, lru_cost_type is binary. COST_IO and COST_CPU. 'bool' is enough to
represent it if you doesn't have further plan to expand it.
But if you did to make it readable, I'm not against. Just trivial.

>  }
> @@ -262,6 +282,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
>  		int lru = page_lru_base_type(page);
>  
>  		del_page_from_lru_list(page, lruvec, lru);
> +		SetPageWorkingset(page);
>  		SetPageActive(page);
>  		lru += LRU_ACTIVE;
>  		add_page_to_lru_list(page, lruvec, lru);
> @@ -821,13 +842,28 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>  				 void *arg)
>  {
> +	unsigned int nr_pages = hpage_nr_pages(page);
>  	enum lru_list lru = page_lru(page);
> +	bool active = is_active_lru(lru);
> +	bool file = is_file_lru(lru);
> +	bool new = (bool)arg;
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
>  	SetPageLRU(page);
>  	add_page_to_lru_list(page, lruvec, lru);
>  
> +	if (new) {
> +		/*
> +		 * If the workingset is thrashing, note the IO cost of
> +		 * reclaiming that list and steer reclaim away from it.
> +		 */
> +		if (PageWorkingset(page))
> +			lru_note_cost(lruvec, COST_IO, file, nr_pages);
> +		else if (active)
> +			SetPageWorkingset(page);
> +	}
> +
>  	trace_mm_lru_insertion(page, lru);
>  }
>  
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 5400f814ae12..43561a56ba5d 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -365,6 +365,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  			/*
>  			 * Initiate read into locked page and return.
>  			 */

How about putting the comment you said to Tim in here?

"
There are no shadow entries for anonymous evictions, only page cache
evictions. All swap-ins are treated as "eligible" refaults and push back
against cache, whereas cache only pushes against anon if the cache
workingset is determined to fit into memory.
That implies a fixed hierarchy where the VM always tries to fit the
anonymous workingset into memory first and the page cache second.
If the anonymous set is bigger than memory, the algorithm won't stop
counting IO cost from anonymous refaults and pressuring page cache.
"
Or put it in workingset.c. I see you wrote up a little bit about
anonymous refault in there but I think adding abvove paragraph is
very helpful.


> +			SetPageWorkingset(new_page);
>  			lru_cache_add(new_page);
>  			*new_page_allocated = true;
>  			return new_page;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index acbd212eab6e..b2cb4f4f9d31 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1216,6 +1216,7 @@ activate_locked:
>  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
>  			try_to_free_swap(page);
>  		VM_BUG_ON_PAGE(PageActive(page), page);
> +		SetPageWorkingset(page);
>  		SetPageActive(page);
>  		pgactivate++;
>  keep_locked:
> @@ -1524,7 +1525,7 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
>  			 * Rotating pages costs CPU without actually
>  			 * progressing toward the reclaim goal.
>  			 */
> -			lru_note_cost(lruvec, file, numpages);
> +			lru_note_cost(lruvec, COST_CPU, file, numpages);
>  		}
>  
>  		if (put_page_testzero(page)) {
> @@ -1849,7 +1850,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	 * Rotating pages costs CPU without actually
>  	 * progressing toward the reclaim goal.
>  	 */
> -	lru_note_cost(lruvec, file, nr_rotated);
> +	lru_note_cost(lruvec, COST_CPU, file, nr_rotated);
>  
>  	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
>  	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 77e42ef388c2..6c8d658f5b7f 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -727,9 +727,9 @@ const char * const vmstat_text[] = {
>  	"numa_local",
>  	"numa_other",
>  #endif
> -	"workingset_refault",
> -	"workingset_activate",
> -	"workingset_nodereclaim",
> +	"refault_inactive_file",
> +	"refault_active_file",
> +	"refault_nodereclaim",
>  	"nr_anon_transparent_hugepages",
>  	"nr_free_cma",
>  
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 8a75f8d2916a..261cf583fb62 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -118,7 +118,7 @@
>   * the only thing eating into inactive list space is active pages.
>   *
>   *
> - *		Activating refaulting pages
> + *		Refaulting inactive pages
>   *
>   * All that is known about the active list is that the pages have been
>   * accessed more than once in the past.  This means that at any given
> @@ -131,6 +131,10 @@
>   * used less frequently than the refaulting page - or even not used at
>   * all anymore.
>   *
> + * That means, if inactive cache is refaulting with a suitable refault
> + * distance, we assume the cache workingset is transitioning and put
> + * pressure on the existing cache pages on the active list.
> + *
>   * If this is wrong and demotion kicks in, the pages which are truly
>   * used more frequently will be reactivated while the less frequently
>   * used once will be evicted from memory.
> @@ -139,6 +143,30 @@
>   * and the used pages get to stay in cache.
>   *
>   *
> + *		Refaulting active pages
> + *
> + * If, on the other hand, the refaulting pages have been recently
> + * deactivated, it means that the active list is no longer protecting
> + * actively used cache from reclaim: the cache is not transitioning to
> + * a different workingset, the existing workingset is thrashing in the
> + * space allocated to the page cache.
> + *
> + * When that is the case, mere activation of the refaulting pages is
> + * not enough. The page reclaim code needs to be informed of the high
> + * IO cost associated with the continued reclaim of page cache, so
> + * that it can steer pressure to the anonymous list.
> + *
> + * Just as when refaulting inactive pages, it's possible that there
> + * are cold(er) anonymous pages that can be swapped and forgotten in
> + * order to increase the space available to the page cache as a whole.
> + *
> + * If anonymous pages start thrashing as well, the reclaim scanner
> + * will aim for the list that imposes the lowest cost on the system,
> + * where cost is defined as:
> + *
> + *	refault rate * relative IO cost (as determined by swappiness)
> + *
> + *
>   *		Implementation
>   *
>   * For each zone's file LRU lists, a counter for inactive evictions
> @@ -150,10 +178,25 @@
>   *
>   * On cache misses for which there are shadow entries, an eligible
>   * refault distance will immediately activate the refaulting page.
> + *
> + * On activation, cache pages are marked PageWorkingset, which is not
> + * cleared until the page is freed. Shadow entries will remember that
> + * flag to be able to tell inactive from active refaults. Refaults of
> + * previous workingset pages will restore that page flag and inform
> + * page reclaim of the IO cost.
> + *
> + * XXX: Since we don't track anonymous references, every swap-in event
> + * is considered a workingset refault - regardless of distance. Swapin
> + * floods will thus always raise the assumed IO cost of reclaiming the
> + * anonymous LRU lists, even if the pages haven't been used recently.
> + * Temporary events don't matter that much other than they might delay
> + * the stabilization a bit. But during continuous thrashing, anonymous
> + * pages can have a leg-up against page cache. This might need fixing
> + * for ultra-fast IO devices or secondary memory types.
>   */
>  
> -#define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY + \
> -			 ZONES_SHIFT + NODES_SHIFT +	\
> +#define EVICTION_SHIFT	(RADIX_TREE_EXCEPTIONAL_ENTRY +			\
> +			 1 + ZONES_SHIFT + NODES_SHIFT +		\
>  			 MEM_CGROUP_ID_SHIFT)
>  #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
>  
> @@ -167,24 +210,29 @@
>   */
>  static unsigned int bucket_order __read_mostly;
>  
> -static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction)
> +static void *pack_shadow(int memcgid, struct zone *zone, unsigned long eviction,
> +			 bool workingset)
>  {
>  	eviction >>= bucket_order;
>  	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
>  	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
>  	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
> +	eviction = (eviction << 1) | workingset;
>  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
>  
>  	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
>  }
>  
>  static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
> -			  unsigned long *evictionp)
> +			  unsigned long *evictionp, bool *workingsetp)
>  {
>  	unsigned long entry = (unsigned long)shadow;
>  	int memcgid, nid, zid;
> +	bool workingset;
>  
>  	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> +	workingset = entry & 1;
> +	entry >>= 1;
>  	zid = entry & ((1UL << ZONES_SHIFT) - 1);
>  	entry >>= ZONES_SHIFT;
>  	nid = entry & ((1UL << NODES_SHIFT) - 1);
> @@ -195,6 +243,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, struct zone **zonep,
>  	*memcgidp = memcgid;
>  	*zonep = NODE_DATA(nid)->node_zones + zid;
>  	*evictionp = entry << bucket_order;
> +	*workingsetp = workingset;
>  }
>  
>  /**
> @@ -220,19 +269,18 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
>  
>  	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  	eviction = atomic_long_inc_return(&lruvec->inactive_age);
> -	return pack_shadow(memcgid, zone, eviction);
> +	return pack_shadow(memcgid, zone, eviction, PageWorkingset(page));
>  }
>  
>  /**
>   * workingset_refault - evaluate the refault of a previously evicted page
> + * @page: the freshly allocated replacement page
>   * @shadow: shadow entry of the evicted page
>   *
>   * Calculates and evaluates the refault distance of the previously
>   * evicted page in the context of the zone it was allocated in.
> - *
> - * Returns %true if the page should be activated, %false otherwise.
>   */
> -bool workingset_refault(void *shadow)
> +void workingset_refault(struct page *page, void *shadow)
>  {
>  	unsigned long refault_distance;
>  	unsigned long active_file;
> @@ -240,10 +288,12 @@ bool workingset_refault(void *shadow)
>  	unsigned long eviction;
>  	struct lruvec *lruvec;
>  	unsigned long refault;
> +	unsigned long anon;
>  	struct zone *zone;
> +	bool workingset;
>  	int memcgid;
>  
> -	unpack_shadow(shadow, &memcgid, &zone, &eviction);
> +	unpack_shadow(shadow, &memcgid, &zone, &eviction, &workingset);
>  
>  	rcu_read_lock();
>  	/*
> @@ -263,40 +313,64 @@ bool workingset_refault(void *shadow)
>  	 * configurations instead.
>  	 */
>  	memcg = mem_cgroup_from_id(memcgid);
> -	if (!mem_cgroup_disabled() && !memcg) {
> -		rcu_read_unlock();
> -		return false;
> -	}
> +	if (!mem_cgroup_disabled() && !memcg)
> +		goto out;
>  	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  	refault = atomic_long_read(&lruvec->inactive_age);
>  	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE);
> -	rcu_read_unlock();
> +	if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
> +		anon = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
> +		       lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
> +	else
> +		anon = 0;
>  
>  	/*
> -	 * The unsigned subtraction here gives an accurate distance
> -	 * across inactive_age overflows in most cases.
> +	 * Calculate the refault distance.
>  	 *
> -	 * There is a special case: usually, shadow entries have a
> -	 * short lifetime and are either refaulted or reclaimed along
> -	 * with the inode before they get too old.  But it is not
> -	 * impossible for the inactive_age to lap a shadow entry in
> -	 * the field, which can then can result in a false small
> -	 * refault distance, leading to a false activation should this
> -	 * old entry actually refault again.  However, earlier kernels
> -	 * used to deactivate unconditionally with *every* reclaim
> -	 * invocation for the longest time, so the occasional
> -	 * inappropriate activation leading to pressure on the active
> -	 * list is not a problem.
> +	 * The unsigned subtraction here gives an accurate distance
> +	 * across inactive_age overflows in most cases. There is a
> +	 * special case: usually, shadow entries have a short lifetime
> +	 * and are either refaulted or reclaimed along with the inode
> +	 * before they get too old.  But it is not impossible for the
> +	 * inactive_age to lap a shadow entry in the field, which can
> +	 * then can result in a false small refault distance, leading
> +	 * to a false activation should this old entry actually
> +	 * refault again.  However, earlier kernels used to deactivate
> +	 * unconditionally with *every* reclaim invocation for the
> +	 * longest time, so the occasional inappropriate activation
> +	 * leading to pressure on the active list is not a problem.
>  	 */
>  	refault_distance = (refault - eviction) & EVICTION_MASK;
>  
> -	inc_zone_state(zone, WORKINGSET_REFAULT);
> +	/*
> +	 * Compare the distance with the existing workingset. We don't
> +	 * act on pages that couldn't stay resident even with all the
> +	 * memory available to the page cache.
> +	 */
> +	if (refault_distance > active_file + anon)
> +		goto out;
>  
> -	if (refault_distance <= active_file) {
> -		inc_zone_state(zone, WORKINGSET_ACTIVATE);
> -		return true;
> +	/*
> +	 * If inactive cache is refaulting, activate the page to
> +	 * challenge the current cache workingset. The existing cache
> +	 * might be stale, or at least colder than the contender.
> +	 *
> +	 * If active cache is refaulting (PageWorkingset set at time
> +	 * of eviction), it means that the page cache as a whole is
> +	 * thrashing. Restore PageWorkingset to inform the LRU code
> +	 * about the additional cost of reclaiming more page cache.
> +	 */
> +	SetPageActive(page);
> +	atomic_long_inc(&lruvec->inactive_age);
> +
> +	if (workingset) {
> +		SetPageWorkingset(page);
> +		inc_zone_state(zone, REFAULT_ACTIVE_FILE);
> +	} else {
> +		inc_zone_state(zone, REFAULT_INACTIVE_FILE);
>  	}
> -	return false;
> +out:
> +	rcu_read_unlock();
>  }
>  
>  /**
> @@ -433,7 +507,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>  		}
>  	}
>  	BUG_ON(node->count);
> -	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
> +	inc_zone_state(page_zone(virt_to_page(node)), REFAULT_NODERECLAIM);
>  	if (!__radix_tree_delete_node(&mapping->page_tree, node))
>  		BUG();
>  
> -- 
> 2.8.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-10  2:19   ` Minchan Kim
@ 2016-06-13 15:52     ` Johannes Weiner
  2016-06-15  2:23       ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2016-06-13 15:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Fri, Jun 10, 2016 at 11:19:35AM +0900, Minchan Kim wrote:
> On Mon, Jun 06, 2016 at 03:48:36PM -0400, Johannes Weiner wrote:
> > @@ -79,6 +79,7 @@ enum pageflags {
> >  	PG_dirty,
> >  	PG_lru,
> >  	PG_active,
> > +	PG_workingset,
> 
> I think PG_workingset might be a good flag in the future, core MM might
> utilize it to optimize something so I hope it supports for 32bit, too.
> 
> A usecase with PG_workingset in old was cleancache. A few year ago,
> Dan tried it to only cache activated page from page cache to cleancache,
> IIRC. As well, many system using zram(i.e., fast swap) are still 32 bit
> architecture.
> 
> Just an idea. we might be able to move less important flag(i.e., enabled
> in specific configuration, for example, PG_hwpoison or PG_uncached) in 32bit
> to page_extra to avoid allocate extra memory space and charge the bit as
> PG_workingset. :)

Yeah, I do think it should be a core flag. We have the space for it.

> Other concern about PG_workingset is naming. For file-backed pages, it's
> good because file-backed pages started from inactive's head and promoted
> active LRU once two touch so it's likely to be workingset. However,
> for anonymous page, it starts from active list so every anonymous page
> has PG_workingset while mlocked pages cannot have a chance to have it.
> It wouldn't matter in eclaim POV but if we would use PG_workingset as
> indicator to identify real workingset page, it might be confused.
> Maybe, We could mark mlocked pages as workingset unconditionally.

Hm I'm not sure it matters. Technically we don't have to set it on
anon, but since it's otherwise unused anyway, it's nice to set it to
reinforce the notion that anon is currently always workingset.

> > @@ -544,6 +544,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
> >  		SetPageActive(newpage);
> >  	} else if (TestClearPageUnevictable(page))
> >  		SetPageUnevictable(newpage);
> > +	if (PageWorkingset(page))
> > +		SetPageWorkingset(newpage);
> 
> When I see this, popped thought is how we handle PG_workingset
> when split/collapsing THP and then, I can't find any logic. :(
> Every anonymous page is PG_workingset by birth so you ignore it
> intentionally?

Good catch. __split_huge_page_tail() should copy it over, will fix that.

> > @@ -1809,6 +1811,8 @@ fail_putback:
> >  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> >  
> >  		/* Reverse changes made by migrate_page_copy() */
> > +		if (TestClearPageWorkingset(new_page))
> > +			ClearPageWorkingset(page);
> >  		if (TestClearPageActive(new_page))
> >  			SetPageActive(page);
> >  		if (TestClearPageUnevictable(new_page))
> > diff --git a/mm/swap.c b/mm/swap.c
> > index ae07b469ddca..cb6773e1424e 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -249,8 +249,28 @@ void rotate_reclaimable_page(struct page *page)
> >  	}
> >  }
> >  
> > -void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
> > +void lru_note_cost(struct lruvec *lruvec, enum lru_cost_type cost,
> > +		   bool file, unsigned int nr_pages)
> >  {
> > +	if (cost == COST_IO) {
> > +		/*
> > +		 * Reflect the relative reclaim cost between incurring
> > +		 * IO from refaults on one hand, and incurring CPU
> > +		 * cost from rotating scanned pages on the other.
> > +		 *
> > +		 * XXX: For now, the relative cost factor for IO is
> > +		 * set statically to outweigh the cost of rotating
> > +		 * referenced pages. This might change with ultra-fast
> > +		 * IO devices, or with secondary memory devices that
> > +		 * allow users continued access of swapped out pages.
> > +		 *
> > +		 * Until then, the value is chosen simply such that we
> > +		 * balance for IO cost first and optimize for CPU only
> > +		 * once the thrashing subsides.
> > +		 */
> > +		nr_pages *= SWAP_CLUSTER_MAX;
> > +	}
> > +
> >  	lruvec->balance.numer[file] += nr_pages;
> >  	lruvec->balance.denom += nr_pages;
> 
> So, lru_cost_type is binary. COST_IO and COST_CPU. 'bool' is enough to
> represent it if you doesn't have further plan to expand it.
> But if you did to make it readable, I'm not against. Just trivial.

Yeah, it's meant for readability. "true" and "false" make for fairly
cryptic arguments when they are a static property of the callsite:

  lru_note_cost(lruvec, false, page_is_file_cache(page), hpage_nr_pages(page))

???

So I'd rather name these things and leave bool for things that are
based on predicate functions.

> > @@ -821,13 +842,28 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
> >  static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
> >  				 void *arg)
> >  {
> > +	unsigned int nr_pages = hpage_nr_pages(page);
> >  	enum lru_list lru = page_lru(page);
> > +	bool active = is_active_lru(lru);
> > +	bool file = is_file_lru(lru);
> > +	bool new = (bool)arg;
> >  
> >  	VM_BUG_ON_PAGE(PageLRU(page), page);
> >  
> >  	SetPageLRU(page);
> >  	add_page_to_lru_list(page, lruvec, lru);
> >  
> > +	if (new) {
> > +		/*
> > +		 * If the workingset is thrashing, note the IO cost of
> > +		 * reclaiming that list and steer reclaim away from it.
> > +		 */
> > +		if (PageWorkingset(page))
> > +			lru_note_cost(lruvec, COST_IO, file, nr_pages);
> > +		else if (active)
> > +			SetPageWorkingset(page);
> > +	}
> > +
> >  	trace_mm_lru_insertion(page, lru);
> >  }
> >  
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 5400f814ae12..43561a56ba5d 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -365,6 +365,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  			/*
> >  			 * Initiate read into locked page and return.
> >  			 */
> 
> How about putting the comment you said to Tim in here?
> 
> "
> There are no shadow entries for anonymous evictions, only page cache
> evictions. All swap-ins are treated as "eligible" refaults and push back
> against cache, whereas cache only pushes against anon if the cache
> workingset is determined to fit into memory.
> That implies a fixed hierarchy where the VM always tries to fit the
> anonymous workingset into memory first and the page cache second.
> If the anonymous set is bigger than memory, the algorithm won't stop
> counting IO cost from anonymous refaults and pressuring page cache.
> "
> Or put it in workingset.c. I see you wrote up a little bit about
> anonymous refault in there but I think adding abvove paragraph is
> very helpful.

Agreed, that would probably be helpful. I'll put that in.

Thanks Minchan!

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-13 15:52     ` Johannes Weiner
@ 2016-06-15  2:23       ` Minchan Kim
  2016-06-16 15:12         ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2016-06-15  2:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 13, 2016 at 11:52:31AM -0400, Johannes Weiner wrote:
> On Fri, Jun 10, 2016 at 11:19:35AM +0900, Minchan Kim wrote:
> > On Mon, Jun 06, 2016 at 03:48:36PM -0400, Johannes Weiner wrote:
> > > @@ -79,6 +79,7 @@ enum pageflags {
> > >  	PG_dirty,
> > >  	PG_lru,
> > >  	PG_active,
> > > +	PG_workingset,
> > 
> > I think PG_workingset might be a good flag in the future, core MM might
> > utilize it to optimize something so I hope it supports for 32bit, too.
> > 
> > A usecase with PG_workingset in old was cleancache. A few year ago,
> > Dan tried it to only cache activated page from page cache to cleancache,
> > IIRC. As well, many system using zram(i.e., fast swap) are still 32 bit
> > architecture.
> > 
> > Just an idea. we might be able to move less important flag(i.e., enabled
> > in specific configuration, for example, PG_hwpoison or PG_uncached) in 32bit
> > to page_extra to avoid allocate extra memory space and charge the bit as
> > PG_workingset. :)
> 
> Yeah, I do think it should be a core flag. We have the space for it.
> 
> > Other concern about PG_workingset is naming. For file-backed pages, it's
> > good because file-backed pages started from inactive's head and promoted
> > active LRU once two touch so it's likely to be workingset. However,
> > for anonymous page, it starts from active list so every anonymous page
> > has PG_workingset while mlocked pages cannot have a chance to have it.
> > It wouldn't matter in eclaim POV but if we would use PG_workingset as
> > indicator to identify real workingset page, it might be confused.
> > Maybe, We could mark mlocked pages as workingset unconditionally.
> 
> Hm I'm not sure it matters. Technically we don't have to set it on
> anon, but since it's otherwise unused anyway, it's nice to set it to
> reinforce the notion that anon is currently always workingset.

When I read your description firstly, I thought the flag for anon page
is set on only swapin but now I feel you want to set it for all of
anonymous page but it has several holes like mlocked pages, shmem pages
and THP and you want to fix it in THP case only.
Hm, What's the rule?
It's not consistent and confusing to me. :(

I think it would be better that PageWorkingset function should return
true in case of PG_swapbacked set if we want to consider all pages of
anonymous LRU PG_workingset which is more clear, not error-prone, IMHO.

Another question:

Do we want to retain [1]?

This patch motivates from swap IO could be much faster than file IO
so that it would be natural if we rely on refaulting feedback rather
than forcing evicting file cache?

[1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-15  2:23       ` Minchan Kim
@ 2016-06-16 15:12         ` Johannes Weiner
  2016-06-17  7:49           ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2016-06-16 15:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Wed, Jun 15, 2016 at 11:23:41AM +0900, Minchan Kim wrote:
> On Mon, Jun 13, 2016 at 11:52:31AM -0400, Johannes Weiner wrote:
> > On Fri, Jun 10, 2016 at 11:19:35AM +0900, Minchan Kim wrote:
> > > Other concern about PG_workingset is naming. For file-backed pages, it's
> > > good because file-backed pages started from inactive's head and promoted
> > > active LRU once two touch so it's likely to be workingset. However,
> > > for anonymous page, it starts from active list so every anonymous page
> > > has PG_workingset while mlocked pages cannot have a chance to have it.
> > > It wouldn't matter in eclaim POV but if we would use PG_workingset as
> > > indicator to identify real workingset page, it might be confused.
> > > Maybe, We could mark mlocked pages as workingset unconditionally.
> > 
> > Hm I'm not sure it matters. Technically we don't have to set it on
> > anon, but since it's otherwise unused anyway, it's nice to set it to
> > reinforce the notion that anon is currently always workingset.
> 
> When I read your description firstly, I thought the flag for anon page
> is set on only swapin but now I feel you want to set it for all of
> anonymous page but it has several holes like mlocked pages, shmem pages
> and THP and you want to fix it in THP case only.
> Hm, What's the rule?
> It's not consistent and confusing to me. :(

I think you are might be over thinking this a bit ;)

The current LRU code has a notion of workingset pages, which is anon
pages and multi-referenced file pages. shmem are considered file for
this purpose. That's why anon start out active and files/shmem do
not. This patch adds refaulting pages to the mix.

PG_workingset keeps track of pages that were recently workingset, so
we set it when the page enters the workingset (activations and
refaults, and new anon from the start). The only thing we need out of
this flag is to tell us whether reclaim is going after the workingset
because the LRUs have become too small to hold it.

mlocked pages are not really interesting because not only are they not
evictable, they are entirely exempt from aging. Without aging, we can
not say whether they are workingset or not. We'll just leave the flags
alone, like the active flag right now.

> I think it would be better that PageWorkingset function should return
> true in case of PG_swapbacked set if we want to consider all pages of
> anonymous LRU PG_workingset which is more clear, not error-prone, IMHO.

I'm not sure I see the upside, it would be more branches and code.

> Another question:
> 
> Do we want to retain [1]?
> 
> This patch motivates from swap IO could be much faster than file IO
> so that it would be natural if we rely on refaulting feedback rather
> than forcing evicting file cache?
> 
> [1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?

Yes! We don't want to go after the workingset, whether it be cache or
anonymous, while there is single-use page cache lying around that we
can reclaim for free, with no IO and little risk of future IO. Anon
memory doesn't have this equivalent. Only cache is lazy-reclaimed.

Once the cache refaults, we activate it to reflect the fact that it's
workingset. Only when we run out of single-use cache do we want to
reclaim multi-use pages, and *then* we balance workingsets based on
cost of refetching each side from secondary storage.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-16 15:12         ` Johannes Weiner
@ 2016-06-17  7:49           ` Minchan Kim
  2016-06-17 17:01             ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2016-06-17  7:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Thu, Jun 16, 2016 at 11:12:07AM -0400, Johannes Weiner wrote:
> On Wed, Jun 15, 2016 at 11:23:41AM +0900, Minchan Kim wrote:
> > On Mon, Jun 13, 2016 at 11:52:31AM -0400, Johannes Weiner wrote:
> > > On Fri, Jun 10, 2016 at 11:19:35AM +0900, Minchan Kim wrote:
> > > > Other concern about PG_workingset is naming. For file-backed pages, it's
> > > > good because file-backed pages started from inactive's head and promoted
> > > > active LRU once two touch so it's likely to be workingset. However,
> > > > for anonymous page, it starts from active list so every anonymous page
> > > > has PG_workingset while mlocked pages cannot have a chance to have it.
> > > > It wouldn't matter in eclaim POV but if we would use PG_workingset as
> > > > indicator to identify real workingset page, it might be confused.
> > > > Maybe, We could mark mlocked pages as workingset unconditionally.
> > > 
> > > Hm I'm not sure it matters. Technically we don't have to set it on
> > > anon, but since it's otherwise unused anyway, it's nice to set it to
> > > reinforce the notion that anon is currently always workingset.
> > 
> > When I read your description firstly, I thought the flag for anon page
> > is set on only swapin but now I feel you want to set it for all of
> > anonymous page but it has several holes like mlocked pages, shmem pages
> > and THP and you want to fix it in THP case only.
> > Hm, What's the rule?
> > It's not consistent and confusing to me. :(
> 
> I think you are might be over thinking this a bit ;)
> 
> The current LRU code has a notion of workingset pages, which is anon
> pages and multi-referenced file pages. shmem are considered file for
> this purpose. That's why anon start out active and files/shmem do
> not. This patch adds refaulting pages to the mix.
> 
> PG_workingset keeps track of pages that were recently workingset, so
> we set it when the page enters the workingset (activations and
> refaults, and new anon from the start). The only thing we need out of
> this flag is to tell us whether reclaim is going after the workingset
> because the LRUs have become too small to hold it.

Understood.

Divergence comes from here. It seems you design the page flag for only
aging/balancing logic working well while I am thinking to leverage the
flag to identify real workingset. I mean a anonymous page would be a cold
if it has just cold data for the application which would be swapped
out after a short time and never swap-in until process exits. However,
we put it from active list so that it has PG_workingset but it's cold
page.

Yes, we cannot use the flag for such purpose in this SEQ replacement so
I will not insist on it.

> 
> mlocked pages are not really interesting because not only are they not
> evictable, they are entirely exempt from aging. Without aging, we can
> not say whether they are workingset or not. We'll just leave the flags
> alone, like the active flag right now.
> 
> > I think it would be better that PageWorkingset function should return
> > true in case of PG_swapbacked set if we want to consider all pages of
> > anonymous LRU PG_workingset which is more clear, not error-prone, IMHO.
> 
> I'm not sure I see the upside, it would be more branches and code.
> 
> > Another question:
> > 
> > Do we want to retain [1]?
> > 
> > This patch motivates from swap IO could be much faster than file IO
> > so that it would be natural if we rely on refaulting feedback rather
> > than forcing evicting file cache?
> > 
> > [1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?
> 
> Yes! We don't want to go after the workingset, whether it be cache or
> anonymous, while there is single-use page cache lying around that we
> can reclaim for free, with no IO and little risk of future IO. Anon
> memory doesn't have this equivalent. Only cache is lazy-reclaimed.
> 
> Once the cache refaults, we activate it to reflect the fact that it's
> workingset. Only when we run out of single-use cache do we want to
> reclaim multi-use pages, and *then* we balance workingsets based on
> cost of refetching each side from secondary storage.

If pages in inactive file LRU are really single-use page cache, I agree.

However, how does the logic can work like that?
If reclaimed file pages were part of workingset(i.e., refault happens),
we give the pressure to anonymous LRU but get_scan_count still force to
reclaim file lru until inactive file LRU list size is enough low.

With that, too many file workingset could be evicted although anon swap
is cheaper on fast swap storage?

IOW, refault mechanisme works once inactive file LRU list size is enough
small but small inactive file LRU doesn't guarantee it has only multiple
-use pages. Hm, Isn't it a problem?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-17  7:49           ` Minchan Kim
@ 2016-06-17 17:01             ` Johannes Weiner
  2016-06-20  7:42               ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2016-06-17 17:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Fri, Jun 17, 2016 at 04:49:45PM +0900, Minchan Kim wrote:
> On Thu, Jun 16, 2016 at 11:12:07AM -0400, Johannes Weiner wrote:
> > On Wed, Jun 15, 2016 at 11:23:41AM +0900, Minchan Kim wrote:
> > > On Mon, Jun 13, 2016 at 11:52:31AM -0400, Johannes Weiner wrote:
> > > > On Fri, Jun 10, 2016 at 11:19:35AM +0900, Minchan Kim wrote:
> > > > > Other concern about PG_workingset is naming. For file-backed pages, it's
> > > > > good because file-backed pages started from inactive's head and promoted
> > > > > active LRU once two touch so it's likely to be workingset. However,
> > > > > for anonymous page, it starts from active list so every anonymous page
> > > > > has PG_workingset while mlocked pages cannot have a chance to have it.
> > > > > It wouldn't matter in eclaim POV but if we would use PG_workingset as
> > > > > indicator to identify real workingset page, it might be confused.
> > > > > Maybe, We could mark mlocked pages as workingset unconditionally.
> > > > 
> > > > Hm I'm not sure it matters. Technically we don't have to set it on
> > > > anon, but since it's otherwise unused anyway, it's nice to set it to
> > > > reinforce the notion that anon is currently always workingset.
> > > 
> > > When I read your description firstly, I thought the flag for anon page
> > > is set on only swapin but now I feel you want to set it for all of
> > > anonymous page but it has several holes like mlocked pages, shmem pages
> > > and THP and you want to fix it in THP case only.
> > > Hm, What's the rule?
> > > It's not consistent and confusing to me. :(
> > 
> > I think you are might be over thinking this a bit ;)
> > 
> > The current LRU code has a notion of workingset pages, which is anon
> > pages and multi-referenced file pages. shmem are considered file for
> > this purpose. That's why anon start out active and files/shmem do
> > not. This patch adds refaulting pages to the mix.
> > 
> > PG_workingset keeps track of pages that were recently workingset, so
> > we set it when the page enters the workingset (activations and
> > refaults, and new anon from the start). The only thing we need out of
> > this flag is to tell us whether reclaim is going after the workingset
> > because the LRUs have become too small to hold it.
> 
> Understood.
> 
> Divergence comes from here. It seems you design the page flag for only
> aging/balancing logic working well while I am thinking to leverage the
> flag to identify real workingset. I mean a anonymous page would be a cold
> if it has just cold data for the application which would be swapped
> out after a short time and never swap-in until process exits. However,
> we put it from active list so that it has PG_workingset but it's cold
> page.
> 
> Yes, we cannot use the flag for such purpose in this SEQ replacement so
> I will not insist on it.

Well, I'm designing the flag so that it's useful for the case I am
introducing it for :)

I have no problem with changing its semantics later on if you want to
build on top of it, rename it, anything - so far as the LRU balancing
is unaffected of course.

But I don't think it makes sense to provision it for potential future
cases that may or may not materialize.

> > > Do we want to retain [1]?
> > > 
> > > This patch motivates from swap IO could be much faster than file IO
> > > so that it would be natural if we rely on refaulting feedback rather
> > > than forcing evicting file cache?
> > > 
> > > [1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?
> > 
> > Yes! We don't want to go after the workingset, whether it be cache or
> > anonymous, while there is single-use page cache lying around that we
> > can reclaim for free, with no IO and little risk of future IO. Anon
> > memory doesn't have this equivalent. Only cache is lazy-reclaimed.
> > 
> > Once the cache refaults, we activate it to reflect the fact that it's
> > workingset. Only when we run out of single-use cache do we want to
> > reclaim multi-use pages, and *then* we balance workingsets based on
> > cost of refetching each side from secondary storage.
> 
> If pages in inactive file LRU are really single-use page cache, I agree.
> 
> However, how does the logic can work like that?
> If reclaimed file pages were part of workingset(i.e., refault happens),
> we give the pressure to anonymous LRU but get_scan_count still force to
> reclaim file lru until inactive file LRU list size is enough low.
> 
> With that, too many file workingset could be evicted although anon swap
> is cheaper on fast swap storage?
> 
> IOW, refault mechanisme works once inactive file LRU list size is enough
> small but small inactive file LRU doesn't guarantee it has only multiple
> -use pages. Hm, Isn't it a problem?

It's a trade-off between the cost of detecting a new workingset from a
stream of use-once pages, and the cost of use-once pages impose on the
established workingset.

That's a pretty easy choice, if you ask me. I'd rather ask cache pages
to prove they are multi-use than have use-once pages put pressure on
the workingset.

Sure, a spike like you describe is certainly possible, where a good
portion of the inactive file pages will be re-used in the near future,
yet we evict all of them in a burst of memory pressure when we should
have swapped. That's a worst case scenario for the use-once policy in
a workingset transition.

However, that's much better than use-once pages, which cost no
additional IO to reclaim and do not benefit from being cached at all,
causing the workingset to be trashed or swapped out.

In your scenario, the real multi-use pages will quickly refault and
get activated and the algorithm will adapt to the new circumstances.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-17 17:01             ` Johannes Weiner
@ 2016-06-20  7:42               ` Minchan Kim
  2016-06-22 21:56                 ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2016-06-20  7:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Fri, Jun 17, 2016 at 01:01:29PM -0400, Johannes Weiner wrote:
> On Fri, Jun 17, 2016 at 04:49:45PM +0900, Minchan Kim wrote:
> > On Thu, Jun 16, 2016 at 11:12:07AM -0400, Johannes Weiner wrote:
> > > On Wed, Jun 15, 2016 at 11:23:41AM +0900, Minchan Kim wrote:
> > > > On Mon, Jun 13, 2016 at 11:52:31AM -0400, Johannes Weiner wrote:
> > > > > On Fri, Jun 10, 2016 at 11:19:35AM +0900, Minchan Kim wrote:
> > > > > > Other concern about PG_workingset is naming. For file-backed pages, it's
> > > > > > good because file-backed pages started from inactive's head and promoted
> > > > > > active LRU once two touch so it's likely to be workingset. However,
> > > > > > for anonymous page, it starts from active list so every anonymous page
> > > > > > has PG_workingset while mlocked pages cannot have a chance to have it.
> > > > > > It wouldn't matter in eclaim POV but if we would use PG_workingset as
> > > > > > indicator to identify real workingset page, it might be confused.
> > > > > > Maybe, We could mark mlocked pages as workingset unconditionally.
> > > > > 
> > > > > Hm I'm not sure it matters. Technically we don't have to set it on
> > > > > anon, but since it's otherwise unused anyway, it's nice to set it to
> > > > > reinforce the notion that anon is currently always workingset.
> > > > 
> > > > When I read your description firstly, I thought the flag for anon page
> > > > is set on only swapin but now I feel you want to set it for all of
> > > > anonymous page but it has several holes like mlocked pages, shmem pages
> > > > and THP and you want to fix it in THP case only.
> > > > Hm, What's the rule?
> > > > It's not consistent and confusing to me. :(
> > > 
> > > I think you are might be over thinking this a bit ;)
> > > 
> > > The current LRU code has a notion of workingset pages, which is anon
> > > pages and multi-referenced file pages. shmem are considered file for
> > > this purpose. That's why anon start out active and files/shmem do
> > > not. This patch adds refaulting pages to the mix.
> > > 
> > > PG_workingset keeps track of pages that were recently workingset, so
> > > we set it when the page enters the workingset (activations and
> > > refaults, and new anon from the start). The only thing we need out of
> > > this flag is to tell us whether reclaim is going after the workingset
> > > because the LRUs have become too small to hold it.
> > 
> > Understood.
> > 
> > Divergence comes from here. It seems you design the page flag for only
> > aging/balancing logic working well while I am thinking to leverage the
> > flag to identify real workingset. I mean a anonymous page would be a cold
> > if it has just cold data for the application which would be swapped
> > out after a short time and never swap-in until process exits. However,
> > we put it from active list so that it has PG_workingset but it's cold
> > page.
> > 
> > Yes, we cannot use the flag for such purpose in this SEQ replacement so
> > I will not insist on it.
> 
> Well, I'm designing the flag so that it's useful for the case I am
> introducing it for :)
> 
> I have no problem with changing its semantics later on if you want to
> build on top of it, rename it, anything - so far as the LRU balancing
> is unaffected of course.
> 
> But I don't think it makes sense to provision it for potential future
> cases that may or may not materialize.

I admit I was so far from the topic. Sorry, Johannes. :)

The reason I guess is naming of the flag. When you introduced the flag,
I popped a vague idea to utilize the flag in future if it represents real
workingset but as I reviewed code, I realized it's not what I want but just
thing to detect activated page before reclaiming. So to me, it looks like
PG_activated rather than PG_workingset. ;-)

> 
> > > > Do we want to retain [1]?
> > > > 
> > > > This patch motivates from swap IO could be much faster than file IO
> > > > so that it would be natural if we rely on refaulting feedback rather
> > > > than forcing evicting file cache?
> > > > 
> > > > [1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?
> > > 
> > > Yes! We don't want to go after the workingset, whether it be cache or
> > > anonymous, while there is single-use page cache lying around that we
> > > can reclaim for free, with no IO and little risk of future IO. Anon
> > > memory doesn't have this equivalent. Only cache is lazy-reclaimed.
> > > 
> > > Once the cache refaults, we activate it to reflect the fact that it's
> > > workingset. Only when we run out of single-use cache do we want to
> > > reclaim multi-use pages, and *then* we balance workingsets based on
> > > cost of refetching each side from secondary storage.
> > 
> > If pages in inactive file LRU are really single-use page cache, I agree.
> > 
> > However, how does the logic can work like that?
> > If reclaimed file pages were part of workingset(i.e., refault happens),
> > we give the pressure to anonymous LRU but get_scan_count still force to
> > reclaim file lru until inactive file LRU list size is enough low.
> > 
> > With that, too many file workingset could be evicted although anon swap
> > is cheaper on fast swap storage?
> > 
> > IOW, refault mechanisme works once inactive file LRU list size is enough
> > small but small inactive file LRU doesn't guarantee it has only multiple
> > -use pages. Hm, Isn't it a problem?
> 
> It's a trade-off between the cost of detecting a new workingset from a
> stream of use-once pages, and the cost of use-once pages impose on the
> established workingset.
> 
> That's a pretty easy choice, if you ask me. I'd rather ask cache pages
> to prove they are multi-use than have use-once pages put pressure on
> the workingset.

Make sense.

> 
> Sure, a spike like you describe is certainly possible, where a good
> portion of the inactive file pages will be re-used in the near future,
> yet we evict all of them in a burst of memory pressure when we should
> have swapped. That's a worst case scenario for the use-once policy in
> a workingset transition.

So, the point is how such case it happens frequently. A scenario I can
think of is that if we use one-cgroup-per-app, many file pages would be
inactive LRU while active LRU is almost empty until reclaim kicks in.
Because normally, parallel reclaim work during launching new app makes
app's startup time really slow. That's why mobile platform uses notifiers
to get free memory in advance via kiling/reclaiming. Anyway, once we get
amount of free memory and lauching new app in a new cgroup, pages would
live his born LRU list(ie, anon: active file: inactive) without aging.

Then, activity manager can set memory.high of less important app-cgroup
to reclaim it with high value swappiness because swap device is much
faster on that system and much bigger anonymous pages compared to file-
backed pages. Surely, activity manager will expect lots of anonymous
pages be able to swap out but unlike expectation, he will see such spike
easily with reclaiming file-backed pages a lot and refault until inactive
file LRU is enough small.

I think it's enough possible scenario in small system one-cgroup-per-
app.

> 
> However, that's much better than use-once pages, which cost no
> additional IO to reclaim and do not benefit from being cached at all,
> causing the workingset to be trashed or swapped out.

I agree removing e9868505987a entirely is dangerous but I think
we need something to prevent such spike. Checking sc->priority might
be helpful. Anyway, I think it's worth to discuss.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bbfae9a92819..5d5e8e634a06 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2043,6 +2043,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * system is under heavy pressure.
 	 */
 	if (!inactive_list_is_low(lruvec, true) &&
+	    sc->priority >= DEF_PRIORITY - 2 &&
 	    lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
 		scan_balance = SCAN_FILE;
 		goto out;

> 
> In your scenario, the real multi-use pages will quickly refault and
> get activated and the algorithm will adapt to the new circumstances.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-20  7:42               ` Minchan Kim
@ 2016-06-22 21:56                 ` Johannes Weiner
  2016-06-24  6:22                   ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2016-06-22 21:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Mon, Jun 20, 2016 at 04:42:08PM +0900, Minchan Kim wrote:
> On Fri, Jun 17, 2016 at 01:01:29PM -0400, Johannes Weiner wrote:
> > On Fri, Jun 17, 2016 at 04:49:45PM +0900, Minchan Kim wrote:
> > > On Thu, Jun 16, 2016 at 11:12:07AM -0400, Johannes Weiner wrote:
> > > > On Wed, Jun 15, 2016 at 11:23:41AM +0900, Minchan Kim wrote:
> > > > > Do we want to retain [1]?
> > > > > 
> > > > > This patch motivates from swap IO could be much faster than file IO
> > > > > so that it would be natural if we rely on refaulting feedback rather
> > > > > than forcing evicting file cache?
> > > > > 
> > > > > [1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?
> > > > 
> > > > Yes! We don't want to go after the workingset, whether it be cache or
> > > > anonymous, while there is single-use page cache lying around that we
> > > > can reclaim for free, with no IO and little risk of future IO. Anon
> > > > memory doesn't have this equivalent. Only cache is lazy-reclaimed.
> > > > 
> > > > Once the cache refaults, we activate it to reflect the fact that it's
> > > > workingset. Only when we run out of single-use cache do we want to
> > > > reclaim multi-use pages, and *then* we balance workingsets based on
> > > > cost of refetching each side from secondary storage.
> > > 
> > > If pages in inactive file LRU are really single-use page cache, I agree.
> > > 
> > > However, how does the logic can work like that?
> > > If reclaimed file pages were part of workingset(i.e., refault happens),
> > > we give the pressure to anonymous LRU but get_scan_count still force to
> > > reclaim file lru until inactive file LRU list size is enough low.
> > > 
> > > With that, too many file workingset could be evicted although anon swap
> > > is cheaper on fast swap storage?
> > > 
> > > IOW, refault mechanisme works once inactive file LRU list size is enough
> > > small but small inactive file LRU doesn't guarantee it has only multiple
> > > -use pages. Hm, Isn't it a problem?
> > 
> > It's a trade-off between the cost of detecting a new workingset from a
> > stream of use-once pages, and the cost of use-once pages impose on the
> > established workingset.
> > 
> > That's a pretty easy choice, if you ask me. I'd rather ask cache pages
> > to prove they are multi-use than have use-once pages put pressure on
> > the workingset.
> 
> Make sense.
> 
> > 
> > Sure, a spike like you describe is certainly possible, where a good
> > portion of the inactive file pages will be re-used in the near future,
> > yet we evict all of them in a burst of memory pressure when we should
> > have swapped. That's a worst case scenario for the use-once policy in
> > a workingset transition.
> 
> So, the point is how such case it happens frequently. A scenario I can
> think of is that if we use one-cgroup-per-app, many file pages would be
> inactive LRU while active LRU is almost empty until reclaim kicks in.
> Because normally, parallel reclaim work during launching new app makes
> app's startup time really slow. That's why mobile platform uses notifiers
> to get free memory in advance via kiling/reclaiming. Anyway, once we get
> amount of free memory and lauching new app in a new cgroup, pages would
> live his born LRU list(ie, anon: active file: inactive) without aging.
> 
> Then, activity manager can set memory.high of less important app-cgroup
> to reclaim it with high value swappiness because swap device is much
> faster on that system and much bigger anonymous pages compared to file-
> backed pages. Surely, activity manager will expect lots of anonymous
> pages be able to swap out but unlike expectation, he will see such spike
> easily with reclaiming file-backed pages a lot and refault until inactive
> file LRU is enough small.
> 
> I think it's enough possible scenario in small system one-cgroup-per-
> app.

That's the workingset transition I was talking about. The algorithm is
designed to settle towards stable memory patterns. We can't possibly
remove one of the key components of this - the use-once policy - to
speed up a few seconds of workingset transition when it comes at the
risk of potentially thrashing the workingset for *hours*.

The fact that swap IO can be faster than filesystem IO doesn't change
this at all. The point is that the reclaim and refetch IO cost of
use-once cache is ZERO. Causing swap IO to make room for more and more
unused cache pages doesn't make any sense, no matter the swap speed.

I really don't see the relevance of this discussion to this patch set.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 10/10] mm: balance LRU lists based on relative thrashing
  2016-06-22 21:56                 ` Johannes Weiner
@ 2016-06-24  6:22                   ` Minchan Kim
  0 siblings, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2016-06-24  6:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Andrew Morton, Rik van Riel, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Michal Hocko, Tim Chen,
	kernel-team

On Wed, Jun 22, 2016 at 05:56:52PM -0400, Johannes Weiner wrote:
> On Mon, Jun 20, 2016 at 04:42:08PM +0900, Minchan Kim wrote:
> > On Fri, Jun 17, 2016 at 01:01:29PM -0400, Johannes Weiner wrote:
> > > On Fri, Jun 17, 2016 at 04:49:45PM +0900, Minchan Kim wrote:
> > > > On Thu, Jun 16, 2016 at 11:12:07AM -0400, Johannes Weiner wrote:
> > > > > On Wed, Jun 15, 2016 at 11:23:41AM +0900, Minchan Kim wrote:
> > > > > > Do we want to retain [1]?
> > > > > > 
> > > > > > This patch motivates from swap IO could be much faster than file IO
> > > > > > so that it would be natural if we rely on refaulting feedback rather
> > > > > > than forcing evicting file cache?
> > > > > > 
> > > > > > [1] e9868505987a, mm,vmscan: only evict file pages when we have plenty?
> > > > > 
> > > > > Yes! We don't want to go after the workingset, whether it be cache or
> > > > > anonymous, while there is single-use page cache lying around that we
> > > > > can reclaim for free, with no IO and little risk of future IO. Anon
> > > > > memory doesn't have this equivalent. Only cache is lazy-reclaimed.
> > > > > 
> > > > > Once the cache refaults, we activate it to reflect the fact that it's
> > > > > workingset. Only when we run out of single-use cache do we want to
> > > > > reclaim multi-use pages, and *then* we balance workingsets based on
> > > > > cost of refetching each side from secondary storage.
> > > > 
> > > > If pages in inactive file LRU are really single-use page cache, I agree.
> > > > 
> > > > However, how does the logic can work like that?
> > > > If reclaimed file pages were part of workingset(i.e., refault happens),
> > > > we give the pressure to anonymous LRU but get_scan_count still force to
> > > > reclaim file lru until inactive file LRU list size is enough low.
> > > > 
> > > > With that, too many file workingset could be evicted although anon swap
> > > > is cheaper on fast swap storage?
> > > > 
> > > > IOW, refault mechanisme works once inactive file LRU list size is enough
> > > > small but small inactive file LRU doesn't guarantee it has only multiple
> > > > -use pages. Hm, Isn't it a problem?
> > > 
> > > It's a trade-off between the cost of detecting a new workingset from a
> > > stream of use-once pages, and the cost of use-once pages impose on the
> > > established workingset.
> > > 
> > > That's a pretty easy choice, if you ask me. I'd rather ask cache pages
> > > to prove they are multi-use than have use-once pages put pressure on
> > > the workingset.
> > 
> > Make sense.
> > 
> > > 
> > > Sure, a spike like you describe is certainly possible, where a good
> > > portion of the inactive file pages will be re-used in the near future,
> > > yet we evict all of them in a burst of memory pressure when we should
> > > have swapped. That's a worst case scenario for the use-once policy in
> > > a workingset transition.
> > 
> > So, the point is how such case it happens frequently. A scenario I can
> > think of is that if we use one-cgroup-per-app, many file pages would be
> > inactive LRU while active LRU is almost empty until reclaim kicks in.
> > Because normally, parallel reclaim work during launching new app makes
> > app's startup time really slow. That's why mobile platform uses notifiers
> > to get free memory in advance via kiling/reclaiming. Anyway, once we get
> > amount of free memory and lauching new app in a new cgroup, pages would
> > live his born LRU list(ie, anon: active file: inactive) without aging.
> > 
> > Then, activity manager can set memory.high of less important app-cgroup
> > to reclaim it with high value swappiness because swap device is much
> > faster on that system and much bigger anonymous pages compared to file-
> > backed pages. Surely, activity manager will expect lots of anonymous
> > pages be able to swap out but unlike expectation, he will see such spike
> > easily with reclaiming file-backed pages a lot and refault until inactive
> > file LRU is enough small.
> > 
> > I think it's enough possible scenario in small system one-cgroup-per-
> > app.
> 
> That's the workingset transition I was talking about. The algorithm is
> designed to settle towards stable memory patterns. We can't possibly
> remove one of the key components of this - the use-once policy - to
> speed up a few seconds of workingset transition when it comes at the
> risk of potentially thrashing the workingset for *hours*.
> 
> The fact that swap IO can be faster than filesystem IO doesn't change
> this at all. The point is that the reclaim and refetch IO cost of
> use-once cache is ZERO. Causing swap IO to make room for more and more
> unused cache pages doesn't make any sense, no matter the swap speed.

I agree your overall point about use-once first reclaim and as I said
previos mail, I didn't want to remove e9868505987a entirely.

My concern was unconditionally scanning of only file lru until inactive
list is enough low by magic value(3:1 or 1:1) is too heuristic to reclaim
use-once pages first so that it could evict non used-once file backed
pages too much.

Even, let's think about MADV_FREEed page in anonymous LRU list.
They might be more attractive candidate for reclaim.
Even, Userspace already paid for the madvise syscall to prefer
but VM unconditionally keeps them until inactive file lru is enough
small under assumption that we should sweep used-once file pages
firstly and unfortune multi-use page reclaim is trade-off to detect
workingset transitions so user should take care of it although he
wanted to prefer anonymous via vm_swappiness.

I don't think it makes sense. The vm_swappiness is user preference
knob. He can know his system workload better than kernel. For example,
a user might want to degrade overall system performance by swapping
out anonymous more but want to keep file pages to reduce latency spike
to access those file pages when some event happens suddenly.
But kernel ignores it until inactive lru is enough small.

pages. And please think over MADV_FREEed pages. They might be more
attractive candidate for reclaim point of view.
use-once file pages

A idea in my mind is as follows.
You nicely abstract cost model in this patchset so if scanning cost
of either LRU is too higher than paging-in/out(e.g., 32 * 2 *
SWAP_CLUSTER_MAX) in other LRU, we can break unconditional scanning
and turn into other LRU to prove it's valuable workingset temporally.
And repeated above cycle rather than sweeping inactive file LRU only.
I think it can mitigate the workload tranistion spike with hanlding
cold/freeable pages fairly in anonymous LRU list.

> 
> I really don't see the relevance of this discussion to this patch set.

Hm, yes, the thing I had a concern is *not* new introduced by your patch
but it has been there for a long time but your patch's goal is to avoid
balancing code mostly going for page cache favor and exploit the
potential of fast swap device as you described in cover-letter.
However, e9868505987a might be one of conflict with that approach.
That was why I raise an issue.

If you think it's separate issue, I don't want to make your nice job
stucked and waste your time. It could be revisited afterward.

Thanks.

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2016-06-24  6:22 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-06 19:48 [PATCH 00/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
2016-06-06 19:48 ` [PATCH 01/10] mm: allow swappiness that prefers anon over file Johannes Weiner
2016-06-07  0:25   ` Minchan Kim
2016-06-07 14:18     ` Johannes Weiner
2016-06-08  0:06       ` Minchan Kim
2016-06-08 15:58         ` Johannes Weiner
2016-06-09  1:01           ` Minchan Kim
2016-06-09 13:32             ` Johannes Weiner
2016-06-06 19:48 ` [PATCH 02/10] mm: swap: unexport __pagevec_lru_add() Johannes Weiner
2016-06-06 21:32   ` Rik van Riel
2016-06-07  9:07   ` Michal Hocko
2016-06-08  7:14   ` Minchan Kim
2016-06-06 19:48 ` [PATCH 03/10] mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() Johannes Weiner
2016-06-06 21:33   ` Rik van Riel
2016-06-07  9:12   ` Michal Hocko
2016-06-08  7:24   ` Minchan Kim
2016-06-06 19:48 ` [PATCH 04/10] mm: fix LRU balancing effect of new transparent huge pages Johannes Weiner
2016-06-06 21:36   ` Rik van Riel
2016-06-07  9:19   ` Michal Hocko
2016-06-08  7:28   ` Minchan Kim
2016-06-06 19:48 ` [PATCH 05/10] mm: remove LRU balancing effect of temporary page isolation Johannes Weiner
2016-06-06 21:56   ` Rik van Riel
2016-06-06 22:15     ` Johannes Weiner
2016-06-07  1:11       ` Rik van Riel
2016-06-07 13:57         ` Johannes Weiner
2016-06-07  9:26       ` Michal Hocko
2016-06-07 14:06         ` Johannes Weiner
2016-06-07  9:49   ` Michal Hocko
2016-06-08  7:39   ` Minchan Kim
2016-06-08 16:02     ` Johannes Weiner
2016-06-06 19:48 ` [PATCH 06/10] mm: remove unnecessary use-once cache bias from LRU balancing Johannes Weiner
2016-06-07  2:20   ` Rik van Riel
2016-06-07 14:11     ` Johannes Weiner
2016-06-08  8:03   ` Minchan Kim
2016-06-08 12:31   ` Michal Hocko
2016-06-06 19:48 ` [PATCH 07/10] mm: base LRU balancing on an explicit cost model Johannes Weiner
2016-06-06 19:13   ` kbuild test robot
2016-06-07  2:34   ` Rik van Riel
2016-06-07 14:12     ` Johannes Weiner
2016-06-08  8:14   ` Minchan Kim
2016-06-08 16:06     ` Johannes Weiner
2016-06-08 12:51   ` Michal Hocko
2016-06-08 16:16     ` Johannes Weiner
2016-06-09 12:18       ` Michal Hocko
2016-06-09 13:33         ` Johannes Weiner
2016-06-06 19:48 ` [PATCH 08/10] mm: deactivations shouldn't bias the LRU balance Johannes Weiner
2016-06-08  8:15   ` Minchan Kim
2016-06-08 12:57   ` Michal Hocko
2016-06-06 19:48 ` [PATCH 09/10] mm: only count actual rotations as LRU reclaim cost Johannes Weiner
2016-06-08  8:19   ` Minchan Kim
2016-06-08 13:18   ` Michal Hocko
2016-06-06 19:48 ` [PATCH 10/10] mm: balance LRU lists based on relative thrashing Johannes Weiner
2016-06-06 19:22   ` kbuild test robot
2016-06-06 23:50   ` Tim Chen
2016-06-07 16:23     ` Johannes Weiner
2016-06-07 19:56       ` Tim Chen
2016-06-08 13:58   ` Michal Hocko
2016-06-10  2:19   ` Minchan Kim
2016-06-13 15:52     ` Johannes Weiner
2016-06-15  2:23       ` Minchan Kim
2016-06-16 15:12         ` Johannes Weiner
2016-06-17  7:49           ` Minchan Kim
2016-06-17 17:01             ` Johannes Weiner
2016-06-20  7:42               ` Minchan Kim
2016-06-22 21:56                 ` Johannes Weiner
2016-06-24  6:22                   ` Minchan Kim
2016-06-07  9:51 ` [PATCH 00/10] " Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).